A fair amount of stuff this time around, dominated by yet another massive

set from Mauro toward the completion of the RST conversion.  I *really*
 hope we are getting close to the end of this.  Meanwhile, those patches
 reach pretty far afield to update document references around the tree;
 there should be no actual code changes there.  There will be, alas, more of
 the usual trivial merge conflicts.
 
 Beyond that we have more translations, improvements to the sphinx
 scripting, a number of additions to the sysctl documentation, and lots of
 fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
 bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
 +NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
 RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
 SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
 oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
 =fDC/
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.8' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "A fair amount of stuff this time around, dominated by yet another
  massive set from Mauro toward the completion of the RST conversion. I
  *really* hope we are getting close to the end of this. Meanwhile,
  those patches reach pretty far afield to update document references
  around the tree; there should be no actual code changes there. There
  will be, alas, more of the usual trivial merge conflicts.

  Beyond that we have more translations, improvements to the sphinx
  scripting, a number of additions to the sysctl documentation, and lots
  of fixes"

* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
  Documentation: fixes to the maintainer-entry-profile template
  zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
  tracing: Fix events.rst section numbering
  docs: acpi: fix old http link and improve document format
  docs: filesystems: add info about efivars content
  Documentation: LSM: Correct the basic LSM description
  mailmap: change email for Ricardo Ribalda
  docs: sysctl/kernel: document unaligned controls
  Documentation: admin-guide: update bug-hunting.rst
  docs: sysctl/kernel: document ngroups_max
  nvdimm: fixes to maintainter-entry-profile
  Documentation/features: Correct RISC-V kprobes support entry
  Documentation/features: Refresh the arch support status files
  Revert "docs: sysctl/kernel: document ngroups_max"
  docs: move locking-specific documents to locking/
  docs: move digsig docs to the security book
  docs: move the kref doc into the core-api book
  docs: add IRQ documentation at the core-api book
  docs: debugging-via-ohci1394.txt: add it to the core-api book
  docs: fix references for ipmi.rst file
  ...
This commit is contained in:
Linus Torvalds 2020-06-01 15:45:27 -07:00
commit b23c4771ff
267 changed files with 6300 additions and 4337 deletions

View File

@ -152,6 +152,7 @@ Krzysztof Kozlowski <krzk@kernel.org> <k.kozlowski.k@gmail.com>
Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Leon Romanovsky <leon@kernel.org> <leon@leon.nu> Leon Romanovsky <leon@kernel.org> <leon@leon.nu>
Leon Romanovsky <leon@kernel.org> <leonro@mellanox.com> Leon Romanovsky <leon@kernel.org> <leonro@mellanox.com>
Leonardo Bras <leobras.c@gmail.com> <leonardo@linux.ibm.com>
Leonid I Ananiev <leonid.i.ananiev@intel.com> Leonid I Ananiev <leonid.i.ananiev@intel.com>
Linas Vepstas <linas@austin.ibm.com> Linas Vepstas <linas@austin.ibm.com>
Linus Lüssing <linus.luessing@c0d3.blue> <linus.luessing@web.de> Linus Lüssing <linus.luessing@c0d3.blue> <linus.luessing@web.de>
@ -234,7 +235,9 @@ Ralf Baechle <ralf@linux-mips.org>
Ralf Wildenhues <Ralf.Wildenhues@gmx.de> Ralf Wildenhues <Ralf.Wildenhues@gmx.de>
Randy Dunlap <rdunlap@infradead.org> <rdunlap@xenotime.net> Randy Dunlap <rdunlap@infradead.org> <rdunlap@xenotime.net>
Rémi Denis-Courmont <rdenis@simphalempin.com> Rémi Denis-Courmont <rdenis@simphalempin.com>
Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com> Ricardo Ribalda <ribalda@kernel.org> <ricardo.ribalda@gmail.com>
Ricardo Ribalda <ribalda@kernel.org> <ricardo@ribalda.com>
Ricardo Ribalda <ribalda@kernel.org> Ricardo Ribalda Delgado <ribalda@kernel.org>
Ross Zwisler <zwisler@kernel.org> <ross.zwisler@linux.intel.com> Ross Zwisler <zwisler@kernel.org> <ross.zwisler@linux.intel.com>
Rudolf Marek <R.Marek@sh.cvut.cz> Rudolf Marek <R.Marek@sh.cvut.cz>
Rui Saraiva <rmps@joel.ist.utl.pt> Rui Saraiva <rmps@joel.ist.utl.pt>

View File

@ -3104,14 +3104,16 @@ W: http://www.qsl.net/dl1bke/
D: Generic Z8530 driver, AX.25 DAMA slave implementation D: Generic Z8530 driver, AX.25 DAMA slave implementation
D: Several AX.25 hacks D: Several AX.25 hacks
N: Ricardo Ribalda Delgado N: Ricardo Ribalda
E: ricardo.ribalda@gmail.com E: ribalda@kernel.org
W: http://ribalda.com W: http://ribalda.com
D: PLX USB338x driver D: PLX USB338x driver
D: PCA9634 driver D: PCA9634 driver
D: Option GTM671WFS D: Option GTM671WFS
D: Fintek F81216A D: Fintek F81216A
D: AD5761 iio driver D: AD5761 iio driver
D: TI DAC7612 driver
D: Sony IMX214 driver
D: Various kernel hacks D: Various kernel hacks
S: Qtechnology A/S S: Qtechnology A/S
S: Valby Langgade 142 S: Valby Langgade 142

View File

@ -54,7 +54,7 @@ Date: October 2002
Contact: Linux Memory Management list <linux-mm@kvack.org> Contact: Linux Memory Management list <linux-mm@kvack.org>
Description: Description:
Provides information about the node's distribution and memory Provides information about the node's distribution and memory
utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.txt utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.rst
What: /sys/devices/system/node/nodeX/numastat What: /sys/devices/system/node/nodeX/numastat
Date: October 2002 Date: October 2002

View File

@ -11,7 +11,7 @@ Description:
Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem
are not present in /proc/pid/smaps. These fields represent are not present in /proc/pid/smaps. These fields represent
the sum of the Pss field of each type (anon, file, shmem). the sum of the Pss field of each type (anon, file, shmem).
For more details, see Documentation/filesystems/proc.txt For more details, see Documentation/filesystems/proc.rst
and the procfs man page. and the procfs man page.
Typical output looks like this: Typical output looks like this:

View File

@ -98,7 +98,11 @@ else # HAVE_PDFLATEX
pdfdocs: latexdocs pdfdocs: latexdocs
@$(srctree)/scripts/sphinx-pre-install --version-check @$(srctree)/scripts/sphinx-pre-install --version-check
$(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;) $(foreach var,$(SPHINXDIRS), \
$(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit; \
mkdir -p $(BUILDDIR)/$(var)/pdf; \
mv $(subst .tex,.pdf,$(wildcard $(BUILDDIR)/$(var)/latex/*.tex)) $(BUILDDIR)/$(var)/pdf/; \
)
endif # HAVE_PDFLATEX endif # HAVE_PDFLATEX

View File

@ -32,12 +32,13 @@ interrupt goes unhandled over time, they are tracked by the Linux kernel as
Spurious Interrupts. The IRQ will be disabled by the Linux kernel after it Spurious Interrupts. The IRQ will be disabled by the Linux kernel after it
reaches a specific count with the error "nobody cared". This disabled IRQ reaches a specific count with the error "nobody cared". This disabled IRQ
now prevents valid usage by an existing interrupt which may happen to share now prevents valid usage by an existing interrupt which may happen to share
the IRQ line. the IRQ line::
irq 19: nobody cared (try booting with the "irqpoll" option) irq 19: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 2988 Comm: irq/34-nipalk Tainted: 4.14.87-rt49-02410-g4a640ec-dirty #1 CPU: 0 PID: 2988 Comm: irq/34-nipalk Tainted: 4.14.87-rt49-02410-g4a640ec-dirty #1
Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880, BIOS 2.1.5f1 01/09/2020 Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880, BIOS 2.1.5f1 01/09/2020
Call Trace: Call Trace:
<IRQ> <IRQ>
? dump_stack+0x46/0x5e ? dump_stack+0x46/0x5e
? __report_bad_irq+0x2e/0xb0 ? __report_bad_irq+0x2e/0xb0
@ -85,15 +86,18 @@ Mitigations
The mitigations take the form of PCI quirks. The preference has been to The mitigations take the form of PCI quirks. The preference has been to
first identify and make use of a means to disable the routing to the PCH. first identify and make use of a means to disable the routing to the PCH.
In such a case a quirk to disable boot interrupt generation can be In such a case a quirk to disable boot interrupt generation can be
added.[1] added. [1]_
Intel® 6300ESB I/O Controller Hub Intel® 6300ESB I/O Controller Hub
Alternate Base Address Register: Alternate Base Address Register:
BIE: Boot Interrupt Enable BIE: Boot Interrupt Enable
0 = Boot interrupt is enabled.
1 = Boot interrupt is disabled.
Intel® Sandy Bridge through Sky Lake based Xeon servers: == ===========================
0 Boot interrupt is enabled.
1 Boot interrupt is disabled.
== ===========================
Intel® Sandy Bridge through Sky Lake based Xeon servers:
Coherent Interface Protocol Interrupt Control Coherent Interface Protocol Interrupt Control
dis_intx_route2pch/dis_intx_route2ich/dis_intx_route2dmi2: dis_intx_route2pch/dis_intx_route2ich/dis_intx_route2dmi2:
When this bit is set. Local INTx messages received from the When this bit is set. Local INTx messages received from the
@ -109,12 +113,12 @@ line by default. Therefore, on chipsets where this INTx routing cannot be
disabled, the Linux kernel will reroute the valid interrupt to its legacy disabled, the Linux kernel will reroute the valid interrupt to its legacy
interrupt. This redirection of the handler will prevent the occurrence of interrupt. This redirection of the handler will prevent the occurrence of
the spurious interrupt detection which would ordinarily disable the IRQ the spurious interrupt detection which would ordinarily disable the IRQ
line due to excessive unhandled counts.[2] line due to excessive unhandled counts. [2]_
The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable (or The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable (or
disable) the redirection of the interrupt handler to the PCH interrupt disable) the redirection of the interrupt handler to the PCH interrupt
line. The option can be overridden by either pci=ioapicreroute or line. The option can be overridden by either pci=ioapicreroute or
pci=noioapicreroute.[3] pci=noioapicreroute. [3]_
More Documentation More Documentation
@ -127,19 +131,19 @@ into the evolution of its handling with chipsets.
Example of disabling of the boot interrupt Example of disabling of the boot interrupt
------------------------------------------ ------------------------------------------
Intel® 6300ESB I/O Controller Hub (Document # 300641-004US) - Intel® 6300ESB I/O Controller Hub (Document # 300641-004US)
5.7.3 Boot Interrupt 5.7.3 Boot Interrupt
https://www.intel.com/content/dam/doc/datasheet/6300esb-io-controller-hub-datasheet.pdf https://www.intel.com/content/dam/doc/datasheet/6300esb-io-controller-hub-datasheet.pdf
Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families - Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families
Datasheet - Volume 2: Registers (Document # 330784-003) Datasheet - Volume 2: Registers (Document # 330784-003)
6.6.41 cipintrc Coherent Interface Protocol Interrupt Control 6.6.41 cipintrc Coherent Interface Protocol Interrupt Control
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf
Example of handler rerouting Example of handler rerouting
---------------------------- ----------------------------
Intel® 6700PXH 64-bit PCI Hub (Document # 302628) - Intel® 6700PXH 64-bit PCI Hub (Document # 302628)
2.15.2 PCI Express Legacy INTx Support and Boot Interrupt 2.15.2 PCI Express Legacy INTx Support and Boot Interrupt
https://www.intel.com/content/dam/doc/datasheet/6700pxh-64-bit-pci-hub-datasheet.pdf https://www.intel.com/content/dam/doc/datasheet/6700pxh-64-bit-pci-hub-datasheet.pdf
@ -150,6 +154,6 @@ Cheers,
Sean V Kelley Sean V Kelley
sean.v.kelley@linux.intel.com sean.v.kelley@linux.intel.com
[1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/ .. [1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/
[2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/ .. [2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/
[3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/ .. [3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/

View File

@ -63,7 +63,7 @@ which can then be compiled to AML binary format::
ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords
AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes
[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29 [1] https://www.elinux.org/Minnowboard:MinnowMax#Low_Speed_Expansion_.28Top.29
The resulting AML code can then be loaded by the kernel using one of the methods The resulting AML code can then be loaded by the kernel using one of the methods
below. below.

View File

@ -49,15 +49,19 @@ the issue, it may also contain the word **Oops**, as on this one::
Despite being an **Oops** or some other sort of stack trace, the offended Despite being an **Oops** or some other sort of stack trace, the offended
line is usually required to identify and handle the bug. Along this chapter, line is usually required to identify and handle the bug. Along this chapter,
we'll refer to "Oops" for all kinds of stack traces that need to be analized. we'll refer to "Oops" for all kinds of stack traces that need to be analyzed.
.. note:: If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the
quality of the stack trace by using file:`scripts/decode_stacktrace.sh`.
Modules linked in
-----------------
Modules that are tainted or are being loaded or unloaded are marked with
"(...)", where the taint flags are described in
file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is
annotated with "+", and "being unloaded" is annotated with "-".
``ksymoops`` is useless on 2.6 or upper. Please use the Oops in its original
format (from ``dmesg``, etc). Ignore any references in this or other docs to
"decoding the Oops" or "running it through ksymoops".
If you post an Oops from 2.6+ that has been run through ``ksymoops``,
people will just tell you to repost it.
Where is the Oops message is located? Where is the Oops message is located?
------------------------------------- -------------------------------------
@ -71,7 +75,7 @@ by running ``journalctl`` command.
Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to
read the data from the kernel buffers and save it. Or you can read the data from the kernel buffers and save it. Or you can
``cat /proc/kmsg > file``, however you have to break in to stop the transfer, ``cat /proc/kmsg > file``, however you have to break in to stop the transfer,
``kmsg`` is a "never ending file". since ``kmsg`` is a "never ending file".
If the machine has crashed so badly that you cannot enter commands or If the machine has crashed so badly that you cannot enter commands or
the disk is not available then you have three options: the disk is not available then you have three options:
@ -81,9 +85,9 @@ the disk is not available then you have three options:
planned for a crash. Alternatively, you can take a picture of planned for a crash. Alternatively, you can take a picture of
the screen with a digital camera - not nice, but better than the screen with a digital camera - not nice, but better than
nothing. If the messages scroll off the top of the console, you nothing. If the messages scroll off the top of the console, you
may find that booting with a higher resolution (eg, ``vga=791``) may find that booting with a higher resolution (e.g., ``vga=791``)
will allow you to read more of the text. (Caveat: This needs ``vesafb``, will allow you to read more of the text. (Caveat: This needs ``vesafb``,
so won't help for 'early' oopses) so won't help for 'early' oopses.)
(2) Boot with a serial console (see (2) Boot with a serial console (see
:ref:`Documentation/admin-guide/serial-console.rst <serial_console>`), :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`),
@ -104,7 +108,7 @@ Kernel source file. There are two methods for doing that. Usually, using
gdb gdb
^^^ ^^^
The GNU debug (``gdb``) is the best way to figure out the exact file and line The GNU debugger (``gdb``) is the best way to figure out the exact file and line
number of the OOPS from the ``vmlinux`` file. number of the OOPS from the ``vmlinux`` file.
The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``. The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``.
@ -165,7 +169,7 @@ If you have a call trace, such as::
[<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
... ...
this shows the problem likely in the :jbd: module. You can load that module this shows the problem likely is in the :jbd: module. You can load that module
in gdb and list the relevant code:: in gdb and list the relevant code::
$ gdb fs/jbd/jbd.ko $ gdb fs/jbd/jbd.ko
@ -199,8 +203,9 @@ in the kernel hacking menu of the menu configuration.) For example::
You need to be at the top level of the kernel tree for this to pick up You need to be at the top level of the kernel tree for this to pick up
your C files. your C files.
If you don't have access to the code you can also debug on some crash dumps If you don't have access to the source code you can still debug some crash
e.g. crash dump output as shown by Dave Miller:: dumps using the following method (example crash dump output as shown by
Dave Miller)::
EIP is at +0x14/0x4c0 EIP is at +0x14/0x4c0
... ...
@ -230,6 +235,9 @@ e.g. crash dump output as shown by Dave Miller::
mov 0x8(%ebp), %ebx ! %ebx = skb->sk mov 0x8(%ebp), %ebx ! %ebx = skb->sk
mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
file:`scripts/decodecode` can be used to automate most of this, depending
on what CPU architecture is being debugged.
Reporting the bug Reporting the bug
----------------- -----------------
@ -241,7 +249,7 @@ used for the development of the affected code. This can be done by using
the ``get_maintainer.pl`` script. the ``get_maintainer.pl`` script.
For example, if you find a bug at the gspca's sonixj.c file, you can get For example, if you find a bug at the gspca's sonixj.c file, you can get
their maintainers with:: its maintainers with::
$ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c
Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%)
@ -253,16 +261,17 @@ their maintainers with::
Please notice that it will point to: Please notice that it will point to:
- The last developers that touched on the source code. On the above example, - The last developers that touched the source code (if this is done inside
Tejun and Bhaktipriya (in this specific case, none really envolved on the a git tree). On the above example, Tejun and Bhaktipriya (in this
development of this file); specific case, none really envolved on the development of this file);
- The driver maintainer (Hans Verkuil); - The driver maintainer (Hans Verkuil);
- The subsystem maintainer (Mauro Carvalho Chehab); - The subsystem maintainer (Mauro Carvalho Chehab);
- The driver and/or subsystem mailing list (linux-media@vger.kernel.org); - The driver and/or subsystem mailing list (linux-media@vger.kernel.org);
- the Linux Kernel mailing list (linux-kernel@vger.kernel.org). - the Linux Kernel mailing list (linux-kernel@vger.kernel.org).
Usually, the fastest way to have your bug fixed is to report it to mailing Usually, the fastest way to have your bug fixed is to report it to mailing
list used for the development of the code (linux-media ML) copying the driver maintainer (Hans). list used for the development of the code (linux-media ML) copying the
driver maintainer (Hans).
If you are totally stumped as to whom to send the report, and If you are totally stumped as to whom to send the report, and
``get_maintainer.pl`` didn't provide you anything useful, send it to ``get_maintainer.pl`` didn't provide you anything useful, send it to
@ -303,9 +312,9 @@ protection fault message can be simply cut out of the message files
and forwarded to the kernel developers. and forwarded to the kernel developers.
Two types of address resolution are performed by ``klogd``. The first is Two types of address resolution are performed by ``klogd``. The first is
static translation and the second is dynamic translation. Static static translation and the second is dynamic translation.
translation uses the System.map file in much the same manner that Static translation uses the System.map file.
ksymoops does. In order to do static translation the ``klogd`` daemon In order to do static translation the ``klogd`` daemon
must be able to find a system map file at daemon initialization time. must be able to find a system map file at daemon initialization time.
See the klogd man page for information on how ``klogd`` searches for map See the klogd man page for information on how ``klogd`` searches for map
files. files.

View File

@ -105,7 +105,7 @@ References
---------- ----------
- http://lkml.org/lkml/2007/2/12/6 - http://lkml.org/lkml/2007/2/12/6
- Documentation/filesystems/proc.txt (1.8) - Documentation/filesystems/proc.rst (1.8)
Thanks Thanks

View File

@ -268,7 +268,7 @@ Guest mitigation mechanisms
/proc/irq/$NR/smp_affinity[_list] files. Limited documentation is /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
available at: available at:
https://www.kernel.org/doc/Documentation/IRQ-affinity.txt https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst
.. _smt_control: .. _smt_control:

View File

@ -1,52 +1,48 @@
Explaining the dreaded "No init found." boot hang message Explaining the "No working init found." boot hang message
========================================================= =========================================================
:Authors: Andreas Mohr <andi at lisas period de>
Cristian Souza <cristianmsbr at gmail period com>
OK, so you've got this pretty unintuitive message (currently located This document provides some high-level reasons for failure
in init/main.c) and are wondering what the H*** went wrong. (listed roughly in order of execution) to load the init binary.
Some high-level reasons for failure (listed roughly in order of execution)
to load the init binary are:
A) Unable to mount root FS 1) **Unable to mount root FS**: Set "debug" kernel parameter (in bootloader
B) init binary doesn't exist on rootfs config file or CONFIG_CMDLINE) to get more detailed kernel messages.
C) broken console device
D) binary exists but dependencies not available
E) binary cannot be loaded
Detailed explanations: 2) **init binary doesn't exist on rootfs**: Make sure you have the correct
root FS type (and ``root=`` kernel parameter points to the correct
partition), required drivers such as storage hardware (such as SCSI or
USB!) and filesystem (ext3, jffs2, etc.) are builtin (alternatively as
modules, to be pre-loaded by an initrd).
A) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE) 3) **Broken console device**: Possibly a conflict in ``console= setup``
to get more detailed kernel messages. --> initial console unavailable. E.g. some serial consoles are unreliable
B) make sure you have the correct root FS type due to serial IRQ issues (e.g. missing interrupt-based configuration).
(and ``root=`` kernel parameter points to the correct partition),
required drivers such as storage hardware (such as SCSI or USB!)
and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules,
to be pre-loaded by an initrd)
C) Possibly a conflict in ``console= setup`` --> initial console unavailable.
E.g. some serial consoles are unreliable due to serial IRQ issues (e.g.
missing interrupt-based configuration).
Try using a different ``console= device`` or e.g. ``netconsole=``. Try using a different ``console= device`` or e.g. ``netconsole=``.
D) e.g. required library dependencies of the init binary such as
``/lib/ld-linux.so.2`` missing or broken. Use 4) **Binary exists but dependencies not available**: E.g. required library
``readelf -d <INIT>|grep NEEDED`` to find out which libraries are required. dependencies of the init binary such as ``/lib/ld-linux.so.2`` missing or
E) make sure the binary's architecture matches your hardware. broken. Use ``readelf -d <INIT>|grep NEEDED`` to find out which libraries
E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware. are required.
In case you tried loading a non-binary file here (shell script?),
you should make sure that the script specifies an interpreter in its shebang 5) **Binary cannot be loaded**: Make sure the binary's architecture matches
header line (``#!/...``) that is fully working (including its library your hardware. E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM
dependencies). And before tackling scripts, better first test a simple hardware. In case you tried loading a non-binary file here (shell script?),
non-script binary such as ``/bin/sh`` and confirm its successful execution. you should make sure that the script specifies an interpreter in its
To find out more, add code ``to init/main.c`` to display kernel_execve()s shebang header line (``#!/...``) that is fully working (including its
return values. library dependencies). And before tackling scripts, better first test a
simple non-script binary such as ``/bin/sh`` and confirm its successful
execution. To find out more, add code ``to init/main.c`` to display
kernel_execve()s return values.
Please extend this explanation whenever you find new failure causes Please extend this explanation whenever you find new failure causes
(after all loading the init binary is a CRITICAL and hard transition step (after all loading the init binary is a CRITICAL and hard transition step
which needs to be made as painless as possible), then submit patch to LKML. which needs to be made as painless as possible), then submit a patch to LKML.
Further TODOs: Further TODOs:
- Implement the various ``run_init_process()`` invocations via a struct array - Implement the various ``run_init_process()`` invocations via a struct array
which can then store the ``kernel_execve()`` result value and on failure which can then store the ``kernel_execve()`` result value and on failure
log it all by iterating over **all** results (very important usability fix). log it all by iterating over **all** results (very important usability fix).
- try to make the implementation itself more helpful in general, - Try to make the implementation itself more helpful in general, e.g. by
e.g. by providing additional error messages at affected places. providing additional error messages at affected places.
Andreas Mohr <andi at lisas period de>

View File

@ -3336,7 +3336,7 @@
See Documentation/admin-guide/sysctl/vm.rst for details. See Documentation/admin-guide/sysctl/vm.rst for details.
ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
See Documentation/debugging-via-ohci1394.txt for more See Documentation/core-api/debugging-via-ohci1394.rst for more
info. info.
olpc_ec_timeout= [OLPC] ms delay when issuing EC commands olpc_ec_timeout= [OLPC] ms delay when issuing EC commands

View File

@ -10,7 +10,7 @@ them to a "housekeeping" CPU dedicated to such work.
References References
========== ==========
- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. - Documentation/core-api/irq/irq-affinity.rst: Binding interrupts to sets of CPUs.
- Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. - Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.

View File

@ -12,107 +12,107 @@ and more generally they allow userland to take control of various
memory page faults, something otherwise only the kernel code could do. memory page faults, something otherwise only the kernel code could do.
For example userfaults allows a proper and more optimal implementation For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick. of the ``PROT_NONE+SIGSEGV`` trick.
Design Design
====== ======
Userfaults are delivered and resolved through the userfaultfd syscall. Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
The userfaultfd (aside from registering and unregistering virtual The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities: memory ranges) provides two primary functionalities:
1) read/POLLIN protocol to notify a userland thread of the faults 1) ``read/POLLIN`` protocol to notify a userland thread of the faults
happening happening
2) various UFFDIO_* ioctls that can manage the virtual memory regions 2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
registered in the userfaultfd that allows userland to efficiently registered in the ``userfaultfd`` that allows userland to efficiently
resolve the userfaults it receives via 1) or to manage the virtual resolve the userfaults it receives via 1) or to manage the virtual
memory in the background memory in the background
The real advantage of userfaults if compared to regular virtual memory The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the operations never involve heavyweight structures like vmas (in fact the
userfaultfd runtime load never takes the mmap_sem for writing). ``userfaultfd`` runtime load never takes the mmap_sem for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that. Terabytes. Too many vmas would be needed for that.
The userfaultfd once opened by invoking the syscall, can also be The ``userfaultfd`` once opened by invoking the syscall, can also be
passed using unix domain sockets to a manager process, so the same passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on different processes without them being aware about what is going on
(well of course unless they later try to use the userfaultfd (well of course unless they later try to use the ``userfaultfd``
themselves on the same region the manager is already tracking, which themselves on the same region the manager is already tracking, which
is a corner case that would currently return -EBUSY). is a corner case that would currently return ``-EBUSY``).
API API
=== ===
When first opened the userfaultfd must be enabled invoking the When first opened the ``userfaultfd`` must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the read/POLLIN protocol a later API version) which will specify the ``read/POLLIN`` protocol
userland intends to speak on the UFFD and the uffdio_api.features userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
userland requires. The UFFDIO_API ioctl if successful (i.e. if the userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
requested uffdio_api.api is spoken also by the running kernel and the requested ``uffdio_api.api`` is spoken also by the running kernel and the
requested features are going to be enabled) will return into requested features are going to be enabled) will return into
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of ``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
respectively all the available features of the read(2) protocol and respectively all the available features of the read(2) protocol and
the generic ioctl available. the generic ioctl available.
The uffdio_api.features bitmask returned by the UFFDIO_API ioctl The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
defines what memory types are supported by the userfaultfd and what defines what memory types are supported by the ``userfaultfd`` and what
events, except page fault notifications, may be generated. events, except page fault notifications, may be generated.
If the kernel supports registering userfaultfd ranges on hugetlbfs If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be ``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
set if the kernel supports registering userfaultfd ranges on shared set if the kernel supports registering ``userfaultfd`` ranges on shared
memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
MAP_SHARED, memfd_create, etc). ``MAP_SHARED``, ``memfd_create``, etc).
The userland application that wants to use userfaultfd with hugetlbfs The userland application that wants to use ``userfaultfd`` with hugetlbfs
or shared memory need to set the corresponding flag in or shared memory need to set the corresponding flag in
uffdio_api.features to enable those features. ``uffdio_api.features`` to enable those features.
If the userland desires to receive notifications for events other than If the userland desires to receive notifications for events other than
page faults, it has to verify that uffdio_api.features has appropriate page faults, it has to verify that ``uffdio_api.features`` has appropriate
UFFD_FEATURE_EVENT_* bits set. These events are described in more ``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
detail below in "Non-cooperative userfaultfd" section. detail below in `Non-cooperative userfaultfd`_ section.
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
be invoked (if present in the returned uffdio_api.ioctls bitmask) to be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
register a memory range in the userfaultfd by setting the register a memory range in the ``userfaultfd`` by setting the
uffdio_register structure accordingly. The uffdio_register.mode uffdio_register structure accordingly. The ``uffdio_register.mode``
bitmask will specify to the kernel which kind of faults to track for bitmask will specify to the kernel which kind of faults to track for
the range (UFFDIO_REGISTER_MODE_MISSING would track missing the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
pages). The UFFDIO_REGISTER ioctl will return the pages). The ``UFFDIO_REGISTER`` ioctl will return the
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve ``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be userfaults on the range registered. Not all ioctls will necessarily be
supported for all memory types depending on the underlying virtual supported for all memory types depending on the underlying virtual
memory backend (anonymous memory vs tmpfs vs real filebacked memory backend (anonymous memory vs tmpfs vs real filebacked
mappings). mappings).
Userland can use the uffdio_register.ioctls to manage the virtual Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove address space in the background (to add or potentially also remove
memory from the userfaultfd registered range). This means a userfault memory from the ``userfaultfd`` registered range). This means a userfault
could be triggering just before userland maps in the background the could be triggering just before userland maps in the background the
user-faulted page. user-faulted page.
The primary ioctl to resolve userfaults is UFFDIO_COPY. That The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
atomically copies a page into the userfault registered range and wakes atomically copies a page into the userfault registered range and wakes
up the blocked userfaults (unless uffdio_copy.mode & up the blocked userfaults
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to (unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
half copied page since it'll keep userfaulting until the copy has guaranteeing that nothing can see an half copied page since it'll
finished. keep userfaulting until the copy has finished.
Notes: Notes:
- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then - If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
you must provide some kind of page in your thread after reading from you must provide some kind of page in your thread after reading from
the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE. the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
The normal behavior of the OS automatically providing a zero page on The normal behavior of the OS automatically providing a zero page on
an annonymous mmaping is not in place. an annonymous mmaping is not in place.
@ -122,13 +122,13 @@ Notes:
- You get the address of the access that triggered the missing page - You get the address of the access that triggered the missing page
event out of a struct uffd_msg that you read in the thread from the event out of a struct uffd_msg that you read in the thread from the
uffd. You can supply as many pages as you want with UFFDIO_COPY or uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or
UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then ``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then
the first of any of those IOCTLs wakes up the faulting thread. the first of any of those IOCTLs wakes up the faulting thread.
- Be sure to test for all errors including (pollfd[0].revents & - Be sure to test for all errors including
POLLERR). This can happen, e.g. when ranges supplied were (``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
incorrect. supplied were incorrect.
Write Protect Notifications Write Protect Notifications
--------------------------- ---------------------------
@ -136,41 +136,42 @@ Write Protect Notifications
This is equivalent to (but faster than) using mprotect and a SIGSEGV This is equivalent to (but faster than) using mprotect and a SIGSEGV
signal handler. signal handler.
Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP. Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT, Instead of using mprotect(2) you use
struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
in the struct passed in. The range does not default to and does not in the struct passed in. The range does not default to and does not
have to be identical to the range you registered with. You can write have to be identical to the range you registered with. You can write
protect as many ranges as you like (inside the registered range). protect as many ranges as you like (inside the registered range).
Then, in the thread reading from uffd the struct will have Then, in the thread reading from uffd the struct will have
msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send ``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set. again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
This wakes up the thread which will continue to run with writes. This set. This wakes up the thread which will continue to run with writes. This
allows you to do the bookkeeping about the write in the uffd reading allows you to do the bookkeeping about the write in the uffd reading
thread before the ioctl. thread before the ioctl.
If you registered with both UFFDIO_REGISTER_MODE_MISSING and If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in ``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
which you supply a page and undo write protect. Note that there is a which you supply a page and undo write protect. Note that there is a
difference between writes into a WP area and into a !WP area. The difference between writes into a WP area and into a !WP area. The
former will have UFFD_PAGEFAULT_FLAG_WP set, the latter former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but ``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but
you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used. used.
QEMU/KVM QEMU/KVM
======== ========
QEMU/KVM is using the userfaultfd syscall to implement postcopy live QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
migration. Postcopy live migration is one form of memory migration. Postcopy live migration is one form of memory
externalization consisting of a virtual machine running with part or externalization consisting of a virtual machine running with part or
all of its memory residing on a different node in the cloud. The all of its memory residing on a different node in the cloud. The
userfaultfd abstraction is generic enough that not a single line of ``userfaultfd`` abstraction is generic enough that not a single line of
KVM kernel code had to be modified in order to add postcopy live KVM kernel code had to be modified in order to add postcopy live
migration to QEMU. migration to QEMU.
Guest async page faults, FOLL_NOWAIT and all other GUP features work Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
just fine in combination with userfaults. Userfaults trigger async just fine in combination with userfaults. Userfaults trigger async
page faults in the guest scheduler so those guest processes that page faults in the guest scheduler so those guest processes that
aren't waiting for userfaults (i.e. network bound) can keep running in aren't waiting for userfaults (i.e. network bound) can keep running in
@ -183,19 +184,19 @@ generating userfaults for readonly guest regions.
The implementation of postcopy live migration currently uses one The implementation of postcopy live migration currently uses one
single bidirectional socket but in the future two different sockets single bidirectional socket but in the future two different sockets
will be used (to reduce the latency of the userfaults to the minimum will be used (to reduce the latency of the userfaults to the minimum
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
The QEMU in the source node writes all pages that it knows are missing The QEMU in the source node writes all pages that it knows are missing
in the destination node, into the socket, and the migration thread of in the destination node, into the socket, and the migration thread of
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
ioctls on the userfaultfd in order to map the received pages into the ioctls on the ``userfaultfd`` in order to map the received pages into the
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
A different postcopy thread in the destination node listens with A different postcopy thread in the destination node listens with
poll() to the userfaultfd in parallel. When a POLLIN event is poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
generated after a userfault triggers, the postcopy thread read() from generated after a userfault triggers, the postcopy thread read() from
the userfaultfd and receives the fault address (or -EAGAIN in case the the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
by the parallel QEMU migration thread). by the parallel QEMU migration thread).
After the QEMU postcopy thread (running in the destination node) gets After the QEMU postcopy thread (running in the destination node) gets
@ -206,7 +207,7 @@ remaining missing pages from that new page offset. Soon after that
(just the time to flush the tcp_wmem queue through the network) the (just the time to flush the tcp_wmem queue through the network) the
migration thread in the QEMU running in the destination node will migration thread in the QEMU running in the destination node will
receive the page that triggered the userfault and it'll map it as receive the page that triggered the userfault and it'll map it as
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
was spontaneously sent by the source or if it was an urgent page was spontaneously sent by the source or if it was an urgent page
requested through a userfault). requested through a userfault).
@ -219,74 +220,74 @@ checked to find which missing pages to send in round robin and we seek
over it when receiving incoming userfaults. After sending each page of over it when receiving incoming userfaults. After sending each page of
course the bitmap is updated accordingly. It's also useful to avoid course the bitmap is updated accordingly. It's also useful to avoid
sending the same page twice (in case the userfault is read by the sending the same page twice (in case the userfault is read by the
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
thread). thread).
Non-cooperative userfaultfd Non-cooperative userfaultfd
=========================== ===========================
When the userfaultfd is monitored by an external manager, the manager When the ``userfaultfd`` is monitored by an external manager, the manager
must be able to track changes in the process virtual memory must be able to track changes in the process virtual memory
layout. Userfaultfd can notify the manager about such changes using layout. Userfaultfd can notify the manager about such changes using
the same read(2) protocol as for the page fault notifications. The the same read(2) protocol as for the page fault notifications. The
manager has to explicitly enable these events by setting appropriate manager has to explicitly enable these events by setting appropriate
bits in uffdio_api.features passed to UFFDIO_API ioctl: bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
UFFD_FEATURE_EVENT_FORK ``UFFD_FEATURE_EVENT_FORK``
enable userfaultfd hooks for fork(). When this feature is enable ``userfaultfd`` hooks for fork(). When this feature is
enabled, the userfaultfd context of the parent process is enabled, the ``userfaultfd`` context of the parent process is
duplicated into the newly created process. The manager duplicated into the newly created process. The manager
receives UFFD_EVENT_FORK with file descriptor of the new receives ``UFFD_EVENT_FORK`` with file descriptor of the new
userfaultfd context in the uffd_msg.fork. ``userfaultfd`` context in the ``uffd_msg.fork``.
UFFD_FEATURE_EVENT_REMAP ``UFFD_FEATURE_EVENT_REMAP``
enable notifications about mremap() calls. When the enable notifications about mremap() calls. When the
non-cooperative process moves a virtual memory area to a non-cooperative process moves a virtual memory area to a
different location, the manager will receive different location, the manager will receive
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and ``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
new addresses of the area and its original length. new addresses of the area and its original length.
UFFD_FEATURE_EVENT_REMOVE ``UFFD_FEATURE_EVENT_REMOVE``
enable notifications about madvise(MADV_REMOVE) and enable notifications about madvise(MADV_REMOVE) and
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
be generated upon these calls to madvise. The uffd_msg.remove be generated upon these calls to madvise(). The ``uffd_msg.remove``
will contain start and end addresses of the removed area. will contain start and end addresses of the removed area.
UFFD_FEATURE_EVENT_UNMAP ``UFFD_FEATURE_EVENT_UNMAP``
enable notifications about memory unmapping. The manager will enable notifications about memory unmapping. The manager will
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
end addresses of the unmapped area. end addresses of the unmapped area.
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
are pretty similar, they quite differ in the action expected from the are pretty similar, they quite differ in the action expected from the
userfaultfd manager. In the former case, the virtual memory is ``userfaultfd`` manager. In the former case, the virtual memory is
removed, but the area is not, the area remains monitored by the removed, but the area is not, the area remains monitored by the
userfaultfd, and if a page fault occurs in that area it will be ``userfaultfd``, and if a page fault occurs in that area it will be
delivered to the manager. The proper resolution for such page fault is delivered to the manager. The proper resolution for such page fault is
to zeromap the faulting address. However, in the latter case, when an to zeromap the faulting address. However, in the latter case, when an
area is unmapped, either explicitly (with munmap() system call), or area is unmapped, either explicitly (with munmap() system call), or
implicitly (e.g. during mremap()), the area is removed and in turn the implicitly (e.g. during mremap()), the area is removed and in turn the
userfaultfd context for such area disappears too and the manager will ``userfaultfd`` context for such area disappears too and the manager will
not get further userland page faults from the removed area. Still, the not get further userland page faults from the removed area. Still, the
notification is required in order to prevent manager from using notification is required in order to prevent manager from using
UFFDIO_COPY on the unmapped area. ``UFFDIO_COPY`` on the unmapped area.
Unlike userland page faults which have to be synchronous and require Unlike userland page faults which have to be synchronous and require
explicit or implicit wakeup, all the events are delivered explicit or implicit wakeup, all the events are delivered
asynchronously and the non-cooperative process resumes execution as asynchronously and the non-cooperative process resumes execution as
soon as manager executes read(). The userfaultfd manager should soon as manager executes read(). The ``userfaultfd`` manager should
carefully synchronize calls to UFFDIO_COPY with the events carefully synchronize calls to ``UFFDIO_COPY`` with the events
processing. To aid the synchronization, the UFFDIO_COPY ioctl will processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
return -ENOSPC when the monitored process exits at the time of return ``-ENOSPC`` when the monitored process exits at the time of
UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed ``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
its virtual memory layout simultaneously with outstanding UFFDIO_COPY its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
operation. operation.
The current asynchronous model of the event delivery is optimal for The current asynchronous model of the event delivery is optimal for
single threaded non-cooperative userfaultfd manager implementations. A single threaded non-cooperative ``userfaultfd`` manager implementations. A
synchronous event delivery model can be added later as a new synchronous event delivery model can be added later as a new
userfaultfd feature to facilitate multithreading enhancements of the ``userfaultfd`` feature to facilitate multithreading enhancements of the
non cooperative manager, for example to allow UFFDIO_COPY ioctls to non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
run in parallel to the event reception. Single threaded run in parallel to the event reception. Single threaded
implementations should continue to use the current async event implementations should continue to use the current async event
delivery model instead. delivery model instead.

View File

@ -18,7 +18,7 @@ Mounting the root filesystem via NFS (nfsroot)
In order to use a diskless system, such as an X-terminal or printer server for In order to use a diskless system, such as an X-terminal or printer server for
example, it is necessary for the root filesystem to be present on a non-disk example, it is necessary for the root filesystem to be present on a non-disk
device. This may be an initramfs (see device. This may be an initramfs (see
Documentation/filesystems/ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/filesystems/ramfs-rootfs-initramfs.rst), a ramdisk (see
Documentation/admin-guide/initrd.rst) or a filesystem mounted via NFS. The Documentation/admin-guide/initrd.rst) or a filesystem mounted via NFS. The
following text describes on how to use NFS for the root filesystem. For the rest following text describes on how to use NFS for the root filesystem. For the rest
of this text 'client' means the diskless system, and 'server' means the NFS of this text 'client' means the diskless system, and 'server' means the NFS

View File

@ -6,6 +6,21 @@ Numa policy hit/miss statistics
All units are pages. Hugepages have separate counters. All units are pages. Hugepages have separate counters.
The numa_hit, numa_miss and numa_foreign counters reflect how well processes
are able to allocate memory from nodes they prefer. If they succeed, numa_hit
is incremented on the preferred node, otherwise numa_foreign is incremented on
the preferred node and numa_miss on the node where allocation succeeded.
Usually preferred node is the one local to the CPU where the process executes,
but restrictions such as mempolicies can change that, so there are also two
counters based on CPU local node. local_node is similar to numa_hit and is
incremented on allocation from a node by CPU on the same node. other_node is
similar to numa_miss and is incremented on the node where allocation succeeds
from a CPU from a different node. Note there is no counter analogical to
numa_foreign.
In more detail:
=============== ============================================================ =============== ============================================================
numa_hit A process wanted to allocate memory from this node, numa_hit A process wanted to allocate memory from this node,
and succeeded. and succeeded.
@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from another node,
but ended up with memory from this node. but ended up with memory from this node.
numa_foreign A process wanted to allocate on this node, numa_foreign A process wanted to allocate on this node,
but ended up with memory from another one. but ended up with memory from another node.
local_node A process ran on this node and got memory from it. local_node A process ran on this node's CPU,
and got memory from this node.
other_node A process ran on this node and got memory from another node. other_node A process ran on a different node's CPU
and got memory from this node.
interleave_hit Interleaving wanted to allocate from this node interleave_hit Interleaving wanted to allocate from this node
and succeeded. and succeeded.
@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
(http://oss.sgi.com/projects/libnuma/). Note that it only works (http://oss.sgi.com/projects/libnuma/). Note that it only works
well right now on machines with a small number of CPUs. well right now on machines with a small number of CPUs.
Note that on systems with memoryless nodes (where a node has CPUs but no
memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
heavily. In the current kernel implementation, if a process prefers a
memoryless node (i.e. because it is running on one of its local CPU), the
implementation actually treats one of the nearest nodes with memory as the
preferred node. As a result, such allocation will not increase the numa_foreign
counter on the memoryless node, and will skew the numa_hit, numa_miss and
numa_foreign statistics of the nearest node.

View File

@ -156,11 +156,11 @@ the labels provided by the BIOS won't match the real ones.
ECC memory ECC memory
---------- ----------
As mentioned on the previous section, ECC memory has extra bits to be As mentioned in the previous section, ECC memory has extra bits to be
used for error correction. So, on 64 bit systems, a memory module used for error correction. In the above example, a memory module has
has 64 bits of *data width*, and 74 bits of *total width*. So, there are 64 bits of *data width*, and 72 bits of *total width*. The extra 8
8 bits extra bits to be used for the error detection and correction bits which are used for the error detection and correction mechanisms
mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. are referred to as the *syndrome*\ [#f1]_\ [#f2]_.
So, when the cpu requests the memory controller to write a word with So, when the cpu requests the memory controller to write a word with
*data width*, the memory controller calculates the *syndrome* in real time, *data width*, the memory controller calculates the *syndrome* in real time,
@ -212,7 +212,7 @@ EDAC - Error Detection And Correction
purposes. purposes.
When the subsystem was pushed upstream for the first time, on When the subsystem was pushed upstream for the first time, on
Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. Kernel 2.6.16, it was renamed to ``EDAC``.
Purpose Purpose
------- -------
@ -351,15 +351,17 @@ controllers. The following example will assume 2 channels:
+------------+-----------+-----------+ +------------+-----------+-----------+
| | ``ch0`` | ``ch1`` | | | ``ch0`` | ``ch1`` |
+============+===========+===========+ +============+===========+===========+
| ``csrow0`` | DIMM_A0 | DIMM_B0 | | |**DIMM_A0**|**DIMM_B0**|
| | rank0 | rank0 | +------------+-----------+-----------+
+------------+ - | - | | ``csrow0`` | rank0 | rank0 |
+------------+-----------+-----------+
| ``csrow1`` | rank1 | rank1 | | ``csrow1`` | rank1 | rank1 |
+------------+-----------+-----------+ +------------+-----------+-----------+
| ``csrow2`` | DIMM_A1 | DIMM_B1 | | |**DIMM_A1**|**DIMM_B1**|
| | rank0 | rank0 | +------------+-----------+-----------+
+------------+ - | - | | ``csrow2`` | rank0 | rank0 |
| ``csrow3`` | rank1 | rank1 | +------------+-----------+-----------+
| ``csrow3`` | rank1 | rank1 |
+------------+-----------+-----------+ +------------+-----------+-----------+
In the above example, there are 4 physical slots on the motherboard In the above example, there are 4 physical slots on the motherboard

View File

@ -102,6 +102,30 @@ See the ``type_of_loader`` and ``ext_loader_ver`` fields in
:doc:`/x86/boot` for additional information. :doc:`/x86/boot` for additional information.
bpf_stats_enabled
=================
Controls whether the kernel should collect statistics on BPF programs
(total time spent running, number of times run...). Enabling
statistics causes a slight reduction in performance on each program
run. The statistics can be seen using ``bpftool``.
= ===================================
0 Don't collect statistics (default).
1 Collect statistics.
= ===================================
cad_pid
=======
This is the pid which will be signalled on reboot (notably, by
Ctrl-Alt-Delete). Writing a value to this file which doesn't
correspond to a running process will result in ``-ESRCH``.
See also `ctrl-alt-del`_.
cap_last_cap cap_last_cap
============ ============
@ -241,6 +265,40 @@ domain names are in general different. For a detailed discussion
see the ``hostname(1)`` man page. see the ``hostname(1)`` man page.
firmware_config
===============
See :doc:`/driver-api/firmware/fallback-mechanisms`.
The entries in this directory allow the firmware loader helper
fallback to be controlled:
* ``force_sysfs_fallback``, when set to 1, forces the use of the
fallback;
* ``ignore_sysfs_fallback``, when set to 1, ignores any fallback.
ftrace_dump_on_oops
===================
Determines whether ``ftrace_dump()`` should be called on an oops (or
kernel panic). This will output the contents of the ftrace buffers to
the console. This is very useful for capturing traces that lead to
crashes and outputting them to a serial console.
= ===================================================
0 Disabled (default).
1 Dump buffers of all CPUs.
2 Dump the buffer of the CPU that triggered the oops.
= ===================================================
ftrace_enabled, stack_tracer_enabled
====================================
See :doc:`/trace/ftrace`.
hardlockup_all_cpu_backtrace hardlockup_all_cpu_backtrace
============================ ============================
@ -344,6 +402,25 @@ Controls whether the panic kmsg data should be reported to Hyper-V.
= ========================================================= = =========================================================
ignore-unaligned-usertrap
=========================
On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``;
currently, ``arc`` and ``ia64``), controls whether all unaligned traps
are logged.
= =============================================================
0 Log all unaligned accesses.
1 Only warn the first time a process traps. This is the default
setting.
= =============================================================
See also `unaligned-trap`_ and `unaligned-dump-stack`_. On ``ia64``,
this allows system administrators to override the
``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
kexec_load_disabled kexec_load_disabled
=================== ===================
@ -459,6 +536,15 @@ Notes:
successful IPC object allocation. If an IPC object allocation syscall successful IPC object allocation. If an IPC object allocation syscall
fails, it is undefined if the value remains unmodified or is reset to -1. fails, it is undefined if the value remains unmodified or is reset to -1.
ngroups_max
===========
Maximum number of supplementary groups, _i.e._ the maximum size which
``setgroups`` will accept. Exports ``NGROUPS_MAX`` from the kernel.
nmi_watchdog nmi_watchdog
============ ============
@ -877,7 +963,7 @@ this sysctl interface anymore.
pty pty
=== ===
See Documentation/filesystems/devpts.txt. See Documentation/filesystems/devpts.rst.
randomize_va_space randomize_va_space
@ -1173,6 +1259,65 @@ If a value outside of this range is written to ``threads-max`` an
``EINVAL`` error occurs. ``EINVAL`` error occurs.
traceoff_on_warning
===================
When set, disables tracing (see :doc:`/trace/ftrace`) when a
``WARN()`` is hit.
tracepoint_printk
=================
When tracepoints are sent to printk() (enabled by the ``tp_printk``
boot parameter), this entry provides runtime control::
echo 0 > /proc/sys/kernel/tracepoint_printk
will stop tracepoints from being sent to printk(), and::
echo 1 > /proc/sys/kernel/tracepoint_printk
will send them to printk() again.
This only works if the kernel was booted with ``tp_printk`` enabled.
See :doc:`/admin-guide/kernel-parameters` and
:doc:`/trace/boottime-trace`.
.. _unaligned-dump-stack:
unaligned-dump-stack (ia64)
===========================
When logging unaligned accesses, controls whether the stack is
dumped.
= ===================================================
0 Do not dump the stack. This is the default setting.
1 Dump the stack.
= ===================================================
See also `ignore-unaligned-usertrap`_.
unaligned-trap
==============
On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently,
``arc`` and ``parisc``), controls whether unaligned traps are caught
and emulated (instead of failing).
= ========================================================
0 Do not emulate unaligned accesses.
1 Emulate unaligned accesses. This is the default setting.
= ========================================================
See also `ignore-unaligned-usertrap`_.
unknown_nmi_panic unknown_nmi_panic
================= =================
@ -1184,6 +1329,16 @@ NMI switch that most IA32 servers have fires unknown NMI up, for
example. If a system hangs up, try pressing the NMI switch. example. If a system hangs up, try pressing the NMI switch.
unprivileged_bpf_disabled
=========================
Writing 1 to this entry will disable unprivileged calls to ``bpf()``;
once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` will return
``-EPERM``.
Once set, this can't be cleared.
watchdog watchdog
======== ========

View File

@ -24,13 +24,13 @@ optional external memory-mapped interface.
Version 1 of the Activity Monitors architecture implements a counter group Version 1 of the Activity Monitors architecture implements a counter group
of four fixed and architecturally defined 64-bit event counters. of four fixed and architecturally defined 64-bit event counters.
- CPU cycle counter: increments at the frequency of the CPU. - CPU cycle counter: increments at the frequency of the CPU.
- Constant counter: increments at the fixed frequency of the system - Constant counter: increments at the fixed frequency of the system
clock. clock.
- Instructions retired: increments with every architecturally executed - Instructions retired: increments with every architecturally executed
instruction. instruction.
- Memory stall cycles: counts instruction dispatch stall cycles caused by - Memory stall cycles: counts instruction dispatch stall cycles caused by
misses in the last level cache within the clock domain. misses in the last level cache within the clock domain.
When in WFI or WFE these counters do not increment. When in WFI or WFE these counters do not increment.
@ -59,11 +59,11 @@ counters, only the presence of the extension.
Firmware (code running at higher exception levels, e.g. arm-tf) support is Firmware (code running at higher exception levels, e.g. arm-tf) support is
needed to: needed to:
- Enable access for lower exception levels (EL2 and EL1) to the AMU - Enable access for lower exception levels (EL2 and EL1) to the AMU
registers. registers.
- Enable the counters. If not enabled these will read as 0. - Enable the counters. If not enabled these will read as 0.
- Save/restore the counters before/after the CPU is being put/brought up - Save/restore the counters before/after the CPU is being put/brought up
from the 'off' power state. from the 'off' power state.
When using kernels that have this feature enabled but boot with broken When using kernels that have this feature enabled but boot with broken
firmware the user may experience panics or lockups when accessing the firmware the user may experience panics or lockups when accessing the
@ -81,10 +81,10 @@ are not trapped in EL2/EL3.
The fixed counters of AMUv1 are accessible though the following system The fixed counters of AMUv1 are accessible though the following system
register definitions: register definitions:
- SYS_AMEVCNTR0_CORE_EL0 - SYS_AMEVCNTR0_CORE_EL0
- SYS_AMEVCNTR0_CONST_EL0 - SYS_AMEVCNTR0_CONST_EL0
- SYS_AMEVCNTR0_INST_RET_EL0 - SYS_AMEVCNTR0_INST_RET_EL0
- SYS_AMEVCNTR0_MEM_STALL_EL0 - SYS_AMEVCNTR0_MEM_STALL_EL0
Auxiliary platform specific counters can be accessed using Auxiliary platform specific counters can be accessed using
SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15. SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15.
@ -97,9 +97,9 @@ Userspace access
Currently, access from userspace to the AMU registers is disabled due to: Currently, access from userspace to the AMU registers is disabled due to:
- Security reasons: they might expose information about code executed in - Security reasons: they might expose information about code executed in
secure mode. secure mode.
- Purpose: AMU counters are intended for system management use. - Purpose: AMU counters are intended for system management use.
Also, the presence of the feature is not visible to userspace. Also, the presence of the feature is not visible to userspace.
@ -110,8 +110,8 @@ Virtualization
Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM
guest side is disabled due to: guest side is disabled due to:
- Security reasons: they might expose information about code executed - Security reasons: they might expose information about code executed
by other guests or the host. by other guests or the host.
Any attempt to access the AMU registers will result in an UNDEFINED Any attempt to access the AMU registers will result in an UNDEFINED
exception being injected into the guest. exception being injected into the guest.

View File

@ -173,8 +173,10 @@ Before jumping into the kernel, the following conditions must be met:
- Caches, MMUs - Caches, MMUs
The MMU must be off. The MMU must be off.
The instruction cache may be on or off, and must not hold any stale The instruction cache may be on or off, and must not hold any stale
entries corresponding to the loaded kernel image. entries corresponding to the loaded kernel image.
The address range corresponding to the loaded kernel image must be The address range corresponding to the loaded kernel image must be
cleaned to the PoC. In the presence of a system cache or other cleaned to the PoC. In the presence of a system cache or other
coherent masters with caches enabled, this will typically require coherent masters with caches enabled, this will typically require
@ -239,6 +241,7 @@ Before jumping into the kernel, the following conditions must be met:
- The DT or ACPI tables must describe a GICv2 interrupt controller. - The DT or ACPI tables must describe a GICv2 interrupt controller.
For CPUs with pointer authentication functionality: For CPUs with pointer authentication functionality:
- If EL3 is present: - If EL3 is present:
- SCR_EL3.APK (bit 16) must be initialised to 0b1 - SCR_EL3.APK (bit 16) must be initialised to 0b1
@ -250,18 +253,22 @@ Before jumping into the kernel, the following conditions must be met:
- HCR_EL2.API (bit 41) must be initialised to 0b1 - HCR_EL2.API (bit 41) must be initialised to 0b1
For CPUs with Activity Monitors Unit v1 (AMUv1) extension present: For CPUs with Activity Monitors Unit v1 (AMUv1) extension present:
- If EL3 is present: - If EL3 is present:
CPTR_EL3.TAM (bit 30) must be initialised to 0b0
CPTR_EL2.TAM (bit 30) must be initialised to 0b0 - CPTR_EL3.TAM (bit 30) must be initialised to 0b0
AMCNTENSET0_EL0 must be initialised to 0b1111 - CPTR_EL2.TAM (bit 30) must be initialised to 0b0
AMCNTENSET1_EL0 must be initialised to a platform specific value - AMCNTENSET0_EL0 must be initialised to 0b1111
having 0b1 set for the corresponding bit for each of the auxiliary - AMCNTENSET1_EL0 must be initialised to a platform specific value
counters present. having 0b1 set for the corresponding bit for each of the auxiliary
counters present.
- If the kernel is entered at EL1: - If the kernel is entered at EL1:
AMCNTENSET0_EL0 must be initialised to 0b1111
AMCNTENSET1_EL0 must be initialised to a platform specific value - AMCNTENSET0_EL0 must be initialised to 0b1111
having 0b1 set for the corresponding bit for each of the auxiliary - AMCNTENSET1_EL0 must be initialised to a platform specific value
counters present. having 0b1 set for the corresponding bit for each of the auxiliary
counters present.
The requirements described above for CPU mode, caches, MMUs, architected The requirements described above for CPU mode, caches, MMUs, architected
timers, coherency and system registers apply to all CPUs. All CPUs must timers, coherency and system registers apply to all CPUs. All CPUs must
@ -305,7 +312,8 @@ following manner:
Documentation/devicetree/bindings/arm/psci.yaml. Documentation/devicetree/bindings/arm/psci.yaml.
- Secondary CPU general-purpose register settings - Secondary CPU general-purpose register settings
x0 = 0 (reserved for future use)
x1 = 0 (reserved for future use) - x0 = 0 (reserved for future use)
x2 = 0 (reserved for future use) - x1 = 0 (reserved for future use)
x3 = 0 (reserved for future use) - x2 = 0 (reserved for future use)
- x3 = 0 (reserved for future use)

View File

@ -388,44 +388,6 @@ if major == 1 and minor < 6:
# author, documentclass [howto, manual, or own class]). # author, documentclass [howto, manual, or own class]).
# Sorted in alphabetical order # Sorted in alphabetical order
latex_documents = [ latex_documents = [
('admin-guide/index', 'linux-user.tex', 'Linux Kernel User Documentation',
'The kernel development community', 'manual'),
('core-api/index', 'core-api.tex', 'The kernel core API manual',
'The kernel development community', 'manual'),
('crypto/index', 'crypto-api.tex', 'Linux Kernel Crypto API manual',
'The kernel development community', 'manual'),
('dev-tools/index', 'dev-tools.tex', 'Development tools for the Kernel',
'The kernel development community', 'manual'),
('doc-guide/index', 'kernel-doc-guide.tex', 'Linux Kernel Documentation Guide',
'The kernel development community', 'manual'),
('driver-api/index', 'driver-api.tex', 'The kernel driver API manual',
'The kernel development community', 'manual'),
('filesystems/index', 'filesystems.tex', 'Linux Filesystems API',
'The kernel development community', 'manual'),
('admin-guide/ext4', 'ext4-admin-guide.tex', 'ext4 Administration Guide',
'ext4 Community', 'manual'),
('filesystems/ext4/index', 'ext4-data-structures.tex',
'ext4 Data Structures and Algorithms', 'ext4 Community', 'manual'),
('gpu/index', 'gpu.tex', 'Linux GPU Driver Developer\'s Guide',
'The kernel development community', 'manual'),
('input/index', 'linux-input.tex', 'The Linux input driver subsystem',
'The kernel development community', 'manual'),
('kernel-hacking/index', 'kernel-hacking.tex', 'Unreliable Guide To Hacking The Linux Kernel',
'The kernel development community', 'manual'),
('media/index', 'media.tex', 'Linux Media Subsystem Documentation',
'The kernel development community', 'manual'),
('networking/index', 'networking.tex', 'Linux Networking Documentation',
'The kernel development community', 'manual'),
('process/index', 'development-process.tex', 'Linux Kernel Development Documentation',
'The kernel development community', 'manual'),
('security/index', 'security.tex', 'The kernel security subsystem manual',
'The kernel development community', 'manual'),
('sh/index', 'sh.tex', 'SuperH architecture implementation manual',
'The kernel development community', 'manual'),
('sound/index', 'sound.tex', 'Linux Sound Subsystem Documentation',
'The kernel development community', 'manual'),
('userspace-api/index', 'userspace-api.tex', 'The Linux kernel user-space API guide',
'The kernel development community', 'manual'),
] ]
# Add all other index files from Documentation/ subdirectories # Add all other index files from Documentation/ subdirectories

View File

@ -18,6 +18,7 @@ it.
kernel-api kernel-api
workqueue workqueue
printk-basics
printk-formats printk-formats
symbol-namespaces symbol-namespaces
@ -30,10 +31,12 @@ Library functionality that is used throughout the kernel.
:maxdepth: 1 :maxdepth: 1
kobject kobject
kref
assoc_array assoc_array
xarray xarray
idr idr
circular-buffers circular-buffers
rbtree
generic-radix-tree generic-radix-tree
packing packing
timekeeping timekeeping
@ -50,6 +53,7 @@ How Linux keeps everything from happening at the same time. See
atomic_ops atomic_ops
refcount-vs-atomic refcount-vs-atomic
irq/index
local_ops local_ops
padata padata
../RCU/index ../RCU/index
@ -78,6 +82,10 @@ more memory-management documentation in :doc:`/vm/index`.
:maxdepth: 1 :maxdepth: 1
memory-allocation memory-allocation
dma-api
dma-api-howto
dma-attributes
dma-isa-lpc
mm-api mm-api
genalloc genalloc
pin_user_pages pin_user_pages
@ -92,6 +100,7 @@ Interfaces for kernel debugging
debug-objects debug-objects
tracepoint tracepoint
debugging-via-ohci1394
Everything else Everything else
=============== ===============

View File

@ -0,0 +1,11 @@
====
IRQs
====
.. toctree::
:maxdepth: 1
concepts
irq-affinity
irq-domain
irqflags-tracing

View File

@ -263,7 +263,8 @@ needs to:
Hierarchy irq_domain is in no way x86 specific, and is heavily used to Hierarchy irq_domain is in no way x86 specific, and is heavily used to
support other architectures, such as ARM, ARM64 etc. support other architectures, such as ARM, ARM64 etc.
=== Debugging === Debugging
=========
Most of the internals of the IRQ subsystem are exposed in debugfs by Most of the internals of the IRQ subsystem are exposed in debugfs by
turning CONFIG_GENERIC_IRQ_DEBUGFS on. turning CONFIG_GENERIC_IRQ_DEBUGFS on.

View File

@ -80,11 +80,11 @@ what is the pointer to the containing structure? You must avoid tricks
(such as assuming that the kobject is at the beginning of the structure) (such as assuming that the kobject is at the beginning of the structure)
and, instead, use the container_of() macro, found in ``<linux/kernel.h>``:: and, instead, use the container_of() macro, found in ``<linux/kernel.h>``::
container_of(pointer, type, member) container_of(ptr, type, member)
where: where:
* ``pointer`` is the pointer to the embedded kobject, * ``ptr`` is the pointer to the embedded kobject,
* ``type`` is the type of the containing structure, and * ``type`` is the type of the containing structure, and
* ``member`` is the name of the structure field to which ``pointer`` points. * ``member`` is the name of the structure field to which ``pointer`` points.
@ -140,7 +140,7 @@ the name of the kobject, call kobject_rename()::
int kobject_rename(struct kobject *kobj, const char *new_name); int kobject_rename(struct kobject *kobj, const char *new_name);
kobject_rename does not perform any locking or have a solid notion of kobject_rename() does not perform any locking or have a solid notion of
what names are valid so the caller must provide their own sanity checking what names are valid so the caller must provide their own sanity checking
and serialization. and serialization.
@ -210,7 +210,7 @@ statically and will warn the developer of this improper usage.
If all that you want to use a kobject for is to provide a reference counter If all that you want to use a kobject for is to provide a reference counter
for your structure, please use the struct kref instead; a kobject would be for your structure, please use the struct kref instead; a kobject would be
overkill. For more information on how to use struct kref, please see the overkill. For more information on how to use struct kref, please see the
file Documentation/kref.txt in the Linux kernel source tree. file Documentation/core-api/kref.rst in the Linux kernel source tree.
Creating "simple" kobjects Creating "simple" kobjects
@ -222,17 +222,17 @@ ksets, show and store functions, and other details. This is the one
exception where a single kobject should be created. To create such an exception where a single kobject should be created. To create such an
entry, use the function:: entry, use the function::
struct kobject *kobject_create_and_add(char *name, struct kobject *parent); struct kobject *kobject_create_and_add(const char *name, struct kobject *parent);
This function will create a kobject and place it in sysfs in the location This function will create a kobject and place it in sysfs in the location
underneath the specified parent kobject. To create simple attributes underneath the specified parent kobject. To create simple attributes
associated with this kobject, use:: associated with this kobject, use::
int sysfs_create_file(struct kobject *kobj, struct attribute *attr); int sysfs_create_file(struct kobject *kobj, const struct attribute *attr);
or:: or::
int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp);
Both types of attributes used here, with a kobject that has been created Both types of attributes used here, with a kobject that has been created
with the kobject_create_and_add(), can be of type kobj_attribute, so no with the kobject_create_and_add(), can be of type kobj_attribute, so no
@ -300,8 +300,10 @@ kobj_type::
void (*release)(struct kobject *kobj); void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops; const struct sysfs_ops *sysfs_ops;
struct attribute **default_attrs; struct attribute **default_attrs;
const struct attribute_group **default_groups;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj); const void *(*namespace)(struct kobject *kobj);
void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid);
}; };
This structure is used to describe a particular type of kobject (or, more This structure is used to describe a particular type of kobject (or, more
@ -352,12 +354,12 @@ created and never declared statically or on the stack. To create a new
kset use:: kset use::
struct kset *kset_create_and_add(const char *name, struct kset *kset_create_and_add(const char *name,
struct kset_uevent_ops *u, const struct kset_uevent_ops *uevent_ops,
struct kobject *parent); struct kobject *parent_kobj);
When you are finished with the kset, call:: When you are finished with the kset, call::
void kset_unregister(struct kset *kset); void kset_unregister(struct kset *k);
to destroy it. This removes the kset from sysfs and decrements its reference to destroy it. This removes the kset from sysfs and decrements its reference
count. When the reference count goes to zero, the kset will be released. count. When the reference count goes to zero, the kset will be released.
@ -371,9 +373,9 @@ If a kset wishes to control the uevent operations of the kobjects
associated with it, it can use the struct kset_uevent_ops to handle it:: associated with it, it can use the struct kset_uevent_ops to handle it::
struct kset_uevent_ops { struct kset_uevent_ops {
int (*filter)(struct kset *kset, struct kobject *kobj); int (* const filter)(struct kset *kset, struct kobject *kobj);
const char *(*name)(struct kset *kset, struct kobject *kobj); const char *(* const name)(struct kset *kset, struct kobject *kobj);
int (*uevent)(struct kset *kset, struct kobject *kobj, int (* const uevent)(struct kset *kset, struct kobject *kobj,
struct kobj_uevent_env *env); struct kobj_uevent_env *env);
}; };

View File

@ -0,0 +1,115 @@
.. SPDX-License-Identifier: GPL-2.0
===========================
Message logging with printk
===========================
printk() is one of the most widely known functions in the Linux kernel. It's the
standard tool we have for printing messages and usually the most basic way of
tracing and debugging. If you're familiar with printf(3) you can tell printk()
is based on it, although it has some functional differences:
- printk() messages can specify a log level.
- the format string, while largely compatible with C99, doesn't follow the
exact same specification. It has some extensions and a few limitations
(no ``%n`` or floating point conversion specifiers). See :ref:`How to get
printk format specifiers right <printk-specifiers>`.
All printk() messages are printed to the kernel log buffer, which is a ring
buffer exported to userspace through /dev/kmsg. The usual way to read it is
using ``dmesg``.
printk() is typically used like this::
printk(KERN_INFO "Message: %s\n", arg);
where ``KERN_INFO`` is the log level (note that it's concatenated to the format
string, the log level is not a separate argument). The available log levels are:
+----------------+--------+-----------------------------------------------+
| Name | String | Alias function |
+================+========+===============================================+
| KERN_EMERG | "0" | pr_emerg() |
+----------------+--------+-----------------------------------------------+
| KERN_ALERT | "1" | pr_alert() |
+----------------+--------+-----------------------------------------------+
| KERN_CRIT | "2" | pr_crit() |
+----------------+--------+-----------------------------------------------+
| KERN_ERR | "3" | pr_err() |
+----------------+--------+-----------------------------------------------+
| KERN_WARNING | "4" | pr_warn() |
+----------------+--------+-----------------------------------------------+
| KERN_NOTICE | "5" | pr_notice() |
+----------------+--------+-----------------------------------------------+
| KERN_INFO | "6" | pr_info() |
+----------------+--------+-----------------------------------------------+
| KERN_DEBUG | "7" | pr_debug() and pr_devel() if DEBUG is defined |
+----------------+--------+-----------------------------------------------+
| KERN_DEFAULT | "" | |
+----------------+--------+-----------------------------------------------+
| KERN_CONT | "c" | pr_cont() |
+----------------+--------+-----------------------------------------------+
The log level specifies the importance of a message. The kernel decides whether
to show the message immediately (printing it to the current console) depending
on its log level and the current *console_loglevel* (a kernel variable). If the
message priority is higher (lower log level value) than the *console_loglevel*
the message will be printed to the console.
If the log level is omitted, the message is printed with ``KERN_DEFAULT``
level.
You can check the current *console_loglevel* with::
$ cat /proc/sys/kernel/printk
4 4 1 7
The result shows the *current*, *default*, *minimum* and *boot-time-default* log
levels.
To change the current console_loglevel simply write the the desired level to
``/proc/sys/kernel/printk``. For example, to print all messages to the console::
# echo 8 > /proc/sys/kernel/printk
Another way, using ``dmesg``::
# dmesg -n 5
sets the console_loglevel to print KERN_WARNING (4) or more severe messages to
console. See ``dmesg(1)`` for more information.
As an alternative to printk() you can use the ``pr_*()`` aliases for
logging. This family of macros embed the log level in the macro names. For
example::
pr_info("Info message no. %d\n", msg_num);
prints a ``KERN_INFO`` message.
Besides being more concise than the equivalent printk() calls, they can use a
common definition for the format string through the pr_fmt() macro. For
instance, defining this at the top of a source file (before any ``#include``
directive)::
#define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__
would prefix every pr_*() message in that file with the module and function name
that originated the message.
For debugging purposes there are also two conditionally-compiled macros:
pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or
also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined.
Function reference
==================
.. kernel-doc:: kernel/printk/printk.c
:functions: printk
.. kernel-doc:: include/linux/printk.h
:functions: pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info
pr_fmt pr_debug pr_devel pr_cont

View File

@ -2,6 +2,8 @@
How to get printk format specifiers right How to get printk format specifiers right
========================================= =========================================
.. _printk-specifiers:
:Author: Randy Dunlap <rdunlap@infradead.org> :Author: Randy Dunlap <rdunlap@infradead.org>
:Author: Andrew Murray <amurray@mpc-data.co.uk> :Author: Andrew Murray <amurray@mpc-data.co.uk>

View File

@ -6,7 +6,7 @@ Documentation subsystem maintainer entry profile
The documentation "subsystem" is the central coordinating point for the The documentation "subsystem" is the central coordinating point for the
kernel's documentation and associated infrastructure. It covers the kernel's documentation and associated infrastructure. It covers the
hierarchy under Documentation/ (with the exception of hierarchy under Documentation/ (with the exception of
Documentation/device-tree), various utilities under scripts/ and, at least Documentation/devicetree), various utilities under scripts/ and, at least
some of the time, LICENSES/. some of the time, LICENSES/.
It's worth noting, though, that the boundaries of this subsystem are rather It's worth noting, though, that the boundaries of this subsystem are rather

View File

@ -11,7 +11,7 @@ course not limited to GPU use cases.
The three main components of this are: (1) dma-buf, representing a The three main components of this are: (1) dma-buf, representing a
sg_table and exposed to userspace as a file descriptor to allow passing sg_table and exposed to userspace as a file descriptor to allow passing
between devices, (2) fence, which provides a mechanism to signal when between devices, (2) fence, which provides a mechanism to signal when
one device as finished access, and (3) reservation, which manages the one device has finished access, and (3) reservation, which manages the
shared or exclusive fence(s) associated with the buffer. shared or exclusive fence(s) associated with the buffer.
Shared DMA Buffers Shared DMA Buffers
@ -31,7 +31,7 @@ The exporter
- implements and manages operations in :c:type:`struct dma_buf_ops - implements and manages operations in :c:type:`struct dma_buf_ops
<dma_buf_ops>` for the buffer, <dma_buf_ops>` for the buffer,
- allows other users to share the buffer by using dma_buf sharing APIs, - allows other users to share the buffer by using dma_buf sharing APIs,
- manages the details of buffer allocation, wrapped int a :c:type:`struct - manages the details of buffer allocation, wrapped in a :c:type:`struct
dma_buf <dma_buf>`, dma_buf <dma_buf>`,
- decides about the actual backing storage where this allocation happens, - decides about the actual backing storage where this allocation happens,
- and takes care of any migration of scatterlist - for all (shared) users of - and takes care of any migration of scatterlist - for all (shared) users of

View File

@ -50,10 +50,10 @@ Attributes
Attributes of devices can be exported by a device driver through sysfs. Attributes of devices can be exported by a device driver through sysfs.
Please see Documentation/filesystems/sysfs.txt for more information Please see Documentation/filesystems/sysfs.rst for more information
on how sysfs works. on how sysfs works.
As explained in Documentation/kobject.txt, device attributes must be As explained in Documentation/core-api/kobject.rst, device attributes must be
created before the KOBJ_ADD uevent is generated. The only way to realize created before the KOBJ_ADD uevent is generated. The only way to realize
that is by defining an attribute group. that is by defining an attribute group.

View File

@ -121,4 +121,4 @@ device-specific data or tunable interfaces.
More information about the sysfs directory layout can be found in More information about the sysfs directory layout can be found in
the other documents in this directory and in the file the other documents in this directory and in the file
Documentation/filesystems/sysfs.txt. Documentation/filesystems/sysfs.rst.

View File

@ -39,6 +39,7 @@ available subsections can be seen below.
spi spi
i2c i2c
ipmb ipmb
ipmi
i3c/index i3c/index
interconnect interconnect
devfreq devfreq

View File

@ -278,8 +278,8 @@ by a region device with a dynamically assigned id (REGION0 - REGION5).
be contiguous in DPA-space. be contiguous in DPA-space.
This bus is provided by the kernel under the device This bus is provided by the kernel under the device
/sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from
the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the
acpi_nfit.ko driver as well. acpi_nfit.ko driver as well.

View File

@ -1,3 +1,6 @@
================
CPU Idle Cooling
================
Situation: Situation:
---------- ----------

View File

@ -8,6 +8,7 @@ Thermal
:maxdepth: 1 :maxdepth: 1
cpu-cooling-api cpu-cooling-api
cpu-idle-cooling
sysfs-api sysfs-api
power_allocator power_allocator

View File

@ -23,7 +23,7 @@
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO | | riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | ok | | sparc: | ok |

View File

@ -22,9 +22,9 @@
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | ok |
| riscv: | TODO | | riscv: | ok |
| s390: | TODO | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
| um: | TODO | | um: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | TODO | | arm: | TODO |
| arm64: | TODO | | arm64: | TODO |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | ok | | ia64: | ok |
@ -23,7 +23,7 @@
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | ok | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok | | riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | ok | | ia64: | ok |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |

View File

@ -16,7 +16,7 @@
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | ok |
| mips: | ok | | mips: | ok |
| nds32: | TODO | | nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | ok | | hexagon: | ok |
| ia64: | TODO | | ia64: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | ok | | hexagon: | ok |
| ia64: | TODO | | ia64: | TODO |
@ -21,7 +21,7 @@
| nds32: | ok | | nds32: | ok |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO | | riscv: | TODO |
| s390: | ok | | s390: | ok |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |
@ -23,7 +23,7 @@
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO | | riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |

View File

@ -11,7 +11,7 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| csky: | TODO | | csky: | ok |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |
@ -23,7 +23,7 @@
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO | | riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |

View File

@ -23,7 +23,7 @@
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | ok | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO | | riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |

View File

@ -22,7 +22,7 @@
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | ok |
| riscv: | TODO | | riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |

View File

@ -17,7 +17,7 @@
| ia64: | TODO | | ia64: | TODO |
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | ok |
| nds32: | TODO | | nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |

View File

@ -192,4 +192,4 @@ For more information on the Plan 9 Operating System check out
http://plan9.bell-labs.com/plan9 http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries For information on Plan 9 from User Space (Plan 9 applications and libraries
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 ported to Linux/BSD/OSX/etc) check out https://9fans.github.io/plan9port/

View File

@ -1,3 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
=================
Automount Support
=================
Support is available for filesystems that wish to do automounting Support is available for filesystems that wish to do automounting
support (such as kAFS which can be found in fs/afs/ and NFS in support (such as kAFS which can be found in fs/afs/ and NFS in
fs/nfs/). This facility includes allowing in-kernel mounts to be fs/nfs/). This facility includes allowing in-kernel mounts to be
@ -5,13 +12,12 @@ performed and mountpoint degradation to be requested. The latter can
also be requested by userspace. also be requested by userspace.
====================== In-Kernel Automounting
IN-KERNEL AUTOMOUNTING
====================== ======================
See section "Mount Traps" of Documentation/filesystems/autofs.rst See section "Mount Traps" of Documentation/filesystems/autofs.rst
Then from userspace, you can just do something like: Then from userspace, you can just do something like::
[root@andromeda root]# mount -t afs \#root.afs. /afs [root@andromeda root]# mount -t afs \#root.afs. /afs
[root@andromeda root]# ls /afs [root@andromeda root]# ls /afs
@ -21,7 +27,7 @@ Then from userspace, you can just do something like:
[root@andromeda root]# ls /afs/cambridge/afsdoc/ [root@andromeda root]# ls /afs/cambridge/afsdoc/
ChangeLog html LICENSE pdf RELNOTES-1.2.2 ChangeLog html LICENSE pdf RELNOTES-1.2.2
And then if you look in the mountpoint catalogue, you'll see something like: And then if you look in the mountpoint catalogue, you'll see something like::
[root@andromeda root]# cat /proc/mounts [root@andromeda root]# cat /proc/mounts
... ...
@ -30,8 +36,7 @@ And then if you look in the mountpoint catalogue, you'll see something like:
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
=========================== Automatic Mountpoint Expiry
AUTOMATIC MOUNTPOINT EXPIRY
=========================== ===========================
Automatic expiration of mountpoints is easy, provided you've mounted the Automatic expiration of mountpoints is easy, provided you've mounted the
@ -43,7 +48,8 @@ To do expiration, you need to follow these steps:
hung. hung.
(2) When a new mountpoint is created in the ->d_automount method, add (2) When a new mountpoint is created in the ->d_automount method, add
the mnt to the list using mnt_set_expiry() the mnt to the list using mnt_set_expiry()::
mnt_set_expiry(newmnt, &afs_vfsmounts); mnt_set_expiry(newmnt, &afs_vfsmounts);
(3) When you want mountpoints to be expired, call mark_mounts_for_expiry() (3) When you want mountpoints to be expired, call mark_mounts_for_expiry()
@ -70,8 +76,7 @@ and the copies of those that are on an expiration list will be added to the
same expiration list. same expiration list.
======================= Userspace Driven Expiry
USERSPACE DRIVEN EXPIRY
======================= =======================
As an alternative, it is possible for userspace to request expiry of any As an alternative, it is possible for userspace to request expiry of any

View File

@ -1,6 +1,8 @@
========================== .. SPDX-License-Identifier: GPL-2.0
FS-CACHE CACHE BACKEND API
========================== ==========================
FS-Cache Cache backend API
==========================
The FS-Cache system provides an API by which actual caches can be supplied to The FS-Cache system provides an API by which actual caches can be supplied to
FS-Cache for it to then serve out to network filesystems and other interested FS-Cache for it to then serve out to network filesystems and other interested
@ -9,15 +11,14 @@ parties.
This API is declared in <linux/fscache-cache.h>. This API is declared in <linux/fscache-cache.h>.
==================================== Initialising and Registering a Cache
INITIALISING AND REGISTERING A CACHE
==================================== ====================================
To start off, a cache definition must be initialised and registered for each To start off, a cache definition must be initialised and registered for each
cache the backend wants to make available. For instance, CacheFS does this in cache the backend wants to make available. For instance, CacheFS does this in
the fill_super() operation on mounting. the fill_super() operation on mounting.
The cache definition (struct fscache_cache) should be initialised by calling: The cache definition (struct fscache_cache) should be initialised by calling::
void fscache_init_cache(struct fscache_cache *cache, void fscache_init_cache(struct fscache_cache *cache,
struct fscache_cache_ops *ops, struct fscache_cache_ops *ops,
@ -26,17 +27,17 @@ The cache definition (struct fscache_cache) should be initialised by calling:
Where: Where:
(*) "cache" is a pointer to the cache definition; * "cache" is a pointer to the cache definition;
(*) "ops" is a pointer to the table of operations that the backend supports on * "ops" is a pointer to the table of operations that the backend supports on
this cache; and this cache; and
(*) "idfmt" is a format and printf-style arguments for constructing a label * "idfmt" is a format and printf-style arguments for constructing a label
for the cache. for the cache.
The cache should then be registered with FS-Cache by passing a pointer to the The cache should then be registered with FS-Cache by passing a pointer to the
previously initialised cache definition to: previously initialised cache definition to::
int fscache_add_cache(struct fscache_cache *cache, int fscache_add_cache(struct fscache_cache *cache,
struct fscache_object *fsdef, struct fscache_object *fsdef,
@ -44,12 +45,12 @@ previously initialised cache definition to:
Two extra arguments should also be supplied: Two extra arguments should also be supplied:
(*) "fsdef" which should point to the object representation for the FS-Cache * "fsdef" which should point to the object representation for the FS-Cache
master index in this cache. Netfs primary index entries will be created master index in this cache. Netfs primary index entries will be created
here. FS-Cache keeps the caller's reference to the index object if here. FS-Cache keeps the caller's reference to the index object if
successful and will release it upon withdrawal of the cache. successful and will release it upon withdrawal of the cache.
(*) "tagname" which, if given, should be a text string naming this cache. If * "tagname" which, if given, should be a text string naming this cache. If
this is NULL, the identifier will be used instead. For CacheFS, the this is NULL, the identifier will be used instead. For CacheFS, the
identifier is set to name the underlying block device and the tag can be identifier is set to name the underlying block device and the tag can be
supplied by mount. supplied by mount.
@ -58,20 +59,18 @@ This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
is already in use. 0 will be returned on success. is already in use. 0 will be returned on success.
===================== Unregistering a Cache
UNREGISTERING A CACHE
===================== =====================
A cache can be withdrawn from the system by calling this function with a A cache can be withdrawn from the system by calling this function with a
pointer to the cache definition: pointer to the cache definition::
void fscache_withdraw_cache(struct fscache_cache *cache); void fscache_withdraw_cache(struct fscache_cache *cache);
In CacheFS's case, this is called by put_super(). In CacheFS's case, this is called by put_super().
======== Security
SECURITY
======== ========
The cache methods are executed one of two contexts: The cache methods are executed one of two contexts:
@ -89,8 +88,7 @@ be masqueraded for the duration of the cache driver's access to the cache.
This is left to the cache to handle; FS-Cache makes no effort in this regard. This is left to the cache to handle; FS-Cache makes no effort in this regard.
=================================== Control and Statistics Presentation
CONTROL AND STATISTICS PRESENTATION
=================================== ===================================
The cache may present data to the outside world through FS-Cache's interfaces The cache may present data to the outside world through FS-Cache's interfaces
@ -101,11 +99,10 @@ is enabled. This is accessible through the kobject struct fscache_cache::kobj
and is for use by the cache as it sees fit. and is for use by the cache as it sees fit.
======================== Relevant Data Structures
RELEVANT DATA STRUCTURES
======================== ========================
(*) Index/Data file FS-Cache representation cookie: * Index/Data file FS-Cache representation cookie::
struct fscache_cookie { struct fscache_cookie {
struct fscache_object_def *def; struct fscache_object_def *def;
@ -121,7 +118,7 @@ RELEVANT DATA STRUCTURES
cache operations. cache operations.
(*) In-cache object representation: * In-cache object representation::
struct fscache_object { struct fscache_object {
int debug_id; int debug_id;
@ -150,7 +147,7 @@ RELEVANT DATA STRUCTURES
initialised by calling fscache_object_init(object). initialised by calling fscache_object_init(object).
(*) FS-Cache operation record: * FS-Cache operation record::
struct fscache_operation { struct fscache_operation {
atomic_t usage; atomic_t usage;
@ -173,7 +170,7 @@ RELEVANT DATA STRUCTURES
an operation needs more processing time, it should be enqueued again. an operation needs more processing time, it should be enqueued again.
(*) FS-Cache retrieval operation record: * FS-Cache retrieval operation record::
struct fscache_retrieval { struct fscache_retrieval {
struct fscache_operation op; struct fscache_operation op;
@ -198,7 +195,7 @@ RELEVANT DATA STRUCTURES
it sees fit. it sees fit.
(*) FS-Cache storage operation record: * FS-Cache storage operation record::
struct fscache_storage { struct fscache_storage {
struct fscache_operation op; struct fscache_operation op;
@ -212,16 +209,17 @@ RELEVANT DATA STRUCTURES
storage. storage.
================ Cache Operations
CACHE OPERATIONS
================ ================
The cache backend provides FS-Cache with a table of operations that can be The cache backend provides FS-Cache with a table of operations that can be
performed on the denizens of the cache. These are held in a structure of type: performed on the denizens of the cache. These are held in a structure of type:
struct fscache_cache_ops ::
(*) Name of cache provider [mandatory]: struct fscache_cache_ops
* Name of cache provider [mandatory]::
const char *name const char *name
@ -229,7 +227,7 @@ performed on the denizens of the cache. These are held in a structure of type:
the backend. the backend.
(*) Allocate a new object [mandatory]: * Allocate a new object [mandatory]::
struct fscache_object *(*alloc_object)(struct fscache_cache *cache, struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
struct fscache_cookie *cookie) struct fscache_cookie *cookie)
@ -244,7 +242,7 @@ performed on the denizens of the cache. These are held in a structure of type:
form once lookup is complete or aborted. form once lookup is complete or aborted.
(*) Look up and create object [mandatory]: * Look up and create object [mandatory]::
void (*lookup_object)(struct fscache_object *object) void (*lookup_object)(struct fscache_object *object)
@ -263,7 +261,7 @@ performed on the denizens of the cache. These are held in a structure of type:
to abort the lookup of that object. to abort the lookup of that object.
(*) Release lookup data [mandatory]: * Release lookup data [mandatory]::
void (*lookup_complete)(struct fscache_object *object) void (*lookup_complete)(struct fscache_object *object)
@ -271,7 +269,7 @@ performed on the denizens of the cache. These are held in a structure of type:
using to perform a lookup. using to perform a lookup.
(*) Increment object refcount [mandatory]: * Increment object refcount [mandatory]::
struct fscache_object *(*grab_object)(struct fscache_object *object) struct fscache_object *(*grab_object)(struct fscache_object *object)
@ -280,7 +278,7 @@ performed on the denizens of the cache. These are held in a structure of type:
It should return the object pointer if successful. It should return the object pointer if successful.
(*) Lock/Unlock object [mandatory]: * Lock/Unlock object [mandatory]::
void (*lock_object)(struct fscache_object *object) void (*lock_object)(struct fscache_object *object)
void (*unlock_object)(struct fscache_object *object) void (*unlock_object)(struct fscache_object *object)
@ -289,7 +287,7 @@ performed on the denizens of the cache. These are held in a structure of type:
to schedule with the lock held, so a spinlock isn't sufficient. to schedule with the lock held, so a spinlock isn't sufficient.
(*) Pin/Unpin object [optional]: * Pin/Unpin object [optional]::
int (*pin_object)(struct fscache_object *object) int (*pin_object)(struct fscache_object *object)
void (*unpin_object)(struct fscache_object *object) void (*unpin_object)(struct fscache_object *object)
@ -299,7 +297,7 @@ performed on the denizens of the cache. These are held in a structure of type:
enough space in the cache to permit this. enough space in the cache to permit this.
(*) Check coherency state of an object [mandatory]: * Check coherency state of an object [mandatory]::
int (*check_consistency)(struct fscache_object *object) int (*check_consistency)(struct fscache_object *object)
@ -308,7 +306,7 @@ performed on the denizens of the cache. These are held in a structure of type:
if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS
may also be returned. may also be returned.
(*) Update object [mandatory]: * Update object [mandatory]::
int (*update_object)(struct fscache_object *object) int (*update_object)(struct fscache_object *object)
@ -317,7 +315,7 @@ performed on the denizens of the cache. These are held in a structure of type:
obtained by calling object->cookie->def->get_aux()/get_attr(). obtained by calling object->cookie->def->get_aux()/get_attr().
(*) Invalidate data object [mandatory]: * Invalidate data object [mandatory]::
int (*invalidate_object)(struct fscache_operation *op) int (*invalidate_object)(struct fscache_operation *op)
@ -329,7 +327,7 @@ performed on the denizens of the cache. These are held in a structure of type:
fscache_op_complete() must be called on op before returning. fscache_op_complete() must be called on op before returning.
(*) Discard object [mandatory]: * Discard object [mandatory]::
void (*drop_object)(struct fscache_object *object) void (*drop_object)(struct fscache_object *object)
@ -341,7 +339,7 @@ performed on the denizens of the cache. These are held in a structure of type:
caller. The caller will invoke the put_object() method as appropriate. caller. The caller will invoke the put_object() method as appropriate.
(*) Release object reference [mandatory]: * Release object reference [mandatory]::
void (*put_object)(struct fscache_object *object) void (*put_object)(struct fscache_object *object)
@ -349,7 +347,7 @@ performed on the denizens of the cache. These are held in a structure of type:
be freed when all the references to it are released. be freed when all the references to it are released.
(*) Synchronise a cache [mandatory]: * Synchronise a cache [mandatory]::
void (*sync)(struct fscache_cache *cache) void (*sync)(struct fscache_cache *cache)
@ -357,7 +355,7 @@ performed on the denizens of the cache. These are held in a structure of type:
device. device.
(*) Dissociate a cache [mandatory]: * Dissociate a cache [mandatory]::
void (*dissociate_pages)(struct fscache_cache *cache) void (*dissociate_pages)(struct fscache_cache *cache)
@ -365,7 +363,7 @@ performed on the denizens of the cache. These are held in a structure of type:
cache withdrawal. cache withdrawal.
(*) Notification that the attributes on a netfs file changed [mandatory]: * Notification that the attributes on a netfs file changed [mandatory]::
int (*attr_changed)(struct fscache_object *object); int (*attr_changed)(struct fscache_object *object);
@ -386,7 +384,7 @@ performed on the denizens of the cache. These are held in a structure of type:
execution of this operation. execution of this operation.
(*) Reserve cache space for an object's data [optional]: * Reserve cache space for an object's data [optional]::
int (*reserve_space)(struct fscache_object *object, loff_t size); int (*reserve_space)(struct fscache_object *object, loff_t size);
@ -404,7 +402,7 @@ performed on the denizens of the cache. These are held in a structure of type:
size if larger than that already. size if larger than that already.
(*) Request page be read from cache [mandatory]: * Request page be read from cache [mandatory]::
int (*read_or_alloc_page)(struct fscache_retrieval *op, int (*read_or_alloc_page)(struct fscache_retrieval *op,
struct page *page, struct page *page,
@ -446,7 +444,7 @@ performed on the denizens of the cache. These are held in a structure of type:
with. This will complete the operation when all pages are dealt with. with. This will complete the operation when all pages are dealt with.
(*) Request pages be read from cache [mandatory]: * Request pages be read from cache [mandatory]::
int (*read_or_alloc_pages)(struct fscache_retrieval *op, int (*read_or_alloc_pages)(struct fscache_retrieval *op,
struct list_head *pages, struct list_head *pages,
@ -457,7 +455,7 @@ performed on the denizens of the cache. These are held in a structure of type:
of pages instead of one page. Any pages on which a read operation is of pages instead of one page. Any pages on which a read operation is
started must be added to the page cache for the specified mapping and also started must be added to the page cache for the specified mapping and also
to the LRU. Such pages must also be removed from the pages list and to the LRU. Such pages must also be removed from the pages list and
*nr_pages decremented per page. ``*nr_pages`` decremented per page.
If there was an error such as -ENOMEM, then that should be returned; else If there was an error such as -ENOMEM, then that should be returned; else
if one or more pages couldn't be read or allocated, then -ENOBUFS should if one or more pages couldn't be read or allocated, then -ENOBUFS should
@ -466,7 +464,7 @@ performed on the denizens of the cache. These are held in a structure of type:
returned. returned.
(*) Request page be allocated in the cache [mandatory]: * Request page be allocated in the cache [mandatory]::
int (*allocate_page)(struct fscache_retrieval *op, int (*allocate_page)(struct fscache_retrieval *op,
struct page *page, struct page *page,
@ -482,7 +480,7 @@ performed on the denizens of the cache. These are held in a structure of type:
allocated, then the netfs page should be marked and 0 returned. allocated, then the netfs page should be marked and 0 returned.
(*) Request pages be allocated in the cache [mandatory]: * Request pages be allocated in the cache [mandatory]::
int (*allocate_pages)(struct fscache_retrieval *op, int (*allocate_pages)(struct fscache_retrieval *op,
struct list_head *pages, struct list_head *pages,
@ -493,7 +491,7 @@ performed on the denizens of the cache. These are held in a structure of type:
nr_pages should be treated as for the read_or_alloc_pages() method. nr_pages should be treated as for the read_or_alloc_pages() method.
(*) Request page be written to cache [mandatory]: * Request page be written to cache [mandatory]::
int (*write_page)(struct fscache_storage *op, int (*write_page)(struct fscache_storage *op,
struct page *page); struct page *page);
@ -514,7 +512,7 @@ performed on the denizens of the cache. These are held in a structure of type:
appropriately. appropriately.
(*) Discard retained per-page metadata [mandatory]: * Discard retained per-page metadata [mandatory]::
void (*uncache_page)(struct fscache_object *object, struct page *page) void (*uncache_page)(struct fscache_object *object, struct page *page)
@ -523,13 +521,12 @@ performed on the denizens of the cache. These are held in a structure of type:
maintains for this page. maintains for this page.
================== FS-Cache Utilities
FS-CACHE UTILITIES
================== ==================
FS-Cache provides some utilities that a cache backend may make use of: FS-Cache provides some utilities that a cache backend may make use of:
(*) Note occurrence of an I/O error in a cache: * Note occurrence of an I/O error in a cache::
void fscache_io_error(struct fscache_cache *cache) void fscache_io_error(struct fscache_cache *cache)
@ -541,7 +538,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
This does not actually withdraw the cache. That must be done separately. This does not actually withdraw the cache. That must be done separately.
(*) Invoke the retrieval I/O completion function: * Invoke the retrieval I/O completion function::
void fscache_end_io(struct fscache_retrieval *op, struct page *page, void fscache_end_io(struct fscache_retrieval *op, struct page *page,
int error); int error);
@ -550,8 +547,8 @@ FS-Cache provides some utilities that a cache backend may make use of:
error value should be 0 if successful and an error otherwise. error value should be 0 if successful and an error otherwise.
(*) Record that one or more pages being retrieved or allocated have been dealt * Record that one or more pages being retrieved or allocated have been dealt
with: with::
void fscache_retrieval_complete(struct fscache_retrieval *op, void fscache_retrieval_complete(struct fscache_retrieval *op,
int n_pages); int n_pages);
@ -562,7 +559,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
completed. completed.
(*) Record operation completion: * Record operation completion::
void fscache_op_complete(struct fscache_operation *op); void fscache_op_complete(struct fscache_operation *op);
@ -571,7 +568,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
one or more pending operations to start running. one or more pending operations to start running.
(*) Set highest store limit: * Set highest store limit::
void fscache_set_store_limit(struct fscache_object *object, void fscache_set_store_limit(struct fscache_object *object,
loff_t i_size); loff_t i_size);
@ -581,7 +578,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
rejected by fscache_read_alloc_page() and co with -ENOBUFS. rejected by fscache_read_alloc_page() and co with -ENOBUFS.
(*) Mark pages as being cached: * Mark pages as being cached::
void fscache_mark_pages_cached(struct fscache_retrieval *op, void fscache_mark_pages_cached(struct fscache_retrieval *op,
struct pagevec *pagevec); struct pagevec *pagevec);
@ -590,7 +587,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
the netfs must call fscache_uncache_page() to unmark the pages. the netfs must call fscache_uncache_page() to unmark the pages.
(*) Perform coherency check on an object: * Perform coherency check on an object::
enum fscache_checkaux fscache_check_aux(struct fscache_object *object, enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
const void *data, const void *data,
@ -603,29 +600,26 @@ FS-Cache provides some utilities that a cache backend may make use of:
One of three values will be returned: One of three values will be returned:
(*) FSCACHE_CHECKAUX_OKAY FSCACHE_CHECKAUX_OKAY
The coherency data indicates the object is valid as is. The coherency data indicates the object is valid as is.
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE FSCACHE_CHECKAUX_NEEDS_UPDATE
The coherency data needs updating, but otherwise the object is The coherency data needs updating, but otherwise the object is
valid. valid.
(*) FSCACHE_CHECKAUX_OBSOLETE FSCACHE_CHECKAUX_OBSOLETE
The coherency data indicates that the object is obsolete and should The coherency data indicates that the object is obsolete and should
be discarded. be discarded.
(*) Initialise a freshly allocated object: * Initialise a freshly allocated object::
void fscache_object_init(struct fscache_object *object); void fscache_object_init(struct fscache_object *object);
This initialises all the fields in an object representation. This initialises all the fields in an object representation.
(*) Indicate the destruction of an object: * Indicate the destruction of an object::
void fscache_object_destroyed(struct fscache_cache *cache); void fscache_object_destroyed(struct fscache_cache *cache);
@ -635,7 +629,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
all the objects. all the objects.
(*) Indicate negative lookup on an object: * Indicate negative lookup on an object::
void fscache_object_lookup_negative(struct fscache_object *object); void fscache_object_lookup_negative(struct fscache_object *object);
@ -650,7 +644,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
significant - all subsequent calls are ignored. significant - all subsequent calls are ignored.
(*) Indicate an object has been obtained: * Indicate an object has been obtained::
void fscache_obtained_object(struct fscache_object *object); void fscache_obtained_object(struct fscache_object *object);
@ -667,7 +661,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
(2) that writes may now proceed against this object. (2) that writes may now proceed against this object.
(*) Indicate that object lookup failed: * Indicate that object lookup failed::
void fscache_object_lookup_error(struct fscache_object *object); void fscache_object_lookup_error(struct fscache_object *object);
@ -676,7 +670,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
as possible. as possible.
(*) Indicate that a stale object was found and discarded: * Indicate that a stale object was found and discarded::
void fscache_object_retrying_stale(struct fscache_object *object); void fscache_object_retrying_stale(struct fscache_object *object);
@ -685,7 +679,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
discarded from the cache and the lookup will be performed again. discarded from the cache and the lookup will be performed again.
(*) Indicate that the caching backend killed an object: * Indicate that the caching backend killed an object::
void fscache_object_mark_killed(struct fscache_object *object, void fscache_object_mark_killed(struct fscache_object *object,
enum fscache_why_object_killed why); enum fscache_why_object_killed why);
@ -693,13 +687,20 @@ FS-Cache provides some utilities that a cache backend may make use of:
This is called to indicate that the cache backend preemptively killed an This is called to indicate that the cache backend preemptively killed an
object. The why parameter should be set to indicate the reason: object. The why parameter should be set to indicate the reason:
FSCACHE_OBJECT_IS_STALE - the object was stale and needs discarding. FSCACHE_OBJECT_IS_STALE
FSCACHE_OBJECT_NO_SPACE - there was insufficient cache space - the object was stale and needs discarding.
FSCACHE_OBJECT_WAS_RETIRED - the object was retired when relinquished.
FSCACHE_OBJECT_WAS_CULLED - the object was culled to make space. FSCACHE_OBJECT_NO_SPACE
- there was insufficient cache space
FSCACHE_OBJECT_WAS_RETIRED
- the object was retired when relinquished.
FSCACHE_OBJECT_WAS_CULLED
- the object was culled to make space.
(*) Get and release references on a retrieval record: * Get and release references on a retrieval record::
void fscache_get_retrieval(struct fscache_retrieval *op); void fscache_get_retrieval(struct fscache_retrieval *op);
void fscache_put_retrieval(struct fscache_retrieval *op); void fscache_put_retrieval(struct fscache_retrieval *op);
@ -708,7 +709,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
asynchronous data retrieval and block allocation. asynchronous data retrieval and block allocation.
(*) Enqueue a retrieval record for processing. * Enqueue a retrieval record for processing::
void fscache_enqueue_retrieval(struct fscache_retrieval *op); void fscache_enqueue_retrieval(struct fscache_retrieval *op);
@ -718,7 +719,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
within the callback function. within the callback function.
(*) List of object state names: * List of object state names::
const char *fscache_object_states[]; const char *fscache_object_states[];

View File

@ -1,8 +1,10 @@
=============================================== .. SPDX-License-Identifier: GPL-2.0
CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
===============================================
Contents: ===============================================
CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
===============================================
.. Contents:
(*) Overview. (*) Overview.
@ -27,8 +29,8 @@ Contents:
(*) Debugging. (*) Debugging.
========
OVERVIEW Overview
======== ========
CacheFiles is a caching backend that's meant to use as a cache a directory on CacheFiles is a caching backend that's meant to use as a cache a directory on
@ -58,8 +60,8 @@ spare space and automatically contract when the set of data requires more
space. space.
============
REQUIREMENTS Requirements
============ ============
The use of CacheFiles and its daemon requires the following features to be The use of CacheFiles and its daemon requires the following features to be
@ -79,84 +81,70 @@ It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache. filesystems being used as a cache.
============= Configuration
CONFIGURATION
============= =============
The cache is configured by a script in /etc/cachefilesd.conf. These commands The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available: set up cache ready for use. The following script commands are available:
(*) brun <N>% brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>%
(*) bcull <N>%
(*) bstop <N>%
(*) frun <N>%
(*) fcull <N>%
(*) fstop <N>%
Configure the culling limits. Optional. See the section on culling Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
The commands beginning with a 'b' are file space (block) limits, those The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits. beginning with an 'f' are file count limits.
(*) dir <path> dir <path>
Specify the directory containing the root of the cache. Mandatory. Specify the directory containing the root of the cache. Mandatory.
(*) tag <name> tag <name>
Specify a tag to FS-Cache to use in distinguishing multiple caches. Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles". Optional. The default is "CacheFiles".
(*) debug <mask> debug <mask>
Specify a numeric bitmask to control debugging in the kernel module. Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information: OR'd into the mask to collect various information:
== =================================================
1 Turn on trace of function entry (_enter() macros) 1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros) 2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug()) 4 Turn on trace of internal debug points (_debug())
== =================================================
This mask can also be set through sysfs, eg: This mask can also be set through sysfs, eg::
echo 5 >/sys/modules/cachefiles/parameters/debug echo 5 >/sys/modules/cachefiles/parameters/debug
================== Starting the Cache
STARTING THE CACHE
================== ==================
The cache is started by running the daemon. The daemon opens the cache device, The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live. binds to fscache and the cache becomes live.
The daemon is run as follows: The daemon is run as follows::
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
The flags are: The flags are:
(*) -d ``-d``
Increase the debugging level. This can be specified multiple times and Increase the debugging level. This can be specified multiple times and
is cumulative with itself. is cumulative with itself.
(*) -s ``-s``
Send messages to stderr instead of syslog. Send messages to stderr instead of syslog.
(*) -n ``-n``
Don't daemonise and go into background. Don't daemonise and go into background.
(*) -f <configfile> ``-f <configfile>``
Use an alternative configuration file rather than the default one. Use an alternative configuration file rather than the default one.
=============== Things to Avoid
THINGS TO AVOID
=============== ===============
Do not mount other things within the cache as this will cause problems. The Do not mount other things within the cache as this will cause problems. The
@ -179,8 +167,7 @@ Do not chmod files in the cache. The module creates things with minimal
permissions to prevent random users being able to access them directly. permissions to prevent random users being able to access them directly.
============= Cache Culling
CACHE CULLING
============= =============
The cache may need culling occasionally to make space. This involves The cache may need culling occasionally to make space. This involves
@ -192,27 +179,21 @@ Cache culling is done on the basis of the percentage of blocks and the
percentage of files available in the underlying filesystem. There are six percentage of files available in the underlying filesystem. There are six
"limits": "limits":
(*) brun brun, frun
(*) frun
If the amount of free space and the number of available files in the cache If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off. rises above both these limits, then culling is turned off.
(*) bcull bcull, fcull
(*) fcull
If the amount of available space or the number of available files in the If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started. cache falls below either of these limits, then culling is started.
(*) bstop bstop, fstop
(*) fstop
If the amount of available space or the number of available files in the If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above disk space or files is permitted until culling has raised things above
these limits again. these limits again.
These must be configured thusly: These must be configured thusly::
0 <= bstop < bcull < brun < 100 0 <= bstop < bcull < brun < 100
0 <= fstop < fcull < frun < 100 0 <= fstop < fcull < frun < 100
@ -226,16 +207,14 @@ started as soon as space is made in the table. Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them. their atimes have changed or if the kernel module says it is still using them.
=============== Cache Structure
CACHE STRUCTURE
=============== ===============
The CacheFiles module will create two directories in the directory it was The CacheFiles module will create two directories in the directory it was
given: given:
(*) cache/ * cache/
* graveyard/
(*) graveyard/
The active cache objects all reside in the first directory. The CacheFiles The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink kernel module moves any retired or culled objects that it can't simply unlink
@ -261,10 +240,10 @@ If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child this directory, if possible, will be placed the representations of the child
objects: objects::
INDEX INDEX INDEX DATA FILES /INDEX /INDEX /INDEX /DATA FILES
========= ========== ================================= ================ /=========/==========/=================================/================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
@ -275,7 +254,7 @@ If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have inside the last directory. The names of the intermediate directories will have
'+' prepended: '+' prepended::
J1223/@23/+xy...z/+kl...m/Epqr J1223/@23/+xy...z/+kl...m/Epqr
@ -288,11 +267,13 @@ To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable. The two versions of "base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding: object filenames indicate the encoding:
=============== =============== ===============
OBJECT TYPE PRINTABLE ENCODED OBJECT TYPE PRINTABLE ENCODED
=============== =============== =============== =============== =============== ===============
Index "I..." "J..." Index "I..." "J..."
Data "D..." "E..." Data "D..." "E..."
Special "S..." "T..." Special "S..." "T..."
=============== =============== ===============
Intermediate directories are always "@" or "+" as appropriate. Intermediate directories are always "@" or "+" as appropriate.
@ -307,8 +288,7 @@ Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file). any file of an incorrect type (such as a FIFO file or a device file).
========================== Security Model and SELinux
SECURITY MODEL AND SELINUX
========================== ==========================
CacheFiles is implemented to deal properly with the LSM security features of CacheFiles is implemented to deal properly with the LSM security features of
@ -331,26 +311,26 @@ When the CacheFiles module is asked to bind to its cache, it:
(1) Finds the security label attached to the root cache directory and uses (1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default, that as the security label with which it will create files. By default,
this is: this is::
cachefiles_var_t cachefiles_var_t
(2) Finds the security label of the process which issued the bind request (2) Finds the security label of the process which issued the bind request
(presumed to be the cachefilesd daemon), which by default will be: (presumed to be the cachefilesd daemon), which by default will be::
cachefilesd_t cachefilesd_t
and asks LSM to supply a security ID as which it should act given the and asks LSM to supply a security ID as which it should act given the
daemon's label. By default, this will be: daemon's label. By default, this will be::
cachefiles_kernel_t cachefiles_kernel_t
SELinux transitions the daemon's security ID to the module's security ID SELinux transitions the daemon's security ID to the module's security ID
based on a rule of this form in the policy. based on a rule of this form in the policy::
type_transition <daemon's-ID> kernel_t : process <module's-ID>; type_transition <daemon's-ID> kernel_t : process <module's-ID>;
For instance: For instance::
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
@ -370,7 +350,7 @@ There are policy source files available in:
http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
and later versions. In that tarball, see the files: and later versions. In that tarball, see the files::
cachefilesd.te cachefilesd.te
cachefilesd.fc cachefilesd.fc
@ -379,7 +359,7 @@ and later versions. In that tarball, see the files:
They are built and installed directly by the RPM. They are built and installed directly by the RPM.
If a non-RPM based system is being used, then copy the above files to their own If a non-RPM based system is being used, then copy the above files to their own
directory and run: directory and run::
make -f /usr/share/selinux/devel/Makefile make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp semodule -i cachefilesd.pp
@ -394,7 +374,7 @@ an auxiliary policy must be installed to label the alternate location of the
cache. cache.
For instructions on how to add an auxiliary policy to enable the cache to be For instructions on how to add an auxiliary policy to enable the cache to be
located elsewhere when SELinux is in enforcing mode, please see: located elsewhere when SELinux is in enforcing mode, please see::
/usr/share/doc/cachefilesd-*/move-cache.txt /usr/share/doc/cachefilesd-*/move-cache.txt
@ -402,8 +382,7 @@ When the cachefilesd rpm is installed; alternatively, the document can be found
in the sources. in the sources.
================== A Note on Security
A NOTE ON SECURITY
================== ==================
CacheFiles makes use of the split security in the task_struct. It allocates CacheFiles makes use of the split security in the task_struct. It allocates
@ -445,17 +424,18 @@ for CacheFiles to run in a context of a specific security label, or to create
files and directories with another security label. files and directories with another security label.
======================= Statistical Information
STATISTICAL INFORMATION
======================= =======================
If FS-Cache is compiled with the following option enabled: If FS-Cache is compiled with the following option enabled::
CONFIG_CACHEFILES_HISTOGRAM=y CONFIG_CACHEFILES_HISTOGRAM=y
then it will gather certain statistics and display them through a proc file. then it will gather certain statistics and display them through a proc file.
(*) /proc/fs/cachefiles/histogram /proc/fs/cachefiles/histogram
::
cat /proc/fs/cachefiles/histogram cat /proc/fs/cachefiles/histogram
JIFS SECS LOOKUPS MKDIRS CREATES JIFS SECS LOOKUPS MKDIRS CREATES
@ -465,36 +445,39 @@ then it will gather certain statistics and display them through a proc file.
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows: columns are as follows:
======= =======================================================
COLUMN TIME MEASUREMENT COLUMN TIME MEASUREMENT
======= ======================================================= ======= =======================================================
LOOKUPS Length of time to perform a lookup on the backing fs LOOKUPS Length of time to perform a lookup on the backing fs
MKDIRS Length of time to perform a mkdir on the backing fs MKDIRS Length of time to perform a mkdir on the backing fs
CREATES Length of time to perform a create on the backing fs CREATES Length of time to perform a create on the backing fs
======= =======================================================
Each row shows the number of events that took a particular range of times. Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds. jiffy range covered, and the SECS field the equivalent number of seconds.
========= Debugging
DEBUGGING
========= =========
If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
debugging enabled by adjusting the value in: debugging enabled by adjusting the value in::
/sys/module/cachefiles/parameters/debug /sys/module/cachefiles/parameters/debug
This is a bitmask of debugging streams to enable: This is a bitmask of debugging streams to enable:
======= ======= =============================== =======================
BIT VALUE STREAM POINT BIT VALUE STREAM POINT
======= ======= =============================== ======================= ======= ======= =============================== =======================
0 1 General Function entry trace 0 1 General Function entry trace
1 2 Function exit trace 1 2 Function exit trace
2 4 General 2 4 General
======= ======= =============================== =======================
The appropriate set of values should be OR'd together and the result written to The appropriate set of values should be OR'd together and the result written to
the control file. For example: the control file. For example::
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug

View File

@ -0,0 +1,565 @@
.. SPDX-License-Identifier: GPL-2.0
==========================
General Filesystem Caching
==========================
Overview
========
This facility is a general purpose cache for network filesystems, though it
could be used for caching other things such as ISO9660 filesystems too.
FS-Cache mediates between cache backends (such as CacheFS) and network
filesystems::
+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+
Or to look at it another way, FS-Cache is a module that provides a caching
facility to a network filesystem such that the cache is transparent to the
user::
+---------+
| |
| Server |
| |
+---------+
| NETWORK
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| +----------+
V | |
+---------+ | |
| | | |
| NFS |----->| FS-Cache |
| | | |--+
+---------+ | | | +--------------+ +--------------+
| | | | | | | |
V +----------+ +-->| CacheFiles |-->| Ext3 |
+---------+ | /var/cache | | /dev/sda6 |
| | +--------------+ +--------------+
| VFS | ^ ^
| | | |
+---------+ +--------------+ |
| KERNEL SPACE | |
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
| USER SPACE | |
V | |
+---------+ +--------------+
| | | |
| Process | | cachefilesd |
| | | |
+---------+ +--------------+
FS-Cache does not follow the idea of completely loading every netfs file
opened in its entirety into a cache before permitting it to be accessed and
then serving the pages out of that cache rather than the netfs inode because:
(1) It must be practical to operate without a cache.
(2) The size of any accessible file must not be limited to the size of the
cache.
(3) The combined size of all opened files (this includes mapped libraries)
must not be limited to the size of the cache.
(4) The user should not be forced to download an entire file just to do a
one-off access of a small portion of it (such as might be done with the
"file" program).
It instead serves the cache out in PAGE_SIZE chunks as and when requested by
the netfs('s) using it.
FS-Cache provides the following facilities:
(1) More than one cache can be used at once. Caches can be selected
explicitly by use of tags.
(2) Caches can be added / removed at any time.
(3) The netfs is provided with an interface that allows either party to
withdraw caching facilities from a file (required for (2)).
(4) The interface to the netfs returns as few errors as possible, preferring
rather to let the netfs remain oblivious.
(5) Cookies are used to represent indices, files and other objects to the
netfs. The simplest cookie is just a NULL pointer - indicating nothing
cached there.
(6) The netfs is allowed to propose - dynamically - any index hierarchy it
desires, though it must be aware that the index search function is
recursive, stack space is limited, and indices can only be children of
indices.
(7) Data I/O is done direct to and from the netfs's pages. The netfs
indicates that page A is at index B of the data-file represented by cookie
C, and that it should be read or written. The cache backend may or may
not start I/O on that page, but if it does, a netfs callback will be
invoked to indicate completion. The I/O may be either synchronous or
asynchronous.
(8) Cookies can be "retired" upon release. At this point FS-Cache will mark
them as obsolete and the index hierarchy rooted at that point will get
recycled.
(9) The netfs provides a "match" function for index searches. In addition to
saying whether a match was made or not, this can also specify that an
entry should be updated or deleted.
(10) As much as possible is done asynchronously.
FS-Cache maintains a virtual indexing tree in which all indices, files, objects
and pages are kept. Bits of this tree may actually reside in one or more
caches::
FSDEF
|
+------------------------------------+
| |
NFS AFS
| |
+--------------------------+ +-----------+
| | | |
homedir mirror afs.org redhat.com
| | |
+------------+ +---------------+ +----------+
| | | | | |
00001 00002 00007 00125 vol00001 vol00002
| | | | |
+---+---+ +-----+ +---+ +------+------+ +-----+----+
| | | | | | | | | | | | |
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
| |
PG0 +-------+
| |
00001 00003
|
+---+---+
| | |
PG0 PG1 PG2
In the example above, you can see two netfs's being backed: NFS and AFS. These
have different index hierarchies:
* The NFS primary index contains per-server indices. Each server index is
indexed by NFS file handles to get data file objects. Each data file
objects can have an array of pages, but may also have further child
objects, such as extended attributes and directory entries. Extended
attribute objects themselves have page-array contents.
* The AFS primary index contains per-cell indices. Each cell index contains
per-logical-volume indices. Each of volume index contains up to three
indices for the read-write, read-only and backup mirrors of those volumes.
Each of these contains vnode data file objects, each of which contains an
array of pages.
The very top index is the FS-Cache master index in which individual netfs's
have entries.
Any index object may reside in more than one cache, provided it only has index
children. Any index with non-index object children will be assumed to only
reside in one cache.
The netfs API to FS-Cache can be found in:
Documentation/filesystems/caching/netfs-api.rst
The cache backend API to FS-Cache can be found in:
Documentation/filesystems/caching/backend-api.rst
A description of the internal representations and object state machine can be
found in:
Documentation/filesystems/caching/object.rst
Statistical Information
=======================
If FS-Cache is compiled with the following options enabled::
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_HISTOGRAM=y
then it will gather certain statistics and display them through a number of
proc files.
/proc/fs/fscache/stats
----------------------
This shows counts of a number of events that can happen in FS-Cache:
+--------------+-------+-------------------------------------------------------+
|CLASS |EVENT |MEANING |
+==============+=======+=======================================================+
|Cookies |idx=N |Number of index cookies allocated |
+ +-------+-------------------------------------------------------+
| |dat=N |Number of data storage cookies allocated |
+ +-------+-------------------------------------------------------+
| |spc=N |Number of special cookies allocated |
+--------------+-------+-------------------------------------------------------+
|Objects |alc=N |Number of objects allocated |
+ +-------+-------------------------------------------------------+
| |nal=N |Number of object allocation failures |
+ +-------+-------------------------------------------------------+
| |avl=N |Number of objects that reached the available state |
+ +-------+-------------------------------------------------------+
| |ded=N |Number of objects that reached the dead state |
+--------------+-------+-------------------------------------------------------+
|ChkAux |non=N |Number of objects that didn't have a coherency check |
+ +-------+-------------------------------------------------------+
| |ok=N |Number of objects that passed a coherency check |
+ +-------+-------------------------------------------------------+
| |upd=N |Number of objects that needed a coherency data update |
+ +-------+-------------------------------------------------------+
| |obs=N |Number of objects that were declared obsolete |
+--------------+-------+-------------------------------------------------------+
|Pages |mrk=N |Number of pages marked as being cached |
| |unc=N |Number of uncache page requests seen |
+--------------+-------+-------------------------------------------------------+
|Acquire |n=N |Number of acquire cookie requests seen |
+ +-------+-------------------------------------------------------+
| |nul=N |Number of acq reqs given a NULL parent |
+ +-------+-------------------------------------------------------+
| |noc=N |Number of acq reqs rejected due to no cache available |
+ +-------+-------------------------------------------------------+
| |ok=N |Number of acq reqs succeeded |
+ +-------+-------------------------------------------------------+
| |nbf=N |Number of acq reqs rejected due to error |
+ +-------+-------------------------------------------------------+
| |oom=N |Number of acq reqs failed on ENOMEM |
+--------------+-------+-------------------------------------------------------+
|Lookups |n=N |Number of lookup calls made on cache backends |
+ +-------+-------------------------------------------------------+
| |neg=N |Number of negative lookups made |
+ +-------+-------------------------------------------------------+
| |pos=N |Number of positive lookups made |
+ +-------+-------------------------------------------------------+
| |crt=N |Number of objects created by lookup |
+ +-------+-------------------------------------------------------+
| |tmo=N |Number of lookups timed out and requeued |
+--------------+-------+-------------------------------------------------------+
|Updates |n=N |Number of update cookie requests seen |
+ +-------+-------------------------------------------------------+
| |nul=N |Number of upd reqs given a NULL parent |
+ +-------+-------------------------------------------------------+
| |run=N |Number of upd reqs granted CPU time |
+--------------+-------+-------------------------------------------------------+
|Relinqs |n=N |Number of relinquish cookie requests seen |
+ +-------+-------------------------------------------------------+
| |nul=N |Number of rlq reqs given a NULL parent |
+ +-------+-------------------------------------------------------+
| |wcr=N |Number of rlq reqs waited on completion of creation |
+--------------+-------+-------------------------------------------------------+
|AttrChg |n=N |Number of attribute changed requests seen |
+ +-------+-------------------------------------------------------+
| |ok=N |Number of attr changed requests queued |
+ +-------+-------------------------------------------------------+
| |nbf=N |Number of attr changed rejected -ENOBUFS |
+ +-------+-------------------------------------------------------+
| |oom=N |Number of attr changed failed -ENOMEM |
+ +-------+-------------------------------------------------------+
| |run=N |Number of attr changed ops given CPU time |
+--------------+-------+-------------------------------------------------------+
|Allocs |n=N |Number of allocation requests seen |
+ +-------+-------------------------------------------------------+
| |ok=N |Number of successful alloc reqs |
+ +-------+-------------------------------------------------------+
| |wt=N |Number of alloc reqs that waited on lookup completion |
+ +-------+-------------------------------------------------------+
| |nbf=N |Number of alloc reqs rejected -ENOBUFS |
+ +-------+-------------------------------------------------------+
| |int=N |Number of alloc reqs aborted -ERESTARTSYS |
+ +-------+-------------------------------------------------------+
| |ops=N |Number of alloc reqs submitted |
+ +-------+-------------------------------------------------------+
| |owt=N |Number of alloc reqs waited for CPU time |
+ +-------+-------------------------------------------------------+
| |abt=N |Number of alloc reqs aborted due to object death |
+--------------+-------+-------------------------------------------------------+
|Retrvls |n=N |Number of retrieval (read) requests seen |
+ +-------+-------------------------------------------------------+
| |ok=N |Number of successful retr reqs |
+ +-------+-------------------------------------------------------+
| |wt=N |Number of retr reqs that waited on lookup completion |
+ +-------+-------------------------------------------------------+
| |nod=N |Number of retr reqs returned -ENODATA |
+ +-------+-------------------------------------------------------+
| |nbf=N |Number of retr reqs rejected -ENOBUFS |
+ +-------+-------------------------------------------------------+
| |int=N |Number of retr reqs aborted -ERESTARTSYS |
+ +-------+-------------------------------------------------------+
| |oom=N |Number of retr reqs failed -ENOMEM |
+ +-------+-------------------------------------------------------+
| |ops=N |Number of retr reqs submitted |
+ +-------+-------------------------------------------------------+
| |owt=N |Number of retr reqs waited for CPU time |
+ +-------+-------------------------------------------------------+
| |abt=N |Number of retr reqs aborted due to object death |
+--------------+-------+-------------------------------------------------------+
|Stores |n=N |Number of storage (write) requests seen |
+ +-------+-------------------------------------------------------+
| |ok=N |Number of successful store reqs |
+ +-------+-------------------------------------------------------+
| |agn=N |Number of store reqs on a page already pending storage |
+ +-------+-------------------------------------------------------+
| |nbf=N |Number of store reqs rejected -ENOBUFS |
+ +-------+-------------------------------------------------------+
| |oom=N |Number of store reqs failed -ENOMEM |
+ +-------+-------------------------------------------------------+
| |ops=N |Number of store reqs submitted |
+ +-------+-------------------------------------------------------+
| |run=N |Number of store reqs granted CPU time |
+ +-------+-------------------------------------------------------+
| |pgs=N |Number of pages given store req processing time |
+ +-------+-------------------------------------------------------+
| |rxd=N |Number of store reqs deleted from tracking tree |
+ +-------+-------------------------------------------------------+
| |olm=N |Number of store reqs over store limit |
+--------------+-------+-------------------------------------------------------+
|VmScan |nos=N |Number of release reqs against pages with no |
| | |pending store |
+ +-------+-------------------------------------------------------+
| |gon=N |Number of release reqs against pages stored by |
| | |time lock granted |
+ +-------+-------------------------------------------------------+
| |bsy=N |Number of release reqs ignored due to in-progress store|
+ +-------+-------------------------------------------------------+
| |can=N |Number of page stores cancelled due to release req |
+--------------+-------+-------------------------------------------------------+
|Ops |pend=N |Number of times async ops added to pending queues |
+ +-------+-------------------------------------------------------+
| |run=N |Number of times async ops given CPU time |
+ +-------+-------------------------------------------------------+
| |enq=N |Number of times async ops queued for processing |
+ +-------+-------------------------------------------------------+
| |can=N |Number of async ops cancelled |
+ +-------+-------------------------------------------------------+
| |rej=N |Number of async ops rejected due to object |
| | |lookup/create failure |
+ +-------+-------------------------------------------------------+
| |ini=N |Number of async ops initialised |
+ +-------+-------------------------------------------------------+
| |dfr=N |Number of async ops queued for deferred release |
+ +-------+-------------------------------------------------------+
| |rel=N |Number of async ops released |
| | |(should equal ini=N when idle) |
+ +-------+-------------------------------------------------------+
| |gc=N |Number of deferred-release async ops garbage collected |
+--------------+-------+-------------------------------------------------------+
|CacheOp |alo=N |Number of in-progress alloc_object() cache ops |
+ +-------+-------------------------------------------------------+
| |luo=N |Number of in-progress lookup_object() cache ops |
+ +-------+-------------------------------------------------------+
| |luc=N |Number of in-progress lookup_complete() cache ops |
+ +-------+-------------------------------------------------------+
| |gro=N |Number of in-progress grab_object() cache ops |
+ +-------+-------------------------------------------------------+
| |upo=N |Number of in-progress update_object() cache ops |
+ +-------+-------------------------------------------------------+
| |dro=N |Number of in-progress drop_object() cache ops |
+ +-------+-------------------------------------------------------+
| |pto=N |Number of in-progress put_object() cache ops |
+ +-------+-------------------------------------------------------+
| |syn=N |Number of in-progress sync_cache() cache ops |
+ +-------+-------------------------------------------------------+
| |atc=N |Number of in-progress attr_changed() cache ops |
+ +-------+-------------------------------------------------------+
| |rap=N |Number of in-progress read_or_alloc_page() cache ops |
+ +-------+-------------------------------------------------------+
| |ras=N |Number of in-progress read_or_alloc_pages() cache ops |
+ +-------+-------------------------------------------------------+
| |alp=N |Number of in-progress allocate_page() cache ops |
+ +-------+-------------------------------------------------------+
| |als=N |Number of in-progress allocate_pages() cache ops |
+ +-------+-------------------------------------------------------+
| |wrp=N |Number of in-progress write_page() cache ops |
+ +-------+-------------------------------------------------------+
| |ucp=N |Number of in-progress uncache_page() cache ops |
+ +-------+-------------------------------------------------------+
| |dsp=N |Number of in-progress dissociate_pages() cache ops |
+--------------+-------+-------------------------------------------------------+
|CacheEv |nsp=N |Number of object lookups/creations rejected due to |
| | |lack of space |
+ +-------+-------------------------------------------------------+
| |stl=N |Number of stale objects deleted |
+ +-------+-------------------------------------------------------+
| |rtr=N |Number of objects retired when relinquished |
+ +-------+-------------------------------------------------------+
| |cul=N |Number of objects culled |
+--------------+-------+-------------------------------------------------------+
/proc/fs/fscache/histogram
--------------------------
::
cat /proc/fs/fscache/histogram
JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
===== ===== ========= ========= ========= ========= =========
This shows the breakdown of the number of times each amount of time
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows:
========= =======================================================
COLUMN TIME MEASUREMENT
========= =======================================================
OBJ INST Length of time to instantiate an object
OP RUNS Length of time a call to process an operation took
OBJ RUNS Length of time a call to process an object event took
RETRV DLY Time between an requesting a read and lookup completing
RETRIEVLS Time between beginning and end of a retrieval
========= =======================================================
Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds.
Object List
===========
If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
list of all the objects currently allocated and allow them to be viewed
through::
/proc/fs/fscache/objects
This will look something like::
[root@andromeda ~]# head /proc/fs/fscache/objects
OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
where the first set of columns before the '|' describe the object:
======= ===============================================================
COLUMN DESCRIPTION
======= ===============================================================
OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
PARENT Debugging ID of parent object
STAT Object state
CHLDN Number of child objects of this object
OPS Number of outstanding operations on this object
OOP Number of outstanding child object management operations
IPR
EX Number of outstanding exclusive operations
READS Number of outstanding read operations
EM Object's event mask
EV Events raised on this object
F Object flags
S Object work item busy state mask (1:pending 2:running)
======= ===============================================================
and the second set of columns describe the object's cookie, if present:
================ ======================================================
COLUMN DESCRIPTION
================ ======================================================
NETFS_COOKIE_DEF Name of netfs cookie definition
TY Cookie type (IX - index, DT - data, hex - special)
FL Cookie flags
NETFS_DATA Netfs private data stored in the cookie
OBJECT_KEY Object key } 1 column, with separating comma
AUX_DATA Object aux data } presence may be configured
================ ======================================================
The data shown may be filtered by attaching the a key to an appropriate keyring
before viewing the file. Something like::
keyctl add user fscache:objlist <restrictions> @s
where <restrictions> are a selection of the following letters:
== =========================================================
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
== =========================================================
and the following paired letters:
== =========================================================
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have work queued
s Show objects that don't have work queued
== =========================================================
If neither side of a letter pair is given, then both are implied. For example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
By default all objects and all fields will be shown.
Debugging
=========
If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
debugging enabled by adjusting the value in::
/sys/module/fscache/parameters/debug
This is a bitmask of debugging streams to enable:
======= ======= =============================== =======================
BIT VALUE STREAM POINT
======= ======= =============================== =======================
0 1 Cache management Function entry trace
1 2 Function exit trace
2 4 General
3 8 Cookie management Function entry trace
4 16 Function exit trace
5 32 General
6 64 Page handling Function entry trace
7 128 Function exit trace
8 256 General
9 512 Operation management Function entry trace
10 1024 Function exit trace
11 2048 General
======= ======= =============================== =======================
The appropriate set of values should be OR'd together and the result written to
the control file. For example::
echo $((1|8|64)) >/sys/module/fscache/parameters/debug
will turn on all function entry debugging.

View File

@ -1,448 +0,0 @@
==========================
General Filesystem Caching
==========================
========
OVERVIEW
========
This facility is a general purpose cache for network filesystems, though it
could be used for caching other things such as ISO9660 filesystems too.
FS-Cache mediates between cache backends (such as CacheFS) and network
filesystems:
+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+
Or to look at it another way, FS-Cache is a module that provides a caching
facility to a network filesystem such that the cache is transparent to the
user:
+---------+
| |
| Server |
| |
+---------+
| NETWORK
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| +----------+
V | |
+---------+ | |
| | | |
| NFS |----->| FS-Cache |
| | | |--+
+---------+ | | | +--------------+ +--------------+
| | | | | | | |
V +----------+ +-->| CacheFiles |-->| Ext3 |
+---------+ | /var/cache | | /dev/sda6 |
| | +--------------+ +--------------+
| VFS | ^ ^
| | | |
+---------+ +--------------+ |
| KERNEL SPACE | |
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
| USER SPACE | |
V | |
+---------+ +--------------+
| | | |
| Process | | cachefilesd |
| | | |
+---------+ +--------------+
FS-Cache does not follow the idea of completely loading every netfs file
opened in its entirety into a cache before permitting it to be accessed and
then serving the pages out of that cache rather than the netfs inode because:
(1) It must be practical to operate without a cache.
(2) The size of any accessible file must not be limited to the size of the
cache.
(3) The combined size of all opened files (this includes mapped libraries)
must not be limited to the size of the cache.
(4) The user should not be forced to download an entire file just to do a
one-off access of a small portion of it (such as might be done with the
"file" program).
It instead serves the cache out in PAGE_SIZE chunks as and when requested by
the netfs('s) using it.
FS-Cache provides the following facilities:
(1) More than one cache can be used at once. Caches can be selected
explicitly by use of tags.
(2) Caches can be added / removed at any time.
(3) The netfs is provided with an interface that allows either party to
withdraw caching facilities from a file (required for (2)).
(4) The interface to the netfs returns as few errors as possible, preferring
rather to let the netfs remain oblivious.
(5) Cookies are used to represent indices, files and other objects to the
netfs. The simplest cookie is just a NULL pointer - indicating nothing
cached there.
(6) The netfs is allowed to propose - dynamically - any index hierarchy it
desires, though it must be aware that the index search function is
recursive, stack space is limited, and indices can only be children of
indices.
(7) Data I/O is done direct to and from the netfs's pages. The netfs
indicates that page A is at index B of the data-file represented by cookie
C, and that it should be read or written. The cache backend may or may
not start I/O on that page, but if it does, a netfs callback will be
invoked to indicate completion. The I/O may be either synchronous or
asynchronous.
(8) Cookies can be "retired" upon release. At this point FS-Cache will mark
them as obsolete and the index hierarchy rooted at that point will get
recycled.
(9) The netfs provides a "match" function for index searches. In addition to
saying whether a match was made or not, this can also specify that an
entry should be updated or deleted.
(10) As much as possible is done asynchronously.
FS-Cache maintains a virtual indexing tree in which all indices, files, objects
and pages are kept. Bits of this tree may actually reside in one or more
caches.
FSDEF
|
+------------------------------------+
| |
NFS AFS
| |
+--------------------------+ +-----------+
| | | |
homedir mirror afs.org redhat.com
| | |
+------------+ +---------------+ +----------+
| | | | | |
00001 00002 00007 00125 vol00001 vol00002
| | | | |
+---+---+ +-----+ +---+ +------+------+ +-----+----+
| | | | | | | | | | | | |
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
| |
PG0 +-------+
| |
00001 00003
|
+---+---+
| | |
PG0 PG1 PG2
In the example above, you can see two netfs's being backed: NFS and AFS. These
have different index hierarchies:
(*) The NFS primary index contains per-server indices. Each server index is
indexed by NFS file handles to get data file objects. Each data file
objects can have an array of pages, but may also have further child
objects, such as extended attributes and directory entries. Extended
attribute objects themselves have page-array contents.
(*) The AFS primary index contains per-cell indices. Each cell index contains
per-logical-volume indices. Each of volume index contains up to three
indices for the read-write, read-only and backup mirrors of those volumes.
Each of these contains vnode data file objects, each of which contains an
array of pages.
The very top index is the FS-Cache master index in which individual netfs's
have entries.
Any index object may reside in more than one cache, provided it only has index
children. Any index with non-index object children will be assumed to only
reside in one cache.
The netfs API to FS-Cache can be found in:
Documentation/filesystems/caching/netfs-api.txt
The cache backend API to FS-Cache can be found in:
Documentation/filesystems/caching/backend-api.txt
A description of the internal representations and object state machine can be
found in:
Documentation/filesystems/caching/object.txt
=======================
STATISTICAL INFORMATION
=======================
If FS-Cache is compiled with the following options enabled:
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_HISTOGRAM=y
then it will gather certain statistics and display them through a number of
proc files.
(*) /proc/fs/fscache/stats
This shows counts of a number of events that can happen in FS-Cache:
CLASS EVENT MEANING
======= ======= =======================================================
Cookies idx=N Number of index cookies allocated
dat=N Number of data storage cookies allocated
spc=N Number of special cookies allocated
Objects alc=N Number of objects allocated
nal=N Number of object allocation failures
avl=N Number of objects that reached the available state
ded=N Number of objects that reached the dead state
ChkAux non=N Number of objects that didn't have a coherency check
ok=N Number of objects that passed a coherency check
upd=N Number of objects that needed a coherency data update
obs=N Number of objects that were declared obsolete
Pages mrk=N Number of pages marked as being cached
unc=N Number of uncache page requests seen
Acquire n=N Number of acquire cookie requests seen
nul=N Number of acq reqs given a NULL parent
noc=N Number of acq reqs rejected due to no cache available
ok=N Number of acq reqs succeeded
nbf=N Number of acq reqs rejected due to error
oom=N Number of acq reqs failed on ENOMEM
Lookups n=N Number of lookup calls made on cache backends
neg=N Number of negative lookups made
pos=N Number of positive lookups made
crt=N Number of objects created by lookup
tmo=N Number of lookups timed out and requeued
Updates n=N Number of update cookie requests seen
nul=N Number of upd reqs given a NULL parent
run=N Number of upd reqs granted CPU time
Relinqs n=N Number of relinquish cookie requests seen
nul=N Number of rlq reqs given a NULL parent
wcr=N Number of rlq reqs waited on completion of creation
AttrChg n=N Number of attribute changed requests seen
ok=N Number of attr changed requests queued
nbf=N Number of attr changed rejected -ENOBUFS
oom=N Number of attr changed failed -ENOMEM
run=N Number of attr changed ops given CPU time
Allocs n=N Number of allocation requests seen
ok=N Number of successful alloc reqs
wt=N Number of alloc reqs that waited on lookup completion
nbf=N Number of alloc reqs rejected -ENOBUFS
int=N Number of alloc reqs aborted -ERESTARTSYS
ops=N Number of alloc reqs submitted
owt=N Number of alloc reqs waited for CPU time
abt=N Number of alloc reqs aborted due to object death
Retrvls n=N Number of retrieval (read) requests seen
ok=N Number of successful retr reqs
wt=N Number of retr reqs that waited on lookup completion
nod=N Number of retr reqs returned -ENODATA
nbf=N Number of retr reqs rejected -ENOBUFS
int=N Number of retr reqs aborted -ERESTARTSYS
oom=N Number of retr reqs failed -ENOMEM
ops=N Number of retr reqs submitted
owt=N Number of retr reqs waited for CPU time
abt=N Number of retr reqs aborted due to object death
Stores n=N Number of storage (write) requests seen
ok=N Number of successful store reqs
agn=N Number of store reqs on a page already pending storage
nbf=N Number of store reqs rejected -ENOBUFS
oom=N Number of store reqs failed -ENOMEM
ops=N Number of store reqs submitted
run=N Number of store reqs granted CPU time
pgs=N Number of pages given store req processing time
rxd=N Number of store reqs deleted from tracking tree
olm=N Number of store reqs over store limit
VmScan nos=N Number of release reqs against pages with no pending store
gon=N Number of release reqs against pages stored by time lock granted
bsy=N Number of release reqs ignored due to in-progress store
can=N Number of page stores cancelled due to release req
Ops pend=N Number of times async ops added to pending queues
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
can=N Number of async ops cancelled
rej=N Number of async ops rejected due to object lookup/create failure
ini=N Number of async ops initialised
dfr=N Number of async ops queued for deferred release
rel=N Number of async ops released (should equal ini=N when idle)
gc=N Number of deferred-release async ops garbage collected
CacheOp alo=N Number of in-progress alloc_object() cache ops
luo=N Number of in-progress lookup_object() cache ops
luc=N Number of in-progress lookup_complete() cache ops
gro=N Number of in-progress grab_object() cache ops
upo=N Number of in-progress update_object() cache ops
dro=N Number of in-progress drop_object() cache ops
pto=N Number of in-progress put_object() cache ops
syn=N Number of in-progress sync_cache() cache ops
atc=N Number of in-progress attr_changed() cache ops
rap=N Number of in-progress read_or_alloc_page() cache ops
ras=N Number of in-progress read_or_alloc_pages() cache ops
alp=N Number of in-progress allocate_page() cache ops
als=N Number of in-progress allocate_pages() cache ops
wrp=N Number of in-progress write_page() cache ops
ucp=N Number of in-progress uncache_page() cache ops
dsp=N Number of in-progress dissociate_pages() cache ops
CacheEv nsp=N Number of object lookups/creations rejected due to lack of space
stl=N Number of stale objects deleted
rtr=N Number of objects retired when relinquished
cul=N Number of objects culled
(*) /proc/fs/fscache/histogram
cat /proc/fs/fscache/histogram
JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
===== ===== ========= ========= ========= ========= =========
This shows the breakdown of the number of times each amount of time
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows:
COLUMN TIME MEASUREMENT
======= =======================================================
OBJ INST Length of time to instantiate an object
OP RUNS Length of time a call to process an operation took
OBJ RUNS Length of time a call to process an object event took
RETRV DLY Time between an requesting a read and lookup completing
RETRIEVLS Time between beginning and end of a retrieval
Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds.
===========
OBJECT LIST
===========
If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
list of all the objects currently allocated and allow them to be viewed
through:
/proc/fs/fscache/objects
This will look something like:
[root@andromeda ~]# head /proc/fs/fscache/objects
OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
where the first set of columns before the '|' describe the object:
COLUMN DESCRIPTION
======= ===============================================================
OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
PARENT Debugging ID of parent object
STAT Object state
CHLDN Number of child objects of this object
OPS Number of outstanding operations on this object
OOP Number of outstanding child object management operations
IPR
EX Number of outstanding exclusive operations
READS Number of outstanding read operations
EM Object's event mask
EV Events raised on this object
F Object flags
S Object work item busy state mask (1:pending 2:running)
and the second set of columns describe the object's cookie, if present:
COLUMN DESCRIPTION
=============== =======================================================
NETFS_COOKIE_DEF Name of netfs cookie definition
TY Cookie type (IX - index, DT - data, hex - special)
FL Cookie flags
NETFS_DATA Netfs private data stored in the cookie
OBJECT_KEY Object key } 1 column, with separating comma
AUX_DATA Object aux data } presence may be configured
The data shown may be filtered by attaching the a key to an appropriate keyring
before viewing the file. Something like:
keyctl add user fscache:objlist <restrictions> @s
where <restrictions> are a selection of the following letters:
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
and the following paired letters:
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have work queued
s Show objects that don't have work queued
If neither side of a letter pair is given, then both are implied. For example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
By default all objects and all fields will be shown.
=========
DEBUGGING
=========
If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
debugging enabled by adjusting the value in:
/sys/module/fscache/parameters/debug
This is a bitmask of debugging streams to enable:
BIT VALUE STREAM POINT
======= ======= =============================== =======================
0 1 Cache management Function entry trace
1 2 Function exit trace
2 4 General
3 8 Cookie management Function entry trace
4 16 Function exit trace
5 32 General
6 64 Page handling Function entry trace
7 128 Function exit trace
8 256 General
9 512 Operation management Function entry trace
10 1024 Function exit trace
11 2048 General
The appropriate set of values should be OR'd together and the result written to
the control file. For example:
echo $((1|8|64)) >/sys/module/fscache/parameters/debug
will turn on all function entry debugging.

View File

@ -0,0 +1,14 @@
.. SPDX-License-Identifier: GPL-2.0
Filesystem Caching
==================
.. toctree::
:maxdepth: 2
fscache
object
backend-api
cachefiles
netfs-api
operations

View File

@ -1,6 +1,8 @@
=============================== .. SPDX-License-Identifier: GPL-2.0
FS-CACHE NETWORK FILESYSTEM API
=============================== ===============================
FS-Cache Network Filesystem API
===============================
There's an API by which a network filesystem can make use of the FS-Cache There's an API by which a network filesystem can make use of the FS-Cache
facilities. This is based around a number of principles: facilities. This is based around a number of principles:
@ -19,7 +21,7 @@ facilities. This is based around a number of principles:
This API is declared in <linux/fscache.h>. This API is declared in <linux/fscache.h>.
This document contains the following sections: .. This document contains the following sections:
(1) Network filesystem definition (1) Network filesystem definition
(2) Index definition (2) Index definition
@ -41,12 +43,11 @@ This document contains the following sections:
(18) FS-Cache specific page flags. (18) FS-Cache specific page flags.
============================= Network Filesystem Definition
NETWORK FILESYSTEM DEFINITION
============================= =============================
FS-Cache needs a description of the network filesystem. This is specified FS-Cache needs a description of the network filesystem. This is specified
using a record of the following structure: using a record of the following structure::
struct fscache_netfs { struct fscache_netfs {
uint32_t version; uint32_t version;
@ -71,7 +72,7 @@ The fields are:
another parameter passed into the registration function. another parameter passed into the registration function.
For example, kAFS (linux/fs/afs/) uses the following definitions to describe For example, kAFS (linux/fs/afs/) uses the following definitions to describe
itself: itself::
struct fscache_netfs afs_cache_netfs = { struct fscache_netfs afs_cache_netfs = {
.version = 0, .version = 0,
@ -79,8 +80,7 @@ itself:
}; };
================ Index Definition
INDEX DEFINITION
================ ================
Indices are used for two purposes: Indices are used for two purposes:
@ -114,11 +114,10 @@ There are some limits on indices:
function is recursive. Too many layers will run the kernel out of stack. function is recursive. Too many layers will run the kernel out of stack.
================= Object Definition
OBJECT DEFINITION
================= =================
To define an object, a structure of the following type should be filled out: To define an object, a structure of the following type should be filled out::
struct fscache_cookie_def struct fscache_cookie_def
{ {
@ -149,16 +148,13 @@ This has the following fields:
This is one of the following values: This is one of the following values:
(*) FSCACHE_COOKIE_TYPE_INDEX FSCACHE_COOKIE_TYPE_INDEX
This defines an index, which is a special FS-Cache type. This defines an index, which is a special FS-Cache type.
(*) FSCACHE_COOKIE_TYPE_DATAFILE FSCACHE_COOKIE_TYPE_DATAFILE
This defines an ordinary data file. This defines an ordinary data file.
(*) Any other value between 2 and 255 Any other value between 2 and 255
This defines an extraordinary object such as an XATTR. This defines an extraordinary object such as an XATTR.
(2) The name of the object type (NUL terminated unless all 16 chars are used) (2) The name of the object type (NUL terminated unless all 16 chars are used)
@ -192,9 +188,14 @@ This has the following fields:
If present, the function should return one of the following values: If present, the function should return one of the following values:
(*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is FSCACHE_CHECKAUX_OKAY
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update - the entry is okay as is
(*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
FSCACHE_CHECKAUX_NEEDS_UPDATE
- the entry requires update
FSCACHE_CHECKAUX_OBSOLETE
- the entry should be deleted
This function can also be used to extract data from the auxiliary data in This function can also be used to extract data from the auxiliary data in
the cache and copy it into the netfs's structures. the cache and copy it into the netfs's structures.
@ -236,32 +237,30 @@ This has the following fields:
This function is not required for indices as they're not permitted data. This function is not required for indices as they're not permitted data.
=================================== Network Filesystem (Un)registration
NETWORK FILESYSTEM (UN)REGISTRATION
=================================== ===================================
The first step is to declare the network filesystem to the cache. This also The first step is to declare the network filesystem to the cache. This also
involves specifying the layout of the primary index (for AFS, this would be the involves specifying the layout of the primary index (for AFS, this would be the
"cell" level). "cell" level).
The registration function is: The registration function is::
int fscache_register_netfs(struct fscache_netfs *netfs); int fscache_register_netfs(struct fscache_netfs *netfs);
It just takes a pointer to the netfs definition. It returns 0 or an error as It just takes a pointer to the netfs definition. It returns 0 or an error as
appropriate. appropriate.
For kAFS, registration is done as follows: For kAFS, registration is done as follows::
ret = fscache_register_netfs(&afs_cache_netfs); ret = fscache_register_netfs(&afs_cache_netfs);
The last step is, of course, unregistration: The last step is, of course, unregistration::
void fscache_unregister_netfs(struct fscache_netfs *netfs); void fscache_unregister_netfs(struct fscache_netfs *netfs);
================ Cache Tag Lookup
CACHE TAG LOOKUP
================ ================
FS-Cache permits the use of more than one cache. To permit particular index FS-Cache permits the use of more than one cache. To permit particular index
@ -270,7 +269,7 @@ representation tags. This step is optional; it can be left entirely up to
FS-Cache as to which cache should be used. The problem with doing that is that FS-Cache as to which cache should be used. The problem with doing that is that
FS-Cache will always pick the first cache that was registered. FS-Cache will always pick the first cache that was registered.
To get the representation for a named tag: To get the representation for a named tag::
struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
@ -278,7 +277,7 @@ This takes a text string as the name and returns a representation of a tag. It
will never return an error. It may return a dummy tag, however, if it runs out will never return an error. It may return a dummy tag, however, if it runs out
of memory; this will inhibit caching with this tag. of memory; this will inhibit caching with this tag.
Any representation so obtained must be released by passing it to this function: Any representation so obtained must be released by passing it to this function::
void fscache_release_cache_tag(struct fscache_cache_tag *tag); void fscache_release_cache_tag(struct fscache_cache_tag *tag);
@ -286,13 +285,12 @@ The tag will be retrieved by FS-Cache when it calls the object definition
operation select_cache(). operation select_cache().
================== Index Registration
INDEX REGISTRATION
================== ==================
The third step is to inform FS-Cache about part of an index hierarchy that can The third step is to inform FS-Cache about part of an index hierarchy that can
be used to locate files. This is done by requesting a cookie for each index in be used to locate files. This is done by requesting a cookie for each index in
the path to the file: the path to the file::
struct fscache_cookie * struct fscache_cookie *
fscache_acquire_cookie(struct fscache_cookie *parent, fscache_acquire_cookie(struct fscache_cookie *parent,
@ -339,7 +337,7 @@ must be enabled to do anything with it. A disabled cookie can be enabled by
calling fscache_enable_cookie() (see below). calling fscache_enable_cookie() (see below).
For example, with AFS, a cell would be added to the primary index. This index For example, with AFS, a cell would be added to the primary index. This index
entry would have a dependent inode containing volume mappings within this cell: entry would have a dependent inode containing volume mappings within this cell::
cell->cache = cell->cache =
fscache_acquire_cookie(afs_cache_netfs.primary_index, fscache_acquire_cookie(afs_cache_netfs.primary_index,
@ -349,7 +347,7 @@ entry would have a dependent inode containing volume mappings within this cell:
cell, 0, true); cell, 0, true);
And then a particular volume could be added to that index by ID, creating And then a particular volume could be added to that index by ID, creating
another index for vnodes (AFS inode equivalents): another index for vnodes (AFS inode equivalents)::
volume->cache = volume->cache =
fscache_acquire_cookie(volume->cell->cache, fscache_acquire_cookie(volume->cell->cache,
@ -359,13 +357,12 @@ another index for vnodes (AFS inode equivalents):
volume, 0, true); volume, 0, true);
====================== Data File Registration
DATA FILE REGISTRATION
====================== ======================
The fourth step is to request a data file be created in the cache. This is The fourth step is to request a data file be created in the cache. This is
identical to index cookie acquisition. The only difference is that the type in identical to index cookie acquisition. The only difference is that the type in
the object definition should be something other than index type. the object definition should be something other than index type::
vnode->cache = vnode->cache =
fscache_acquire_cookie(volume->cache, fscache_acquire_cookie(volume->cache,
@ -375,15 +372,14 @@ the object definition should be something other than index type.
vnode, vnode->status.size, true); vnode, vnode->status.size, true);
================================= Miscellaneous Object Registration
MISCELLANEOUS OBJECT REGISTRATION
================================= =================================
An optional step is to request an object of miscellaneous type be created in An optional step is to request an object of miscellaneous type be created in
the cache. This is almost identical to index cookie acquisition. The only the cache. This is almost identical to index cookie acquisition. The only
difference is that the type in the object definition should be something other difference is that the type in the object definition should be something other
than index type. While the parent object could be an index, it's more likely than index type. While the parent object could be an index, it's more likely
it would be some other type of object such as a data file. it would be some other type of object such as a data file::
xattr->cache = xattr->cache =
fscache_acquire_cookie(vnode->cache, fscache_acquire_cookie(vnode->cache,
@ -396,13 +392,12 @@ Miscellaneous objects might be used to store extended attributes or directory
entries for example. entries for example.
========================== Setting the Data File Size
SETTING THE DATA FILE SIZE
========================== ==========================
The fifth step is to set the physical attributes of the file, such as its size. The fifth step is to set the physical attributes of the file, such as its size.
This doesn't automatically reserve any space in the cache, but permits the This doesn't automatically reserve any space in the cache, but permits the
cache to adjust its metadata for data tracking appropriately: cache to adjust its metadata for data tracking appropriately::
int fscache_attr_changed(struct fscache_cookie *cookie); int fscache_attr_changed(struct fscache_cookie *cookie);
@ -417,8 +412,7 @@ some point in the future, and as such, it may happen after the function returns
to the caller. The attribute adjustment excludes read and write operations. to the caller. The attribute adjustment excludes read and write operations.
===================== Page alloc/read/write
PAGE ALLOC/READ/WRITE
===================== =====================
And the sixth step is to store and retrieve pages in the cache. There are And the sixth step is to store and retrieve pages in the cache. There are
@ -441,7 +435,7 @@ PAGE READ
Firstly, the netfs should ask FS-Cache to examine the caches and read the Firstly, the netfs should ask FS-Cache to examine the caches and read the
contents cached for a particular page of a particular file if present, or else contents cached for a particular page of a particular file if present, or else
allocate space to store the contents if not: allocate space to store the contents if not::
typedef typedef
void (*fscache_rw_complete_t)(struct page *page, void (*fscache_rw_complete_t)(struct page *page,
@ -474,14 +468,14 @@ Else if there's a copy of the page resident in the cache:
(4) When the read is complete, end_io_func() will be invoked with: (4) When the read is complete, end_io_func() will be invoked with:
(*) The netfs data supplied when the cookie was created. * The netfs data supplied when the cookie was created.
(*) The page descriptor. * The page descriptor.
(*) The context argument passed to the above function. This will be * The context argument passed to the above function. This will be
maintained with the get_context/put_context functions mentioned above. maintained with the get_context/put_context functions mentioned above.
(*) An argument that's 0 on success or negative for an error code. * An argument that's 0 on success or negative for an error code.
If an error occurs, it should be assumed that the page contains no usable If an error occurs, it should be assumed that the page contains no usable
data. fscache_readpages_cancel() may need to be called. data. fscache_readpages_cancel() may need to be called.
@ -504,11 +498,11 @@ This function may also return -ENOMEM or -EINTR, in which case it won't have
read any data from the cache. read any data from the cache.
PAGE ALLOCATE Page Allocate
------------- -------------
Alternatively, if there's not expected to be any data in the cache for a page Alternatively, if there's not expected to be any data in the cache for a page
because the file has been extended, a block can simply be allocated instead: because the file has been extended, a block can simply be allocated instead::
int fscache_alloc_page(struct fscache_cookie *cookie, int fscache_alloc_page(struct fscache_cookie *cookie,
struct page *page, struct page *page,
@ -523,12 +517,12 @@ The mark_pages_cached() cookie operation will be called on the page if
successful. successful.
PAGE WRITE Page Write
---------- ----------
Secondly, if the netfs changes the contents of the page (either due to an Secondly, if the netfs changes the contents of the page (either due to an
initial download or if a user performs a write), then the page should be initial download or if a user performs a write), then the page should be
written back to the cache: written back to the cache::
int fscache_write_page(struct fscache_cookie *cookie, int fscache_write_page(struct fscache_cookie *cookie,
struct page *page, struct page *page,
@ -566,11 +560,11 @@ place if unforeseen circumstances arose (such as a disk error).
Writing takes place asynchronously. Writing takes place asynchronously.
MULTIPLE PAGE READ Multiple Page Read
------------------ ------------------
A facility is provided to read several pages at once, as requested by the A facility is provided to read several pages at once, as requested by the
readpages() address space operation: readpages() address space operation::
int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
struct address_space *mapping, struct address_space *mapping,
@ -598,7 +592,7 @@ This works in a similar way to fscache_read_or_alloc_page(), except:
be returned. be returned.
Otherwise, if all pages had reads dispatched, then 0 will be returned, the Otherwise, if all pages had reads dispatched, then 0 will be returned, the
list will be empty and *nr_pages will be 0. list will be empty and ``*nr_pages`` will be 0.
(4) end_io_func will be called once for each page being read as the reads (4) end_io_func will be called once for each page being read as the reads
complete. It will be called in process context if error != 0, but it may complete. It will be called in process context if error != 0, but it may
@ -609,13 +603,13 @@ some of the pages being read and some being allocated. Those pages will have
been marked appropriately and will need uncaching. been marked appropriately and will need uncaching.
CANCELLATION OF UNREAD PAGES Cancellation of Unread Pages
---------------------------- ----------------------------
If one or more pages are passed to fscache_read_or_alloc_pages() but not then If one or more pages are passed to fscache_read_or_alloc_pages() but not then
read from the cache and also not read from the underlying filesystem then read from the cache and also not read from the underlying filesystem then
those pages will need to have any marks and reservations removed. This can be those pages will need to have any marks and reservations removed. This can be
done by calling: done by calling::
void fscache_readpages_cancel(struct fscache_cookie *cookie, void fscache_readpages_cancel(struct fscache_cookie *cookie,
struct list_head *pages); struct list_head *pages);
@ -625,11 +619,10 @@ fscache_read_or_alloc_pages(). Every page in the pages list will be examined
and any that have PG_fscache set will be uncached. and any that have PG_fscache set will be uncached.
============== Page Uncaching
PAGE UNCACHING
============== ==============
To uncache a page, this function should be called: To uncache a page, this function should be called::
void fscache_uncache_page(struct fscache_cookie *cookie, void fscache_uncache_page(struct fscache_cookie *cookie,
struct page *page); struct page *page);
@ -644,12 +637,12 @@ data file must be retired (see the relinquish cookie function below).
Furthermore, note that this does not cancel the asynchronous read or write Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions, so the page operation started by the read/alloc and write functions, so the page
invalidation functions must use: invalidation functions must use::
bool fscache_check_page_write(struct fscache_cookie *cookie, bool fscache_check_page_write(struct fscache_cookie *cookie,
struct page *page); struct page *page);
to see if a page is being written to the cache, and: to see if a page is being written to the cache, and::
void fscache_wait_on_page_write(struct fscache_cookie *cookie, void fscache_wait_on_page_write(struct fscache_cookie *cookie,
struct page *page); struct page *page);
@ -660,7 +653,7 @@ to wait for it to finish if it is.
When releasepage() is being implemented, a special FS-Cache function exists to When releasepage() is being implemented, a special FS-Cache function exists to
manage the heuristics of coping with vmscan trying to eject pages, which may manage the heuristics of coping with vmscan trying to eject pages, which may
conflict with the cache trying to write pages to the cache (which may itself conflict with the cache trying to write pages to the cache (which may itself
need to allocate memory): need to allocate memory)::
bool fscache_maybe_release_page(struct fscache_cookie *cookie, bool fscache_maybe_release_page(struct fscache_cookie *cookie,
struct page *page, struct page *page,
@ -676,12 +669,12 @@ storage request to complete, or it may attempt to cancel the storage request -
in which case the page will not be stored in the cache this time. in which case the page will not be stored in the cache this time.
BULK INODE PAGE UNCACHE Bulk Image Page Uncache
----------------------- -----------------------
A convenience routine is provided to perform an uncache on all the pages A convenience routine is provided to perform an uncache on all the pages
attached to an inode. This assumes that the pages on the inode correspond on a attached to an inode. This assumes that the pages on the inode correspond on a
1:1 basis with the pages in the cache. 1:1 basis with the pages in the cache::
void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie, void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
struct inode *inode); struct inode *inode);
@ -692,12 +685,11 @@ written to the cache and for the cache to finish with the page generally. No
error is returned. error is returned.
=============================== Index and Data File consistency
INDEX AND DATA FILE CONSISTENCY
=============================== ===============================
To find out whether auxiliary data for an object is up to data within the To find out whether auxiliary data for an object is up to data within the
cache, the following function can be called: cache, the following function can be called::
int fscache_check_consistency(struct fscache_cookie *cookie, int fscache_check_consistency(struct fscache_cookie *cookie,
const void *aux_data); const void *aux_data);
@ -708,7 +700,7 @@ data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also
return -ENOMEM and -ERESTARTSYS. return -ENOMEM and -ERESTARTSYS.
To request an update of the index data for an index or other object, the To request an update of the index data for an index or other object, the
following function should be called: following function should be called::
void fscache_update_cookie(struct fscache_cookie *cookie, void fscache_update_cookie(struct fscache_cookie *cookie,
const void *aux_data); const void *aux_data);
@ -721,8 +713,7 @@ Note that partial updates may happen automatically at other times, such as when
data blocks are added to a data file object. data blocks are added to a data file object.
================= Cookie Enablement
COOKIE ENABLEMENT
================= =================
Cookies exist in one of two states: enabled and disabled. If a cookie is Cookies exist in one of two states: enabled and disabled. If a cookie is
@ -731,7 +722,7 @@ invalidate its state; allocate, read or write backing pages - though it is
still possible to uncache pages and relinquish the cookie. still possible to uncache pages and relinquish the cookie.
The initial enablement state is set by fscache_acquire_cookie(), but the cookie The initial enablement state is set by fscache_acquire_cookie(), but the cookie
can be enabled or disabled later. To disable a cookie, call: can be enabled or disabled later. To disable a cookie, call::
void fscache_disable_cookie(struct fscache_cookie *cookie, void fscache_disable_cookie(struct fscache_cookie *cookie,
const void *aux_data, const void *aux_data,
@ -746,7 +737,7 @@ All possible failures are handled internally. The caller should consider
calling fscache_uncache_all_inode_pages() afterwards to make sure all page calling fscache_uncache_all_inode_pages() afterwards to make sure all page
markings are cleared up. markings are cleared up.
Cookies can be enabled or reenabled with: Cookies can be enabled or reenabled with::
void fscache_enable_cookie(struct fscache_cookie *cookie, void fscache_enable_cookie(struct fscache_cookie *cookie,
const void *aux_data, const void *aux_data,
@ -771,13 +762,12 @@ In both cases, the cookie's auxiliary data buffer is updated from aux_data if
that is non-NULL inside the enablement lock before proceeding. that is non-NULL inside the enablement lock before proceeding.
=============================== Miscellaneous Cookie operations
MISCELLANEOUS COOKIE OPERATIONS
=============================== ===============================
There are a number of operations that can be used to control cookies: There are a number of operations that can be used to control cookies:
(*) Cookie pinning: * Cookie pinning::
int fscache_pin_cookie(struct fscache_cookie *cookie); int fscache_pin_cookie(struct fscache_cookie *cookie);
void fscache_unpin_cookie(struct fscache_cookie *cookie); void fscache_unpin_cookie(struct fscache_cookie *cookie);
@ -790,7 +780,7 @@ There are a number of operations that can be used to control cookies:
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
-EIO if there's any other problem. -EIO if there's any other problem.
(*) Data space reservation: * Data space reservation::
int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
@ -809,11 +799,10 @@ There are a number of operations that can be used to control cookies:
make space if it's not in use. make space if it's not in use.
===================== Cookie Unregistration
COOKIE UNREGISTRATION
===================== =====================
To get rid of a cookie, this function should be called. To get rid of a cookie, this function should be called::
void fscache_relinquish_cookie(struct fscache_cookie *cookie, void fscache_relinquish_cookie(struct fscache_cookie *cookie,
const void *aux_data, const void *aux_data,
@ -835,16 +824,14 @@ the cookies for "child" indices, objects and pages have been relinquished
first. first.
================== Index Invalidation
INDEX INVALIDATION
================== ==================
There is no direct way to invalidate an index subtree. To do this, the caller There is no direct way to invalidate an index subtree. To do this, the caller
should relinquish and retire the cookie they have, and then acquire a new one. should relinquish and retire the cookie they have, and then acquire a new one.
====================== Data File Invalidation
DATA FILE INVALIDATION
====================== ======================
Sometimes it will be necessary to invalidate an object that contains data. Sometimes it will be necessary to invalidate an object that contains data.
@ -853,7 +840,7 @@ change - at which point the netfs has to throw away all the state it had for an
inode and reload from the server. inode and reload from the server.
To indicate that a cache object should be invalidated, the following function To indicate that a cache object should be invalidated, the following function
can be called: can be called::
void fscache_invalidate(struct fscache_cookie *cookie); void fscache_invalidate(struct fscache_cookie *cookie);
@ -868,13 +855,12 @@ auxiliary data update operation as it is very likely these will have changed.
Using the following function, the netfs can wait for the invalidation operation Using the following function, the netfs can wait for the invalidation operation
to have reached a point at which it can start submitting ordinary operations to have reached a point at which it can start submitting ordinary operations
once again: once again::
void fscache_wait_on_invalidate(struct fscache_cookie *cookie); void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
=========================== FS-cache Specific Page Flag
FS-CACHE SPECIFIC PAGE FLAG
=========================== ===========================
FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
@ -898,7 +884,7 @@ was given under certain circumstances.
This bit does not overlap with such as PG_private. This means that FS-Cache This bit does not overlap with such as PG_private. This means that FS-Cache
can be used with a filesystem that uses the block buffering code. can be used with a filesystem that uses the block buffering code.
There are a number of operations defined on this flag: There are a number of operations defined on this flag::
int PageFsCache(struct page *page); int PageFsCache(struct page *page);
void SetPageFsCache(struct page *page) void SetPageFsCache(struct page *page)

View File

@ -1,10 +1,12 @@
==================================================== .. SPDX-License-Identifier: GPL-2.0
IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
==================================================== ====================================================
In-Kernel Cache Object Representation and Management
====================================================
By: David Howells <dhowells@redhat.com> By: David Howells <dhowells@redhat.com>
Contents: .. Contents:
(*) Representation (*) Representation
@ -18,8 +20,7 @@ Contents:
(*) The set of events. (*) The set of events.
============== Representation
REPRESENTATION
============== ==============
FS-Cache maintains an in-kernel representation of each object that a netfs is FS-Cache maintains an in-kernel representation of each object that a netfs is
@ -38,7 +39,7 @@ or even by no objects (it may not be cached).
Furthermore, both cookies and objects are hierarchical. The two hierarchies Furthermore, both cookies and objects are hierarchical. The two hierarchies
correspond, but the cookies tree is a superset of the union of the object trees correspond, but the cookies tree is a superset of the union of the object trees
of multiple caches: of multiple caches::
NETFS INDEX TREE : CACHE 1 : CACHE 2 NETFS INDEX TREE : CACHE 1 : CACHE 2
: : : :
@ -89,8 +90,7 @@ pointers to the cookies. The cookies themselves and any objects attached to
those cookies are hidden from it. those cookies are hidden from it.
=============================== Object Management State Machine
OBJECT MANAGEMENT STATE MACHINE
=============================== ===============================
Within FS-Cache, each active object is managed by its own individual state Within FS-Cache, each active object is managed by its own individual state
@ -124,7 +124,7 @@ is not masked, the object will be queued for processing (by calling
fscache_enqueue_object()). fscache_enqueue_object()).
PROVISION OF CPU TIME Provision of CPU Time
--------------------- ---------------------
The work to be done by the various states was given CPU time by the threads of The work to be done by the various states was given CPU time by the threads of
@ -141,7 +141,7 @@ because:
workqueues don't necessarily have the right numbers of threads. workqueues don't necessarily have the right numbers of threads.
LOCKING SIMPLIFICATION Locking Simplification
---------------------- ----------------------
Because only one worker thread may be operating on any particular object's Because only one worker thread may be operating on any particular object's
@ -151,8 +151,7 @@ from the cache backend's representation (fscache_object) - which may be
requested from either end. requested from either end.
================= The Set of States
THE SET OF STATES
================= =================
The object state machine has a set of states that it can be in. There are The object state machine has a set of states that it can be in. There are
@ -275,19 +274,17 @@ memory and potentially deletes stuff from disk:
this state. this state.
THE SET OF EVENTS The Set of Events
----------------- -----------------
There are a number of events that can be raised to an object state machine: There are a number of events that can be raised to an object state machine:
(*) FSCACHE_OBJECT_EV_UPDATE FSCACHE_OBJECT_EV_UPDATE
The netfs requested that an object be updated. The state machine will ask The netfs requested that an object be updated. The state machine will ask
the cache backend to update the object, and the cache backend will ask the the cache backend to update the object, and the cache backend will ask the
netfs for details of the change through its cookie definition ops. netfs for details of the change through its cookie definition ops.
(*) FSCACHE_OBJECT_EV_CLEARED FSCACHE_OBJECT_EV_CLEARED
This is signalled in two circumstances: This is signalled in two circumstances:
(a) when an object's last child object is dropped and (a) when an object's last child object is dropped and
@ -296,20 +293,16 @@ There are a number of events that can be raised to an object state machine:
This is used to proceed from the dying state. This is used to proceed from the dying state.
(*) FSCACHE_OBJECT_EV_ERROR FSCACHE_OBJECT_EV_ERROR
This is signalled when an I/O error occurs during the processing of some This is signalled when an I/O error occurs during the processing of some
object. object.
(*) FSCACHE_OBJECT_EV_RELEASE FSCACHE_OBJECT_EV_RELEASE, FSCACHE_OBJECT_EV_RETIRE
(*) FSCACHE_OBJECT_EV_RETIRE
These are signalled when the netfs relinquishes a cookie it was using. These are signalled when the netfs relinquishes a cookie it was using.
The event selected depends on whether the netfs asks for the backing The event selected depends on whether the netfs asks for the backing
object to be retired (deleted) or retained. object to be retired (deleted) or retained.
(*) FSCACHE_OBJECT_EV_WITHDRAW FSCACHE_OBJECT_EV_WITHDRAW
This is signalled when the cache backend wants to withdraw an object. This is signalled when the cache backend wants to withdraw an object.
This means that the object will have to be detached from the netfs's This means that the object will have to be detached from the netfs's
cookie. cookie.

View File

@ -1,10 +1,12 @@
================================ .. SPDX-License-Identifier: GPL-2.0
ASYNCHRONOUS OPERATIONS HANDLING
================================ ================================
Asynchronous Operations Handling
================================
By: David Howells <dhowells@redhat.com> By: David Howells <dhowells@redhat.com>
Contents: .. Contents:
(*) Overview. (*) Overview.
@ -17,8 +19,7 @@ Contents:
(*) Asynchronous callback. (*) Asynchronous callback.
======== Overview
OVERVIEW
======== ========
FS-Cache has an asynchronous operations handling facility that it uses for its FS-Cache has an asynchronous operations handling facility that it uses for its
@ -33,11 +34,10 @@ backend for completion.
To make use of this facility, <linux/fscache-cache.h> should be #included. To make use of this facility, <linux/fscache-cache.h> should be #included.
=============================== Operation Record Initialisation
OPERATION RECORD INITIALISATION
=============================== ===============================
An operation is recorded in an fscache_operation struct: An operation is recorded in an fscache_operation struct::
struct fscache_operation { struct fscache_operation {
union { union {
@ -50,7 +50,7 @@ An operation is recorded in an fscache_operation struct:
}; };
Someone wanting to issue an operation should allocate something with this Someone wanting to issue an operation should allocate something with this
struct embedded in it. They should initialise it by calling: struct embedded in it. They should initialise it by calling::
void fscache_operation_init(struct fscache_operation *op, void fscache_operation_init(struct fscache_operation *op,
fscache_operation_release_t release); fscache_operation_release_t release);
@ -67,8 +67,7 @@ FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
operation and waited for afterwards. operation and waited for afterwards.
========== Parameters
PARAMETERS
========== ==========
There are a number of parameters that can be set in the operation record's flag There are a number of parameters that can be set in the operation record's flag
@ -87,7 +86,7 @@ operations:
If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
before submitting the operation, and the operating thread must wait for it before submitting the operation, and the operating thread must wait for it
to be cleared before proceeding: to be cleared before proceeding::
wait_on_bit(&op->flags, FSCACHE_OP_WAITING, wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
TASK_UNINTERRUPTIBLE); TASK_UNINTERRUPTIBLE);
@ -101,7 +100,7 @@ operations:
page to a netfs page after the backing fs has read the page in. page to a netfs page after the backing fs has read the page in.
If this option is used, op->fast_work and op->processor must be If this option is used, op->fast_work and op->processor must be
initialised before submitting the operation: initialised before submitting the operation::
INIT_WORK(&op->fast_work, do_some_work); INIT_WORK(&op->fast_work, do_some_work);
@ -114,7 +113,7 @@ operations:
pages that have just been fetched from a remote server. pages that have just been fetched from a remote server.
If this option is used, op->slow_work and op->processor must be If this option is used, op->slow_work and op->processor must be
initialised before submitting the operation: initialised before submitting the operation::
fscache_operation_init_slow(op, processor) fscache_operation_init_slow(op, processor)
@ -132,8 +131,7 @@ Furthermore, operations may be one of two types:
operations running at the same time. operations running at the same time.
========= Procedure
PROCEDURE
========= =========
Operations are used through the following procedure: Operations are used through the following procedure:
@ -143,7 +141,7 @@ Operations are used through the following procedure:
generic op embedded within. generic op embedded within.
(2) The submitting thread must then submit the operation for processing using (2) The submitting thread must then submit the operation for processing using
one of the following two functions: one of the following two functions::
int fscache_submit_op(struct fscache_object *object, int fscache_submit_op(struct fscache_object *object,
struct fscache_operation *op); struct fscache_operation *op);
@ -164,7 +162,7 @@ Operations are used through the following procedure:
operation of conflicting exclusivity is in progress on the object. operation of conflicting exclusivity is in progress on the object.
If the operation is asynchronous, the manager will retain a reference to If the operation is asynchronous, the manager will retain a reference to
it, so the caller should put their reference to it by passing it to: it, so the caller should put their reference to it by passing it to::
void fscache_put_operation(struct fscache_operation *op); void fscache_put_operation(struct fscache_operation *op);
@ -179,12 +177,12 @@ Operations are used through the following procedure:
(4) The operation holds an effective lock upon the object, preventing other (4) The operation holds an effective lock upon the object, preventing other
exclusive ops conflicting until it is released. The operation can be exclusive ops conflicting until it is released. The operation can be
enqueued for further immediate asynchronous processing by adjusting the enqueued for further immediate asynchronous processing by adjusting the
CPU time provisioning option if necessary, eg: CPU time provisioning option if necessary, eg::
op->flags &= ~FSCACHE_OP_TYPE; op->flags &= ~FSCACHE_OP_TYPE;
op->flags |= ~FSCACHE_OP_FAST; op->flags |= ~FSCACHE_OP_FAST;
and calling: and calling::
void fscache_enqueue_operation(struct fscache_operation *op) void fscache_enqueue_operation(struct fscache_operation *op)
@ -192,13 +190,12 @@ Operations are used through the following procedure:
pools. pools.
===================== Asynchronous Callback
ASYNCHRONOUS CALLBACK
===================== =====================
When used in asynchronous mode, the worker thread pool will invoke the When used in asynchronous mode, the worker thread pool will invoke the
processor method with a pointer to the operation. This should then get at the processor method with a pointer to the operation. This should then get at the
container struct by using container_of(): container struct by using container_of()::
static void fscache_write_op(struct fscache_operation *_op) static void fscache_write_op(struct fscache_operation *_op)
{ {

View File

@ -1,7 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
===========================================
Mounting root file system via SMB (cifs.ko) Mounting root file system via SMB (cifs.ko)
=========================================== ===========================================
Written 2019 by Paulo Alcantara <palcantara@suse.de> Written 2019 by Paulo Alcantara <palcantara@suse.de>
Written 2019 by Aurelien Aptel <aaptel@suse.com> Written 2019 by Aurelien Aptel <aaptel@suse.com>
The CONFIG_CIFS_ROOT option enables experimental root file system The CONFIG_CIFS_ROOT option enables experimental root file system
@ -32,7 +36,7 @@ Server configuration
==================== ====================
To enable SMB1+UNIX extensions you will need to set these global To enable SMB1+UNIX extensions you will need to set these global
settings in Samba smb.conf: settings in Samba smb.conf::
[global] [global]
server min protocol = NT1 server min protocol = NT1
@ -41,12 +45,16 @@ settings in Samba smb.conf:
Kernel command line Kernel command line
=================== ===================
root=/dev/cifs ::
root=/dev/cifs
This is just a virtual device that basically tells the kernel to mount This is just a virtual device that basically tells the kernel to mount
the root file system via SMB protocol. the root file system via SMB protocol.
cifsroot=//<server-ip>/<share>[,options] ::
cifsroot=//<server-ip>/<share>[,options]
Enables the kernel to mount the root file system via SMB that are Enables the kernel to mount the root file system via SMB that are
located in the <server-ip> and <share> specified in this option. located in the <server-ip> and <share> specified in this option.
@ -65,33 +73,33 @@ options
Examples Examples
======== ========
Export root file system as a Samba share in smb.conf file. Export root file system as a Samba share in smb.conf file::
... ...
[linux] [linux]
path = /path/to/rootfs path = /path/to/rootfs
read only = no read only = no
guest ok = yes guest ok = yes
force user = root force user = root
force group = root force group = root
browseable = yes browseable = yes
writeable = yes writeable = yes
admin users = root admin users = root
public = yes public = yes
create mask = 0777 create mask = 0777
directory mask = 0777 directory mask = 0777
... ...
Restart smb service. Restart smb service::
# systemctl restart smb # systemctl restart smb
Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and
CONFIG_IP_PNP options enabled. CONFIG_IP_PNP options enabled::
# qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \ # qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
-kernel /path/to/linux/arch/x86/boot/bzImage -nographic \ -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
-append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3" -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
1: https://wiki.samba.org/index.php/UNIX_Extensions 1: https://wiki.samba.org/index.php/UNIX_Extensions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,6 @@
=======================================================
configfs - Userspace-driven kernel object configuration. Configfs - Userspace-driven Kernel Object Configuration
=======================================================
Joel Becker <joel.becker@oracle.com> Joel Becker <joel.becker@oracle.com>
@ -9,7 +10,8 @@ Copyright (c) 2005 Oracle Corporation,
Joel Becker <joel.becker@oracle.com> Joel Becker <joel.becker@oracle.com>
[What is configfs?] What is configfs?
=================
configfs is a ram-based filesystem that provides the converse of configfs is a ram-based filesystem that provides the converse of
sysfs's functionality. Where sysfs is a filesystem-based view of sysfs's functionality. Where sysfs is a filesystem-based view of
@ -35,10 +37,11 @@ kernel modules backing the items must respond to this.
Both sysfs and configfs can and should exist together on the same Both sysfs and configfs can and should exist together on the same
system. One is not a replacement for the other. system. One is not a replacement for the other.
[Using configfs] Using configfs
==============
configfs can be compiled as a module or into the kernel. You can access configfs can be compiled as a module or into the kernel. You can access
it by doing it by doing::
mount -t configfs none /config mount -t configfs none /config
@ -56,28 +59,29 @@ values. Don't mix more than one attribute in one attribute file.
There are two types of configfs attributes: There are two types of configfs attributes:
* Normal attributes, which similar to sysfs attributes, are small ASCII text * Normal attributes, which similar to sysfs attributes, are small ASCII text
files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
only one value per file should be used, and the same caveats from sysfs apply. only one value per file should be used, and the same caveats from sysfs apply.
Configfs expects write(2) to store the entire buffer at once. When writing to Configfs expects write(2) to store the entire buffer at once. When writing to
normal configfs attributes, userspace processes should first read the entire normal configfs attributes, userspace processes should first read the entire
file, modify the portions they wish to change, and then write the entire file, modify the portions they wish to change, and then write the entire
buffer back. buffer back.
* Binary attributes, which are somewhat similar to sysfs binary attributes, * Binary attributes, which are somewhat similar to sysfs binary attributes,
but with a few slight changes to semantics. The PAGE_SIZE limitation does not but with a few slight changes to semantics. The PAGE_SIZE limitation does not
apply, but the whole binary item must fit in single kernel vmalloc'ed buffer. apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
The write(2) calls from user space are buffered, and the attributes' The write(2) calls from user space are buffered, and the attributes'
write_bin_attribute method will be invoked on the final close, therefore it is write_bin_attribute method will be invoked on the final close, therefore it is
imperative for user-space to check the return code of close(2) in order to imperative for user-space to check the return code of close(2) in order to
verify that the operation finished successfully. verify that the operation finished successfully.
To avoid a malicious user OOMing the kernel, there's a per-binary attribute To avoid a malicious user OOMing the kernel, there's a per-binary attribute
maximum buffer value. maximum buffer value.
When an item needs to be destroyed, remove it with rmdir(2). An When an item needs to be destroyed, remove it with rmdir(2). An
item cannot be destroyed if any other item has a link to it (via item cannot be destroyed if any other item has a link to it (via
symlink(2)). Links can be removed via unlink(2). symlink(2)). Links can be removed via unlink(2).
[Configuring FakeNBD: an Example] Configuring FakeNBD: an Example
===============================
Imagine there's a Network Block Device (NBD) driver that allows you to Imagine there's a Network Block Device (NBD) driver that allows you to
access remote block devices. Call it FakeNBD. FakeNBD uses configfs access remote block devices. Call it FakeNBD. FakeNBD uses configfs
@ -86,14 +90,14 @@ sysadmins use to configure FakeNBD, but somehow that program has to tell
the driver about it. Here's where configfs comes in. the driver about it. Here's where configfs comes in.
When the FakeNBD driver is loaded, it registers itself with configfs. When the FakeNBD driver is loaded, it registers itself with configfs.
readdir(3) sees this just fine: readdir(3) sees this just fine::
# ls /config # ls /config
fakenbd fakenbd
A fakenbd connection can be created with mkdir(2). The name is A fakenbd connection can be created with mkdir(2). The name is
arbitrary, but likely the tool will make some use of the name. Perhaps arbitrary, but likely the tool will make some use of the name. Perhaps
it is a uuid or a disk name: it is a uuid or a disk name::
# mkdir /config/fakenbd/disk1 # mkdir /config/fakenbd/disk1
# ls /config/fakenbd/disk1 # ls /config/fakenbd/disk1
@ -102,7 +106,7 @@ it is a uuid or a disk name:
The target attribute contains the IP address of the server FakeNBD will The target attribute contains the IP address of the server FakeNBD will
connect to. The device attribute is the device on the server. connect to. The device attribute is the device on the server.
Predictably, the rw attribute determines whether the connection is Predictably, the rw attribute determines whether the connection is
read-only or read-write. read-only or read-write::
# echo 10.0.0.1 > /config/fakenbd/disk1/target # echo 10.0.0.1 > /config/fakenbd/disk1/target
# echo /dev/sda1 > /config/fakenbd/disk1/device # echo /dev/sda1 > /config/fakenbd/disk1/device
@ -111,7 +115,8 @@ read-only or read-write.
That's it. That's all there is. Now the device is configured, via the That's it. That's all there is. Now the device is configured, via the
shell no less. shell no less.
[Coding With configfs] Coding With configfs
====================
Every object in configfs is a config_item. A config_item reflects an Every object in configfs is a config_item. A config_item reflects an
object in the subsystem. It has attributes that match values on that object in the subsystem. It has attributes that match values on that
@ -130,7 +135,10 @@ appears as a directory at the top of the configfs filesystem. A
subsystem is also a config_group, and can do everything a config_group subsystem is also a config_group, and can do everything a config_group
can. can.
[struct config_item] struct config_item
==================
::
struct config_item { struct config_item {
char *ci_name; char *ci_name;
@ -168,7 +176,10 @@ By itself, a config_item cannot do much more than appear in configfs.
Usually a subsystem wants the item to display and/or store attributes, Usually a subsystem wants the item to display and/or store attributes,
among other things. For that, it needs a type. among other things. For that, it needs a type.
[struct config_item_type] struct config_item_type
=======================
::
struct configfs_item_operations { struct configfs_item_operations {
void (*release)(struct config_item *); void (*release)(struct config_item *);
@ -192,7 +203,10 @@ allocated dynamically will need to provide the ct_item_ops->release()
method. This method is called when the config_item's reference count method. This method is called when the config_item's reference count
reaches zero. reaches zero.
[struct configfs_attribute] struct configfs_attribute
=========================
::
struct configfs_attribute { struct configfs_attribute {
char *ca_name; char *ca_name;
@ -214,7 +228,10 @@ be called whenever userspace asks for a read(2) on the attribute. If an
attribute is writable and provides a ->store method, that method will be attribute is writable and provides a ->store method, that method will be
be called whenever userspace asks for a write(2) on the attribute. be called whenever userspace asks for a write(2) on the attribute.
[struct configfs_bin_attribute] struct configfs_bin_attribute
=============================
::
struct configfs_bin_attribute { struct configfs_bin_attribute {
struct configfs_attribute cb_attr; struct configfs_attribute cb_attr;
@ -240,11 +257,12 @@ will happen for write(2). The reads/writes are bufferred so only a
single read/write will occur; the attributes' need not concern itself single read/write will occur; the attributes' need not concern itself
with it. with it.
[struct config_group] struct config_group
===================
A config_item cannot live in a vacuum. The only way one can be created A config_item cannot live in a vacuum. The only way one can be created
is via mkdir(2) on a config_group. This will trigger creation of a is via mkdir(2) on a config_group. This will trigger creation of a
child item. child item::
struct config_group { struct config_group {
struct config_item cg_item; struct config_item cg_item;
@ -264,7 +282,7 @@ The config_group structure contains a config_item. Properly configuring
that item means that a group can behave as an item in its own right. that item means that a group can behave as an item in its own right.
However, it can do more: it can create child items or groups. This is However, it can do more: it can create child items or groups. This is
accomplished via the group operations specified on the group's accomplished via the group operations specified on the group's
config_item_type. config_item_type::
struct configfs_group_operations { struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group, struct config_item *(*make_item)(struct config_group *group,
@ -279,7 +297,8 @@ config_item_type.
}; };
A group creates child items by providing the A group creates child items by providing the
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new ct_group_ops->make_item() method. If provided, this method is called from
mkdir(2) in the group's directory. The subsystem allocates a new
config_item (or more likely, its container structure), initializes it, config_item (or more likely, its container structure), initializes it,
and returns it to configfs. Configfs will then populate the filesystem and returns it to configfs. Configfs will then populate the filesystem
tree to reflect the new item. tree to reflect the new item.
@ -296,13 +315,14 @@ upon item allocation. If a subsystem has no work to do, it may omit
the ct_group_ops->drop_item() method, and configfs will call the ct_group_ops->drop_item() method, and configfs will call
config_item_put() on the item on behalf of the subsystem. config_item_put() on the item on behalf of the subsystem.
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) Important:
is called, configfs WILL remove the item from the filesystem tree drop_item() is void, and as such cannot fail. When rmdir(2)
(assuming that it has no children to keep it busy). The subsystem is is called, configfs WILL remove the item from the filesystem tree
responsible for responding to this. If the subsystem has references to (assuming that it has no children to keep it busy). The subsystem is
the item in other threads, the memory is safe. It may take some time responsible for responding to this. If the subsystem has references to
for the item to actually disappear from the subsystem's usage. But it the item in other threads, the memory is safe. It may take some time
is gone from configfs. for the item to actually disappear from the subsystem's usage. But it
is gone from configfs.
When drop_item() is called, the item's linkage has already been torn When drop_item() is called, the item's linkage has already been torn
down. It no longer has a reference on its parent and has no place in down. It no longer has a reference on its parent and has no place in
@ -319,10 +339,11 @@ is implemented in the configfs rmdir(2) code. ->drop_item() will not be
called, as the item has not been dropped. rmdir(2) will fail, as the called, as the item has not been dropped. rmdir(2) will fail, as the
directory is not empty. directory is not empty.
[struct configfs_subsystem] struct configfs_subsystem
=========================
A subsystem must register itself, usually at module_init time. This A subsystem must register itself, usually at module_init time. This
tells configfs to make the subsystem appear in the file tree. tells configfs to make the subsystem appear in the file tree::
struct configfs_subsystem { struct configfs_subsystem {
struct config_group su_group; struct config_group su_group;
@ -332,17 +353,19 @@ tells configfs to make the subsystem appear in the file tree.
int configfs_register_subsystem(struct configfs_subsystem *subsys); int configfs_register_subsystem(struct configfs_subsystem *subsys);
void configfs_unregister_subsystem(struct configfs_subsystem *subsys); void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
A subsystem consists of a toplevel config_group and a mutex. A subsystem consists of a toplevel config_group and a mutex.
The group is where child config_items are created. For a subsystem, The group is where child config_items are created. For a subsystem,
this group is usually defined statically. Before calling this group is usually defined statically. Before calling
configfs_register_subsystem(), the subsystem must have initialized the configfs_register_subsystem(), the subsystem must have initialized the
group via the usual group _init() functions, and it must also have group via the usual group _init() functions, and it must also have
initialized the mutex. initialized the mutex.
When the register call returns, the subsystem is live, and it
When the register call returns, the subsystem is live, and it
will be visible via configfs. At that point, mkdir(2) can be called and will be visible via configfs. At that point, mkdir(2) can be called and
the subsystem must be ready for it. the subsystem must be ready for it.
[An Example] An Example
==========
The best example of these basic concepts is the simple_children The best example of these basic concepts is the simple_children
subsystem/group and the simple_child item in subsystem/group and the simple_child item in
@ -350,7 +373,8 @@ samples/configfs/configfs_sample.c. It shows a trivial object displaying
and storing an attribute, and a simple group creating and destroying and storing an attribute, and a simple group creating and destroying
these children. these children.
[Hierarchy Navigation and the Subsystem Mutex] Hierarchy Navigation and the Subsystem Mutex
============================================
There is an extra bonus that configfs provides. The config_groups and There is an extra bonus that configfs provides. The config_groups and
config_items are arranged in a hierarchy due to the fact that they config_items are arranged in a hierarchy due to the fact that they
@ -375,7 +399,8 @@ be in its parent's cg_children list for the same duration. This allows
a subsystem to trust ci_parent and cg_children while they hold the a subsystem to trust ci_parent and cg_children while they hold the
mutex. mutex.
[Item Aggregation Via symlink(2)] Item Aggregation Via symlink(2)
===============================
configfs provides a simple group via the group->item parent/child configfs provides a simple group via the group->item parent/child
relationship. Often, however, a larger environment requires aggregation relationship. Often, however, a larger environment requires aggregation
@ -403,7 +428,8 @@ A config_item cannot be removed while it links to any other item, nor
can it be removed while an item links to it. Dangling symlinks are not can it be removed while an item links to it. Dangling symlinks are not
allowed in configfs. allowed in configfs.
[Automatically Created Subgroups] Automatically Created Subgroups
===============================
A new config_group may want to have two types of child config_items. A new config_group may want to have two types of child config_items.
While this could be codified by magic names in ->make_item(), it is much While this could be codified by magic names in ->make_item(), it is much
@ -433,7 +459,8 @@ As a consequence of this, default groups cannot be removed directly via
rmdir(2). They also are not considered when rmdir(2) on the parent rmdir(2). They also are not considered when rmdir(2) on the parent
group is checking for children. group is checking for children.
[Dependent Subsystems] Dependent Subsystems
====================
Sometimes other drivers depend on particular configfs items. For Sometimes other drivers depend on particular configfs items. For
example, ocfs2 mounts depend on a heartbeat region item. If that example, ocfs2 mounts depend on a heartbeat region item. If that
@ -460,9 +487,11 @@ succeeds, then heartbeat knows the region is safe to give to ocfs2.
If it fails, it was being torn down anyway, and heartbeat can gracefully If it fails, it was being torn down anyway, and heartbeat can gracefully
pass up an error. pass up an error.
[Committable Items] Committable Items
=================
NOTE: Committable items are currently unimplemented. Note:
Committable items are currently unimplemented.
Some config_items cannot have a valid initial state. That is, no Some config_items cannot have a valid initial state. That is, no
default values can be specified for the item's attributes such that the default values can be specified for the item's attributes such that the
@ -504,5 +533,3 @@ As rmdir(2) does not work in the "live" directory, an item must be
shutdown, or "uncommitted". Again, this is done via rename(2), this shutdown, or "uncommitted". Again, this is done via rename(2), this
time from the "live" directory back to the "pending" one. The subsystem time from the "live" directory back to the "pending" one. The subsystem
is notified by the ct_group_ops->uncommit_object() method. is notified by the ct_group_ops->uncommit_object() method.

View File

@ -74,7 +74,7 @@ are zeroed out and converted to written extents before being returned to avoid
exposure of uninitialized data through mmap. exposure of uninitialized data through mmap.
These filesystems may be used for inspiration: These filesystems may be used for inspiration:
- ext2: see Documentation/filesystems/ext2.txt - ext2: see Documentation/filesystems/ext2.rst
- ext4: see Documentation/filesystems/ext4/ - ext4: see Documentation/filesystems/ext4/
- xfs: see Documentation/admin-guide/xfs.rst - xfs: see Documentation/admin-guide/xfs.rst

View File

@ -166,16 +166,17 @@ file::
}; };
struct debugfs_regset32 { struct debugfs_regset32 {
struct debugfs_reg32 *regs; const struct debugfs_reg32 *regs;
int nregs; int nregs;
void __iomem *base; void __iomem *base;
struct device *dev; /* Optional device for Runtime PM */
}; };
debugfs_create_regset32(const char *name, umode_t mode, debugfs_create_regset32(const char *name, umode_t mode,
struct dentry *parent, struct dentry *parent,
struct debugfs_regset32 *regset); struct debugfs_regset32 *regset);
void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs,
int nregs, void __iomem *base, char *prefix); int nregs, void __iomem *base, char *prefix);
The "base" argument may be 0, but you may want to build the reg32 array The "base" argument may be 0, but you may want to build the reg32 array

View File

@ -0,0 +1,36 @@
.. SPDX-License-Identifier: GPL-2.0
=====================
The Devpts Filesystem
=====================
Each mount of the devpts filesystem is now distinct such that ptys
and their indicies allocated in one mount are independent from ptys
and their indicies in all other mounts.
All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node
with permissions ``0000``.
To retain backwards compatibility the a ptmx device node (aka any node
created with ``mknod name c 5 2``) when opened will look for an instance
of devpts under the name ``pts`` in the same directory as the ptmx device
node.
As an option instead of placing a ``/dev/ptmx`` device node at ``/dev/ptmx``
it is possible to place a symlink to ``/dev/pts/ptmx`` at ``/dev/ptmx`` or
to bind mount ``/dev/ptx/ptmx`` to ``/dev/ptmx``. If you opt for using
the devpts filesystem in this manner devpts should be mounted with
the ``ptmxmode=0666``, or ``chmod 0666 /dev/pts/ptmx`` should be called.
Total count of pty pairs in all instances is limited by sysctls::
kernel.pty.max = 4096 - global limit
kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
kernel.pty.nr - current count of ptys
Per-instance limit could be set by adding mount option ``max=<count>``.
This feature was added in kernel 3.4 together with
``sysctl kernel.pty.reserve``.
In kernels older than 3.4 sysctl ``kernel.pty.max`` works as per-instance limit.

View File

@ -1,26 +0,0 @@
Each mount of the devpts filesystem is now distinct such that ptys
and their indicies allocated in one mount are independent from ptys
and their indicies in all other mounts.
All mounts of the devpts filesystem now create a /dev/pts/ptmx node
with permissions 0000.
To retain backwards compatibility the a ptmx device node (aka any node
created with "mknod name c 5 2") when opened will look for an instance
of devpts under the name "pts" in the same directory as the ptmx device
node.
As an option instead of placing a /dev/ptmx device node at /dev/ptmx
it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or
to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using
the devpts filesystem in this manner devpts should be mounted with
the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called.
Total count of pty pairs in all instances is limited by sysctls:
kernel.pty.max = 4096 - global limit
kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
kernel.pty.nr - current count of ptys
Per-instance limit could be set by adding mount option "max=<count>".
This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve.
In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit.

View File

@ -1,5 +1,8 @@
Linux Directory Notification .. SPDX-License-Identifier: GPL-2.0
============================
============================
Linux Directory Notification
============================
Stephen Rothwell <sfr@canb.auug.org.au> Stephen Rothwell <sfr@canb.auug.org.au>
@ -12,6 +15,7 @@ being delivered using signals.
The application decides which "events" it wants to be notified about. The application decides which "events" it wants to be notified about.
The currently defined events are: The currently defined events are:
========= =====================================================
DN_ACCESS A file in the directory was accessed (read) DN_ACCESS A file in the directory was accessed (read)
DN_MODIFY A file in the directory was modified (write,truncate) DN_MODIFY A file in the directory was modified (write,truncate)
DN_CREATE A file was created in the directory DN_CREATE A file was created in the directory
@ -19,6 +23,7 @@ The currently defined events are:
DN_RENAME A file in the directory was renamed DN_RENAME A file in the directory was renamed
DN_ATTRIB A file in the directory had its attributes DN_ATTRIB A file in the directory had its attributes
changed (chmod,chown) changed (chmod,chown)
========= =====================================================
Usually, the application must reregister after each notification, but Usually, the application must reregister after each notification, but
if DN_MULTISHOT is or'ed with the event mask, then the registration will if DN_MULTISHOT is or'ed with the event mask, then the registration will
@ -36,7 +41,7 @@ especially important if DN_MULTISHOT is specified. Note that SIGRTMIN
is often blocked, so it is better to use (at least) SIGRTMIN + 1. is often blocked, so it is better to use (at least) SIGRTMIN + 1.
Implementation expectations (features and bugs :-)) Implementation expectations (features and bugs :-))
--------------------------- ---------------------------------------------------
The notification should work for any local access to files even if the The notification should work for any local access to files even if the
actual file system is on a remote server. This implies that remote actual file system is on a remote server. This implies that remote
@ -67,4 +72,4 @@ See tools/testing/selftests/filesystems/dnotify_test.c for an example.
NOTE NOTE
---- ----
Beginning with Linux 2.6.13, dnotify has been replaced by inotify. Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
See Documentation/filesystems/inotify.txt for more information on it. See Documentation/filesystems/inotify.rst for more information on it.

View File

@ -24,3 +24,20 @@ files that are not well-known standardized variables are created
as immutable files. This doesn't prevent removal - "chattr -i" will work - as immutable files. This doesn't prevent removal - "chattr -i" will work -
but it does prevent this kind of failure from being accomplished but it does prevent this kind of failure from being accomplished
accidentally. accidentally.
.. warning ::
When a content of an UEFI variable in /sys/firmware/efi/efivars is
displayed, for example using "hexdump", pay attention that the first
4 bytes of the output represent the UEFI variable attributes,
in little-endian format.
Practically the output of each efivar is composed of:
+-----------------------------------+
|4_bytes_of_attributes + efivar_data|
+-----------------------------------+
*See also:*
- Documentation/admin-guide/acpi/ssdt-overlays.rst
- Documentation/ABI/stable/sysfs-firmware-efi-vars

View File

@ -1,3 +1,5 @@
.. SPDX-License-Identifier: GPL-2.0
============ ============
Fiemap Ioctl Fiemap Ioctl
============ ============
@ -10,9 +12,9 @@ returns a list of extents.
Request Basics Request Basics
-------------- --------------
A fiemap request is encoded within struct fiemap: A fiemap request is encoded within struct fiemap::
struct fiemap { struct fiemap {
__u64 fm_start; /* logical offset (inclusive) at __u64 fm_start; /* logical offset (inclusive) at
* which to start mapping (in) */ * which to start mapping (in) */
__u64 fm_length; /* logical length of mapping which __u64 fm_length; /* logical length of mapping which
@ -23,7 +25,7 @@ struct fiemap {
__u32 fm_extent_count; /* size of fm_extents array (in) */ __u32 fm_extent_count; /* size of fm_extents array (in) */
__u32 fm_reserved; __u32 fm_reserved;
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
}; };
fm_start, and fm_length specify the logical range within the file fm_start, and fm_length specify the logical range within the file
@ -51,12 +53,12 @@ nothing to prevent the file from changing between calls to FIEMAP.
The following flags can be set in fm_flags: The following flags can be set in fm_flags:
* FIEMAP_FLAG_SYNC FIEMAP_FLAG_SYNC
If this flag is set, the kernel will sync the file before mapping extents. If this flag is set, the kernel will sync the file before mapping extents.
* FIEMAP_FLAG_XATTR FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of its data tree. extended attribute lookup tree, instead of its data tree.
Extent Mapping Extent Mapping
@ -75,18 +77,18 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags). flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as Each extent is described by a single fiemap_extent structure as
returned in fm_extents. returned in fm_extents::
struct fiemap_extent { struct fiemap_extent {
__u64 fe_logical; /* logical offset in bytes for the start of __u64 fe_logical; /* logical offset in bytes for the start of
* the extent */ * the extent */
__u64 fe_physical; /* physical offset in bytes for the start __u64 fe_physical; /* physical offset in bytes for the start
* of the extent */ * of the extent */
__u64 fe_length; /* length in bytes for the extent */ __u64 fe_length; /* length in bytes for the extent */
__u64 fe_reserved64[2]; __u64 fe_reserved64[2];
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_reserved[3]; __u32 fe_reserved[3];
}; };
All offsets and lengths are in bytes and mirror those on disk. It is valid All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical for an extents logical offset to start before the request or its logical
@ -114,26 +116,27 @@ worry about all present and future flags which might imply unaligned
data. Note that the opposite is not true - it would be valid for data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone. FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
* FIEMAP_EXTENT_LAST FIEMAP_EXTENT_LAST
This is generally the last extent in the file. A mapping attempt past This is generally the last extent in the file. A mapping attempt past
this extent may return nothing. Some implementations set this flag to this extent may return nothing. Some implementations set this flag to
indicate this extent is the last one in the range queried by the user indicate this extent is the last one in the range queried by the user
(via fiemap->fm_length). (via fiemap->fm_length).
* FIEMAP_EXTENT_UNKNOWN FIEMAP_EXTENT_UNKNOWN
The location of this extent is currently unknown. This may indicate The location of this extent is currently unknown. This may indicate
the data is stored on an inaccessible volume or that no storage has the data is stored on an inaccessible volume or that no storage has
been allocated for the file yet. been allocated for the file yet.
* FIEMAP_EXTENT_DELALLOC FIEMAP_EXTENT_DELALLOC
- This will also set FIEMAP_EXTENT_UNKNOWN. This will also set FIEMAP_EXTENT_UNKNOWN.
Delayed allocation - while there is data for this extent, its
physical location has not been allocated yet.
* FIEMAP_EXTENT_ENCODED Delayed allocation - while there is data for this extent, its
This extent does not consist of plain filesystem blocks but is physical location has not been allocated yet.
encoded (e.g. encrypted or compressed). Reading the data in this
extent via I/O to the block device will have undefined results. FIEMAP_EXTENT_ENCODED
This extent does not consist of plain filesystem blocks but is
encoded (e.g. encrypted or compressed). Reading the data in this
extent via I/O to the block device will have undefined results.
Note that it is *always* undefined to try to update the data Note that it is *always* undefined to try to update the data
in-place by writing to the indicated location without the in-place by writing to the indicated location without the
@ -145,32 +148,32 @@ unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is
clear; user applications must not try reading or writing to the clear; user applications must not try reading or writing to the
filesystem via the block device under any other circumstances. filesystem via the block device under any other circumstances.
* FIEMAP_EXTENT_DATA_ENCRYPTED FIEMAP_EXTENT_DATA_ENCRYPTED
- This will also set FIEMAP_EXTENT_ENCODED This will also set FIEMAP_EXTENT_ENCODED
The data in this extent has been encrypted by the file system. The data in this extent has been encrypted by the file system.
* FIEMAP_EXTENT_NOT_ALIGNED FIEMAP_EXTENT_NOT_ALIGNED
Extent offsets and length are not guaranteed to be block aligned. Extent offsets and length are not guaranteed to be block aligned.
* FIEMAP_EXTENT_DATA_INLINE FIEMAP_EXTENT_DATA_INLINE
This will also set FIEMAP_EXTENT_NOT_ALIGNED This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is located within a meta data block. Data is located within a meta data block.
* FIEMAP_EXTENT_DATA_TAIL FIEMAP_EXTENT_DATA_TAIL
This will also set FIEMAP_EXTENT_NOT_ALIGNED This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is packed into a block with data from other files. Data is packed into a block with data from other files.
* FIEMAP_EXTENT_UNWRITTEN FIEMAP_EXTENT_UNWRITTEN
Unwritten extent - the extent is allocated but its data has not been Unwritten extent - the extent is allocated but its data has not been
initialized. This indicates the extent's data will be all zero if read initialized. This indicates the extent's data will be all zero if read
through the filesystem but the contents are undefined if read directly from through the filesystem but the contents are undefined if read directly from
the device. the device.
* FIEMAP_EXTENT_MERGED FIEMAP_EXTENT_MERGED
This will be set when a file does not support extents, i.e., it uses a block This will be set when a file does not support extents, i.e., it uses a block
based addressing scheme. Since returning an extent for each block back to based addressing scheme. Since returning an extent for each block back to
userspace would be highly inefficient, the kernel will try to merge most userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'. adjacent blocks into 'extents'.
VFS -> File System Implementation VFS -> File System Implementation
@ -179,23 +182,23 @@ VFS -> File System Implementation
File systems wishing to support fiemap must implement a ->fiemap callback on File systems wishing to support fiemap must implement a ->fiemap callback on
their inode_operations structure. The fs ->fiemap call is responsible for their inode_operations structure. The fs ->fiemap call is responsible for
defining its set of supported fiemap flags, and calling a helper function on defining its set of supported fiemap flags, and calling a helper function on
each discovered extent: each discovered extent::
struct inode_operations { struct inode_operations {
... ...
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len); u64 len);
->fiemap is passed struct fiemap_extent_info which describes the ->fiemap is passed struct fiemap_extent_info which describes the
fiemap request: fiemap request::
struct fiemap_extent_info { struct fiemap_extent_info {
unsigned int fi_flags; /* Flags as passed from user */ unsigned int fi_flags; /* Flags as passed from user */
unsigned int fi_extents_mapped; /* Number of mapped extents */ unsigned int fi_extents_mapped; /* Number of mapped extents */
unsigned int fi_extents_max; /* Size of fiemap_extent array */ unsigned int fi_extents_max; /* Size of fiemap_extent array */
struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
}; };
It is intended that the file system should not need to access any of this It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return structure directly. Filesystem handlers should be tolerant to signals and return
@ -203,9 +206,9 @@ EINTR once fatal signal received.
Flag checking should be done at the beginning of the ->fiemap callback via the Flag checking should be done at the beginning of the ->fiemap callback via the
fiemap_check_flags() helper: fiemap_check_flags() helper::
int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
The struct fieinfo should be passed in as received from ioctl_fiemap(). The The struct fieinfo should be passed in as received from ioctl_fiemap(). The
set of fiemap flags which the fs understands should be passed via fs_flags. If set of fiemap flags which the fs understands should be passed via fs_flags. If
@ -216,10 +219,10 @@ ioctl_fiemap().
For each extent in the request range, the file system should call For each extent in the request range, the file system should call
the helper function, fiemap_fill_next_extent(): the helper function, fiemap_fill_next_extent()::
int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
u64 phys, u64 len, u32 flags, u32 dev); u64 phys, u64 len, u32 flags, u32 dev);
fiemap_fill_next_extent() will use the passed values to populate the fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will next free extent in the fm_extents array. 'General' extent flags will

View File

@ -1,5 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
===================================
File management in the Linux kernel File management in the Linux kernel
----------------------------------- ===================================
This document describes how locking for files (struct file) This document describes how locking for files (struct file)
and file descriptor table (struct files) works. and file descriptor table (struct files) works.
@ -34,7 +37,7 @@ appear atomic. Here are the locking rules for
the fdtable structure - the fdtable structure -
1. All references to the fdtable must be done through 1. All references to the fdtable must be done through
the files_fdtable() macro : the files_fdtable() macro::
struct fdtable *fdt; struct fdtable *fdt;
@ -61,7 +64,8 @@ the fdtable structure -
4. To look up the file structure given an fd, a reader 4. To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup. take care of barrier requirements due to lock-free lookup.
An example :
An example::
struct file *file; struct file *file;
@ -77,7 +81,7 @@ the fdtable structure -
of the fd (fget()/fget_light()) are lock-free, it is possible of the fd (fget()/fget_light()) are lock-free, it is possible
that look-up may race with the last put() operation on the that look-up may race with the last put() operation on the
file structure. This is avoided using atomic_long_inc_not_zero() file structure. This is avoided using atomic_long_inc_not_zero()
on ->f_count : on ->f_count::
rcu_read_lock(); rcu_read_lock();
file = fcheck_files(files, fd); file = fcheck_files(files, fd);
@ -106,7 +110,8 @@ the fdtable structure -
holding files->file_lock. If ->file_lock is dropped, then holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale. fdtable and making the earlier fdtable pointer stale.
For example :
For example::
spin_lock(&files->file_lock); spin_lock(&files->file_lock);
fd = locate_fd(files, file, start); fd = locate_fd(files, file, start);

View File

@ -1,3 +1,9 @@
.. SPDX-License-Identifier: GPL-2.0
==============
Fuse I/O Modes
==============
Fuse supports the following I/O modes: Fuse supports the following I/O modes:
- direct-io - direct-io

View File

@ -24,6 +24,22 @@ algorithms work.
splice splice
locking locking
directory-locking directory-locking
devpts
dnotify
fiemap
files
locks
mandatory-locking
mount_api
quota
seq_file
sharedsubtree
sysfs-pci
sysfs-tagging
automount-support
caching/index
porting porting
@ -57,7 +73,10 @@ Documentation for filesystem implementations.
befs befs
bfs bfs
btrfs btrfs
cifs/cifsroot
ceph ceph
coda
configfs
cramfs cramfs
debugfs debugfs
dlmfs dlmfs
@ -73,6 +92,7 @@ Documentation for filesystem implementations.
hfsplus hfsplus
hpfs hpfs
fuse fuse
fuse-io
inotify inotify
isofs isofs
nilfs2 nilfs2
@ -88,6 +108,7 @@ Documentation for filesystem implementations.
ramfs-rootfs-initramfs ramfs-rootfs-initramfs
relay relay
romfs romfs
spufs/index
squashfs squashfs
sysfs sysfs
sysv-fs sysv-fs
@ -97,4 +118,6 @@ Documentation for filesystem implementations.
udf udf
virtiofs virtiofs
vfat vfat
xfs-delayed-logging-design
xfs-self-describing-metadata
zonefs zonefs

View File

@ -1,4 +1,8 @@
File Locking Release Notes .. SPDX-License-Identifier: GPL-2.0
==========================
File Locking Release Notes
==========================
Andy Walker <andy@lysaker.kvaerner.no> Andy Walker <andy@lysaker.kvaerner.no>
@ -6,7 +10,7 @@
1. What's New? 1. What's New?
-------------- ==============
1.1 Broken Flock Emulation 1.1 Broken Flock Emulation
-------------------------- --------------------------
@ -25,7 +29,7 @@ anyway (see the file "Documentation/process/changes.rst".)
--------------------------- ---------------------------
1.2.1 Typical Problems - Sendmail 1.2.1 Typical Problems - Sendmail
--------------------------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Because sendmail was unable to use the old flock() emulation, many sendmail Because sendmail was unable to use the old flock() emulation, many sendmail
installations use fcntl() instead of flock(). This is true of Slackware 3.0 installations use fcntl() instead of flock(). This is true of Slackware 3.0
for example. This gave rise to some other subtle problems if sendmail was for example. This gave rise to some other subtle problems if sendmail was
@ -37,7 +41,7 @@ to lock solid with deadlocked processes.
1.2.2 The Solution 1.2.2 The Solution
------------------ ^^^^^^^^^^^^^^^^^^
The solution I have chosen, after much experimentation and discussion, The solution I have chosen, after much experimentation and discussion,
is to make flock() and fcntl() locks oblivious to each other. Both can is to make flock() and fcntl() locks oblivious to each other. Both can
exists, and neither will have any effect on the other. exists, and neither will have any effect on the other.
@ -54,7 +58,7 @@ fcntl(), with all the problems that implies.
--------------------------------------- ---------------------------------------
Mandatory locking, as described in Mandatory locking, as described in
'Documentation/filesystems/mandatory-locking.txt' was prior to this release a 'Documentation/filesystems/mandatory-locking.rst' was prior to this release a
general configuration option that was valid for all mounted filesystems. This general configuration option that was valid for all mounted filesystems. This
had a number of inherent dangers, not the least of which was the ability to had a number of inherent dangers, not the least of which was the ability to
freeze an NFS server by asking it to read a file for which a mandatory lock freeze an NFS server by asking it to read a file for which a mandatory lock

View File

@ -1,8 +1,13 @@
Mandatory File Locking For The Linux Operating System .. SPDX-License-Identifier: GPL-2.0
=====================================================
Mandatory File Locking For The Linux Operating System
=====================================================
Andy Walker <andy@lysaker.kvaerner.no> Andy Walker <andy@lysaker.kvaerner.no>
15 April 1996 15 April 1996
(Updated September 2007) (Updated September 2007)
0. Why you should avoid mandatory locking 0. Why you should avoid mandatory locking
@ -53,15 +58,17 @@ possible on existing user code. The scheme is based on marking individual files
as candidates for mandatory locking, and using the existing fcntl()/lockf() as candidates for mandatory locking, and using the existing fcntl()/lockf()
interface for applying locks just as if they were normal, advisory locks. interface for applying locks just as if they were normal, advisory locks.
Note 1: In saying "file" in the paragraphs above I am actually not telling .. Note::
the whole truth. System V locking is based on fcntl(). The granularity of
fcntl() is such that it allows the locking of byte ranges in files, in addition
to entire files, so the mandatory locking rules also have byte level
granularity.
Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite 1. In saying "file" in the paragraphs above I am actually not telling
borrowing the fcntl() locking scheme from System V. The mandatory locking the whole truth. System V locking is based on fcntl(). The granularity of
scheme is defined by the System V Interface Definition (SVID) Version 3. fcntl() is such that it allows the locking of byte ranges in files, in
addition to entire files, so the mandatory locking rules also have byte
level granularity.
2. POSIX.1 does not specify any scheme for mandatory locking, despite
borrowing the fcntl() locking scheme from System V. The mandatory locking
scheme is defined by the System V Interface Definition (SVID) Version 3.
2. Marking a file for mandatory locking 2. Marking a file for mandatory locking
--------------------------------------- ---------------------------------------

View File

@ -1,8 +1,10 @@
==================== .. SPDX-License-Identifier: GPL-2.0
FILESYSTEM MOUNT API
====================
CONTENTS ====================
fILESYSTEM Mount API
====================
.. CONTENTS
(1) Overview. (1) Overview.
@ -21,8 +23,7 @@ CONTENTS
(8) Parameter helper functions. (8) Parameter helper functions.
======== Overview
OVERVIEW
======== ========
The creation of new mounts is now to be done in a multistep process: The creation of new mounts is now to be done in a multistep process:
@ -43,7 +44,7 @@ The creation of new mounts is now to be done in a multistep process:
(7) Destroy the context. (7) Destroy the context.
To support this, the file_system_type struct gains two new fields: To support this, the file_system_type struct gains two new fields::
int (*init_fs_context)(struct fs_context *fc); int (*init_fs_context)(struct fs_context *fc);
const struct fs_parameter_description *parameters; const struct fs_parameter_description *parameters;
@ -57,12 +58,11 @@ Note that security initialisation is done *after* the filesystem is called so
that the namespaces may be adjusted first. that the namespaces may be adjusted first.
====================== The Filesystem context
THE FILESYSTEM CONTEXT
====================== ======================
The creation and reconfiguration of a superblock is governed by a filesystem The creation and reconfiguration of a superblock is governed by a filesystem
context. This is represented by the fs_context structure: context. This is represented by the fs_context structure::
struct fs_context { struct fs_context {
const struct fs_context_operations *ops; const struct fs_context_operations *ops;
@ -86,78 +86,106 @@ context. This is represented by the fs_context structure:
The fs_context fields are as follows: The fs_context fields are as follows:
(*) const struct fs_context_operations *ops * ::
const struct fs_context_operations *ops
These are operations that can be done on a filesystem context (see These are operations that can be done on a filesystem context (see
below). This must be set by the ->init_fs_context() file_system_type below). This must be set by the ->init_fs_context() file_system_type
operation. operation.
(*) struct file_system_type *fs_type * ::
struct file_system_type *fs_type
A pointer to the file_system_type of the filesystem that is being A pointer to the file_system_type of the filesystem that is being
constructed or reconfigured. This retains a reference on the type owner. constructed or reconfigured. This retains a reference on the type owner.
(*) void *fs_private * ::
void *fs_private
A pointer to the file system's private data. This is where the filesystem A pointer to the file system's private data. This is where the filesystem
will need to store any options it parses. will need to store any options it parses.
(*) struct dentry *root * ::
struct dentry *root
A pointer to the root of the mountable tree (and indirectly, the A pointer to the root of the mountable tree (and indirectly, the
superblock thereof). This is filled in by the ->get_tree() op. If this superblock thereof). This is filled in by the ->get_tree() op. If this
is set, an active reference on root->d_sb must also be held. is set, an active reference on root->d_sb must also be held.
(*) struct user_namespace *user_ns * ::
(*) struct net *net_ns
struct user_namespace *user_ns
struct net *net_ns
There are a subset of the namespaces in use by the invoking process. They There are a subset of the namespaces in use by the invoking process. They
retain references on each namespace. The subscribed namespaces may be retain references on each namespace. The subscribed namespaces may be
replaced by the filesystem to reflect other sources, such as the parent replaced by the filesystem to reflect other sources, such as the parent
mount superblock on an automount. mount superblock on an automount.
(*) const struct cred *cred * ::
const struct cred *cred
The mounter's credentials. This retains a reference on the credentials. The mounter's credentials. This retains a reference on the credentials.
(*) char *source * ::
char *source
This specifies the source. It may be a block device (e.g. /dev/sda1) or This specifies the source. It may be a block device (e.g. /dev/sda1) or
something more exotic, such as the "host:/path" that NFS desires. something more exotic, such as the "host:/path" that NFS desires.
(*) char *subtype * ::
char *subtype
This is a string to be added to the type displayed in /proc/mounts to This is a string to be added to the type displayed in /proc/mounts to
qualify it (used by FUSE). This is available for the filesystem to set if qualify it (used by FUSE). This is available for the filesystem to set if
desired. desired.
(*) void *security * ::
void *security
A place for the LSMs to hang their security data for the superblock. The A place for the LSMs to hang their security data for the superblock. The
relevant security operations are described below. relevant security operations are described below.
(*) void *s_fs_info * ::
void *s_fs_info
The proposed s_fs_info for a new superblock, set in the superblock by The proposed s_fs_info for a new superblock, set in the superblock by
sget_fc(). This can be used to distinguish superblocks. sget_fc(). This can be used to distinguish superblocks.
(*) unsigned int sb_flags * ::
(*) unsigned int sb_flags_mask
unsigned int sb_flags
unsigned int sb_flags_mask
Which bits SB_* flags are to be set/cleared in super_block::s_flags. Which bits SB_* flags are to be set/cleared in super_block::s_flags.
(*) unsigned int s_iflags * ::
unsigned int s_iflags
These will be bitwise-OR'd with s->s_iflags when a superblock is created. These will be bitwise-OR'd with s->s_iflags when a superblock is created.
(*) enum fs_context_purpose * ::
enum fs_context_purpose
This indicates the purpose for which the context is intended. The This indicates the purpose for which the context is intended. The
available values are: available values are:
FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount ========================== ======================================
FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount FS_CONTEXT_FOR_MOUNT, New superblock for explicit mount
FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount FS_CONTEXT_FOR_SUBMOUNT New automatic submount of extant mount
FS_CONTEXT_FOR_RECONFIGURE Change an existing mount
========================== ======================================
The mount context is created by calling vfs_new_fs_context() or The mount context is created by calling vfs_new_fs_context() or
vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
@ -176,11 +204,10 @@ mount context. For instance, NFS might pin the appropriate protocol version
module. module.
================================= The Filesystem Context Operations
THE FILESYSTEM CONTEXT OPERATIONS
================================= =================================
The filesystem context points to a table of operations: The filesystem context points to a table of operations::
struct fs_context_operations { struct fs_context_operations {
void (*free)(struct fs_context *fc); void (*free)(struct fs_context *fc);
@ -195,24 +222,32 @@ The filesystem context points to a table of operations:
These operations are invoked by the various stages of the mount procedure to These operations are invoked by the various stages of the mount procedure to
manage the filesystem context. They are as follows: manage the filesystem context. They are as follows:
(*) void (*free)(struct fs_context *fc); * ::
void (*free)(struct fs_context *fc);
Called to clean up the filesystem-specific part of the filesystem context Called to clean up the filesystem-specific part of the filesystem context
when the context is destroyed. It should be aware that parts of the when the context is destroyed. It should be aware that parts of the
context may have been removed and NULL'd out by ->get_tree(). context may have been removed and NULL'd out by ->get_tree().
(*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc); * ::
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
Called when a filesystem context has been duplicated to duplicate the Called when a filesystem context has been duplicated to duplicate the
filesystem-private data. An error may be returned to indicate failure to filesystem-private data. An error may be returned to indicate failure to
do this. do this.
[!] Note that even if this fails, put_fs_context() will be called .. Warning::
Note that even if this fails, put_fs_context() will be called
immediately thereafter, so ->dup() *must* make the immediately thereafter, so ->dup() *must* make the
filesystem-private data safe for ->free(). filesystem-private data safe for ->free().
(*) int (*parse_param)(struct fs_context *fc, * ::
struct struct fs_parameter *param);
int (*parse_param)(struct fs_context *fc,
struct struct fs_parameter *param);
Called when a parameter is being added to the filesystem context. param Called when a parameter is being added to the filesystem context. param
points to the key name and maybe a value object. VFS-specific options points to the key name and maybe a value object. VFS-specific options
@ -224,7 +259,9 @@ manage the filesystem context. They are as follows:
If successful, 0 should be returned or a negative error code otherwise. If successful, 0 should be returned or a negative error code otherwise.
(*) int (*parse_monolithic)(struct fs_context *fc, void *data); * ::
int (*parse_monolithic)(struct fs_context *fc, void *data);
Called when the mount(2) system call is invoked to pass the entire data Called when the mount(2) system call is invoked to pass the entire data
page in one go. If this is expected to be just a list of "key[=val]" page in one go. If this is expected to be just a list of "key[=val]"
@ -236,7 +273,9 @@ manage the filesystem context. They are as follows:
finds it's the standard key-val list then it may pass it off to finds it's the standard key-val list then it may pass it off to
generic_parse_monolithic(). generic_parse_monolithic().
(*) int (*get_tree)(struct fs_context *fc); * ::
int (*get_tree)(struct fs_context *fc);
Called to get or create the mountable root and superblock, using the Called to get or create the mountable root and superblock, using the
information stored in the filesystem context (reconfiguration goes via a information stored in the filesystem context (reconfiguration goes via a
@ -249,7 +288,9 @@ manage the filesystem context. They are as follows:
The phase on a userspace-driven context will be set to only allow this to The phase on a userspace-driven context will be set to only allow this to
be called once on any particular context. be called once on any particular context.
(*) int (*reconfigure)(struct fs_context *fc); * ::
int (*reconfigure)(struct fs_context *fc);
Called to effect reconfiguration of a superblock using information stored Called to effect reconfiguration of a superblock using information stored
in the filesystem context. It may detach any resources it desires from in the filesystem context. It may detach any resources it desires from
@ -259,19 +300,20 @@ manage the filesystem context. They are as follows:
On success it should return 0. In the case of an error, it should return On success it should return 0. In the case of an error, it should return
a negative error code. a negative error code.
[NOTE] reconfigure is intended as a replacement for remount_fs. .. Note:: reconfigure is intended as a replacement for remount_fs.
=========================== Filesystem context Security
FILESYSTEM CONTEXT SECURITY
=========================== ===========================
The filesystem context contains a security pointer that the LSMs can use for The filesystem context contains a security pointer that the LSMs can use for
building up a security context for the superblock to be mounted. There are a building up a security context for the superblock to be mounted. There are a
number of operations used by the new mount code for this purpose: number of operations used by the new mount code for this purpose:
(*) int security_fs_context_alloc(struct fs_context *fc, * ::
struct dentry *reference);
int security_fs_context_alloc(struct fs_context *fc,
struct dentry *reference);
Called to initialise fc->security (which is preset to NULL) and allocate Called to initialise fc->security (which is preset to NULL) and allocate
any resources needed. It should return 0 on success or a negative error any resources needed. It should return 0 on success or a negative error
@ -283,22 +325,28 @@ number of operations used by the new mount code for this purpose:
non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
it indicates the automount point. it indicates the automount point.
(*) int security_fs_context_dup(struct fs_context *fc, * ::
struct fs_context *src_fc);
int security_fs_context_dup(struct fs_context *fc,
struct fs_context *src_fc);
Called to initialise fc->security (which is preset to NULL) and allocate Called to initialise fc->security (which is preset to NULL) and allocate
any resources needed. The original filesystem context is pointed to by any resources needed. The original filesystem context is pointed to by
src_fc and may be used for reference. It should return 0 on success or a src_fc and may be used for reference. It should return 0 on success or a
negative error code on failure. negative error code on failure.
(*) void security_fs_context_free(struct fs_context *fc); * ::
void security_fs_context_free(struct fs_context *fc);
Called to clean up anything attached to fc->security. Note that the Called to clean up anything attached to fc->security. Note that the
contents may have been transferred to a superblock and the pointer cleared contents may have been transferred to a superblock and the pointer cleared
during get_tree. during get_tree.
(*) int security_fs_context_parse_param(struct fs_context *fc, * ::
struct fs_parameter *param);
int security_fs_context_parse_param(struct fs_context *fc,
struct fs_parameter *param);
Called for each mount parameter, including the source. The arguments are Called for each mount parameter, including the source. The arguments are
as for the ->parse_param() method. It should return 0 to indicate that as for the ->parse_param() method. It should return 0 to indicate that
@ -310,7 +358,9 @@ number of operations used by the new mount code for this purpose:
(provided the value pointer is NULL'd out). If it is stolen, 1 must be (provided the value pointer is NULL'd out). If it is stolen, 1 must be
returned to prevent it being passed to the filesystem. returned to prevent it being passed to the filesystem.
(*) int security_fs_context_validate(struct fs_context *fc); * ::
int security_fs_context_validate(struct fs_context *fc);
Called after all the options have been parsed to validate the collection Called after all the options have been parsed to validate the collection
as a whole and to do any necessary allocation so that as a whole and to do any necessary allocation so that
@ -320,36 +370,43 @@ number of operations used by the new mount code for this purpose:
In the case of reconfiguration, the target superblock will be accessible In the case of reconfiguration, the target superblock will be accessible
via fc->root. via fc->root.
(*) int security_sb_get_tree(struct fs_context *fc); * ::
int security_sb_get_tree(struct fs_context *fc);
Called during the mount procedure to verify that the specified superblock Called during the mount procedure to verify that the specified superblock
is allowed to be mounted and to transfer the security data there. It is allowed to be mounted and to transfer the security data there. It
should return 0 or a negative error code. should return 0 or a negative error code.
(*) void security_sb_reconfigure(struct fs_context *fc); * ::
void security_sb_reconfigure(struct fs_context *fc);
Called to apply any reconfiguration to an LSM's context. It must not Called to apply any reconfiguration to an LSM's context. It must not
fail. Error checking and resource allocation must be done in advance by fail. Error checking and resource allocation must be done in advance by
the parameter parsing and validation hooks. the parameter parsing and validation hooks.
(*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint, * ::
unsigned int mnt_flags);
int security_sb_mountpoint(struct fs_context *fc,
struct path *mountpoint,
unsigned int mnt_flags);
Called during the mount procedure to verify that the root dentry attached Called during the mount procedure to verify that the root dentry attached
to the context is permitted to be attached to the specified mountpoint. to the context is permitted to be attached to the specified mountpoint.
It should return 0 on success or a negative error code on failure. It should return 0 on success or a negative error code on failure.
========================== VFS Filesystem context API
VFS FILESYSTEM CONTEXT API
========================== ==========================
There are four operations for creating a filesystem context and one for There are four operations for creating a filesystem context and one for
destroying a context: destroying a context:
(*) struct fs_context *fs_context_for_mount( * ::
struct file_system_type *fs_type,
unsigned int sb_flags); struct fs_context *fs_context_for_mount(struct file_system_type *fs_type,
unsigned int sb_flags);
Allocate a filesystem context for the purpose of setting up a new mount, Allocate a filesystem context for the purpose of setting up a new mount,
whether that be with a new superblock or sharing an existing one. This whether that be with a new superblock or sharing an existing one. This
@ -359,7 +416,9 @@ destroying a context:
fs_type specifies the filesystem type that will manage the context and fs_type specifies the filesystem type that will manage the context and
sb_flags presets the superblock flags stored therein. sb_flags presets the superblock flags stored therein.
(*) struct fs_context *fs_context_for_reconfigure( * ::
struct fs_context *fs_context_for_reconfigure(
struct dentry *dentry, struct dentry *dentry,
unsigned int sb_flags, unsigned int sb_flags,
unsigned int sb_flags_mask); unsigned int sb_flags_mask);
@ -369,7 +428,9 @@ destroying a context:
configured. sb_flags and sb_flags_mask indicate which superblock flags configured. sb_flags and sb_flags_mask indicate which superblock flags
need changing and to what. need changing and to what.
(*) struct fs_context *fs_context_for_submount( * ::
struct fs_context *fs_context_for_submount(
struct file_system_type *fs_type, struct file_system_type *fs_type,
struct dentry *reference); struct dentry *reference);
@ -382,7 +443,9 @@ destroying a context:
Note that it's not a requirement that the reference dentry be of the same Note that it's not a requirement that the reference dentry be of the same
filesystem type as fs_type. filesystem type as fs_type.
(*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc); * ::
struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
Duplicate a filesystem context, copying any options noted and duplicating Duplicate a filesystem context, copying any options noted and duplicating
or additionally referencing any resources held therein. This is available or additionally referencing any resources held therein. This is available
@ -392,14 +455,18 @@ destroying a context:
The purpose in the new context is inherited from the old one. The purpose in the new context is inherited from the old one.
(*) void put_fs_context(struct fs_context *fc); * ::
void put_fs_context(struct fs_context *fc);
Destroy a filesystem context, releasing any resources it holds. This Destroy a filesystem context, releasing any resources it holds. This
calls the ->free() operation. This is intended to be called by anyone who calls the ->free() operation. This is intended to be called by anyone who
created a filesystem context. created a filesystem context.
[!] filesystem contexts are not refcounted, so this causes unconditional .. Warning::
destruction.
filesystem contexts are not refcounted, so this causes unconditional
destruction.
In all the above operations, apart from the put op, the return is a mount In all the above operations, apart from the put op, the return is a mount
context pointer or a negative error code. context pointer or a negative error code.
@ -407,8 +474,10 @@ context pointer or a negative error code.
For the remaining operations, if an error occurs, a negative error code will be For the remaining operations, if an error occurs, a negative error code will be
returned. returned.
(*) int vfs_parse_fs_param(struct fs_context *fc, * ::
struct fs_parameter *param);
int vfs_parse_fs_param(struct fs_context *fc,
struct fs_parameter *param);
Supply a single mount parameter to the filesystem context. This include Supply a single mount parameter to the filesystem context. This include
the specification of the source/device which is specified as the "source" the specification of the source/device which is specified as the "source"
@ -423,53 +492,64 @@ returned.
The parameter value is typed and can be one of: The parameter value is typed and can be one of:
fs_value_is_flag, Parameter not given a value. ==================== =============================
fs_value_is_string, Value is a string fs_value_is_flag Parameter not given a value
fs_value_is_blob, Value is a binary blob fs_value_is_string Value is a string
fs_value_is_filename, Value is a filename* + dirfd fs_value_is_blob Value is a binary blob
fs_value_is_file, Value is an open file (file*) fs_value_is_filename Value is a filename* + dirfd
fs_value_is_file Value is an open file (file*)
==================== =============================
If there is a value, that value is stored in a union in the struct in one If there is a value, that value is stored in a union in the struct in one
of param->{string,blob,name,file}. Note that the function may steal and of param->{string,blob,name,file}. Note that the function may steal and
clear the pointer, but then becomes responsible for disposing of the clear the pointer, but then becomes responsible for disposing of the
object. object.
(*) int vfs_parse_fs_string(struct fs_context *fc, const char *key, * ::
const char *value, size_t v_size);
int vfs_parse_fs_string(struct fs_context *fc, const char *key,
const char *value, size_t v_size);
A wrapper around vfs_parse_fs_param() that copies the value string it is A wrapper around vfs_parse_fs_param() that copies the value string it is
passed. passed.
(*) int generic_parse_monolithic(struct fs_context *fc, void *data); * ::
int generic_parse_monolithic(struct fs_context *fc, void *data);
Parse a sys_mount() data page, assuming the form to be a text list Parse a sys_mount() data page, assuming the form to be a text list
consisting of key[=val] options separated by commas. Each item in the consisting of key[=val] options separated by commas. Each item in the
list is passed to vfs_mount_option(). This is the default when the list is passed to vfs_mount_option(). This is the default when the
->parse_monolithic() method is NULL. ->parse_monolithic() method is NULL.
(*) int vfs_get_tree(struct fs_context *fc); * ::
int vfs_get_tree(struct fs_context *fc);
Get or create the mountable root and superblock, using the parameters in Get or create the mountable root and superblock, using the parameters in
the filesystem context to select/configure the superblock. This invokes the filesystem context to select/configure the superblock. This invokes
the ->get_tree() method. the ->get_tree() method.
(*) struct vfsmount *vfs_create_mount(struct fs_context *fc); * ::
struct vfsmount *vfs_create_mount(struct fs_context *fc);
Create a mount given the parameters in the specified filesystem context. Create a mount given the parameters in the specified filesystem context.
Note that this does not attach the mount to anything. Note that this does not attach the mount to anything.
=========================== Superblock Creation Helpers
SUPERBLOCK CREATION HELPERS
=========================== ===========================
A number of VFS helpers are available for use by filesystems for the creation A number of VFS helpers are available for use by filesystems for the creation
or looking up of superblocks. or looking up of superblocks.
(*) struct super_block * * ::
sget_fc(struct fs_context *fc,
int (*test)(struct super_block *sb, struct fs_context *fc), struct super_block *
int (*set)(struct super_block *sb, struct fs_context *fc)); sget_fc(struct fs_context *fc,
int (*test)(struct super_block *sb, struct fs_context *fc),
int (*set)(struct super_block *sb, struct fs_context *fc));
This is the core routine. If test is non-NULL, it searches for an This is the core routine. If test is non-NULL, it searches for an
existing superblock matching the criteria held in the fs_context, using existing superblock matching the criteria held in the fs_context, using
@ -482,10 +562,12 @@ or looking up of superblocks.
The following helpers all wrap sget_fc(): The following helpers all wrap sget_fc():
(*) int vfs_get_super(struct fs_context *fc, * ::
enum vfs_get_super_keying keying,
int (*fill_super)(struct super_block *sb, int vfs_get_super(struct fs_context *fc,
struct fs_context *fc)) enum vfs_get_super_keying keying,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc))
This creates/looks up a deviceless superblock. The keying indicates how This creates/looks up a deviceless superblock. The keying indicates how
many superblocks of this type may exist and in what manner they may be many superblocks of this type may exist and in what manner they may be
@ -515,14 +597,14 @@ PARAMETER DESCRIPTION
===================== =====================
Parameters are described using structures defined in linux/fs_parser.h. Parameters are described using structures defined in linux/fs_parser.h.
There's a core description struct that links everything together: There's a core description struct that links everything together::
struct fs_parameter_description { struct fs_parameter_description {
const struct fs_parameter_spec *specs; const struct fs_parameter_spec *specs;
const struct fs_parameter_enum *enums; const struct fs_parameter_enum *enums;
}; };
For example: For example::
enum { enum {
Opt_autocell, Opt_autocell,
@ -539,10 +621,12 @@ For example:
The members are as follows: The members are as follows:
(1) const struct fs_parameter_specification *specs; (1) ::
const struct fs_parameter_specification *specs;
Table of parameter specifications, terminated with a null entry, where the Table of parameter specifications, terminated with a null entry, where the
entries are of type: entries are of type::
struct fs_parameter_spec { struct fs_parameter_spec {
const char *name; const char *name;
@ -558,6 +642,7 @@ The members are as follows:
The 'type' field indicates the desired value type and must be one of: The 'type' field indicates the desired value type and must be one of:
======================= ======================= =====================
TYPE NAME EXPECTED VALUE RESULT IN TYPE NAME EXPECTED VALUE RESULT IN
======================= ======================= ===================== ======================= ======================= =====================
fs_param_is_flag No value n/a fs_param_is_flag No value n/a
@ -573,19 +658,23 @@ The members are as follows:
fs_param_is_blockdev Blockdev path * Needs lookup fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32 fs_param_is_fd File descriptor result->int_32
======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try Note that if the value is of fs_param_is_bool type, fs_parse() will try
to match any string value against "0", "1", "no", "yes", "false", "true". to match any string value against "0", "1", "no", "yes", "false", "true".
Each parameter can also be qualified with 'flags': Each parameter can also be qualified with 'flags':
======================= ================================================
fs_param_v_optional The value is optional fs_param_v_optional The value is optional
fs_param_neg_with_no result->negated set if key is prefixed with "no" fs_param_neg_with_no result->negated set if key is prefixed with "no"
fs_param_neg_with_empty result->negated set if value is "" fs_param_neg_with_empty result->negated set if value is ""
fs_param_deprecated The parameter is deprecated. fs_param_deprecated The parameter is deprecated.
======================= ================================================
These are wrapped with a number of convenience wrappers: These are wrapped with a number of convenience wrappers:
======================= ===============================================
MACRO SPECIFIES MACRO SPECIFIES
======================= =============================================== ======================= ===============================================
fsparam_flag() fs_param_is_flag fsparam_flag() fs_param_is_flag
@ -602,9 +691,10 @@ The members are as follows:
fsparam_bdev() fs_param_is_blockdev fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd fsparam_fd() fs_param_is_fd
======================= ===============================================
all of which take two arguments, name string and option number - for all of which take two arguments, name string and option number - for
example: example::
static const struct fs_parameter_spec afs_param_specs[] = { static const struct fs_parameter_spec afs_param_specs[] = {
fsparam_flag ("autocell", Opt_autocell), fsparam_flag ("autocell", Opt_autocell),
@ -618,10 +708,12 @@ The members are as follows:
of arguments to specify the type and the flags for anything that doesn't of arguments to specify the type and the flags for anything that doesn't
match one of the above macros. match one of the above macros.
(2) const struct fs_parameter_enum *enums; (2) ::
const struct fs_parameter_enum *enums;
Table of enum value names to integer mappings, terminated with a null Table of enum value names to integer mappings, terminated with a null
entry. This is of type: entry. This is of type::
struct fs_parameter_enum { struct fs_parameter_enum {
u8 opt; u8 opt;
@ -630,7 +722,7 @@ The members are as follows:
}; };
Where the array is an unsorted list of { parameter ID, name }-keyed Where the array is an unsorted list of { parameter ID, name }-keyed
elements that indicate the value to map to, e.g.: elements that indicate the value to map to, e.g.::
static const struct fs_parameter_enum afs_param_enums[] = { static const struct fs_parameter_enum afs_param_enums[] = {
{ Opt_bar, "x", 1}, { Opt_bar, "x", 1},
@ -648,18 +740,19 @@ CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
userspace using the fsinfo() syscall. userspace using the fsinfo() syscall.
========================== Parameter Helper Functions
PARAMETER HELPER FUNCTIONS
========================== ==========================
A number of helper functions are provided to help a filesystem or an LSM A number of helper functions are provided to help a filesystem or an LSM
process the parameters it is given. process the parameters it is given.
(*) int lookup_constant(const struct constant_table tbl[], * ::
const char *name, int not_found);
int lookup_constant(const struct constant_table tbl[],
const char *name, int not_found);
Look up a constant by name in a table of name -> integer mappings. The Look up a constant by name in a table of name -> integer mappings. The
table is an array of elements of the following type: table is an array of elements of the following type::
struct constant_table { struct constant_table {
const char *name; const char *name;
@ -669,9 +762,11 @@ process the parameters it is given.
If a match is found, the corresponding value is returned. If a match If a match is found, the corresponding value is returned. If a match
isn't found, the not_found value is returned instead. isn't found, the not_found value is returned instead.
(*) bool validate_constant_table(const struct constant_table *tbl, * ::
size_t tbl_size,
int low, int high, int special); bool validate_constant_table(const struct constant_table *tbl,
size_t tbl_size,
int low, int high, int special);
Validate a constant table. Checks that all the elements are appropriately Validate a constant table. Checks that all the elements are appropriately
ordered, that there are no duplicates and that the values are between low ordered, that there are no duplicates and that the values are between low
@ -682,16 +777,20 @@ process the parameters it is given.
If all is good, true is returned. If the table is invalid, errors are If all is good, true is returned. If the table is invalid, errors are
logged to dmesg and false is returned. logged to dmesg and false is returned.
(*) bool fs_validate_description(const struct fs_parameter_description *desc); * ::
bool fs_validate_description(const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will returns true if the description is good and false if it is not. It will
log errors to dmesg if validation fails. log errors to dmesg if validation fails.
(*) int fs_parse(struct fs_context *fc, * ::
const struct fs_parameter_description *desc,
struct fs_parameter *param, int fs_parse(struct fs_context *fc,
struct fs_parse_result *result); const struct fs_parameter_description *desc,
struct fs_parameter *param,
struct fs_parse_result *result);
This is the main interpreter of parameters. It uses the parameter This is the main interpreter of parameters. It uses the parameter
description to look up a parameter by key name and to convert that to an description to look up a parameter by key name and to convert that to an
@ -711,14 +810,16 @@ process the parameters it is given.
parameter is matched, but the value is erroneous, -EINVAL will be parameter is matched, but the value is erroneous, -EINVAL will be
returned; otherwise the parameter's option number will be returned. returned; otherwise the parameter's option number will be returned.
(*) int fs_lookup_param(struct fs_context *fc, * ::
struct fs_parameter *value,
bool want_bdev, int fs_lookup_param(struct fs_context *fc,
struct path *_path); struct fs_parameter *value,
bool want_bdev,
struct path *_path);
This takes a parameter that carries a string or filename type and attempts This takes a parameter that carries a string or filename type and attempts
to do a path lookup on it. If the parameter expects a blockdev, a check to do a path lookup on it. If the parameter expects a blockdev, a check
is made that the inode actually represents one. is made that the inode actually represents one.
Returns 0 if successful and *_path will be set; returns a negative error Returns 0 if successful and ``*_path`` will be set; returns a negative
code if not. error code if not.

View File

@ -119,9 +119,7 @@ it comes to that question::
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
Create an /etc/pvfs2tab file:: Create an /etc/pvfs2tab file (localhost is fine)::
Localhost is fine for your pvfs2tab file:
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
/etc/pvfs2tab /etc/pvfs2tab

View File

@ -1871,7 +1871,7 @@ unbindable mount is unbindable
For more information on mount propagation see: For more information on mount propagation see:
Documentation/filesystems/sharedsubtree.txt Documentation/filesystems/sharedsubtree.rst
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm

View File

@ -1,4 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
===============
Quota subsystem Quota subsystem
=============== ===============
@ -39,6 +41,7 @@ Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
This command is used to send a notification about any of the above mentioned This command is used to send a notification about any of the above mentioned
events. Each message has six attributes. These are (type of the argument is events. Each message has six attributes. These are (type of the argument is
in parentheses): in parentheses):
QUOTA_NL_A_QTYPE (u32) QUOTA_NL_A_QTYPE (u32)
- type of quota being exceeded (one of USRQUOTA, GRPQUOTA) - type of quota being exceeded (one of USRQUOTA, GRPQUOTA)
QUOTA_NL_A_EXCESS_ID (u64) QUOTA_NL_A_EXCESS_ID (u64)
@ -48,20 +51,34 @@ in parentheses):
- UID of a user who caused the event - UID of a user who caused the event
QUOTA_NL_A_WARNING (u32) QUOTA_NL_A_WARNING (u32)
- what kind of limit is exceeded: - what kind of limit is exceeded:
QUOTA_NL_IHARDWARN - inode hardlimit
QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer QUOTA_NL_IHARDWARN
than given grace period inode hardlimit
QUOTA_NL_ISOFTWARN - inode softlimit QUOTA_NL_ISOFTLONGWARN
QUOTA_NL_BHARDWARN - space (block) hardlimit inode softlimit is exceeded longer
QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded than given grace period
longer than given grace period. QUOTA_NL_ISOFTWARN
QUOTA_NL_BSOFTWARN - space (block) softlimit inode softlimit
QUOTA_NL_BHARDWARN
space (block) hardlimit
QUOTA_NL_BSOFTLONGWARN
space (block) softlimit is exceeded
longer than given grace period.
QUOTA_NL_BSOFTWARN
space (block) softlimit
- four warnings are also defined for the event when user stops - four warnings are also defined for the event when user stops
exceeding some limit: exceeding some limit:
QUOTA_NL_IHARDBELOW - inode hardlimit
QUOTA_NL_ISOFTBELOW - inode softlimit QUOTA_NL_IHARDBELOW
QUOTA_NL_BHARDBELOW - space (block) hardlimit inode hardlimit
QUOTA_NL_BSOFTBELOW - space (block) softlimit QUOTA_NL_ISOFTBELOW
inode softlimit
QUOTA_NL_BHARDBELOW
space (block) hardlimit
QUOTA_NL_BSOFTBELOW
space (block) softlimit
QUOTA_NL_A_DEV_MAJOR (u32) QUOTA_NL_A_DEV_MAJOR (u32)
- major number of a device with the affected filesystem - major number of a device with the affected filesystem
QUOTA_NL_A_DEV_MINOR (u32) QUOTA_NL_A_DEV_MINOR (u32)

View File

@ -71,7 +71,7 @@ be allowed write access to a ramfs mount.
A ramfs derivative called tmpfs was created to add size limits, and the ability A ramfs derivative called tmpfs was created to add size limits, and the ability
to write the data to swap space. Normal users can be allowed write access to to write the data to swap space. Normal users can be allowed write access to
tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. tmpfs mounts. See Documentation/filesystems/tmpfs.rst for more information.
What is rootfs? What is rootfs?
--------------- ---------------

View File

@ -1,6 +1,11 @@
The seq_file interface .. SPDX-License-Identifier: GPL-2.0
======================
The seq_file Interface
======================
Copyright 2003 Jonathan Corbet <corbet@lwn.net> Copyright 2003 Jonathan Corbet <corbet@lwn.net>
This file is originally from the LWN.net Driver Porting series at This file is originally from the LWN.net Driver Porting series at
http://lwn.net/Articles/driver-porting/ http://lwn.net/Articles/driver-porting/
@ -43,7 +48,7 @@ loadable module which creates a file called /proc/sequence. The file, when
read, simply produces a set of increasing integer values, one per line. The read, simply produces a set of increasing integer values, one per line. The
sequence will continue until the user loses patience and finds something sequence will continue until the user loses patience and finds something
better to do. The file is seekable, in that one can do something like the better to do. The file is seekable, in that one can do something like the
following: following::
dd if=/proc/sequence of=out1 count=1 dd if=/proc/sequence of=out1 count=1
dd if=/proc/sequence skip=1 of=out2 count=1 dd if=/proc/sequence skip=1 of=out2 count=1
@ -55,16 +60,18 @@ wanting to see the full source for this module can find it at
http://lwn.net/Articles/22359/). http://lwn.net/Articles/22359/).
Deprecated create_proc_entry Deprecated create_proc_entry
============================
Note that the above article uses create_proc_entry which was removed in Note that the above article uses create_proc_entry which was removed in
kernel 3.10. Current versions require the following update kernel 3.10. Current versions require the following update::
- entry = create_proc_entry("sequence", 0, NULL); - entry = create_proc_entry("sequence", 0, NULL);
- if (entry) - if (entry)
- entry->proc_fops = &ct_file_ops; - entry->proc_fops = &ct_file_ops;
+ entry = proc_create("sequence", 0, NULL, &ct_file_ops); + entry = proc_create("sequence", 0, NULL, &ct_file_ops);
The iterator interface The iterator interface
======================
Modules implementing a virtual file with seq_file must implement an Modules implementing a virtual file with seq_file must implement an
iterator object that allows stepping through the data of interest iterator object that allows stepping through the data of interest
@ -99,7 +106,7 @@ position. The pos passed to start() will always be either zero, or
the most recent pos used in the previous session. the most recent pos used in the previous session.
For our simple sequence example, For our simple sequence example,
the start() function looks like: the start() function looks like::
static void *ct_seq_start(struct seq_file *s, loff_t *pos) static void *ct_seq_start(struct seq_file *s, loff_t *pos)
{ {
@ -129,7 +136,7 @@ move the iterator forward to the next position in the sequence. The
example module can simply increment the position by one; more useful example module can simply increment the position by one; more useful
modules will do what is needed to step through some data structure. The modules will do what is needed to step through some data structure. The
next() function returns a new iterator, or NULL if the sequence is next() function returns a new iterator, or NULL if the sequence is
complete. Here's the example version: complete. Here's the example version::
static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos)
{ {
@ -141,10 +148,10 @@ complete. Here's the example version:
The stop() function closes a session; its job, of course, is to clean The stop() function closes a session; its job, of course, is to clean
up. If dynamic memory is allocated for the iterator, stop() is the up. If dynamic memory is allocated for the iterator, stop() is the
place to free it; if a lock was taken by start(), stop() must release place to free it; if a lock was taken by start(), stop() must release
that lock. The value that *pos was set to by the last next() call that lock. The value that ``*pos`` was set to by the last next() call
before stop() is remembered, and used for the first start() call of before stop() is remembered, and used for the first start() call of
the next session unless lseek() has been called on the file; in that the next session unless lseek() has been called on the file; in that
case next start() will be asked to start at position zero. case next start() will be asked to start at position zero::
static void ct_seq_stop(struct seq_file *s, void *v) static void ct_seq_stop(struct seq_file *s, void *v)
{ {
@ -152,7 +159,7 @@ case next start() will be asked to start at position zero.
} }
Finally, the show() function should format the object currently pointed to Finally, the show() function should format the object currently pointed to
by the iterator for output. The example module's show() function is: by the iterator for output. The example module's show() function is::
static int ct_seq_show(struct seq_file *s, void *v) static int ct_seq_show(struct seq_file *s, void *v)
{ {
@ -169,7 +176,7 @@ generated output before returning SEQ_SKIP, that output will be dropped.
We will look at seq_printf() in a moment. But first, the definition of the We will look at seq_printf() in a moment. But first, the definition of the
seq_file iterator is finished by creating a seq_operations structure with seq_file iterator is finished by creating a seq_operations structure with
the four functions we have just defined: the four functions we have just defined::
static const struct seq_operations ct_seq_ops = { static const struct seq_operations ct_seq_ops = {
.start = ct_seq_start, .start = ct_seq_start,
@ -194,6 +201,7 @@ other locks while the iterator is active.
Formatted output Formatted output
================
The seq_file code manages positioning within the output created by the The seq_file code manages positioning within the output created by the
iterator and getting it into the user's buffer. But, for that to work, that iterator and getting it into the user's buffer. But, for that to work, that
@ -203,7 +211,7 @@ been defined which make this task easy.
Most code will simply use seq_printf(), which works pretty much like Most code will simply use seq_printf(), which works pretty much like
printk(), but which requires the seq_file pointer as an argument. printk(), but which requires the seq_file pointer as an argument.
For straight character output, the following functions may be used: For straight character output, the following functions may be used::
seq_putc(struct seq_file *m, char c); seq_putc(struct seq_file *m, char c);
seq_puts(struct seq_file *m, const char *s); seq_puts(struct seq_file *m, const char *s);
@ -213,7 +221,7 @@ The first two output a single character and a string, just like one would
expect. seq_escape() is like seq_puts(), except that any character in s expect. seq_escape() is like seq_puts(), except that any character in s
which is in the string esc will be represented in octal form in the output. which is in the string esc will be represented in octal form in the output.
There are also a pair of functions for printing filenames: There are also a pair of functions for printing filenames::
int seq_path(struct seq_file *m, const struct path *path, int seq_path(struct seq_file *m, const struct path *path,
const char *esc); const char *esc);
@ -226,8 +234,10 @@ the path relative to the current process's filesystem root. If a different
root is desired, it can be used with seq_path_root(). If it turns out that root is desired, it can be used with seq_path_root(). If it turns out that
path cannot be reached from root, seq_path_root() returns SEQ_SKIP. path cannot be reached from root, seq_path_root() returns SEQ_SKIP.
A function producing complicated output may want to check A function producing complicated output may want to check::
bool seq_has_overflowed(struct seq_file *m); bool seq_has_overflowed(struct seq_file *m);
and avoid further seq_<output> calls if true is returned. and avoid further seq_<output> calls if true is returned.
A true return from seq_has_overflowed means that the seq_file buffer will A true return from seq_has_overflowed means that the seq_file buffer will
@ -236,6 +246,7 @@ buffer and retry printing.
Making it all work Making it all work
==================
So far, we have a nice set of functions which can produce output within the So far, we have a nice set of functions which can produce output within the
seq_file system, but we have not yet turned them into a file that a user seq_file system, but we have not yet turned them into a file that a user
@ -244,7 +255,7 @@ creation of a set of file_operations which implement the operations on that
file. The seq_file interface provides a set of canned operations which do file. The seq_file interface provides a set of canned operations which do
most of the work. The virtual file author still must implement the open() most of the work. The virtual file author still must implement the open()
method, however, to hook everything up. The open function is often a single method, however, to hook everything up. The open function is often a single
line, as in the example module: line, as in the example module::
static int ct_open(struct inode *inode, struct file *file) static int ct_open(struct inode *inode, struct file *file)
{ {
@ -263,7 +274,7 @@ by the iterator functions.
There is also a wrapper function to seq_open() called seq_open_private(). It There is also a wrapper function to seq_open() called seq_open_private(). It
kmallocs a zero filled block of memory and stores a pointer to it in the kmallocs a zero filled block of memory and stores a pointer to it in the
private field of the seq_file structure, returning 0 on success. The private field of the seq_file structure, returning 0 on success. The
block size is specified in a third parameter to the function, e.g.: block size is specified in a third parameter to the function, e.g.::
static int ct_open(struct inode *inode, struct file *file) static int ct_open(struct inode *inode, struct file *file)
{ {
@ -273,7 +284,7 @@ block size is specified in a third parameter to the function, e.g.:
There is also a variant function, __seq_open_private(), which is functionally There is also a variant function, __seq_open_private(), which is functionally
identical except that, if successful, it returns the pointer to the allocated identical except that, if successful, it returns the pointer to the allocated
memory block, allowing further initialisation e.g.: memory block, allowing further initialisation e.g.::
static int ct_open(struct inode *inode, struct file *file) static int ct_open(struct inode *inode, struct file *file)
{ {
@ -295,7 +306,7 @@ frees the memory allocated in the corresponding open.
The other operations of interest - read(), llseek(), and release() - are The other operations of interest - read(), llseek(), and release() - are
all implemented by the seq_file code itself. So a virtual file's all implemented by the seq_file code itself. So a virtual file's
file_operations structure will look like: file_operations structure will look like::
static const struct file_operations ct_file_ops = { static const struct file_operations ct_file_ops = {
.owner = THIS_MODULE, .owner = THIS_MODULE,
@ -309,7 +320,7 @@ There is also a seq_release_private() which passes the contents of the
seq_file private field to kfree() before releasing the structure. seq_file private field to kfree() before releasing the structure.
The final step is the creation of the /proc file itself. In the example The final step is the creation of the /proc file itself. In the example
code, that is done in the initialization code in the usual way: code, that is done in the initialization code in the usual way::
static int ct_init(void) static int ct_init(void)
{ {
@ -325,9 +336,10 @@ And that is pretty much it.
seq_list seq_list
========
If your file will be iterating through a linked list, you may find these If your file will be iterating through a linked list, you may find these
routines useful: routines useful::
struct list_head *seq_list_start(struct list_head *head, struct list_head *seq_list_start(struct list_head *head,
loff_t pos); loff_t pos);
@ -338,15 +350,16 @@ routines useful:
These helpers will interpret pos as a position within the list and iterate These helpers will interpret pos as a position within the list and iterate
accordingly. Your start() and next() functions need only invoke the accordingly. Your start() and next() functions need only invoke the
seq_list_* helpers with a pointer to the appropriate list_head structure. ``seq_list_*`` helpers with a pointer to the appropriate list_head structure.
The extra-simple version The extra-simple version
========================
For extremely simple virtual files, there is an even easier interface. A For extremely simple virtual files, there is an even easier interface. A
module can define only the show() function, which should create all the module can define only the show() function, which should create all the
output that the virtual file will contain. The file's open() method then output that the virtual file will contain. The file's open() method then
calls: calls::
int single_open(struct file *file, int single_open(struct file *file,
int (*show)(struct seq_file *m, void *p), int (*show)(struct seq_file *m, void *p),

View File

@ -1,7 +1,10 @@
Shared Subtrees .. SPDX-License-Identifier: GPL-2.0
---------------
Contents: ===============
Shared Subtrees
===============
.. Contents:
1) Overview 1) Overview
2) Features 2) Features
3) Setting mount states 3) Setting mount states
@ -41,31 +44,38 @@ replicas continue to be exactly same.
Here is an example: Here is an example:
Let's say /mnt has a mount that is shared. Let's say /mnt has a mount that is shared::
mount --make-shared /mnt
mount --make-shared /mnt
Note: mount(8) command now supports the --make-shared flag, Note: mount(8) command now supports the --make-shared flag,
so the sample 'smount' program is no longer needed and has been so the sample 'smount' program is no longer needed and has been
removed. removed.
# mount --bind /mnt /tmp ::
# mount --bind /mnt /tmp
The above command replicates the mount at /mnt to the mountpoint /tmp The above command replicates the mount at /mnt to the mountpoint /tmp
and the contents of both the mounts remain identical. and the contents of both the mounts remain identical.
#ls /mnt ::
a b c
#ls /tmp #ls /mnt
a b c a b c
Now let's say we mount a device at /tmp/a #ls /tmp
# mount /dev/sd0 /tmp/a a b c
#ls /tmp/a Now let's say we mount a device at /tmp/a::
t1 t2 t3
#ls /mnt/a # mount /dev/sd0 /tmp/a
t1 t2 t3
#ls /tmp/a
t1 t2 t3
#ls /mnt/a
t1 t2 t3
Note that the mount has propagated to the mount at /mnt as well. Note that the mount has propagated to the mount at /mnt as well.
@ -123,14 +133,15 @@ replicas continue to be exactly same.
2d) A unbindable mount is a unbindable private mount 2d) A unbindable mount is a unbindable private mount
let's say we have a mount at /mnt and we make it unbindable let's say we have a mount at /mnt and we make it unbindable::
# mount --make-unbindable /mnt # mount --make-unbindable /mnt
Let's try to bind mount this mount somewhere else. Let's try to bind mount this mount somewhere else::
# mount --bind /mnt /tmp
mount: wrong fs type, bad option, bad superblock on /mnt, # mount --bind /mnt /tmp
or too many mounted file systems mount: wrong fs type, bad option, bad superblock on /mnt,
or too many mounted file systems
Binding a unbindable mount is a invalid operation. Binding a unbindable mount is a invalid operation.
@ -138,12 +149,12 @@ replicas continue to be exactly same.
3) Setting mount states 3) Setting mount states
The mount command (util-linux package) can be used to set mount The mount command (util-linux package) can be used to set mount
states: states::
mount --make-shared mountpoint mount --make-shared mountpoint
mount --make-slave mountpoint mount --make-slave mountpoint
mount --make-private mountpoint mount --make-private mountpoint
mount --make-unbindable mountpoint mount --make-unbindable mountpoint
4) Use cases 4) Use cases
@ -154,9 +165,10 @@ replicas continue to be exactly same.
Solution: Solution:
The system administrator can make the mount at /cdrom shared The system administrator can make the mount at /cdrom shared::
mount --bind /cdrom /cdrom
mount --make-shared /cdrom mount --bind /cdrom /cdrom
mount --make-shared /cdrom
Now any process that clones off a new namespace will have a Now any process that clones off a new namespace will have a
mount at /cdrom which is a replica of the same mount in the mount at /cdrom which is a replica of the same mount in the
@ -172,14 +184,14 @@ replicas continue to be exactly same.
Solution: Solution:
To begin with, the administrator can mark the entire mount tree To begin with, the administrator can mark the entire mount tree
as shareable. as shareable::
mount --make-rshared / mount --make-rshared /
A new process can clone off a new namespace. And mark some part A new process can clone off a new namespace. And mark some part
of its namespace as slave of its namespace as slave::
mount --make-rslave /myprivatetree mount --make-rslave /myprivatetree
Hence forth any mounts within the /myprivatetree done by the Hence forth any mounts within the /myprivatetree done by the
process will not show up in any other namespace. However mounts process will not show up in any other namespace. However mounts
@ -206,13 +218,13 @@ replicas continue to be exactly same.
versions of the file depending on the path used to access that versions of the file depending on the path used to access that
file. file.
An example is: An example is::
mount --make-shared / mount --make-shared /
mount --rbind / /view/v1 mount --rbind / /view/v1
mount --rbind / /view/v2 mount --rbind / /view/v2
mount --rbind / /view/v3 mount --rbind / /view/v3
mount --rbind / /view/v4 mount --rbind / /view/v4
and if /usr has a versioning filesystem mounted, then that and if /usr has a versioning filesystem mounted, then that
mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
@ -224,8 +236,8 @@ replicas continue to be exactly same.
filesystem is being requested and return the corresponding filesystem is being requested and return the corresponding
inode. inode.
5) Detailed semantics: 5) Detailed semantics
------------------- ---------------------
The section below explains the detailed semantics of The section below explains the detailed semantics of
bind, rbind, move, mount, umount and clone-namespace operations. bind, rbind, move, mount, umount and clone-namespace operations.
@ -235,6 +247,7 @@ replicas continue to be exactly same.
5a) Mount states 5a) Mount states
A given mount can be in one of the following states A given mount can be in one of the following states
1) shared 1) shared
2) slave 2) slave
3) shared and slave 3) shared and slave
@ -252,7 +265,8 @@ replicas continue to be exactly same.
A 'shared mount' is defined as a vfsmount that belongs to a A 'shared mount' is defined as a vfsmount that belongs to a
'peer group'. 'peer group'.
For example: For example::
mount --make-shared /mnt mount --make-shared /mnt
mount --bind /mnt /tmp mount --bind /mnt /tmp
@ -270,7 +284,7 @@ replicas continue to be exactly same.
A slave mount as the name implies has a master mount from which A slave mount as the name implies has a master mount from which
mount/unmount events are received. Events do not propagate from mount/unmount events are received. Events do not propagate from
the slave mount to the master. Only a shared mount can be made the slave mount to the master. Only a shared mount can be made
a slave by executing the following command a slave by executing the following command::
mount --make-slave mount mount --make-slave mount
@ -290,8 +304,10 @@ replicas continue to be exactly same.
peer group. peer group.
Only a slave vfsmount can be made as 'shared and slave' by Only a slave vfsmount can be made as 'shared and slave' by
either executing the following command either executing the following command::
mount --make-shared mount mount --make-shared mount
or by moving the slave vfsmount under a shared vfsmount. or by moving the slave vfsmount under a shared vfsmount.
(4) Private mount (4) Private mount
@ -307,30 +323,32 @@ replicas continue to be exactly same.
State diagram: State diagram:
The state diagram below explains the state transition of a mount, The state diagram below explains the state transition of a mount,
in response to various commands. in response to various commands::
------------------------------------------------------------------------
| |make-shared | make-slave | make-private |make-unbindab|
--------------|------------|--------------|--------------|-------------|
|shared |shared |*slave/private| private | unbindable |
| | | | | |
|-------------|------------|--------------|--------------|-------------|
|slave |shared | **slave | private | unbindable |
| |and slave | | | |
|-------------|------------|--------------|--------------|-------------|
|shared |shared | slave | private | unbindable |
|and slave |and slave | | | |
|-------------|------------|--------------|--------------|-------------|
|private |shared | **private | private | unbindable |
|-------------|------------|--------------|--------------|-------------|
|unbindable |shared |**unbindable | private | unbindable |
------------------------------------------------------------------------
* if the shared mount is the only mount in its peer group, making it -----------------------------------------------------------------------
slave, makes it private automatically. Note that there is no master to | |make-shared | make-slave | make-private |make-unbindab|
which it can be slaved to. --------------|------------|--------------|--------------|-------------|
|shared |shared |*slave/private| private | unbindable |
| | | | | |
|-------------|------------|--------------|--------------|-------------|
|slave |shared | **slave | private | unbindable |
| |and slave | | | |
|-------------|------------|--------------|--------------|-------------|
|shared |shared | slave | private | unbindable |
|and slave |and slave | | | |
|-------------|------------|--------------|--------------|-------------|
|private |shared | **private | private | unbindable |
|-------------|------------|--------------|--------------|-------------|
|unbindable |shared |**unbindable | private | unbindable |
------------------------------------------------------------------------
** slaving a non-shared mount has no effect on the mount. * if the shared mount is the only mount in its peer group, making it
slave, makes it private automatically. Note that there is no master to
which it can be slaved to.
** slaving a non-shared mount has no effect on the mount.
Apart from the commands listed below, the 'move' operation also changes Apart from the commands listed below, the 'move' operation also changes
the state of a mount depending on type of the destination mount. Its the state of a mount depending on type of the destination mount. Its
@ -338,31 +356,32 @@ replicas continue to be exactly same.
5b) Bind semantics 5b) Bind semantics
Consider the following command Consider the following command::
mount --bind A/a B/b mount --bind A/a B/b
where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
is the destination mount and 'b' is the dentry in the destination mount. is the destination mount and 'b' is the dentry in the destination mount.
The outcome depends on the type of mount of 'A' and 'B'. The table The outcome depends on the type of mount of 'A' and 'B'. The table
below contains quick reference. below contains quick reference::
---------------------------------------------------------------------------
| BIND MOUNT OPERATION | --------------------------------------------------------------------------
|************************************************************************** | BIND MOUNT OPERATION |
|source(A)->| shared | private | slave | unbindable | |************************************************************************|
| dest(B) | | | | | |source(A)->| shared | private | slave | unbindable |
| | | | | | | | dest(B) | | | | |
| v | | | | | | | | | | | |
|************************************************************************** | v | | | | |
| shared | shared | shared | shared & slave | invalid | |************************************************************************|
| | | | | | | shared | shared | shared | shared & slave | invalid |
|non-shared| shared | private | slave | invalid | | | | | | |
*************************************************************************** |non-shared| shared | private | slave | invalid |
**************************************************************************
Details: Details:
1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a' . 'C' is which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B' are created and mounted at the dentry 'b' on all mounts where 'B'
@ -371,7 +390,7 @@ replicas continue to be exactly same.
'B'. And finally the peer-group of 'C' is merged with the peer group 'B'. And finally the peer-group of 'C' is merged with the peer group
of 'A'. of 'A'.
2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a'. 'C' is which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B' are created and mounted at the dentry 'b' on all mounts where 'B'
@ -379,7 +398,7 @@ replicas continue to be exactly same.
'C', 'C1', .., 'Cn' with exactly the same configuration as the 'C', 'C1', .., 'Cn' with exactly the same configuration as the
propagation tree for 'B'. propagation tree for 'B'.
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
'C3' ... are created and mounted at the dentry 'b' on all mounts where 'C3' ... are created and mounted at the dentry 'b' on all mounts where
@ -389,19 +408,19 @@ replicas continue to be exactly same.
is made the slave of mount 'Z'. In other words, mount 'C' is in the is made the slave of mount 'Z'. In other words, mount 'C' is in the
state 'slave and shared'. state 'slave and shared'.
4. 'A' is a unbindable mount and 'B' is a shared mount. This is a 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
invalid operation. invalid operation.
5. 'A' is a private mount and 'B' is a non-shared(private or slave or 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. A new mount 'C' which is clone of 'A', is created. unbindable) mount. A new mount 'C' which is clone of 'A', is created.
Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. 'C' is made a member of the mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
peer-group of 'A'. peer-group of 'A'.
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
new mount 'C' which is a clone of 'A' is created. Its root dentry is new mount 'C' which is a clone of 'A' is created. Its root dentry is
'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
@ -409,7 +428,7 @@ replicas continue to be exactly same.
mount/unmount on 'A' do not propagate anywhere else. Similarly mount/unmount on 'A' do not propagate anywhere else. Similarly
mount/unmount on 'C' do not propagate anywhere else. mount/unmount on 'C' do not propagate anywhere else.
8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
invalid operation. A unbindable mount cannot be bind mounted. invalid operation. A unbindable mount cannot be bind mounted.
5c) Rbind semantics 5c) Rbind semantics
@ -422,7 +441,9 @@ replicas continue to be exactly same.
then the subtree under the unbindable mount is pruned in the new then the subtree under the unbindable mount is pruned in the new
location. location.
eg: let's say we have the following mount tree. eg:
let's say we have the following mount tree::
A A
/ \ / \
@ -430,12 +451,12 @@ replicas continue to be exactly same.
/ \ / \ / \ / \
D E F G D E F G
Let's say all the mount except the mount C in the tree are Let's say all the mount except the mount C in the tree are
of a type other than unbindable. of a type other than unbindable.
If this tree is rbound to say Z If this tree is rbound to say Z
We will have the following tree at the new location. We will have the following tree at the new location::
Z Z
| |
@ -457,24 +478,26 @@ replicas continue to be exactly same.
the dentry in the destination mount. the dentry in the destination mount.
The outcome depends on the type of the mount of 'A' and 'B'. The table The outcome depends on the type of the mount of 'A' and 'B'. The table
below is a quick reference. below is a quick reference::
---------------------------------------------------------------------------
| MOVE MOUNT OPERATION | ---------------------------------------------------------------------------
|************************************************************************** | MOVE MOUNT OPERATION |
| source(A)->| shared | private | slave | unbindable | |**************************************************************************
| dest(B) | | | | | | source(A)->| shared | private | slave | unbindable |
| | | | | | | | dest(B) | | | | |
| v | | | | | | | | | | | |
|************************************************************************** | v | | | | |
| shared | shared | shared |shared and slave| invalid | |**************************************************************************
| | | | | | | shared | shared | shared |shared and slave| invalid |
|non-shared| shared | private | slave | unbindable | | | | | | |
*************************************************************************** |non-shared| shared | private | slave | unbindable |
NOTE: moving a mount residing under a shared mount is invalid. ***************************************************************************
.. Note:: moving a mount residing under a shared mount is invalid.
Details follow: Details follow:
1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
are created and mounted at dentry 'b' on all mounts that receive are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. A new propagation tree is created in the propagation from mount 'B'. A new propagation tree is created in the
@ -483,7 +506,7 @@ replicas continue to be exactly same.
propagation tree is appended to the already existing propagation tree propagation tree is appended to the already existing propagation tree
of 'A'. of 'A'.
2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
are created and mounted at dentry 'b' on all mounts that receive are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. The mount 'A' becomes a shared mount and a propagation from mount 'B'. The mount 'A' becomes a shared mount and a
@ -491,7 +514,7 @@ replicas continue to be exactly same.
'B'. This new propagation tree contains all the new mounts 'A1', 'B'. This new propagation tree contains all the new mounts 'A1',
'A2'... 'An'. 'A2'... 'An'.
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
'A2'... 'An' are created and mounted at dentry 'b' on all mounts that 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
receive propagation from mount 'B'. A new propagation tree is created receive propagation from mount 'B'. A new propagation tree is created
@ -501,32 +524,32 @@ replicas continue to be exactly same.
'A'. Mount 'A' continues to be the slave mount of 'Z' but it also 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
becomes 'shared'. becomes 'shared'.
4. 'A' is a unbindable mount and 'B' is a shared mount. The operation 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
is invalid. Because mounting anything on the shared mount 'B' can is invalid. Because mounting anything on the shared mount 'B' can
create new mounts that get mounted on the mounts that receive create new mounts that get mounted on the mounts that receive
propagation from 'B'. And since the mount 'A' is unbindable, cloning propagation from 'B'. And since the mount 'A' is unbindable, cloning
it to mount at other mountpoints is not possible. it to mount at other mountpoints is not possible.
5. 'A' is a private mount and 'B' is a non-shared(private or slave or 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
shared mount. shared mount.
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
continues to be a slave mount of mount 'Z'. continues to be a slave mount of mount 'Z'.
8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
unbindable mount. unbindable mount.
5e) Mount semantics 5e) Mount semantics
Consider the following command Consider the following command::
mount device B/b mount device B/b
'B' is the destination mount and 'b' is the dentry in the destination 'B' is the destination mount and 'b' is the dentry in the destination
mount. mount.
@ -537,9 +560,9 @@ replicas continue to be exactly same.
5f) Unmount semantics 5f) Unmount semantics
Consider the following command Consider the following command::
umount A umount A
where 'A' is a mount mounted on mount 'B' at dentry 'b'. where 'A' is a mount mounted on mount 'B' at dentry 'b'.
@ -592,10 +615,12 @@ replicas continue to be exactly same.
A. What is the result of the following command sequence? A. What is the result of the following command sequence?
mount --bind /mnt /mnt ::
mount --make-shared /mnt
mount --bind /mnt /tmp mount --bind /mnt /mnt
mount --move /tmp /mnt/1 mount --make-shared /mnt
mount --bind /mnt /tmp
mount --move /tmp /mnt/1
what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
Should they all be identical? or should /mnt and /mnt/1 be Should they all be identical? or should /mnt and /mnt/1 be
@ -604,23 +629,27 @@ replicas continue to be exactly same.
B. What is the result of the following command sequence? B. What is the result of the following command sequence?
mount --make-rshared / ::
mkdir -p /v/1
mount --rbind / /v/1 mount --make-rshared /
mkdir -p /v/1
mount --rbind / /v/1
what should be the content of /v/1/v/1 be? what should be the content of /v/1/v/1 be?
C. What is the result of the following command sequence? C. What is the result of the following command sequence?
mount --bind /mnt /mnt ::
mount --make-shared /mnt
mkdir -p /mnt/1/2/3 /mnt/1/test mount --bind /mnt /mnt
mount --bind /mnt/1 /tmp mount --make-shared /mnt
mount --make-slave /mnt mkdir -p /mnt/1/2/3 /mnt/1/test
mount --make-shared /mnt mount --bind /mnt/1 /tmp
mount --bind /mnt/1/2 /tmp1 mount --make-slave /mnt
mount --make-slave /mnt mount --make-shared /mnt
mount --bind /mnt/1/2 /tmp1
mount --make-slave /mnt
At this point we have the first mount at /tmp and At this point we have the first mount at /tmp and
its root dentry is 1. Let's call this mount 'A' its root dentry is 1. Let's call this mount 'A'
@ -668,7 +697,8 @@ replicas continue to be exactly same.
step 1: step 1:
let's say the root tree has just two directories with let's say the root tree has just two directories with
one vfsmount. one vfsmount::
root root
/ \ / \
tmp usr tmp usr
@ -676,14 +706,17 @@ replicas continue to be exactly same.
And we want to replicate the tree at multiple And we want to replicate the tree at multiple
mountpoints under /root/tmp mountpoints under /root/tmp
step2: step 2:
mount --make-shared /root ::
mkdir -p /tmp/m1
mount --rbind /root /tmp/m1 mount --make-shared /root
the new tree now looks like this: mkdir -p /tmp/m1
mount --rbind /root /tmp/m1
the new tree now looks like this::
root root
/ \ / \
@ -697,11 +730,13 @@ replicas continue to be exactly same.
it has two vfsmounts it has two vfsmounts
step3: step 3:
::
mkdir -p /tmp/m2 mkdir -p /tmp/m2
mount --rbind /root /tmp/m2 mount --rbind /root /tmp/m2
the new tree now looks like this: the new tree now looks like this::
root root
/ \ / \
@ -724,6 +759,7 @@ replicas continue to be exactly same.
it has 6 vfsmounts it has 6 vfsmounts
step 4: step 4:
::
mkdir -p /tmp/m3 mkdir -p /tmp/m3
mount --rbind /root /tmp/m3 mount --rbind /root /tmp/m3
@ -740,7 +776,8 @@ replicas continue to be exactly same.
step 1: step 1:
let's say the root tree has just two directories with let's say the root tree has just two directories with
one vfsmount. one vfsmount::
root root
/ \ / \
tmp usr tmp usr
@ -748,17 +785,20 @@ replicas continue to be exactly same.
How do we set up the same tree at multiple locations under How do we set up the same tree at multiple locations under
/root/tmp /root/tmp
step2: step 2:
mount --bind /root/tmp /root/tmp ::
mount --make-rshared /root
mount --make-unbindable /root/tmp
mkdir -p /tmp/m1 mount --bind /root/tmp /root/tmp
mount --rbind /root /tmp/m1 mount --make-rshared /root
mount --make-unbindable /root/tmp
the new tree now looks like this: mkdir -p /tmp/m1
mount --rbind /root /tmp/m1
the new tree now looks like this::
root root
/ \ / \
@ -768,11 +808,13 @@ replicas continue to be exactly same.
/ \ / \
tmp usr tmp usr
step3: step 3:
::
mkdir -p /tmp/m2 mkdir -p /tmp/m2
mount --rbind /root /tmp/m2 mount --rbind /root /tmp/m2
the new tree now looks like this: the new tree now looks like this::
root root
/ \ / \
@ -782,12 +824,13 @@ replicas continue to be exactly same.
/ \ / \ / \ / \
tmp usr tmp usr tmp usr tmp usr
step4: step 4:
::
mkdir -p /tmp/m3 mkdir -p /tmp/m3
mount --rbind /root /tmp/m3 mount --rbind /root /tmp/m3
the new tree now looks like this: the new tree now looks like this::
root root
/ \ / \
@ -801,25 +844,31 @@ replicas continue to be exactly same.
8A) Datastructure 8A) Datastructure
4 new fields are introduced to struct vfsmount 4 new fields are introduced to struct vfsmount:
->mnt_share
->mnt_slave_list
->mnt_slave
->mnt_master
->mnt_share links together all the mount to/from which this vfsmount * ->mnt_share
* ->mnt_slave_list
* ->mnt_slave
* ->mnt_master
->mnt_share
links together all the mount to/from which this vfsmount
send/receives propagation events. send/receives propagation events.
->mnt_slave_list links all the mounts to which this vfsmount propagates ->mnt_slave_list
links all the mounts to which this vfsmount propagates
to. to.
->mnt_slave links together all the slaves that its master vfsmount ->mnt_slave
links together all the slaves that its master vfsmount
propagates to. propagates to.
->mnt_master points to the master vfsmount from which this vfsmount ->mnt_master
points to the master vfsmount from which this vfsmount
receives propagation. receives propagation.
->mnt_flags takes two more flags to indicate the propagation status of ->mnt_flags
takes two more flags to indicate the propagation status of
the vfsmount. MNT_SHARE indicates that the vfsmount is a shared the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
replicated. replicated.
@ -842,7 +891,7 @@ replicas continue to be exactly same.
A example propagation tree looks as shown in the figure below. A example propagation tree looks as shown in the figure below.
[ NOTE: Though it looks like a forest, if we consider all the shared [ NOTE: Though it looks like a forest, if we consider all the shared
mounts as a conceptual entity called 'pnode', it becomes a tree] mounts as a conceptual entity called 'pnode', it becomes a tree]::
A <--> B <--> C <---> D A <--> B <--> C <---> D
@ -864,14 +913,19 @@ replicas continue to be exactly same.
A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
E's ->mnt_share links with ->mnt_share of K E's ->mnt_share links with ->mnt_share of K
'E', 'K', 'F', 'G' have their ->mnt_master point to struct
vfsmount of 'A' 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
J and K's ->mnt_master points to struct vfsmount of C J and K's ->mnt_master points to struct vfsmount of C
and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
@ -903,6 +957,7 @@ replicas continue to be exactly same.
Prepare phase: Prepare phase:
for each mount in the source tree: for each mount in the source tree:
a) Create the necessary number of mount trees to a) Create the necessary number of mount trees to
be attached to each of the mounts that receive be attached to each of the mounts that receive
propagation from the destination mount. propagation from the destination mount.
@ -929,11 +984,12 @@ replicas continue to be exactly same.
Abort phase Abort phase
delete all the newly created trees. delete all the newly created trees.
NOTE: all the propagation related functionality resides in the file .. Note::
pnode.c all the propagation related functionality resides in the file pnode.c
------------------------------------------------------------------------ ------------------------------------------------------------------------
version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com) version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
version 0.2 (Incorporated comments from Al Viro) version 0.2 (Incorporated comments from Al Viro)

View File

@ -0,0 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
==============
SPU Filesystem
==============
.. toctree::
:maxdepth: 1
spufs
spu_create
spu_run

View File

@ -0,0 +1,131 @@
.. SPDX-License-Identifier: GPL-2.0
==========
spu_create
==========
Name
====
spu_create - create a new spu context
Synopsis
========
::
#include <sys/types.h>
#include <sys/spu.h>
int spu_create(const char *pathname, int flags, mode_t mode);
Description
===========
The spu_create system call is used on PowerPC machines that implement
the Cell Broadband Engine Architecture in order to access Synergistic
Processor Units (SPUs). It creates a new logical context for an SPU in
pathname and returns a handle to associated with it. pathname must
point to a non-existing directory in the mount point of the SPU file
system (spufs). When spu_create is successful, a directory gets cre-
ated on pathname and it is populated with files.
The returned file handle can only be passed to spu_run(2) or closed,
other operations are not defined on it. When it is closed, all associ-
ated directory entries in spufs are removed. When the last file handle
pointing either inside of the context directory or to this file
descriptor is closed, the logical SPU context is destroyed.
The parameter flags can be zero or any bitwise or'd combination of the
following constants:
SPU_RAWIO
Allow mapping of some of the hardware registers of the SPU into
user space. This flag requires the CAP_SYS_RAWIO capability, see
capabilities(7).
The mode parameter specifies the permissions used for creating the new
directory in spufs. mode is modified with the user's umask(2) value
and then used for both the directory and the files contained in it. The
file permissions mask out some more bits of mode because they typically
support only read or write access. See stat(2) for a full list of the
possible mode values.
Return Value
============
spu_create returns a new file descriptor. It may return -1 to indicate
an error condition and set errno to one of the error codes listed
below.
Errors
======
EACCES
The current user does not have write access on the spufs mount
point.
EEXIST An SPU context already exists at the given path name.
EFAULT pathname is not a valid string pointer in the current address
space.
EINVAL pathname is not a directory in the spufs mount point.
ELOOP Too many symlinks were found while resolving pathname.
EMFILE The process has reached its maximum open file limit.
ENAMETOOLONG
pathname was too long.
ENFILE The system has reached the global open file limit.
ENOENT Part of pathname could not be resolved.
ENOMEM The kernel could not allocate all resources required.
ENOSPC There are not enough SPU resources available to create a new
context or the user specific limit for the number of SPU con-
texts has been reached.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
ENOTDIR
A part of pathname is not a directory.
Notes
=====
spu_create is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
Files
=====
pathname must point to a location beneath the mount point of spufs. By
convention, it gets mounted in /spu.
Conforming to
=============
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
Bugs
====
The code does not yet fully implement all features lined out here.
Author
======
Arnd Bergmann <arndb@de.ibm.com>
See Also
========
capabilities(7), close(2), spu_run(2), spufs(7)

View File

@ -0,0 +1,138 @@
.. SPDX-License-Identifier: GPL-2.0
=======
spu_run
=======
Name
====
spu_run - execute an spu context
Synopsis
========
::
#include <sys/spu.h>
int spu_run(int fd, unsigned int *npc, unsigned int *event);
Description
===========
The spu_run system call is used on PowerPC machines that implement the
Cell Broadband Engine Architecture in order to access Synergistic Pro-
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
ate(2) to address a specific SPU context. When the context gets sched-
uled to a physical SPU, it starts execution at the instruction pointer
passed in npc.
Execution of SPU code happens synchronously, meaning that spu_run does
not return while the SPU is still running. If there is a need to exe-
cute SPU code in parallel with other code on either the main CPU or
other SPUs, you need to create a new thread of execution first, e.g.
using the pthread_create(3) call.
When spu_run returns, the current value of the SPU instruction pointer
is written back to npc, so you can call spu_run again without updating
the pointers.
event can be a NULL pointer or point to an extended status code that
gets filled when spu_run returns. It can be one of the following con-
stants:
SPE_EVENT_DMA_ALIGNMENT
A DMA alignment error
SPE_EVENT_SPE_DATA_SEGMENT
A DMA segmentation error
SPE_EVENT_SPE_DATA_STORAGE
A DMA storage error
If NULL is passed as the event argument, these errors will result in a
signal delivered to the calling process.
Return Value
============
spu_run returns the value of the spu_status register or -1 to indicate
an error and set errno to one of the error codes listed below. The
spu_status register value contains a bit mask of status codes and
optionally a 14 bit code returned from the stop-and-signal instruction
on the SPU. The bit masks for the status codes are:
0x02
SPU was stopped by stop-and-signal.
0x04
SPU was stopped by halt.
0x08
SPU is waiting for a channel.
0x10
SPU is in single-step mode.
0x20
SPU has tried to execute an invalid instruction.
0x40
SPU has tried to access an invalid channel.
0x3fff0000
The bits masked with this value contain the code returned from
stop-and-signal.
There are always one or more of the lower eight bits set or an error
code is returned from spu_run.
Errors
======
EAGAIN or EWOULDBLOCK
fd is in non-blocking mode and spu_run would block.
EBADF fd is not a valid file descriptor.
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
pointer.
EINTR A signal occurred while spu_run was in progress. The npc value
has been updated to the new program counter value if necessary.
EINVAL fd is not a file descriptor returned from spu_create(2).
ENOMEM Insufficient memory was available to handle a page fault result-
ing from an MFC direct memory access.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
Notes
=====
spu_run is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
Conforming to
=============
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
Bugs
====
The code does not yet fully implement all features lined out here.
Author
======
Arnd Bergmann <arndb@de.ibm.com>
See Also
========
capabilities(7), close(2), spu_create(2), spufs(7)

View File

@ -1,12 +1,18 @@
SPUFS(2) Linux Programmer's Manual SPUFS(2) .. SPDX-License-Identifier: GPL-2.0
=====
spufs
=====
Name
====
NAME
spufs - the SPU file system spufs - the SPU file system
DESCRIPTION Description
===========
The SPU file system is used on PowerPC machines that implement the Cell The SPU file system is used on PowerPC machines that implement the Cell
Broadband Engine Architecture in order to access Synergistic Processor Broadband Engine Architecture in order to access Synergistic Processor
Units (SPUs). Units (SPUs).
@ -21,7 +27,9 @@ DESCRIPTION
ally add or remove files. ally add or remove files.
MOUNT OPTIONS Mount Options
=============
uid=<uid> uid=<uid>
set the user owning the mount point, the default is 0 (root). set the user owning the mount point, the default is 0 (root).
@ -29,7 +37,9 @@ MOUNT OPTIONS
set the group owning the mount point, the default is 0 (root). set the group owning the mount point, the default is 0 (root).
FILES Files
=====
The files in spufs mostly follow the standard behavior for regular sys- The files in spufs mostly follow the standard behavior for regular sys-
tem calls like read(2) or write(2), but often support only a subset of tem calls like read(2) or write(2), but often support only a subset of
the operations supported on regular file systems. This list details the the operations supported on regular file systems. This list details the
@ -125,14 +135,12 @@ FILES
space is available for writing. space is available for writing.
/mbox_stat /mbox_stat, /ibox_stat, /wbox_stat
/ibox_stat
/wbox_stat
Read-only files that contain the length of the current queue, i.e. how Read-only files that contain the length of the current queue, i.e. how
many words can be read from mbox or ibox or how many words can be many words can be read from mbox or ibox or how many words can be
written to wbox without blocking. The files can be read only in 4-byte written to wbox without blocking. The files can be read only in 4-byte
units and return a big-endian binary integer number. The possible units and return a big-endian binary integer number. The possible
operations on an open *box_stat file are: operations on an open ``*box_stat`` file are:
read(2) read(2)
If a count smaller than four is requested, read returns -1 and If a count smaller than four is requested, read returns -1 and
@ -143,12 +151,7 @@ FILES
in EAGAIN. in EAGAIN.
/npc /npc, /decr, /decr_status, /spu_tag_mask, /event_mask, /srr0
/decr
/decr_status
/spu_tag_mask
/event_mask
/srr0
Internal registers of the SPU. The representation is an ASCII string Internal registers of the SPU. The representation is an ASCII string
with the numeric value of the next instruction to be executed. These with the numeric value of the next instruction to be executed. These
can be used in read/write mode for debugging, but normal operation of can be used in read/write mode for debugging, but normal operation of
@ -157,17 +160,14 @@ FILES
The contents of these files are: The contents of these files are:
=================== ===================================
npc Next Program Counter npc Next Program Counter
decr SPU Decrementer decr SPU Decrementer
decr_status Decrementer Status decr_status Decrementer Status
spu_tag_mask MFC tag mask for SPU DMA spu_tag_mask MFC tag mask for SPU DMA
event_mask Event mask for SPU interrupts event_mask Event mask for SPU interrupts
srr0 Interrupt Return address register srr0 Interrupt Return address register
=================== ===================================
The possible operations on an open npc, decr, decr_status, The possible operations on an open npc, decr, decr_status,
@ -206,8 +206,7 @@ FILES
from the data buffer, updating the value of the fpcr register. from the data buffer, updating the value of the fpcr register.
/signal1 /signal1, /signal2
/signal2
The two signal notification channels of an SPU. These are read-write The two signal notification channels of an SPU. These are read-write
files that operate on a 32 bit word. Writing to one of these files files that operate on a 32 bit word. Writing to one of these files
triggers an interrupt on the SPU. The value written to the signal triggers an interrupt on the SPU. The value written to the signal
@ -233,8 +232,7 @@ FILES
file. file.
/signal1_type /signal1_type, /signal2_type
/signal2_type
These two files change the behavior of the signal1 and signal2 notifi- These two files change the behavior of the signal1 and signal2 notifi-
cation files. The contain a numerical ASCII string which is read as cation files. The contain a numerical ASCII string which is read as
either "1" or "0". In mode 0 (overwrite), the hardware replaces the either "1" or "0". In mode 0 (overwrite), the hardware replaces the
@ -259,263 +257,17 @@ FILES
the previous setting. the previous setting.
EXAMPLES Examples
========
/etc/fstab entry /etc/fstab entry
none /spu spufs gid=spu 0 0 none /spu spufs gid=spu 0 0
AUTHORS Authors
=======
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
Ulrich Weigand <Ulrich.Weigand@de.ibm.com> Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
SEE ALSO See Also
========
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
Linux 2005-09-28 SPUFS(2)
------------------------------------------------------------------------------
SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
NAME
spu_run - execute an spu context
SYNOPSIS
#include <sys/spu.h>
int spu_run(int fd, unsigned int *npc, unsigned int *event);
DESCRIPTION
The spu_run system call is used on PowerPC machines that implement the
Cell Broadband Engine Architecture in order to access Synergistic Pro-
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
ate(2) to address a specific SPU context. When the context gets sched-
uled to a physical SPU, it starts execution at the instruction pointer
passed in npc.
Execution of SPU code happens synchronously, meaning that spu_run does
not return while the SPU is still running. If there is a need to exe-
cute SPU code in parallel with other code on either the main CPU or
other SPUs, you need to create a new thread of execution first, e.g.
using the pthread_create(3) call.
When spu_run returns, the current value of the SPU instruction pointer
is written back to npc, so you can call spu_run again without updating
the pointers.
event can be a NULL pointer or point to an extended status code that
gets filled when spu_run returns. It can be one of the following con-
stants:
SPE_EVENT_DMA_ALIGNMENT
A DMA alignment error
SPE_EVENT_SPE_DATA_SEGMENT
A DMA segmentation error
SPE_EVENT_SPE_DATA_STORAGE
A DMA storage error
If NULL is passed as the event argument, these errors will result in a
signal delivered to the calling process.
RETURN VALUE
spu_run returns the value of the spu_status register or -1 to indicate
an error and set errno to one of the error codes listed below. The
spu_status register value contains a bit mask of status codes and
optionally a 14 bit code returned from the stop-and-signal instruction
on the SPU. The bit masks for the status codes are:
0x02 SPU was stopped by stop-and-signal.
0x04 SPU was stopped by halt.
0x08 SPU is waiting for a channel.
0x10 SPU is in single-step mode.
0x20 SPU has tried to execute an invalid instruction.
0x40 SPU has tried to access an invalid channel.
0x3fff0000
The bits masked with this value contain the code returned from
stop-and-signal.
There are always one or more of the lower eight bits set or an error
code is returned from spu_run.
ERRORS
EAGAIN or EWOULDBLOCK
fd is in non-blocking mode and spu_run would block.
EBADF fd is not a valid file descriptor.
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
pointer.
EINTR A signal occurred while spu_run was in progress. The npc value
has been updated to the new program counter value if necessary.
EINVAL fd is not a file descriptor returned from spu_create(2).
ENOMEM Insufficient memory was available to handle a page fault result-
ing from an MFC direct memory access.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
NOTES
spu_run is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
CONFORMING TO
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
BUGS
The code does not yet fully implement all features lined out here.
AUTHOR
Arnd Bergmann <arndb@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_create(2), spufs(7)
Linux 2005-09-28 SPU_RUN(2)
------------------------------------------------------------------------------
SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
NAME
spu_create - create a new spu context
SYNOPSIS
#include <sys/types.h>
#include <sys/spu.h>
int spu_create(const char *pathname, int flags, mode_t mode);
DESCRIPTION
The spu_create system call is used on PowerPC machines that implement
the Cell Broadband Engine Architecture in order to access Synergistic
Processor Units (SPUs). It creates a new logical context for an SPU in
pathname and returns a handle to associated with it. pathname must
point to a non-existing directory in the mount point of the SPU file
system (spufs). When spu_create is successful, a directory gets cre-
ated on pathname and it is populated with files.
The returned file handle can only be passed to spu_run(2) or closed,
other operations are not defined on it. When it is closed, all associ-
ated directory entries in spufs are removed. When the last file handle
pointing either inside of the context directory or to this file
descriptor is closed, the logical SPU context is destroyed.
The parameter flags can be zero or any bitwise or'd combination of the
following constants:
SPU_RAWIO
Allow mapping of some of the hardware registers of the SPU into
user space. This flag requires the CAP_SYS_RAWIO capability, see
capabilities(7).
The mode parameter specifies the permissions used for creating the new
directory in spufs. mode is modified with the user's umask(2) value
and then used for both the directory and the files contained in it. The
file permissions mask out some more bits of mode because they typically
support only read or write access. See stat(2) for a full list of the
possible mode values.
RETURN VALUE
spu_create returns a new file descriptor. It may return -1 to indicate
an error condition and set errno to one of the error codes listed
below.
ERRORS
EACCES
The current user does not have write access on the spufs mount
point.
EEXIST An SPU context already exists at the given path name.
EFAULT pathname is not a valid string pointer in the current address
space.
EINVAL pathname is not a directory in the spufs mount point.
ELOOP Too many symlinks were found while resolving pathname.
EMFILE The process has reached its maximum open file limit.
ENAMETOOLONG
pathname was too long.
ENFILE The system has reached the global open file limit.
ENOENT Part of pathname could not be resolved.
ENOMEM The kernel could not allocate all resources required.
ENOSPC There are not enough SPU resources available to create a new
context or the user specific limit for the number of SPU con-
texts has been reached.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
ENOTDIR
A part of pathname is not a directory.
NOTES
spu_create is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
FILES
pathname must point to a location beneath the mount point of spufs. By
convention, it gets mounted in /spu.
CONFORMING TO
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
BUGS
The code does not yet fully implement all features lined out here.
AUTHOR
Arnd Bergmann <arndb@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_run(2), spufs(7)
Linux 2005-09-28 SPU_CREATE(2)

View File

@ -1,8 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
============================================
Accessing PCI device resources through sysfs Accessing PCI device resources through sysfs
-------------------------------------------- ============================================
sysfs, usually mounted at /sys, provides access to PCI resources on platforms sysfs, usually mounted at /sys, provides access to PCI resources on platforms
that support it. For example, a given bus might look like this: that support it. For example, a given bus might look like this::
/sys/devices/pci0000:17 /sys/devices/pci0000:17
|-- 0000:17:00.0 |-- 0000:17:00.0
@ -30,8 +33,9 @@ This bus contains a single function device in slot 0. The domain and bus
numbers are reproduced for convenience. Under the device directory are several numbers are reproduced for convenience. Under the device directory are several
files, each with their own function. files, each with their own function.
=================== =====================================================
file function file function
---- -------- =================== =====================================================
class PCI class (ascii, ro) class PCI class (ascii, ro)
config PCI config space (binary, rw) config PCI config space (binary, rw)
device PCI device (ascii, ro) device PCI device (ascii, ro)
@ -40,13 +44,16 @@ files, each with their own function.
local_cpus nearby CPU mask (cpumask, ro) local_cpus nearby CPU mask (cpumask, ro)
remove remove device from kernel's list (ascii, wo) remove remove device from kernel's list (ascii, wo)
resource PCI resource host addresses (ascii, ro) resource PCI resource host addresses (ascii, ro)
resource0..N PCI resource N, if present (binary, mmap, rw[1]) resource0..N PCI resource N, if present (binary, mmap, rw\ [1]_)
resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
revision PCI revision (ascii, ro) revision PCI revision (ascii, ro)
rom PCI ROM resource, if present (binary, ro) rom PCI ROM resource, if present (binary, ro)
subsystem_device PCI subsystem device (ascii, ro) subsystem_device PCI subsystem device (ascii, ro)
subsystem_vendor PCI subsystem vendor (ascii, ro) subsystem_vendor PCI subsystem vendor (ascii, ro)
vendor PCI vendor (ascii, ro) vendor PCI vendor (ascii, ro)
=================== =====================================================
::
ro - read only file ro - read only file
rw - file is readable and writable rw - file is readable and writable
@ -56,7 +63,7 @@ files, each with their own function.
binary - file contains binary data binary - file contains binary data
cpumask - file contains a cpumask type cpumask - file contains a cpumask type
[1] rw for RESOURCE_IO (I/O port) regions only .. [1] rw for RESOURCE_IO (I/O port) regions only
The read only files are informational, writes to them will be ignored, with The read only files are informational, writes to them will be ignored, with
the exception of the 'rom' file. Writable files can be used to perform the exception of the 'rom' file. Writable files can be used to perform
@ -67,11 +74,11 @@ don't support mmapping of certain resources, so be sure to check the return
value from any attempted mmap. The most notable of these are I/O port value from any attempted mmap. The most notable of these are I/O port
resources, which also provide read/write access. resources, which also provide read/write access.
The 'enable' file provides a counter that indicates how many times the device The 'enable' file provides a counter that indicates how many times the device
has been enabled. If the 'enable' file currently returns '4', and a '1' is has been enabled. If the 'enable' file currently returns '4', and a '1' is
echoed into it, it will then return '5'. Echoing a '0' into it will decrease echoed into it, it will then return '5'. Echoing a '0' into it will decrease
the count. Even when it returns to 0, though, some of the initialisation the count. Even when it returns to 0, though, some of the initialisation
may not be reversed. may not be reversed.
The 'rom' file is special in that it provides read-only access to the device's The 'rom' file is special in that it provides read-only access to the device's
ROM file, if available. It's disabled by default, however, so applications ROM file, if available. It's disabled by default, however, so applications
@ -93,7 +100,7 @@ Accessing legacy resources through sysfs
Legacy I/O port and ISA memory resources are also provided in sysfs if the Legacy I/O port and ISA memory resources are also provided in sysfs if the
underlying platform supports them. They're located in the PCI class hierarchy, underlying platform supports them. They're located in the PCI class hierarchy,
e.g. e.g.::
/sys/class/pci_bus/0000:17/ /sys/class/pci_bus/0000:17/
|-- bridge -> ../../../devices/pci0000:17 |-- bridge -> ../../../devices/pci0000:17

Some files were not shown because too many files have changed in this diff Show More