Merge drm/drm-next into drm-xe-next
Backmerging drm-next in order to get up-to-date and in particular to access commit 9ca5facd0400f610f3f7f71aeb7fc0b949a48c67. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
This commit is contained in:
commit
79790b6818
@ -1,4 +1,5 @@
|
||||
Alan Cox <alan@lxorguk.ukuu.org.uk>
|
||||
Alan Cox <root@hraefn.swansea.linux.org.uk>
|
||||
Christoph Hellwig <hch@lst.de>
|
||||
Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
||||
Marc Gonzalez <marc.w.gonzalez@free.fr>
|
||||
|
1
.gitignore
vendored
1
.gitignore
vendored
@ -52,6 +52,7 @@
|
||||
*.xz
|
||||
*.zst
|
||||
Module.symvers
|
||||
dtbs-list
|
||||
modules.order
|
||||
|
||||
#
|
||||
|
27
.mailmap
27
.mailmap
@ -191,10 +191,11 @@ Gao Xiang <xiang@kernel.org> <gaoxiang25@huawei.com>
|
||||
Gao Xiang <xiang@kernel.org> <hsiangkao@aol.com>
|
||||
Gao Xiang <xiang@kernel.org> <hsiangkao@linux.alibaba.com>
|
||||
Gao Xiang <xiang@kernel.org> <hsiangkao@redhat.com>
|
||||
Geliang Tang <geliang.tang@linux.dev> <geliang.tang@suse.com>
|
||||
Geliang Tang <geliang.tang@linux.dev> <geliangtang@xiaomi.com>
|
||||
Geliang Tang <geliang.tang@linux.dev> <geliangtang@gmail.com>
|
||||
Geliang Tang <geliang.tang@linux.dev> <geliangtang@163.com>
|
||||
Geliang Tang <geliang@kernel.org> <geliang.tang@linux.dev>
|
||||
Geliang Tang <geliang@kernel.org> <geliang.tang@suse.com>
|
||||
Geliang Tang <geliang@kernel.org> <geliangtang@xiaomi.com>
|
||||
Geliang Tang <geliang@kernel.org> <geliangtang@gmail.com>
|
||||
Geliang Tang <geliang@kernel.org> <geliangtang@163.com>
|
||||
Georgi Djakov <djakov@kernel.org> <georgi.djakov@linaro.org>
|
||||
Gerald Schaefer <gerald.schaefer@linux.ibm.com> <geraldsc@de.ibm.com>
|
||||
Gerald Schaefer <gerald.schaefer@linux.ibm.com> <gerald.schaefer@de.ibm.com>
|
||||
@ -289,6 +290,7 @@ Johan Hovold <johan@kernel.org> <johan@hovoldconsulting.com>
|
||||
John Crispin <john@phrozen.org> <blogic@openwrt.org>
|
||||
John Fastabend <john.fastabend@gmail.com> <john.r.fastabend@intel.com>
|
||||
John Keeping <john@keeping.me.uk> <john@metanate.com>
|
||||
John Moon <john@jmoon.dev> <quic_johmoo@quicinc.com>
|
||||
John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
|
||||
John Stultz <johnstul@us.ibm.com>
|
||||
<jon.toppins+linux@gmail.com> <jtoppins@cumulusnetworks.com>
|
||||
@ -323,6 +325,7 @@ Kenneth W Chen <kenneth.w.chen@intel.com>
|
||||
Kenneth Westfield <quic_kwestfie@quicinc.com> <kwestfie@codeaurora.org>
|
||||
Kiran Gunda <quic_kgunda@quicinc.com> <kgunda@codeaurora.org>
|
||||
Kirill Tkhai <tkhai@ya.ru> <ktkhai@virtuozzo.com>
|
||||
Kishon Vijay Abraham I <kishon@kernel.org> <kishon@ti.com>
|
||||
Konstantin Khlebnikov <koct9i@gmail.com> <khlebnikov@yandex-team.ru>
|
||||
Konstantin Khlebnikov <koct9i@gmail.com> <k.khlebnikov@samsung.com>
|
||||
Koushik <raghavendra.koushik@neterion.com>
|
||||
@ -337,13 +340,15 @@ Lee Jones <lee@kernel.org> <joneslee@google.com>
|
||||
Lee Jones <lee@kernel.org> <lee.jones@canonical.com>
|
||||
Lee Jones <lee@kernel.org> <lee.jones@linaro.org>
|
||||
Lee Jones <lee@kernel.org> <lee@ubuntu.com>
|
||||
Leonard Crestez <leonard.crestez@nxp.com> Leonard Crestez <cdleonard@gmail.com>
|
||||
Leonard Crestez <cdleonard@gmail.com> <leonard.crestez@nxp.com>
|
||||
Leonard Crestez <cdleonard@gmail.com> <leonard.crestez@intel.com>
|
||||
Leonardo Bras <leobras.c@gmail.com> <leonardo@linux.ibm.com>
|
||||
Leonard Göhrs <l.goehrs@pengutronix.de>
|
||||
Leonid I Ananiev <leonid.i.ananiev@intel.com>
|
||||
Leon Romanovsky <leon@kernel.org> <leon@leon.nu>
|
||||
Leon Romanovsky <leon@kernel.org> <leonro@mellanox.com>
|
||||
Leon Romanovsky <leon@kernel.org> <leonro@nvidia.com>
|
||||
Leo Yan <leo.yan@linux.dev> <leo.yan@linaro.org>
|
||||
Liam Mark <quic_lmark@quicinc.com> <lmark@codeaurora.org>
|
||||
Linas Vepstas <linas@austin.ibm.com>
|
||||
Linus Lüssing <linus.luessing@c0d3.blue> <linus.luessing@ascom.ch>
|
||||
@ -435,6 +440,8 @@ Mukesh Ojha <quic_mojha@quicinc.com> <mojha@codeaurora.org>
|
||||
Muna Sinada <quic_msinada@quicinc.com> <msinada@codeaurora.org>
|
||||
Murali Nalajala <quic_mnalajal@quicinc.com> <mnalajal@codeaurora.org>
|
||||
Mythri P K <mythripk@ti.com>
|
||||
Nadav Amit <nadav.amit@gmail.com> <namit@vmware.com>
|
||||
Nadav Amit <nadav.amit@gmail.com> <namit@cs.technion.ac.il>
|
||||
Nadia Yvette Chambers <nyc@holomorphy.com> William Lee Irwin III <wli@holomorphy.com>
|
||||
Naoya Horiguchi <naoya.horiguchi@nec.com> <n-horiguchi@ah.jp.nec.com>
|
||||
Nathan Chancellor <nathan@kernel.org> <natechancellor@gmail.com>
|
||||
@ -491,7 +498,8 @@ Prasad Sodagudi <quic_psodagud@quicinc.com> <psodagud@codeaurora.org>
|
||||
Punit Agrawal <punitagrawal@gmail.com> <punit.agrawal@arm.com>
|
||||
Qais Yousef <qyousef@layalina.io> <qais.yousef@imgtec.com>
|
||||
Qais Yousef <qyousef@layalina.io> <qais.yousef@arm.com>
|
||||
Quentin Monnet <quentin@isovalent.com> <quentin.monnet@netronome.com>
|
||||
Quentin Monnet <qmo@kernel.org> <quentin.monnet@netronome.com>
|
||||
Quentin Monnet <qmo@kernel.org> <quentin@isovalent.com>
|
||||
Quentin Perret <qperret@qperret.net> <quentin.perret@arm.com>
|
||||
Rafael J. Wysocki <rjw@rjwysocki.net> <rjw@sisk.pl>
|
||||
Rajeev Nandan <quic_rajeevny@quicinc.com> <rajeevny@codeaurora.org>
|
||||
@ -550,6 +558,7 @@ Senthilkumar N L <quic_snlakshm@quicinc.com> <snlakshm@codeaurora.org>
|
||||
Serge Hallyn <sergeh@kernel.org> <serge.hallyn@canonical.com>
|
||||
Serge Hallyn <sergeh@kernel.org> <serue@us.ibm.com>
|
||||
Seth Forshee <sforshee@kernel.org> <seth.forshee@canonical.com>
|
||||
Shakeel Butt <shakeel.butt@linux.dev> <shakeelb@google.com>
|
||||
Shannon Nelson <shannon.nelson@amd.com> <snelson@pensando.io>
|
||||
Shannon Nelson <shannon.nelson@amd.com> <shannon.nelson@intel.com>
|
||||
Shannon Nelson <shannon.nelson@amd.com> <shannon.nelson@oracle.com>
|
||||
@ -568,6 +577,7 @@ Simon Kelley <simon@thekelleys.org.uk>
|
||||
Sricharan Ramabadhran <quic_srichara@quicinc.com> <sricharan@codeaurora.org>
|
||||
Srinivas Ramana <quic_sramana@quicinc.com> <sramana@codeaurora.org>
|
||||
Sriram R <quic_srirrama@quicinc.com> <srirrama@codeaurora.org>
|
||||
Stefan Wahren <wahrenst@gmx.net> <stefan.wahren@i2se.com>
|
||||
Stéphane Witzmann <stephane.witzmann@ubpmes.univ-bpclermont.fr>
|
||||
Stephen Hemminger <stephen@networkplumber.org> <shemminger@linux-foundation.org>
|
||||
Stephen Hemminger <stephen@networkplumber.org> <shemminger@osdl.org>
|
||||
@ -605,6 +615,11 @@ TripleX Chung <xxx.phy@gmail.com> <triplex@zh-kernel.org>
|
||||
TripleX Chung <xxx.phy@gmail.com> <zhongyu@18mail.cn>
|
||||
Tsuneo Yoshioka <Tsuneo.Yoshioka@f-secure.com>
|
||||
Tudor Ambarus <tudor.ambarus@linaro.org> <tudor.ambarus@microchip.com>
|
||||
Tvrtko Ursulin <tursulin@ursulin.net> <tvrtko.ursulin@intel.com>
|
||||
Tvrtko Ursulin <tursulin@ursulin.net> <tvrtko.ursulin@linux.intel.com>
|
||||
Tvrtko Ursulin <tursulin@ursulin.net> <tvrtko.ursulin@sophos.com>
|
||||
Tvrtko Ursulin <tursulin@ursulin.net> <tvrtko.ursulin@onelan.co.uk>
|
||||
Tvrtko Ursulin <tursulin@ursulin.net> <tvrtko@ursulin.net>
|
||||
Tycho Andersen <tycho@tycho.pizza> <tycho@tycho.ws>
|
||||
Tzung-Bi Shih <tzungbi@kernel.org> <tzungbi@google.com>
|
||||
Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>
|
||||
|
10
CREDITS
10
CREDITS
@ -63,6 +63,11 @@ D: dosfs, LILO, some fd features, ATM, various other hacks here and there
|
||||
S: Buenos Aires
|
||||
S: Argentina
|
||||
|
||||
NTFS FILESYSTEM
|
||||
N: Anton Altaparmakov
|
||||
E: anton@tuxera.com
|
||||
D: NTFS filesystem
|
||||
|
||||
N: Tim Alpaerts
|
||||
E: tim_alpaerts@toyota-motor-europe.com
|
||||
D: 802.2 class II logical link control layer,
|
||||
@ -2955,6 +2960,11 @@ S: 2364 Old Trail Drive
|
||||
S: Reston, Virginia 20191
|
||||
S: USA
|
||||
|
||||
N: Sekhar Nori
|
||||
E: nori.sekhar@gmail.com
|
||||
D: Maintainer of Texas Instruments DaVinci machine support, contributor
|
||||
D: to device drivers relevant to that SoC family.
|
||||
|
||||
N: Fredrik Noring
|
||||
E: noring@nocrew.org
|
||||
W: http://www.lysator.liu.se/~noring/
|
||||
|
@ -28,5 +28,5 @@ Description:
|
||||
/label ... (r/o) descriptive, not necessarily unique
|
||||
/ngpio ... (r/o) number of GPIOs; numbered N to N + (ngpio - 1)
|
||||
|
||||
This ABI is deprecated and will be removed after 2020. It is
|
||||
replaced with the GPIO character device.
|
||||
This ABI is obsoleted by Documentation/ABI/testing/gpio-cdev and will be
|
||||
removed after 2020.
|
||||
|
@ -4,6 +4,14 @@ KernelVersion: 3.13
|
||||
Description: The purpose of this directory is to create and remove it.
|
||||
|
||||
A corresponding USB function instance is created/removed.
|
||||
There are no attributes here.
|
||||
|
||||
All parameters are set through FunctionFS.
|
||||
All attributes are read only:
|
||||
|
||||
============= ============================================
|
||||
ready 1 if the function is ready to be used, E.G.
|
||||
if userspace has written descriptors and
|
||||
strings to ep0, so the gadget can be
|
||||
enabled - 0 otherwise.
|
||||
============= ============================================
|
||||
|
||||
All other parameters are set through FunctionFS.
|
||||
|
@ -33,3 +33,37 @@ Description:
|
||||
device cannot clear poison from the address, -ENXIO is returned.
|
||||
The clear_poison attribute is only visible for devices
|
||||
supporting the capability.
|
||||
|
||||
What: /sys/kernel/debug/cxl/einj_types
|
||||
Date: January, 2024
|
||||
KernelVersion: v6.9
|
||||
Contact: linux-cxl@vger.kernel.org
|
||||
Description:
|
||||
(RO) Prints the CXL protocol error types made available by
|
||||
the platform in the format:
|
||||
|
||||
0x<error number> <error type>
|
||||
|
||||
The possible error types are (as of ACPI v6.5):
|
||||
|
||||
0x1000 CXL.cache Protocol Correctable
|
||||
0x2000 CXL.cache Protocol Uncorrectable non-fatal
|
||||
0x4000 CXL.cache Protocol Uncorrectable fatal
|
||||
0x8000 CXL.mem Protocol Correctable
|
||||
0x10000 CXL.mem Protocol Uncorrectable non-fatal
|
||||
0x20000 CXL.mem Protocol Uncorrectable fatal
|
||||
|
||||
The <error number> can be written to einj_inject to inject
|
||||
<error type> into a chosen dport.
|
||||
|
||||
What: /sys/kernel/debug/cxl/$dport_dev/einj_inject
|
||||
Date: January, 2024
|
||||
KernelVersion: v6.9
|
||||
Contact: linux-cxl@vger.kernel.org
|
||||
Description:
|
||||
(WO) Writing an integer to this file injects the corresponding
|
||||
CXL protocol error into $dport_dev ($dport_dev will be a device
|
||||
name from /sys/bus/pci/devices). The integer to type mapping for
|
||||
injection can be found by reading from einj_types. If the dport
|
||||
was enumerated in RCH mode, a CXL 1.1 error is injected, otherwise
|
||||
a CXL 2.0 error is injected.
|
||||
|
@ -81,3 +81,29 @@ Description: (RO) Read returns, for each Acceleration Engine (AE), the number
|
||||
<N>: Number of Compress and Verify (CnV) errors and type
|
||||
of the last CnV error detected by Acceleration
|
||||
Engine N.
|
||||
|
||||
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/inject_error
|
||||
Date: March 2024
|
||||
KernelVersion: 6.8
|
||||
Contact: qat-linux@intel.com
|
||||
Description: (WO) Write to inject an error that simulates an heartbeat
|
||||
failure. This is to be used for testing purposes.
|
||||
|
||||
After writing this file, the driver stops arbitration on a
|
||||
random engine and disables the fetching of heartbeat counters.
|
||||
If a workload is running on the device, a job submitted to the
|
||||
accelerator might not get a response and a read of the
|
||||
`heartbeat/status` attribute might report -1, i.e. device
|
||||
unresponsive.
|
||||
The error is unrecoverable thus the device must be restarted to
|
||||
restore its functionality.
|
||||
|
||||
This attribute is available only when the kernel is built with
|
||||
CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION=y.
|
||||
|
||||
A write of 1 enables error injection.
|
||||
|
||||
The following example shows how to enable error injection::
|
||||
|
||||
# cd /sys/kernel/debug/qat_<device>_<BDF>
|
||||
# echo 1 > heartbeat/inject_error
|
||||
|
@ -111,6 +111,28 @@ Description: QM debug registers(regs) read hardware register value. This
|
||||
node is used to show the change of the qm register values. This
|
||||
node can be help users to check the change of register values.
|
||||
|
||||
What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/qm_state
|
||||
Date: Jan 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Dump the state of the device.
|
||||
0: busy, 1: idle.
|
||||
Only available for PF, and take no other effect on HPRE.
|
||||
|
||||
What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/dev_timeout
|
||||
Date: Feb 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Set the wait time when stop queue fails. Available for both PF
|
||||
and VF, and take no other effect on HPRE.
|
||||
0: not wait(default), others value: wait dev_timeout * 20 microsecond.
|
||||
|
||||
What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/dev_state
|
||||
Date: Feb 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Dump the stop queue status of the QM. The default value is 0,
|
||||
if dev_timeout is set, when stop queue fails, the dev_state
|
||||
will return non-zero value. Available for both PF and VF,
|
||||
and take no other effect on HPRE.
|
||||
|
||||
What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/diff_regs
|
||||
Date: Mar 2022
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
|
@ -91,6 +91,28 @@ Description: QM debug registers(regs) read hardware register value. This
|
||||
node is used to show the change of the qm register values. This
|
||||
node can be help users to check the change of register values.
|
||||
|
||||
What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/qm_state
|
||||
Date: Jan 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Dump the state of the device.
|
||||
0: busy, 1: idle.
|
||||
Only available for PF, and take no other effect on SEC.
|
||||
|
||||
What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/dev_timeout
|
||||
Date: Feb 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Set the wait time when stop queue fails. Available for both PF
|
||||
and VF, and take no other effect on SEC.
|
||||
0: not wait(default), others value: wait dev_timeout * 20 microsecond.
|
||||
|
||||
What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/dev_state
|
||||
Date: Feb 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Dump the stop queue status of the QM. The default value is 0,
|
||||
if dev_timeout is set, when stop queue fails, the dev_state
|
||||
will return non-zero value. Available for both PF and VF,
|
||||
and take no other effect on SEC.
|
||||
|
||||
What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/diff_regs
|
||||
Date: Mar 2022
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
|
@ -104,6 +104,28 @@ Description: QM debug registers(regs) read hardware register value. This
|
||||
node is used to show the change of the qm registers value. This
|
||||
node can be help users to check the change of register values.
|
||||
|
||||
What: /sys/kernel/debug/hisi_zip/<bdf>/qm/qm_state
|
||||
Date: Jan 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Dump the state of the device.
|
||||
0: busy, 1: idle.
|
||||
Only available for PF, and take no other effect on ZIP.
|
||||
|
||||
What: /sys/kernel/debug/hisi_zip/<bdf>/qm/dev_timeout
|
||||
Date: Feb 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Set the wait time when stop queue fails. Available for both PF
|
||||
and VF, and take no other effect on ZIP.
|
||||
0: not wait(default), others value: wait dev_timeout * 20 microsecond.
|
||||
|
||||
What: /sys/kernel/debug/hisi_zip/<bdf>/qm/dev_state
|
||||
Date: Feb 2024
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
Description: Dump the stop queue status of the QM. The default value is 0,
|
||||
if dev_timeout is set, when stop queue fails, the dev_state
|
||||
will return non-zero value. Available for both PF and VF,
|
||||
and take no other effect on ZIP.
|
||||
|
||||
What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/diff_regs
|
||||
Date: Mar 2022
|
||||
Contact: linux-crypto@vger.kernel.org
|
||||
|
276
Documentation/ABI/testing/debugfs-intel-iommu
Normal file
276
Documentation/ABI/testing/debugfs-intel-iommu
Normal file
@ -0,0 +1,276 @@
|
||||
What: /sys/kernel/debug/iommu/intel/iommu_regset
|
||||
Date: December 2023
|
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com>
|
||||
Description:
|
||||
This file dumps all the register contents for each IOMMU device.
|
||||
|
||||
Example in Kabylake:
|
||||
|
||||
::
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/iommu_regset
|
||||
|
||||
IOMMU: dmar0 Register Base Address: 26be37000
|
||||
|
||||
Name Offset Contents
|
||||
VER 0x00 0x0000000000000010
|
||||
GCMD 0x18 0x0000000000000000
|
||||
GSTS 0x1c 0x00000000c7000000
|
||||
FSTS 0x34 0x0000000000000000
|
||||
FECTL 0x38 0x0000000000000000
|
||||
|
||||
[...]
|
||||
|
||||
IOMMU: dmar1 Register Base Address: fed90000
|
||||
|
||||
Name Offset Contents
|
||||
VER 0x00 0x0000000000000010
|
||||
GCMD 0x18 0x0000000000000000
|
||||
GSTS 0x1c 0x00000000c7000000
|
||||
FSTS 0x34 0x0000000000000000
|
||||
FECTL 0x38 0x0000000000000000
|
||||
|
||||
[...]
|
||||
|
||||
IOMMU: dmar2 Register Base Address: fed91000
|
||||
|
||||
Name Offset Contents
|
||||
VER 0x00 0x0000000000000010
|
||||
GCMD 0x18 0x0000000000000000
|
||||
GSTS 0x1c 0x00000000c7000000
|
||||
FSTS 0x34 0x0000000000000000
|
||||
FECTL 0x38 0x0000000000000000
|
||||
|
||||
[...]
|
||||
|
||||
What: /sys/kernel/debug/iommu/intel/ir_translation_struct
|
||||
Date: December 2023
|
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com>
|
||||
Description:
|
||||
This file dumps the table entries for Interrupt
|
||||
remapping and Interrupt posting.
|
||||
|
||||
Example in Kabylake:
|
||||
|
||||
::
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/ir_translation_struct
|
||||
|
||||
Remapped Interrupt supported on IOMMU: dmar0
|
||||
IR table address:100900000
|
||||
|
||||
Entry SrcID DstID Vct IRTE_high IRTE_low
|
||||
0 00:0a.0 00000080 24 0000000000040050 000000800024000d
|
||||
1 00:0a.0 00000001 ef 0000000000040050 0000000100ef000d
|
||||
|
||||
Remapped Interrupt supported on IOMMU: dmar1
|
||||
IR table address:100300000
|
||||
Entry SrcID DstID Vct IRTE_high IRTE_low
|
||||
0 00:02.0 00000002 26 0000000000040010 000000020026000d
|
||||
|
||||
[...]
|
||||
|
||||
****
|
||||
|
||||
Posted Interrupt supported on IOMMU: dmar0
|
||||
IR table address:100900000
|
||||
Entry SrcID PDA_high PDA_low Vct IRTE_high IRTE_low
|
||||
|
||||
What: /sys/kernel/debug/iommu/intel/dmar_translation_struct
|
||||
Date: December 2023
|
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com>
|
||||
Description:
|
||||
This file dumps Intel IOMMU DMA remapping tables, such
|
||||
as root table, context table, PASID directory and PASID
|
||||
table entries in debugfs. For legacy mode, it doesn't
|
||||
support PASID, and hence PASID field is defaulted to
|
||||
'-1' and other PASID related fields are invalid.
|
||||
|
||||
Example in Kabylake:
|
||||
|
||||
::
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_translation_struct
|
||||
|
||||
IOMMU dmar1: Root Table Address: 0x103027000
|
||||
B.D.F Root_entry
|
||||
00:02.0 0x0000000000000000:0x000000010303e001
|
||||
|
||||
Context_entry
|
||||
0x0000000000000102:0x000000010303f005
|
||||
|
||||
PASID PASID_table_entry
|
||||
-1 0x0000000000000000:0x0000000000000000:0x0000000000000000
|
||||
|
||||
IOMMU dmar0: Root Table Address: 0x103028000
|
||||
B.D.F Root_entry
|
||||
00:0a.0 0x0000000000000000:0x00000001038a7001
|
||||
|
||||
Context_entry
|
||||
0x0000000000000000:0x0000000103220e7d
|
||||
|
||||
PASID PASID_table_entry
|
||||
0 0x0000000000000000:0x0000000000800002:0x00000001038a5089
|
||||
|
||||
[...]
|
||||
|
||||
What: /sys/kernel/debug/iommu/intel/invalidation_queue
|
||||
Date: December 2023
|
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com>
|
||||
Description:
|
||||
This file exports invalidation queue internals of each
|
||||
IOMMU device.
|
||||
|
||||
Example in Kabylake:
|
||||
|
||||
::
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/invalidation_queue
|
||||
|
||||
Invalidation queue on IOMMU: dmar0
|
||||
Base: 0x10022e000 Head: 20 Tail: 20
|
||||
Index qw0 qw1 qw2
|
||||
0 0000000000000014 0000000000000000 0000000000000000
|
||||
1 0000000200000025 0000000100059c04 0000000000000000
|
||||
2 0000000000000014 0000000000000000 0000000000000000
|
||||
|
||||
qw3 status
|
||||
0000000000000000 0000000000000000
|
||||
0000000000000000 0000000000000000
|
||||
0000000000000000 0000000000000000
|
||||
|
||||
[...]
|
||||
|
||||
Invalidation queue on IOMMU: dmar1
|
||||
Base: 0x10026e000 Head: 32 Tail: 32
|
||||
Index qw0 qw1 status
|
||||
0 0000000000000004 0000000000000000 0000000000000000
|
||||
1 0000000200000025 0000000100059804 0000000000000000
|
||||
2 0000000000000011 0000000000000000 0000000000000000
|
||||
|
||||
[...]
|
||||
|
||||
What: /sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
Date: December 2023
|
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com>
|
||||
Description:
|
||||
This file is used to control and show counts of
|
||||
execution time ranges for various types per DMAR.
|
||||
|
||||
Firstly, write a value to
|
||||
/sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
to enable sampling.
|
||||
|
||||
The possible values are as follows:
|
||||
|
||||
* 0 - disable sampling all latency data
|
||||
|
||||
* 1 - enable sampling IOTLB invalidation latency data
|
||||
|
||||
* 2 - enable sampling devTLB invalidation latency data
|
||||
|
||||
* 3 - enable sampling intr entry cache invalidation latency data
|
||||
|
||||
Next, read /sys/kernel/debug/iommu/intel/dmar_perf_latency gives
|
||||
a snapshot of sampling result of all enabled monitors.
|
||||
|
||||
Examples in Kabylake:
|
||||
|
||||
::
|
||||
|
||||
1) Disable sampling all latency data:
|
||||
|
||||
$ sudo echo 0 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
|
||||
2) Enable sampling IOTLB invalidation latency data
|
||||
|
||||
$ sudo echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
|
||||
IOMMU: dmar0 Register Base Address: 26be37000
|
||||
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
|
||||
inv_iotlb 0 0 0 0 0
|
||||
|
||||
1ms-10ms >=10ms min(us) max(us) average(us)
|
||||
inv_iotlb 0 0 0 0 0
|
||||
|
||||
[...]
|
||||
|
||||
IOMMU: dmar2 Register Base Address: fed91000
|
||||
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
|
||||
inv_iotlb 0 0 18 0 0
|
||||
|
||||
1ms-10ms >=10ms min(us) max(us) average(us)
|
||||
inv_iotlb 0 0 2 2 2
|
||||
|
||||
3) Enable sampling devTLB invalidation latency data
|
||||
|
||||
$ sudo echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
|
||||
|
||||
IOMMU: dmar0 Register Base Address: 26be37000
|
||||
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
|
||||
inv_devtlb 0 0 0 0 0
|
||||
|
||||
>=10ms min(us) max(us) average(us)
|
||||
inv_devtlb 0 0 0 0
|
||||
|
||||
[...]
|
||||
|
||||
What: /sys/kernel/debug/iommu/intel/<bdf>/domain_translation_struct
|
||||
Date: December 2023
|
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com>
|
||||
Description:
|
||||
This file dumps a specified page table of Intel IOMMU
|
||||
in legacy mode or scalable mode.
|
||||
|
||||
For a device that only supports legacy mode, dump its
|
||||
page table by the debugfs file in the debugfs device
|
||||
directory. e.g.
|
||||
/sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct.
|
||||
|
||||
For a device that supports scalable mode, dump the
|
||||
page table of specified pasid by the debugfs file in
|
||||
the debugfs pasid directory. e.g.
|
||||
/sys/kernel/debug/iommu/intel/0000:00:02.0/1/domain_translation_struct.
|
||||
|
||||
Examples in Kabylake:
|
||||
|
||||
::
|
||||
|
||||
1) Dump the page table of device "0000:00:02.0" that only supports legacy mode.
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct
|
||||
|
||||
Device 0000:00:02.0 @0x1017f8000
|
||||
IOVA_PFN PML5E PML4E
|
||||
0x000000008d800 | 0x0000000000000000 0x00000001017f9003
|
||||
0x000000008d801 | 0x0000000000000000 0x00000001017f9003
|
||||
0x000000008d802 | 0x0000000000000000 0x00000001017f9003
|
||||
|
||||
PDPE PDE PTE
|
||||
0x00000001017fa003 0x00000001017fb003 0x000000008d800003
|
||||
0x00000001017fa003 0x00000001017fb003 0x000000008d801003
|
||||
0x00000001017fa003 0x00000001017fb003 0x000000008d802003
|
||||
|
||||
[...]
|
||||
|
||||
2) Dump the page table of device "0000:00:0a.0" with PASID "1" that
|
||||
supports scalable mode.
|
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:0a.0/1/domain_translation_struct
|
||||
|
||||
Device 0000:00:0a.0 with pasid 1 @0x10c112000
|
||||
IOVA_PFN PML5E PML4E
|
||||
0x0000000000000 | 0x0000000000000000 0x000000010df93003
|
||||
0x0000000000001 | 0x0000000000000000 0x000000010df93003
|
||||
0x0000000000002 | 0x0000000000000000 0x000000010df93003
|
||||
|
||||
PDPE PDE PTE
|
||||
0x0000000106ae6003 0x0000000104b38003 0x0000000147c00803
|
||||
0x0000000106ae6003 0x0000000104b38003 0x0000000147c01803
|
||||
0x0000000106ae6003 0x0000000104b38003 0x0000000147c02803
|
||||
|
||||
[...]
|
@ -6,8 +6,9 @@ Description:
|
||||
The character device files /dev/gpiochip* are the interface
|
||||
between GPIO chips and userspace.
|
||||
|
||||
The ioctl(2)-based ABI is defined and documented in
|
||||
[include/uapi]<linux/gpio.h>.
|
||||
The ioctl(2)-based ABI is defined in
|
||||
[include/uapi]<linux/gpio.h> and documented in
|
||||
Documentation/userspace-api/gpio/chardev.rst.
|
||||
|
||||
The following file operations are supported:
|
||||
|
||||
@ -17,8 +18,8 @@ Description:
|
||||
ioctl(2)
|
||||
Initiate various actions.
|
||||
|
||||
See the inline documentation in [include/uapi]<linux/gpio.h>
|
||||
for descriptions of all ioctls.
|
||||
See Documentation/userspace-api/gpio/chardev.rst
|
||||
for a description of all ioctls.
|
||||
|
||||
close(2)
|
||||
Stops and free up the I/O contexts that was associated
|
||||
|
@ -170,3 +170,90 @@ Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_t
|
||||
Description:
|
||||
(RW) Set/Get the MSR(mux select register) for the DSB subunit
|
||||
TPDM.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_mode
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description: (Write) Set the data collection mode of CMB tpdm. Continuous
|
||||
change creates CMB data set elements on every CMBCLK edge.
|
||||
Trace-on-change creates CMB data set elements only when a new
|
||||
data set element differs in value from the previous element
|
||||
in a CMB data set.
|
||||
|
||||
Accepts only one of the 2 values - 0 or 1.
|
||||
0 : Continuous CMB collection mode.
|
||||
1 : Trace-on-change CMB collection mode.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_trig_patt/xpr[0:1]
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Set/Get the value of the trigger pattern for the CMB
|
||||
subunit TPDM.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_trig_patt/xpmr[0:1]
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Set/Get the mask of the trigger pattern for the CMB
|
||||
subunit TPDM.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/dsb_patt/tpr[0:1]
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Set/Get the value of the pattern for the CMB subunit TPDM.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/dsb_patt/tpmr[0:1]
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Set/Get the mask of the pattern for the CMB subunit TPDM.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_patt/enable_ts
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(Write) Set the pattern timestamp of CMB tpdm. Read
|
||||
the pattern timestamp of CMB tpdm.
|
||||
|
||||
Accepts only one of the 2 values - 0 or 1.
|
||||
0 : Disable CMB pattern timestamp.
|
||||
1 : Enable CMB pattern timestamp.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_trig_ts
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Set/Get the trigger timestamp of the CMB for tpdm.
|
||||
|
||||
Accepts only one of the 2 values - 0 or 1.
|
||||
0 : Set the CMB trigger type to false
|
||||
1 : Set the CMB trigger type to true
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_ts_all
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Read or write the status of timestamp upon all interface.
|
||||
Only value 0 and 1 can be written to this node. Set this node to 1 to requeset
|
||||
timestamp to all trace packet.
|
||||
Accepts only one of the 2 values - 0 or 1.
|
||||
0 : Disable the timestamp of all trace packets.
|
||||
1 : Enable the timestamp of all trace packets.
|
||||
|
||||
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_msr/msr[0:31]
|
||||
Date: January 2024
|
||||
KernelVersion 6.9
|
||||
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
|
||||
Description:
|
||||
(RW) Set/Get the MSR(mux select register) for the CMB subunit
|
||||
TPDM.
|
||||
|
@ -552,3 +552,37 @@ Description:
|
||||
attribute is only visible for devices supporting the
|
||||
capability. The retrieved errors are logged as kernel
|
||||
events when cxl_poison event tracing is enabled.
|
||||
|
||||
|
||||
What: /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
|
||||
/sys/bus/cxl/devices/regionZ/accessY/write_banwidth
|
||||
Date: Jan, 2024
|
||||
KernelVersion: v6.9
|
||||
Contact: linux-cxl@vger.kernel.org
|
||||
Description:
|
||||
(RO) The aggregated read or write bandwidth of the region. The
|
||||
number is the accumulated read or write bandwidth of all CXL memory
|
||||
devices that contributes to the region in MB/s. It is
|
||||
identical data that should appear in
|
||||
/sys/devices/system/node/nodeX/accessY/initiators/read_bandwidth or
|
||||
/sys/devices/system/node/nodeX/accessY/initiators/write_bandwidth.
|
||||
See Documentation/ABI/stable/sysfs-devices-node. access0 provides
|
||||
the number to the closest initiator and access1 provides the
|
||||
number to the closest CPU.
|
||||
|
||||
|
||||
What: /sys/bus/cxl/devices/regionZ/accessY/read_latency
|
||||
/sys/bus/cxl/devices/regionZ/accessY/write_latency
|
||||
Date: Jan, 2024
|
||||
KernelVersion: v6.9
|
||||
Contact: linux-cxl@vger.kernel.org
|
||||
Description:
|
||||
(RO) The read or write latency of the region. The number is
|
||||
the worst read or write latency of all CXL memory devices that
|
||||
contributes to the region in nanoseconds. It is identical data
|
||||
that should appear in
|
||||
/sys/devices/system/node/nodeX/accessY/initiators/read_latency or
|
||||
/sys/devices/system/node/nodeX/accessY/initiators/write_latency.
|
||||
See Documentation/ABI/stable/sysfs-devices-node. access0 provides
|
||||
the number to the closest initiator and access1 provides the
|
||||
number to the closest CPU.
|
||||
|
153
Documentation/ABI/testing/sysfs-bus-dax
Normal file
153
Documentation/ABI/testing/sysfs-bus-dax
Normal file
@ -0,0 +1,153 @@
|
||||
What: /sys/bus/dax/devices/daxX.Y/align
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RW) Provides a way to specify an alignment for a dax device.
|
||||
Values allowed are constrained by the physical address ranges
|
||||
that back the dax device, and also by arch requirements.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/mapping
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(WO) Provides a way to allocate a mapping range under a dax
|
||||
device. Specified in the format <start>-<end>.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
|
||||
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
|
||||
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) A dax device may have multiple constituent discontiguous
|
||||
address ranges. These are represented by the different
|
||||
'mappingX' subdirectories. The 'start' attribute indicates the
|
||||
start physical address for the given range. The 'end' attribute
|
||||
indicates the end physical address for the given range. The
|
||||
'page_offset' attribute indicates the offset of the current
|
||||
range in the dax device.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/resource
|
||||
Date: June, 2019
|
||||
KernelVersion: v5.3
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The resource attribute indicates the starting physical
|
||||
address of a dax device. In case of a device with multiple
|
||||
constituent ranges, it indicates the starting address of the
|
||||
first range.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/size
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RW) The size attribute indicates the total size of a dax
|
||||
device. For creating subdivided dax devices, or for resizing
|
||||
an existing device, the new size can be written to this as
|
||||
part of the reconfiguration process.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/numa_node
|
||||
Date: November, 2019
|
||||
KernelVersion: v5.5
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) If NUMA is enabled and the platform has affinitized the
|
||||
backing device for this dax device, emit the CPU node
|
||||
affinity for this device.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/target_node
|
||||
Date: February, 2019
|
||||
KernelVersion: v5.1
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The target-node attribute is the Linux numa-node that a
|
||||
device-dax instance may create when it is online. Prior to
|
||||
being online the device's 'numa_node' property reflects the
|
||||
closest online cpu node which is the typical expectation of a
|
||||
device 'numa_node'. Once it is online it becomes its own
|
||||
distinct numa node.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/available_size
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The available_size attribute tracks available dax region
|
||||
capacity. This only applies to volatile hmem devices, not pmem
|
||||
devices, since pmem devices are defined by nvdimm namespace
|
||||
boundaries.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/size
|
||||
Date: July, 2017
|
||||
KernelVersion: v5.1
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The size attribute indicates the size of a given dax region
|
||||
in bytes.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/align
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The align attribute indicates alignment of the dax region.
|
||||
Changes on align may not always be valid, when say certain
|
||||
mappings were created with 2M and then we switch to 1G. This
|
||||
validates all ranges against the new value being attempted, post
|
||||
resizing.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/seed
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The seed device is a concept for dynamic dax regions to be
|
||||
able to split the region amongst multiple sub-instances. The
|
||||
seed device, similar to libnvdimm seed devices, is a device
|
||||
that starts with zero capacity allocated and unbound to a
|
||||
driver.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/create
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RW) The create interface to the dax region provides a way to
|
||||
create a new unconfigured dax device under the given region, which
|
||||
can then be configured (with a size etc.) and then probed.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/delete
|
||||
Date: October, 2020
|
||||
KernelVersion: v5.10
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(WO) The delete interface for a dax region provides for deletion
|
||||
of any 0-sized and idle dax devices.
|
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/id
|
||||
Date: July, 2017
|
||||
KernelVersion: v5.1
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RO) The id attribute indicates the region id of a dax region.
|
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory
|
||||
Date: January, 2024
|
||||
KernelVersion: v6.8
|
||||
Contact: nvdimm@lists.linux.dev
|
||||
Description:
|
||||
(RW) Control the memmap_on_memory setting if the dax device
|
||||
were to be hotplugged as system memory. This determines whether
|
||||
the 'altmap' for the hotplugged memory will be placed on the
|
||||
device being hotplugged (memmap_on_memory=1) or if it will be
|
||||
placed on regular memory (memmap_on_memory=0). This attribute
|
||||
must be set before the device is handed over to the 'kmem'
|
||||
driver (i.e. hotplugged into system-ram). Additionally, this
|
||||
depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
|
||||
memmap_on_memory parameter for memory_hotplug. This is
|
||||
typically set on the kernel command line -
|
||||
memory_hotplug.memmap_on_memory set to 'true' or 'force'."
|
9
Documentation/ABI/testing/sysfs-bus-iio-adc-pac1934
Normal file
9
Documentation/ABI/testing/sysfs-bus-iio-adc-pac1934
Normal file
@ -0,0 +1,9 @@
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_shunt_resistorY
|
||||
KernelVersion: 6.7
|
||||
Contact: linux-iio@vger.kernel.org
|
||||
Description:
|
||||
The value of the shunt resistor may be known only at runtime
|
||||
and set by a client application. This attribute allows to
|
||||
set its value in micro-ohms. X is the IIO index of the device.
|
||||
Y is the channel number. The value is used to calculate
|
||||
current, power and accumulated energy.
|
@ -11,7 +11,7 @@ saw any problems).
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of correctable errors seen and reported by this
|
||||
PCI device using ERR_COR. Note that since multiple errors may
|
||||
@ -32,7 +32,7 @@ Description: List of correctable errors seen and reported by this
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of uncorrectable fatal errors seen and reported by this
|
||||
PCI device using ERR_FATAL. Note that since multiple errors may
|
||||
@ -62,7 +62,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: List of uncorrectable nonfatal errors seen and reported by this
|
||||
PCI device using ERR_NONFATAL. Note that since multiple errors
|
||||
@ -100,20 +100,20 @@ collectors) that are AER capable. These indicate the number of error messages as
|
||||
device, so these counters include them and are thus cumulative of all the error
|
||||
messages on the PCI hierarchy originating at that root port.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_cor
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_COR messages reported to rootport.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_fatal
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_FATAL messages reported to rootport.
|
||||
|
||||
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_nonfatal
|
||||
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
|
||||
Date: July 2018
|
||||
KernelVersion: 4.19.0
|
||||
KernelVersion: 4.19.0
|
||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||
Description: Total number of ERR_NONFATAL messages reported to rootport.
|
||||
|
8
Documentation/ABI/testing/sysfs-bus-pci-devices-avs
Normal file
8
Documentation/ABI/testing/sysfs-bus-pci-devices-avs
Normal file
@ -0,0 +1,8 @@
|
||||
What: /sys/devices/pci0000:00/<dev>/avs/fw_version
|
||||
Date: February 2024
|
||||
Contact: Cezary Rojewski <cezary.rojewski@intel.com>
|
||||
Description:
|
||||
Version of AudioDSP firmware ASoC avs driver is communicating
|
||||
with.
|
||||
|
||||
Format: %d.%d.%d.%d, type:major:minor:build.
|
@ -442,6 +442,16 @@ What: /sys/bus/usb/devices/usbX/descriptors
|
||||
Description:
|
||||
Contains the interface descriptors, in binary.
|
||||
|
||||
What: /sys/bus/usb/devices/usbX/bos_descriptors
|
||||
Date: March 2024
|
||||
Contact: Elbert Mai <code@elbertmai.com>
|
||||
Description:
|
||||
Binary file containing the cached binary device object store (BOS)
|
||||
of the device. This consists of the BOS descriptor followed by the
|
||||
set of device capability descriptors. All descriptors read from
|
||||
this file are in bus-endian format. Note that the kernel will not
|
||||
request the BOS from a device if its bcdUSB is less than 0x0201.
|
||||
|
||||
What: /sys/bus/usb/devices/usbX/idProduct
|
||||
Description:
|
||||
Product ID, in hexadecimal.
|
||||
|
@ -1,6 +1,6 @@
|
||||
What: /sys/bus/vdpa/drivers_autoprobe
|
||||
Date: March 2020
|
||||
Contact: virtualization@lists.linux-foundation.org
|
||||
Contact: virtualization@lists.linux.dev
|
||||
Description:
|
||||
This file determines whether new devices are immediately bound
|
||||
to a driver after the creation. It initially contains 1, which
|
||||
@ -12,7 +12,7 @@ Description:
|
||||
|
||||
What: /sys/bus/vdpa/driver_probe
|
||||
Date: March 2020
|
||||
Contact: virtualization@lists.linux-foundation.org
|
||||
Contact: virtualization@lists.linux.dev
|
||||
Description:
|
||||
Writing a device name to this file will cause the kernel binds
|
||||
devices to a compatible driver.
|
||||
@ -22,7 +22,7 @@ Description:
|
||||
|
||||
What: /sys/bus/vdpa/drivers/.../bind
|
||||
Date: March 2020
|
||||
Contact: virtualization@lists.linux-foundation.org
|
||||
Contact: virtualization@lists.linux.dev
|
||||
Description:
|
||||
Writing a device name to this file will cause the driver to
|
||||
attempt to bind to the device. This is useful for overriding
|
||||
@ -30,7 +30,7 @@ Description:
|
||||
|
||||
What: /sys/bus/vdpa/drivers/.../unbind
|
||||
Date: March 2020
|
||||
Contact: virtualization@lists.linux-foundation.org
|
||||
Contact: virtualization@lists.linux.dev
|
||||
Description:
|
||||
Writing a device name to this file will cause the driver to
|
||||
attempt to unbind from the device. This may be useful when
|
||||
@ -38,7 +38,7 @@ Description:
|
||||
|
||||
What: /sys/bus/vdpa/devices/.../driver_override
|
||||
Date: November 2021
|
||||
Contact: virtualization@lists.linux-foundation.org
|
||||
Contact: virtualization@lists.linux.dev
|
||||
Description:
|
||||
This file allows the driver for a device to be specified.
|
||||
When specified, only a driver with a name matching the value
|
||||
|
@ -149,6 +149,15 @@ Description:
|
||||
|
||||
RW
|
||||
|
||||
What: /sys/class/hwmon/hwmonX/inY_fault
|
||||
Description:
|
||||
Reports a voltage hard failure (eg: shorted component)
|
||||
|
||||
- 1: Failed
|
||||
- 0: Ok
|
||||
|
||||
RO
|
||||
|
||||
What: /sys/class/hwmon/hwmonX/cpuY_vid
|
||||
Description:
|
||||
CPU core reference voltage.
|
||||
@ -968,6 +977,15 @@ Description:
|
||||
|
||||
RW
|
||||
|
||||
What: /sys/class/hwmon/hwmonX/humidityY_max_alarm
|
||||
Description:
|
||||
Maximum humidity detection
|
||||
|
||||
- 0: OK
|
||||
- 1: Maximum humidity detected
|
||||
|
||||
RO
|
||||
|
||||
What: /sys/class/hwmon/hwmonX/humidityY_max_hyst
|
||||
Description:
|
||||
Humidity hysteresis value for max limit.
|
||||
@ -987,6 +1005,15 @@ Description:
|
||||
|
||||
RW
|
||||
|
||||
What: /sys/class/hwmon/hwmonX/humidityY_min_alarm
|
||||
Description:
|
||||
Minimum humidity detection
|
||||
|
||||
- 0: OK
|
||||
- 1: Minimum humidity detected
|
||||
|
||||
RO
|
||||
|
||||
What: /sys/class/hwmon/hwmonX/humidityY_min_hyst
|
||||
Description:
|
||||
Humidity hysteresis value for min limit.
|
||||
|
@ -88,6 +88,8 @@ Description:
|
||||
speed of 10MBps of the named network device.
|
||||
Setting this value also immediately changes the LED state.
|
||||
|
||||
Present only if the named network device supports 10Mbps link speed.
|
||||
|
||||
What: /sys/class/leds/<led>/link_100
|
||||
Date: Jun 2023
|
||||
KernelVersion: 6.5
|
||||
@ -101,6 +103,8 @@ Description:
|
||||
speed of 100Mbps of the named network device.
|
||||
Setting this value also immediately changes the LED state.
|
||||
|
||||
Present only if the named network device supports 100Mbps link speed.
|
||||
|
||||
What: /sys/class/leds/<led>/link_1000
|
||||
Date: Jun 2023
|
||||
KernelVersion: 6.5
|
||||
@ -114,6 +118,8 @@ Description:
|
||||
speed of 1000Mbps of the named network device.
|
||||
Setting this value also immediately changes the LED state.
|
||||
|
||||
Present only if the named network device supports 1000Mbps link speed.
|
||||
|
||||
What: /sys/class/leds/<led>/link_2500
|
||||
Date: Nov 2023
|
||||
KernelVersion: 6.8
|
||||
@ -127,6 +133,8 @@ Description:
|
||||
speed of 2500Mbps of the named network device.
|
||||
Setting this value also immediately changes the LED state.
|
||||
|
||||
Present only if the named network device supports 2500Mbps link speed.
|
||||
|
||||
What: /sys/class/leds/<led>/link_5000
|
||||
Date: Nov 2023
|
||||
KernelVersion: 6.8
|
||||
@ -140,6 +148,8 @@ Description:
|
||||
speed of 5000Mbps of the named network device.
|
||||
Setting this value also immediately changes the LED state.
|
||||
|
||||
Present only if the named network device supports 5000Mbps link speed.
|
||||
|
||||
What: /sys/class/leds/<led>/link_10000
|
||||
Date: Nov 2023
|
||||
KernelVersion: 6.8
|
||||
@ -153,6 +163,8 @@ Description:
|
||||
speed of 10000Mbps of the named network device.
|
||||
Setting this value also immediately changes the LED state.
|
||||
|
||||
Present only if the named network device supports 10000Mbps link speed.
|
||||
|
||||
What: /sys/class/leds/<led>/half_duplex
|
||||
Date: Jun 2023
|
||||
KernelVersion: 6.5
|
||||
|
@ -1,11 +1,11 @@
|
||||
What: /sys/class/leds/<led>/ttyname
|
||||
What: /sys/class/leds/<tty_led>/ttyname
|
||||
Date: Dec 2020
|
||||
KernelVersion: 5.10
|
||||
Contact: linux-leds@vger.kernel.org
|
||||
Description:
|
||||
Specifies the tty device name of the triggering tty
|
||||
|
||||
What: /sys/class/leds/<led>/rx
|
||||
What: /sys/class/leds/<tty_led>/rx
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8
|
||||
Description:
|
||||
@ -13,7 +13,7 @@ Description:
|
||||
If set to 0, the LED will not blink on reception.
|
||||
If set to 1 (default), the LED will blink on reception.
|
||||
|
||||
What: /sys/class/leds/<led>/tx
|
||||
What: /sys/class/leds/<tty_led>/tx
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8
|
||||
Description:
|
||||
@ -21,7 +21,7 @@ Description:
|
||||
If set to 0, the LED will not blink on transmission.
|
||||
If set to 1 (default), the LED will blink on transmission.
|
||||
|
||||
What: /sys/class/leds/<led>/cts
|
||||
What: /sys/class/leds/<tty_led>/cts
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8
|
||||
Description:
|
||||
@ -31,7 +31,7 @@ Description:
|
||||
If set to 0 (default), the LED will not evaluate CTS.
|
||||
If set to 1, the LED will evaluate CTS.
|
||||
|
||||
What: /sys/class/leds/<led>/dsr
|
||||
What: /sys/class/leds/<tty_led>/dsr
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8
|
||||
Description:
|
||||
@ -41,7 +41,7 @@ Description:
|
||||
If set to 0 (default), the LED will not evaluate DSR.
|
||||
If set to 1, the LED will evaluate DSR.
|
||||
|
||||
What: /sys/class/leds/<led>/dcd
|
||||
What: /sys/class/leds/<tty_led>/dcd
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8
|
||||
Description:
|
||||
@ -51,7 +51,7 @@ Description:
|
||||
If set to 0 (default), the LED will not evaluate CAR (DCD).
|
||||
If set to 1, the LED will evaluate CAR (DCD).
|
||||
|
||||
What: /sys/class/leds/<led>/rng
|
||||
What: /sys/class/leds/<tty_led>/rng
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8
|
||||
Description:
|
||||
|
@ -96,3 +96,26 @@ Description:
|
||||
Indicates the absolute minimum limit of bytes allowed to be
|
||||
queued on this network device transmit queue. Default value is
|
||||
0.
|
||||
|
||||
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_thrs
|
||||
Date: Jan 2024
|
||||
KernelVersion: 6.9
|
||||
Contact: netdev@vger.kernel.org
|
||||
Description:
|
||||
Tx completion stall detection threshold in ms. Kernel will
|
||||
guarantee to detect all stalls longer than this threshold but
|
||||
may also detect stalls longer than half of the threshold.
|
||||
|
||||
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_cnt
|
||||
Date: Jan 2024
|
||||
KernelVersion: 6.9
|
||||
Contact: netdev@vger.kernel.org
|
||||
Description:
|
||||
Number of detected Tx completion stalls.
|
||||
|
||||
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_max
|
||||
Date: Jan 2024
|
||||
KernelVersion: 6.9
|
||||
Contact: netdev@vger.kernel.org
|
||||
Description:
|
||||
Longest detected Tx completion stall. Write 0 to clear.
|
||||
|
@ -1,4 +1,4 @@
|
||||
What: /sys/class/<iface>/statistics/collisions
|
||||
What: /sys/class/net/<iface>/statistics/collisions
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -6,7 +6,7 @@ Description:
|
||||
Indicates the number of collisions seen by this network device.
|
||||
This value might not be relevant with all MAC layers.
|
||||
|
||||
What: /sys/class/<iface>/statistics/multicast
|
||||
What: /sys/class/net/<iface>/statistics/multicast
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -14,7 +14,7 @@ Description:
|
||||
Indicates the number of multicast packets received by this
|
||||
network device.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_bytes
|
||||
What: /sys/class/net/<iface>/statistics/rx_bytes
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -23,7 +23,7 @@ Description:
|
||||
See the network driver for the exact meaning of when this
|
||||
value is incremented.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_compressed
|
||||
What: /sys/class/net/<iface>/statistics/rx_compressed
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -32,7 +32,7 @@ Description:
|
||||
network device. This value might only be relevant for interfaces
|
||||
that support packet compression (e.g: PPP).
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_crc_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_crc_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -41,7 +41,7 @@ Description:
|
||||
by this network device. Note that the specific meaning might
|
||||
depend on the MAC layer used by the interface.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_dropped
|
||||
What: /sys/class/net/<iface>/statistics/rx_dropped
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -51,7 +51,7 @@ Description:
|
||||
packet processing. See the network driver for the exact
|
||||
meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -59,7 +59,7 @@ Description:
|
||||
Indicates the number of receive errors on this network device.
|
||||
See the network driver for the exact meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_fifo_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_fifo_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -68,7 +68,7 @@ Description:
|
||||
network device. See the network driver for the exact
|
||||
meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_frame_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_frame_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -78,7 +78,7 @@ Description:
|
||||
on the MAC layer protocol used. See the network driver for
|
||||
the exact meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_length_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_length_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -87,7 +87,7 @@ Description:
|
||||
error, oversized or undersized. See the network driver for the
|
||||
exact meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_missed_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_missed_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -96,7 +96,7 @@ Description:
|
||||
due to lack of capacity in the receive side. See the network
|
||||
driver for the exact meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_nohandler
|
||||
What: /sys/class/net/<iface>/statistics/rx_nohandler
|
||||
Date: February 2016
|
||||
KernelVersion: 4.6
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -104,7 +104,7 @@ Description:
|
||||
Indicates the number of received packets that were dropped on
|
||||
an inactive device by the network core.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_over_errors
|
||||
What: /sys/class/net/<iface>/statistics/rx_over_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -114,7 +114,7 @@ Description:
|
||||
(e.g: larger than MTU). See the network driver for the exact
|
||||
meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/rx_packets
|
||||
What: /sys/class/net/<iface>/statistics/rx_packets
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -122,7 +122,7 @@ Description:
|
||||
Indicates the total number of good packets received by this
|
||||
network device.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_aborted_errors
|
||||
What: /sys/class/net/<iface>/statistics/tx_aborted_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -132,7 +132,7 @@ Description:
|
||||
a medium collision). See the network driver for the exact
|
||||
meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_bytes
|
||||
What: /sys/class/net/<iface>/statistics/tx_bytes
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -143,7 +143,7 @@ Description:
|
||||
transmitted packets or all packets that have been queued for
|
||||
transmission.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_carrier_errors
|
||||
What: /sys/class/net/<iface>/statistics/tx_carrier_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -152,7 +152,7 @@ Description:
|
||||
because of carrier errors (e.g: physical link down). See the
|
||||
network driver for the exact meaning of this value.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_compressed
|
||||
What: /sys/class/net/<iface>/statistics/tx_compressed
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -161,7 +161,7 @@ Description:
|
||||
this might only be relevant for devices that support
|
||||
compression (e.g: PPP).
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_dropped
|
||||
What: /sys/class/net/<iface>/statistics/tx_dropped
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -170,7 +170,7 @@ Description:
|
||||
See the driver for the exact reasons as to why the packets were
|
||||
dropped.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_errors
|
||||
What: /sys/class/net/<iface>/statistics/tx_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -179,7 +179,7 @@ Description:
|
||||
a network device. See the driver for the exact reasons as to
|
||||
why the packets were dropped.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_fifo_errors
|
||||
What: /sys/class/net/<iface>/statistics/tx_fifo_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -188,7 +188,7 @@ Description:
|
||||
FIFO error. See the driver for the exact reasons as to why the
|
||||
packets were dropped.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_heartbeat_errors
|
||||
What: /sys/class/net/<iface>/statistics/tx_heartbeat_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -197,7 +197,7 @@ Description:
|
||||
reported as heartbeat errors. See the driver for the exact
|
||||
reasons as to why the packets were dropped.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_packets
|
||||
What: /sys/class/net/<iface>/statistics/tx_packets
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
@ -206,7 +206,7 @@ Description:
|
||||
device. See the driver for whether this reports the number of all
|
||||
attempted or successful transmissions.
|
||||
|
||||
What: /sys/class/<iface>/statistics/tx_window_errors
|
||||
What: /sys/class/net/<iface>/statistics/tx_window_errors
|
||||
Date: April 2005
|
||||
KernelVersion: 2.6.12
|
||||
Contact: netdev@vger.kernel.org
|
||||
|
@ -19,3 +19,9 @@ Description:
|
||||
- none
|
||||
- host
|
||||
- device
|
||||
|
||||
What: /sys/class/usb_role/<switch>/connector
|
||||
Date: Feb 2024
|
||||
Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>
|
||||
Description:
|
||||
Optional symlink to the USB Type-C connector.
|
||||
|
@ -516,6 +516,7 @@ What: /sys/devices/system/cpu/vulnerabilities
|
||||
/sys/devices/system/cpu/vulnerabilities/mds
|
||||
/sys/devices/system/cpu/vulnerabilities/meltdown
|
||||
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
|
||||
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
|
||||
/sys/devices/system/cpu/vulnerabilities/retbleed
|
||||
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
|
||||
/sys/devices/system/cpu/vulnerabilities/spectre_v1
|
||||
|
10
Documentation/ABI/testing/sysfs-driver-panfrost-profiling
Normal file
10
Documentation/ABI/testing/sysfs-driver-panfrost-profiling
Normal file
@ -0,0 +1,10 @@
|
||||
What: /sys/bus/platform/drivers/panfrost/.../profiling
|
||||
Date: February 2024
|
||||
KernelVersion: 6.8.0
|
||||
Contact: Adrian Larumbe <adrian.larumbe@collabora.com>
|
||||
Description:
|
||||
Get/set drm fdinfo's engine and cycles profiling status.
|
||||
Valid values are:
|
||||
0: Don't enable fdinfo job profiling sources.
|
||||
1: Enable fdinfo job profiling sources, this enables both the GPU's
|
||||
timestamp and cycle counter registers.
|
@ -141,3 +141,23 @@ Description:
|
||||
64
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
|
||||
What: /sys/bus/pci/devices/<BDF>/qat/auto_reset
|
||||
Date: March 2024
|
||||
KernelVersion: 6.8
|
||||
Contact: qat-linux@intel.com
|
||||
Description: (RW) Reports the current state of the autoreset feature
|
||||
for a QAT device
|
||||
|
||||
Write to the attribute to enable or disable device auto reset.
|
||||
|
||||
Device auto reset is disabled by default.
|
||||
|
||||
The values are:
|
||||
|
||||
* 1/Yy/on: auto reset enabled. If the device encounters an
|
||||
unrecoverable error, it will be reset automatically.
|
||||
* 0/Nn/off: auto reset disabled. If the device encounters an
|
||||
unrecoverable error, it will not be reset.
|
||||
|
||||
This attribute is only available for qat_4xxx devices.
|
||||
|
@ -205,7 +205,7 @@ Description: Controls the idle timing of system, if there is no FS operation
|
||||
What: /sys/fs/f2fs/<disk>/discard_idle_interval
|
||||
Date: September 2018
|
||||
Contact: "Chao Yu" <yuchao0@huawei.com>
|
||||
Contact: "Sahitya Tummala" <stummala@codeaurora.org>
|
||||
Contact: "Sahitya Tummala" <quic_stummala@quicinc.com>
|
||||
Description: Controls the idle timing of discard thread given
|
||||
this time interval.
|
||||
Default is 5 secs.
|
||||
@ -213,7 +213,7 @@ Description: Controls the idle timing of discard thread given
|
||||
What: /sys/fs/f2fs/<disk>/gc_idle_interval
|
||||
Date: September 2018
|
||||
Contact: "Chao Yu" <yuchao0@huawei.com>
|
||||
Contact: "Sahitya Tummala" <stummala@codeaurora.org>
|
||||
Contact: "Sahitya Tummala" <quic_stummala@quicinc.com>
|
||||
Description: Controls the idle timing for gc path. Set to 5 seconds by default.
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/iostat_enable
|
||||
@ -701,29 +701,31 @@ Description: Support configuring fault injection type, should be
|
||||
enabled with fault_injection option, fault type value
|
||||
is shown below, it supports single or combined type.
|
||||
|
||||
=================== ===========
|
||||
Type_Name Type_Value
|
||||
=================== ===========
|
||||
FAULT_KMALLOC 0x000000001
|
||||
FAULT_KVMALLOC 0x000000002
|
||||
FAULT_PAGE_ALLOC 0x000000004
|
||||
FAULT_PAGE_GET 0x000000008
|
||||
FAULT_ALLOC_BIO 0x000000010 (obsolete)
|
||||
FAULT_ALLOC_NID 0x000000020
|
||||
FAULT_ORPHAN 0x000000040
|
||||
FAULT_BLOCK 0x000000080
|
||||
FAULT_DIR_DEPTH 0x000000100
|
||||
FAULT_EVICT_INODE 0x000000200
|
||||
FAULT_TRUNCATE 0x000000400
|
||||
FAULT_READ_IO 0x000000800
|
||||
FAULT_CHECKPOINT 0x000001000
|
||||
FAULT_DISCARD 0x000002000
|
||||
FAULT_WRITE_IO 0x000004000
|
||||
FAULT_SLAB_ALLOC 0x000008000
|
||||
FAULT_DQUOT_INIT 0x000010000
|
||||
FAULT_LOCK_OP 0x000020000
|
||||
FAULT_BLKADDR 0x000040000
|
||||
=================== ===========
|
||||
=========================== ===========
|
||||
Type_Name Type_Value
|
||||
=========================== ===========
|
||||
FAULT_KMALLOC 0x000000001
|
||||
FAULT_KVMALLOC 0x000000002
|
||||
FAULT_PAGE_ALLOC 0x000000004
|
||||
FAULT_PAGE_GET 0x000000008
|
||||
FAULT_ALLOC_BIO 0x000000010 (obsolete)
|
||||
FAULT_ALLOC_NID 0x000000020
|
||||
FAULT_ORPHAN 0x000000040
|
||||
FAULT_BLOCK 0x000000080
|
||||
FAULT_DIR_DEPTH 0x000000100
|
||||
FAULT_EVICT_INODE 0x000000200
|
||||
FAULT_TRUNCATE 0x000000400
|
||||
FAULT_READ_IO 0x000000800
|
||||
FAULT_CHECKPOINT 0x000001000
|
||||
FAULT_DISCARD 0x000002000
|
||||
FAULT_WRITE_IO 0x000004000
|
||||
FAULT_SLAB_ALLOC 0x000008000
|
||||
FAULT_DQUOT_INIT 0x000010000
|
||||
FAULT_LOCK_OP 0x000020000
|
||||
FAULT_BLKADDR_VALIDITY 0x000040000
|
||||
FAULT_BLKADDR_CONSISTENCE 0x000080000
|
||||
FAULT_NO_SEGMENT 0x000100000
|
||||
=========================== ===========
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/discard_io_aware_gran
|
||||
Date: January 2023
|
||||
|
11
Documentation/ABI/testing/sysfs-fs-virtiofs
Normal file
11
Documentation/ABI/testing/sysfs-fs-virtiofs
Normal file
@ -0,0 +1,11 @@
|
||||
What: /sys/fs/virtiofs/<n>/tag
|
||||
Date: Feb 2024
|
||||
Contact: virtio-fs@lists.linux.dev
|
||||
Description:
|
||||
[RO] The mount "tag" that can be used to mount this filesystem.
|
||||
|
||||
What: /sys/fs/virtiofs/<n>/device
|
||||
Date: Feb 2024
|
||||
Contact: virtio-fs@lists.linux.dev
|
||||
Description:
|
||||
Symlink to the virtio device that exports this filesystem.
|
@ -23,3 +23,9 @@ Date: Feb 2021
|
||||
Contact: Minchan Kim <minchan@kernel.org>
|
||||
Description:
|
||||
the number of pages CMA API failed to allocate
|
||||
|
||||
What: /sys/kernel/mm/cma/<cma-heap-name>/release_pages_success
|
||||
Date: Feb 2024
|
||||
Contact: Anshuman Khandual <anshuman.khandual@arm.com>
|
||||
Description:
|
||||
the number of pages CMA API succeeded to release
|
||||
|
@ -34,7 +34,9 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
|
||||
kdamond. Writing 'update_schemes_tried_bytes' to the file
|
||||
updates only '.../tried_regions/total_bytes' files of this
|
||||
kdamond. Writing 'clear_schemes_tried_regions' to the file
|
||||
removes contents of the 'tried_regions' directory.
|
||||
removes contents of the 'tried_regions' directory. Writing
|
||||
'update_schemes_effective_quotas' to the file updates
|
||||
'.../quotas/effective_bytes' files of this kdamond.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
|
||||
Date: Mar 2022
|
||||
@ -208,6 +210,12 @@ Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the size
|
||||
quota of the scheme in bytes.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/effective_bytes
|
||||
Date: Feb 2024
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Reading from this file gets the effective size quota of the
|
||||
scheme in bytes, which adjusted for the time quota and goals.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/reset_interval_ms
|
||||
Date: Mar 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
@ -221,6 +229,12 @@ Description: Writing a number 'N' to this file creates the number of
|
||||
directories for setting automatic tuning of the scheme's
|
||||
aggressiveness named '0' to 'N-1' under the goals/ directory.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_metric
|
||||
Date: Feb 2024
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the quota
|
||||
auto-tuning goal metric.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value
|
||||
Date: Nov 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
|
4
Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
Normal file
4
Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
Normal file
@ -0,0 +1,4 @@
|
||||
What: /sys/kernel/mm/mempolicy/
|
||||
Date: January 2024
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Interface for Mempolicy
|
@ -0,0 +1,25 @@
|
||||
What: /sys/kernel/mm/mempolicy/weighted_interleave/
|
||||
Date: January 2024
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Configuration Interface for the Weighted Interleave policy
|
||||
|
||||
What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
|
||||
Date: January 2024
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Weight configuration interface for nodeN
|
||||
|
||||
The interleave weight for a memory node (N). These weights are
|
||||
utilized by tasks which have set their mempolicy to
|
||||
MPOL_WEIGHTED_INTERLEAVE.
|
||||
|
||||
These weights only affect new allocations, and changes at runtime
|
||||
will not cause migrations on already allocated pages.
|
||||
|
||||
The minimum weight for a node is always 1.
|
||||
|
||||
Minimum weight: 1
|
||||
Maximum weight: 255
|
||||
|
||||
Writing an empty string or `0` will reset the weight to the
|
||||
system default. The system default may be set by the kernel
|
||||
or drivers at boot or during hotplug events.
|
@ -4,18 +4,18 @@ KernelVersion: 6.5
|
||||
Contact: Miquel Raynal <miquel.raynal@bootlin.com>
|
||||
Description:
|
||||
The "cells" folder contains one file per cell exposed by the
|
||||
NVMEM device. The name of the file is: <name>@<where>, with
|
||||
<name> being the cell name and <where> its location in the NVMEM
|
||||
device, in hexadecimal (without the '0x' prefix, to mimic device
|
||||
tree node names). The length of the file is the size of the cell
|
||||
(when known). The content of the file is the binary content of
|
||||
the cell (may sometimes be ASCII, likely without trailing
|
||||
character).
|
||||
NVMEM device. The name of the file is: "<name>@<byte>,<bit>",
|
||||
with <name> being the cell name and <where> its location in
|
||||
the NVMEM device, in hexadecimal bytes and bits (without the
|
||||
'0x' prefix, to mimic device tree node names). The length of
|
||||
the file is the size of the cell (when known). The content of
|
||||
the file is the binary content of the cell (may sometimes be
|
||||
ASCII, likely without trailing character).
|
||||
Note: This file is only present if CONFIG_NVMEM_SYSFS
|
||||
is enabled.
|
||||
|
||||
Example::
|
||||
|
||||
hexdump -C /sys/bus/nvmem/devices/1-00563/cells/product-name@d
|
||||
hexdump -C /sys/bus/nvmem/devices/1-00563/cells/product-name@d,0
|
||||
00000000 54 4e 34 38 4d 2d 50 2d 44 4e |TN48M-P-DN|
|
||||
0000000a
|
||||
|
@ -111,7 +111,9 @@ $(YNL_INDEX): $(YNL_RST_FILES)
|
||||
$(YNL_RST_DIR)/%.rst: $(YNL_YAML_DIR)/%.yaml $(YNL_TOOL)
|
||||
$(Q)$(YNL_TOOL) -i $< -o $@
|
||||
|
||||
htmldocs: $(YNL_INDEX)
|
||||
htmldocs texinfodocs latexdocs epubdocs xmldocs: $(YNL_INDEX)
|
||||
|
||||
htmldocs:
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
|
||||
|
||||
@ -176,6 +178,7 @@ refcheckdocs:
|
||||
$(Q)cd $(srctree);scripts/documentation-file-ref-check
|
||||
|
||||
cleandocs:
|
||||
$(Q)rm -f $(YNL_INDEX) $(YNL_RST_FILES)
|
||||
$(Q)rm -rf $(BUILDDIR)
|
||||
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/userspace-api/media clean
|
||||
|
||||
|
@ -68,7 +68,8 @@ over a rather long period of time, but improvements are always welcome!
|
||||
rcu_read_lock_sched(), or by the appropriate update-side lock.
|
||||
Explicit disabling of preemption (preempt_disable(), for example)
|
||||
can serve as rcu_read_lock_sched(), but is less readable and
|
||||
prevents lockdep from detecting locking issues.
|
||||
prevents lockdep from detecting locking issues. Acquiring a
|
||||
spinlock also enters an RCU read-side critical section.
|
||||
|
||||
Please note that you *cannot* rely on code known to be built
|
||||
only in non-preemptible kernels. Such code can and will break,
|
||||
@ -382,16 +383,17 @@ over a rather long period of time, but improvements are always welcome!
|
||||
must use whatever locking or other synchronization is required
|
||||
to safely access and/or modify that data structure.
|
||||
|
||||
Do not assume that RCU callbacks will be executed on the same
|
||||
CPU that executed the corresponding call_rcu() or call_srcu().
|
||||
For example, if a given CPU goes offline while having an RCU
|
||||
callback pending, then that RCU callback will execute on some
|
||||
surviving CPU. (If this was not the case, a self-spawning RCU
|
||||
callback would prevent the victim CPU from ever going offline.)
|
||||
Furthermore, CPUs designated by rcu_nocbs= might well *always*
|
||||
have their RCU callbacks executed on some other CPUs, in fact,
|
||||
for some real-time workloads, this is the whole point of using
|
||||
the rcu_nocbs= kernel boot parameter.
|
||||
Do not assume that RCU callbacks will be executed on
|
||||
the same CPU that executed the corresponding call_rcu(),
|
||||
call_srcu(), call_rcu_tasks(), call_rcu_tasks_rude(), or
|
||||
call_rcu_tasks_trace(). For example, if a given CPU goes offline
|
||||
while having an RCU callback pending, then that RCU callback
|
||||
will execute on some surviving CPU. (If this was not the case,
|
||||
a self-spawning RCU callback would prevent the victim CPU from
|
||||
ever going offline.) Furthermore, CPUs designated by rcu_nocbs=
|
||||
might well *always* have their RCU callbacks executed on some
|
||||
other CPUs, in fact, for some real-time workloads, this is the
|
||||
whole point of using the rcu_nocbs= kernel boot parameter.
|
||||
|
||||
In addition, do not assume that callbacks queued in a given order
|
||||
will be invoked in that order, even if they all are queued on the
|
||||
@ -444,7 +446,7 @@ over a rather long period of time, but improvements are always welcome!
|
||||
real-time workloads than is synchronize_rcu_expedited().
|
||||
|
||||
It is also permissible to sleep in RCU Tasks Trace read-side
|
||||
critical, which are delimited by rcu_read_lock_trace() and
|
||||
critical section, which are delimited by rcu_read_lock_trace() and
|
||||
rcu_read_unlock_trace(). However, this is a specialized flavor
|
||||
of RCU, and you should not use it without first checking with
|
||||
its current users. In most cases, you should instead use SRCU.
|
||||
@ -490,6 +492,12 @@ over a rather long period of time, but improvements are always welcome!
|
||||
since the last time that you passed that same object to
|
||||
call_rcu() (or friends).
|
||||
|
||||
CONFIG_RCU_STRICT_GRACE_PERIOD:
|
||||
combine with KASAN to check for pointers leaked out
|
||||
of RCU read-side critical sections. This Kconfig
|
||||
option is tough on both performance and scalability,
|
||||
and so is limited to four-CPU systems.
|
||||
|
||||
__rcu sparse checks:
|
||||
tag the pointer to the RCU-protected data structure
|
||||
with __rcu, and sparse will warn you if you access that
|
||||
|
@ -408,7 +408,10 @@ member of the rcu_dereference() to use in various situations:
|
||||
RCU flavors, an RCU read-side critical section is entered
|
||||
using rcu_read_lock(), anything that disables bottom halves,
|
||||
anything that disables interrupts, or anything that disables
|
||||
preemption.
|
||||
preemption. Please note that spinlock critical sections
|
||||
are also implied RCU read-side critical sections, even when
|
||||
they are preemptible, as they are in kernels built with
|
||||
CONFIG_PREEMPT_RT=y.
|
||||
|
||||
2. If the access might be within an RCU read-side critical section
|
||||
on the one hand, or protected by (say) my_lock on the other,
|
||||
|
@ -318,7 +318,7 @@ Suppose that a previous kvm.sh run left its output in this directory::
|
||||
|
||||
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
|
||||
|
||||
Then this run can be re-run without rebuilding as follow:
|
||||
Then this run can be re-run without rebuilding as follow::
|
||||
|
||||
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
|
||||
|
||||
|
@ -172,14 +172,25 @@ rcu_read_lock()
|
||||
critical section. Reference counts may be used in conjunction
|
||||
with RCU to maintain longer-term references to data structures.
|
||||
|
||||
Note that anything that disables bottom halves, preemption,
|
||||
or interrupts also enters an RCU read-side critical section.
|
||||
Acquiring a spinlock also enters an RCU read-side critical
|
||||
sections, even for spinlocks that do not disable preemption,
|
||||
as is the case in kernels built with CONFIG_PREEMPT_RT=y.
|
||||
Sleeplocks do *not* enter RCU read-side critical sections.
|
||||
|
||||
rcu_read_unlock()
|
||||
^^^^^^^^^^^^^^^^^
|
||||
void rcu_read_unlock(void);
|
||||
|
||||
This temporal primitives is used by a reader to inform the
|
||||
reclaimer that the reader is exiting an RCU read-side critical
|
||||
section. Note that RCU read-side critical sections may be nested
|
||||
and/or overlapping.
|
||||
section. Anything that enables bottom halves, preemption,
|
||||
or interrupts also exits an RCU read-side critical section.
|
||||
Releasing a spinlock also exits an RCU read-side critical section.
|
||||
|
||||
Note that RCU read-side critical sections may be nested and/or
|
||||
overlapping.
|
||||
|
||||
synchronize_rcu()
|
||||
^^^^^^^^^^^^^^^^^
|
||||
@ -952,8 +963,8 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
|
||||
initialized after each and every call to kmem_cache_alloc(), which renders
|
||||
reference-free spinlock acquisition completely unsafe. Therefore, when
|
||||
using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
|
||||
(Those willing to use a kmem_cache constructor may also use locking,
|
||||
including cache-friendly sequence locking.)
|
||||
(Those willing to initialize their locks in a kmem_cache constructor
|
||||
may also use locking, including cache-friendly sequence locking.)
|
||||
|
||||
With traditional reference counting -- such as that implemented by the
|
||||
kref library in Linux -- there is typically code that runs when the last
|
||||
|
24
Documentation/admin-guide/RAS/address-translation.rst
Normal file
24
Documentation/admin-guide/RAS/address-translation.rst
Normal file
@ -0,0 +1,24 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Address translation
|
||||
===================
|
||||
|
||||
x86 AMD
|
||||
-------
|
||||
|
||||
Zen-based AMD systems include a Data Fabric that manages the layout of
|
||||
physical memory. Devices attached to the Fabric, like memory controllers,
|
||||
I/O, etc., may not have a complete view of the system physical memory map.
|
||||
These devices may provide a "normalized", i.e. device physical, address
|
||||
when reporting memory errors. Normalized addresses must be translated to
|
||||
a system physical address for the kernel to action on the memory.
|
||||
|
||||
AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for
|
||||
this case.
|
||||
|
||||
Glossary of acronyms used in address translation for Zen-based systems
|
||||
|
||||
* CCM = Cache Coherent Moderator
|
||||
* COD = Cluster-on-Die
|
||||
* COH_ST = Coherent Station
|
||||
* DF = Data Fabric
|
@ -1,15 +1,10 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Reliability, Availability and Serviceability features
|
||||
=====================================================
|
||||
|
||||
This documents different aspects of the RAS functionality present in the
|
||||
kernel.
|
||||
|
||||
Error decoding
|
||||
---------------
|
||||
==============
|
||||
|
||||
* x86
|
||||
x86
|
||||
---
|
||||
|
||||
Error decoding on AMD systems should be done using the rasdaemon tool:
|
||||
https://github.com/mchehab/rasdaemon/
|
7
Documentation/admin-guide/RAS/index.rst
Normal file
7
Documentation/admin-guide/RAS/index.rst
Normal file
@ -0,0 +1,7 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
main
|
||||
error-decoding
|
||||
address-translation
|
@ -1,8 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
============================================
|
||||
Reliability, Availability and Serviceability
|
||||
============================================
|
||||
==================================================
|
||||
Reliability, Availability and Serviceability (RAS)
|
||||
==================================================
|
||||
|
||||
This documents different aspects of the RAS functionality present in the
|
||||
kernel.
|
||||
|
||||
RAS concepts
|
||||
************
|
@ -262,9 +262,11 @@ Compiling the kernel
|
||||
- Make sure you have at least gcc 5.1 available.
|
||||
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
|
||||
|
||||
- Do a ``make`` to create a compressed kernel image. It is also
|
||||
possible to do ``make install`` if you have lilo installed to suit the
|
||||
kernel makefiles, but you may want to check your particular lilo setup first.
|
||||
- Do a ``make`` to create a compressed kernel image. It is also possible to do
|
||||
``make install`` if you have lilo installed or if your distribution has an
|
||||
install script recognised by the kernel's installer. Most popular
|
||||
distributions will have a recognized install script. You may want to
|
||||
check your distribution's setup first.
|
||||
|
||||
To do the actual install, you have to be root, but none of the normal
|
||||
build should require that. Don't take the name of root in vain.
|
||||
@ -301,32 +303,51 @@ Compiling the kernel
|
||||
image (e.g. .../linux/arch/x86/boot/bzImage after compilation)
|
||||
to the place where your regular bootable kernel is found.
|
||||
|
||||
- Booting a kernel directly from a floppy without the assistance of a
|
||||
bootloader such as LILO, is no longer supported.
|
||||
- Booting a kernel directly from a storage device without the assistance
|
||||
of a bootloader such as LILO or GRUB, is no longer supported in BIOS
|
||||
(non-EFI systems). On UEFI/EFI systems, however, you can use EFISTUB
|
||||
which allows the motherboard to boot directly to the kernel.
|
||||
On modern workstations and desktops, it's generally recommended to use a
|
||||
bootloader as difficulties can arise with multiple kernels and secure boot.
|
||||
For more details on EFISTUB,
|
||||
see "Documentation/admin-guide/efi-stub.rst".
|
||||
|
||||
If you boot Linux from the hard drive, chances are you use LILO, which
|
||||
uses the kernel image as specified in the file /etc/lilo.conf. The
|
||||
kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
|
||||
/boot/bzImage. To use the new kernel, save a copy of the old image
|
||||
and copy the new image over the old one. Then, you MUST RERUN LILO
|
||||
to update the loading map! If you don't, you won't be able to boot
|
||||
the new kernel image.
|
||||
- It's important to note that as of 2016 LILO (LInux LOader) is no longer in
|
||||
active development, though as it was extremely popular, it often comes up
|
||||
in documentation. Popular alternatives include GRUB2, rEFInd, Syslinux,
|
||||
systemd-boot, or EFISTUB. For various reasons, it's not recommended to use
|
||||
software that's no longer in active development.
|
||||
|
||||
Reinstalling LILO is usually a matter of running /sbin/lilo.
|
||||
You may wish to edit /etc/lilo.conf to specify an entry for your
|
||||
old kernel image (say, /vmlinux.old) in case the new one does not
|
||||
work. See the LILO docs for more information.
|
||||
- Chances are your distribution includes an install script and running
|
||||
``make install`` will be all that's needed. Should that not be the case
|
||||
you'll have to identify your bootloader and reference its documentation or
|
||||
configure your EFI.
|
||||
|
||||
After reinstalling LILO, you should be all set. Shutdown the system,
|
||||
Legacy LILO Instructions
|
||||
------------------------
|
||||
|
||||
|
||||
- If you use LILO the kernel images are specified in the file /etc/lilo.conf.
|
||||
The kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
|
||||
/boot/bzImage. To use the new kernel, save a copy of the old image and copy
|
||||
the new image over the old one. Then, you MUST RERUN LILO to update the
|
||||
loading map! If you don't, you won't be able to boot the new kernel image.
|
||||
|
||||
- Reinstalling LILO is usually a matter of running /sbin/lilo. You may wish
|
||||
to edit /etc/lilo.conf to specify an entry for your old kernel image
|
||||
(say, /vmlinux.old) in case the new one does not work. See the LILO docs
|
||||
for more information.
|
||||
|
||||
- After reinstalling LILO, you should be all set. Shutdown the system,
|
||||
reboot, and enjoy!
|
||||
|
||||
If you ever need to change the default root device, video mode,
|
||||
etc. in the kernel image, use your bootloader's boot options
|
||||
where appropriate. No need to recompile the kernel to change
|
||||
these parameters.
|
||||
- If you ever need to change the default root device, video mode, etc. in the
|
||||
kernel image, use your bootloader's boot options where appropriate. No need
|
||||
to recompile the kernel to change these parameters.
|
||||
|
||||
- Reboot with the new kernel and enjoy.
|
||||
|
||||
|
||||
If something goes wrong
|
||||
-----------------------
|
||||
|
||||
|
@ -179,7 +179,7 @@ files describing that cpuset:
|
||||
- cpuset.mem_hardwall flag: is memory allocation hardwalled
|
||||
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
|
||||
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
|
||||
- cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
|
||||
- cpuset.memory_spread_slab flag: OBSOLETE. Doesn't have any function.
|
||||
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
|
||||
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
|
||||
|
||||
|
@ -65,10 +65,12 @@ files include::
|
||||
|
||||
1. Page fault accounting
|
||||
|
||||
hugetlb.<hugepagesize>.limit_in_bytes
|
||||
hugetlb.<hugepagesize>.max_usage_in_bytes
|
||||
hugetlb.<hugepagesize>.usage_in_bytes
|
||||
hugetlb.<hugepagesize>.failcnt
|
||||
::
|
||||
|
||||
hugetlb.<hugepagesize>.limit_in_bytes
|
||||
hugetlb.<hugepagesize>.max_usage_in_bytes
|
||||
hugetlb.<hugepagesize>.usage_in_bytes
|
||||
hugetlb.<hugepagesize>.failcnt
|
||||
|
||||
The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
|
||||
control group and enforces the limit during page fault. Since HugeTLB
|
||||
@ -82,10 +84,12 @@ getting SIGBUS.
|
||||
|
||||
2. Reservation accounting
|
||||
|
||||
hugetlb.<hugepagesize>.rsvd.limit_in_bytes
|
||||
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
|
||||
hugetlb.<hugepagesize>.rsvd.usage_in_bytes
|
||||
hugetlb.<hugepagesize>.rsvd.failcnt
|
||||
::
|
||||
|
||||
hugetlb.<hugepagesize>.rsvd.limit_in_bytes
|
||||
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
|
||||
hugetlb.<hugepagesize>.rsvd.usage_in_bytes
|
||||
hugetlb.<hugepagesize>.rsvd.failcnt
|
||||
|
||||
The HugeTLB controller allows to limit the HugeTLB reservations per control
|
||||
group and enforces the controller limit at reservation time and at the fault of
|
||||
|
@ -28,7 +28,7 @@ Introduction
|
||||
high performance safe distributed caching (leases/oplocks), optional packet
|
||||
signing, large files, Unicode support and other internationalization
|
||||
improvements. Since both Samba server and this filesystem client support the
|
||||
CIFS Unix extensions, and the Linux client also suppors SMB3 POSIX extensions,
|
||||
CIFS Unix extensions, and the Linux client also supports SMB3 POSIX extensions,
|
||||
the combination can provide a reasonable alternative to other network and
|
||||
cluster file systems for fileserving in some Linux to Linux environments,
|
||||
not just in Linux to Windows (or Linux to Mac) environments.
|
||||
|
@ -34,6 +34,8 @@ Device Mapper
|
||||
switch
|
||||
thin-provisioning
|
||||
unstriped
|
||||
vdo-design
|
||||
vdo
|
||||
verity
|
||||
writecache
|
||||
zero
|
||||
|
633
Documentation/admin-guide/device-mapper/vdo-design.rst
Normal file
633
Documentation/admin-guide/device-mapper/vdo-design.rst
Normal file
@ -0,0 +1,633 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
|
||||
================
|
||||
Design of dm-vdo
|
||||
================
|
||||
|
||||
The dm-vdo (virtual data optimizer) target provides inline deduplication,
|
||||
compression, zero-block elimination, and thin provisioning. A dm-vdo target
|
||||
can be backed by up to 256TB of storage, and can present a logical size of
|
||||
up to 4PB. This target was originally developed at Permabit Technology
|
||||
Corp. starting in 2009. It was first released in 2013 and has been used in
|
||||
production environments ever since. It was made open-source in 2017 after
|
||||
Permabit was acquired by Red Hat. This document describes the design of
|
||||
dm-vdo. For usage, see vdo.rst in the same directory as this file.
|
||||
|
||||
Because deduplication rates fall drastically as the block size increases, a
|
||||
vdo target has a maximum block size of 4K. However, it can achieve
|
||||
deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
|
||||
reference a single 4K of actual storage. It can achieve compression rates
|
||||
of 14:1. All zero blocks consume no storage at all.
|
||||
|
||||
Theory of Operation
|
||||
===================
|
||||
|
||||
The design of dm-vdo is based on the idea that deduplication is a two-part
|
||||
problem. The first is to recognize duplicate data. The second is to avoid
|
||||
storing multiple copies of those duplicates. Therefore, dm-vdo has two main
|
||||
parts: a deduplication index (called UDS) that is used to discover
|
||||
duplicate data, and a data store with a reference counted block map that
|
||||
maps from logical block addresses to the actual storage location of the
|
||||
data.
|
||||
|
||||
Zones and Threading
|
||||
-------------------
|
||||
|
||||
Due to the complexity of data optimization, the number of metadata
|
||||
structures involved in a single write operation to a vdo target is larger
|
||||
than most other targets. Furthermore, because vdo must operate on small
|
||||
block sizes in order to achieve good deduplication rates, acceptable
|
||||
performance can only be achieved through parallelism. Therefore, vdo's
|
||||
design attempts to be lock-free.
|
||||
|
||||
Most of a vdo's main data structures are designed to be easily divided into
|
||||
"zones" such that any given bio must only access a single zone of any zoned
|
||||
structure. Safety with minimal locking is achieved by ensuring that during
|
||||
normal operation, each zone is assigned to a specific thread, and only that
|
||||
thread will access the portion of the data structure in that zone.
|
||||
Associated with each thread is a work queue. Each bio is associated with a
|
||||
request object (the "data_vio") which will be added to a work queue when
|
||||
the next phase of its operation requires access to the structures in the
|
||||
zone associated with that queue.
|
||||
|
||||
Another way of thinking about this arrangement is that the work queue for
|
||||
each zone has an implicit lock on the structures it manages for all its
|
||||
operations, because vdo guarantees that no other thread will alter those
|
||||
structures.
|
||||
|
||||
Although each structure is divided into zones, this division is not
|
||||
reflected in the on-disk representation of each data structure. Therefore,
|
||||
the number of zones for each structure, and hence the number of threads,
|
||||
can be reconfigured each time a vdo target is started.
|
||||
|
||||
The Deduplication Index
|
||||
-----------------------
|
||||
|
||||
In order to identify duplicate data efficiently, vdo was designed to
|
||||
leverage some common characteristics of duplicate data. From empirical
|
||||
observations, we gathered two key insights. The first is that in most data
|
||||
sets with significant amounts of duplicate data, the duplicates tend to
|
||||
have temporal locality. When a duplicate appears, it is more likely that
|
||||
other duplicates will be detected, and that those duplicates will have been
|
||||
written at about the same time. This is why the index keeps records in
|
||||
temporal order. The second insight is that new data is more likely to
|
||||
duplicate recent data than it is to duplicate older data and in general,
|
||||
there are diminishing returns to looking further back in time. Therefore,
|
||||
when the index is full, it should cull its oldest records to make space for
|
||||
new ones. Another important idea behind the design of the index is that the
|
||||
ultimate goal of deduplication is to reduce storage costs. Since there is a
|
||||
trade-off between the storage saved and the resources expended to achieve
|
||||
those savings, vdo does not attempt to find every last duplicate block. It
|
||||
is sufficient to find and eliminate most of the redundancy.
|
||||
|
||||
Each block of data is hashed to produce a 16-byte block name. An index
|
||||
record consists of this block name paired with the presumed location of
|
||||
that data on the underlying storage. However, it is not possible to
|
||||
guarantee that the index is accurate. In the most common case, this occurs
|
||||
because it is too costly to update the index when a block is over-written
|
||||
or discarded. Doing so would require either storing the block name along
|
||||
with the blocks, which is difficult to do efficiently in block-based
|
||||
storage, or reading and rehashing each block before overwriting it.
|
||||
Inaccuracy can also result from a hash collision where two different blocks
|
||||
have the same name. In practice, this is extremely unlikely, but because
|
||||
vdo does not use a cryptographic hash, a malicious workload could be
|
||||
constructed. Because of these inaccuracies, vdo treats the locations in the
|
||||
index as hints, and reads each indicated block to verify that it is indeed
|
||||
a duplicate before sharing the existing block with a new one.
|
||||
|
||||
Records are collected into groups called chapters. New records are added to
|
||||
the newest chapter, called the open chapter. This chapter is stored in a
|
||||
format optimized for adding and modifying records, and the content of the
|
||||
open chapter is not finalized until it runs out of space for new records.
|
||||
When the open chapter fills up, it is closed and a new open chapter is
|
||||
created to collect new records.
|
||||
|
||||
Closing a chapter converts it to a different format which is optimized for
|
||||
reading. The records are written to a series of record pages based on the
|
||||
order in which they were received. This means that records with temporal
|
||||
locality should be on a small number of pages, reducing the I/O required to
|
||||
retrieve them. The chapter also compiles an index that indicates which
|
||||
record page contains any given name. This index means that a request for a
|
||||
name can determine exactly which record page may contain that record,
|
||||
without having to load the entire chapter from storage. This index uses
|
||||
only a subset of the block name as its key, so it cannot guarantee that an
|
||||
index entry refers to the desired block name. It can only guarantee that if
|
||||
there is a record for this name, it will be on the indicated page. Closed
|
||||
chapters are read-only structures and their contents are never altered in
|
||||
any way.
|
||||
|
||||
Once enough records have been written to fill up all the available index
|
||||
space, the oldest chapter is removed to make space for new chapters. Any
|
||||
time a request finds a matching record in the index, that record is copied
|
||||
into the open chapter. This ensures that useful block names remain available
|
||||
in the index, while unreferenced block names are forgotten over time.
|
||||
|
||||
In order to find records in older chapters, the index also maintains a
|
||||
higher level structure called the volume index, which contains entries
|
||||
mapping each block name to the chapter containing its newest record. This
|
||||
mapping is updated as records for the block name are copied or updated,
|
||||
ensuring that only the newest record for a given block name can be found.
|
||||
An older record for a block name will no longer be found even though it has
|
||||
not been deleted from its chapter. Like the chapter index, the volume index
|
||||
uses only a subset of the block name as its key and can not definitively
|
||||
say that a record exists for a name. It can only say which chapter would
|
||||
contain the record if a record exists. The volume index is stored entirely
|
||||
in memory and is saved to storage only when the vdo target is shut down.
|
||||
|
||||
From the viewpoint of a request for a particular block name, it will first
|
||||
look up the name in the volume index. This search will either indicate that
|
||||
the name is new, or which chapter to search. If it returns a chapter, the
|
||||
request looks up its name in the chapter index. This will indicate either
|
||||
that the name is new, or which record page to search. Finally, if it is not
|
||||
new, the request will look for its name in the indicated record page.
|
||||
This process may require up to two page reads per request (one for the
|
||||
chapter index page and one for the request page). However, recently
|
||||
accessed pages are cached so that these page reads can be amortized across
|
||||
many block name requests.
|
||||
|
||||
The volume index and the chapter indexes are implemented using a
|
||||
memory-efficient structure called a delta index. Instead of storing the
|
||||
entire block name (the key) for each entry, the entries are sorted by name
|
||||
and only the difference between adjacent keys (the delta) is stored.
|
||||
Because we expect the hashes to be randomly distributed, the size of the
|
||||
deltas follows an exponential distribution. Because of this distribution,
|
||||
the deltas are expressed using a Huffman code to take up even less space.
|
||||
The entire sorted list of keys is called a delta list. This structure
|
||||
allows the index to use many fewer bytes per entry than a traditional hash
|
||||
table, but it is slightly more expensive to look up entries, because a
|
||||
request must read every entry in a delta list to add up the deltas in order
|
||||
to find the record it needs. The delta index reduces this lookup cost by
|
||||
splitting its key space into many sub-lists, each starting at a fixed key
|
||||
value, so that each individual list is short.
|
||||
|
||||
The default index size can hold 64 million records, corresponding to about
|
||||
256GB of data. This means that the index can identify duplicate data if the
|
||||
original data was written within the last 256GB of writes. This range is
|
||||
called the deduplication window. If new writes duplicate data that is older
|
||||
than that, the index will not be able to find it because the records of the
|
||||
older data have been removed. This means that if an application writes a
|
||||
200 GB file to a vdo target and then immediately writes it again, the two
|
||||
copies will deduplicate perfectly. Doing the same with a 500 GB file will
|
||||
result in no deduplication, because the beginning of the file will no
|
||||
longer be in the index by the time the second write begins (assuming there
|
||||
is no duplication within the file itself).
|
||||
|
||||
If an application anticipates a data workload that will see useful
|
||||
deduplication beyond the 256GB threshold, vdo can be configured to use a
|
||||
larger index with a correspondingly larger deduplication window. (This
|
||||
configuration can only be set when the target is created, not altered
|
||||
later. It is important to consider the expected workload for a vdo target
|
||||
before configuring it.) There are two ways to do this.
|
||||
|
||||
One way is to increase the memory size of the index, which also increases
|
||||
the amount of backing storage required. Doubling the size of the index will
|
||||
double the length of the deduplication window at the expense of doubling
|
||||
the storage size and the memory requirements.
|
||||
|
||||
The other option is to enable sparse indexing. Sparse indexing increases
|
||||
the deduplication window by a factor of 10, at the expense of also
|
||||
increasing the storage size by a factor of 10. However with sparse
|
||||
indexing, the memory requirements do not increase. The trade-off is
|
||||
slightly more computation per request and a slight decrease in the amount
|
||||
of deduplication detected. For most workloads with significant amounts of
|
||||
duplicate data, sparse indexing will detect 97-99% of the deduplication
|
||||
that a standard index will detect.
|
||||
|
||||
The vio and data_vio Structures
|
||||
-------------------------------
|
||||
|
||||
A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
|
||||
fields and data to track vdo-specific information. A struct vio maintains a
|
||||
pointer to a bio but also tracks other fields specific to the operation of
|
||||
vdo. The vio is kept separate from its related bio because there are many
|
||||
circumstances where vdo completes the bio but must continue to do work
|
||||
related to deduplication or compression.
|
||||
|
||||
Metadata reads and writes, and other writes that originate within vdo, use
|
||||
a struct vio directly. Application reads and writes use a larger structure
|
||||
called a data_vio to track information about their progress. A struct
|
||||
data_vio contain a struct vio and also includes several other fields
|
||||
related to deduplication and other vdo features. The data_vio is the
|
||||
primary unit of application work in vdo. Each data_vio proceeds through a
|
||||
set of steps to handle the application data, after which it is reset and
|
||||
returned to a pool of data_vios for reuse.
|
||||
|
||||
There is a fixed pool of 2048 data_vios. This number was chosen to bound
|
||||
the amount of work that is required to recover from a crash. In addition,
|
||||
benchmarks have indicated that increasing the size of the pool does not
|
||||
significantly improve performance.
|
||||
|
||||
The Data Store
|
||||
--------------
|
||||
|
||||
The data store is implemented by three main data structures, all of which
|
||||
work in concert to reduce or amortize metadata updates across as many data
|
||||
writes as possible.
|
||||
|
||||
*The Slab Depot*
|
||||
|
||||
Most of the vdo volume belongs to the slab depot. The depot contains a
|
||||
collection of slabs. The slabs can be up to 32GB, and are divided into
|
||||
three sections. Most of a slab consists of a linear sequence of 4K blocks.
|
||||
These blocks are used either to store data, or to hold portions of the
|
||||
block map (see below). In addition to the data blocks, each slab has a set
|
||||
of reference counters, using 1 byte for each data block. Finally each slab
|
||||
has a journal.
|
||||
|
||||
Reference updates are written to the slab journal. Slab journal blocks are
|
||||
written out either when they are full, or when the recovery journal
|
||||
requests they do so in order to allow the main recovery journal (see below)
|
||||
to free up space. The slab journal is used both to ensure that the main
|
||||
recovery journal can regularly free up space, and also to amortize the cost
|
||||
of updating individual reference blocks. The reference counters are kept in
|
||||
memory and are written out, a block at a time in oldest-dirtied-order, only
|
||||
when there is a need to reclaim slab journal space. The write operations
|
||||
are performed in the background as needed so they do not add latency to
|
||||
particular I/O operations.
|
||||
|
||||
Each slab is independent of every other. They are assigned to "physical
|
||||
zones" in round-robin fashion. If there are P physical zones, then slab n
|
||||
is assigned to zone n mod P.
|
||||
|
||||
The slab depot maintains an additional small data structure, the "slab
|
||||
summary," which is used to reduce the amount of work needed to come back
|
||||
online after a crash. The slab summary maintains an entry for each slab
|
||||
indicating whether or not the slab has ever been used, whether all of its
|
||||
reference count updates have been persisted to storage, and approximately
|
||||
how full it is. During recovery, each physical zone will attempt to recover
|
||||
at least one slab, stopping whenever it has recovered a slab which has some
|
||||
free blocks. Once each zone has some space, or has determined that none is
|
||||
available, the target can resume normal operation in a degraded mode. Read
|
||||
and write requests can be serviced, perhaps with degraded performance,
|
||||
while the remainder of the dirty slabs are recovered.
|
||||
|
||||
*The Block Map*
|
||||
|
||||
The block map contains the logical to physical mapping. It can be thought
|
||||
of as an array with one entry per logical address. Each entry is 5 bytes,
|
||||
36 bits of which contain the physical block number which holds the data for
|
||||
the given logical address. The other 4 bits are used to indicate the nature
|
||||
of the mapping. Of the 16 possible states, one represents a logical address
|
||||
which is unmapped (i.e. it has never been written, or has been discarded),
|
||||
one represents an uncompressed block, and the other 14 states are used to
|
||||
indicate that the mapped data is compressed, and which of the compression
|
||||
slots in the compressed block contains the data for this logical address.
|
||||
|
||||
In practice, the array of mapping entries is divided into "block map
|
||||
pages," each of which fits in a single 4K block. Each block map page
|
||||
consists of a header and 812 mapping entries. Each mapping page is actually
|
||||
a leaf of a radix tree which consists of block map pages at each level.
|
||||
There are 60 radix trees which are assigned to "logical zones" in round
|
||||
robin fashion. (If there are L logical zones, tree n will belong to zone n
|
||||
mod L.) At each level, the trees are interleaved, so logical addresses
|
||||
0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
|
||||
on. The interleaving is maintained all the way up to the 60 root nodes.
|
||||
Choosing 60 trees results in an evenly distributed number of trees per zone
|
||||
for a large number of possible logical zone counts. The storage for the 60
|
||||
tree roots is allocated at format time. All other block map pages are
|
||||
allocated out of the slabs as needed. This flexible allocation avoids the
|
||||
need to pre-allocate space for the entire set of logical mappings and also
|
||||
makes growing the logical size of a vdo relatively easy.
|
||||
|
||||
In operation, the block map maintains two caches. It is prohibitive to keep
|
||||
the entire leaf level of the trees in memory, so each logical zone
|
||||
maintains its own cache of leaf pages. The size of this cache is
|
||||
configurable at target start time. The second cache is allocated at start
|
||||
time, and is large enough to hold all the non-leaf pages of the entire
|
||||
block map. This cache is populated as pages are needed.
|
||||
|
||||
*The Recovery Journal*
|
||||
|
||||
The recovery journal is used to amortize updates across the block map and
|
||||
slab depot. Each write request causes an entry to be made in the journal.
|
||||
Entries are either "data remappings" or "block map remappings." For a data
|
||||
remapping, the journal records the logical address affected and its old and
|
||||
new physical mappings. For a block map remapping, the journal records the
|
||||
block map page number and the physical block allocated for it. Block map
|
||||
pages are never reclaimed or repurposed, so the old mapping is always 0.
|
||||
|
||||
Each journal entry is an intent record summarizing the metadata updates
|
||||
that are required for a data_vio. The recovery journal issues a flush
|
||||
before each journal block write to ensure that the physical data for the
|
||||
new block mappings in that block are stable on storage, and journal block
|
||||
writes are all issued with the FUA bit set to ensure the recovery journal
|
||||
entries themselves are stable. The journal entry and the data write it
|
||||
represents must be stable on disk before the other metadata structures may
|
||||
be updated to reflect the operation. These entries allow the vdo device to
|
||||
reconstruct the logical to physical mappings after an unexpected
|
||||
interruption such as a loss of power.
|
||||
|
||||
*Write Path*
|
||||
|
||||
All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
|
||||
as vdo has done enough work to guarantee that it can complete the write
|
||||
eventually. Generally, the data for acknowledged but unflushed write I/O
|
||||
can be treated as though it is cached in memory. If an application
|
||||
requires data to be stable on storage, it must issue a flush or write the
|
||||
data with the FUA bit set like any other asynchronous I/O. Shutting down
|
||||
the vdo target will also flush any remaining I/O.
|
||||
|
||||
Application write bios follow the steps outlined below.
|
||||
|
||||
1. A data_vio is obtained from the data_vio pool and associated with the
|
||||
application bio. If there are no data_vios available, the incoming bio
|
||||
will block until a data_vio is available. This provides back pressure
|
||||
to the application. The data_vio pool is protected by a spin lock.
|
||||
|
||||
The newly acquired data_vio is reset and the bio's data is copied into
|
||||
the data_vio if it is a write and the data is not all zeroes. The data
|
||||
must be copied because the application bio can be acknowledged before
|
||||
the data_vio processing is complete, which means later processing steps
|
||||
will no longer have access to the application bio. The application bio
|
||||
may also be smaller than 4K, in which case the data_vio will have
|
||||
already read the underlying block and the data is instead copied over
|
||||
the relevant portion of the larger block.
|
||||
|
||||
2. The data_vio places a claim (the "logical lock") on the logical address
|
||||
of the bio. It is vital to prevent simultaneous modifications of the
|
||||
same logical address, because deduplication involves sharing blocks.
|
||||
This claim is implemented as an entry in a hashtable where the key is
|
||||
the logical address and the value is a pointer to the data_vio
|
||||
currently handling that address.
|
||||
|
||||
If a data_vio looks in the hashtable and finds that another data_vio is
|
||||
already operating on that logical address, it waits until the previous
|
||||
operation finishes. It also sends a message to inform the current
|
||||
lock holder that it is waiting. Most notably, a new data_vio waiting
|
||||
for a logical lock will flush the previous lock holder out of the
|
||||
compression packer (step 8d) rather than allowing it to continue
|
||||
waiting to be packed.
|
||||
|
||||
This stage requires the data_vio to get an implicit lock on the
|
||||
appropriate logical zone to prevent concurrent modifications of the
|
||||
hashtable. This implicit locking is handled by the zone divisions
|
||||
described above.
|
||||
|
||||
3. The data_vio traverses the block map tree to ensure that all the
|
||||
necessary internal tree nodes have been allocated, by trying to find
|
||||
the leaf page for its logical address. If any interior tree page is
|
||||
missing, it is allocated at this time out of the same physical storage
|
||||
pool used to store application data.
|
||||
|
||||
a. If any page-node in the tree has not yet been allocated, it must be
|
||||
allocated before the write can continue. This step requires the
|
||||
data_vio to lock the page-node that needs to be allocated. This
|
||||
lock, like the logical block lock in step 2, is a hashtable entry
|
||||
that causes other data_vios to wait for the allocation process to
|
||||
complete.
|
||||
|
||||
The implicit logical zone lock is released while the allocation is
|
||||
happening, in order to allow other operations in the same logical
|
||||
zone to proceed. The details of allocation are the same as in
|
||||
step 4. Once a new node has been allocated, that node is added to
|
||||
the tree using a similar process to adding a new data block mapping.
|
||||
The data_vio journals the intent to add the new node to the block
|
||||
map tree (step 10), updates the reference count of the new block
|
||||
(step 11), and reacquires the implicit logical zone lock to add the
|
||||
new mapping to the parent tree node (step 12). Once the tree is
|
||||
updated, the data_vio proceeds down the tree. Any other data_vios
|
||||
waiting on this allocation also proceed.
|
||||
|
||||
b. In the steady-state case, the block map tree nodes will already be
|
||||
allocated, so the data_vio just traverses the tree until it finds
|
||||
the required leaf node. The location of the mapping (the "block map
|
||||
slot") is recorded in the data_vio so that later steps do not need
|
||||
to traverse the tree again. The data_vio then releases the implicit
|
||||
logical zone lock.
|
||||
|
||||
4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
|
||||
made to allocate a free data block. This allocation ensures that the
|
||||
data_vio can write its data somewhere even if deduplication and
|
||||
compression are not possible. This stage gets an implicit lock on a
|
||||
physical zone to search for free space within that zone.
|
||||
|
||||
The data_vio will search each slab in a zone until it finds a free
|
||||
block or decides there are none. If the first zone has no free space,
|
||||
it will proceed to search the next physical zone by taking the implicit
|
||||
lock for that zone and releasing the previous one until it finds a
|
||||
free block or runs out of zones to search. The data_vio will acquire a
|
||||
struct pbn_lock (the "physical block lock") on the free block. The
|
||||
struct pbn_lock also has several fields to record the various kinds of
|
||||
claims that data_vios can have on physical blocks. The pbn_lock is
|
||||
added to a hashtable like the logical block locks in step 2. This
|
||||
hashtable is also covered by the implicit physical zone lock. The
|
||||
reference count of the free block is updated to prevent any other
|
||||
data_vio from considering it free. The reference counters are a
|
||||
sub-component of the slab and are thus also covered by the implicit
|
||||
physical zone lock.
|
||||
|
||||
5. If an allocation was obtained, the data_vio has all the resources it
|
||||
needs to complete the write. The application bio can safely be
|
||||
acknowledged at this point. The acknowledgment happens on a separate
|
||||
thread to prevent the application callback from blocking other data_vio
|
||||
operations.
|
||||
|
||||
If an allocation could not be obtained, the data_vio continues to
|
||||
attempt to deduplicate or compress the data, but the bio is not
|
||||
acknowledged because the vdo device may be out of space.
|
||||
|
||||
6. At this point vdo must determine where to store the application data.
|
||||
The data_vio's data is hashed and the hash (the "record name") is
|
||||
recorded in the data_vio.
|
||||
|
||||
7. The data_vio reserves or joins a struct hash_lock, which manages all of
|
||||
the data_vios currently writing the same data. Active hash locks are
|
||||
tracked in a hashtable similar to the way logical block locks are
|
||||
tracked in step 2. This hashtable is covered by the implicit lock on
|
||||
the hash zone.
|
||||
|
||||
If there is no existing hash lock for this data_vio's record_name, the
|
||||
data_vio obtains a hash lock from the pool, adds it to the hashtable,
|
||||
and sets itself as the new hash lock's "agent." The hash_lock pool is
|
||||
also covered by the implicit hash zone lock. The hash lock agent will
|
||||
do all the work to decide where the application data will be
|
||||
written. If a hash lock for the data_vio's record_name already exists,
|
||||
and the data_vio's data is the same as the agent's data, the new
|
||||
data_vio will wait for the agent to complete its work and then share
|
||||
its result.
|
||||
|
||||
In the rare case that a hash lock exists for the data_vio's hash but
|
||||
the data does not match the hash lock's agent, the data_vio skips to
|
||||
step 8h and attempts to write its data directly. This can happen if two
|
||||
different data blocks produce the same hash, for example.
|
||||
|
||||
8. The hash lock agent attempts to deduplicate or compress its data with
|
||||
the following steps.
|
||||
|
||||
a. The agent initializes and sends its embedded deduplication request
|
||||
(struct uds_request) to the deduplication index. This does not
|
||||
require the data_vio to get any locks because the index components
|
||||
manage their own locking. The data_vio waits until it either gets a
|
||||
response from the index or times out.
|
||||
|
||||
b. If the deduplication index returns advice, the data_vio attempts to
|
||||
obtain a physical block lock on the indicated physical address, in
|
||||
order to read the data and verify that it is the same as the
|
||||
data_vio's data, and that it can accept more references. If the
|
||||
physical address is already locked by another data_vio, the data at
|
||||
that address may soon be overwritten so it is not safe to use the
|
||||
address for deduplication.
|
||||
|
||||
c. If the data matches and the physical block can add references, the
|
||||
agent and any other data_vios waiting on it will record this
|
||||
physical block as their new physical address and proceed to step 9
|
||||
to record their new mapping. If there are more data_vios in the hash
|
||||
lock than there are references available, one of the remaining
|
||||
data_vios becomes the new agent and continues to step 8d as if no
|
||||
valid advice was returned.
|
||||
|
||||
d. If no usable duplicate block was found, the agent first checks that
|
||||
it has an allocated physical block (from step 3) that it can write
|
||||
to. If the agent does not have an allocation, some other data_vio in
|
||||
the hash lock that does have an allocation takes over as agent. If
|
||||
none of the data_vios have an allocated physical block, these writes
|
||||
are out of space, so they proceed to step 13 for cleanup.
|
||||
|
||||
e. The agent attempts to compress its data. If the data does not
|
||||
compress, the data_vio will continue to step 8h to write its data
|
||||
directly.
|
||||
|
||||
If the compressed size is small enough, the agent will release the
|
||||
implicit hash zone lock and go to the packer (struct packer) where
|
||||
it will be placed in a bin (struct packer_bin) along with other
|
||||
data_vios. All compression operations require the implicit lock on
|
||||
the packer zone.
|
||||
|
||||
The packer can combine up to 14 compressed blocks in a single 4k
|
||||
data block. Compression is only helpful if vdo can pack at least 2
|
||||
data_vios into a single data block. This means that a data_vio may
|
||||
wait in the packer for an arbitrarily long time for other data_vios
|
||||
to fill out the compressed block. There is a mechanism for vdo to
|
||||
evict waiting data_vios when continuing to wait would cause
|
||||
problems. Circumstances causing an eviction include an application
|
||||
flush, device shutdown, or a subsequent data_vio trying to overwrite
|
||||
the same logical block address. A data_vio may also be evicted from
|
||||
the packer if it cannot be paired with any other compressed block
|
||||
before more compressible blocks need to use its bin. An evicted
|
||||
data_vio will proceed to step 8h to write its data directly.
|
||||
|
||||
f. If the agent fills a packer bin, either because all 14 of its slots
|
||||
are used or because it has no remaining space, it is written out
|
||||
using the allocated physical block from one of its data_vios. Step
|
||||
8d has already ensured that an allocation is available.
|
||||
|
||||
g. Each data_vio sets the compressed block as its new physical address.
|
||||
The data_vio obtains an implicit lock on the physical zone and
|
||||
acquires the struct pbn_lock for the compressed block, which is
|
||||
modified to be a shared lock. Then it releases the implicit physical
|
||||
zone lock and proceeds to step 8i.
|
||||
|
||||
h. Any data_vio evicted from the packer will have an allocation from
|
||||
step 3. It will write its data to that allocated physical block.
|
||||
|
||||
i. After the data is written, if the data_vio is the agent of a hash
|
||||
lock, it will reacquire the implicit hash zone lock and share its
|
||||
physical address with as many other data_vios in the hash lock as
|
||||
possible. Each data_vio will then proceed to step 9 to record its
|
||||
new mapping.
|
||||
|
||||
j. If the agent actually wrote new data (whether compressed or not),
|
||||
the deduplication index is updated to reflect the location of the
|
||||
new data. The agent then releases the implicit hash zone lock.
|
||||
|
||||
9. The data_vio determines the previous mapping of the logical address.
|
||||
There is a cache for block map leaf pages (the "block map cache"),
|
||||
because there are usually too many block map leaf nodes to store
|
||||
entirely in memory. If the desired leaf page is not in the cache, the
|
||||
data_vio will reserve a slot in the cache and load the desired page
|
||||
into it, possibly evicting an older cached page. The data_vio then
|
||||
finds the current physical address for this logical address (the "old
|
||||
physical mapping"), if any, and records it. This step requires a lock
|
||||
on the block map cache structures, covered by the implicit logical zone
|
||||
lock.
|
||||
|
||||
10. The data_vio makes an entry in the recovery journal containing the
|
||||
logical block address, the old physical mapping, and the new physical
|
||||
mapping. Making this journal entry requires holding the implicit
|
||||
recovery journal lock. The data_vio will wait in the journal until all
|
||||
recovery blocks up to the one containing its entry have been written
|
||||
and flushed to ensure the transaction is stable on storage.
|
||||
|
||||
11. Once the recovery journal entry is stable, the data_vio makes two slab
|
||||
journal entries: an increment entry for the new mapping, and a
|
||||
decrement entry for the old mapping. These two operations each require
|
||||
holding a lock on the affected physical slab, covered by its implicit
|
||||
physical zone lock. For correctness during recovery, the slab journal
|
||||
entries in any given slab journal must be in the same order as the
|
||||
corresponding recovery journal entries. Therefore, if the two entries
|
||||
are in different zones, they are made concurrently, and if they are in
|
||||
the same zone, the increment is always made before the decrement in
|
||||
order to avoid underflow. After each slab journal entry is made in
|
||||
memory, the associated reference count is also updated in memory.
|
||||
|
||||
12. Once both of the reference count updates are done, the data_vio
|
||||
acquires the implicit logical zone lock and updates the
|
||||
logical-to-physical mapping in the block map to point to the new
|
||||
physical block. At this point the write operation is complete.
|
||||
|
||||
13. If the data_vio has a hash lock, it acquires the implicit hash zone
|
||||
lock and releases its hash lock to the pool.
|
||||
|
||||
The data_vio then acquires the implicit physical zone lock and releases
|
||||
the struct pbn_lock it holds for its allocated block. If it had an
|
||||
allocation that it did not use, it also sets the reference count for
|
||||
that block back to zero to free it for use by subsequent data_vios.
|
||||
|
||||
The data_vio then acquires the implicit logical zone lock and releases
|
||||
the logical block lock acquired in step 2.
|
||||
|
||||
The application bio is then acknowledged if it has not previously been
|
||||
acknowledged, and the data_vio is returned to the pool.
|
||||
|
||||
*Read Path*
|
||||
|
||||
An application read bio follows a much simpler set of steps. It does steps
|
||||
1 and 2 in the write path to obtain a data_vio and lock its logical
|
||||
address. If there is already a write data_vio in progress for that logical
|
||||
address that is guaranteed to complete, the read data_vio will copy the
|
||||
data from the write data_vio and return it. Otherwise, it will look up the
|
||||
logical-to-physical mapping by traversing the block map tree as in step 3,
|
||||
and then read and possibly decompress the indicated data at the indicated
|
||||
physical block address. A read data_vio will not allocate block map tree
|
||||
nodes if they are missing. If the interior block map nodes do not exist
|
||||
yet, the logical block map address must still be unmapped and the read
|
||||
data_vio will return all zeroes. A read data_vio handles cleanup and
|
||||
acknowledgment as in step 13, although it only needs to release the logical
|
||||
lock and return itself to the pool.
|
||||
|
||||
*Small Writes*
|
||||
|
||||
All storage within vdo is managed as 4KB blocks, but it can accept writes
|
||||
as small as 512 bytes. Processing a write that is smaller than 4K requires
|
||||
a read-modify-write operation that reads the relevant 4K block, copies the
|
||||
new data over the approriate sectors of the block, and then launches a
|
||||
write operation for the modified data block. The read and write stages of
|
||||
this operation are nearly identical to the normal read and write
|
||||
operations, and a single data_vio is used throughout this operation.
|
||||
|
||||
*Recovery*
|
||||
|
||||
When a vdo is restarted after a crash, it will attempt to recover from the
|
||||
recovery journal. During the pre-resume phase of the next start, the
|
||||
recovery journal is read. The increment portion of valid entries are played
|
||||
into the block map. Next, valid entries are played, in order as required,
|
||||
into the slab journals. Finally, each physical zone attempts to replay at
|
||||
least one slab journal to reconstruct the reference counts of one slab.
|
||||
Once each zone has some free space (or has determined that it has none),
|
||||
the vdo comes back online, while the remainder of the slab journals are
|
||||
used to reconstruct the rest of the reference counts in the background.
|
||||
|
||||
*Read-only Rebuild*
|
||||
|
||||
If a vdo encounters an unrecoverable error, it will enter read-only mode.
|
||||
This mode indicates that some previously acknowledged data may have been
|
||||
lost. The vdo may be instructed to rebuild as best it can in order to
|
||||
return to a writable state. However, this is never done automatically due
|
||||
to the possibility that data has been lost. During a read-only rebuild, the
|
||||
block map is recovered from the recovery journal as before. However, the
|
||||
reference counts are not rebuilt from the slab journals. Instead, the
|
||||
reference counts are zeroed, the entire block map is traversed, and the
|
||||
reference counts are updated from the block mappings. While this may lose
|
||||
some data, it ensures that the block map and reference counts are
|
||||
consistent with each other. This allows vdo to resume normal operation and
|
||||
accept further writes.
|
406
Documentation/admin-guide/device-mapper/vdo.rst
Normal file
406
Documentation/admin-guide/device-mapper/vdo.rst
Normal file
@ -0,0 +1,406 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
|
||||
dm-vdo
|
||||
======
|
||||
|
||||
The dm-vdo (virtual data optimizer) device mapper target provides
|
||||
block-level deduplication, compression, and thin provisioning. As a device
|
||||
mapper target, it can add these features to the storage stack, compatible
|
||||
with any file system. The vdo target does not protect against data
|
||||
corruption, relying instead on integrity protection of the storage below
|
||||
it. It is strongly recommended that lvm be used to manage vdo volumes. See
|
||||
lvmvdo(7).
|
||||
|
||||
Userspace component
|
||||
===================
|
||||
|
||||
Formatting a vdo volume requires the use of the 'vdoformat' tool, available
|
||||
at:
|
||||
|
||||
https://github.com/dm-vdo/vdo/
|
||||
|
||||
In most cases, a vdo target will recover from a crash automatically the
|
||||
next time it is started. In cases where it encountered an unrecoverable
|
||||
error (either during normal operation or crash recovery) the target will
|
||||
enter or come up in read-only mode. Because read-only mode is indicative of
|
||||
data-loss, a positive action must be taken to bring vdo out of read-only
|
||||
mode. The 'vdoforcerebuild' tool, available from the same repo, is used to
|
||||
prepare a read-only vdo to exit read-only mode. After running this tool,
|
||||
the vdo target will rebuild its metadata the next time it is
|
||||
started. Although some data may be lost, the rebuilt vdo's metadata will be
|
||||
internally consistent and the target will be writable again.
|
||||
|
||||
The repo also contains additional userspace tools which can be used to
|
||||
inspect a vdo target's on-disk metadata. Fortunately, these tools are
|
||||
rarely needed except by dm-vdo developers.
|
||||
|
||||
Metadata requirements
|
||||
=====================
|
||||
|
||||
Each vdo volume reserves 3GB of space for metadata, or more depending on
|
||||
its configuration. It is helpful to check that the space saved by
|
||||
deduplication and compression is not cancelled out by the metadata
|
||||
requirements. An estimation of the space saved for a specific dataset can
|
||||
be computed with the vdo estimator tool, which is available at:
|
||||
|
||||
https://github.com/dm-vdo/vdoestimator/
|
||||
|
||||
Target interface
|
||||
================
|
||||
|
||||
Table line
|
||||
----------
|
||||
|
||||
::
|
||||
|
||||
<offset> <logical device size> vdo V4 <storage device>
|
||||
<storage device size> <minimum I/O size> <block map cache size>
|
||||
<block map era length> [optional arguments]
|
||||
|
||||
|
||||
Required parameters:
|
||||
|
||||
offset:
|
||||
The offset, in sectors, at which the vdo volume's logical
|
||||
space begins.
|
||||
|
||||
logical device size:
|
||||
The size of the device which the vdo volume will service,
|
||||
in sectors. Must match the current logical size of the vdo
|
||||
volume.
|
||||
|
||||
storage device:
|
||||
The device holding the vdo volume's data and metadata.
|
||||
|
||||
storage device size:
|
||||
The size of the device holding the vdo volume, as a number
|
||||
of 4096-byte blocks. Must match the current size of the vdo
|
||||
volume.
|
||||
|
||||
minimum I/O size:
|
||||
The minimum I/O size for this vdo volume to accept, in
|
||||
bytes. Valid values are 512 or 4096. The recommended value
|
||||
is 4096.
|
||||
|
||||
block map cache size:
|
||||
The size of the block map cache, as a number of 4096-byte
|
||||
blocks. The minimum and recommended value is 32768 blocks.
|
||||
If the logical thread count is non-zero, the cache size
|
||||
must be at least 4096 blocks per logical thread.
|
||||
|
||||
block map era length:
|
||||
The speed with which the block map cache writes out
|
||||
modified block map pages. A smaller era length is likely to
|
||||
reduce the amount of time spent rebuilding, at the cost of
|
||||
increased block map writes during normal operation. The
|
||||
maximum and recommended value is 16380; the minimum value
|
||||
is 1.
|
||||
|
||||
Optional parameters:
|
||||
--------------------
|
||||
Some or all of these parameters may be specified as <key> <value> pairs.
|
||||
|
||||
Thread related parameters:
|
||||
|
||||
Different categories of work are assigned to separate thread groups, and
|
||||
the number of threads in each group can be configured separately.
|
||||
|
||||
If <hash>, <logical>, and <physical> are all set to 0, the work handled by
|
||||
all three thread types will be handled by a single thread. If any of these
|
||||
values are non-zero, all of them must be non-zero.
|
||||
|
||||
ack:
|
||||
The number of threads used to complete bios. Since
|
||||
completing a bio calls an arbitrary completion function
|
||||
outside the vdo volume, threads of this type allow the vdo
|
||||
volume to continue processing requests even when bio
|
||||
completion is slow. The default is 1.
|
||||
|
||||
bio:
|
||||
The number of threads used to issue bios to the underlying
|
||||
storage. Threads of this type allow the vdo volume to
|
||||
continue processing requests even when bio submission is
|
||||
slow. The default is 4.
|
||||
|
||||
bioRotationInterval:
|
||||
The number of bios to enqueue on each bio thread before
|
||||
switching to the next thread. The value must be greater
|
||||
than 0 and not more than 1024; the default is 64.
|
||||
|
||||
cpu:
|
||||
The number of threads used to do CPU-intensive work, such
|
||||
as hashing and compression. The default is 1.
|
||||
|
||||
hash:
|
||||
The number of threads used to manage data comparisons for
|
||||
deduplication based on the hash value of data blocks. The
|
||||
default is 0.
|
||||
|
||||
logical:
|
||||
The number of threads used to manage caching and locking
|
||||
based on the logical address of incoming bios. The default
|
||||
is 0; the maximum is 60.
|
||||
|
||||
physical:
|
||||
The number of threads used to manage administration of the
|
||||
underlying storage device. At format time, a slab size for
|
||||
the vdo is chosen; the vdo storage device must be large
|
||||
enough to have at least 1 slab per physical thread. The
|
||||
default is 0; the maximum is 16.
|
||||
|
||||
Miscellaneous parameters:
|
||||
|
||||
maxDiscard:
|
||||
The maximum size of discard bio accepted, in 4096-byte
|
||||
blocks. I/O requests to a vdo volume are normally split
|
||||
into 4096-byte blocks, and processed up to 2048 at a time.
|
||||
However, discard requests to a vdo volume can be
|
||||
automatically split to a larger size, up to <maxDiscard>
|
||||
4096-byte blocks in a single bio, and are limited to 1500
|
||||
at a time. Increasing this value may provide better overall
|
||||
performance, at the cost of increased latency for the
|
||||
individual discard requests. The default and minimum is 1;
|
||||
the maximum is UINT_MAX / 4096.
|
||||
|
||||
deduplication:
|
||||
Whether deduplication is enabled. The default is 'on'; the
|
||||
acceptable values are 'on' and 'off'.
|
||||
|
||||
compression:
|
||||
Whether compression is enabled. The default is 'off'; the
|
||||
acceptable values are 'on' and 'off'.
|
||||
|
||||
Device modification
|
||||
-------------------
|
||||
|
||||
A modified table may be loaded into a running, non-suspended vdo volume.
|
||||
The modifications will take effect when the device is next resumed. The
|
||||
modifiable parameters are <logical device size>, <physical device size>,
|
||||
<maxDiscard>, <compression>, and <deduplication>.
|
||||
|
||||
If the logical device size or physical device size are changed, upon
|
||||
successful resume vdo will store the new values and require them on future
|
||||
startups. These two parameters may not be decreased. The logical device
|
||||
size may not exceed 4 PB. The physical device size must increase by at
|
||||
least 32832 4096-byte blocks if at all, and must not exceed the size of the
|
||||
underlying storage device. Additionally, when formatting the vdo device, a
|
||||
slab size is chosen: the physical device size may never increase above the
|
||||
size which provides 8192 slabs, and each increase must be large enough to
|
||||
add at least one new slab.
|
||||
|
||||
Examples:
|
||||
|
||||
Start a previously-formatted vdo volume with 1 GB logical space and 1 GB
|
||||
physical space, storing to /dev/dm-1 which has more than 1 GB of space.
|
||||
|
||||
::
|
||||
|
||||
dmsetup create vdo0 --table \
|
||||
"0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380"
|
||||
|
||||
Grow the logical size to 4 GB.
|
||||
|
||||
::
|
||||
|
||||
dmsetup reload vdo0 --table \
|
||||
"0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380"
|
||||
dmsetup resume vdo0
|
||||
|
||||
Grow the physical size to 2 GB.
|
||||
|
||||
::
|
||||
|
||||
dmsetup reload vdo0 --table \
|
||||
"0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380"
|
||||
dmsetup resume vdo0
|
||||
|
||||
Grow the physical size by 1 GB more and increase max discard sectors.
|
||||
|
||||
::
|
||||
|
||||
dmsetup reload vdo0 --table \
|
||||
"0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8"
|
||||
dmsetup resume vdo0
|
||||
|
||||
Stop the vdo volume.
|
||||
|
||||
::
|
||||
|
||||
dmsetup remove vdo0
|
||||
|
||||
Start the vdo volume again. Note that the logical and physical device sizes
|
||||
must still match, but other parameters can change.
|
||||
|
||||
::
|
||||
|
||||
dmsetup create vdo1 --table \
|
||||
"0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2"
|
||||
|
||||
Messages
|
||||
--------
|
||||
All vdo devices accept messages in the form:
|
||||
|
||||
::
|
||||
dmsetup message <target-name> 0 <message-name> <message-parameters>
|
||||
|
||||
The messages are:
|
||||
|
||||
stats:
|
||||
Outputs the current view of the vdo statistics. Mostly used
|
||||
by the vdostats userspace program to interpret the output
|
||||
buffer.
|
||||
|
||||
dump:
|
||||
Dumps many internal structures to the system log. This is
|
||||
not always safe to run, so it should only be used to debug
|
||||
a hung vdo. Optional parameters to specify structures to
|
||||
dump are:
|
||||
|
||||
viopool: The pool of I/O requests incoming bios
|
||||
pools: A synonym of 'viopool'
|
||||
vdo: Most of the structures managing on-disk data
|
||||
queues: Basic information about each vdo thread
|
||||
threads: A synonym of 'queues'
|
||||
default: Equivalent to 'queues vdo'
|
||||
all: All of the above.
|
||||
|
||||
dump-on-shutdown:
|
||||
Perform a default dump next time vdo shuts down.
|
||||
|
||||
|
||||
Status
|
||||
------
|
||||
|
||||
::
|
||||
|
||||
<device> <operating mode> <in recovery> <index state>
|
||||
<compression state> <physical blocks used> <total physical blocks>
|
||||
|
||||
device:
|
||||
The name of the vdo volume.
|
||||
|
||||
operating mode:
|
||||
The current operating mode of the vdo volume; values may be
|
||||
'normal', 'recovering' (the volume has detected an issue
|
||||
with its metadata and is attempting to repair itself), and
|
||||
'read-only' (an error has occurred that forces the vdo
|
||||
volume to only support read operations and not writes).
|
||||
|
||||
in recovery:
|
||||
Whether the vdo volume is currently in recovery mode;
|
||||
values may be 'recovering' or '-' which indicates not
|
||||
recovering.
|
||||
|
||||
index state:
|
||||
The current state of the deduplication index in the vdo
|
||||
volume; values may be 'closed', 'closing', 'error',
|
||||
'offline', 'online', 'opening', and 'unknown'.
|
||||
|
||||
compression state:
|
||||
The current state of compression in the vdo volume; values
|
||||
may be 'offline' and 'online'.
|
||||
|
||||
used physical blocks:
|
||||
The number of physical blocks in use by the vdo volume.
|
||||
|
||||
total physical blocks:
|
||||
The total number of physical blocks the vdo volume may use;
|
||||
the difference between this value and the
|
||||
<used physical blocks> is the number of blocks the vdo
|
||||
volume has left before being full.
|
||||
|
||||
Memory Requirements
|
||||
===================
|
||||
|
||||
A vdo target requires a fixed 38 MB of RAM along with the following amounts
|
||||
that scale with the target:
|
||||
|
||||
- 1.15 MB of RAM for each 1 MB of configured block map cache size. The
|
||||
block map cache requires a minimum of 150 MB.
|
||||
- 1.6 MB of RAM for each 1 TB of logical space.
|
||||
- 268 MB of RAM for each 1 TB of physical storage managed by the volume.
|
||||
|
||||
The deduplication index requires additional memory which scales with the
|
||||
size of the deduplication window. For dense indexes, the index requires 1
|
||||
GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB
|
||||
of RAM per 10 TB of window. The index configuration is set when the target
|
||||
is formatted and may not be modified.
|
||||
|
||||
Module Parameters
|
||||
=================
|
||||
|
||||
The vdo driver has a numeric parameter 'log_level' which controls the
|
||||
verbosity of logging from the driver. The default setting is 6
|
||||
(LOGLEVEL_INFO and more severe messages).
|
||||
|
||||
Run-time Usage
|
||||
==============
|
||||
|
||||
When using dm-vdo, it is important to be aware of the ways in which its
|
||||
behavior differs from other storage targets.
|
||||
|
||||
- There is no guarantee that over-writes of existing blocks will succeed.
|
||||
Because the underlying storage may be multiply referenced, over-writing
|
||||
an existing block generally requires a vdo to have a free block
|
||||
available.
|
||||
|
||||
- When blocks are no longer in use, sending a discard request for those
|
||||
blocks lets the vdo release references for those blocks. If the vdo is
|
||||
thinly provisioned, discarding unused blocks is essential to prevent the
|
||||
target from running out of space. However, due to the sharing of
|
||||
duplicate blocks, no discard request for any given logical block is
|
||||
guaranteed to reclaim space.
|
||||
|
||||
- Assuming the underlying storage properly implements flush requests, vdo
|
||||
is resilient against crashes, however, unflushed writes may or may not
|
||||
persist after a crash.
|
||||
|
||||
- Each write to a vdo target entails a significant amount of processing.
|
||||
However, much of the work is paralellizable. Therefore, vdo targets
|
||||
achieve better throughput at higher I/O depths, and can support up 2048
|
||||
requests in parallel.
|
||||
|
||||
Tuning
|
||||
======
|
||||
|
||||
The vdo device has many options, and it can be difficult to make optimal
|
||||
choices without perfect knowledge of the workload. Additionally, most
|
||||
configuration options must be set when a vdo target is started, and cannot
|
||||
be changed without shutting it down completely; the configuration cannot be
|
||||
changed while the target is active. Ideally, tuning with simulated
|
||||
workloads should be performed before deploying vdo in production
|
||||
environments.
|
||||
|
||||
The most important value to adjust is the block map cache size. In order to
|
||||
service a request for any logical address, a vdo must load the portion of
|
||||
the block map which holds the relevant mapping. These mappings are cached.
|
||||
Performance will suffer when the working set does not fit in the cache. By
|
||||
default, a vdo allocates 128 MB of metadata cache in RAM to support
|
||||
efficient access to 100 GB of logical space at a time. It should be scaled
|
||||
up proportionally for larger working sets.
|
||||
|
||||
The logical and physical thread counts should also be adjusted. A logical
|
||||
thread controls a disjoint section of the block map, so additional logical
|
||||
threads increase parallelism and can increase throughput. Physical threads
|
||||
control a disjoint section of the data blocks, so additional physical
|
||||
threads can also increase throughput. However, excess threads can waste
|
||||
resources and increase contention.
|
||||
|
||||
Bio submission threads control the parallelism involved in sending I/O to
|
||||
the underlying storage; fewer threads mean there is more opportunity to
|
||||
reorder I/O requests for performance benefit, but also that each I/O
|
||||
request has to wait longer before being submitted.
|
||||
|
||||
Bio acknowledgment threads are used for finishing I/O requests. This is
|
||||
done on dedicated threads since the amount of work required to execute a
|
||||
bio's callback can not be controlled by the vdo itself. Usually one thread
|
||||
is sufficient but additional threads may be beneficial, particularly when
|
||||
bios have CPU-heavy callbacks.
|
||||
|
||||
CPU threads are used for hashing and for compression; in workloads with
|
||||
compression enabled, more threads may result in higher throughput.
|
||||
|
||||
Hash threads are used to sort active requests by hash and determine whether
|
||||
they should deduplicate; the most CPU intensive actions done by these
|
||||
threads are comparison of 4096-byte data blocks. In most cases, a single
|
||||
hash thread is sufficient.
|
@ -24,37 +24,4 @@ restrictions later on.
|
||||
As a remedy for such situations, the kernel configuration item
|
||||
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
|
||||
individually prepared or corrected EDID data set in the /lib/firmware
|
||||
directory from where it is loaded via the firmware interface. The code
|
||||
(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
|
||||
commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
|
||||
1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
|
||||
not contain code to create these data. In order to elucidate the origin
|
||||
of the built-in binary EDID blobs and to facilitate the creation of
|
||||
individual data for a specific misbehaving monitor, commented sources
|
||||
and a Makefile environment are given here.
|
||||
|
||||
To create binary EDID and C source code files from the existing data
|
||||
material, simply type "make" in tools/edid/.
|
||||
|
||||
If you want to create your own EDID file, copy the file 1024x768.S,
|
||||
replace the settings with your own data and add a new target to the
|
||||
Makefile. Please note that the EDID data structure expects the timing
|
||||
values in a different way as compared to the standard X11 format.
|
||||
|
||||
X11:
|
||||
HTimings:
|
||||
hdisp hsyncstart hsyncend htotal
|
||||
VTimings:
|
||||
vdisp vsyncstart vsyncend vtotal
|
||||
|
||||
EDID::
|
||||
|
||||
#define XPIX hdisp
|
||||
#define XBLANK htotal-hdisp
|
||||
#define XOFFSET hsyncstart-hdisp
|
||||
#define XPULSE hsyncend-hsyncstart
|
||||
|
||||
#define YPIX vdisp
|
||||
#define YBLANK vtotal-vdisp
|
||||
#define YOFFSET vsyncstart-vdisp
|
||||
#define YPULSE vsyncend-vsyncstart
|
||||
directory from where it is loaded via the firmware interface.
|
||||
|
@ -3,6 +3,14 @@
|
||||
GPIO Testing Driver
|
||||
===================
|
||||
|
||||
.. note::
|
||||
|
||||
This module has been obsoleted by the more flexible gpio-sim.rst.
|
||||
New developments should use that API and existing developments are
|
||||
encouraged to migrate as soon as possible.
|
||||
This module will continue to be maintained but no new features will be
|
||||
added.
|
||||
|
||||
The GPIO Testing Driver (gpio-mockup) provides a way to create simulated GPIO
|
||||
chips for testing purposes. The lines exposed by these chips can be accessed
|
||||
using the standard GPIO character device interface as well as manipulated
|
||||
|
@ -1,16 +1,16 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
====
|
||||
gpio
|
||||
GPIO
|
||||
====
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Character Device Userspace API <../../userspace-api/gpio/chardev>
|
||||
gpio-aggregator
|
||||
sysfs
|
||||
gpio-mockup
|
||||
gpio-sim
|
||||
Obsolete APIs <obsolete>
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
13
Documentation/admin-guide/gpio/obsolete.rst
Normal file
13
Documentation/admin-guide/gpio/obsolete.rst
Normal file
@ -0,0 +1,13 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==================
|
||||
Obsolete GPIO APIs
|
||||
==================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Character Device Userspace API (v1) <../../userspace-api/gpio/chardev_v1>
|
||||
Sysfs Interface <../../userspace-api/gpio/sysfs>
|
||||
Mockup Testing Module <gpio-mockup>
|
||||
|
@ -21,3 +21,4 @@ are configurable at compile, boot or run time.
|
||||
cross-thread-rsb
|
||||
srso
|
||||
gather_data_sampling
|
||||
reg-file-data-sampling
|
||||
|
104
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
Normal file
104
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
Normal file
@ -0,0 +1,104 @@
|
||||
==================================
|
||||
Register File Data Sampling (RFDS)
|
||||
==================================
|
||||
|
||||
Register File Data Sampling (RFDS) is a microarchitectural vulnerability that
|
||||
only affects Intel Atom parts(also branded as E-cores). RFDS may allow
|
||||
a malicious actor to infer data values previously used in floating point
|
||||
registers, vector registers, or integer registers. RFDS does not provide the
|
||||
ability to choose which data is inferred. CVE-2023-28746 is assigned to RFDS.
|
||||
|
||||
Affected Processors
|
||||
===================
|
||||
Below is the list of affected Intel processors [#f1]_:
|
||||
|
||||
=================== ============
|
||||
Common name Family_Model
|
||||
=================== ============
|
||||
ATOM_GOLDMONT 06_5CH
|
||||
ATOM_GOLDMONT_D 06_5FH
|
||||
ATOM_GOLDMONT_PLUS 06_7AH
|
||||
ATOM_TREMONT_D 06_86H
|
||||
ATOM_TREMONT 06_96H
|
||||
ALDERLAKE 06_97H
|
||||
ALDERLAKE_L 06_9AH
|
||||
ATOM_TREMONT_L 06_9CH
|
||||
RAPTORLAKE 06_B7H
|
||||
RAPTORLAKE_P 06_BAH
|
||||
ATOM_GRACEMONT 06_BEH
|
||||
RAPTORLAKE_S 06_BFH
|
||||
=================== ============
|
||||
|
||||
As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and
|
||||
RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as
|
||||
vulnerable in Linux because they share the same family/model with an affected
|
||||
part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or
|
||||
CPUID.HYBRID. This information could be used to distinguish between the
|
||||
affected and unaffected parts, but it is deemed not worth adding complexity as
|
||||
the reporting is fixed automatically when these parts enumerate RFDS_NO.
|
||||
|
||||
Mitigation
|
||||
==========
|
||||
Intel released a microcode update that enables software to clear sensitive
|
||||
information using the VERW instruction. Like MDS, RFDS deploys the same
|
||||
mitigation strategy to force the CPU to clear the affected buffers before an
|
||||
attacker can extract the secrets. This is achieved by using the otherwise
|
||||
unused and obsolete VERW instruction in combination with a microcode update.
|
||||
The microcode clears the affected CPU buffers when the VERW instruction is
|
||||
executed.
|
||||
|
||||
Mitigation points
|
||||
-----------------
|
||||
VERW is executed by the kernel before returning to user space, and by KVM
|
||||
before VMentry. None of the affected cores support SMT, so VERW is not required
|
||||
at C-state transitions.
|
||||
|
||||
New bits in IA32_ARCH_CAPABILITIES
|
||||
----------------------------------
|
||||
Newer processors and microcode update on existing affected processors added new
|
||||
bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
|
||||
vulnerability and mitigation capability:
|
||||
|
||||
- Bit 27 - RFDS_NO - When set, processor is not affected by RFDS.
|
||||
- Bit 28 - RFDS_CLEAR - When set, processor is affected by RFDS, and has the
|
||||
microcode that clears the affected buffers on VERW execution.
|
||||
|
||||
Mitigation control on the kernel command line
|
||||
---------------------------------------------
|
||||
The kernel command line allows to control RFDS mitigation at boot time with the
|
||||
parameter "reg_file_data_sampling=". The valid arguments are:
|
||||
|
||||
========== =================================================================
|
||||
on If the CPU is vulnerable, enable mitigation; CPU buffer clearing
|
||||
on exit to userspace and before entering a VM.
|
||||
off Disables mitigation.
|
||||
========== =================================================================
|
||||
|
||||
Mitigation default is selected by CONFIG_MITIGATION_RFDS.
|
||||
|
||||
Mitigation status information
|
||||
-----------------------------
|
||||
The Linux kernel provides a sysfs interface to enumerate the current
|
||||
vulnerability status of the system: whether the system is vulnerable, and
|
||||
which mitigations are active. The relevant sysfs file is:
|
||||
|
||||
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
|
||||
|
||||
The possible values in this file are:
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - 'Not affected'
|
||||
- The processor is not vulnerable
|
||||
* - 'Vulnerable'
|
||||
- The processor is vulnerable, but no mitigation enabled
|
||||
* - 'Vulnerable: No microcode'
|
||||
- The processor is vulnerable but microcode is not updated.
|
||||
* - 'Mitigation: Clear Register File'
|
||||
- The processor is vulnerable and the CPU buffer clearing mitigation is
|
||||
enabled.
|
||||
|
||||
References
|
||||
----------
|
||||
.. [#f1] Affected Processors
|
||||
https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
|
@ -473,8 +473,8 @@ Spectre variant 2
|
||||
-mindirect-branch=thunk-extern -mindirect-branch-register options.
|
||||
If the kernel is compiled with a Clang compiler, the compiler needs
|
||||
to support -mretpoline-external-thunk option. The kernel config
|
||||
CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
|
||||
the latest updated microcode.
|
||||
CONFIG_MITIGATION_RETPOLINE needs to be turned on, and the CPU needs
|
||||
to run with the latest updated microcode.
|
||||
|
||||
On Intel Skylake-era systems the mitigation covers most, but not all,
|
||||
cases. See :ref:`[3] <spec_ref3>` for more details.
|
||||
@ -609,8 +609,8 @@ kernel command line.
|
||||
Selecting 'on' will, and 'auto' may, choose a
|
||||
mitigation method at run time according to the
|
||||
CPU, the available microcode, the setting of the
|
||||
CONFIG_RETPOLINE configuration option, and the
|
||||
compiler with which the kernel was built.
|
||||
CONFIG_MITIGATION_RETPOLINE configuration option,
|
||||
and the compiler with which the kernel was built.
|
||||
|
||||
Selecting 'on' will also enable the mitigation
|
||||
against user space to user space task attacks.
|
||||
|
@ -1,3 +1,4 @@
|
||||
=================================================
|
||||
The Linux kernel user's and administrator's guide
|
||||
=================================================
|
||||
|
||||
@ -37,6 +38,7 @@ problems and bugs in particular.
|
||||
reporting-issues
|
||||
reporting-regressions
|
||||
quickly-build-trimmed-linux
|
||||
verify-bugs-and-bisect-regressions
|
||||
bug-hunting
|
||||
bug-bisect
|
||||
tainted-kernels
|
||||
@ -122,7 +124,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
pmf
|
||||
pnp
|
||||
rapidio
|
||||
ras
|
||||
RAS/index
|
||||
rtc
|
||||
serial-console
|
||||
svga
|
||||
|
@ -191,9 +191,7 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
|
||||
CPU is enough for kdump kernel to dump vmcore on most of systems.
|
||||
|
||||
However, you can also specify nr_cpus=X to enable multiple processors
|
||||
in kdump kernel. In this case, "disable_cpu_apicid=" is needed to
|
||||
tell kdump kernel which cpu is 1st kernel's BSP. Please refer to
|
||||
admin-guide/kernel-parameters.txt for more details.
|
||||
in kdump kernel.
|
||||
|
||||
With CONFIG_SMP=n, the above things are not related.
|
||||
|
||||
@ -454,8 +452,7 @@ Notes on loading the dump-capture kernel:
|
||||
to use multi-thread programs with it, such as parallel dump feature of
|
||||
makedumpfile. Otherwise, the multi-thread program may have a great
|
||||
performance degradation. To enable multi-cpu support, you should bring up an
|
||||
SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
|
||||
options while loading it.
|
||||
SMP dump-capture kernel and specify maxcpus/nr_cpus options while loading it.
|
||||
|
||||
* For s390x there are two kdump modes: If a ELF header is specified with
|
||||
the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
|
||||
|
@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
|
||||
the kernel start address. Used to convert a virtual address from the
|
||||
direct kernel map to a physical address.
|
||||
|
||||
vmap_area_list
|
||||
--------------
|
||||
VMALLOC_START
|
||||
-------------
|
||||
|
||||
Stores the virtual area list. makedumpfile gets the vmalloc start value
|
||||
from this variable and its value is necessary for vmalloc translation.
|
||||
Stores the base address of vmalloc area. makedumpfile gets this value
|
||||
since is necessary for vmalloc translation.
|
||||
|
||||
mem_map
|
||||
-------
|
||||
|
@ -108,6 +108,7 @@ is applicable::
|
||||
CMA Contiguous Memory Area support is enabled.
|
||||
DRM Direct Rendering Management support is enabled.
|
||||
DYNAMIC_DEBUG Build in debug messages and enable them at runtime
|
||||
EARLY Parameter processed too early to be embedded in initrd.
|
||||
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
|
||||
EFI EFI Partitioning (GPT) is enabled
|
||||
EVM Extended Verification Module
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -444,7 +444,9 @@ event code Key Notes
|
||||
|
||||
0x1008 0x07 FN+F8 IBM: toggle screen expand
|
||||
Lenovo: configure UltraNav,
|
||||
or toggle screen expand
|
||||
or toggle screen expand.
|
||||
On newer platforms (2024+)
|
||||
replaced by 0x131f (see below)
|
||||
|
||||
0x1009 0x08 FN+F9 -
|
||||
|
||||
@ -504,6 +506,9 @@ event code Key Notes
|
||||
|
||||
0x1019 0x18 unknown
|
||||
|
||||
0x131f ... FN+F8 Platform Mode change.
|
||||
Implemented in driver.
|
||||
|
||||
... ... ...
|
||||
|
||||
0x1020 0x1F unknown
|
||||
|
@ -49,6 +49,10 @@ Module parameters
|
||||
visl_dprintk_frame_start, visl_dprintk_nframes, but controls the dumping of
|
||||
buffer data through debugfs instead.
|
||||
|
||||
- tpg_verbose: Write extra information on each output frame to ease debugging
|
||||
the API. When set to true, the output frames are not stable for a given input
|
||||
as some information like pointers or queue status will be added to them.
|
||||
|
||||
What is the default use case for this driver?
|
||||
---------------------------------------------
|
||||
|
||||
@ -57,8 +61,12 @@ This assumes that a working client is run against visl and that the ftrace and
|
||||
OUTPUT buffer data is subsequently used to debug a work-in-progress
|
||||
implementation.
|
||||
|
||||
Information on reference frames, their timestamps, the status of the OUTPUT and
|
||||
CAPTURE queues and more can be read directly from the CAPTURE buffers.
|
||||
Even though no video decoding is actually done, the output frames can be used
|
||||
against a reference for a given input, except if tpg_verbose is set to true.
|
||||
|
||||
Depending on the tpg_verbose parameter value, information on reference frames,
|
||||
their timestamps, the status of the OUTPUT and CAPTURE queues and more can be
|
||||
read directly from the CAPTURE buffers.
|
||||
|
||||
Supported codecs
|
||||
----------------
|
||||
|
@ -60,7 +60,7 @@ all configurable using the following module options:
|
||||
- node_types:
|
||||
|
||||
which devices should each driver instance create. An array of
|
||||
hexadecimal values, one for each instance. The default is 0x1d3d.
|
||||
hexadecimal values, one for each instance. The default is 0xe1d3d.
|
||||
Each value is a bitmask with the following meaning:
|
||||
|
||||
- bit 0: Video Capture node
|
||||
|
@ -117,6 +117,33 @@ milliseconds.
|
||||
|
||||
1 second by default.
|
||||
|
||||
quota_mem_pressure_us
|
||||
---------------------
|
||||
|
||||
Desired level of memory pressure-stall time in microseconds.
|
||||
|
||||
While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
|
||||
increases and decreases the effective level of the quota aiming this level of
|
||||
memory pressure is incurred. System-wide ``some`` memory PSI in microseconds
|
||||
per quota reset interval (``quota_reset_interval_ms``) is collected and
|
||||
compared to this value to see if the aim is satisfied. Value zero means
|
||||
disabling this auto-tuning feature.
|
||||
|
||||
Disabled by default.
|
||||
|
||||
quota_autotune_feedback
|
||||
-----------------------
|
||||
|
||||
User-specifiable feedback for auto-tuning of the effective quota.
|
||||
|
||||
While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
|
||||
increases and decreases the effective level of the quota aiming receiving this
|
||||
feedback of value ``10,000`` from the user. DAMON_RECLAIM assumes the feedback
|
||||
value and the quota are positively proportional. Value zero means disabling
|
||||
this auto-tuning feature.
|
||||
|
||||
Disabled by default.
|
||||
|
||||
wmarks_interval
|
||||
---------------
|
||||
|
||||
|
@ -83,10 +83,10 @@ comma (",").
|
||||
│ │ │ │ │ │ │ │ sz/min,max
|
||||
│ │ │ │ │ │ │ │ nr_accesses/min,max
|
||||
│ │ │ │ │ │ │ │ age/min,max
|
||||
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
|
||||
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes
|
||||
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
|
||||
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
|
||||
│ │ │ │ │ │ │ │ │ 0/target_value,current_value
|
||||
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
|
||||
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
|
||||
│ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
|
||||
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
||||
@ -153,6 +153,9 @@ Users can write below commands for the kdamond to the ``state`` file.
|
||||
- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
|
||||
action tried regions directory for each DAMON-based operation scheme of the
|
||||
kdamond.
|
||||
- ``update_schemes_effective_bytes``: Update the contents of
|
||||
``effective_bytes`` files for each DAMON-based operation scheme of the
|
||||
kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`.
|
||||
|
||||
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
||||
|
||||
@ -180,19 +183,14 @@ In each context directory, two files (``avail_operations`` and ``operations``)
|
||||
and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
|
||||
exist.
|
||||
|
||||
DAMON supports multiple types of monitoring operations, including those for
|
||||
virtual address space and the physical address space. You can get the list of
|
||||
available monitoring operations set on the currently running kernel by reading
|
||||
DAMON supports multiple types of :ref:`monitoring operations
|
||||
<damon_design_configurable_operations_set>`, including those for virtual address
|
||||
space and the physical address space. You can get the list of available
|
||||
monitoring operations set on the currently running kernel by reading
|
||||
``avail_operations`` file. Based on the kernel configuration, the file will
|
||||
list some or all of below keywords.
|
||||
|
||||
- vaddr: Monitor virtual address spaces of specific processes
|
||||
- fvaddr: Monitor fixed virtual address ranges
|
||||
- paddr: Monitor the physical address space of the system
|
||||
|
||||
Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
|
||||
differences between the operations sets in terms of the monitoring target
|
||||
regions.
|
||||
list different available operation sets. Please refer to the :ref:`design
|
||||
<damon_operations_set>` for the list of all available operation sets and their
|
||||
brief explanations.
|
||||
|
||||
You can set and get what type of monitoring operations DAMON will use for the
|
||||
context by writing one of the keywords listed in ``avail_operations`` file and
|
||||
@ -247,17 +245,11 @@ process to the ``pid_target`` file.
|
||||
targets/<N>/regions
|
||||
-------------------
|
||||
|
||||
When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to
|
||||
the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the
|
||||
monitoring target regions so that entire memory mappings of target processes
|
||||
can be covered. However, users could want to set the initial monitoring region
|
||||
to specific address ranges.
|
||||
|
||||
In contrast, DAMON do not automatically sets and updates the monitoring target
|
||||
regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
|
||||
(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
|
||||
Therefore, users should set the monitoring target regions by themselves in the
|
||||
cases.
|
||||
In case of ``fvaddr`` or ``paddr`` monitoring operations sets, users are
|
||||
required to set the monitoring target address ranges. In case of ``vaddr``
|
||||
operations set, it is not mandatory, but users can optionally set the initial
|
||||
monitoring region to specific address ranges. Please refer to the :ref:`design
|
||||
<damon_design_vaddr_target_regions_construction>` for more details.
|
||||
|
||||
For such cases, users can explicitly set the initial monitoring target regions
|
||||
as they want, by writing proper values to the files under this directory.
|
||||
@ -302,27 +294,8 @@ In each scheme directory, five directories (``access_pattern``, ``quotas``,
|
||||
|
||||
The ``action`` file is for setting and getting the scheme's :ref:`action
|
||||
<damon_design_damos_action>`. The keywords that can be written to and read
|
||||
from the file and their meaning are as below.
|
||||
|
||||
Note that support of each action depends on the running DAMON operations set
|
||||
:ref:`implementation <sysfs_context>`.
|
||||
|
||||
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
|
||||
Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
|
||||
- ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``lru_prio``: Prioritize the region on its LRU lists.
|
||||
Supported by ``paddr`` operations set.
|
||||
- ``lru_deprio``: Deprioritize the region on its LRU lists.
|
||||
Supported by ``paddr`` operations set.
|
||||
- ``stat``: Do nothing but count the statistics.
|
||||
Supported by all operations sets.
|
||||
from the file and their meaning are same to those of the list on
|
||||
:ref:`design doc <damon_design_damos_action>`.
|
||||
|
||||
The ``apply_interval_us`` file is for setting and getting the scheme's
|
||||
:ref:`apply_interval <damon_design_damos>` in microseconds.
|
||||
@ -350,8 +323,9 @@ schemes/<N>/quotas/
|
||||
The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
|
||||
DAMON-based operation scheme.
|
||||
|
||||
Under ``quotas`` directory, three files (``ms``, ``bytes``,
|
||||
``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
|
||||
Under ``quotas`` directory, four files (``ms``, ``bytes``,
|
||||
``reset_interval_ms``, ``effective_bytes``) and two directores (``weights`` and
|
||||
``goals``) exist.
|
||||
|
||||
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
|
||||
``reset interval`` in milliseconds by writing the values to the three files,
|
||||
@ -359,7 +333,17 @@ respectively. Then, DAMON tries to use only up to ``time quota`` milliseconds
|
||||
for applying the ``action`` to memory regions of the ``access_pattern``, and to
|
||||
apply the action to only up to ``bytes`` bytes of memory regions within the
|
||||
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
|
||||
quota limits.
|
||||
quota limits unless at least one :ref:`goal <sysfs_schemes_quota_goals>` is
|
||||
set.
|
||||
|
||||
The time quota is internally transformed to a size quota. Between the
|
||||
transformed size quota and user-specified size quota, smaller one is applied.
|
||||
Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the
|
||||
effective size quota is further adjusted. Reading ``effective_bytes`` returns
|
||||
the current effective size quota. The file is not updated in real time, so
|
||||
users should ask DAMON sysfs interface to update the content of the file for
|
||||
the stats by writing a special keyword, ``update_schemes_effective_bytes`` to
|
||||
the relevant ``kdamonds/<N>/state`` file.
|
||||
|
||||
Under ``weights`` directory, three files (``sz_permil``,
|
||||
``nr_accesses_permil``, and ``age_permil``) exist.
|
||||
@ -382,11 +366,11 @@ number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each goal and current achievement.
|
||||
Among the multiple feedback, the best one is used.
|
||||
|
||||
Each goal directory contains two files, namely ``target_value`` and
|
||||
``current_value``. Users can set and get any number to those files to set the
|
||||
feedback. User space main workload's latency or throughput, system metrics
|
||||
like free memory ratio or memory pressure stall time (PSI) could be example
|
||||
metrics for the values. Note that users should write
|
||||
Each goal directory contains three files, namely ``target_metric``,
|
||||
``target_value`` and ``current_value``. Users can set and get the three
|
||||
parameters for the quota auto-tuning goals that specified on the :ref:`design
|
||||
doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each
|
||||
of the files. Note that users should further write
|
||||
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
|
||||
directory <sysfs_kdamond>` to pass the feedback to DAMON.
|
||||
|
||||
@ -579,11 +563,11 @@ monitoring results recording.
|
||||
While the monitoring is turned on, you could record the tracepoint events and
|
||||
show results using tracepoint supporting tools like ``perf``. For example::
|
||||
|
||||
# echo on > monitor_on
|
||||
# echo on > kdamonds/0/state
|
||||
# perf record -e damon:damon_aggregated &
|
||||
# sleep 5
|
||||
# kill 9 $(pidof perf)
|
||||
# echo off > monitor_on
|
||||
# echo off > kdamonds/0/state
|
||||
# perf script
|
||||
kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
|
||||
[...]
|
||||
@ -628,9 +612,17 @@ debugfs Interface (DEPRECATED!)
|
||||
move, please report your usecase to damon@lists.linux.dev and
|
||||
linux-mm@kvack.org.
|
||||
|
||||
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
|
||||
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
|
||||
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
|
||||
DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``,
|
||||
``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``,
|
||||
``mk_contexts`` and ``rm_contexts`` under its debugfs directory,
|
||||
``<debugfs>/damon/``.
|
||||
|
||||
|
||||
``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation
|
||||
notice. Reading it returns the deprecation notice, as below::
|
||||
|
||||
# cat DEPRECATED
|
||||
DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org.
|
||||
|
||||
|
||||
Attributes
|
||||
@ -755,19 +747,17 @@ Action
|
||||
~~~~~~
|
||||
|
||||
The ``<action>`` is a predefined integer for memory management :ref:`actions
|
||||
<damon_design_damos_action>`. The supported numbers and their meanings are as
|
||||
below.
|
||||
<damon_design_damos_action>`. The mapping between the ``<action>`` values and
|
||||
the memory management actions is as below. For the detailed meaning of the
|
||||
action and DAMON operations set supporting each action, please refer to the
|
||||
list on :ref:`design doc <damon_design_damos_action>`.
|
||||
|
||||
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
|
||||
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 5: Do nothing but count the statistics
|
||||
- 0: ``willneed``
|
||||
- 1: ``cold``
|
||||
- 2: ``pageout``
|
||||
- 3: ``hugepage``
|
||||
- 4: ``nohugepage``
|
||||
- 5: ``stat``
|
||||
|
||||
Quota
|
||||
~~~~~
|
||||
@ -848,16 +838,16 @@ Turning On/Off
|
||||
|
||||
Setting the files as described above doesn't incur effect unless you explicitly
|
||||
start the monitoring. You can start, stop, and check the current status of the
|
||||
monitoring by writing to and reading from the ``monitor_on`` file. Writing
|
||||
``on`` to the file starts the monitoring of the targets with the attributes.
|
||||
Writing ``off`` to the file stops those. DAMON also stops if every target
|
||||
process is terminated. Below example commands turn on, off, and check the
|
||||
status of DAMON::
|
||||
monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file.
|
||||
Writing ``on`` to the file starts the monitoring of the targets with the
|
||||
attributes. Writing ``off`` to the file stops those. DAMON also stops if
|
||||
every target process is terminated. Below example commands turn on, off, and
|
||||
check the status of DAMON::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo on > monitor_on
|
||||
# echo off > monitor_on
|
||||
# cat monitor_on
|
||||
# echo on > monitor_on_DEPRECATED
|
||||
# echo off > monitor_on_DEPRECATED
|
||||
# cat monitor_on_DEPRECATED
|
||||
off
|
||||
|
||||
Please note that you cannot write to the above-mentioned debugfs files while
|
||||
@ -873,11 +863,11 @@ can get the pid of the thread by reading the ``kdamond_pid`` file. When the
|
||||
monitoring is turned off, reading the file returns ``none``. ::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# cat monitor_on
|
||||
# cat monitor_on_DEPRECATED
|
||||
off
|
||||
# cat kdamond_pid
|
||||
none
|
||||
# echo on > monitor_on
|
||||
# echo on > monitor_on_DEPRECATED
|
||||
# cat kdamond_pid
|
||||
18594
|
||||
|
||||
@ -907,5 +897,5 @@ directory by putting the name of the context to the ``rm_contexts`` file. ::
|
||||
# ls foo
|
||||
# ls: cannot access 'foo': No such file or directory
|
||||
|
||||
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
|
||||
root directory only.
|
||||
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files
|
||||
are in the root directory only.
|
||||
|
@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
|
||||
can fall back to all existing numa nodes. This is effectively
|
||||
MPOL_PREFERRED allowed for a mask rather than a single node.
|
||||
|
||||
MPOL_WEIGHTED_INTERLEAVE
|
||||
This mode operates the same as MPOL_INTERLEAVE, except that
|
||||
interleaving behavior is executed based on weights set in
|
||||
/sys/kernel/mm/mempolicy/weighted_interleave/
|
||||
|
||||
Weighted interleave allocates pages on nodes according to a
|
||||
weight. For example if nodes [0,1] are weighted [5,2], 5 pages
|
||||
will be allocated on node0 for every 2 pages allocated on node1.
|
||||
|
||||
NUMA memory policy supports the following optional mode flags:
|
||||
|
||||
MPOL_F_STATIC_NODES
|
||||
|
@ -37,9 +37,21 @@ Example usage of perf::
|
||||
hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
|
||||
------------------------------------------
|
||||
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mwr_latency/
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/
|
||||
$# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0xffff/
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/
|
||||
|
||||
The related events usually used to calculate the bandwidth, latency or others.
|
||||
They need to start and end counting at the same time, therefore related events
|
||||
are best used in the same event group to get the expected value. There are two
|
||||
ways to know if they are related events:
|
||||
|
||||
a) By event name, such as the latency events "xxx_latency, xxx_cnt" or
|
||||
bandwidth events "xxx_flux, xxx_time".
|
||||
b) By event type, such as "event=0xXXXX, event=0x1XXXX".
|
||||
|
||||
Example usage of perf group::
|
||||
|
||||
$# perf stat -e "{hisi_pcie0_core0/rx_mwr_latency,port=0xffff/,hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/}"
|
||||
|
||||
The current driver does not support sampling. So "perf record" is unsupported.
|
||||
Also attach to a task is unsupported for PCIe PMU.
|
||||
@ -51,8 +63,12 @@ Filter options
|
||||
|
||||
PMU could only monitor the performance of traffic downstream target Root
|
||||
Ports or downstream target Endpoint. PCIe PMU driver support "port" and
|
||||
"bdf" interfaces for users, and these two interfaces aren't supported at the
|
||||
same time.
|
||||
"bdf" interfaces for users.
|
||||
Please notice that, one of these two interfaces must be set, and these two
|
||||
interfaces aren't supported at the same time. If they are both set, only
|
||||
"port" filter is valid.
|
||||
If "port" filter not being set or is set explicitly to zero (default), the
|
||||
"bdf" filter will be in effect, because "bdf=0" meaning 0000:000:00.0.
|
||||
|
||||
- port
|
||||
|
||||
@ -95,7 +111,7 @@ Filter options
|
||||
|
||||
Example usage of perf::
|
||||
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,trig_len=0x4,trig_mode=1/ sleep 5
|
||||
|
||||
3. Threshold filter
|
||||
|
||||
@ -109,7 +125,7 @@ Filter options
|
||||
|
||||
Example usage of perf::
|
||||
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,thr_len=0x4,thr_mode=1/ sleep 5
|
||||
|
||||
4. TLP Length filter
|
||||
|
||||
@ -127,4 +143,4 @@ Filter options
|
||||
|
||||
Example usage of perf::
|
||||
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5
|
||||
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,len_mode=0x1/ sleep 5
|
||||
|
@ -13,6 +13,7 @@ Performance monitor support
|
||||
imx-ddr
|
||||
qcom_l2_pmu
|
||||
qcom_l3_pmu
|
||||
starfive_starlink_pmu
|
||||
arm-ccn
|
||||
arm-cmn
|
||||
xgene-pmu
|
||||
|
46
Documentation/admin-guide/perf/starfive_starlink_pmu.rst
Normal file
46
Documentation/admin-guide/perf/starfive_starlink_pmu.rst
Normal file
@ -0,0 +1,46 @@
|
||||
================================================
|
||||
StarFive StarLink Performance Monitor Unit (PMU)
|
||||
================================================
|
||||
|
||||
StarFive StarLink Performance Monitor Unit (PMU) exists within the
|
||||
StarLink Coherent Network on Chip (CNoC) that connects multiple CPU
|
||||
clusters with an L3 memory system.
|
||||
|
||||
The uncore PMU supports overflow interrupt, up to 16 programmable 64bit
|
||||
event counters, and an independent 64bit cycle counter.
|
||||
The PMU can only be accessed via Memory Mapped I/O and are common to the
|
||||
cores connected to the same PMU.
|
||||
|
||||
Driver exposes supported PMU events in sysfs "events" directory under::
|
||||
|
||||
/sys/bus/event_source/devices/starfive_starlink_pmu/events/
|
||||
|
||||
Driver exposes cpu used to handle PMU events in sysfs "cpumask" directory
|
||||
under::
|
||||
|
||||
/sys/bus/event_source/devices/starfive_starlink_pmu/cpumask/
|
||||
|
||||
Driver describes the format of config (event ID) in sysfs "format" directory
|
||||
under::
|
||||
|
||||
/sys/bus/event_source/devices/starfive_starlink_pmu/format/
|
||||
|
||||
Example of perf usage::
|
||||
|
||||
$ perf list
|
||||
|
||||
starfive_starlink_pmu/cycles/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/read_hit/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/read_miss/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/read_request/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/release_request/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/write_hit/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/write_miss/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/write_request/ [Kernel PMU event]
|
||||
starfive_starlink_pmu/writeback/ [Kernel PMU event]
|
||||
|
||||
|
||||
$ perf stat -a -e /starfive_starlink_pmu/cycles/ sleep 1
|
||||
|
||||
Sampling is not supported. As a result, "perf record" is not supported.
|
||||
Attaching to a task is not supported, only system-wide counting is supported.
|
@ -300,8 +300,8 @@ platforms. The AMD P-States mechanism is the more performance and energy
|
||||
efficiency frequency management method on AMD processors.
|
||||
|
||||
|
||||
AMD Pstate Driver Operation Modes
|
||||
=================================
|
||||
``amd-pstate`` Driver Operation Modes
|
||||
======================================
|
||||
|
||||
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
|
||||
non-autonomous (passive) mode and guided autonomous (guided) mode.
|
||||
@ -353,6 +353,48 @@ is activated. In this mode, driver requests minimum and maximum performance
|
||||
level and the platform autonomously selects a performance level in this range
|
||||
and appropriate to the current workload.
|
||||
|
||||
``amd-pstate`` Preferred Core
|
||||
=================================
|
||||
|
||||
The core frequency is subjected to the process variation in semiconductors.
|
||||
Not all cores are able to reach the maximum frequency respecting the
|
||||
infrastructure limits. Consequently, AMD has redefined the concept of
|
||||
maximum frequency of a part. This means that a fraction of cores can reach
|
||||
maximum frequency. To find the best process scheduling policy for a given
|
||||
scenario, OS needs to know the core ordering informed by the platform through
|
||||
highest performance capability register of the CPPC interface.
|
||||
|
||||
``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
|
||||
cores that can achieve a higher frequency with lower voltage. The preferred
|
||||
core rankings can dynamically change based on the workload, platform conditions,
|
||||
thermals and ageing.
|
||||
|
||||
The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
|
||||
driver will also determine whether or not ``amd-pstate`` preferred core is
|
||||
supported by the platform.
|
||||
|
||||
``amd-pstate`` driver will provide an initial core ordering when the system boots.
|
||||
The platform uses the CPPC interfaces to communicate the core ranking to the
|
||||
operating system and scheduler to make sure that OS is choosing the cores
|
||||
with highest performance firstly for scheduling the process. When ``amd-pstate``
|
||||
driver receives a message with the highest performance change, it will
|
||||
update the core ranking and set the cpu's priority.
|
||||
|
||||
``amd-pstate`` Preferred Core Switch
|
||||
=====================================
|
||||
Kernel Parameters
|
||||
-----------------
|
||||
|
||||
``amd-pstate`` peferred core`` has two states: enable and disable.
|
||||
Enable/disable states can be chosen by different kernel parameters.
|
||||
Default enable ``amd-pstate`` preferred core.
|
||||
|
||||
``amd_prefcore=disable``
|
||||
|
||||
For systems that support ``amd-pstate`` preferred core, the core rankings will
|
||||
always be advertised by the platform. But OS can choose to ignore that via the
|
||||
kernel parameter ``amd_prefcore=disable``.
|
||||
|
||||
User Space Interface in ``sysfs`` - General
|
||||
===========================================
|
||||
|
||||
@ -385,6 +427,19 @@ control its functionality at the system level. They are located in the
|
||||
to the operation mode represented by that string - or to be
|
||||
unregistered in the "disable" case.
|
||||
|
||||
``prefcore``
|
||||
Preferred core state of the driver: "enabled" or "disabled".
|
||||
|
||||
"enabled"
|
||||
Enable the ``amd-pstate`` preferred core.
|
||||
|
||||
"disabled"
|
||||
Disable the ``amd-pstate`` preferred core
|
||||
|
||||
|
||||
This attribute is read-only to check the state of preferred core set
|
||||
by the kernel parameter.
|
||||
|
||||
``cpupower`` tool support for ``amd-pstate``
|
||||
===============================================
|
||||
|
||||
|
@ -31,7 +31,7 @@ The important bits (aka "TL;DR")
|
||||
Linux kernel regression tracking bot "regzbot" track the issue by specifying
|
||||
when the regression started like this::
|
||||
|
||||
#regzbot introduced v5.13..v5.14-rc1
|
||||
#regzbot introduced: v5.13..v5.14-rc1
|
||||
|
||||
|
||||
All the details on Linux kernel regressions relevant for users
|
||||
|
@ -296,12 +296,30 @@ kernel panic). This will output the contents of the ftrace buffers to
|
||||
the console. This is very useful for capturing traces that lead to
|
||||
crashes and outputting them to a serial console.
|
||||
|
||||
= ===================================================
|
||||
0 Disabled (default).
|
||||
1 Dump buffers of all CPUs.
|
||||
2 Dump the buffer of the CPU that triggered the oops.
|
||||
= ===================================================
|
||||
======================= ===========================================
|
||||
0 Disabled (default).
|
||||
1 Dump buffers of all CPUs.
|
||||
2(orig_cpu) Dump the buffer of the CPU that triggered the
|
||||
oops.
|
||||
<instance> Dump the specific instance buffer on all CPUs.
|
||||
<instance>=2(orig_cpu) Dump the specific instance buffer on the CPU
|
||||
that triggered the oops.
|
||||
======================= ===========================================
|
||||
|
||||
Multiple instance dump is also supported, and instances are separated
|
||||
by commas. If global buffer also needs to be dumped, please specify
|
||||
the dump mode (1/2/orig_cpu) first for global buffer.
|
||||
|
||||
So for example to dump "foo" and "bar" instance buffer on all CPUs,
|
||||
user can::
|
||||
|
||||
echo "foo,bar" > /proc/sys/kernel/ftrace_dump_on_oops
|
||||
|
||||
To dump global buffer and "foo" instance buffer on all
|
||||
CPUs along with the "bar" instance buffer on CPU that triggered the
|
||||
oops, user can::
|
||||
|
||||
echo "1,foo,bar=2" > /proc/sys/kernel/ftrace_dump_on_oops
|
||||
|
||||
ftrace_enabled, stack_tracer_enabled
|
||||
====================================
|
||||
@ -594,6 +612,9 @@ default (``MSGMNB``).
|
||||
``msgmni`` is the maximum number of IPC queues. 32000 by default
|
||||
(``MSGMNI``).
|
||||
|
||||
All of these parameters are set per ipc namespace. The maximum number of bytes
|
||||
in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is
|
||||
respected hierarchically in the each user namespace.
|
||||
|
||||
msg_next_id, sem_next_id, and shm_next_id (System V IPC)
|
||||
========================================================
|
||||
@ -850,6 +871,7 @@ bit 3 print locks info if ``CONFIG_LOCKDEP`` is on
|
||||
bit 4 print ftrace buffer
|
||||
bit 5 print all printk messages in buffer
|
||||
bit 6 print all CPUs backtrace (if available in the arch)
|
||||
bit 7 print only tasks in uninterruptible (blocked) state
|
||||
===== ============================================
|
||||
|
||||
So for example to print tasks and memory info on panic, user can::
|
||||
@ -1274,15 +1296,20 @@ are doing anyway :)
|
||||
shmall
|
||||
======
|
||||
|
||||
This parameter sets the total amount of shared memory pages that
|
||||
can be used system wide. Hence, ``shmall`` should always be at least
|
||||
``ceil(shmmax/PAGE_SIZE)``.
|
||||
This parameter sets the total amount of shared memory pages that can be used
|
||||
inside ipc namespace. The shared memory pages counting occurs for each ipc
|
||||
namespace separately and is not inherited. Hence, ``shmall`` should always be at
|
||||
least ``ceil(shmmax/PAGE_SIZE)``.
|
||||
|
||||
If you are not sure what the default ``PAGE_SIZE`` is on your Linux
|
||||
system, you can run the following command::
|
||||
|
||||
# getconf PAGE_SIZE
|
||||
|
||||
To reduce or disable the ability to allocate shared memory, you must create a
|
||||
new ipc namespace, set this parameter to the required value and prohibit the
|
||||
creation of a new ipc namespace in the current user namespace or cgroups can
|
||||
be used.
|
||||
|
||||
shmmax
|
||||
======
|
||||
|
@ -206,6 +206,11 @@ Will increase power usage.
|
||||
|
||||
Default: 0 (off)
|
||||
|
||||
mem_pcpu_rsv
|
||||
------------
|
||||
|
||||
Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
|
||||
|
||||
rmem_default
|
||||
------------
|
||||
|
||||
|
@ -34,7 +34,7 @@ name of the command ('Comm:') that triggered the event::
|
||||
|
||||
You'll find a 'Not tainted: ' there if the kernel was not tainted at the
|
||||
time of the event; if it was, then it will print 'Tainted: ' and characters
|
||||
either letters or blanks. In above example it looks like this::
|
||||
either letters or blanks. In the example above it looks like this::
|
||||
|
||||
Tainted: P W O
|
||||
|
||||
@ -52,7 +52,7 @@ At runtime, you can query the tainted state by reading
|
||||
tainted; any other number indicates the reasons why it is. The easiest way to
|
||||
decode that number is the script ``tools/debugging/kernel-chktaint``, which your
|
||||
distribution might ship as part of a package called ``linux-tools`` or
|
||||
``kernel-tools``; if it doesn't you can download the script from
|
||||
``kernel-tools``; if it doesn't, you can download the script from
|
||||
`git.kernel.org <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/debugging/kernel-chktaint>`_
|
||||
and execute it with ``sh kernel-chktaint``, which would print something like
|
||||
this on the machine that had the statements in the logs that were quoted earlier::
|
||||
|
1985
Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
Normal file
1985
Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
Normal file
File diff suppressed because it is too large
Load Diff
@ -317,6 +317,55 @@ HWCAP2_LRCPC3
|
||||
HWCAP2_LSE128
|
||||
Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0011.
|
||||
|
||||
HWCAP2_FPMR
|
||||
Functionality implied by ID_AA64PFR2_EL1.FMR == 0b0001.
|
||||
|
||||
HWCAP2_LUT
|
||||
Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0001.
|
||||
|
||||
HWCAP2_FAMINMAX
|
||||
Functionality implied by ID_AA64ISAR3_EL1.FAMINMAX == 0b0001.
|
||||
|
||||
HWCAP2_F8CVT
|
||||
Functionality implied by ID_AA64FPFR0_EL1.F8CVT == 0b1.
|
||||
|
||||
HWCAP2_F8FMA
|
||||
Functionality implied by ID_AA64FPFR0_EL1.F8FMA == 0b1.
|
||||
|
||||
HWCAP2_F8DP4
|
||||
Functionality implied by ID_AA64FPFR0_EL1.F8DP4 == 0b1.
|
||||
|
||||
HWCAP2_F8DP2
|
||||
Functionality implied by ID_AA64FPFR0_EL1.F8DP2 == 0b1.
|
||||
|
||||
HWCAP2_F8E4M3
|
||||
Functionality implied by ID_AA64FPFR0_EL1.F8E4M3 == 0b1.
|
||||
|
||||
HWCAP2_F8E5M2
|
||||
Functionality implied by ID_AA64FPFR0_EL1.F8E5M2 == 0b1.
|
||||
|
||||
HWCAP2_SME_LUTV2
|
||||
Functionality implied by ID_AA64SMFR0_EL1.LUTv2 == 0b1.
|
||||
|
||||
HWCAP2_SME_F8F16
|
||||
Functionality implied by ID_AA64SMFR0_EL1.F8F16 == 0b1.
|
||||
|
||||
HWCAP2_SME_F8F32
|
||||
Functionality implied by ID_AA64SMFR0_EL1.F8F32 == 0b1.
|
||||
|
||||
HWCAP2_SME_SF8FMA
|
||||
Functionality implied by ID_AA64SMFR0_EL1.SF8FMA == 0b1.
|
||||
|
||||
HWCAP2_SME_SF8DP4
|
||||
Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
|
||||
|
||||
HWCAP2_SME_SF8DP2
|
||||
Functionality implied by ID_AA64SMFR0_EL1.SF8DP2 == 0b1.
|
||||
|
||||
HWCAP2_SME_SF8DP4
|
||||
Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
|
||||
|
||||
|
||||
4. Unused AT_HWCAP bits
|
||||
-----------------------
|
||||
|
||||
|
@ -35,8 +35,9 @@ can be triggered by Linux).
|
||||
For software workarounds that may adversely impact systems unaffected by
|
||||
the erratum in question, a Kconfig entry is added under "Kernel
|
||||
Features" -> "ARM errata workarounds via the alternatives framework".
|
||||
These are enabled by default and patched in at runtime when an affected
|
||||
CPU is detected. For less-intrusive workarounds, a Kconfig option is not
|
||||
With the exception of workarounds for errata deemed "rare" by Arm, these
|
||||
are enabled by default and patched in at runtime when an affected CPU is
|
||||
detected. For less-intrusive workarounds, a Kconfig option is not
|
||||
available and the code is structured (preferably with a comment) in such
|
||||
a way that the erratum will not be hit.
|
||||
|
||||
@ -243,3 +244,10 @@ stable kernels.
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| ASR | ASR8601 | #8601001 | N/A |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Microsoft | Azure Cobalt 100| #2139208 | ARM64_ERRATUM_2139208 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Microsoft | Azure Cobalt 100| #2067961 | ARM64_ERRATUM_2067961 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
| Microsoft | Azure Cobalt 100| #2253138 | ARM64_ERRATUM_2253138 |
|
||||
+----------------+-----------------+-----------------+-----------------------------+
|
||||
|
@ -75,7 +75,7 @@ model features for SME is included in Appendix A.
|
||||
2. Vector lengths
|
||||
------------------
|
||||
|
||||
SME defines a second vector length similar to the SVE vector length which is
|
||||
SME defines a second vector length similar to the SVE vector length which
|
||||
controls the size of the streaming mode SVE vectors and the ZA matrix array.
|
||||
The ZA matrix is square with each side having as many bytes as a streaming
|
||||
mode SVE vector.
|
||||
@ -238,12 +238,12 @@ prctl(PR_SME_SET_VL, unsigned long arg)
|
||||
bits of Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become
|
||||
unspecified, including both streaming and non-streaming SVE state.
|
||||
Calling PR_SME_SET_VL with vl equal to the thread's current vector
|
||||
length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag,
|
||||
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
|
||||
does not constitute a change to the vector length for this purpose.
|
||||
|
||||
* Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared.
|
||||
Calling PR_SME_SET_VL with vl equal to the thread's current vector
|
||||
length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag,
|
||||
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
|
||||
does not constitute a change to the vector length for this purpose.
|
||||
|
||||
|
||||
@ -379,9 +379,8 @@ The regset data starts with struct user_za_header, containing:
|
||||
/proc/sys/abi/sme_default_vector_length
|
||||
|
||||
Writing the text representation of an integer to this file sets the system
|
||||
default vector length to the specified value, unless the value is greater
|
||||
than the maximum vector length supported by the system in which case the
|
||||
default vector length is set to that maximum.
|
||||
default vector length to the specified value rounded to a supported value
|
||||
using the same rules as for setting vector length via PR_SME_SET_VL.
|
||||
|
||||
The result can be determined by reopening the file and reading its
|
||||
contents.
|
||||
|
@ -117,11 +117,6 @@ the SVE instruction set architecture.
|
||||
* The SVE registers are not used to pass arguments to or receive results from
|
||||
any syscall.
|
||||
|
||||
* In practice the affected registers/bits will be preserved or will be replaced
|
||||
with zeros on return from a syscall, but userspace should not make
|
||||
assumptions about this. The kernel behaviour may vary on a case-by-case
|
||||
basis.
|
||||
|
||||
* All other SVE state of a thread, including the currently configured vector
|
||||
length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector
|
||||
length (if any), is preserved across all syscalls, subject to the specific
|
||||
@ -428,9 +423,8 @@ The regset data starts with struct user_sve_header, containing:
|
||||
/proc/sys/abi/sve_default_vector_length
|
||||
|
||||
Writing the text representation of an integer to this file sets the system
|
||||
default vector length to the specified value, unless the value is greater
|
||||
than the maximum vector length supported by the system in which case the
|
||||
default vector length is set to that maximum.
|
||||
default vector length to the specified value rounded to a supported value
|
||||
using the same rules as for setting vector length via PR_SVE_SET_VL.
|
||||
|
||||
The result can be determined by reopening the file and reading its
|
||||
contents.
|
||||
|
@ -144,14 +144,8 @@ passing 0 into the hint address parameter of mmap. On CPUs with an address space
|
||||
smaller than sv48, the CPU maximum supported address space will be the default.
|
||||
|
||||
Software can "opt-in" to receiving VAs from another VA space by providing
|
||||
a hint address to mmap. A hint address passed to mmap will cause the largest
|
||||
address space that fits entirely into the hint to be used, unless there is no
|
||||
space left in the address space. If there is no space available in the requested
|
||||
address space, an address in the next smallest available address space will be
|
||||
returned.
|
||||
|
||||
For example, in order to obtain 48-bit VA space, a hint address greater than
|
||||
:code:`1 << 47` must be provided. Note that this is 47 due to sv48 userspace
|
||||
ending at :code:`1 << 47` and the addresses beyond this are reserved for the
|
||||
kernel. Similarly, to obtain 57-bit VA space addresses, a hint address greater
|
||||
than or equal to :code:`1 << 56` must be provided.
|
||||
a hint address to mmap. When a hint address is passed to mmap, the returned
|
||||
address will never use more bits than the hint address. For example, if a hint
|
||||
address of `1 << 40` is passed to mmap, a valid returned address will never use
|
||||
bits 41 through 63. If no mappable addresses are available in that range, mmap
|
||||
will return `MAP_FAILED`.
|
||||
|
@ -87,14 +87,14 @@ The state of SME in the Linux kernel can be documented as follows:
|
||||
kernel is non-zero).
|
||||
|
||||
SME can also be enabled and activated in the BIOS. If SME is enabled and
|
||||
activated in the BIOS, then all memory accesses will be encrypted and it will
|
||||
not be necessary to activate the Linux memory encryption support. If the BIOS
|
||||
merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate
|
||||
memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or
|
||||
by supplying mem_encrypt=on on the kernel command line. However, if BIOS does
|
||||
not enable SME, then Linux will not be able to activate memory encryption, even
|
||||
if configured to do so by default or the mem_encrypt=on command line parameter
|
||||
is specified.
|
||||
activated in the BIOS, then all memory accesses will be encrypted and it
|
||||
will not be necessary to activate the Linux memory encryption support.
|
||||
|
||||
If the BIOS merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG),
|
||||
then memory encryption can be enabled by supplying mem_encrypt=on on the
|
||||
kernel command line. However, if BIOS does not enable SME, then Linux
|
||||
will not be able to activate memory encryption, even if configured to do
|
||||
so by default or the mem_encrypt=on command line parameter is specified.
|
||||
|
||||
Secure Nested Paging (SNP)
|
||||
==========================
|
||||
|
@ -13,7 +13,8 @@ set of mailbox registers.
|
||||
|
||||
More details on the interface can be found in chapter
|
||||
"7 Host System Management Port (HSMP)" of the family/model PPR
|
||||
Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
|
||||
Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
|
||||
|
||||
|
||||
HSMP interface is supported on EPYC server CPU models only.
|
||||
|
||||
@ -97,8 +98,8 @@ what happened. The transaction returns 0 on success.
|
||||
|
||||
More details on the interface and message definitions can be found in chapter
|
||||
"7 Host System Management Port (HSMP)" of the respective family/model PPR
|
||||
eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
|
||||
eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
|
||||
|
||||
User space C-APIs are made available by linking against the esmi library,
|
||||
which is provided by the E-SMS project https://developer.amd.com/e-sms/.
|
||||
which is provided by the E-SMS project https://www.amd.com/en/developer/e-sms.html.
|
||||
See: https://github.com/amd/esmi_ib_library
|
||||
|
@ -878,7 +878,8 @@ Protocol: 2.10+
|
||||
address if possible.
|
||||
|
||||
A non-relocatable kernel will unconditionally move itself and to run
|
||||
at this address.
|
||||
at this address. A relocatable kernel will move itself to this address if it
|
||||
loaded below this address.
|
||||
|
||||
============ =======
|
||||
Field name: init_size
|
||||
|
@ -95,6 +95,9 @@ The kernel provides a function to invoke the buffer clearing:
|
||||
|
||||
mds_clear_cpu_buffers()
|
||||
|
||||
Also macro CLEAR_CPU_BUFFERS can be used in ASM late in exit-to-user path.
|
||||
Other than CFLAGS.ZF, this macro doesn't clobber any registers.
|
||||
|
||||
The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
|
||||
(idle) transitions.
|
||||
|
||||
@ -138,17 +141,30 @@ Mitigation points
|
||||
|
||||
When transitioning from kernel to user space the CPU buffers are flushed
|
||||
on affected CPUs when the mitigation is not disabled on the kernel
|
||||
command line. The migitation is enabled through the static key
|
||||
mds_user_clear.
|
||||
command line. The mitigation is enabled through the feature flag
|
||||
X86_FEATURE_CLEAR_CPU_BUF.
|
||||
|
||||
The mitigation is invoked in prepare_exit_to_usermode() which covers
|
||||
all but one of the kernel to user space transitions. The exception
|
||||
is when we return from a Non Maskable Interrupt (NMI), which is
|
||||
handled directly in do_nmi().
|
||||
The mitigation is invoked just before transitioning to userspace after
|
||||
user registers are restored. This is done to minimize the window in
|
||||
which kernel data could be accessed after VERW e.g. via an NMI after
|
||||
VERW.
|
||||
|
||||
(The reason that NMI is special is that prepare_exit_to_usermode() can
|
||||
enable IRQs. In NMI context, NMIs are blocked, and we don't want to
|
||||
enable IRQs with NMIs blocked.)
|
||||
**Corner case not handled**
|
||||
Interrupts returning to kernel don't clear CPUs buffers since the
|
||||
exit-to-user path is expected to do that anyways. But, there could be
|
||||
a case when an NMI is generated in kernel after the exit-to-user path
|
||||
has cleared the buffers. This case is not handled and NMI returning to
|
||||
kernel don't clear CPU buffers because:
|
||||
|
||||
1. It is rare to get an NMI after VERW, but before returning to userspace.
|
||||
2. For an unprivileged user, there is no known way to make that NMI
|
||||
less rare or target it.
|
||||
3. It would take a large number of these precisely-timed NMIs to mount
|
||||
an actual attack. There's presumably not enough bandwidth.
|
||||
4. The NMI in question occurs after a VERW, i.e. when user state is
|
||||
restored and most interesting data is already scrubbed. Whats left
|
||||
is only the data that NMI touches, and that may or may not be of
|
||||
any interest.
|
||||
|
||||
|
||||
2. C-State transition
|
||||
|
@ -26,9 +26,9 @@ comments in pti.c).
|
||||
|
||||
This approach helps to ensure that side-channel attacks leveraging
|
||||
the paging structures do not function when PTI is enabled. It can be
|
||||
enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
|
||||
Once enabled at compile-time, it can be disabled at boot with the
|
||||
'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
|
||||
enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
|
||||
time. Once enabled at compile-time, it can be disabled at boot with
|
||||
the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
|
||||
|
||||
Page Table Management
|
||||
=====================
|
||||
|
@ -45,7 +45,7 @@ mount options are:
|
||||
Enable code/data prioritization in L2 cache allocations.
|
||||
"mba_MBps":
|
||||
Enable the MBA Software Controller(mba_sc) to specify MBA
|
||||
bandwidth in MBps
|
||||
bandwidth in MiBps
|
||||
"debug":
|
||||
Make debug files accessible. Available debug files are annotated with
|
||||
"Available only with debug option".
|
||||
@ -526,7 +526,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
|
||||
increase or vary although user specified bandwidth percentage is same.
|
||||
|
||||
In order to mitigate this and make the interface more user friendly,
|
||||
resctrl added support for specifying the bandwidth in MBps as well. The
|
||||
resctrl added support for specifying the bandwidth in MiBps as well. The
|
||||
kernel underneath would use a software feedback mechanism or a "Software
|
||||
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
|
||||
and adjust the memory bandwidth percentages to ensure::
|
||||
@ -573,13 +573,13 @@ Memory b/w domain is L3 cache.
|
||||
|
||||
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
|
||||
|
||||
Memory bandwidth Allocation specified in MBps
|
||||
---------------------------------------------
|
||||
Memory bandwidth Allocation specified in MiBps
|
||||
----------------------------------------------
|
||||
|
||||
Memory bandwidth domain is L3 cache.
|
||||
::
|
||||
|
||||
MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
|
||||
MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
|
||||
|
||||
Slow Memory Bandwidth Allocation (SMBA)
|
||||
---------------------------------------
|
||||
|
@ -47,17 +47,21 @@ AMD nomenclature for package is 'Node'.
|
||||
|
||||
Package-related topology information in the kernel:
|
||||
|
||||
- cpuinfo_x86.x86_max_cores:
|
||||
- topology_num_threads_per_package()
|
||||
|
||||
The number of cores in a package. This information is retrieved via CPUID.
|
||||
The number of threads in a package.
|
||||
|
||||
- cpuinfo_x86.x86_max_dies:
|
||||
- topology_num_cores_per_package()
|
||||
|
||||
The number of dies in a package. This information is retrieved via CPUID.
|
||||
The number of cores in a package.
|
||||
|
||||
- topology_max_dies_per_package()
|
||||
|
||||
The maximum number of dies in a package.
|
||||
|
||||
- cpuinfo_x86.topo.die_id:
|
||||
|
||||
The physical ID of the die. This information is retrieved via CPUID.
|
||||
The physical ID of the die.
|
||||
|
||||
- cpuinfo_x86.topo.pkg_id:
|
||||
|
||||
@ -96,16 +100,6 @@ are SMT- or CMT-type threads.
|
||||
AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
|
||||
"core".
|
||||
|
||||
Core-related topology information in the kernel:
|
||||
|
||||
- smp_num_siblings:
|
||||
|
||||
The number of threads in a core. The number of threads in a package can be
|
||||
calculated by::
|
||||
|
||||
threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
|
||||
|
||||
|
||||
Threads
|
||||
=======
|
||||
A thread is a single scheduling unit. It's the equivalent to a logical Linux
|
||||
|
96
Documentation/arch/x86/x86_64/fred.rst
Normal file
96
Documentation/arch/x86/x86_64/fred.rst
Normal file
@ -0,0 +1,96 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================================
|
||||
Flexible Return and Event Delivery (FRED)
|
||||
=========================================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The FRED architecture defines simple new transitions that change
|
||||
privilege level (ring transitions). The FRED architecture was
|
||||
designed with the following goals:
|
||||
|
||||
1) Improve overall performance and response time by replacing event
|
||||
delivery through the interrupt descriptor table (IDT event
|
||||
delivery) and event return by the IRET instruction with lower
|
||||
latency transitions.
|
||||
|
||||
2) Improve software robustness by ensuring that event delivery
|
||||
establishes the full supervisor context and that event return
|
||||
establishes the full user context.
|
||||
|
||||
The new transitions defined by the FRED architecture are FRED event
|
||||
delivery and, for returning from events, two FRED return instructions.
|
||||
FRED event delivery can effect a transition from ring 3 to ring 0, but
|
||||
it is used also to deliver events incident to ring 0. One FRED
|
||||
instruction (ERETU) effects a return from ring 0 to ring 3, while the
|
||||
other (ERETS) returns while remaining in ring 0. Collectively, FRED
|
||||
event delivery and the FRED return instructions are FRED transitions.
|
||||
|
||||
In addition to these transitions, the FRED architecture defines a new
|
||||
instruction (LKGS) for managing the state of the GS segment register.
|
||||
The LKGS instruction can be used by 64-bit operating systems that do
|
||||
not use the new FRED transitions.
|
||||
|
||||
Furthermore, the FRED architecture is easy to extend for future CPU
|
||||
architectures.
|
||||
|
||||
Software based event dispatching
|
||||
================================
|
||||
|
||||
FRED operates differently from IDT in terms of event handling. Instead
|
||||
of directly dispatching an event to its handler based on the event
|
||||
vector, FRED requires the software to dispatch an event to its handler
|
||||
based on both the event's type and vector. Therefore, an event dispatch
|
||||
framework must be implemented to facilitate the event-to-handler
|
||||
dispatch process. The FRED event dispatch framework takes control
|
||||
once an event is delivered, and employs a two-level dispatch.
|
||||
|
||||
The first level dispatching is event type based, and the second level
|
||||
dispatching is event vector based.
|
||||
|
||||
Full supervisor/user context
|
||||
============================
|
||||
|
||||
FRED event delivery atomically save and restore full supervisor/user
|
||||
context upon event delivery and return. Thus it avoids the problem of
|
||||
transient states due to %cr2 and/or %dr6, and it is no longer needed
|
||||
to handle all the ugly corner cases caused by half baked entry states.
|
||||
|
||||
FRED allows explicit unblock of NMI with new event return instructions
|
||||
ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
|
||||
unblocks NMI, e.g., when an exception happens during NMI handling.
|
||||
|
||||
FRED always restores the full value of %rsp, thus ESPFIX is no longer
|
||||
needed when FRED is enabled.
|
||||
|
||||
LKGS
|
||||
====
|
||||
|
||||
LKGS behaves like the MOV to GS instruction except that it loads the
|
||||
base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
|
||||
segment’s descriptor cache. With LKGS, it ends up with avoiding
|
||||
mucking with kernel GS, i.e., an operating system can always operate
|
||||
with its own GS base address.
|
||||
|
||||
Because FRED event delivery from ring 3 and ERETU both swap the value
|
||||
of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
|
||||
the introduction of LKGS instruction, the SWAPGS instruction is no
|
||||
longer needed when FRED is enabled, thus is disallowed (#UD).
|
||||
|
||||
Stack levels
|
||||
============
|
||||
|
||||
4 stack levels 0~3 are introduced to replace the nonreentrant IST for
|
||||
event handling, and each stack level should be configured to use a
|
||||
dedicated stack.
|
||||
|
||||
The current stack level could be unchanged or go higher upon FRED
|
||||
event delivery. If unchanged, the CPU keeps using the current event
|
||||
stack. If higher, the CPU switches to a new event stack specified by
|
||||
the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].
|
||||
|
||||
Only execution of a FRED return instruction ERET[US], could lower the
|
||||
current stack level, causing the CPU to switch back to the stack it was
|
||||
on before a previous event delivery that promoted the stack level.
|
@ -15,3 +15,4 @@ x86_64 Support
|
||||
cpu-hotplug-spec
|
||||
machinecheck
|
||||
fsgs
|
||||
fred
|
||||
|
@ -177,10 +177,10 @@ In addition to kfuncs' arguments, verifier may need more information about the
|
||||
type of kfunc(s) being registered with the BPF subsystem. To do so, we define
|
||||
flags on a set of kfuncs as follows::
|
||||
|
||||
BTF_SET8_START(bpf_task_set)
|
||||
BTF_KFUNCS_START(bpf_task_set)
|
||||
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
|
||||
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
|
||||
BTF_SET8_END(bpf_task_set)
|
||||
BTF_KFUNCS_END(bpf_task_set)
|
||||
|
||||
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
|
||||
along with it. Ofcourse, it is also allowed to specify no flags.
|
||||
@ -347,10 +347,10 @@ Once the kfunc is prepared for use, the final step to making it visible is
|
||||
registering it with the BPF subsystem. Registration is done per BPF program
|
||||
type. An example is shown below::
|
||||
|
||||
BTF_SET8_START(bpf_task_set)
|
||||
BTF_KFUNCS_START(bpf_task_set)
|
||||
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
|
||||
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
|
||||
BTF_SET8_END(bpf_task_set)
|
||||
BTF_KFUNCS_END(bpf_task_set)
|
||||
|
||||
static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
|
||||
.owner = THIS_MODULE,
|
||||
|
@ -17,7 +17,7 @@ significant byte.
|
||||
|
||||
LPM tries may be created with a maximum prefix length that is a multiple
|
||||
of 8, in the range from 8 to 2048. The key used for lookup and update
|
||||
operations is a ``struct bpf_lpm_trie_key``, extended by
|
||||
operations is a ``struct bpf_lpm_trie_key_u8``, extended by
|
||||
``max_prefixlen/8`` bytes.
|
||||
|
||||
- For IPv4 addresses the data length is 4 bytes
|
||||
|
@ -1,11 +1,11 @@
|
||||
.. contents::
|
||||
.. sectnum::
|
||||
|
||||
=======================================
|
||||
BPF Instruction Set Specification, v1.0
|
||||
=======================================
|
||||
======================================
|
||||
BPF Instruction Set Architecture (ISA)
|
||||
======================================
|
||||
|
||||
This document specifies version 1.0 of the BPF instruction set.
|
||||
This document specifies the BPF instruction set architecture (ISA).
|
||||
|
||||
Documentation conventions
|
||||
=========================
|
||||
@ -24,22 +24,22 @@ a type's signedness (`S`) and bit width (`N`), respectively.
|
||||
.. table:: Meaning of signedness notation.
|
||||
|
||||
==== =========
|
||||
`S` Meaning
|
||||
S Meaning
|
||||
==== =========
|
||||
`u` unsigned
|
||||
`s` signed
|
||||
u unsigned
|
||||
s signed
|
||||
==== =========
|
||||
|
||||
.. table:: Meaning of bit-width notation.
|
||||
|
||||
===== =========
|
||||
`N` Bit width
|
||||
N Bit width
|
||||
===== =========
|
||||
`8` 8 bits
|
||||
`16` 16 bits
|
||||
`32` 32 bits
|
||||
`64` 64 bits
|
||||
`128` 128 bits
|
||||
8 8 bits
|
||||
16 16 bits
|
||||
32 32 bits
|
||||
64 64 bits
|
||||
128 128 bits
|
||||
===== =========
|
||||
|
||||
For example, `u32` is a type whose valid values are all the 32-bit unsigned
|
||||
@ -48,31 +48,31 @@ numbers.
|
||||
|
||||
Functions
|
||||
---------
|
||||
* `htobe16`: Takes an unsigned 16-bit number in host-endian format and
|
||||
* htobe16: Takes an unsigned 16-bit number in host-endian format and
|
||||
returns the equivalent number as an unsigned 16-bit number in big-endian
|
||||
format.
|
||||
* `htobe32`: Takes an unsigned 32-bit number in host-endian format and
|
||||
* htobe32: Takes an unsigned 32-bit number in host-endian format and
|
||||
returns the equivalent number as an unsigned 32-bit number in big-endian
|
||||
format.
|
||||
* `htobe64`: Takes an unsigned 64-bit number in host-endian format and
|
||||
* htobe64: Takes an unsigned 64-bit number in host-endian format and
|
||||
returns the equivalent number as an unsigned 64-bit number in big-endian
|
||||
format.
|
||||
* `htole16`: Takes an unsigned 16-bit number in host-endian format and
|
||||
* htole16: Takes an unsigned 16-bit number in host-endian format and
|
||||
returns the equivalent number as an unsigned 16-bit number in little-endian
|
||||
format.
|
||||
* `htole32`: Takes an unsigned 32-bit number in host-endian format and
|
||||
* htole32: Takes an unsigned 32-bit number in host-endian format and
|
||||
returns the equivalent number as an unsigned 32-bit number in little-endian
|
||||
format.
|
||||
* `htole64`: Takes an unsigned 64-bit number in host-endian format and
|
||||
* htole64: Takes an unsigned 64-bit number in host-endian format and
|
||||
returns the equivalent number as an unsigned 64-bit number in little-endian
|
||||
format.
|
||||
* `bswap16`: Takes an unsigned 16-bit number in either big- or little-endian
|
||||
* bswap16: Takes an unsigned 16-bit number in either big- or little-endian
|
||||
format and returns the equivalent number with the same bit width but
|
||||
opposite endianness.
|
||||
* `bswap32`: Takes an unsigned 32-bit number in either big- or little-endian
|
||||
* bswap32: Takes an unsigned 32-bit number in either big- or little-endian
|
||||
format and returns the equivalent number with the same bit width but
|
||||
opposite endianness.
|
||||
* `bswap64`: Takes an unsigned 64-bit number in either big- or little-endian
|
||||
* bswap64: Takes an unsigned 64-bit number in either big- or little-endian
|
||||
format and returns the equivalent number with the same bit width but
|
||||
opposite endianness.
|
||||
|
||||
@ -97,40 +97,101 @@ Definitions
|
||||
A: 10000110
|
||||
B: 11111111 10000110
|
||||
|
||||
Conformance groups
|
||||
------------------
|
||||
|
||||
An implementation does not need to support all instructions specified in this
|
||||
document (e.g., deprecated instructions). Instead, a number of conformance
|
||||
groups are specified. An implementation must support the base32 conformance
|
||||
group and may support additional conformance groups, where supporting a
|
||||
conformance group means it must support all instructions in that conformance
|
||||
group.
|
||||
|
||||
The use of named conformance groups enables interoperability between a runtime
|
||||
that executes instructions, and tools as such compilers that generate
|
||||
instructions for the runtime. Thus, capability discovery in terms of
|
||||
conformance groups might be done manually by users or automatically by tools.
|
||||
|
||||
Each conformance group has a short ASCII label (e.g., "base32") that
|
||||
corresponds to a set of instructions that are mandatory. That is, each
|
||||
instruction has one or more conformance groups of which it is a member.
|
||||
|
||||
This document defines the following conformance groups:
|
||||
|
||||
* base32: includes all instructions defined in this
|
||||
specification unless otherwise noted.
|
||||
* base64: includes base32, plus instructions explicitly noted
|
||||
as being in the base64 conformance group.
|
||||
* atomic32: includes 32-bit atomic operation instructions (see `Atomic operations`_).
|
||||
* atomic64: includes atomic32, plus 64-bit atomic operation instructions.
|
||||
* divmul32: includes 32-bit division, multiplication, and modulo instructions.
|
||||
* divmul64: includes divmul32, plus 64-bit division, multiplication,
|
||||
and modulo instructions.
|
||||
* packet: deprecated packet access instructions.
|
||||
|
||||
Instruction encoding
|
||||
====================
|
||||
|
||||
BPF has two instruction encodings:
|
||||
|
||||
* the basic instruction encoding, which uses 64 bits to encode an instruction
|
||||
* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
|
||||
constant) value after the basic instruction for a total of 128 bits.
|
||||
* the wide instruction encoding, which appends a second 64 bits
|
||||
after the basic instruction for a total of 128 bits.
|
||||
|
||||
The fields conforming an encoded basic instruction are stored in the
|
||||
following order::
|
||||
Basic instruction encoding
|
||||
--------------------------
|
||||
|
||||
opcode:8 src_reg:4 dst_reg:4 offset:16 imm:32 // In little-endian BPF.
|
||||
opcode:8 dst_reg:4 src_reg:4 offset:16 imm:32 // In big-endian BPF.
|
||||
A basic instruction is encoded as follows::
|
||||
|
||||
**imm**
|
||||
signed integer immediate value
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| opcode | regs | offset |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| imm |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
**opcode**
|
||||
operation to perform, encoded as follows::
|
||||
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|specific |class|
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|
||||
**specific**
|
||||
The format of these bits varies by instruction class
|
||||
|
||||
**class**
|
||||
The instruction class (see `Instruction classes`_)
|
||||
|
||||
**regs**
|
||||
The source and destination register numbers, encoded as follows
|
||||
on a little-endian host::
|
||||
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|src_reg|dst_reg|
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|
||||
and as follows on a big-endian host::
|
||||
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|dst_reg|src_reg|
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|
||||
**src_reg**
|
||||
the source register number (0-10), except where otherwise specified
|
||||
(`64-bit immediate instructions`_ reuse this field for other purposes)
|
||||
|
||||
**dst_reg**
|
||||
destination register number (0-10)
|
||||
|
||||
**offset**
|
||||
signed integer offset used with pointer arithmetic
|
||||
|
||||
**src_reg**
|
||||
the source register number (0-10), except where otherwise specified
|
||||
(`64-bit immediate instructions`_ reuse this field for other purposes)
|
||||
**imm**
|
||||
signed integer immediate value
|
||||
|
||||
**dst_reg**
|
||||
destination register number (0-10)
|
||||
|
||||
**opcode**
|
||||
operation to perform
|
||||
|
||||
Note that the contents of multi-byte fields ('imm' and 'offset') are
|
||||
stored using big-endian byte ordering in big-endian BPF and
|
||||
little-endian byte ordering in little-endian BPF.
|
||||
Note that the contents of multi-byte fields ('offset' and 'imm') are
|
||||
stored using big-endian byte ordering on big-endian hosts and
|
||||
little-endian byte ordering on little-endian hosts.
|
||||
|
||||
For example::
|
||||
|
||||
@ -143,71 +204,83 @@ For example::
|
||||
Note that most instructions do not use all of the fields.
|
||||
Unused fields shall be cleared to zero.
|
||||
|
||||
As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
|
||||
instruction uses a 64-bit immediate value that is constructed as follows.
|
||||
The 64 bits following the basic instruction contain a pseudo instruction
|
||||
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
|
||||
and imm containing the high 32 bits of the immediate value.
|
||||
Wide instruction encoding
|
||||
--------------------------
|
||||
|
||||
Some instructions are defined to use the wide instruction encoding,
|
||||
which uses two 32-bit immediate values. The 64 bits following
|
||||
the basic instruction format contain a pseudo instruction
|
||||
with 'opcode', 'dst_reg', 'src_reg', and 'offset' all set to zero.
|
||||
|
||||
This is depicted in the following figure::
|
||||
|
||||
basic_instruction
|
||||
.-----------------------------.
|
||||
| |
|
||||
code:8 regs:8 offset:16 imm:32 unused:32 imm:32
|
||||
| |
|
||||
'--------------'
|
||||
pseudo instruction
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| opcode | regs | offset |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| imm |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| reserved |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| next_imm |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
Thus the 64-bit immediate value is constructed as follows:
|
||||
**opcode**
|
||||
operation to perform, encoded as explained above
|
||||
|
||||
imm64 = (next_imm << 32) | imm
|
||||
**regs**
|
||||
The source and destination register numbers, encoded as explained above
|
||||
|
||||
where 'next_imm' refers to the imm value of the pseudo instruction
|
||||
following the basic instruction. The unused bytes in the pseudo
|
||||
instruction are reserved and shall be cleared to zero.
|
||||
**offset**
|
||||
signed integer offset used with pointer arithmetic
|
||||
|
||||
**imm**
|
||||
signed integer immediate value
|
||||
|
||||
**reserved**
|
||||
unused, set to zero
|
||||
|
||||
**next_imm**
|
||||
second signed integer immediate value
|
||||
|
||||
Instruction classes
|
||||
-------------------
|
||||
|
||||
The three LSB bits of the 'opcode' field store the instruction class:
|
||||
The three least significant bits of the 'opcode' field store the instruction class:
|
||||
|
||||
========= ===== =============================== ===================================
|
||||
class value description reference
|
||||
========= ===== =============================== ===================================
|
||||
BPF_LD 0x00 non-standard load operations `Load and store instructions`_
|
||||
BPF_LDX 0x01 load into register operations `Load and store instructions`_
|
||||
BPF_ST 0x02 store from immediate operations `Load and store instructions`_
|
||||
BPF_STX 0x03 store from register operations `Load and store instructions`_
|
||||
BPF_ALU 0x04 32-bit arithmetic operations `Arithmetic and jump instructions`_
|
||||
BPF_JMP 0x05 64-bit jump operations `Arithmetic and jump instructions`_
|
||||
BPF_JMP32 0x06 32-bit jump operations `Arithmetic and jump instructions`_
|
||||
BPF_ALU64 0x07 64-bit arithmetic operations `Arithmetic and jump instructions`_
|
||||
========= ===== =============================== ===================================
|
||||
===== ===== =============================== ===================================
|
||||
class value description reference
|
||||
===== ===== =============================== ===================================
|
||||
LD 0x0 non-standard load operations `Load and store instructions`_
|
||||
LDX 0x1 load into register operations `Load and store instructions`_
|
||||
ST 0x2 store from immediate operations `Load and store instructions`_
|
||||
STX 0x3 store from register operations `Load and store instructions`_
|
||||
ALU 0x4 32-bit arithmetic operations `Arithmetic and jump instructions`_
|
||||
JMP 0x5 64-bit jump operations `Arithmetic and jump instructions`_
|
||||
JMP32 0x6 32-bit jump operations `Arithmetic and jump instructions`_
|
||||
ALU64 0x7 64-bit arithmetic operations `Arithmetic and jump instructions`_
|
||||
===== ===== =============================== ===================================
|
||||
|
||||
Arithmetic and jump instructions
|
||||
================================
|
||||
|
||||
For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
|
||||
``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
|
||||
For arithmetic and jump instructions (``ALU``, ``ALU64``, ``JMP`` and
|
||||
``JMP32``), the 8-bit 'opcode' field is divided into three parts::
|
||||
|
||||
============== ====== =================
|
||||
4 bits (MSB) 1 bit 3 bits (LSB)
|
||||
============== ====== =================
|
||||
code source instruction class
|
||||
============== ====== =================
|
||||
+-+-+-+-+-+-+-+-+
|
||||
| code |s|class|
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|
||||
**code**
|
||||
the operation code, whose meaning varies by instruction class
|
||||
|
||||
**source**
|
||||
**s (source)**
|
||||
the source operand location, which unless otherwise specified is one of:
|
||||
|
||||
====== ===== ==============================================
|
||||
source value description
|
||||
====== ===== ==============================================
|
||||
BPF_K 0x00 use 32-bit 'imm' value as source operand
|
||||
BPF_X 0x08 use 'src_reg' register value as source operand
|
||||
K 0 use 32-bit 'imm' value as source operand
|
||||
X 1 use 'src_reg' register value as source operand
|
||||
====== ===== ==============================================
|
||||
|
||||
**instruction class**
|
||||
@ -216,70 +289,75 @@ code source instruction class
|
||||
Arithmetic instructions
|
||||
-----------------------
|
||||
|
||||
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
|
||||
otherwise identical operations.
|
||||
``ALU`` uses 32-bit wide operands while ``ALU64`` uses 64-bit wide operands for
|
||||
otherwise identical operations. ``ALU64`` instructions belong to the
|
||||
base64 conformance group unless noted otherwise.
|
||||
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
|
||||
to the values of the source and destination registers, respectively.
|
||||
|
||||
========= ===== ======= ==========================================================
|
||||
code value offset description
|
||||
========= ===== ======= ==========================================================
|
||||
BPF_ADD 0x00 0 dst += src
|
||||
BPF_SUB 0x10 0 dst -= src
|
||||
BPF_MUL 0x20 0 dst \*= src
|
||||
BPF_DIV 0x30 0 dst = (src != 0) ? (dst / src) : 0
|
||||
BPF_SDIV 0x30 1 dst = (src != 0) ? (dst s/ src) : 0
|
||||
BPF_OR 0x40 0 dst \|= src
|
||||
BPF_AND 0x50 0 dst &= src
|
||||
BPF_LSH 0x60 0 dst <<= (src & mask)
|
||||
BPF_RSH 0x70 0 dst >>= (src & mask)
|
||||
BPF_NEG 0x80 0 dst = -dst
|
||||
BPF_MOD 0x90 0 dst = (src != 0) ? (dst % src) : dst
|
||||
BPF_SMOD 0x90 1 dst = (src != 0) ? (dst s% src) : dst
|
||||
BPF_XOR 0xa0 0 dst ^= src
|
||||
BPF_MOV 0xb0 0 dst = src
|
||||
BPF_MOVSX 0xb0 8/16/32 dst = (s8,s16,s32)src
|
||||
BPF_ARSH 0xc0 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
|
||||
BPF_END 0xd0 0 byte swap operations (see `Byte swap instructions`_ below)
|
||||
========= ===== ======= ==========================================================
|
||||
===== ===== ======= ==========================================================
|
||||
name code offset description
|
||||
===== ===== ======= ==========================================================
|
||||
ADD 0x0 0 dst += src
|
||||
SUB 0x1 0 dst -= src
|
||||
MUL 0x2 0 dst \*= src
|
||||
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
|
||||
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
|
||||
OR 0x4 0 dst \|= src
|
||||
AND 0x5 0 dst &= src
|
||||
LSH 0x6 0 dst <<= (src & mask)
|
||||
RSH 0x7 0 dst >>= (src & mask)
|
||||
NEG 0x8 0 dst = -dst
|
||||
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
|
||||
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
|
||||
XOR 0xa 0 dst ^= src
|
||||
MOV 0xb 0 dst = src
|
||||
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
|
||||
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
|
||||
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
|
||||
===== ===== ======= ==========================================================
|
||||
|
||||
Underflow and overflow are allowed during arithmetic operations, meaning
|
||||
the 64-bit or 32-bit value will wrap. If BPF program execution would
|
||||
result in division by zero, the destination register is instead set to zero.
|
||||
If execution would result in modulo by zero, for ``BPF_ALU64`` the value of
|
||||
the destination register is unchanged whereas for ``BPF_ALU`` the upper
|
||||
If execution would result in modulo by zero, for ``ALU64`` the value of
|
||||
the destination register is unchanged whereas for ``ALU`` the upper
|
||||
32 bits of the destination register are zeroed.
|
||||
|
||||
``BPF_ADD | BPF_X | BPF_ALU`` means::
|
||||
``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means::
|
||||
|
||||
dst = (u32) ((u32) dst + (u32) src)
|
||||
|
||||
where '(u32)' indicates that the upper 32 bits are zeroed.
|
||||
|
||||
``BPF_ADD | BPF_X | BPF_ALU64`` means::
|
||||
``{ADD, X, ALU64}`` means::
|
||||
|
||||
dst = dst + src
|
||||
|
||||
``BPF_XOR | BPF_K | BPF_ALU`` means::
|
||||
``{XOR, K, ALU}`` means::
|
||||
|
||||
dst = (u32) dst ^ (u32) imm32
|
||||
dst = (u32) dst ^ (u32) imm
|
||||
|
||||
``BPF_XOR | BPF_K | BPF_ALU64`` means::
|
||||
``{XOR, K, ALU64}`` means::
|
||||
|
||||
dst = dst ^ imm32
|
||||
dst = dst ^ imm
|
||||
|
||||
Note that most instructions have instruction offset of 0. Only three instructions
|
||||
(``BPF_SDIV``, ``BPF_SMOD``, ``BPF_MOVSX``) have a non-zero offset.
|
||||
(``SDIV``, ``SMOD``, ``MOVSX``) have a non-zero offset.
|
||||
|
||||
Division, multiplication, and modulo operations for ``ALU`` are part
|
||||
of the "divmul32" conformance group, and division, multiplication, and
|
||||
modulo operations for ``ALU64`` are part of the "divmul64" conformance
|
||||
group.
|
||||
The division and modulo operations support both unsigned and signed flavors.
|
||||
|
||||
For unsigned operations (``BPF_DIV`` and ``BPF_MOD``), for ``BPF_ALU``,
|
||||
'imm' is interpreted as a 32-bit unsigned value. For ``BPF_ALU64``,
|
||||
For unsigned operations (``DIV`` and ``MOD``), for ``ALU``,
|
||||
'imm' is interpreted as a 32-bit unsigned value. For ``ALU64``,
|
||||
'imm' is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
|
||||
interpreted as a 64-bit unsigned value.
|
||||
|
||||
For signed operations (``BPF_SDIV`` and ``BPF_SMOD``), for ``BPF_ALU``,
|
||||
'imm' is interpreted as a 32-bit signed value. For ``BPF_ALU64``, 'imm'
|
||||
For signed operations (``SDIV`` and ``SMOD``), for ``ALU``,
|
||||
'imm' is interpreted as a 32-bit signed value. For ``ALU64``, 'imm'
|
||||
is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
|
||||
interpreted as a 64-bit signed value.
|
||||
|
||||
@ -291,11 +369,15 @@ etc. This specification requires that signed modulo use truncated division
|
||||
|
||||
a % n = a - n * trunc(a / n)
|
||||
|
||||
The ``BPF_MOVSX`` instruction does a move operation with sign extension.
|
||||
``BPF_ALU | BPF_MOVSX`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into 32
|
||||
The ``MOVSX`` instruction does a move operation with sign extension.
|
||||
``{MOVSX, X, ALU}`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into 32
|
||||
bit operands, and zeroes the remaining upper 32 bits.
|
||||
``BPF_ALU64 | BPF_MOVSX`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
|
||||
operands into 64 bit operands.
|
||||
``{MOVSX, X, ALU64}`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
|
||||
operands into 64 bit operands. Unlike other arithmetic instructions,
|
||||
``MOVSX`` is only defined for register source operands (``X``).
|
||||
|
||||
The ``NEG`` instruction is only defined when the source bit is clear
|
||||
(``K``).
|
||||
|
||||
Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31)
|
||||
for 32-bit operations.
|
||||
@ -303,43 +385,45 @@ for 32-bit operations.
|
||||
Byte swap instructions
|
||||
----------------------
|
||||
|
||||
The byte swap instructions use instruction classes of ``BPF_ALU`` and ``BPF_ALU64``
|
||||
and a 4-bit 'code' field of ``BPF_END``.
|
||||
The byte swap instructions use instruction classes of ``ALU`` and ``ALU64``
|
||||
and a 4-bit 'code' field of ``END``.
|
||||
|
||||
The byte swap instructions operate on the destination register
|
||||
only and do not use a separate source register or immediate value.
|
||||
|
||||
For ``BPF_ALU``, the 1-bit source operand field in the opcode is used to
|
||||
For ``ALU``, the 1-bit source operand field in the opcode is used to
|
||||
select what byte order the operation converts from or to. For
|
||||
``BPF_ALU64``, the 1-bit source operand field in the opcode is reserved
|
||||
``ALU64``, the 1-bit source operand field in the opcode is reserved
|
||||
and must be set to 0.
|
||||
|
||||
========= ========= ===== =================================================
|
||||
class source value description
|
||||
========= ========= ===== =================================================
|
||||
BPF_ALU BPF_TO_LE 0x00 convert between host byte order and little endian
|
||||
BPF_ALU BPF_TO_BE 0x08 convert between host byte order and big endian
|
||||
BPF_ALU64 Reserved 0x00 do byte swap unconditionally
|
||||
========= ========= ===== =================================================
|
||||
===== ======== ===== =================================================
|
||||
class source value description
|
||||
===== ======== ===== =================================================
|
||||
ALU TO_LE 0 convert between host byte order and little endian
|
||||
ALU TO_BE 1 convert between host byte order and big endian
|
||||
ALU64 Reserved 0 do byte swap unconditionally
|
||||
===== ======== ===== =================================================
|
||||
|
||||
The 'imm' field encodes the width of the swap operations. The following widths
|
||||
are supported: 16, 32 and 64.
|
||||
are supported: 16, 32 and 64. Width 64 operations belong to the base64
|
||||
conformance group and other swap operations belong to the base32
|
||||
conformance group.
|
||||
|
||||
Examples:
|
||||
|
||||
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16/32/64 means::
|
||||
``{END, TO_LE, ALU}`` with imm = 16/32/64 means::
|
||||
|
||||
dst = htole16(dst)
|
||||
dst = htole32(dst)
|
||||
dst = htole64(dst)
|
||||
|
||||
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 16/32/64 means::
|
||||
``{END, TO_BE, ALU}`` with imm = 16/32/64 means::
|
||||
|
||||
dst = htobe16(dst)
|
||||
dst = htobe32(dst)
|
||||
dst = htobe64(dst)
|
||||
|
||||
``BPF_ALU64 | BPF_TO_LE | BPF_END`` with imm = 16/32/64 means::
|
||||
``{END, TO_LE, ALU64}`` with imm = 16/32/64 means::
|
||||
|
||||
dst = bswap16(dst)
|
||||
dst = bswap32(dst)
|
||||
@ -348,56 +432,61 @@ Examples:
|
||||
Jump instructions
|
||||
-----------------
|
||||
|
||||
``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
|
||||
otherwise identical operations.
|
||||
``JMP32`` uses 32-bit wide operands and indicates the base32
|
||||
conformance group, while ``JMP`` uses 64-bit wide operands for
|
||||
otherwise identical operations, and indicates the base64 conformance
|
||||
group unless otherwise specified.
|
||||
The 'code' field encodes the operation as below:
|
||||
|
||||
======== ===== === =========================================== =========================================
|
||||
code value src description notes
|
||||
======== ===== === =========================================== =========================================
|
||||
BPF_JA 0x0 0x0 PC += offset BPF_JMP class
|
||||
BPF_JA 0x0 0x0 PC += imm BPF_JMP32 class
|
||||
BPF_JEQ 0x1 any PC += offset if dst == src
|
||||
BPF_JGT 0x2 any PC += offset if dst > src unsigned
|
||||
BPF_JGE 0x3 any PC += offset if dst >= src unsigned
|
||||
BPF_JSET 0x4 any PC += offset if dst & src
|
||||
BPF_JNE 0x5 any PC += offset if dst != src
|
||||
BPF_JSGT 0x6 any PC += offset if dst > src signed
|
||||
BPF_JSGE 0x7 any PC += offset if dst >= src signed
|
||||
BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_
|
||||
BPF_CALL 0x8 0x1 call PC += imm see `Program-local functions`_
|
||||
BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_
|
||||
BPF_EXIT 0x9 0x0 return BPF_JMP only
|
||||
BPF_JLT 0xa any PC += offset if dst < src unsigned
|
||||
BPF_JLE 0xb any PC += offset if dst <= src unsigned
|
||||
BPF_JSLT 0xc any PC += offset if dst < src signed
|
||||
BPF_JSLE 0xd any PC += offset if dst <= src signed
|
||||
======== ===== === =========================================== =========================================
|
||||
======== ===== ======= =============================== ===================================================
|
||||
code value src_reg description notes
|
||||
======== ===== ======= =============================== ===================================================
|
||||
JA 0x0 0x0 PC += offset {JA, K, JMP} only
|
||||
JA 0x0 0x0 PC += imm {JA, K, JMP32} only
|
||||
JEQ 0x1 any PC += offset if dst == src
|
||||
JGT 0x2 any PC += offset if dst > src unsigned
|
||||
JGE 0x3 any PC += offset if dst >= src unsigned
|
||||
JSET 0x4 any PC += offset if dst & src
|
||||
JNE 0x5 any PC += offset if dst != src
|
||||
JSGT 0x6 any PC += offset if dst > src signed
|
||||
JSGE 0x7 any PC += offset if dst >= src signed
|
||||
CALL 0x8 0x0 call helper function by address {CALL, K, JMP} only, see `Helper functions`_
|
||||
CALL 0x8 0x1 call PC += imm {CALL, K, JMP} only, see `Program-local functions`_
|
||||
CALL 0x8 0x2 call helper function by BTF ID {CALL, K, JMP} only, see `Helper functions`_
|
||||
EXIT 0x9 0x0 return {CALL, K, JMP} only
|
||||
JLT 0xa any PC += offset if dst < src unsigned
|
||||
JLE 0xb any PC += offset if dst <= src unsigned
|
||||
JSLT 0xc any PC += offset if dst < src signed
|
||||
JSLE 0xd any PC += offset if dst <= src signed
|
||||
======== ===== ======= =============================== ===================================================
|
||||
|
||||
The BPF program needs to store the return value into register R0 before doing a
|
||||
``BPF_EXIT``.
|
||||
The BPF program needs to store the return value into register R0 before doing an
|
||||
``EXIT``.
|
||||
|
||||
Example:
|
||||
|
||||
``BPF_JSGE | BPF_X | BPF_JMP32`` (0x7e) means::
|
||||
``{JSGE, X, JMP32}`` means::
|
||||
|
||||
if (s32)dst s>= (s32)src goto +offset
|
||||
|
||||
where 's>=' indicates a signed '>=' comparison.
|
||||
|
||||
``BPF_JA | BPF_K | BPF_JMP32`` (0x06) means::
|
||||
``{JA, K, JMP32}`` means::
|
||||
|
||||
gotol +imm
|
||||
|
||||
where 'imm' means the branch offset comes from insn 'imm' field.
|
||||
|
||||
Note that there are two flavors of ``BPF_JA`` instructions. The
|
||||
``BPF_JMP`` class permits a 16-bit jump offset specified by the 'offset'
|
||||
field, whereas the ``BPF_JMP32`` class permits a 32-bit jump offset
|
||||
Note that there are two flavors of ``JA`` instructions. The
|
||||
``JMP`` class permits a 16-bit jump offset specified by the 'offset'
|
||||
field, whereas the ``JMP32`` class permits a 32-bit jump offset
|
||||
specified by the 'imm' field. A > 16-bit conditional jump may be
|
||||
converted to a < 16-bit conditional jump plus a 32-bit unconditional
|
||||
jump.
|
||||
|
||||
All ``CALL`` and ``JA`` instructions belong to the
|
||||
base32 conformance group.
|
||||
|
||||
Helper functions
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
@ -416,78 +505,83 @@ Program-local functions
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Program-local functions are functions exposed by the same BPF program as the
|
||||
caller, and are referenced by offset from the call instruction, similar to
|
||||
``BPF_JA``. The offset is encoded in the imm field of the call instruction.
|
||||
A ``BPF_EXIT`` within the program-local function will return to the caller.
|
||||
``JA``. The offset is encoded in the imm field of the call instruction.
|
||||
A ``EXIT`` within the program-local function will return to the caller.
|
||||
|
||||
Load and store instructions
|
||||
===========================
|
||||
|
||||
For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
|
||||
8-bit 'opcode' field is divided as:
|
||||
For load and store instructions (``LD``, ``LDX``, ``ST``, and ``STX``), the
|
||||
8-bit 'opcode' field is divided as::
|
||||
|
||||
============ ====== =================
|
||||
3 bits (MSB) 2 bits 3 bits (LSB)
|
||||
============ ====== =================
|
||||
mode size instruction class
|
||||
============ ====== =================
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|mode |sz |class|
|
||||
+-+-+-+-+-+-+-+-+
|
||||
|
||||
The mode modifier is one of:
|
||||
**mode**
|
||||
The mode modifier is one of:
|
||||
|
||||
============= ===== ==================================== =============
|
||||
mode modifier value description reference
|
||||
============= ===== ==================================== =============
|
||||
BPF_IMM 0x00 64-bit immediate instructions `64-bit immediate instructions`_
|
||||
BPF_ABS 0x20 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
|
||||
BPF_IND 0x40 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
|
||||
BPF_MEM 0x60 regular load and store operations `Regular load and store operations`_
|
||||
BPF_MEMSX 0x80 sign-extension load operations `Sign-extension load operations`_
|
||||
BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_
|
||||
============= ===== ==================================== =============
|
||||
============= ===== ==================================== =============
|
||||
mode modifier value description reference
|
||||
============= ===== ==================================== =============
|
||||
IMM 0 64-bit immediate instructions `64-bit immediate instructions`_
|
||||
ABS 1 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
|
||||
IND 2 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
|
||||
MEM 3 regular load and store operations `Regular load and store operations`_
|
||||
MEMSX 4 sign-extension load operations `Sign-extension load operations`_
|
||||
ATOMIC 6 atomic operations `Atomic operations`_
|
||||
============= ===== ==================================== =============
|
||||
|
||||
The size modifier is one of:
|
||||
**sz (size)**
|
||||
The size modifier is one of:
|
||||
|
||||
============= ===== =====================
|
||||
size modifier value description
|
||||
============= ===== =====================
|
||||
BPF_W 0x00 word (4 bytes)
|
||||
BPF_H 0x08 half word (2 bytes)
|
||||
BPF_B 0x10 byte
|
||||
BPF_DW 0x18 double word (8 bytes)
|
||||
============= ===== =====================
|
||||
==== ===== =====================
|
||||
size value description
|
||||
==== ===== =====================
|
||||
W 0 word (4 bytes)
|
||||
H 1 half word (2 bytes)
|
||||
B 2 byte
|
||||
DW 3 double word (8 bytes)
|
||||
==== ===== =====================
|
||||
|
||||
Instructions using ``DW`` belong to the base64 conformance group.
|
||||
|
||||
**class**
|
||||
The instruction class (see `Instruction classes`_)
|
||||
|
||||
Regular load and store operations
|
||||
---------------------------------
|
||||
|
||||
The ``BPF_MEM`` mode modifier is used to encode regular load and store
|
||||
The ``MEM`` mode modifier is used to encode regular load and store
|
||||
instructions that transfer data between a register and memory.
|
||||
|
||||
``BPF_MEM | <size> | BPF_STX`` means::
|
||||
``{MEM, <size>, STX}`` means::
|
||||
|
||||
*(size *) (dst + offset) = src
|
||||
|
||||
``BPF_MEM | <size> | BPF_ST`` means::
|
||||
``{MEM, <size>, ST}`` means::
|
||||
|
||||
*(size *) (dst + offset) = imm32
|
||||
*(size *) (dst + offset) = imm
|
||||
|
||||
``BPF_MEM | <size> | BPF_LDX`` means::
|
||||
``{MEM, <size>, LDX}`` means::
|
||||
|
||||
dst = *(unsigned size *) (src + offset)
|
||||
|
||||
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW`` and
|
||||
'unsigned size' is one of u8, u16, u32 or u64.
|
||||
Where '<size>' is one of: ``B``, ``H``, ``W``, or ``DW``, and
|
||||
'unsigned size' is one of: u8, u16, u32, or u64.
|
||||
|
||||
Sign-extension load operations
|
||||
------------------------------
|
||||
|
||||
The ``BPF_MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
|
||||
The ``MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
|
||||
instructions that transfer data between a register and memory.
|
||||
|
||||
``BPF_MEMSX | <size> | BPF_LDX`` means::
|
||||
``{MEMSX, <size>, LDX}`` means::
|
||||
|
||||
dst = *(signed size *) (src + offset)
|
||||
|
||||
Where size is one of: ``BPF_B``, ``BPF_H`` or ``BPF_W``, and
|
||||
'signed size' is one of s8, s16 or s32.
|
||||
Where size is one of: ``B``, ``H``, or ``W``, and
|
||||
'signed size' is one of: s8, s16, or s32.
|
||||
|
||||
Atomic operations
|
||||
-----------------
|
||||
@ -497,10 +591,12 @@ interrupted or corrupted by other access to the same memory region
|
||||
by other BPF programs or means outside of this specification.
|
||||
|
||||
All atomic operations supported by BPF are encoded as store operations
|
||||
that use the ``BPF_ATOMIC`` mode modifier as follows:
|
||||
that use the ``ATOMIC`` mode modifier as follows:
|
||||
|
||||
* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
|
||||
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
|
||||
* ``{ATOMIC, W, STX}`` for 32-bit operations, which are
|
||||
part of the "atomic32" conformance group.
|
||||
* ``{ATOMIC, DW, STX}`` for 64-bit operations, which are
|
||||
part of the "atomic64" conformance group.
|
||||
* 8-bit and 16-bit wide atomic operations are not supported.
|
||||
|
||||
The 'imm' field is used to encode the actual atomic operation.
|
||||
@ -510,18 +606,18 @@ arithmetic operations in the 'imm' field to encode the atomic operation:
|
||||
======== ===== ===========
|
||||
imm value description
|
||||
======== ===== ===========
|
||||
BPF_ADD 0x00 atomic add
|
||||
BPF_OR 0x40 atomic or
|
||||
BPF_AND 0x50 atomic and
|
||||
BPF_XOR 0xa0 atomic xor
|
||||
ADD 0x00 atomic add
|
||||
OR 0x40 atomic or
|
||||
AND 0x50 atomic and
|
||||
XOR 0xa0 atomic xor
|
||||
======== ===== ===========
|
||||
|
||||
|
||||
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
|
||||
``{ATOMIC, W, STX}`` with 'imm' = ADD means::
|
||||
|
||||
*(u32 *)(dst + offset) += src
|
||||
|
||||
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
|
||||
``{ATOMIC, DW, STX}`` with 'imm' = ADD means::
|
||||
|
||||
*(u64 *)(dst + offset) += src
|
||||
|
||||
@ -531,20 +627,20 @@ two complex atomic operations:
|
||||
=========== ================ ===========================
|
||||
imm value description
|
||||
=========== ================ ===========================
|
||||
BPF_FETCH 0x01 modifier: return old value
|
||||
BPF_XCHG 0xe0 | BPF_FETCH atomic exchange
|
||||
BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
|
||||
FETCH 0x01 modifier: return old value
|
||||
XCHG 0xe0 | FETCH atomic exchange
|
||||
CMPXCHG 0xf0 | FETCH atomic compare and exchange
|
||||
=========== ================ ===========================
|
||||
|
||||
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
|
||||
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
|
||||
The ``FETCH`` modifier is optional for simple atomic operations, and
|
||||
always set for the complex atomic operations. If the ``FETCH`` flag
|
||||
is set, then the operation also overwrites ``src`` with the value that
|
||||
was in memory before it was modified.
|
||||
|
||||
The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
|
||||
The ``XCHG`` operation atomically exchanges ``src`` with the value
|
||||
addressed by ``dst + offset``.
|
||||
|
||||
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
|
||||
The ``CMPXCHG`` operation atomically compares the value addressed by
|
||||
``dst + offset`` with ``R0``. If they match, the value addressed by
|
||||
``dst + offset`` is replaced with ``src``. In either case, the
|
||||
value that was at ``dst + offset`` before the operation is zero-extended
|
||||
@ -553,25 +649,25 @@ and loaded back to ``R0``.
|
||||
64-bit immediate instructions
|
||||
-----------------------------
|
||||
|
||||
Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
|
||||
encoding defined in `Instruction encoding`_, and use the 'src' field of the
|
||||
Instructions with the ``IMM`` 'mode' modifier use the wide instruction
|
||||
encoding defined in `Instruction encoding`_, and use the 'src_reg' field of the
|
||||
basic instruction to hold an opcode subtype.
|
||||
|
||||
The following table defines a set of ``BPF_IMM | BPF_DW | BPF_LD`` instructions
|
||||
with opcode subtypes in the 'src' field, using new terms such as "map"
|
||||
The following table defines a set of ``{IMM, DW, LD}`` instructions
|
||||
with opcode subtypes in the 'src_reg' field, using new terms such as "map"
|
||||
defined further below:
|
||||
|
||||
========================= ====== === ========================================= =========== ==============
|
||||
opcode construction opcode src pseudocode imm type dst type
|
||||
========================= ====== === ========================================= =========== ==============
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x0 dst = imm64 integer integer
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
|
||||
BPF_IMM | BPF_DW | BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
|
||||
========================= ====== === ========================================= =========== ==============
|
||||
======= ========================================= =========== ==============
|
||||
src_reg pseudocode imm type dst type
|
||||
======= ========================================= =========== ==============
|
||||
0x0 dst = (next_imm << 32) | imm integer integer
|
||||
0x1 dst = map_by_fd(imm) map fd map
|
||||
0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
|
||||
0x3 dst = var_addr(imm) variable id data pointer
|
||||
0x4 dst = code_addr(imm) integer code pointer
|
||||
0x5 dst = map_by_idx(imm) map index map
|
||||
0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
|
||||
======= ========================================= =========== ==============
|
||||
|
||||
where
|
||||
|
||||
@ -609,5 +705,9 @@ Legacy BPF Packet access instructions
|
||||
-------------------------------------
|
||||
|
||||
BPF previously introduced special instructions for access to packet data that were
|
||||
carried over from classic BPF. However, these instructions are
|
||||
deprecated and should no longer be used.
|
||||
carried over from classic BPF. These instructions used an instruction
|
||||
class of ``LD``, a size modifier of ``W``, ``H``, or ``B``, and a
|
||||
mode modifier of ``ABS`` or ``IND``. The 'dst_reg' and 'offset' fields were
|
||||
set to zero, and 'src_reg' was set to zero for ``ABS``. However, these
|
||||
instructions are deprecated and should no longer be used. All legacy packet
|
||||
access instructions belong to the "packet" conformance group.
|
||||
|
@ -562,7 +562,7 @@ works::
|
||||
* ``checkpoint[0].r1`` is marked as read;
|
||||
|
||||
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
|
||||
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
|
||||
by ``clean_live_states()``. After this processing ``checkpoint[0].r1`` has a
|
||||
read mark and all other registers and stack slots are marked as ``NOT_INIT``
|
||||
or ``STACK_INVALID``
|
||||
|
||||
|
@ -346,9 +346,9 @@ sys.stderr.write("Using %s theme\n" % html_theme)
|
||||
html_static_path = ['sphinx-static']
|
||||
|
||||
# If true, Docutils "smart quotes" will be used to convert quotes and dashes
|
||||
# to typographically correct entities. This will convert "--" to "—",
|
||||
# which is not always what we want, so disable it.
|
||||
smartquotes = False
|
||||
# to typographically correct entities. However, conversion of "--" to "—"
|
||||
# is not always what we want, so enable only quotes.
|
||||
smartquotes_action = 'q'
|
||||
|
||||
# Custom sidebar templates, maps document names to template names.
|
||||
# Note that the RTD theme ignores this
|
||||
@ -388,6 +388,12 @@ latex_elements = {
|
||||
verbatimhintsturnover=false,
|
||||
''',
|
||||
|
||||
#
|
||||
# Some of our authors are fond of deep nesting; tell latex to
|
||||
# cope.
|
||||
#
|
||||
'maxlistdepth': '10',
|
||||
|
||||
# For CJK One-half spacing, need to be in front of hyperref
|
||||
'extrapackages': r'\usepackage{setspace}',
|
||||
|
||||
|
@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work
|
||||
item pointing to that function and queue that work item on a
|
||||
workqueue.
|
||||
|
||||
Special purpose threads, called worker threads, execute the functions
|
||||
off of the queue, one after the other. If no work is queued, the
|
||||
worker threads become idle. These worker threads are managed in so
|
||||
called worker-pools.
|
||||
A work item can be executed in either a thread or the BH (softirq) context.
|
||||
|
||||
For threaded workqueues, special purpose threads, called [k]workers, execute
|
||||
the functions off of the queue, one after the other. If no work is queued,
|
||||
the worker threads become idle. These worker threads are managed in
|
||||
worker-pools.
|
||||
|
||||
The cmwq design differentiates between the user-facing workqueues that
|
||||
subsystems and drivers queue work items on and the backend mechanism
|
||||
@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra
|
||||
worker-pools to serve work items queued on unbound workqueues - the
|
||||
number of these backing pools is dynamic.
|
||||
|
||||
BH workqueues use the same framework. However, as there can only be one
|
||||
concurrent execution context, there's no need to worry about concurrency.
|
||||
Each per-CPU BH worker pool contains only one pseudo worker which represents
|
||||
the BH execution context. A BH workqueue can be considered a convenience
|
||||
interface to softirq.
|
||||
|
||||
Subsystems and drivers can create and queue work items through special
|
||||
workqueue API functions as they see fit. They can influence some
|
||||
aspects of the way the work items are executed by setting flags on the
|
||||
@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will
|
||||
be queued on the worklist of either normal or highpri worker-pool that
|
||||
is associated to the CPU the issuer is running on.
|
||||
|
||||
For any worker pool implementation, managing the concurrency level
|
||||
For any thread pool implementation, managing the concurrency level
|
||||
(how many execution contexts are active) is an important issue. cmwq
|
||||
tries to keep the concurrency at a minimal but sufficient level.
|
||||
Minimal to save resources and sufficient in that the system is used at
|
||||
@ -164,6 +172,17 @@ resources, scheduled and executed.
|
||||
``flags``
|
||||
---------
|
||||
|
||||
``WQ_BH``
|
||||
BH workqueues can be considered a convenience interface to softirq. BH
|
||||
workqueues are always per-CPU and all BH work items are executed in the
|
||||
queueing CPU's softirq context in the queueing order.
|
||||
|
||||
All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
|
||||
only allowed additional flag.
|
||||
|
||||
BH work items cannot sleep. All other features such as delayed queueing,
|
||||
flushing and canceling are supported.
|
||||
|
||||
``WQ_UNBOUND``
|
||||
Work items queued to an unbound wq are served by the special
|
||||
worker-pools which host workers which are not bound to any
|
||||
@ -237,15 +256,11 @@ may queue at the same time. Unless there is a specific need for
|
||||
throttling the number of active work items, specifying '0' is
|
||||
recommended.
|
||||
|
||||
Some users depend on the strict execution ordering of ST wq. The
|
||||
combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` used to
|
||||
achieve this behavior. Work items on such wq were always queued to the
|
||||
unbound worker-pools and only one work item could be active at any given
|
||||
time thus achieving the same ordering property as ST wq.
|
||||
|
||||
In the current implementation the above configuration only guarantees
|
||||
ST behavior within a given NUMA node. Instead ``alloc_ordered_workqueue()`` should
|
||||
be used to achieve system-wide ST behavior.
|
||||
Some users depend on strict execution ordering where only one work item
|
||||
is in flight at any given time and the work items are processed in
|
||||
queueing order. While the combination of ``@max_active`` of 1 and
|
||||
``WQ_UNBOUND`` used to achieve this behavior, this is no longer the
|
||||
case. Use ``alloc_ordered_queue()`` instead.
|
||||
|
||||
|
||||
Example Execution Scenarios
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user