2015-03-24 14:02:41 +00:00
acpi= [HW,ACPI,X86,ARM64]
2007-03-06 02:29:44 -08:00
Advanced Configuration and Power Interface
2016-04-12 16:09:11 +02:00
Format: { force | on | off | strict | noirq | rsdt |
2015-09-26 19:27:57 +03:00
copy_dsdt }
2005-04-16 15:20:36 -07:00
force -- enable ACPI if default was off
2016-04-12 16:09:11 +02:00
on -- enable ACPI but allow fallback to DT [arm64]
2005-04-16 15:20:36 -07:00
off -- disable ACPI if default was on
noirq -- do not use ACPI for IRQ routing
2005-10-23 12:57:11 -07:00
strict -- Be less tolerant of platforms that are not
2005-04-16 15:20:36 -07:00
strictly ACPI specification compliant.
2008-12-17 16:55:18 +08:00
rsdt -- prefer RSDT over (default) XSDT
2010-04-08 14:34:27 +08:00
copy_dsdt -- copy DSDT to memory
2016-04-12 16:09:11 +02:00
For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force"
are available
2005-04-16 15:20:36 -07:00
2019-06-13 07:10:36 -03:00
See also Documentation/power/runtime_pm.rst, pci=noacpi
2005-04-16 15:20:36 -07:00
2007-03-11 03:26:14 -04:00
acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
2: use 2nd APIC table, if available
1,0: use 1st APIC table
2007-03-30 14:16:10 -04:00
default: 0
2007-03-11 03:26:14 -04:00
2008-08-01 17:37:55 +02:00
acpi_backlight= [HW,ACPI]
2020-03-30 17:17:37 -07:00
{ vendor | video | native | none }
If set to vendor, prefer vendor-specific driver
2008-08-01 17:37:55 +02:00
(e.g. thinkpad_acpi, sony_acpi, etc.) instead
of the ACPI video.ko driver.
2020-03-30 17:17:37 -07:00
If set to video, use the ACPI video.ko driver.
If set to native, use the device's native backlight mode.
If set to none, disable the ACPI backlight interface.
2008-08-01 17:37:55 +02:00
2016-01-21 17:05:47 +00:00
acpi_force_32bit_fadt_addr
force FADT to use 32 bit addresses rather than the
64 bit X_* addresses. Some firmware have broken 64
bit addresses for force ACPI ignore these and use
the older legacy 32 bit addresses.
2015-05-14 15:31:28 +02:00
acpica_no_return_repair [HW, ACPI]
Disable AML predefined validation mechanism
This mechanism can repair the evaluation result to make
the return objects more ACPI specification compliant.
This option is useful for developers to identify the
root cause of an AML interpreter issue when the issue
has something to do with the repair mechanism.
2008-11-07 16:58:05 -07:00
acpi.debug_layer= [HW,ACPI,ACPI_DEBUG]
acpi.debug_level= [HW,ACPI,ACPI_DEBUG]
2005-04-16 15:20:36 -07:00
Format: <int>
2008-11-07 16:58:05 -07:00
CONFIG_ACPI_DEBUG must be enabled to produce any ACPI
debug output. Bits in debug_layer correspond to a
_COMPONENT in an ACPI source file, e.g.,
2021-02-19 19:16:54 +01:00
#define _COMPONENT ACPI_EVENTS
2008-11-07 16:58:05 -07:00
Bits in debug_level correspond to a level in
ACPI_DEBUG_PRINT statements, e.g.,
ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
2008-11-13 17:30:13 -06:00
The debug_level mask defaults to "info". See
2019-06-07 15:54:32 -03:00
Documentation/firmware-guide/acpi/debug.rst for more information about
2008-11-13 17:30:13 -06:00
debug layers and levels.
2008-11-07 16:58:05 -07:00
2008-11-13 17:30:13 -06:00
Enable processor driver info messages:
acpi.debug_layer=0x20000000
2008-11-07 16:58:05 -07:00
Enable AML "Debug" output, i.e., stores to the Debug
object while interpreting AML:
acpi.debug_layer=0xffffffff acpi.debug_level=0x2
Enable all messages related to ACPI hardware:
acpi.debug_layer=0x2 acpi.debug_level=0xffffffff
Some values produce so much output that the system is
unusable. The "log_buf_len" parameter may be useful
if you need to capture more output.
2007-04-24 13:53:22 +08:00
2015-05-14 15:31:28 +02:00
acpi_enforce_resources= [ACPI]
{ strict | lax | no }
Check for resource conflicts between native drivers
and ACPI OperationRegions (SystemIO and SystemMemory
only). IO ports and memory declared in ACPI might be
used by the ACPI subsystem in arbitrary AML code and
can interfere with legacy drivers.
strict (default): access to resources claimed by ACPI
is denied; legacy drivers trying to access reserved
resources will fail to bind to device using them.
lax: access to resources claimed by ACPI is allowed;
legacy drivers trying to access reserved resources
will bind successfully but a warning message is logged.
no: ACPI OperationRegions are not marked as reserved,
no further checks are performed.
2014-05-31 08:15:02 +08:00
acpi_force_table_verification [HW,ACPI]
Enable table checksum verification during early stage.
By default, this is disabled due to x86 early mapping
size limitation.
2009-04-05 15:55:22 -07:00
acpi_irq_balance [HW,ACPI]
ACPI will balance active IRQs
default in APIC mode
acpi_irq_nobalance [HW,ACPI]
ACPI will not move active IRQs (default)
default in PIC mode
acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA
Format: <irq>,<irq>...
acpi_irq_pci= [HW,ACPI] If irq_balance, clear listed IRQs for
use by PCI
Format: <irq>,<irq>...
2018-04-18 20:51:39 +02:00
acpi_mask_gpe= [HW,ACPI]
2016-12-16 12:07:57 +08:00
Due to the existence of _Lxx/_Exx, some GPEs triggered
by unsupported hardware/firmware features can result in
2018-04-18 20:51:39 +02:00
GPE floodings that cannot be automatically disabled by
the GPE dispatcher.
2016-12-16 12:07:57 +08:00
This facility can be used to prevent such uncontrolled
GPE floodings.
2021-06-16 20:03:33 +03:00
Format: <byte> or <bitmap-list>
2016-12-16 12:07:57 +08:00
2014-03-24 14:49:22 +08:00
acpi_no_auto_serialize [HW,ACPI]
Disable auto-serialization of AML methods
2014-03-24 14:49:00 +08:00
AML control methods that contain the opcodes to create
named objects will be marked as "Serialized" by the
auto-serialization feature.
2014-03-24 14:49:22 +08:00
This feature is enabled by default.
This option allows to turn off the feature.
2014-03-24 14:49:00 +08:00
2015-05-14 15:31:28 +02:00
acpi_no_memhotplug [ACPI] Disable memory hotplug. Useful for kdump
kernels.
2014-04-04 12:39:11 +08:00
acpi_no_static_ssdt [HW,ACPI]
Disable installation of static SSDTs at early boot time
By default, SSDTs contained in the RSDT/XSDT will be
installed automatically and they will appear under
/sys/firmware/acpi/tables.
This option turns off this feature.
Note that specifying this option does not affect
dynamic table installation which will install SSDT
tables to /sys/firmware/acpi/tables/dynamic.
2009-04-05 15:55:22 -07:00
2020-02-06 16:58:45 +01:00
acpi_no_watchdog [HW,ACPI,WDT]
Ignore the ACPI-based watchdog interface (WDAT) and let
a native driver control the watchdog device instead.
2015-05-14 15:31:28 +02:00
acpi_rsdp= [ACPI,EFI,KEXEC]
Pass the RSDP address to the kernel, mostly used
on machines running EFI runtime service to boot the
second kernel for kdump.
2014-02-11 11:01:52 +08:00
2009-04-05 15:55:22 -07:00
acpi_os_name= [HW,ACPI] Tell ACPI BIOS the name of the OS
Format: To spoof as Windows 98: ="Microsoft Windows"
2015-07-03 01:06:00 +02:00
acpi_rev_override [ACPI] Override the _REV object to return 5 (instead
of 2 which is mandated by ACPI 6) as the supported ACPI
specification revision (when using this switch, it may
be necessary to carry out a cold reboot _twice_ in a
row to make it take effect on the platform firmware).
2009-04-05 15:55:22 -07:00
acpi_osi= [HW,ACPI] Modify list of supported OS interface strings
2013-07-22 16:08:25 +08:00
acpi_osi="string1" # add string1
acpi_osi="!string2" # remove string2
2013-07-22 16:08:36 +08:00
acpi_osi=!* # remove all strings
2013-07-22 16:08:25 +08:00
acpi_osi=! # disable all built-in OS vendor
strings
2016-05-03 16:48:32 +08:00
acpi_osi=!! # enable all built-in OS vendor
strings
2009-04-05 15:55:22 -07:00
acpi_osi= # disable all strings
2013-07-22 16:08:25 +08:00
'acpi_osi=!' can be used in combination with single or
multiple 'acpi_osi="string1"' to support specific OS
vendor string(s). Note that such command can only
affect the default state of the OS vendor strings, thus
it cannot affect the default state of the feature group
strings and the current state of the OS vendor strings,
specifying it multiple times through kernel command line
2013-07-22 16:08:36 +08:00
is meaningless. This command is useful when one do not
care about the state of the feature group strings which
should be controlled by the OSPM.
2013-07-22 16:08:25 +08:00
Examples:
1. 'acpi_osi=! acpi_osi="Windows 2000"' is equivalent
to 'acpi_osi="Windows 2000" acpi_osi=!', they all
can make '_OSI("Windows 2000")' TRUE.
'acpi_osi=' cannot be used in combination with other
'acpi_osi=' command lines, the _OSI method will not
exist in the ACPI namespace. NOTE that such command can
only affect the _OSI support state, thus specifying it
multiple times through kernel command line is also
meaningless.
Examples:
1. 'acpi_osi=' can make 'CondRefOf(_OSI, Local1)'
FALSE.
2013-07-22 16:08:36 +08:00
'acpi_osi=!*' can be used in combination with single or
multiple 'acpi_osi="string1"' to support specific
string(s). Note that such command can affect the
current state of both the OS vendor strings and the
feature group strings, thus specifying it multiple times
through kernel command line is meaningful. But it may
still not able to affect the final state of a string if
there are quirks related to this string. This command
is useful when one want to control the state of the
feature group strings to debug BIOS issues related to
the OSPM features.
Examples:
1. 'acpi_osi="Module Device" acpi_osi=!*' can make
'_OSI("Module Device")' FALSE.
2. 'acpi_osi=!* acpi_osi="Module Device"' can make
'_OSI("Module Device")' TRUE.
3. 'acpi_osi=! acpi_osi=!* acpi_osi="Windows 2000"' is
equivalent to
'acpi_osi=!* acpi_osi=! acpi_osi="Windows 2000"'
and
'acpi_osi=!* acpi_osi="Windows 2000" acpi_osi=!',
they all will make '_OSI("Windows 2000")' TRUE.
2009-04-14 14:03:43 +05:30
acpi_pm_good [X86]
2009-04-05 15:55:22 -07:00
Override the pmtimer bug detection: force the kernel
to assume that this machine's pmtimer latches its value
and always returns good values.
2009-04-17 18:30:28 -07:00
acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode
Format: { level | edge | high | low }
acpi_skip_timer_override [HW,ACPI]
Recognize and ignore IRQ0/pin2 Interrupt Override.
For broken nForce2 BIOS resulting in XT-PIC timer.
acpi_sleep= [HW,ACPI] Sleep options
2021-11-08 16:09:41 +00:00
Format: { s3_bios, s3_mode, s3_beep, s4_hwsig,
s4_nohwsig, old_ordering, nonvs,
sci_force_enable, nobl }
2019-06-13 07:10:36 -03:00
See Documentation/power/video.rst for information on
2009-04-17 18:30:28 -07:00
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
2021-11-08 16:09:41 +00:00
s4_hwsig causes the kernel to check the ACPI hardware
signature during resume from hibernation, and gracefully
refuse to resume if it has changed. This complies with
the ACPI specification but not with reality, since
Windows does not do this and many laptops do change it
on docking. So the default behaviour is to allow resume
and simply warn when the signature changes, unless the
s4_hwsig option is enabled.
2009-04-17 18:30:28 -07:00
s4_nohwsig prevents ACPI hardware signature from being
2021-11-08 16:09:41 +00:00
used (or even warned about) during resume.
2009-04-17 18:30:28 -07:00
old_ordering causes the ACPI 1.0 ordering of the _PTS
control method, with respect to putting devices into
low power states, to be enforced (the ACPI 2.0 ordering
of _PTS is used by default).
2010-07-23 22:59:09 +02:00
nonvs prevents the kernel from saving/restoring the
ACPI NVS memory during suspend/hibernation and resume.
2009-12-30 15:36:42 +08:00
sci_force_enable causes the kernel to set SCI_EN directly
on resume from S1/S3 (which is against the ACPI spec,
but some broken systems don't work without it).
2017-11-15 02:16:55 +01:00
nobl causes the internal blacklist of systems known to
behave incorrectly in some ways with respect to system
suspend and resume to be ignored (use wisely).
2009-04-17 18:30:28 -07:00
acpi_use_timer_override [HW,ACPI]
Use timer override. For some broken Nvidia NF5 boards
that require a timer override, but don't have HPET
add_efi_memmap [EFI; X86] Include EFI memory map in
kernel's map of available physical RAM.
2009-04-05 15:55:22 -07:00
agp= [AGP]
{ off | try_unsupported }
off: disable AGP support
try_unsupported: try to drive unsupported chipsets
(may crash computer or cause data corruption)
2010-06-07 17:10:38 -07:00
ALSA [HW,ALSA]
2018-06-14 07:43:07 -03:00
See Documentation/sound/alsa-configuration.rst
2010-06-07 17:10:38 -07:00
2010-02-20 16:13:29 +00:00
alignment= [KNL,ARM]
Allow the default userspace alignment fault handler
behaviour to be specified. Bit 0 enables warnings,
bit 1 enables fixups, and bit 2 sends a segfault.
2011-08-05 15:15:08 +02:00
align_va_addr= [X86-64]
Align virtual addresses by clearing slice [14:12] when
allocating a VMA at process creation time. This option
gives you up to 3% performance improvement on AMD F15h
machines (where it is enabled by default) for a
CPU-intensive style benchmark, and it can vary highly in
a microbenchmark depending on workload and compiler.
2011-11-21 12:10:19 +01:00
32: only for 32-bit processes
64: only for 64-bit processes
2011-08-05 15:15:08 +02:00
on: enable for both 32- and 64-bit processes
off: disable for both 32- and 64-bit processes
2013-03-07 22:48:09 -05:00
alloc_snapshot [FTRACE]
Allocate the ftrace snapshot buffer on boot up when the
main buffer is allocated. This is handy if debugging
and you need to use tracing_snapshot() on boot up, and
do not want to use tracing_snapshot_alloc() as it needs
to be done where GFP_KERNEL allocations are allowed.
2021-07-30 12:24:41 +01:00
allow_mismatched_32bit_el0 [ARM64]
Allow execve() of 32-bit applications and setting of the
PER_LINUX32 personality on systems where only a strict
subset of the CPUs support 32-bit EL0. When this
parameter is present, the set of CPUs supporting 32-bit
EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
and hot-unplug operations may be restricted.
2021-07-30 12:24:43 +01:00
See Documentation/arm64/asymmetric-32bit.rst for more
information.
2011-12-05 23:08:32 +01:00
amd_iommu= [HW,X86-64]
2008-06-26 21:28:10 +02:00
Pass parameters to the AMD IOMMU driver in the system.
Possible values are:
2021-07-12 19:12:15 +08:00
fullflush - Deprecated, equivalent to iommu.strict=1
2010-05-11 17:12:33 +02:00
off - do not initialize any AMD IOMMU found in
the system
2011-12-01 15:49:45 +01:00
force_isolation - Force device isolation for all
devices. The IOMMU driver is not
allowed anymore to lift isolation
requirements as needed. This option
does not override iommu=pt
2021-06-03 15:02:03 +02:00
force_enable - Force enable the IOMMU on platforms known
to be buggy with IOMMU enabled. Use this
option with care.
2022-08-25 06:39:39 +00:00
pgtbl_v1 - Use v1 page table for DMA-API (Default).
pgtbl_v2 - Use v2 page table for DMA-API.
2008-09-20 01:23:30 +09:00
2012-05-24 15:58:25 -06:00
amd_iommu_dump= [HW,X86-64]
Enable AMD IOMMU driver option to dump the ACPI table
for AMD IOMMU. With this option enabled, AMD IOMMU
driver will print ACPI tables for AMD IOMMU during
IOMMU initialization.
2016-08-23 13:52:32 -05:00
amd_iommu_intr= [HW,X86-64]
Specifies one of the following AMD IOMMU interrupt
remapping modes:
legacy - Use legacy interrupt remapping mode.
vapic - Use virtual APIC mode, which allows IOMMU
to inject interrupts directly into guest.
This mode requires kvm-amd.avic=1.
(Default when IOMMU HW support is present.)
2005-04-16 15:20:36 -07:00
amijoy.map= [HW,JOY] Amiga joystick support
Map of devices attached to JOY0DAT and JOY1DAT
Format: <a>,<b>
2017-10-10 12:36:23 -05:00
See also Documentation/input/joydev/joystick.rst
2005-04-16 15:20:36 -07:00
analog.map= [HW,JOY] Analog joystick and gamepad support
Specifies type or capabilities of an analog joystick
connected to one of 16 gameports
Format: <type1>,<type2>,..<type16>
2005-10-23 12:57:11 -07:00
apc= [HW,SPARC]
Power management functions (SPARCstation-4/5 + deriv.)
2005-04-16 15:20:36 -07:00
Format: noidle
Disable APC CPU standby support. SPARCstation-Fox does
not play well with APC CPU idle - disable it if you have
APC and your system crashes randomly.
2017-12-04 12:03:13 +08:00
apic= [APIC,X86] Advanced Programmable Interrupt Controller
2018-11-19 11:02:45 +00:00
Change the output verbosity while booting
2005-04-16 15:20:36 -07:00
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
2017-12-04 12:03:13 +08:00
For X86-32, this can also be used to specify an APIC
driver name.
Format: apic=driver_name
Examples: apic=bigsmp
2005-10-23 12:57:11 -07:00
2015-12-14 11:19:12 +01:00
apic_extnmi= [APIC,X86] External NMI delivery setting
Format: { bsp (default) | all | none }
bsp: External NMI is delivered only to CPU 0
all: External NMIs are broadcast to all CPUs as a
backup of CPU 0
none: External NMI is masked for all CPUs. This is
useful so that a dump capture kernel won't be
shot down by NMI
2010-02-04 13:36:50 -08:00
autoconf= [IPV6]
2020-04-28 00:01:50 +02:00
See Documentation/networking/ipv6.rst.
2010-02-04 13:36:50 -08:00
2005-04-16 15:20:36 -07:00
apm= [APM] Advanced Power Management
2008-07-04 09:59:43 -07:00
See header of arch/x86/kernel/apm_32.c.
2005-04-16 15:20:36 -07:00
2022-12-03 17:30:50 -08:00
apparmor= [APPARMOR] Disable or enable AppArmor at boot time
Format: { "0" | "1" }
See security/apparmor/Kconfig help text
0 -- disable.
1 -- enable.
Default value is set via kernel config option.
2005-04-16 15:20:36 -07:00
arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards
Format: <io>,<irq>,<nodeID>
2021-02-08 09:57:29 +00:00
arm64.nobti [ARM64] Unconditionally disable Branch Target
Identification support
2021-02-08 09:57:31 +00:00
arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
support
2021-08-03 15:08:22 +08:00
arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
support
2022-06-30 17:04:59 +01:00
arm64.nosve [ARM64] Unconditionally disable Scalable Vector
Extension support
2022-06-30 17:04:58 +01:00
arm64.nosme [ARM64] Unconditionally disable Scalable Matrix
Extension support
2005-04-16 15:20:36 -07:00
ataflop= [HW,M68k]
atarimouse= [HW,MOUSE] Atari Mouse
atkbd.extra= [HW] Enable extra LEDs and keys on IBM RapidAccess,
EzKey and similar keyboards
atkbd.reset= [HW] Reset keyboard during initialization
2005-10-23 12:57:11 -07:00
atkbd.set= [HW] Select keyboard code set
Format: <int> (2 = AT (default), 3 = PS/2)
2005-04-16 15:20:36 -07:00
atkbd.scroll= [HW] Enable scroll wheel on MS Office and similar
keyboards
atkbd.softraw= [HW] Choose between synthetic and real raw mode
Format: <bool> (0 = real, 1 = synthetic (default))
2005-10-23 12:57:11 -07:00
atkbd.softrepeat= [HW]
Use software keyboard repeat
2005-04-16 15:20:36 -07:00
2013-09-23 21:53:35 -04:00
audit= [KNL] Enable the audit sub-system
2018-03-05 15:05:20 -07:00
Format: { "0" | "1" | "off" | "on" }
0 | off - kernel audit is disabled and can not be
enabled until the next reboot
2014-01-13 16:01:06 -05:00
unset - kernel audit is initialized but disabled and
will be fully enabled by the userspace auditd.
2018-03-05 15:05:20 -07:00
1 | on - kernel audit is initialized and partially
enabled, storing at most audit_backlog_limit
messages in RAM until it is fully enabled by the
userspace auditd.
2013-09-23 21:53:35 -04:00
Default: unset
2013-09-17 12:34:52 -04:00
2013-09-17 12:34:52 -04:00
audit_backlog_limit= [KNL] Set the audit queue size limit.
Format: <int> (must be >=0)
Default: 64
2016-03-31 14:18:29 -05:00
bau= [X86_UV] Enable the BAU on SGI UV. The default
behavior is to disable the BAU (i.e. bau=0).
Format: { "0" | "1" }
0 - Disable the BAU.
1 - Enable the BAU.
unset - Disable the BAU.
2005-04-16 15:20:36 -07:00
baycom_epp= [HW,AX25]
Format: <io>,<mode>
2005-10-23 12:57:11 -07:00
2005-04-16 15:20:36 -07:00
baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem
Format: <io>,<mode>
See header of drivers/net/hamradio/baycom_par.c.
2005-10-23 12:57:11 -07:00
baycom_ser_fdx= [HW,AX25]
BayCom Serial Port AX.25 Modem (Full Duplex Mode)
2005-04-16 15:20:36 -07:00
Format: <io>,<irq>,<mode>[,<baud>]
See header of drivers/net/hamradio/baycom_ser_fdx.c.
2005-10-23 12:57:11 -07:00
baycom_ser_hdx= [HW,AX25]
BayCom Serial Port AX.25 Modem (Half Duplex Mode)
2005-04-16 15:20:36 -07:00
Format: <io>,<irq>,<mode>
See header of drivers/net/hamradio/baycom_ser_hdx.c.
2022-04-02 22:48:22 -07:00
bert_disable [ACPI]
Disable BERT OS support on buggy BIOSes.
bgrt_disable [ACPI][X86]
Disable BGRT to avoid flickering OEM logo.
2013-09-30 13:45:19 -07:00
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
2019-04-18 19:45:00 -03:00
See Documentation/block/cmdline-partition.rst
2013-09-30 13:45:19 -07:00
2007-10-16 01:23:46 -07:00
boot_delay= Milliseconds to delay each printk during boot.
2023-01-26 16:54:20 -06:00
Only works if CONFIG_BOOT_PRINTK_DELAY is enabled,
and you may also have to specify "lpj=". Boot_delay
values larger than 10 seconds (10000) are assumed
erroneous and ignored.
2007-10-16 01:23:46 -07:00
Format: integer
2020-02-04 07:33:53 -05:00
bootconfig [KNL]
Extended command line options can be added to an initrd
and this will cause the kernel to look for it.
See Documentation/admin-guide/bootconfig.rst
2005-04-16 15:20:36 -07:00
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
2005-10-23 12:57:11 -07:00
bttv.radio= Most important insmod options are available as
kernel args too.
2020-03-04 13:08:03 +01:00
bttv.pll= See Documentation/admin-guide/media/bttv.rst
2011-08-15 02:02:26 +02:00
bttv.tuner=
2005-04-16 15:20:36 -07:00
2010-09-28 15:33:12 +00:00
bulk_remove=off [PPC] This parameter disables the use of the pSeries
firmware feature for flushing multiple hpte entries
at a time.
2005-04-16 15:20:36 -07:00
c101= [NET] Moxa C101 synchronous serial card
2007-07-31 00:37:59 -07:00
cachesize= [BUGS=X86-32] Override level 2 CPU cache size detection.
2005-04-16 15:20:36 -07:00
Sometimes CPU hardware bugs make them report the cache
size incorrectly. The kernel will attempt work arounds
to fix known problems, but for some CPUs it is not
possible to determine what the correct size should be.
This option provides an override for these situations.
2019-01-31 11:14:18 +01:00
carrier_timeout=
[NET] Specifies amount of time (in seconds) that
the kernel should wait for a network carrier. By default
it waits 120 seconds.
2014-06-17 11:56:58 +03:00
ca_keys= [KEYS] This parameter identifies a specific key(s) on
the system trusted keyring to be used for certificate
trust validation.
2014-06-17 11:56:59 +03:00
format: { id:<keyid> | builtin }
2014-06-17 11:56:58 +03:00
2014-06-25 16:41:13 -07:00
cca= [MIPS] Override the kernel pages' cache coherency
algorithm. Accepted values range from 0 to 7
inclusive. See arch/mips/include/asm/pgtable-bits.h
for platform specific values (SB1, Loongson3 and
others).
2018-04-18 20:51:39 +02:00
ccw_timeout_log [S390]
2019-06-08 23:27:16 -03:00
See Documentation/s390/common_io.rst for details.
2005-04-16 15:20:36 -07:00
2021-05-24 12:53:39 -07:00
cgroup_disable= [KNL] Disable a particular controller or optional feature
Format: {name of the controller(s) or feature(s) to disable}
2013-11-06 13:18:09 -08:00
The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in
a single hierarchy
- foo isn't visible as an individually mountable
subsystem
2021-05-24 12:53:39 -07:00
- if foo is an optional feature then the feature is
disabled and corresponding cgroup files are not
created
2013-11-06 13:18:09 -08:00
{Currently only "memory" controller deal with this and
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
2021-05-24 12:53:39 -07:00
Specifying "pressure" disables per-cgroup pressure
stall information accounting feature
2008-04-04 14:29:57 -07:00
2018-12-28 10:31:07 -08:00
cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1
Format: { { controller | "all" | "named" }
[,{ controller | "all" | "named" }...] }
2016-02-16 13:21:14 -05:00
Like cgroup_disable, but only applies to cgroup v1;
the blacklisted controllers remain available in cgroup2.
2018-12-28 10:31:07 -08:00
"all" blacklists all controllers and "named" disables
named mounts. Specifying both "all" and "named" disables
all v1 hierarchies.
2016-02-16 13:21:14 -05:00
2016-01-14 15:21:29 -08:00
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
nosocket -- Disable socket memory accounting.
2016-01-20 15:02:38 -08:00
nokmem -- Disable kernel memory accounting.
2023-02-10 15:47:31 +00:00
nobpf -- Disable BPF memory accounting.
2016-01-14 15:21:29 -08:00
2022-02-28 20:14:54 -08:00
checkreqprot= [SELINUX] Set initial checkreqprot flag value.
2005-04-16 15:20:36 -07:00
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
2005-10-23 12:57:11 -07:00
0 -- check protection applied by kernel (includes
any implied execute protection).
2005-04-16 15:20:36 -07:00
1 -- check protection requested by application.
Default value is set via a kernel config option.
2005-10-23 12:57:11 -07:00
Value can be changed at runtime via
2020-01-07 11:35:04 -05:00
/sys/fs/selinux/checkreqprot.
2020-01-08 11:24:47 -05:00
Setting checkreqprot to 1 is deprecated.
2005-10-23 12:57:11 -07:00
2008-01-26 14:10:36 +01:00
cio_ignore= [S390]
2019-06-08 23:27:16 -03:00
See Documentation/s390/common_io.rst for details.
2022-04-02 22:48:21 -07:00
It was a moderately busy cycle for documentation; highlights include:
- After a long period of inactivity, the Japanese translations are seeing
some much-needed maintenance and updating.
- Reworked IOMMU documentation
- Some new documentation for static-analysis tools
- A new overall structure for the memory-management documentation. This
is an LSFMM outcome that, it is hoped, will help encourage developers to
fill in the many gaps. Optimism is eternal...but hopefully it will
work.
- More Chinese translations.
Plus the usual typo fixes, updates, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmKLqZQPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YdgQH/2/9+EgQDes93f/+iKtbO23EV67392dwrmXS
kYg8lR4948/Q3jzgMloUo6hNOoxXeV/sqmdHu0LjUhFN+BGsp9fFjd/jp0XhWcqA
nnc9foGbpmeFPxHeAg2aqV84eeasLoO5lUUm2rNoPBLd6HFV+IYC5R4VZ+w42StB
5bYEOYwHXMvQZXkivZDse82YmvQK3/2rRGTUoFhME/Aap6rFgWJJ+XQcSKA7WmwW
OpJqq+FOsjsxHe6IFVy6onzlqgGJM8zM2bLtqedid6yaE3uACcHMb/OyAjp0rdKF
BQvaG+d3f7DugABqM6Y1oU75iBtJWWYgGeAm36JtX+3mz2uR/f0=
=3UoR
-----END PGP SIGNATURE-----
Merge tag 'docs-5.19' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It was a moderately busy cycle for documentation; highlights include:
- After a long period of inactivity, the Japanese translations are
seeing some much-needed maintenance and updating.
- Reworked IOMMU documentation
- Some new documentation for static-analysis tools
- A new overall structure for the memory-management documentation.
This is an LSFMM outcome that, it is hoped, will help encourage
developers to fill in the many gaps. Optimism is eternal...but
hopefully it will work.
- More Chinese translations.
Plus the usual typo fixes, updates, etc"
* tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits)
docs: pdfdocs: Add space for chapter counts >= 100 in TOC
docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation
input: Docs: correct ntrig.rst typo
input: Docs: correct atarikbd.rst typos
MAINTAINERS: Become the docs/zh_CN maintainer
docs/zh_CN: fix devicetree usage-model translation
mm,doc: Add new documentation structure
Documentation: drop more IDE boot options and ide-cd.rst
Documentation/process: use scripts/get_maintainer.pl on patches
MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE
docs/trans/ja_JP/howto: Don't mention specific kernel versions
docs/ja_JP/SubmittingPatches: Request summaries for commit references
docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature
docs/ja_JP/SubmittingPatches: Randy has moved
docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl
docs/ja_JP/SubmittingPatches: Update GregKH links
Documentation/sysctl: document max_rcu_stall_to_panic
Documentation: add missing angle bracket in cgroup-v2 doc
Documentation: dev-tools: use literal block instead of code-block
docs/zh_CN: add vm numa translation
...
2022-05-25 11:17:41 -07:00
clearcpuid=X[,X...] [X86]
2022-04-02 22:48:21 -07:00
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
It was a moderately busy cycle for documentation; highlights include:
- After a long period of inactivity, the Japanese translations are seeing
some much-needed maintenance and updating.
- Reworked IOMMU documentation
- Some new documentation for static-analysis tools
- A new overall structure for the memory-management documentation. This
is an LSFMM outcome that, it is hoped, will help encourage developers to
fill in the many gaps. Optimism is eternal...but hopefully it will
work.
- More Chinese translations.
Plus the usual typo fixes, updates, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmKLqZQPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YdgQH/2/9+EgQDes93f/+iKtbO23EV67392dwrmXS
kYg8lR4948/Q3jzgMloUo6hNOoxXeV/sqmdHu0LjUhFN+BGsp9fFjd/jp0XhWcqA
nnc9foGbpmeFPxHeAg2aqV84eeasLoO5lUUm2rNoPBLd6HFV+IYC5R4VZ+w42StB
5bYEOYwHXMvQZXkivZDse82YmvQK3/2rRGTUoFhME/Aap6rFgWJJ+XQcSKA7WmwW
OpJqq+FOsjsxHe6IFVy6onzlqgGJM8zM2bLtqedid6yaE3uACcHMb/OyAjp0rdKF
BQvaG+d3f7DugABqM6Y1oU75iBtJWWYgGeAm36JtX+3mz2uR/f0=
=3UoR
-----END PGP SIGNATURE-----
Merge tag 'docs-5.19' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It was a moderately busy cycle for documentation; highlights include:
- After a long period of inactivity, the Japanese translations are
seeing some much-needed maintenance and updating.
- Reworked IOMMU documentation
- Some new documentation for static-analysis tools
- A new overall structure for the memory-management documentation.
This is an LSFMM outcome that, it is hoped, will help encourage
developers to fill in the many gaps. Optimism is eternal...but
hopefully it will work.
- More Chinese translations.
Plus the usual typo fixes, updates, etc"
* tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits)
docs: pdfdocs: Add space for chapter counts >= 100 in TOC
docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation
input: Docs: correct ntrig.rst typo
input: Docs: correct atarikbd.rst typos
MAINTAINERS: Become the docs/zh_CN maintainer
docs/zh_CN: fix devicetree usage-model translation
mm,doc: Add new documentation structure
Documentation: drop more IDE boot options and ide-cd.rst
Documentation/process: use scripts/get_maintainer.pl on patches
MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE
docs/trans/ja_JP/howto: Don't mention specific kernel versions
docs/ja_JP/SubmittingPatches: Request summaries for commit references
docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature
docs/ja_JP/SubmittingPatches: Randy has moved
docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl
docs/ja_JP/SubmittingPatches: Update GregKH links
Documentation/sysctl: document max_rcu_stall_to_panic
Documentation: add missing angle bracket in cgroup-v2 doc
Documentation: dev-tools: use literal block instead of code-block
docs/zh_CN: add vm numa translation
...
2022-05-25 11:17:41 -07:00
numbers X. Note the Linux-specific bits are not necessarily
stable over kernel options, but the vendor-specific
2022-04-02 22:48:21 -07:00
ones should be.
It was a moderately busy cycle for documentation; highlights include:
- After a long period of inactivity, the Japanese translations are seeing
some much-needed maintenance and updating.
- Reworked IOMMU documentation
- Some new documentation for static-analysis tools
- A new overall structure for the memory-management documentation. This
is an LSFMM outcome that, it is hoped, will help encourage developers to
fill in the many gaps. Optimism is eternal...but hopefully it will
work.
- More Chinese translations.
Plus the usual typo fixes, updates, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmKLqZQPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YdgQH/2/9+EgQDes93f/+iKtbO23EV67392dwrmXS
kYg8lR4948/Q3jzgMloUo6hNOoxXeV/sqmdHu0LjUhFN+BGsp9fFjd/jp0XhWcqA
nnc9foGbpmeFPxHeAg2aqV84eeasLoO5lUUm2rNoPBLd6HFV+IYC5R4VZ+w42StB
5bYEOYwHXMvQZXkivZDse82YmvQK3/2rRGTUoFhME/Aap6rFgWJJ+XQcSKA7WmwW
OpJqq+FOsjsxHe6IFVy6onzlqgGJM8zM2bLtqedid6yaE3uACcHMb/OyAjp0rdKF
BQvaG+d3f7DugABqM6Y1oU75iBtJWWYgGeAm36JtX+3mz2uR/f0=
=3UoR
-----END PGP SIGNATURE-----
Merge tag 'docs-5.19' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It was a moderately busy cycle for documentation; highlights include:
- After a long period of inactivity, the Japanese translations are
seeing some much-needed maintenance and updating.
- Reworked IOMMU documentation
- Some new documentation for static-analysis tools
- A new overall structure for the memory-management documentation.
This is an LSFMM outcome that, it is hoped, will help encourage
developers to fill in the many gaps. Optimism is eternal...but
hopefully it will work.
- More Chinese translations.
Plus the usual typo fixes, updates, etc"
* tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits)
docs: pdfdocs: Add space for chapter counts >= 100 in TOC
docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation
input: Docs: correct ntrig.rst typo
input: Docs: correct atarikbd.rst typos
MAINTAINERS: Become the docs/zh_CN maintainer
docs/zh_CN: fix devicetree usage-model translation
mm,doc: Add new documentation structure
Documentation: drop more IDE boot options and ide-cd.rst
Documentation/process: use scripts/get_maintainer.pl on patches
MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE
docs/trans/ja_JP/howto: Don't mention specific kernel versions
docs/ja_JP/SubmittingPatches: Request summaries for commit references
docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature
docs/ja_JP/SubmittingPatches: Randy has moved
docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl
docs/ja_JP/SubmittingPatches: Update GregKH links
Documentation/sysctl: document max_rcu_stall_to_panic
Documentation: add missing angle bracket in cgroup-v2 doc
Documentation: dev-tools: use literal block instead of code-block
docs/zh_CN: add vm numa translation
...
2022-05-25 11:17:41 -07:00
X can also be a string as appearing in the flags: line
in /proc/cpuinfo which does not have the above
instability issue. However, not all features have names
in /proc/cpuinfo.
Note that using this option will taint your kernel.
2022-04-02 22:48:21 -07:00
Also note that user programs calling CPUID directly
or using the feature without checking anything
will still see it. This just prevents it from
being used by the kernel or shown in /proc/cpuinfo.
Also note the kernel might malfunction if you disable
some critical bits.
2013-04-27 14:10:18 -07:00
clk_ignore_unused
[CLK]
2014-09-30 14:24:38 -07:00
Prevents the clock framework from automatically gating
clocks that have not been explicitly enabled by a Linux
device driver but are enabled in hardware at reset or
by the bootloader/firmware. Note that this does not
force such clocks to be always-on nor does it reserve
those clocks in any way. This parameter is useful for
debug and development, but should not be needed on a
platform with proper driver support. For more
2018-05-07 06:35:44 -03:00
information, see Documentation/driver-api/clk.rst.
2008-01-26 14:10:36 +01:00
2007-07-31 00:37:59 -07:00
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
2006-06-26 00:25:05 -07:00
[Deprecated]
2006-10-03 22:45:33 +02:00
Forces specified clocksource (if available) to be used
2006-06-26 00:25:05 -07:00
when calculating gettimeofday(). If specified
2006-10-03 22:45:33 +02:00
clocksource is not available, it defaults to PIT.
2005-04-16 15:20:36 -07:00
Format: { pit | tsc | cyclone | pmtmr }
2010-07-13 17:56:20 -07:00
clocksource= Override the default clocksource
2007-05-23 13:58:16 -07:00
Format: <string>
Override the default clocksource and use the clocksource
with the name specified.
Some clocksource names to choose from, depending on
the platform:
[all] jiffies (this is the base, fallback clocksource)
[ACPI] acpi_pm
[ARM] imx_timer1,OSTS,netx_timer,mpu_timer2,
pxa_timer,timer3,32k_counter,timer0_1
2010-08-23 14:49:11 -07:00
[X86-32] pit,hpet,tsc;
2007-05-23 13:58:16 -07:00
scx200_hrt on Geode; cyclone on IBM x440
[MIPS] MIPS
[PARISC] cr16
[S390] tod
[SH] SuperH
[SPARC64] tick
[X86-64] hpet,tsc
2016-06-27 17:30:13 +01:00
clocksource.arm_arch_timer.evtstrm=
[ARM,ARM64]
Format: <bool>
Enable/disable the eventstream feature of the ARM
architected timer so that code using WFE-based polling
loops can be debugged more effectively on production
systems.
clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can delay
interrupts-disabled regions of code ranging from SMI handlers to vCPU
preemption. It would be good to have some indication as to why the clock
was marked unstable.
Therefore, re-read the watchdog clock on either side of the read from the
clock under test. If the watchdog clock shows an excessive time delta
between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that is, up to
four reads, one initial and up to three retries. If more than one retry
was required, a message is printed on the console (the occasional single
retry is expected behavior, especially in guest OSes). If the maximum
number of retries is exceeded, the clock under test will be marked
unstable. However, the probability of this happening due to various sorts
of delays is quite small. In addition, the reason (clock-read delays) for
the unstable marking will be apparent.
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
2021-05-27 12:01:19 -07:00
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
2021-11-18 14:14:37 -05:00
unstable. Defaults to two retries, that is,
three attempts to read the clock under test.
clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can delay
interrupts-disabled regions of code ranging from SMI handlers to vCPU
preemption. It would be good to have some indication as to why the clock
was marked unstable.
Therefore, re-read the watchdog clock on either side of the read from the
clock under test. If the watchdog clock shows an excessive time delta
between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that is, up to
four reads, one initial and up to three retries. If more than one retry
was required, a message is printed on the console (the occasional single
retry is expected behavior, especially in guest OSes). If the maximum
number of retries is exceeded, the clock under test will be marked
unstable. However, the probability of this happening due to various sorts
of delays is quite small. In addition, the reason (clock-read delays) for
the unstable marking will be apparent.
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
2021-05-27 12:01:19 -07:00
2021-05-27 12:01:21 -07:00
clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
marked with CLOCK_SOURCE_VERIFY_PERCPU that
are marked unstable due to excessive skew.
A negative value says to check all CPUs, while
zero says not to check any. Values larger than
nr_cpu_ids are silently truncated to nr_cpu_ids.
The actual CPUs are chosen randomly, with
no replacement if the same CPU is chosen twice.
clocksource: Provide kernel module to test clocksource watchdog
When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. It would be good
to have a way of testing the clocksource watchdog's ability to
distinguish between these two causes of clock skew and instability.
Therefore, provide a new clocksource-wdtest module selected by a new
TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module
parameter named "holdoff" that provides the number of seconds of delay
before testing should start, which defaults to zero when built as a module
and to 10 seconds when built directly into the kernel. Very large systems
that boot slowly may need to increase the value of this module parameter.
This module uses hand-crafted clocksource structures to do its testing,
thus avoiding messing up timing for the rest of the kernel and for user
applications. This module first verifies that the ->uncertainty_margin
field of the clocksource structures are set sanely. It then tests the
delay-detection capability of the clocksource watchdog, increasing the
number of consecutive delays injected, first provoking console messages
complaining about the delays and finally forcing a clock-skew event.
Unexpected test results cause at least one WARN_ON_ONCE() console splat.
If there are no splats, the test has passed. Finally, it fuzzes the
value returned from a clocksource to test the clocksource watchdog's
ability to detect time skew.
This module checks the state of its clocksource after each test, and
uses WARN_ON_ONCE() to emit a console splat if there are any failures.
This should enable all types of test frameworks to detect any such
failures.
This facility is intended for diagnostic use only, and should be avoided
on production systems.
Reported-by: Chris Mason <clm@fb.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
2021-05-27 12:01:23 -07:00
clocksource-wdtest.holdoff= [KNL]
Set the time in seconds that the clocksource
watchdog test waits before commencing its tests.
Defaults to zero when built as a module and to
10 seconds when built into the kernel.
2014-06-04 16:06:54 -07:00
cma=nn[MG]@[start[MG][-end[MG]]]
2020-09-18 15:05:58 +08:00
[KNL,CMA]
2014-06-04 16:06:54 -07:00
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
2014-10-09 15:29:41 -07:00
memory allocations. A value of 0 disables CMA
altogether. For more information, see
2020-09-11 10:56:52 +02:00
kernel/dma/contiguous.c
2011-12-29 13:09:51 +01:00
2020-08-24 11:03:07 +12:00
cma_pernuma=nn[MG]
2021-01-24 20:32:02 -08:00
[ARM64,KNL,CMA]
2020-08-24 11:03:07 +12:00
Sets the size of kernel per-numa memory area for
contiguous memory allocations. A value of 0 disables
per-numa CMA altogether. And If this option is not
2023-01-29 15:10:45 -08:00
specified, the default value is 0.
2020-08-24 11:03:07 +12:00
With per-numa CMA enabled, DMA users on node nid will
first try to allocate buffer from the pernuma area
which is located in node nid, if the allocation fails,
they will fallback to the global default memory area.
2011-12-29 13:09:51 +01:00
2009-04-15 05:55:32 +00:00
cmo_free_hint= [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed. This is used in CMO environments
to determine OS memory pressure for page stealing by
a hypervisor.
Default: yes
2011-12-29 13:09:51 +01:00
coherent_pool=nn[KMG] [ARM,KNL]
Sets the size of memory pool for coherent, atomic dma
2012-07-30 09:11:33 +02:00
allocations, by default set to 256K.
2011-12-29 13:09:51 +01:00
2005-04-16 15:20:36 -07:00
com20020= [HW,NET] ARCnet - COM20020 chipset
2005-10-23 12:57:11 -07:00
Format:
<io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]]
2005-04-16 15:20:36 -07:00
com90io= [HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers)
Format: <io>[,<irq>]
2005-10-23 12:57:11 -07:00
com90xx= [HW,NET]
ARCnet - COM90xx chipset (memory-mapped buffers)
2005-04-16 15:20:36 -07:00
Format: <io>[,<irq>[,<memstart>]]
condev= [HW,S390] console device
conmode=
2005-10-23 12:57:11 -07:00
s390/con3215: Drop console data printout when buffer full
Using z/VM the 3270 terminal emulator also emulates an IBM 3215 console
which outputs line by line. When the screen is full, the console enters
the MORE... state and waits for the operator to confirm the data
on the screen by pressing a clear key. If this does not happen in the
default time frame (currently 50 seconds) the console enters the HOLDING
state.
It then waits another time frame (currently 10 seconds) before the output
continues on the next screen. When the operator presses the clear key
during these wait times, the output continues immediately.
This may lead to a very long boot time when the console
has to print many messages, also the system may hang because of the
console's limited buffer space and the system waits for the console
output to drain and finally to finish. This problem can only occur
when a terminal emulator is actually connected to the 3215 console
driver. If not z/VM simply drops console output.
Remedy this rare situation and add a kernel boot command line parameter
con3215_drop. It can be set to 0 (do not drop) or 1 (do drop) which is
the default. This instructs the kernel drop console data when the
console buffer is full. This speeds up the boot time considerable and
also does not hang the system anymore.
Add a sysfs attribute file for console IBM 3215 named con_drop.
This allows for changing the behavior after the boot, for example when
during interactive debugging a panic/crash is expected.
Here is a test of the new behavior using the following test program:
#/bin/bash
declare -i cnt=4
mode=$(cat /sys/bus/ccw/drivers/3215/con_drop)
[ $mode = yes ] && cnt=25
echo "cons_drop $(cat /sys/bus/ccw/drivers/3215/con_drop)"
echo "vmcp term more 5 2"
vmcp term more 5 2
echo "Run $cnt iterations of "'echo t > /proc/sysrq-trigger'
for i in $(seq $cnt)
do
echo "$i. command 'echo t > /proc/sysrq-trigger' at $(date +%F,%T)"
echo t > /proc/sysrq-trigger
sleep 1
done
echo "droptest done" > /dev/kmsg
#
Output with sysfs attribute con_drop set to 1:
# ./droptest.sh
cons_drop yes
vmcp term more 5 2
Run 25 iterations of echo t > /proc/sysrq-trigger
1. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:09
2. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:10
3. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:11
4. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:12
5. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:13
6. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:14
7. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:15
8. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:16
9. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:17
10. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:18
11. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:19
12. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:20
13. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:21
14. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:22
15. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:23
16. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:24
17. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:25
18. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:26
19. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:27
20. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:28
21. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:29
22. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:30
23. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:31
24. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:32
25. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:33
#
There are no hangs anymore.
Output with sysfs attribute con_drop set to 0 and identical
setting for z/VM console 'term more 5 2'. Sometimes hitting the
clear key at the x3270 console to progress output.
# ./droptest.sh
cons_drop no
vmcp term more 5 2
Run 4 iterations of echo t > /proc/sysrq-trigger
1. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:20:58
2. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:24:32
3. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:28:04
4. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:31:37
#
Details:
Enable function raw3215_write() to handle tab expansion and newlines
and feed it with input not larger than the console buffer of 65536
bytes. Function raw3125_putchar() just forwards its character for
output to raw3215_write().
This moves tab to blank conversion to one function raw3215_write()
which also does call raw3215_make_room() to wait for enough free
buffer space.
Function handle_write() loops over all its input and segments input
into chunks of console buffer size (should the input be larger).
Rework tab expansion handling logic to avoid code duplication.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-09-20 14:26:16 +02:00
con3215_drop= [S390] 3215 console drop mode.
Format: y|n|Y|N|1|0
When set to true, drop data on the 3215 console when
the console buffer is full. In this case the
operator using a 3270 terminal emulator (for example
x3270) does not have to enter the clear key for the
console output to advance and the kernel to continue.
This leads to a much faster boot time when a 3270
terminal emulator is active. If no 3270 terminal
emulator is used, this parameter has no effect.
2005-04-16 15:20:36 -07:00
console= [KNL] Output console device and options.
tty<n> Use the virtual console device <n>.
ttyS<n>[,options]
2006-03-25 03:08:17 -08:00
ttyUSB0[,options]
2005-04-16 15:20:36 -07:00
Use the specified serial port. The options are of
2006-03-25 03:08:17 -08:00
the form "bbbbpnf", where "bbbb" is the baud rate,
"p" is parity ("n", "o", or "e"), "n" is number of
bits, and "f" is flow control ("r" for RTS or
omit it). Default is "9600n8".
2016-11-03 12:10:10 +02:00
See Documentation/admin-guide/serial-console.rst for more
2006-03-25 03:08:17 -08:00
information. See
2020-04-30 18:04:02 +02:00
Documentation/networking/netconsole.rst for an
2006-03-25 03:08:17 -08:00
alternative.
2005-04-16 15:20:36 -07:00
serial: convert early_uart to earlycon for 8250
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-15 23:37:59 -07:00
uart[8250],io,<addr>[,options]
uart[8250],mmio,<addr>[,options]
2015-10-28 12:46:05 +09:00
uart[8250],mmio16,<addr>[,options]
earlycon: 8250: Document kernel command line options
Document the expected behavior of kernel command lines of the forms:
console=uart[8250],io|mmio|mmio32,<addr>[,options]
console=uart[8250],<addr>[,options]
and
earlycon=uart[8250],io|mmio|mmio32,<addr>[,options]
earlycon=uart[8250],<addr>[,options]
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-06 10:52:39 -04:00
uart[8250],mmio32,<addr>[,options]
uart[8250],0x<addr>[,options]
2005-04-16 15:20:36 -07:00
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address,
earlycon: 8250: Document kernel command line options
Document the expected behavior of kernel command lines of the forms:
console=uart[8250],io|mmio|mmio32,<addr>[,options]
console=uart[8250],<addr>[,options]
and
earlycon=uart[8250],io|mmio|mmio32,<addr>[,options]
earlycon=uart[8250],<addr>[,options]
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-06 10:52:39 -04:00
switching to the matching ttyS device later.
MMIO inter-register address stride is either 8-bit
2015-10-28 12:46:05 +09:00
(mmio), 16-bit (mmio16), or 32-bit (mmio32).
If none of [io|mmio|mmio16|mmio32], <addr> is assumed
to be equivalent to 'mmio'. 'options' are specified in
the same format described for ttyS above; if unspecified,
earlycon: 8250: Document kernel command line options
Document the expected behavior of kernel command lines of the forms:
console=uart[8250],io|mmio|mmio32,<addr>[,options]
console=uart[8250],<addr>[,options]
and
earlycon=uart[8250],io|mmio|mmio32,<addr>[,options]
earlycon=uart[8250],<addr>[,options]
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-06 10:52:39 -04:00
the h/w is not re-initialized.
2013-02-25 15:54:09 -05:00
hvc<n> Use the hypervisor console device <n>. This is for
both Xen and PowerPC hypervisors.
2005-04-16 15:20:36 -07:00
2022-02-16 12:37:45 -08:00
{ null | "" }
Use to disable console output, i.e., to have kernel
console messages discarded.
This must be the only console= parameter used on the
kernel command line.
2018-04-18 20:51:39 +02:00
If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance
2008-04-30 00:54:51 -07:00
console=brl,ttyS0
For now, only VisioBraille is supported.
printk: add console_msg_format command line option
0day and kernelCI automatically parse kernel log - basically some sort
of grepping using the pre-defined text patterns - in order to detect
and report regressions/errors. There are several sources they get the
kernel logs from:
a) dmesg or /proc/ksmg
This is the preferred way. Because `dmesg --raw' (see later Note)
and /proc/kmsg output contains facility and log level, which greatly
simplifies grepping for EMERG/ALERT/CRIT/ERR messages.
b) serial consoles
This option is harder to maintain, because serial console messages
don't contain facility and log level.
This patch introduces a `console_msg_format=' command line option,
to switch between different message formatting on serial consoles.
For the time being we have just two options - default and syslog.
The "default" option just keeps the existing format. While the
"syslog" option makes serial console messages to appear in syslog
format [syslog() syscall], matching the `dmesg -S --raw' and
`cat /proc/kmsg' output formats:
- facility and log level
- time stamp (depends on printk_time/PRINTK_TIME)
- message
<%u>[time stamp] text\n
NOTE: while Kevin and Fengguang talk about "dmesg --raw", it's actually
"dmesg -S --raw" that always prints messages in syslog format [per
Petr Mladek]. Running "dmesg --raw" may produce output in non-syslog
format sometimes. console_msg_format=syslog enables syslog format,
thus in documentation we mention "dmesg -S --raw", not "dmesg --raw".
Per Kevin Hilman:
: Right now we can get this info from a "dmesg --raw" after bootup,
: but it would be really nice in certain automation frameworks to
: have a kernel command-line option to enable printing of loglevels
: in default boot log.
:
: This is especially useful when ingesting kernel logs into advanced
: search/analytics frameworks (I'm playing with and ELK stack: Elastic
: Search, Logstash, Kibana).
:
: The other important reason for having this on the command line is that
: for testing linux-next (and other bleeding edge developer branches),
: it's common that we never make it to userspace, so can't even run
: "dmesg --raw" (or equivalent.) So we really want this on the primary
: boot (serial) console.
Per Fengguang Wu, 0day scripts should quickly benefit from that
feature, because they will be able to switch to a more reliable
parsing, based on messages' facility and log levels [1]:
`#{grep} -a -E -e '^<[0123]>' -e '^kern :(err |crit |alert |emerg )'
instead of doing text pattern matching
`#{grep} -a -F -f /lkp/printk-error-messages #{kmsg_file} |
grep -a -v -E -f #{LKP_SRC}/etc/oops-pattern |
grep -a -v -F -f #{LKP_SRC}/etc/kmsg-blacklist`
[1] https://github.com/fengguang/lkp-tests/blob/master/lib/dmesg.rb
Link: http://lkml.kernel.org/r/20171221054149.4398-1-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Kevin Hilman <khilman@baylibre.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-12-21 14:41:49 +09:00
console_msg_format=
[KNL] Change console messages format
default
By default we print messages on consoles in
"[time stamp] text\n" format (time stamp may not be
printed, depending on CONFIG_PRINTK_TIME or
`printk_time' param).
syslog
Switch to syslog format: "<%u>[time stamp] text\n"
IOW, each message will have a facility and loglevel
prefix. The format is similar to one used by syslog()
syscall, or to executing "dmesg -S --raw" or to reading
from /proc/kmsg.
2009-06-16 15:33:52 -07:00
consoleblank= [KNL] The console blank (screen saver) timeout in
2017-09-18 22:21:25 -07:00
seconds. A value of 0 disables the blank timer.
2018-04-18 20:51:39 +02:00
Defaults to 0.
2009-06-16 15:33:52 -07:00
2009-01-06 14:42:47 -08:00
coredump_filter=
[KNL] Change the default value for
/proc/<pid>/coredump_filter.
2020-04-02 19:26:14 +02:00
See also Documentation/filesystems/proc.rst.
2009-01-06 14:42:47 -08:00
2017-06-05 14:15:12 -06:00
coresight_cpu_debug.enable
[ARM,ARM64]
Format: <bool>
Enable/disable the CPU sampling based debugging.
0: default value, disable debugging
1: enable debugging at boot time
2022-04-02 22:48:22 -07:00
cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
Format:
<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
1. Resume from suspend/hibernate depends on CPU0.
Suspend/hibernate will fail if CPU0 is offline and you
need to online CPU0 before suspend/hibernate.
2. PIC interrupts also depend on CPU0. CPU0 can't be
removed if a PIC interrupt is detected.
It's said poweroff/reboot may depend on CPU0 on some
machines although I haven't seen such issues so far
after CPU0 is offline on a few tested machines.
If the dependencies are under your control, you can
turn on cpu0_hotplug.
2011-04-01 18:13:10 -04:00
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system
2018-12-05 23:45:34 +01:00
cpuidle.governor=
[CPU_IDLE] Name of the cpuidle governor to use.
2017-02-28 16:44:16 -05:00
cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system
cpufreq: Specify default governor on command line
Currently, the only way to specify the default CPUfreq governor is
via Kconfig options, which suits users who can build the kernel
themselves perfectly.
However, for those who use a distro-like kernel (such as Android,
with the Generic Kernel Image project), the only way to use a
non-default governor is to boot to userspace, and to then switch
using the sysfs interface. Being able to specify the default governor
on the command line, like is the case for cpuidle, would allow those
users to specify their governor of choice earlier on, and to simplify
the userspace boot procedure slighlty.
To support this use-case, add a kernel command line parameter
allowing the default governor for CPUfreq to be specified, which
takes precedence over the built-in default.
This implementation has one notable limitation: the default governor
must be registered before the driver. This is solved for builtin
governors and drivers using appropriate *_initcall() functions. And
in the modular case, this must be reflected as a constraint on the
module loading order.
Signed-off-by: Quentin Perret <qperret@google.com>
[ Viresh: Converted 'default_governor' to a string and parsing it only
at initcall level, and several updates to
cpufreq_init_policy(). ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-29 13:55:00 +05:30
cpufreq.default_governor=
[CPU_FREQ] Name of the default cpufreq governor or
policy to use. This governor must be registered in the
kernel before the cpufreq driver probes.
2015-05-11 17:27:09 -04:00
cpu_init_udelay=N
[X86] Delay for N microsec between assert and de-assert
of APIC INIT to start processors. This delay occurs
on every CPU online, such as boot, and resume from suspend.
Default: 10000
2022-04-02 22:48:22 -07:00
crash_kexec_post_notifiers
Run kdump after running panic-notifiers and dumping
kmsg. This only for the users who doubt kdump always
succeeds in any situation.
Note that this also increases risks of kdump failure,
because some panic notifiers can make the crashed
kernel more unstable.
2005-04-16 15:20:36 -07:00
2011-02-20 20:08:35 -08:00
crashkernel=size[KMG][@offset[KMG]]
[KNL] Using kexec, Linux can switch to a 'crash kernel'
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
2019-04-22 11:19:05 +08:00
is selected automatically.
2022-11-16 20:10:44 +08:00
[KNL, X86-64, ARM64] Select a region under 4G first, and
2019-04-22 11:19:05 +08:00
fall back to reserve region above 4G when '@offset'
hasn't been specified.
2019-06-13 15:21:39 -03:00
See Documentation/admin-guide/kdump/kdump.rst for further details.
2005-06-25 14:57:52 -07:00
2007-10-18 23:41:02 -07:00
crashkernel=range1:size1[,range2:size2,...][@offset]
[KNL] Same as above, but depends on the memory
in the running system. The syntax of range is
start-[end] where start and end are both
a memory unit (amount[KMG]). See also
2019-06-13 15:21:39 -03:00
Documentation/admin-guide/kdump/kdump.rst for an example.
2007-10-18 23:41:02 -07:00
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
crashkernel=size[KMG],high
2022-05-06 19:44:02 +08:00
[KNL, X86-64, ARM64] range could be above 4G. Allow kernel
2013-04-15 22:23:47 -07:00
to allocate physical memory region from top, so could
be above 4G if system have more than 4G ram installed.
Otherwise memory region will be allocated below 4G, if
available.
It will be ignored if crashkernel=X is specified.
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
crashkernel=size[KMG],low
2022-11-16 20:10:43 +08:00
[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
is passed, kernel could allocate physical memory region
2013-04-15 22:23:45 -07:00
above 4G, that cause second kernel crash on system
that require some amount of low memory, e.g. swiotlb
2015-09-24 16:51:25 +08:00
requires at least 64M+32K low memory, also enough extra
low memory is needed to make sure DMA buffers for 32-bit
2022-05-06 19:44:02 +08:00
devices won't run out. Kernel would try to allocate
2022-11-16 20:10:43 +08:00
default size of memory below 4G automatically. The default
size is platform dependent.
--> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
--> arm64: 128MiB
2022-05-11 11:20:32 +08:00
This one lets the user specify own low range under 4G
2013-04-15 22:23:45 -07:00
for second kernel instead.
0: to disable low allocation.
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
It will be ignored when crashkernel=X,high is not used
2013-04-15 22:23:47 -07:00
or memory reserved is below 4G.
2013-04-15 22:23:45 -07:00
2016-05-03 10:00:17 +01:00
cryptomgr.notests
2018-04-18 20:51:39 +02:00
[KNL] Disable crypto self-tests
2016-05-03 10:00:17 +01:00
2005-04-16 15:20:36 -07:00
cs89x0_dma= [HW,NET]
Format: <dma>
cs89x0_media= [HW,NET]
Format: { rj45 | aui | bnc }
2005-10-23 12:57:11 -07:00
2021-03-01 11:13:34 +01:00
csdlock_debug= [KNL] Enable debug add-ons of cross-CPU function call
handling. When switched on, additional debug data is
printed to the console in case a hanging CPU is
detected, and that CPU is pinged again in order to try
to resolve the hang situation.
locking/csd_lock: Add more data to CSD lock debugging
In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.
There have been multiple reports of CPUs looping long times (many
seconds) in smp_call_function_many() waiting for another CPU executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).
Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.
Commit:
35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout diagnostics")
tried to address this by adding some debug code and by issuing another
IPI when a hang was detected. This helped mitigating the problem
(the repeated IPI unlocks the hang), but the root cause is still unknown.
Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target CPU
executing the queued function (due to the IPI not reaching the CPU,
the IPI handler not being called, or the handler not seeing the queued
request).
Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.
The data of the last event (consisting of sequence counter, source
CPU, target CPU, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).
A typical printout with a detected hang will look like this:
csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
csd: cnt(00008cd): ffff->0006 idle
csd: cnt(0003668): 0001->0006 queue
csd: cnt(0003669): 0001->0006 ipi
csd: cnt(0003e0f): 0007->000a queue
csd: cnt(0003e10): 0001->ffff ping
csd: cnt(0003e71): 0003->0000 ping
csd: cnt(0003e72): ffff->0006 gotipi
csd: cnt(0003e73): ffff->0006 handle
csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
csd: cnt(0003e7f): 0004->0006 ping
csd: cnt(0003e80): 0001->ffff pinged
csd: cnt(0003eb2): 0005->0001 noipi
csd: cnt(0003eb3): 0001->0006 queue
csd: cnt(0003eb4): 0001->0006 noipi
csd: cnt now: 0003f00
The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source CPU
of the hanging request, and receiver side events for the target CPU),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.
Signed-off-by: Juergen Gross <jgross@suse.com>
[ Minor readability edits. Breaks col80 but is far more readable. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20210301101336.7797-4-jgross@suse.com
2021-03-01 11:13:36 +01:00
0: disable csdlock debugging (default)
1: enable basic csdlock debugging (minor impact)
ext: enable extended csdlock debugging (more impact,
but more data)
2021-03-01 11:13:34 +01:00
2005-10-23 12:57:11 -07:00
dasd= [HW,NET]
2005-04-16 15:20:36 -07:00
See header of drivers/s390/block/dasd_devmap.c.
db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port
(one device per port)
Format: <port#>,<type>
2017-10-10 12:36:23 -05:00
See also Documentation/input/devices/joystick-parport.rst
2005-04-16 15:20:36 -07:00
debug [KNL] Enable kernel debugging (events log level).
2018-06-22 09:15:34 +10:00
debug_boot_weak_hash
[KNL] Enable printing [hashed] pointers early in the
boot sequence. If enabled, we use a weak hash instead
of siphash to hash pointers. Use this option if you are
seeing instances of '(___ptrval___)') and need to see a
value (hashed pointer) instead. Cryptographically
insecure, please do not use on production kernels.
2006-07-03 00:24:48 -07:00
debug_locks_verbose=
2020-12-09 16:42:57 +01:00
[KNL] verbose locking self-tests
Format: <int>
2006-07-03 00:24:48 -07:00
Print debugging info while doing the locking API
self-tests.
2020-12-09 16:42:57 +01:00
Bitmask for the various LOCKTYPE_ tests. Defaults to 0
(no extra messages), setting it to -1 (all bits set)
will print _a_lot_ more information - normally only
useful to lockdep developers.
2006-07-03 00:24:48 -07:00
2008-04-30 00:55:01 -07:00
debug_objects [KNL] Enable object debugging
2009-03-01 20:41:41 -05:00
no_debug_objects
[KNL] Disable object debugging
2012-01-10 15:07:28 -08:00
debug_guardpage_minorder=
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
parameter allows control of the order of pages that will
be intentionally kept free (and hence protected) by the
buddy allocator. Bigger value increase the probability
of catching random memory corruption, but reduce the
amount of memory for normal system use. The maximum
possible value is MAX_ORDER/2. Setting this parameter
to 1 or 2 should be enough to identify most random
memory corruption problems caused by bugs in kernel or
driver code when a CPU writes to (or reads from) a
random memory location. Note that there exists a class
of memory corruptions problems caused by buggy H/W or
2023-01-29 15:10:45 -08:00
F/W or by drivers badly programming DMA (basically when
2012-01-10 15:07:28 -08:00
memory is written at bus level and the CPU MMU is
bypassed) which are not detectable by
CONFIG_DEBUG_PAGEALLOC, hence this option will not help
tracking down these problems.
2014-12-12 16:55:52 -08:00
debug_pagealloc=
2019-07-11 20:55:13 -07:00
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
enables the feature at boot time. By default, it is
disabled and the system will work mostly the same as a
kernel built without CONFIG_DEBUG_PAGEALLOC.
mm, page_owner, debug_pagealloc: save and dump freeing stack trace
The debug_pagealloc functionality is useful to catch buggy page allocator
users that cause e.g. use after free or double free. When page
inconsistency is detected, debugging is often simpler by knowing the call
stack of process that last allocated and freed the page. When page_owner
is also enabled, we record the allocation stack trace, but not freeing.
This patch therefore adds recording of freeing process stack trace to page
owner info, if both page_owner and debug_pagealloc are configured and
enabled. With only page_owner enabled, this info is not useful for the
memory leak debugging use case. dump_page() is adjusted to print the
info. An example result of calling __free_pages() twice may look like
this (note the page last free stack trace):
BUG: Bad page state in process bash pfn:13d8f8
page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1affff800000000()
raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
prep_new_page+0x143/0x150
get_page_from_freelist+0x289/0x380
__alloc_pages_nodemask+0x13c/0x2d0
khugepaged+0x6e/0xc10
kthread+0xf9/0x130
ret_from_fork+0x3a/0x50
page last free stack trace:
free_pcp_prepare+0x134/0x1e0
free_unref_page+0x18/0x90
khugepaged+0x7b/0xc10
kthread+0xf9/0x130
ret_from_fork+0x3a/0x50
Modules linked in:
CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc0
bad_page.cold+0xba/0xbf
rmqueue_pcplist.isra.0+0x6c5/0x6d0
rmqueue+0x2d/0x810
get_page_from_freelist+0x191/0x380
__alloc_pages_nodemask+0x13c/0x2d0
__get_free_pages+0xd/0x30
__pud_alloc+0x2c/0x110
copy_page_range+0x4f9/0x630
dup_mmap+0x362/0x480
dup_mm+0x68/0x110
copy_process+0x19e1/0x1b40
_do_fork+0x73/0x310
__x64_sys_clone+0x75/0x80
do_syscall_64+0x6e/0x1e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f10af854a10
...
Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-23 15:34:42 -07:00
Note: to get most of debug_pagealloc error reports, it's
useful to also enable the page_owner functionality.
2014-12-12 16:55:52 -08:00
on: enable the feature
2020-07-16 09:15:11 +02:00
debugfs= [KNL] This parameter enables what is exposed to userspace
and debugfs internal clients.
Format: { on, no-mount, off }
on: All functions are enabled.
no-mount:
Filesystem is not registered but kernel clients can
access APIs and a crashkernel can be used to read
its content. There is nothing to mount.
off: Filesystem is not registered and clients
get a -EPERM as result when trying to register files
or directories within debugfs.
This is equivalent of the runtime functionality if
debugfs was not enabled in the kernel at all.
Default value is set in build-time with a kernel configuration.
2008-07-15 15:04:56 +02:00
debugpat [X86] Enable PAT debugging
2009-04-05 15:55:22 -07:00
default_hugepagesz=
2020-06-03 16:00:46 -07:00
[HW] The size of the default HugeTLB page. This is
the size represented by the legacy /proc/ hugepages
APIs. In addition, this is the default hugetlb size
used for shmget(), mmap() and mounting hugetlbfs
filesystems. If not specified, defaults to the
architecture's default huge page size. Huge page
sizes are architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
2007-05-08 00:38:53 -07:00
2018-07-09 09:41:48 -06:00
deferred_probe_timeout=
[KNL] Debugging option to set a timeout in seconds for
deferred probe to give up waiting on dependencies to
probe. Only specific dependencies (subsystems or
2022-04-29 15:09:32 -07:00
drivers) that have opted in will be ignored. A timeout
of 0 will timeout at the end of initcalls. If the time
out hasn't expired, it'll be restarted by each
successful driver registration. This option will also
2018-07-09 09:41:48 -06:00
dump out devices still on the deferred probe list after
retrying.
2022-04-02 22:48:21 -07:00
delayacct [KNL] Enable per-task delay accounting
2022-01-09 22:42:46 +01:00
dell_smm_hwmon.ignore_dmi=
[HW] Continue probing hardware even if DMI data
indicates that the driver is running on unsupported
hardware.
dell_smm_hwmon.force=
[HW] Activate driver even if SMM BIOS signature does
not match list of supported models and enable otherwise
blacklisted features.
dell_smm_hwmon.power_status=
[HW] Report power status in /proc/i8k
(disabled by default).
dell_smm_hwmon.restricted=
[HW] Allow controlling fans only if SYS_ADMIN
capability is set.
2022-01-09 22:42:47 +01:00
dell_smm_hwmon.fan_mult=
[HW] Factor to multiply fan speed with.
dell_smm_hwmon.fan_max=
[HW] Maximum configurable fan speed.
2020-01-30 22:16:27 -08:00
dfltcc= [HW,S390]
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
level 1 and decompression (default)
off: No s390 zlib hardware support
def_only: s390 zlib hardware support for deflate
only (compression on level 1)
inf_only: s390 zlib hardware support for inflate
only (decompression)
always: Same as 'on' but ignores the selected compression
level always using hardware support (used for debugging)
2005-04-16 15:20:36 -07:00
dhash_entries= [KNL]
Set number of hash buckets for dentry cache.
2005-10-23 12:57:11 -07:00
2016-07-05 11:43:21 +10:00
disable_1tb_segments [PPC]
Disables the use of 1TB hash page table segments. This
causes the kernel to fall back to 256MB segments which
can be useful when debugging issues that require an SLB
miss to occur.
2010-02-04 13:36:50 -08:00
disable= [IPV6]
2020-04-28 00:01:50 +02:00
See Documentation/networking/ipv6.rst.
2010-02-04 13:36:50 -08:00
2016-07-13 15:05:31 +05:30
disable_radix [PPC]
Disable RADIX MMU mode on POWER9
2019-09-03 01:29:31 +10:00
disable_tlbie [PPC]
Disable TLBIE instruction. Currently does not work
with KVM, with HASH MMU, or with coherent accelerators.
x86, apic, kexec: Add disable_cpu_apicid kernel parameter
Add disable_cpu_apicid kernel parameter. To use this kernel parameter,
specify an initial APIC ID of the corresponding CPU you want to
disable.
This is mostly used for the kdump 2nd kernel to disable BSP to wake up
multiple CPUs without causing system reset or hang due to sending INIT
from AP to BSP.
Kdump users first figure out initial APIC ID of the BSP, CPU0 in the
1st kernel, for example from /proc/cpuinfo and then set up this kernel
parameter for the 2nd kernel using the obtained APIC ID.
However, doing this procedure at each boot time manually is awkward,
which should be automatically done by user-land service scripts, for
example, kexec-tools on fedora/RHEL distributions.
This design is more flexible than disabling BSP in kernel boot time
automatically in that in kernel boot time we have no choice but
referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning
that the method is not applicable to the systems without such BIOS
tables.
One assumption behind this design is that users get initial APIC ID of
the BSP in still healthy state and so BSP is uniquely kept in
CPU0. Thus, through the kernel parameter, only one initial APIC ID can
be specified.
In a comparison with disabled_cpu_apicid, we use read_apic_id(), not
boot_cpu_physical_apicid, because on some platforms, the variable is
modified to the apicid reported as BSP through MP table and this
function is executed with the temporarily modified
boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel
parameter doesn't work well for apicids of APs.
Fixing the wrong handling of boot_cpu_physical_apicid requires some
reviews and tests beyond some platforms and it could take some
time. The fix here is a kind of workaround to focus on the main topic
of this patch.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-01-15 15:44:58 +09:00
disable_cpu_apicid= [X86,APIC,SMP]
Format: <int>
The number of initial APIC ID for the
corresponding CPU to be disabled at boot,
mostly used for the kdump 2nd kernel to
disable BSP to wake up multiple CPUs without
causing system reset or hang due to sending
INIT from AP to BSP.
2018-04-18 20:51:39 +02:00
disable_ddw [PPC/PSERIES]
2020-09-17 22:48:03 -07:00
Disable Dynamic DMA Window support. Use this
2011-02-10 09:10:47 +00:00
to workaround buggy firmware.
2010-02-04 13:36:50 -08:00
disable_ipv6= [IPV6]
2020-04-28 00:01:50 +02:00
See Documentation/networking/ipv6.rst.
2010-02-04 13:36:50 -08:00
2008-04-29 03:52:33 -07:00
disable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
2009-04-05 15:55:22 -07:00
entry later. This parameter disables that.
2008-04-29 03:52:33 -07:00
2008-01-30 13:33:32 +01:00
disable_mtrr_trim [X86, Intel and AMD only]
x86, 32-bit: trim memory not covered by wb mtrrs
On some machines, buggy BIOSes don't properly setup WB MTRRs to cover all
available RAM, meaning the last few megs (or even gigs) of memory will be
marked uncached. Since Linux tends to allocate from high memory addresses
first, this causes the machine to be unusably slow as soon as the kernel
starts really using memory (i.e. right around init time).
This patch works around the problem by scanning the MTRRs at boot and
figuring out whether the current end_pfn value (setup by early e820 code)
goes beyond the highest WB MTRR range, and if so, trimming it to match. A
fairly obnoxious KERN_WARNING is printed too, letting the user know that
not all of their memory is available due to a likely BIOS bug.
Something similar could be done on i386 if needed, but the boot ordering
would be slightly different, since the MTRR code on i386 depends on the
boot_cpu_data structure being setup.
This patch fixes a bug in the last patch that caused the code to run on
non-Intel machines (AMD machines apparently don't need it and it's untested
on other non-Intel machines, so best keep it off).
Further enhancements and fixes from:
Yinghai Lu <Yinghai.Lu@Sun.COM>
Andi Kleen <ak@suse.de>
Signed-off-by: Jesse Barnes <jesse.barnes@intel.com>
Tested-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:18 +01:00
By default the kernel will trim any uncacheable
memory out of your available memory pool based on
MTRR settings. This parameter disables that behavior,
possibly causing your machine to run very slowly.
2009-04-14 14:03:43 +05:30
disable_timer_pin_1 [X86]
2009-04-05 15:55:22 -07:00
Disable PIN 1 of APIC timer
Can be useful to work around chipset bugs.
2015-08-25 13:34:53 -04:00
dis_ucode_ldr [X86] Disable the microcode loader.
2009-04-05 15:55:22 -07:00
dma_debug=off If the kernel is compiled with DMA_API_DEBUG support,
this option disables the debugging code at boot.
dma_debug_entries=<number>
This option allows to tune the number of preallocated
entries for DMA-API debugging code. One entry is
required per DMA-API allocation. Use this if the
DMA-API debugging code disables itself because the
architectural default is too low.
2009-05-22 21:49:51 +02:00
dma_debug_driver=<driver_name>
With this option the DMA-API debugging driver
filter feature can be enabled at boot time. Just
pass the driver to filter for as the parameter.
The filter can be disabled or changed to another
driver later using sysfs.
2019-02-13 15:47:36 +08:00
driver_async_probe= [KNL]
driver core: Add "*" wildcard support to driver_async_probe cmdline param
There's currently no way to use driver_async_probe kernel cmdline param
to enable default async probe for all drivers. So, add support for "*"
to match with all driver names. When "*" is used, all other drivers
listed in driver_async_probe are drivers that will NOT match the "*".
For example:
* driver_async_probe=drvA,drvB,drvC
drvA, drvB and drvC do asynchronous probing.
* driver_async_probe=*
All drivers do asynchronous probing except those that have set
PROBE_FORCE_SYNCHRONOUS flag.
* driver_async_probe=*,drvA,drvB,drvC
All drivers do asynchronous probing except drvA, drvB, drvC and those
that have set PROBE_FORCE_SYNCHRONOUS flag.
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Link: https://lore.kernel.org/r/20220504005344.117803-1-saravanak@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-03 17:53:43 -07:00
List of driver names to be probed asynchronously. *
matches with all driver names. If * is specified, the
rest of the listed driver names are those that will NOT
match the *.
2019-02-13 15:47:36 +08:00
Format: <driver_name1>,<driver_name2>...
drm: handle override and firmware EDID at drm_do_get_edid() level
Handle debugfs override edid and firmware edid at the low level to
transparently and completely replace the real edid. Previously, we
practically only used the modes from the override EDID, and none of the
other data, such as audio parameters.
This change also prevents actual EDID reads when the EDID is to be
overridden, but retains the DDC probe. This is useful if the reason for
preferring override EDID are problems with reading the data, or
corruption of the data.
Move firmware EDID loading from helper to core, as the functionality
moves to lower level as well. This will result in a change of module
parameter from drm_kms_helper.edid_firmware to drm.edid_firmware, which
arguably makes more sense anyway.
Some future work remains related to override and firmware EDID
validation. Like before, no validation is done for override EDID. The
firmware EDID is validated separately in the loader. Some unification
and deduplication would be in order, to validate all of them at the
drm_do_get_edid() level, like "real" EDIDs.
v2: move firmware loading to core
v3: rebase, commit message refresh
Cc: Abdiel Janulgue <abdiel.janulgue@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Abdiel Janulgue <abdiel.janulgue@linux.intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1e8a710bcac46e5136c1a7b430074893c81f364a.1505203831.git.jani.nikula@intel.com
2017-09-12 11:19:26 +03:00
drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>]
2015-08-27 10:04:13 -07:00
Broken monitors, graphic adapters, KVMs and EDIDless
panels may send no or incorrect EDID data sets.
This parameter allows to specify an EDID data sets
in the /lib/firmware directory that are used instead.
2012-03-18 22:37:33 +01:00
Generic built-in EDID data sets are used, if one of
edid/1024x768.bin, edid/1280x1024.bin,
edid/1680x1050.bin, or edid/1920x1080.bin is given
and no file with the same name exists. Details and
instructions how to build your own EDID data are
2020-04-02 19:26:14 +02:00
available in Documentation/admin-guide/edid.rst. An EDID
2012-03-18 22:37:33 +01:00
data set will only be used for a particular connector,
if its name and a colon are prepended to the EDID
2015-08-27 10:04:13 -07:00
name. Each connector may use a unique EDID data
set by separating the files with a comma. An EDID
data set with no connector name will be used for
any connectors not explicitly specified.
2012-03-18 22:37:33 +01:00
2005-04-16 15:20:36 -07:00
dscc4.setup= [NET]
2017-05-11 21:24:41 +10:00
dt_cpu_ftrs= [PPC]
Format: {"off" | "known"}
Control how the dt_cpu_ftrs device-tree binding is
used for CPU feature discovery and setup (if it
exists).
off: Do not use it, fall back to legacy cpu table.
known: Do not pass through unknown features to guests
or userspace, only those that the kernel is aware of.
x86/efi: Retrieve and assign Apple device properties
Apple's EFI drivers supply device properties which are needed to support
Macs optimally. They contain vital information which cannot be obtained
any other way (e.g. Thunderbolt Device ROM). They're also used to convey
the current device state so that OS drivers can pick up where EFI
drivers left (e.g. GPU mode setting).
There's an EFI driver dubbed "AAPL,PathProperties" which implements a
per-device key/value store. Other EFI drivers populate it using a custom
protocol. The macOS bootloader /System/Library/CoreServices/boot.efi
retrieves the properties with the same protocol. The kernel extension
AppleACPIPlatform.kext subsequently merges them into the I/O Kit
registry (see ioreg(8)) where they can be queried by other kernel
extensions and user space.
This commit extends the efistub to retrieve the device properties before
ExitBootServices is called. It assigns them to devices in an fs_initcall
so that they can be queried with the API in <linux/property.h>.
Note that the device properties will only be available if the kernel is
booted with the efistub. Distros should adjust their installers to
always use the efistub on Macs. grub with the "linux" directive will not
work unless the functionality of this commit is duplicated in grub.
(The "linuxefi" directive should work but is not included upstream as of
this writing.)
The custom protocol has GUID 91BD12FE-F6C3-44FB-A5B7-5122AB303AE0 and
looks like this:
typedef struct {
unsigned long version; /* 0x10000 */
efi_status_t (*get) (
IN struct apple_properties_protocol *this,
IN struct efi_dev_path *device,
IN efi_char16_t *property_name,
OUT void *buffer,
IN OUT u32 *buffer_len);
/* EFI_SUCCESS, EFI_NOT_FOUND, EFI_BUFFER_TOO_SMALL */
efi_status_t (*set) (
IN struct apple_properties_protocol *this,
IN struct efi_dev_path *device,
IN efi_char16_t *property_name,
IN void *property_value,
IN u32 property_value_len);
/* allocates copies of property name and value */
/* EFI_SUCCESS, EFI_OUT_OF_RESOURCES */
efi_status_t (*del) (
IN struct apple_properties_protocol *this,
IN struct efi_dev_path *device,
IN efi_char16_t *property_name);
/* EFI_SUCCESS, EFI_NOT_FOUND */
efi_status_t (*get_all) (
IN struct apple_properties_protocol *this,
OUT void *buffer,
IN OUT u32 *buffer_len);
/* EFI_SUCCESS, EFI_BUFFER_TOO_SMALL */
} apple_properties_protocol;
Thanks to Pedro Vilaça for this blog post which was helpful in reverse
engineering Apple's EFI drivers and bootloader:
https://reverse.put.as/2016/06/25/apple-efi-firmware-passwords-and-the-scbo-myth/
If someone at Apple is reading this, please note there's a memory leak
in your implementation of the del() function as the property struct is
freed but the name and value allocations are not.
Neither the macOS bootloader nor Apple's EFI drivers check the protocol
version, but we do to avoid breakage if it's ever changed. It's been the
same since at least OS X 10.6 (2009).
The get_all() function conveniently fills a buffer with all properties
in marshalled form which can be passed to the kernel as a setup_data
payload. The number of device properties is dynamic and can change
between a first invocation of get_all() (to determine the buffer size)
and a second invocation (to retrieve the actual buffer), hence the
peculiar loop which does not finish until the buffer size settles.
The macOS bootloader does the same.
The setup_data payload is later on unmarshalled in an fs_initcall. The
idea is that most buses instantiate devices in "subsys" initcall level
and drivers are usually bound to these devices in "device" initcall
level, so we assign the properties in-between, i.e. in "fs" initcall
level.
This assumes that devices to which properties pertain are instantiated
from a "subsys" initcall or earlier. That should always be the case
since on macOS, AppleACPIPlatformExpert::matchEFIDevicePath() only
supports ACPI and PCI nodes and we've fully scanned those buses during
"subsys" initcall level.
The second assumption is that properties are only needed from a "device"
initcall or later. Seems reasonable to me, but should this ever not work
out, an alternative approach would be to store the property sets e.g. in
a btree early during boot. Then whenever device_add() is called, an EFI
Device Path would have to be constructed for the newly added device,
and looked up in the btree. That way, the property set could be assigned
to the device immediately on instantiation. And this would also work for
devices instantiated in a deferred fashion. It seems like this approach
would be more complicated and require more code. That doesn't seem
justified without a specific use case.
For comparison, the strategy on macOS is to assign properties to objects
in the ACPI namespace (AppleACPIPlatformExpert::mergeEFIProperties()).
That approach is definitely wrong as it fails for devices not present in
the namespace: The NHI EFI driver supplies properties for attached
Thunderbolt devices, yet on Macs with Thunderbolt 1 only one device
level behind the host controller is described in the namespace.
Consequently macOS cannot assign properties for chained devices. With
Thunderbolt 2 they started to describe three device levels behind host
controllers in the namespace but this grossly inflates the SSDT and
still fails if the user daisy-chained more than three devices.
We copy the property names and values from the setup_data payload to
swappable virtual memory and afterwards make the payload available to
the page allocator. This is just for the sake of good housekeeping, it
wouldn't occupy a meaningful amount of physical memory (4444 bytes on my
machine). Only the payload is freed, not the setup_data header since
otherwise we'd break the list linkage and we cannot safely update the
predecessor's ->next link because there's no locking for the list.
The payload is currently not passed on to kexec'ed kernels, same for PCI
ROMs retrieved by setup_efi_pci(). This can be added later if there is
demand by amending setup_efi_state(). The payload can then no longer be
made available to the page allocator of course.
Tested-by: Lukas Wunner <lukas@wunner.de> [MacBookPro9,1]
Tested-by: Pierre Moreau <pierre.morrow@free.fr> [MacBookPro11,3]
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andreas Noever <andreas.noever@gmail.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pedro Vilaça <reverser@put.as>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: grub-devel@gnu.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20161112213237.8804-9-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-12 21:32:36 +00:00
dump_apple_properties [X86]
Dump name and content of EFI device properties on
x86 Macs. Useful for driver authors to determine
what data is available or for reverse-engineering.
2012-04-27 14:30:41 -06:00
dyndbg[="val"] [KNL,DYNAMIC_DEBUG]
2020-09-15 19:49:02 -07:00
<module>.dyndbg[="val"]
2012-04-27 14:30:41 -06:00
Enable debug messages at boot time. See
2017-06-14 12:24:12 +02:00
Documentation/admin-guide/dynamic-debug-howto.rst
for details.
2012-04-27 14:30:41 -06:00
2014-04-07 15:39:53 -07:00
early_ioremap_debug [KNL]
Enable debug messages in early_ioremap support. This
is useful for tracking down temporary early mappings
which are not unmapped.
2009-04-05 15:55:22 -07:00
earlycon= [KNL] Output early console device and options.
2014-04-18 17:19:57 -05:00
2019-09-17 09:15:23 +02:00
When used with no options, the early console is
determined by stdout-path property in device tree's
chosen node or the ACPI SPCR table if supported by
the platform.
2015-09-14 19:54:07 -05:00
2016-09-22 16:58:16 +01:00
cdns,<addr>[,options]
Start an early, polled-mode console on a Cadence
(xuartps) serial port at the specified address. Only
supported option is baud rate. If baud rate is not
specified, the serial port must already be setup and
configured.
2014-09-10 12:43:02 +02:00
2022-11-24 13:39:07 +01:00
uart[8250],io,<addr>[,options[,uartclk]]
uart[8250],mmio,<addr>[,options[,uartclk]]
uart[8250],mmio32,<addr>[,options[,uartclk]]
uart[8250],mmio32be,<addr>[,options[,uartclk]]
earlycon: 8250: Document kernel command line options
Document the expected behavior of kernel command lines of the forms:
console=uart[8250],io|mmio|mmio32,<addr>[,options]
console=uart[8250],<addr>[,options]
and
earlycon=uart[8250],io|mmio|mmio32,<addr>[,options]
earlycon=uart[8250],<addr>[,options]
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-06 10:52:39 -04:00
uart[8250],0x<addr>[,options]
2009-04-05 15:55:22 -07:00
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address.
2011-08-13 12:34:52 -07:00
MMIO inter-register address stride is either 8-bit
2015-05-25 06:54:28 +03:00
(mmio) or 32-bit (mmio32 or mmio32be).
If none of [io|mmio|mmio32|mmio32be], <addr> is assumed
to be equivalent to 'mmio'. 'options' are specified
in the same format described for "console=ttyS<n>"; if
2022-11-24 13:39:07 +01:00
unspecified, the h/w is not initialized. 'uartclk' is
the uart clock frequency; if unspecified, it is set
to 'BASE_BAUD' * 16.
2009-04-05 15:55:22 -07:00
2014-04-18 17:19:57 -05:00
pl011,<addr>
2016-01-04 15:37:42 -06:00
pl011,mmio32,<addr>
2014-04-18 17:19:57 -05:00
Start an early, polled-mode console on a pl011 serial
port at the specified address. The pl011 serial port
must already be setup and configured. Options are not
2016-01-04 15:37:42 -06:00
yet supported. If 'mmio32' is specified, then only
the driver will use only 32-bit accessors to read/write
the device registers.
2014-04-18 17:19:57 -05:00
2021-05-17 20:54:52 +09:00
liteuart,<addr>
Start an early console on a litex serial port at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
2016-03-06 12:21:24 +01:00
meson,<addr>
Start an early, polled-mode console on a meson serial
port at the specified address. The serial port must
already be setup and configured. Options are not yet
supported.
2014-09-15 17:22:51 -07:00
msm_serial,<addr>
Start an early, polled-mode console on an msm serial
port at the specified address. The serial port
must already be setup and configured. Options are not
yet supported.
msm_serial_dm,<addr>
Start an early, polled-mode console on an msm serial
dm port at the specified address. The serial port
must already be setup and configured. Options are not
yet supported.
2017-06-19 03:46:40 +02:00
owl,<addr>
Start an early, polled-mode console on a serial port
of an Actions Semi SoC, such as S500 or S900, at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
2018-12-18 20:32:37 +05:30
rda,<addr>
Start an early, polled-mode console on a serial port
of an RDA Micro SoC, such as RDA8810PL, at the
specified address. The serial port must already be
2017-06-19 03:46:40 +02:00
setup and configured. Options are not yet supported.
2019-09-13 13:38:43 -07:00
sbi
Use RISC-V SBI (Supervisor Binary Interface) for early
console.
2014-04-18 17:19:58 -05:00
smh Use ARM semihosting calls for early console.
2015-01-23 14:47:41 +01:00
s3c2410,<addr>
s3c2412,<addr>
s3c2440,<addr>
s3c6400,<addr>
s5pv210,<addr>
exynos4210,<addr>
Use early console provided by serial driver available
on Samsung SoCs, requires selecting proper type and
a correct base address of the selected UART port. The
serial port must already be setup and configured.
Options are not yet supported.
2016-12-11 21:42:23 +01:00
lantiq,<addr>
Start an early, polled-mode console on a lantiq serial
(lqasc) port at the specified address. The serial port
must already be setup and configured. Options are not
yet supported.
2015-10-17 00:45:55 -07:00
lpuart,<addr>
lpuart32,<addr>
Use early console provided by Freescale LP UART driver
found on Freescale Vybrid and QorIQ LS1021A processors.
A valid base address must be provided, and the serial
port must already be setup and configured.
2020-02-29 14:27:48 +01:00
ec_imx21,<addr>
ec_imx6q,<addr>
Start an early, polled-mode, output-only console on the
Freescale i.MX UART at the specified address. The UART
must already be setup and configured.
2017-05-04 00:49:36 +01:00
ar3700_uart,<addr>
2016-02-16 19:14:53 +01:00
Start an early, polled-mode console on the
Armada 3700 serial port at the specified
address. The serial port must already be setup
and configured. Options are not yet supported.
2018-05-03 14:14:40 -06:00
qcom_geni,<addr>
Start an early, polled-mode console on a Qualcomm
Generic Interface (GENI) based serial port at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
2019-02-02 10:41:18 +01:00
efifb,[options]
Start an early, unaccelerated console on the EFI
memory mapped framebuffer (if available). On cache
coherent non-x86 systems that use system memory for
the framebuffer, pass the 'ram' option so that it is
mapped with the correct attributes.
2019-08-09 11:29:16 +00:00
linflex,<addr>
2019-10-16 15:48:25 +03:00
Use early console provided by Freescale LINFlexD UART
2019-08-09 11:29:16 +00:00
serial driver for NXP S32V234 SoCs. A valid base
address must be provided, and the serial port must
already be setup and configured.
2018-03-07 22:23:24 +01:00
earlyprintk= [X86,SH,ARM,M68k,S390]
2005-04-16 15:20:36 -07:00
earlyprintk=vga
2017-01-11 09:14:52 +01:00
earlyprintk=sclp
2013-02-25 15:54:08 -05:00
earlyprintk=xen
2005-04-16 15:20:36 -07:00
earlyprintk=serial[,ttySn[,baudrate]]
2013-04-10 14:03:38 -07:00
earlyprintk=serial[,0x...[,baudrate]]
2009-09-24 09:08:30 -05:00
earlyprintk=ttySn[,baudrate]
2009-08-20 15:39:57 -05:00
earlyprintk=dbgp[debugController#]
2018-10-03 00:49:21 +08:00
earlyprintk=pciserial[,force],bus:device.function[,baudrate]
2017-03-21 16:01:31 +08:00
earlyprintk=xdbc[xhciController#]
2005-04-16 15:20:36 -07:00
2013-04-10 14:03:38 -07:00
earlyprintk is useful when the kernel crashes before
the normal console is initialized. It is not enabled by
default because it has some cosmetic problems.
2005-10-23 12:57:11 -07:00
Append ",keep" to not disable it when the real console
2005-04-16 15:20:36 -07:00
takes over.
2022-03-21 13:58:53 +09:00
Only one of vga, serial, or usb debug port can
2013-10-04 09:36:56 +01:00
be used at a time.
2005-04-16 15:20:36 -07:00
2013-04-10 14:03:38 -07:00
Currently only ttyS0 and ttyS1 may be specified by
name. Other I/O ports may be explicitly specified
on some architectures (x86 and arm at least) by
replacing ttySn with an I/O port address, like this:
earlyprintk=serial,0x1008,115200
You can find the port for a given device in
/proc/tty/driver/serial:
2: uart:ST16650V2 port:00001008 irq:18 ...
2005-04-16 15:20:36 -07:00
Interaction with the standard serial driver is not
very good.
2022-03-21 13:58:53 +09:00
The VGA output is eventually overwritten by
2013-10-04 09:36:56 +01:00
the real console.
2005-04-16 15:20:36 -07:00
2021-09-30 14:18:45 +02:00
The xen option can only be used in Xen domains.
2013-02-25 15:54:08 -05:00
2017-01-11 09:14:52 +01:00
The sclp output can only be used on s390.
2018-10-03 00:49:21 +08:00
The optional "force" to "pciserial" enables use of a
PCI device even when its classcode is not of the
UART class.
2013-12-06 01:17:08 -05:00
edac_report= [HW,EDAC] Control how to report EDAC event
Format: {"on" | "off" | "force"}
on: enable EDAC to report H/W event. May be overridden
by other higher priority error reporting module.
off: disable H/W event reporting through EDAC.
force: enforce the use of EDAC to report H/W event.
default: on.
2005-04-16 15:20:36 -07:00
edd= [EDD]
2008-04-29 01:02:45 -07:00
Format: {"off" | "on" | "skip[mbr]"}
2005-04-16 15:20:36 -07:00
2013-10-31 17:25:08 +01:00
efi= [EFI]
2020-06-16 12:40:12 +02:00
Format: { "debug", "disable_early_pci_dma",
"nochunk", "noruntime", "nosoftreserve",
2020-08-17 12:00:17 +02:00
"novamap", "no_disable_early_pci_dma" }
2020-06-16 12:40:12 +02:00
debug: enable misc debug output.
disable_early_pci_dma: disable the busmaster bit on all
PCI bridges while in the EFI boot stub.
2014-08-05 11:52:11 +01:00
nochunk: disable reading files in "chunks" in the EFI
boot stub, as chunking can cause problems with some
firmware implementations.
2014-08-14 17:15:28 +08:00
noruntime : disable EFI runtime services support
2019-11-06 17:43:11 -08:00
nosoftreserve: The EFI_MEMORY_SP (Specific Purpose)
attribute may cause the kernel to reserve the
memory range for a memory mapping driver to
claim. Specify efi=nosoftreserve to disable this
reservation and treat the memory by its base type
(i.e. EFI_CONVENTIONAL_MEMORY / "System RAM").
2020-06-16 12:40:12 +02:00
novamap: do not call SetVirtualAddressMap().
2020-01-03 12:39:50 +01:00
no_disable_early_pci_dma: Leave the busmaster bit set
on all PCI bridges while in the EFI boot stub
2013-10-31 17:25:08 +01:00
2013-04-17 01:00:53 +02:00
efi_no_storage_paranoia [EFI; X86]
Using this parameter you can use more than 50% of
your efi variable storage. Use this parameter only if
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
2015-09-30 23:01:56 +09:00
efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
Add arbitrary attribute to specific memory range by
updating original EFI memory map.
Region of memory which aa attribute is added to is
from ss to ss+nn.
2019-11-06 17:43:26 -08:00
2015-09-30 23:01:56 +09:00
If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000
is specified, EFI_MEMORY_MORE_RELIABLE(0x10000)
attribute is added to range 0x100000000-0x180000000 and
0x10a0000000-0x1120000000.
2019-11-06 17:43:26 -08:00
If efi_fake_mem=8G@9G:0x40000 is specified, the
EFI_MEMORY_SP(0x40000) attribute is added to
range 0x240000000-0x43fffffff.
2015-09-30 23:01:56 +09:00
Using this parameter you can do debugging of EFI memmap
2019-11-06 17:43:26 -08:00
related features. For example, you can do debugging of
2015-09-30 23:01:56 +09:00
Address Range Mirroring feature even if your box
2019-11-06 17:43:26 -08:00
doesn't support it, or mark specific memory as
"soft reserved".
2015-09-30 23:01:56 +09:00
2016-07-08 19:13:12 +03:00
efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT
that is to be dynamically loaded by Linux. If there are
multiple variables with the same name but with different
vendor GUIDs, all of them will be loaded. See
2019-06-07 15:54:32 -03:00
Documentation/admin-guide/acpi/ssdt-overlays.rst for details.
2016-07-08 19:13:12 +03:00
2005-04-16 15:20:36 -07:00
eisa_irq_edge= [PARISC,HW]
See header of drivers/parisc/eisa.c.
2022-04-02 22:48:21 -07:00
ekgdboc= [X86,KGDB] Allow early kernel console debugging
Format: ekgdboc=kbd
This is designed to be used in conjunction with
the boot argument: earlyprintk=vga
This parameter works in place of the kgdboc parameter
but can only be used if the backing tty is available
very early in the boot process. For early debugging
via a serial port see kgdboc_earlycon instead.
2007-07-31 00:37:59 -07:00
elanfreq= [X86-32]
2005-04-16 15:20:36 -07:00
See comment before function elanfreq_setup() in
2008-07-04 09:59:43 -07:00
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
2005-04-16 15:20:36 -07:00
2011-10-30 15:16:37 +01:00
elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
2005-10-23 12:57:11 -07:00
Specifies physical address of start of kernel core
2011-10-30 15:16:37 +01:00
image elf header and optionally the size. Generally
kexec loader will pass this option to capture kernel.
2019-06-13 15:21:39 -03:00
See Documentation/admin-guide/kdump/kdump.rst for details.
2005-04-16 15:20:36 -07:00
2009-04-05 15:55:22 -07:00
enable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter enables that.
2009-05-06 16:02:58 -07:00
enable_timer_pin_1 [X86]
2009-04-05 15:55:22 -07:00
Enable PIN 1 of APIC timer
Can be useful to work around chipset bugs
(in particular on some ATI chipsets).
The kernel tries to set a reasonable default.
2022-02-28 20:14:54 -08:00
enforcing= [SELINUX] Set initial enforcing status.
2005-04-16 15:20:36 -07:00
Format: {"0" | "1"}
See security/selinux/Kconfig help text.
0 -- permissive (log only, no denials).
1 -- enforcing (deny and log).
Default value is 0.
2020-01-07 11:35:04 -05:00
Value can be changed at runtime via
/sys/fs/selinux/enforce.
2005-04-16 15:20:36 -07:00
2010-05-18 14:35:21 +08:00
erst_disable [ACPI]
Disable Error Record Serialization Table (ERST)
support.
2005-04-16 15:20:36 -07:00
ether= [HW,NET] Ethernet cards parameters
This option is obsoleted by the "netdev=" option, which
has equivalent usage. See its documentation for details.
2011-05-12 18:33:20 -04:00
evm= [EVM]
Format: { "fix" }
Permit 'security.evm' to be updated regardless of
current integrity status.
2022-08-25 18:27:14 +08:00
early_page_ext [KNL] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
might be disabled to achieve that (e.g. parallelized
memory initialization is disabled) so the boot process
might take longer, especially on systems with a lot of
memory. Available with CONFIG_PAGE_EXTENSION=y.
2006-12-08 02:39:42 -08:00
failslab=
2020-10-15 20:13:46 -07:00
fail_usercopy=
2006-12-08 02:39:42 -08:00
fail_page_alloc=
fail_make_request=[KNL]
General fault injection mechanism.
Format: <interval>,<probability>,<space>,<times>
2011-08-15 02:02:26 +02:00
See also Documentation/fault-injection/.
2006-12-08 02:39:42 -08:00
2020-08-26 09:05:35 -07:00
fb_tunnels= [NET]
Format: { initns | none }
See Documentation/admin-guide/sysctl/net.rst for
fb_tunnels_only_for_init_ns
2005-04-16 15:20:36 -07:00
floppy= [HW]
2019-06-18 11:47:10 -03:00
See Documentation/admin-guide/blockdev/floppy.rst.
2005-04-16 15:20:36 -07:00
2008-05-08 14:03:23 -06:00
force_pal_cache_flush
[IA-64] Avoid check_sal_cache_flush which may hang on
buggy SAL_CACHE_FLUSH implementations. Using this
parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
2018-04-18 20:51:39 +02:00
forcepae [X86-32]
2014-03-07 18:40:42 +07:00
Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a
functionally usable PAE implementation.
Warning: use of this parameter will taint the kernel
and may cause unknown problems.
2008-11-01 19:57:37 +01:00
ftrace=[tracer]
2009-05-28 13:37:24 -04:00
[FTRACE] will set and start the specified tracer
2008-11-01 19:57:37 +01:00
as early as possible in order to facilitate early
boot debugging.
2022-03-10 21:37:09 -05:00
ftrace_boot_snapshot
[FTRACE] On boot up, a snapshot will be taken of the
ftrace ring buffer that can be read at:
/sys/kernel/tracing/snapshot.
This is useful if you need tracing information from kernel
boot up that is likely to be overridden by user space
start up functionality.
2023-02-07 12:28:53 -05:00
Optionally, the snapshot can also be defined for a tracing
instance that was created by the trace_instance= command
line parameter.
trace_instance=foo,sched_switch ftrace_boot_snapshot=foo
The above will cause the "foo" tracing instance to trigger
a snapshot at the end of boot up.
2010-04-18 19:08:41 +02:00
ftrace_dump_on_oops[=orig_cpu]
2009-05-28 13:37:24 -04:00
[FTRACE] will dump the trace buffers on oops.
2010-04-18 19:08:41 +02:00
If no parameter is passed, ftrace will dump
buffers of all CPUs, but if you pass orig_cpu, it will
dump only the buffer of the CPU that triggered the
oops.
2009-05-28 13:37:24 -04:00
ftrace_filter=[function-list]
[FTRACE] Limit the functions traced by the function
2020-12-31 20:08:31 -08:00
tracer at boot up. function-list is a comma-separated
2009-05-28 13:37:24 -04:00
list of functions. This list can be changed at run
time by the set_ftrace_filter file in the debugfs
2011-08-13 12:34:52 -07:00
tracing directory.
2009-05-28 13:37:24 -04:00
ftrace_notrace=[function-list]
[FTRACE] Do not trace the functions specified in
function-list. This list can be changed at run time
by the set_ftrace_notrace file in the debugfs
tracing directory.
2008-11-01 19:57:37 +01:00
2009-10-12 22:17:21 +02:00
ftrace_graph_filter=[function-list]
[FTRACE] Limit the top level callers functions traced
by the function graph tracer at boot up.
2020-12-31 20:08:31 -08:00
function-list is a comma-separated list of functions
2009-10-12 22:17:21 +02:00
that can be changed at run time by the
set_graph_function file in the debugfs tracing directory.
2014-06-13 01:23:50 +09:00
ftrace_graph_notrace=[function-list]
[FTRACE] Do not trace from the functions specified in
2020-12-31 20:08:31 -08:00
function-list. This list is a comma-separated list of
2014-06-13 01:23:50 +09:00
functions that can be changed at run time by the
set_graph_notrace file in the debugfs tracing directory.
2017-03-02 16:12:15 -08:00
ftrace_graph_max_depth=<uint>
[FTRACE] Used with the function graph tracer. This is
the max depth it will trace into a function. This value
can be changed at run time by the max_graph_depth file
in the tracefs tracing directory. default: 0 (no limit)
2020-02-21 17:40:35 -08:00
fw_devlink= [KNL] Create device links between consumer and supplier
devices by scanning the firmware to infer the
consumer/supplier relationships. This feature is
especially useful when drivers are loaded as modules as
it ensures proper ordering of tasks like device probing
(suppliers first, then consumers), supplier boot state
clean up (only after all consumers have probed),
suspend/resume & runtime PM (consumers first, then
suppliers).
Format: { off | permissive | on | rpm }
off -- Don't create device links from firmware info.
permissive -- Create device links from firmware info
but use it only for ordering boot state clean
up (sync_state() calls).
on -- Create device links from firmware info and use it
to enforce probe and suspend/resume ordering.
rpm -- Like "on", but also use to order runtime PM.
2021-02-05 14:26:39 -08:00
fw_devlink.strict=<bool>
[KNL] Treat all inferred dependencies as mandatory
dependencies. This only applies for fw_devlink=on|rpm.
Format: <bool>
2005-04-16 15:20:36 -07:00
gamecon.map[2|3]=
[HW,JOY] Multisystem joystick and NES/SNES/PSX pad
support via parallel port (up to 5 devices per port)
Format: <port#>,<pad1>,<pad2>,<pad3>,<pad4>,<pad5>
2017-10-10 12:36:23 -05:00
See also Documentation/input/devices/joystick-parport.rst
2005-04-16 15:20:36 -07:00
gamma= [HW,DRM]
2020-08-09 19:49:41 -07:00
gart_fix_e820= [X86-64] disable the fix e820 for K8 GART
x86: disable the GART early, 64-bit
For K8 system: 4G RAM with memory hole remapping enabled, or more than
4G RAM installed.
when try to use kexec second kernel, and the first doesn't include
gart_shutdown. the second kernel could have different aper position than
the first kernel. and second kernel could use that hole as RAM that is
still used by GART set by the first kernel. esp. when try to kexec
2.6.24 with sparse mem enable from previous kernel (from RHEL 5 or SLES
10). the new kernel will use aper by GART (set by first kernel) for
vmemmap. and after new kernel setting one new GART. the position will be
real RAM. the _mapcount set is lost.
Bad page state in process 'swapper'
page:ffffe2000e600020 flags:0x0000000000000000 mapping:0000000000000000 mapcount:1 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 0, comm: swapper Not tainted 2.6.24-rc7-smp-gcdf71a10-dirty #13
Call Trace:
[<ffffffff8026401f>] bad_page+0x63/0x8d
[<ffffffff80264169>] __free_pages_ok+0x7c/0x2a5
[<ffffffff80ba75d1>] free_all_bootmem_core+0xd0/0x198
[<ffffffff80ba3a42>] numa_free_all_bootmem+0x3b/0x76
[<ffffffff80ba3461>] mem_init+0x3b/0x152
[<ffffffff80b959d3>] start_kernel+0x236/0x2c2
[<ffffffff80b9511a>] _sinittext+0x11a/0x121
and
[ffffe2000e600000-ffffe2000e7fffff] PMD ->ffff81001c200000 on node 0
phys addr is : 0x1c200000
RHEL 5.1 kernel -53 said:
PCI-DMA: aperture base @ 1c000000 size 65536 KB
new kernel said:
Mapping aperture over 65536 KB of RAM @ 3c000000
So could try to disable that GART if possible.
According to Ingo
> hm, i'm wondering, instead of modifying the GART, why dont we simply
> _detect_ whatever GART settings we have inherited, and propagate that
> into our e820 maps? I.e. if there's inconsistency, then punch that out
> from the memory maps and just dont use that memory.
>
> that way it would not matter whether the GART settings came from a [old
> or crashing] Linux kernel that has not called gart_iommu_shutdown(), or
> whether it's a BIOS that has set up an aperture hole inconsistent with
> the memory map it passed. (or the memory map we _think_ i tried to pass
> us)
>
> it would also be more robust to only read and do a memory map quirk
> based on that, than actively trying to change the GART so early in the
> bootup. Later on we have to re-enable the GART _anyway_ and have to
> punch a hole for it.
>
> and as a bonus, we would have shored up our defenses against crappy
> BIOSes as well.
add e820 modification for gart inconsistent setting.
gart_fix_e820=off could be used to disable e820 fix.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:09 +01:00
Format: off | on
default: on
2009-06-17 16:28:08 -07:00
gcov_persist= [GCOV] When non-zero (default), profiling data for
kernel modules is saved and remains accessible via
debugfs, even when the module is unloaded/reloaded.
When zero, profiling data is discarded and associated
debugfs files are removed at module unload time.
2017-02-15 11:11:50 +01:00
goldfish [X86] Enable the goldfish android emulator platform.
Don't use this when you are not running on the
android emulator
2021-03-29 13:16:47 +02:00
gpio-mockup.gpio_mockup_ranges
[HW] Sets the ranges of gpiochip of for this device.
Format: <start1>,<end1>,<start2>,<end2>...
2021-03-29 13:16:48 +02:00
gpio-mockup.gpio_mockup_named_lines
[HW] Let the driver know GPIO lines should be named.
2021-03-29 13:16:47 +02:00
2005-04-16 15:20:36 -07:00
gpt [EFI] Forces disk with valid GPT signature but
2014-01-23 15:56:03 -08:00
invalid Protective MBR to be treated as GPT. If the
primary GPT is corrupted, it enables the backup/alternate
GPT to be used instead.
2005-04-16 15:20:36 -07:00
2012-11-15 08:47:14 +01:00
grcan.enable0= [HW] Configuration of physical interface 0. Determines
the "Enable 0" bit of the configuration register.
Format: 0 | 1
Default: 0
grcan.enable1= [HW] Configuration of physical interface 1. Determines
the "Enable 0" bit of the configuration register.
Format: 0 | 1
Default: 0
grcan.select= [HW] Select which physical interface to use.
Format: 0 | 1
Default: 0
grcan.txsize= [HW] Sets the size of the tx buffer.
Format: <unsigned int> such that (txsize & ~0x1fffc0) == 0.
Default: 1024
grcan.rxsize= [HW] Sets the size of the rx buffer.
Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
Default: 1024
2022-04-02 22:48:21 -07:00
hardened_usercopy=
[KNL] Under CONFIG_HARDENED_USERCOPY, whether
hardening is enabled for this boot. Hardened
usercopy checking is used to protect the kernel
from reading or writing beyond known memory
allocation boundaries as a proactive defense
against bounds-checking flaws in the kernel's
copy_to_user()/copy_from_user() interface.
on Perform hardened usercopy checks (default).
off Disable hardened usercopy checks.
2015-11-05 18:44:41 -08:00
hardlockup_all_cpu_backtrace=
[KNL] Should the hard-lockup detector generate
backtraces on all cpus.
2020-06-07 21:40:42 -07:00
Format: 0 | 1
2015-11-05 18:44:41 -08:00
2005-04-16 15:20:36 -07:00
hashdist= [KNL,NUMA] Large hashes allocated during boot
are distributed across NUMA nodes. Defaults on
2011-08-13 12:34:52 -07:00
for 64-bit NUMA, off otherwise.
2005-10-23 12:57:11 -07:00
Format: 0 | 1 (for off | on)
2005-04-16 15:20:36 -07:00
hcl= [IA-64] SGI's Hardware Graph compatibility layer
hd= [EIDE] (E)IDE hard drive subsystem geometry
Format: <cyl>,<head>,<sect>
2010-05-18 14:35:15 +08:00
hest_disable [ACPI]
Disable Hardware Error Source Table (HEST) support;
corresponding firmware-first mode error processing
logic will be disabled.
2022-04-02 22:48:21 -07:00
hibernate= [HIBERNATION]
noresume Don't check if there's a hibernation image
present during boot.
nocompress Don't compress/decompress hibernation images.
no Disable hibernation and resume.
protect_image Turn on image protection during restoration
(that will set all pages holding image data
during restoration read-only).
2005-04-16 15:20:36 -07:00
highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
size of <nn>. This works even on boxes that have no
highmem otherwise. This also works to reduce highmem
size on bigger boxes.
2007-02-16 01:28:11 -08:00
highres= [KNL] Enable/disable high resolution timer mode.
Valid parameters: "on", "off"
Default: "on"
2009-04-05 15:55:22 -07:00
hlt [BUGS=ARM,SH]
init: add "hostname" kernel parameter
The gethostname system call returns the hostname for the current machine.
However, the kernel has no mechanism to initially set the current
machine's name in such a way as to guarantee that the first userspace
process to call gethostname will receive a meaningful result. It relies
on some unspecified userspace process to first call sethostname before
gethostname can produce a meaningful name.
Traditionally the machine's hostname is set from userspace by the init
system. The init system, in turn, often relies on a configuration file
(say, /etc/hostname) to provide the value that it will supply in the call
to sethostname. Consequently, the file system containing /etc/hostname
usually must be available before the hostname will be set. There may,
however, be earlier userspace processes that could call gethostname before
the file system containing /etc/hostname is mounted. Such a process will
get some other, likely meaningless, name from gethostname (such as
"(none)", "localhost", or "darkstar").
A real-world example where this can happen, and lead to undesirable
results, is with mdadm. When assembling arrays, mdadm distinguishes
between "local" arrays and "foreign" arrays. A local array is one that
properly belongs to the current machine, and a foreign array is one that
is (possibly temporarily) attached to the current machine, but properly
belongs to some other machine. To determine if an array is local or
foreign, mdadm may compare the "homehost" recorded on the array with the
current hostname. If mdadm is run before the root file system is mounted,
perhaps because the root file system itself resides on an md-raid array,
then /etc/hostname isn't yet available and the init system will not yet
have called sethostname, causing mdadm to incorrectly conclude that all of
the local arrays are foreign.
Solving this problem *could* be delegated to the init system. It could be
left up to the init system (including any init system that starts within
an initramfs, if one is in use) to ensure that sethostname is called
before any other userspace process could possibly call gethostname.
However, it may not always be obvious which processes could call
gethostname (for example, udev itself might not call gethostname, but it
could via udev rules invoke processes that do). Additionally, the init
system has to ensure that the hostname configuration value is stored in
some place where it will be readily accessible during early boot.
Unfortunately, every init system will attempt to (or has already attempted
to) solve this problem in a different, possibly incorrect, way. This
makes getting consistently working configurations harder for users.
I believe it is better for the kernel to provide the means by which the
hostname may be set early, rather than making this a problem for the init
system to solve. The option to set the hostname during early startup, via
a kernel parameter, provides a simple, reliable way to solve this problem.
It also could make system configuration easier for some embedded systems.
[dmoulding@me.com: v2]
Link: https://lkml.kernel.org/r/20220506060310.7495-2-dmoulding@me.com
Link: https://lkml.kernel.org/r/20220505180651.22849-2-dmoulding@me.com
Signed-off-by: Dan Moulding <dmoulding@me.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:31:37 -07:00
hostname= [KNL] Set the hostname (aka UTS nodename).
Format: <string>
This allows setting the system's hostname during early
startup. This sets the name returned by gethostname.
Using this parameter to set the hostname makes it
possible to ensure the hostname is correctly set before
any userspace processes run, avoiding the possibility
that a process may call gethostname before the hostname
has been explicitly set, resulting in the calling
process getting an incorrect result. The string must
not exceed the maximum allowed hostname length (usually
64 characters) and will be truncated otherwise.
2009-04-05 15:55:22 -07:00
hpet= [X86-32,HPET] option to control HPET usage
Format: { enable (default) | disable | force |
verbose }
disable: disable HPET and use PIT instead
force: allow force enabled of undocumented chips (ICH4,
VIA, nVidia)
verbose: show contents of HPET registers during setup
2013-11-12 15:08:33 -08:00
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
2020-06-03 16:00:46 -07:00
hugepages= [HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
2021-11-05 13:43:28 -07:00
the default huge page size. If using node format, the
number of pages to allocate per-node can be specified.
See also Documentation/admin-guide/mm/hugetlbpage.rst.
Format: <integer> or (node format)
<node>:<integer>[,<node>:<integer>]
2020-06-03 16:00:46 -07:00
hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
conjunction with hugepages (above) to allocate huge
pages of a specific size at boot. The pair
hugepagesz=X hugepages=Y can be specified once for
each supported huge page size. Huge page sizes are
architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
2008-09-21 17:14:42 +09:00
2022-04-02 22:48:22 -07:00
hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
Format: nn[KMGTPE] or (node format)
<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
Reserve a CMA area of given size and allocate gigantic
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
2021-06-30 18:47:25 -07:00
hugetlb_free_vmemmap=
2023-01-29 15:10:45 -08:00
[KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
2021-06-30 18:47:25 -07:00
enabled.
2022-06-28 17:22:30 +08:00
Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
2021-06-30 18:47:25 -07:00
Allows heavy hugetlb users to free up some more
mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page
Patch series "Free the 2nd vmemmap page associated with each HugeTLB
page", v7.
This series can minimize the overhead of struct page for 2MB HugeTLB
pages significantly. It further reduces the overhead of struct page by
12.5% for a 2MB HugeTLB compared to the previous approach, which means
2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are
welcome. Thanks.
The main implementation and details can refer to the commit log of patch
1. In this series, I have changed the following four helpers, the
following table shows the impact of the overhead of those helpers.
+------------------+-----------------------+
| APIs | head page | tail page |
+------------------+-----------+-----------+
| PageHead() | Y | N |
+------------------+-----------+-----------+
| PageTail() | Y | N |
+------------------+-----------+-----------+
| PageCompound() | N | N |
+------------------+-----------+-----------+
| compound_head() | Y | N |
+------------------+-----------+-----------+
Y: Overhead is increased.
N: Overhead is _NOT_ increased.
It shows that the overhead of those helpers on a tail page don't change
between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the
overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
(except PageCompound()). So I believe that Matthew Wilcox's folio series
will help with this.
The users of PageHead() and PageTail() are much less than compound_head()
and most users of PageTail() are VM_BUG_ON(), so I have done some tests
about the overhead of compound_head() on head pages.
I have tested the overhead of calling compound_head() on a head page,
which is 2.11ns (Measure the call time of 10 million times
compound_head(), and then average).
For a head page whose address is not aligned with PAGE_SIZE or a
non-compound page, the overhead of compound_head() is 2.54ns which is
increased by 20%. For a head page whose address is aligned with
PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
40%. Most pages are the former. I do not think the overhead is
significant since the overhead of compound_head() itself is low.
This patch (of 5):
This patch minimizes the overhead of struct page for 2MB HugeTLB pages
significantly. It further reduces the overhead of struct page by 12.5%
for a 2MB HugeTLB compared to the previous approach, which means 2GB per
1TB HugeTLB (2MB type).
After the feature of "Free sonme vmemmap pages of HugeTLB page" is
enabled, the mapping of the vmemmap addresses associated with a 2MB
HugeTLB page becomes the figure below.
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | ----------------^ ^ ^ ^ ^ ^
| | +-----------+ | | | | |
| | | 3 | ------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | --------------------+ | | |
| 2MB | +-----------+ | | |
| | | 5 | ----------------------+ | |
| | +-----------+ | |
| | | 6 | ------------------------+ |
| | +-----------+ |
| | | 7 | --------------------------+
| | +-----------+
| |
| |
| |
+-----------+
As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
remaped. However, the 2nd vmemmap page frame is also can be freed to
the buddy allocator, then we can change the mapping from the figure
above to the figure below.
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
| | +-----------+ | | | | | |
| | | 2 | -----------------+ | | | | |
| | +-----------+ | | | | |
| | | 3 | -------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | ---------------------+ | | |
| 2MB | +-----------+ | | |
| | | 5 | -----------------------+ | |
| | +-----------+ | |
| | | 6 | -------------------------+ |
| | +-----------+ |
| | | 7 | ---------------------------+
| | +-----------+
| |
| |
| |
+-----------+
After we do this, all tail vmemmap pages (1-7) are mapped to the head
vmemmap page frame (0). In other words, there are more than one page
struct with PG_head associated with each HugeTLB page. We __know__ that
there is only one head page struct, the tail page structs with PG_head are
fake head page structs. We need an approach to distinguish between those
two different types of page structs so that compound_head(), PageHead()
and PageTail() can work properly if the parameter is the tail page struct
but with PG_head.
The following code snippet describes how to distinguish between real and
fake head page struct.
if (test_bit(PG_head, &page->flags)) {
unsigned long head = READ_ONCE(page[1].compound_head);
if (head & 1) {
if (head == (unsigned long)page + 1)
==> head page struct
else
==> tail page struct
} else
==> head page struct
}
We can safely access the field of the @page[1] with PG_head because the
@page is a compound page composed with at least two contiguous pages.
[songmuchun@bytedance.com: restore lost comment changes]
Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 14:45:00 -07:00
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
2022-06-28 17:22:30 +08:00
Format: { on | off (default) }
2021-06-30 18:47:25 -07:00
2022-06-28 17:22:30 +08:00
on: enable HVO
off: disable HVO
2021-06-30 18:47:25 -07:00
2022-04-28 23:16:15 -07:00
Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
2021-06-30 18:48:28 -07:00
the default is on.
2022-06-17 21:56:50 +08:00
Note that the vmemmap pages may be allocated from the added
memory block itself when memory_hotplug.memmap_on_memory is
enabled, those vmemmap pages cannot be optimized even if this
feature is enabled. Other vmemmap pages not allocated from
the added memory block itself do not be affected.
2021-06-30 18:47:29 -07:00
2018-05-21 11:18:17 -07:00
hung_task_panic=
[KNL] Should the hung task detector generate panics.
2020-06-07 21:40:42 -07:00
Format: 0 | 1
2008-12-25 13:39:55 +01:00
2020-06-07 21:40:31 -07:00
A value of 1 instructs the kernel to panic when a
2018-05-21 11:18:17 -07:00
hung task is detected. The default value is controlled
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
option. The value selected by this boot parameter can
be changed later by the kernel.hung_task_panic sysctl.
2018-04-18 20:51:39 +02:00
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
terminal devices. Valid values: 0..8
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
2018-10-08 16:29:34 +08:00
hv_nopvspin [X86,HYPER_V] Disables the paravirt spinlock optimizations
which allow the hypervisor to 'idle' the
guest on lock contention.
2018-04-18 20:51:39 +02:00
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
registered from board initialization code.
Format:
<bus_id>,<clkrate>
2009-03-23 18:07:47 -07:00
2008-10-06 02:51:09 -04:00
i8042.debug [HW] Toggle i8042 debug mode
2015-07-15 10:20:17 -07:00
i8042.unmask_kbd_data
[HW] Enable printing of interrupt data from the KBD port
(disabled by default, and as a pre-condition
requires that i8042.debug=1 be enabled)
2005-04-16 15:20:36 -07:00
i8042.direct [HW] Put keyboard port into non-translated mode
2006-10-03 22:53:09 +02:00
i8042.dumbkbd [HW] Pretend that controller can only read data from
keyboard and cannot control its state
2005-04-16 15:20:36 -07:00
(Don't attempt to blink the leds)
i8042.noaux [HW] Don't check for auxiliary (== mouse) port
2005-09-04 01:42:00 -05:00
i8042.nokbd [HW] Don't check/create keyboard port
2008-03-13 16:13:59 -04:00
i8042.noloop [HW] Disable the AUX Loopback command while probing
for the AUX port
2005-04-16 15:20:36 -07:00
i8042.nomux [HW] Don't check presence of an active multiplexing
2014-10-31 09:35:53 -07:00
controller
2005-04-16 15:20:36 -07:00
i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX
controllers
2012-02-15 00:26:42 +09:00
i8042.notimeout [HW] Ignore timeout condition signalled by controller
2016-10-01 12:07:35 -07:00
i8042.reset [HW] Reset the controller during init, cleanup and
suspend-to-ram transitions, only during s2r
transitions, or never reset
Format: { 1 | Y | y | 0 | N | n }
1, Y, y: always reset controller
0, N, n: don't ever reset controller
Default: only on s2r transitions on x86; most other
architectures force reset to be always executed
2005-04-16 15:20:36 -07:00
i8042.unlock [HW] Unlock (ignore) the keylock
2018-04-18 20:51:39 +02:00
i8042.kbdreset [HW] Reset device connected to KBD port
2021-11-28 23:21:41 -08:00
i8042.probe_defer
[HW] Allow deferred probing upon i8042 probe errors
2005-04-16 15:20:36 -07:00
i810= [HW,DRM]
2012-03-15 15:56:26 +01:00
i915.invert_brightness=
2012-03-15 15:56:25 +01:00
[DRM] Invert the sense of the variable that is used to
set the brightness of the panel backlight. Normally a
2012-03-15 15:56:26 +01:00
brightness value of 0 indicates backlight switched off,
and the maximum of the brightness value sets the backlight
to maximum brightness. If this parameter is set to 0
(default) and the machine requires it, or this parameter
is set to 1, a brightness value of 0 sets the backlight
to maximum brightness, and the maximum of the brightness
value switches the backlight off.
-1 -- never invert brightness
0 -- machine default
1 -- force brightness inversion
2012-03-15 15:56:25 +01:00
2005-04-16 15:20:36 -07:00
icn= [HW,ISDN]
Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
2009-04-05 15:55:22 -07:00
2007-05-02 19:27:12 +02:00
idle= [X86]
2013-02-10 01:38:39 -05:00
Format: idle=poll, idle=halt, idle=nomwait
2008-12-19 10:57:32 -08:00
Poll forces a polling idle loop that can slightly
improve the performance of waking up a idle CPU, but
will use a lot of power and make the system run hot.
Not recommended.
idle=halt: Halt is forced to be used for CPU idle.
2008-06-24 17:58:53 +08:00
In such case C2/C3 won't be used again.
2008-12-19 10:57:32 -08:00
idle=nomwait: Disable mwait for CPU C-states
2005-10-23 12:57:11 -07:00
2021-01-22 11:46:00 -07:00
idxd.sva= [HW]
Format: <bool>
Allow force disabling of Shared Virtual Memory (SVA)
support for the idxd driver. By default it is set to
true (1).
2021-07-20 13:42:10 -07:00
idxd.tc_override= [HW]
Format: <bool>
Allow override of default traffic class configuration
for the device. By default it is set to false (0).
2015-11-13 00:48:29 +00:00
ieee754= [MIPS] Select IEEE Std 754 conformance mode
Format: { strict | legacy | 2008 | relaxed }
Default: strict
Choose which programs will be accepted for execution
based on the IEEE 754 NaN encoding(s) supported by
the FPU and the NaN encoding requested with the value
of an ELF file header flag individually set by each
binary. Hardware implementations are permitted to
support either or both of the legacy and the 2008 NaN
encoding mode.
Available settings are as follows:
strict accept binaries that request a NaN encoding
supported by the FPU
legacy only accept legacy-NaN binaries, if supported
by the FPU
2008 only accept 2008-NaN binaries, if supported
by the FPU
relaxed accept any binaries regardless of whether
supported by the FPU
The FPU emulator is always able to support both NaN
encodings, so if no FPU hardware is present or it has
been disabled with 'nofpu', then the settings of
'legacy' and '2008' strap the emulator accordingly,
'relaxed' straps the emulator for both legacy-NaN and
2008-NaN, whereas 'strict' enables legacy-NaN only on
legacy processors and both NaN encodings on MIPS32 or
MIPS64 CPUs.
The setting for ABS.fmt/NEG.fmt instruction execution
mode generally follows that for the NaN encoding,
except where unsupported by hardware.
2006-12-06 20:40:51 -08:00
ignore_loglevel [KNL]
Ignore loglevel setting - this will print /all/
kernel messages to the console. Useful for debugging.
2011-10-31 17:11:25 -07:00
We also add it as printk module parameter, so users
could change it dynamically, usually by
/sys/module/printk/parameters/ignore_loglevel.
2006-12-06 20:40:51 -08:00
2016-02-02 16:57:43 -08:00
ignore_rlimit_data
Ignore RLIMIT_DATA setting for data mappings,
print warning at first misuse. Can be changed via
/sys/module/kernel/parameters/ignore_rlimit_data.
2005-04-16 15:20:36 -07:00
ihash_entries= [KNL]
Set number of hash buckets for inode cache.
ima: integrity appraisal extension
IMA currently maintains an integrity measurement list used to assert the
integrity of the running system to a third party. The IMA-appraisal
extension adds local integrity validation and enforcement of the
measurement against a "good" value stored as an extended attribute
'security.ima'. The initial methods for validating 'security.ima' are
hashed based, which provides file data integrity, and digital signature
based, which in addition to providing file data integrity, provides
authenticity.
This patch creates and maintains the 'security.ima' xattr, containing
the file data hash measurement. Protection of the xattr is provided by
EVM, if enabled and configured.
Based on policy, IMA calls evm_verifyxattr() to verify a file's metadata
integrity and, assuming success, compares the file's current hash value
with the one stored as an extended attribute in 'security.ima'.
Changelov v4:
- changed iint cache flags to hex values
Changelog v3:
- change appraisal default for filesystems without xattr support to fail
Changelog v2:
- fix audit msg 'res' value
- removed unused 'ima_appraise=' values
Changelog v1:
- removed unused iint mutex (Dmitry Kasatkin)
- setattr hook must not reset appraised (Dmitry Kasatkin)
- evm_verifyxattr() now differentiates between no 'security.evm' xattr
(INTEGRITY_NOLABEL) and no EVM 'protected' xattrs included in the
'security.evm' (INTEGRITY_NOXATTRS).
- replace hash_status with ima_status (Dmitry Kasatkin)
- re-initialize slab element ima_status on free (Dmitry Kasatkin)
- include 'security.ima' in EVM if CONFIG_IMA_APPRAISE, not CONFIG_IMA
- merged half "ima: ima_must_appraise_or_measure API change" (Dmitry Kasatkin)
- removed unnecessary error variable in process_measurement() (Dmitry Kasatkin)
- use ima_inode_post_setattr() stub function, if IMA_APPRAISE not configured
(moved ima_inode_post_setattr() to ima_appraise.c)
- make sure ima_collect_measurement() can read file
Changelog:
- add 'iint' to evm_verifyxattr() call (Dimitry Kasatkin)
- fix the race condition between chmod, which takes the i_mutex and then
iint->mutex, and ima_file_free() and process_measurement(), which take
the locks in the reverse order, by eliminating iint->mutex. (Dmitry Kasatkin)
- cleanup of ima_appraise_measurement() (Dmitry Kasatkin)
- changes as a result of the iint not allocated for all regular files, but
only for those measured/appraised.
- don't try to appraise new/empty files
- expanded ima_appraisal description in ima/Kconfig
- IMA appraise definitions required even if IMA_APPRAISE not enabled
- add return value to ima_must_appraise() stub
- unconditionally set status = INTEGRITY_PASS *after* testing status,
not before. (Found by Joe Perches)
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
2012-02-13 10:15:05 -05:00
ima_appraise= [IMA] appraise integrity measurements
2014-05-08 13:11:29 +03:00
Format: { "off" | "enforce" | "fix" | "log" }
ima: integrity appraisal extension
IMA currently maintains an integrity measurement list used to assert the
integrity of the running system to a third party. The IMA-appraisal
extension adds local integrity validation and enforcement of the
measurement against a "good" value stored as an extended attribute
'security.ima'. The initial methods for validating 'security.ima' are
hashed based, which provides file data integrity, and digital signature
based, which in addition to providing file data integrity, provides
authenticity.
This patch creates and maintains the 'security.ima' xattr, containing
the file data hash measurement. Protection of the xattr is provided by
EVM, if enabled and configured.
Based on policy, IMA calls evm_verifyxattr() to verify a file's metadata
integrity and, assuming success, compares the file's current hash value
with the one stored as an extended attribute in 'security.ima'.
Changelov v4:
- changed iint cache flags to hex values
Changelog v3:
- change appraisal default for filesystems without xattr support to fail
Changelog v2:
- fix audit msg 'res' value
- removed unused 'ima_appraise=' values
Changelog v1:
- removed unused iint mutex (Dmitry Kasatkin)
- setattr hook must not reset appraised (Dmitry Kasatkin)
- evm_verifyxattr() now differentiates between no 'security.evm' xattr
(INTEGRITY_NOLABEL) and no EVM 'protected' xattrs included in the
'security.evm' (INTEGRITY_NOXATTRS).
- replace hash_status with ima_status (Dmitry Kasatkin)
- re-initialize slab element ima_status on free (Dmitry Kasatkin)
- include 'security.ima' in EVM if CONFIG_IMA_APPRAISE, not CONFIG_IMA
- merged half "ima: ima_must_appraise_or_measure API change" (Dmitry Kasatkin)
- removed unnecessary error variable in process_measurement() (Dmitry Kasatkin)
- use ima_inode_post_setattr() stub function, if IMA_APPRAISE not configured
(moved ima_inode_post_setattr() to ima_appraise.c)
- make sure ima_collect_measurement() can read file
Changelog:
- add 'iint' to evm_verifyxattr() call (Dimitry Kasatkin)
- fix the race condition between chmod, which takes the i_mutex and then
iint->mutex, and ima_file_free() and process_measurement(), which take
the locks in the reverse order, by eliminating iint->mutex. (Dmitry Kasatkin)
- cleanup of ima_appraise_measurement() (Dmitry Kasatkin)
- changes as a result of the iint not allocated for all regular files, but
only for those measured/appraised.
- don't try to appraise new/empty files
- expanded ima_appraisal description in ima/Kconfig
- IMA appraise definitions required even if IMA_APPRAISE not enabled
- add return value to ima_must_appraise() stub
- unconditionally set status = INTEGRITY_PASS *after* testing status,
not before. (Found by Joe Perches)
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
2012-02-13 10:15:05 -05:00
default: "enforce"
2019-04-04 20:23:22 +02:00
ima_appraise_tcb [IMA] Deprecated. Use ima_policy= instead.
ima: add appraise action keywords and default rules
Unlike the IMA measurement policy, the appraise policy can not be dependent
on runtime process information, such as the task uid, as the 'security.ima'
xattr is written on file close and must be updated each time the file changes,
regardless of the current task uid.
This patch extends the policy language with 'fowner', defines an appraise
policy, which appraises all files owned by root, and defines 'ima_appraise_tcb',
a new boot command line option, to enable the appraise policy.
Changelog v3:
- separate the measure from the appraise rules in order to support measuring
without appraising and appraising without measuring.
- change appraisal default for filesystems without xattr support to fail
- update default appraise policy for cgroups
Changelog v1:
- don't appraise RAMFS (Dmitry Kasatkin)
- merged rest of "ima: ima_must_appraise_or_measure API change" commit
(Dmtiry Kasatkin)
ima_must_appraise_or_measure() called ima_match_policy twice, which
searched the policy for a matching rule. Once for a matching measurement
rule and subsequently for an appraisal rule. Searching the policy twice
is unnecessary overhead, which could be noticeable with a large policy.
The new version of ima_must_appraise_or_measure() does everything in a
single iteration using a new version of ima_match_policy(). It returns
IMA_MEASURE, IMA_APPRAISE mask.
With the use of action mask only one efficient matching function
is enough. Removed other specific versions of matching functions.
Changelog:
- change 'owner' to 'fowner' to conform to the new LSM conditions posted by
Roberto Sassu.
- fix calls to ima_log_string()
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
2011-03-09 22:25:48 -05:00
The builtin appraise policy appraises all files
owned by uid=0.
2016-12-19 16:22:57 -08:00
ima_canonical_fmt [IMA]
Use the canonical format for the binary runtime
measurements, instead of host native format.
2009-02-04 09:06:58 -05:00
ima_hash= [IMA]
2013-06-07 12:16:37 +02:00
Format: { md5 | sha1 | rmd160 | sha256 | sha384
| sha512 | ... }
2009-02-04 09:06:58 -05:00
default: "sha1"
2013-06-07 12:16:37 +02:00
The list of supported hash algorithms is defined
in crypto/hash_info.h.
2015-06-11 20:48:33 -04:00
ima_policy= [IMA]
2017-04-24 12:04:09 -04:00
The builtin policies to load during IMA setup.
2018-02-21 11:36:32 -05:00
Format: "tcb | appraise_tcb | secure_boot |
2021-01-07 20:07:07 -08:00
fail_securely | critical_data"
2017-04-24 12:04:09 -04:00
The "tcb" policy measures all programs exec'd, files
mmap'd for exec, and all files opened with the read
mode bit set by either the effective uid (euid=0) or
uid=0.
The "appraise_tcb" policy appraises the integrity of
2019-04-04 20:23:22 +02:00
all files owned by root.
2015-06-11 20:48:33 -04:00
2017-04-21 18:58:27 -04:00
The "secure_boot" policy appraises the integrity
of files (eg. kexec kernel image, kernel modules,
firmware, policy, etc) based on file signatures.
2015-06-11 20:48:33 -04:00
2018-02-21 11:36:32 -05:00
The "fail_securely" policy forces file signature
verification failure also on privileged mounted
filesystems with the SB_I_UNVERIFIABLE_SIGNATURE
flag.
2021-01-07 20:07:07 -08:00
The "critical_data" policy measures kernel integrity
critical data.
2015-06-11 20:48:33 -04:00
ima_tcb [IMA] Deprecated. Use ima_policy= instead.
2009-05-21 15:47:06 -04:00
Load a policy which meets the needs of the Trusted
Computing Base. This means IMA will measure all
programs exec'd, files mmap'd for exec, and all files
opened for read by uid=0.
2018-04-18 20:51:39 +02:00
ima_template= [IMA]
2013-06-07 12:16:35 +02:00
Select one of defined IMA measurements template formats.
2021-12-23 12:29:56 -05:00
Formats: { "ima" | "ima-ng" | "ima-ngv2" | "ima-sig" |
"ima-sigv2" }
2013-06-07 12:16:35 +02:00
Default: "ima-ng"
2014-10-13 14:08:42 +02:00
ima_template_fmt=
2018-04-18 20:51:39 +02:00
[IMA] Define a custom template format.
2014-10-13 14:08:42 +02:00
Format: { "field1|...|fieldN" }
2014-02-26 17:05:20 +02:00
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
Format: <min_file_size>
Set the minimal file size for using asynchronous hash.
If left unspecified, ahash usage is disabled.
ahash performance varies for different data sizes on
different crypto accelerators. This option can be used
to achieve the best performance for a particular HW.
2014-05-06 14:47:13 +03:00
ima.ahash_bufsize= [IMA] Asynchronous hash buffer size
Format: <bufsize>
Set hashing buffer size. Default: 4k.
ahash performance varies for different chunk sizes on
different crypto accelerators. This option can be used
to achieve best performance for particular HW.
2005-04-16 15:20:36 -07:00
init= [KNL]
Format: <full_path>
Run specified binary instead of /sbin/init as init
process.
initcall_debug [KNL] Trace initcalls as they are executed. Useful
for working out where the kernel is dying during
startup.
2014-06-04 16:12:17 -07:00
initcall_blacklist= [KNL] Do not execute a comma-separated list of
initcall functions. Useful for debugging built-in
modules and initcalls.
init/initramfs.c: do unpacking asynchronously
Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.
These two patches are independent, but better-together.
The second is a rather trivial patch that simply allows the developer to
change "/sbin/modprobe" to something else - e.g. the empty string, so
that all request_module() during early boot return -ENOENT early, without
even spawning a usermode helper, needlessly synchronizing with the
initramfs unpacking.
The first patch delegates decompressing the initramfs to a worker thread,
allowing do_initcalls() in main.c to proceed to the device_ and late_
initcalls without waiting for that decompression (and populating of
rootfs) to finish. Obviously, some of those later calls may rely on the
initramfs being available, so I've added synchronization points in the
firmware loader and usermodehelper paths - there might be other places
that would need this, but so far no one has been able to think of any
places I have missed.
There's not much to win if most of the functionality needed during boot is
only available as modules. But systems with a custom-made .config and
initramfs can boot faster, partly due to utilizing more than one cpu
earlier, partly by avoiding known-futile modprobe calls (which would still
trigger synchronization with the initramfs unpacking, thus eliminating
most of the first benefit).
This patch (of 2):
Most of the boot process doesn't actually need anything from the
initramfs, until of course PID1 is to be executed. So instead of doing
the decompressing and populating of the initramfs synchronously in
populate_rootfs() itself, push that off to a worker thread.
This is primarily motivated by an embedded ppc target, where unpacking
even the rather modest sized initramfs takes 0.6 seconds, which is long
enough that the external watchdog becomes unhappy that it doesn't get
attention soon enough. By doing the initramfs decompression in a worker
thread, we get to do the device_initcalls and hence start petting the
watchdog much sooner.
Normal desktops might benefit as well. On my mostly stock Ubuntu kernel,
my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
That takes almost two seconds:
[ 0.201454] Trying to unpack rootfs image as initramfs...
[ 1.976633] Freeing initrd memory: 29416K
Before this patch, these lines occur consecutively in dmesg. With this
patch, the timestamps on these two lines is roughly the same as above, but
with 172 lines inbetween - so more than one cpu has been kept busy doing
work that would otherwise only happen after the populate_rootfs()
finished.
Should one of the initcalls done after rootfs_initcall time (i.e., device_
and late_ initcalls) need something from the initramfs (say, a kernel
module or a firmware blob), it will simply wait for the initramfs
unpacking to be done before proceeding, which should in theory make this
completely safe.
But if some driver pokes around in the filesystem directly and not via one
of the official kernel interfaces (i.e. request_firmware*(),
call_usermodehelper*) that theory may not hold - also, I certainly might
have missed a spot when sprinkling wait_for_initramfs(). So there is an
escape hatch in the form of an initramfs_async= command line parameter.
Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-06 18:05:42 -07:00
initramfs_async= [KNL]
Format: <bool>
Default: 1
This parameter controls whether the initramfs
image is unpacked asynchronously, concurrently
with devices being probed and
initialized. This should normally just work,
but as a debugging aid, one can get the
historical behaviour of the initramfs
unpacking being completed before device_ and
late_ initcalls.
2005-04-16 15:20:36 -07:00
initrd= [BOOT] Specify the location of the initial ramdisk
x86/setup: Add an initrdmem= option to specify initrd physical address
Add the initrdmem option:
initrdmem=ss[KMG],nn[KMG]
which is used to specify the physical address of the initrd, almost
always an address in FLASH. Also add code for x86 to use the existing
phys_init_start and phys_init_size variables in the kernel.
This is useful in cases where a kernel and an initrd is placed in FLASH,
but there is no firmware file system structure in the FLASH.
One such situation occurs when unused FLASH space on UEFI systems has
been reclaimed by, e.g., taking it from the Management Engine. For
example, on many systems, the ME is given half the FLASH part; not only
is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
space can be used to contain an initrd, but need to tell Linux where it
is.
This space is "raw": due to, e.g., UEFI limitations: it can not be added
to UEFI firmware volumes without rebuilding UEFI from source or writing
a UEFI device driver. It can be referenced only as a physical address
and size.
At the same time, if a kernel can be "netbooted" or loaded from GRUB or
syslinux, the option of not using the physical address specification
should be available.
Then, it is easy to boot the kernel and provide an initrd; or boot the
the kernel and let it use the initrd in FLASH. In practice, this has
proven to be very helpful when integrating Linux into FLASH on x86.
Hence, the most flexible and convenient path is to enable the initrdmem
command line option in a way that it is the last choice tried.
For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
burnt in with a built-in command line which includes:
initrdmem=0xff968000,0x200000
which specifies a location and size.
[ bp: Massage commit message, make it passive. ]
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
2020-04-25 18:10:21 -07:00
initrdmem= [KNL] Specify a physical address and size from which to
load the initrd. If an initrd is compiled in or
specified in the bootparams, it takes priority over this
setting.
Format: ss[KMG],nn[KMG]
Default is 0, 0
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-11 20:59:19 -07:00
init_on_alloc= [MM] Fill newly allocated pages and heap objects with
zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
init_on_free= [MM] Fill freed pages and heap objects with zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
2020-08-09 19:49:41 -07:00
init_pkru= [X86] Specify the default memory protection keys rights
x86/pkeys: Default to a restrictive init PKRU
PKRU is the register that lets you disallow writes or all access to a given
protection key.
The XSAVE hardware defines an "init state" of 0 for PKRU: its most
permissive state, allowing access/writes to everything. Since we start off
all new processes with the init state, we start all processes off with the
most permissive possible PKRU.
This is unfortunate. If a thread is clone()'d [1] before a program has
time to set PKRU to a restrictive value, that thread will be able to write
to all data, no matter what pkey is set on it. This weakens any integrity
guarantees that we want pkeys to provide.
To fix this, we define a very restrictive PKRU to override the
XSAVE-provided value when we create a new FPU context. We choose a value
that only allows access to pkey 0, which is as restrictive as we can
practically make it.
This does not cause any practical problems with applications using
protection keys because we require them to specify initial permissions for
each key when it is allocated, which override the restrictive default.
In the end, this ensures that threads which do not know how to manage their
own pkey rights can not do damage to data which is pkey-protected.
I would have thought this was a pretty contrived scenario, except that I
heard a bug report from an MPX user who was creating threads in some very
early code before main(). It may be crazy, but folks evidently _do_ it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: mgorman@techsingularity.net
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163021.F3C25D4A@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 09:30:21 -07:00
register contents for all processes. 0x55555554 by
default (disallow access to all but pkey 0). Can
override in debugfs after boot.
2005-04-16 15:20:36 -07:00
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
Format: <irq>
2020-08-09 19:49:41 -07:00
int_pln_enable [X86] Enable power limit notification interrupt
2013-05-21 15:35:17 -04:00
2013-03-18 14:48:02 -04:00
integrity_audit=[IMA]
Format: { "0" | "1" }
0 -- basic integrity auditing messages. (Default)
1 -- additional integrity auditing messages.
2007-10-21 16:41:49 -07:00
intel_iommu= [DMAR] Intel IOMMU driver (DMAR) option
2009-02-04 14:29:19 -08:00
on
Enable intel iommu driver.
2007-10-21 16:41:49 -07:00
off
Disable intel iommu driver.
igfx_off [Default Off]
By default, gfx is mapped as normal device. If a gfx
device has a dedicated DMAR unit, the DMAR unit is
bypassed by not enabling DMAR with this option. In
this case, gfx device will use physical address for
DMA.
2008-03-04 15:22:08 -08:00
strict [Default Off]
2021-07-12 19:12:15 +08:00
Deprecated, equivalent to iommu.strict=1.
intel-iommu: Enable super page (2MiB, 1GiB, etc.) support
There are no externally-visible changes with this. In the loop in the
internal __domain_mapping() function, we simply detect if we are mapping:
- size >= 2MiB, and
- virtual address aligned to 2MiB, and
- physical address aligned to 2MiB, and
- on hardware that supports superpages.
(and likewise for larger superpages).
We automatically use a superpage for such mappings. We never have to
worry about *breaking* superpages, since we trust that we will always
*unmap* the same range that was mapped. So all we need to do is ensure
that dma_pte_clear_range() will also cope with superpages.
Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so
it can return a PTE at the appropriate level rather than always
extending the page tables all the way down to level 1. Again, this is
simplified by the fact that we should never encounter existing small
pages when we're creating a mapping; any old mapping that used the same
virtual range will have been entirely removed and its obsolete page
tables freed.
Provide an 'intel_iommu=sp_off' argument on the command line as a
chicken bit. Not that it should ever be required.
==
The original commit seen in the iommu-2.6.git was Youquan's
implementation (and completion) of my own half-baked code which I'd
typed into an email. Followed by half a dozen subsequent 'fixes'.
I've taken the unusual step of rewriting history and collapsing the
original commits in order to keep the main history simpler, and make
life easier for the people who are going to have to backport this to
older kernels. And also so I can give it a more coherent commit comment
which (hopefully) gives a better explanation of what's going on.
The original sequence of commits leading to identical code was:
Youquan Song (3):
intel-iommu: super page support
intel-iommu: Fix superpage alignment calculation error
intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte()
David Woodhouse (4):
intel-iommu: Precalculate superpage support for dmar_domain
intel-iommu: Fix hardware_largepage_caps()
intel-iommu: Fix inappropriate use of superpages in __domain_mapping()
intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2011-05-25 19:13:49 +01:00
sp_off [Default Off]
By default, super page will be supported if Intel IOMMU
has the capability. With this option, super page will
not be supported.
2021-08-18 21:48:47 +08:00
sm_on
Enable the Intel IOMMU scalable mode if the hardware
advertises that it has support for the scalable mode
translation.
sm_off
Disallow use of the Intel IOMMU scalable mode.
2017-04-26 09:18:35 -07:00
tboot_noforce [Default Off]
Do not force the Intel IOMMU enabled under tboot.
By default, tboot will force Intel IOMMU on, which
could harm performance of some high-throughput
devices like 40GBit network cards, even if identity
mapping is enabled.
Note that using this option lowers the security
provided by tboot because it makes the system
vulnerable to DMA attacks.
2011-12-15 01:18:52 +09:00
intel_idle.max_cstate= [KNL,HW,ACPI,X86]
0 disables intel_idle and fall back on acpi_idle.
Update the maximum depth of C-state from 6 to 9
Hi Jon,
This patch is an old one, we have corrected some minor issues on the newer one.
Please only review the newest version from my last mail with this subject
"[PATCH] ACPI: Update the maximum depth of C-state from 6 to 9".
And I also attached it to this mail.
Thanks,
Baole
On 7/11/2016 6:37 AM, Jonathan Corbet wrote:
> On Mon, 4 Jul 2016 09:55:10 +0800
> "baolex.ni" <baolex.ni@intel.com> wrote:
>
>> Currently, CPUIDLE_STATE_MAX has been defined as 10 in the cpuidle head file,
>> and max_cstate = CPUIDLE_STATE_MAX – 1, so 9 is the right maximum depth of C-state.
>> This change is reflected in one place of the kernel-param file,
>> but not in the other place where I suggest changing.
>>
>> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
>> Signed-off-by: Baole Ni <baolex.ni@intel.com>
>
> So why are there two signoffs on a single-line patch? Which one of you
> is the actual author?
>
> Thanks,
>
> jon
>
From cf5f8aa6885874f6490b11507d3c0c86fa0a11f4 Mon Sep 17 00:00:00 2001
From: Chuansheng Liu <chuansheng.liu@intel.com>
Date: Mon, 4 Jul 2016 08:52:51 +0800
Subject: [PATCH] Update the maximum depth of C-state from 6 to 9
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Currently, CPUIDLE_STATE_MAX has been defined as 10 in the cpuidle head file,
and max_cstate = CPUIDLE_STATE_MAX – 1, so 9 is the right maximum depth of C-state.
This change is reflected in one place of the kernel-param file,
but not in the other place where I suggest changing.
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Baole Ni <baolex.ni@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-07-11 09:57:37 +08:00
1 to 9 specify maximum depth of C-state.
2011-12-15 01:18:52 +09:00
2018-04-18 20:51:39 +02:00
intel_pstate= [X86]
disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
passive
Use intel_pstate as a scaling driver, but configure it
to work with generic cpufreq governors (instead of
enabling its internal governor). This mode cannot be
used along with the hardware-managed P-states (HWP)
feature.
force
Enable intel_pstate on systems that prohibit it by default
in favor of acpi-cpufreq. Forcing the intel_pstate driver
instead of acpi-cpufreq may disable platform features, such
as thermal controls and power capping, that rely on ACPI
P-States information being indicated to OSPM and therefore
should be used with caution. This option does not work with
processors that aren't supported by the intel_pstate driver
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
no_hwp
Do not enable hardware P state control (HWP)
if available.
hwp_only
Only load intel_pstate on systems which support
hardware P state control (HWP) if available.
support_acpi_ppc
Enforce ACPI _PPC performance limits. If the Fixed ACPI
Description Table, specifies preferred power management
profile as "Enterprise Server" or "Performance Server",
then this feature is turned on by default.
per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
2013-02-15 22:55:10 +01:00
2010-07-20 11:06:49 -07:00
intremap= [X86-64, Intel-IOMMU]
on enable Interrupt Remapping (default)
off disable Interrupt Remapping
nosid disable Source ID checking
2011-08-23 17:05:18 -07:00
no_x2apic_optout
BIOS x2APIC opt-out request will be ignored
2015-09-18 22:29:56 +08:00
nopost disable Interrupt Posting
2010-07-20 11:06:49 -07:00
2009-04-05 15:55:22 -07:00
iomem= Disable strict checking of access to MMIO memory
strict regions from userspace.
relaxed
2020-08-09 19:49:41 -07:00
iommu= [X86]
2009-04-05 15:55:22 -07:00
off
force
noforce
biomerge
panic
nopanic
merge
nomerge
soft
2020-08-09 19:49:41 -07:00
pt [X86]
nopt [X86]
2014-10-23 19:19:35 -02:00
nobypass [PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
2011-10-21 15:56:24 -04:00
2021-03-05 16:32:34 +00:00
iommu.forcedac= [ARM64, X86] Control IOVA allocation for PCI devices.
Format: { "0" | "1" }
0 - Try to allocate a 32-bit DMA address first, before
falling back to the full range if needed.
1 - Allocate directly from the full usable range,
forcing Dual Address Cycle for PCI cards supporting
greater than 32-bit addressing.
2021-06-14 15:57:26 +01:00
iommu.strict= [ARM64, X86] Configure TLB invalidation behaviour
2018-09-20 17:10:23 +01:00
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
invalidation of hardware TLBs, for increased
throughput at the cost of reduced device isolation.
Will fall back to strict mode if not supported by
the relevant IOMMU driver.
2021-07-12 19:12:17 +08:00
1 - Strict mode.
2018-09-20 17:10:23 +01:00
DMA unmap operations invalidate IOMMU hardware TLBs
synchronously.
2021-08-11 13:21:37 +01:00
unset - Use value of CONFIG_IOMMU_DEFAULT_DMA_{LAZY,STRICT}.
Note: on x86, strict mode specified via one of the
legacy driver-specific options takes precedence.
2018-09-20 17:10:23 +01:00
2017-01-05 18:38:26 +00:00
iommu.passthrough=
2019-08-19 15:22:56 +02:00
[ARM64, X86] Configure DMA to bypass the IOMMU by default.
2017-01-05 18:38:26 +00:00
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
2018-09-20 14:14:26 +01:00
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
2009-04-05 15:55:22 -07:00
2020-09-17 22:47:51 -07:00
io7= [HW] IO7 for Marvel-based Alpha systems
2009-04-05 15:55:22 -07:00
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
2009-04-14 14:03:43 +05:30
io_delay= [X86] I/O delay method
2008-01-30 13:30:05 +01:00
0x80
Standard port 0x80 based delay
0xed
Alternate port 0xed based delay (needed on some systems)
2008-01-30 13:30:05 +01:00
udelay
2008-01-30 13:30:05 +01:00
Simple two microseconds delay
none
No delay
2008-01-30 13:30:05 +01:00
2005-04-16 15:20:36 -07:00
ip= [IP_PNP]
2020-02-12 19:13:32 +01:00
See Documentation/admin-guide/nfs/nfsroot.rst.
2005-04-16 15:20:36 -07:00
2019-05-14 15:46:29 -07:00
ipcmni_extend [KNL] Extend the maximum number of unique System V
IPC identifiers from 32,768 to 16,777,216.
2016-02-03 19:52:23 +01:00
irqaffinity= [SMP] Set the default irq affinity mask
2016-10-11 13:51:35 -07:00
The argument is a cpu list, as described above.
2016-02-03 19:52:23 +01:00
2017-10-27 10:34:22 +02:00
irqchip.gicv2_force_probe=
[ARM, ARM64]
Format: <bool>
Force the kernel to look for the second 4kB page
of a GICv2 controller even if the memory range
exposed by the device tree is too small.
2018-02-25 11:27:04 +00:00
irqchip.gicv3_nolpi=
[ARM, ARM64]
Force the kernel to ignore the availability of
LPIs (and by consequence ITSs). Intended for system
that use the kernel as a bootloader, and thus want
to let secondary kernels in charge of setting up
LPIs.
2019-01-31 14:59:03 +00:00
irqchip.gicv3_pseudo_nmi= [ARM64]
Enables support for pseudo-NMIs in the kernel. This
requires the kernel to be built with
CONFIG_ARM64_PSEUDO_NMI.
2005-06-28 20:45:18 -07:00
irqfixup [HW]
When an interrupt is not handled search all handlers
for it. Intended to get systems with badly broken
firmware running.
irqpoll [HW]
When an interrupt is not handled search all handlers
for it. Also check all handlers each timer
interrupt. Intended to get systems with badly broken
firmware running.
2005-04-16 15:20:36 -07:00
isapnp= [ISAPNP]
2005-10-23 12:57:11 -07:00
Format: <RDP>,<reset>,<pci_scan>,<verbosity>
2005-04-16 15:20:36 -07:00
2017-12-14 19:18:27 +01:00
isolcpus= [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance.
2017-10-31 04:18:34 +01:00
[Deprecated - use cpusets instead]
Format: [flag-list,]<cpu-list>
Specify one or more CPUs to isolate from disturbances
specified in the flag list (default: domain):
nohz
Disable the tick when a single task runs.
2018-02-21 05:17:29 +01:00
A residual 1Hz tick is offloaded to workqueues, which you
need to affine to housekeeping through the global
workqueue's affinity configured via the
/sys/devices/virtual/workqueue/cpumask sysfs file, or
by using the 'domain' flag described below.
NOTE: by default the global workqueue runs on all CPUs,
so to protect individual CPUs the 'cpumask' file has to
be configured manually after bootup.
2017-10-31 04:18:34 +01:00
domain
Isolate from the general SMP balancing and scheduling
algorithms. Note that performing domain isolation this way
is irreversible: it's not possible to bring back a CPU to
the domains once isolated through isolcpus. It's strongly
advised to use cpusets instead to disable scheduler load
balancing through the "cpuset.sched_load_balance" file.
It offers a much more flexible interface where CPUs can
move in and out of an isolated set anytime.
You can move a process onto or off an "isolated" CPU via
the CPU affinity syscalls or cpuset.
<cpu number> begins at 0 and the maximum value is
"number of CPUs in system - 1".
2020-01-20 17:16:25 +08:00
managed_irq
Isolate from being targeted by managed interrupts
which have an interrupt mask containing isolated
CPUs. The affinity of managed interrupts is
handled by the kernel and cannot be changed via
the /proc/irq/* interfaces.
This isolation is best effort and only effective
if the automatically assigned interrupt mask of a
device queue contains isolated and housekeeping
CPUs. If housekeeping CPUs are online then such
interrupts are directed to the housekeeping CPU
so that IO submitted on the housekeeping CPU
cannot disturb the isolated CPU.
If a queue's affinity mask contains only isolated
CPUs then this parameter has no effect on the
interrupt routing decision, though interrupts are
only delivered when tasks running on those
isolated CPUs submit IO. IO submitted on
housekeeping CPUs has no influence on those
queues.
2005-04-16 15:20:36 -07:00
2020-01-20 17:16:25 +08:00
The format of <cpu-list> is described above.
2005-04-16 15:20:36 -07:00
2005-10-23 12:57:11 -07:00
iucv= [HW,NET]
2005-04-16 15:20:36 -07:00
2020-08-09 19:49:41 -07:00
ivrs_ioapic [HW,X86-64]
2013-04-09 21:27:19 +02:00
Provide an override to the IOAPIC-ID<->DEVICE-ID
2022-07-06 17:08:22 +05:30
mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
2022-09-19 10:56:38 -05:00
For example, to map IOAPIC-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
ivrs_ioapic=10@0001:00:14.0
Deprecated formats:
2022-07-06 17:08:22 +05:30
* To map IOAPIC-ID decimal 10 to PCI device 00:14.0
write the parameter as:
2013-04-09 21:27:19 +02:00
ivrs_ioapic[10]=00:14.0
2022-07-06 17:08:22 +05:30
* To map IOAPIC-ID decimal 10 to PCI segment 0x1 and
PCI device 00:14.0 write the parameter as:
ivrs_ioapic[10]=0001:00:14.0
2013-04-09 21:27:19 +02:00
2020-08-09 19:49:41 -07:00
ivrs_hpet [HW,X86-64]
2013-04-09 21:27:19 +02:00
Provide an override to the HPET-ID<->DEVICE-ID
2022-07-06 17:08:22 +05:30
mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
2022-09-19 10:56:38 -05:00
For example, to map HPET-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
ivrs_hpet=10@0001:00:14.0
Deprecated formats:
2022-07-06 17:08:22 +05:30
* To map HPET-ID decimal 0 to PCI device 00:14.0
write the parameter as:
2013-04-09 21:27:19 +02:00
ivrs_hpet[0]=00:14.0
2022-07-06 17:08:22 +05:30
* To map HPET-ID decimal 10 to PCI segment 0x1 and
PCI device 00:14.0 write the parameter as:
ivrs_ioapic[10]=0001:00:14.0
2013-04-09 21:27:19 +02:00
2020-08-09 19:49:41 -07:00
ivrs_acpihid [HW,X86-64]
2016-04-01 09:06:01 -04:00
Provide an override to the ACPI-HID:UID<->DEVICE-ID
2022-07-06 17:08:22 +05:30
mapping provided in the IVRS ACPI table.
2022-09-19 10:56:38 -05:00
By default, PCI segment is 0, and can be omitted.
2022-07-06 17:08:22 +05:30
For example, to map UART-HID:UID AMD0020:0 to
PCI segment 0x1 and PCI device ID 00:14.5,
write the parameter as:
2022-09-19 10:56:38 -05:00
ivrs_acpihid=AMD0020:0@0001:00:14.5
2022-07-06 17:08:22 +05:30
2022-09-19 10:56:38 -05:00
Deprecated formats:
* To map UART-HID:UID AMD0020:0 to PCI segment is 0,
PCI device ID 00:14.5, write the parameter as:
2016-04-01 09:06:01 -04:00
ivrs_acpihid[00:14.5]=AMD0020:0
2022-09-19 10:56:38 -05:00
* To map UART-HID:UID AMD0020:0 to PCI segment 0x1 and
PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[0001:00:14.5]=AMD0020:0
2016-04-01 09:06:01 -04:00
2005-04-16 15:20:36 -07:00
js= [HW,JOY] Analog joystick
2017-10-10 12:36:23 -05:00
See Documentation/input/joydev/joystick.rst.
2005-04-16 15:20:36 -07:00
2017-03-31 15:12:04 -07:00
kasan_multi_shot
[KNL] Enforce KASAN (Kernel Address Sanitizer) to print
report on every invalid memory access. Without this
parameter KASAN will print report only for the first
invalid access.
2022-12-03 17:30:50 -08:00
keep_bootcon [KNL]
Do not unregister boot console at start. This is only
useful for debugging when something happens in the window
between unregistering the boot console and initializing
the real console.
2009-04-05 15:55:22 -07:00
keepinitrd [HW,ARM]
2016-03-15 14:55:22 -07:00
kernelcore= [KNL,X86,IA-64,PPC]
mm, page_alloc: extend kernelcore and movablecore for percent
Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.
This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.
To define the percentage, the final character of the parameter should be
a '%'.
mhocko: "why is anyone using these options nowadays?"
rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.
[rientjes@google.com: updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 16:23:09 -07:00
Format: nn[KMGTPE] | nn% | "mirror"
This parameter specifies the amount of memory usable by
the kernel for non-movable allocations. The requested
amount is spread evenly throughout all nodes in the
system as ZONE_NORMAL. The remaining memory is used for
movable memory in its own zone, ZONE_MOVABLE. In the
event, a node is too small to have both ZONE_NORMAL and
ZONE_MOVABLE, kernelcore memory will take priority and
other nodes will have a larger ZONE_MOVABLE.
ZONE_MOVABLE is used for the allocation of pages that
may be reclaimed or moved by the page migration
subsystem. Note that allocations like PTEs-from-HighMem
still use the HighMem zone if it exists, and the Normal
2007-07-17 04:03:14 -07:00
zone if it does not.
mm, page_alloc: extend kernelcore and movablecore for percent
Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.
This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.
To define the percentage, the final character of the parameter should be
a '%'.
mhocko: "why is anyone using these options nowadays?"
rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.
[rientjes@google.com: updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 16:23:09 -07:00
It is possible to specify the exact amount of memory in
the form of "nn[KMGTPE]", a percentage of total system
memory in the form of "nn%", or "mirror". If "mirror"
2016-03-15 14:55:22 -07:00
option is specified, mirrored (reliable) memory is used
for non-movable allocations and remaining memory is used
mm, page_alloc: extend kernelcore and movablecore for percent
Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.
This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.
To define the percentage, the final character of the parameter should be
a '%'.
mhocko: "why is anyone using these options nowadays?"
rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.
[rientjes@google.com: updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 16:23:09 -07:00
for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
are exclusive, so you cannot specify multiple forms.
2007-07-17 04:03:14 -07:00
2010-05-20 21:04:31 -05:00
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
port as it is probed via PCI. The poll interval is
optional and is the number seconds in between
each poll cycle to the debug port in case you need
the functionality for interrupting the kernel with
gdb or control-c on the dbgp connection. When
not using this parameter you use sysrq-g to break into
the kernel debugger.
2010-05-20 21:04:24 -05:00
kgdboc= [KGDB,HW] kgdb over consoles.
2010-05-20 21:04:24 -05:00
Requires a tty driver that supports console polling,
or a supported polling keyboard driver (non-usb).
2010-08-05 09:22:33 -05:00
Serial only format: <serial_device>[,baud]
keyboard only format: kbd
keyboard and serial format: kbd,<serial_device>[,baud]
Optional Kernel mode setting:
kms, kbd format: kms,kbd
kms, kbd and serial format: kms,kbd,<ser_dev>[,baud]
2008-04-17 20:05:38 +02:00
2020-05-07 13:08:47 -07:00
kgdboc_earlycon= [KGDB,HW]
If the boot console provides the ability to read
characters and can work in polling mode, you can use
this parameter to tell kgdb to use it as a backend
until the normal console is registered. Intended to
be used together with the kgdboc parameter which
specifies the normal console to transition to.
The name of the early console should be specified
as the value of this parameter. Note that the name of
the early console might be different than the tty
name passed to kgdboc. It's OK to leave the value
blank and the first boot console that implements
read() will be picked.
2010-05-20 21:04:24 -05:00
kgdbwait [KGDB] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
2020-09-17 22:47:22 -07:00
kmac= [MIPS] Korina ethernet MAC address.
2008-08-23 18:54:37 +02:00
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
2009-06-11 13:22:39 +01:00
kmemleak= [KNL] Boot-time kmemleak enable/disable
Valid arguments: on, off
Default: on
2014-10-24 21:24:59 +09:00
Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
the default is off.
2009-06-11 13:22:39 +01:00
2019-05-22 17:32:35 +09:00
kprobe_event=[probe-list]
[FTRACE] Add kprobe events and enable at boot time.
The probe-list is a semicolon delimited list of probe
definitions. Each definition is same as kprobe_events
interface, but the parameters are comma delimited.
For example, to add a kprobe event on vfs_read with
arg1 and arg2, add to the command line;
kprobe_event=p,vfs_read,$arg1,$arg2
See also Documentation/trace/kprobetrace.rst "Kernel
Boot Parameter" section.
2019-01-25 12:07:00 -06:00
kpti= [ARM64] Control page table isolation of user
and kernel address spaces.
Default: enabled on cores which need mitigation.
0: force disabled
1: force enabled
2022-08-23 07:24:54 -07:00
kunit.enable= [KUNIT] Enable executing KUnit tests. Requires
CONFIG_KUNIT to be set to be fully enabled. The
default value can be overridden via
KUNIT_DEFAULT_ENABLED.
Default is 1 (enabled)
2009-07-10 14:20:35 +02:00
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
kvm.eager_page_split=
[KVM,X86] Controls whether or not KVM will try to
proactively split all huge pages during dirty logging.
Eager page splitting reduces interruptions to vCPU
execution by eliminating the write-protection faults
and MMU lock contention that would otherwise be
required to split huge pages lazily.
VM workloads that rarely perform writes or that write
only to a small region of VM memory may benefit from
disabling eager page splitting to allow huge pages to
still be used for reads.
The behavior of eager page splitting depends on whether
KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
disabled, all huge pages in a memslot will be eagerly
split when dirty logging is enabled on that memslot. If
2022-01-19 23:07:37 +00:00
enabled, eager page splitting will be performed during
the KVM_CLEAR_DIRTY ioctl, and only for the pages being
cleared.
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.
Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.
Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:
(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.
(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.
(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.
Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.
This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 15:27:09 -04:00
Eager page splitting is only supported when kvm.tdp_mmu=Y.
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
Default is Y (on).
2018-03-12 13:12:47 +02:00
kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
Default is false (don't support).
2019-11-04 12:22:02 +01:00
kvm.nx_huge_pages=
[KVM] Controls the software workaround for the
X86_BUG_ITLB_MULTIHIT bug.
force : Always deploy workaround.
off : Never deploy workaround.
auto : Deploy workaround based on the presence of
X86_BUG_ITLB_MULTIHIT.
Default is 'auto'.
If the software workaround is enabled for the host,
guests do need not to enable it for nested guests.
2019-11-04 20:26:00 +01:00
kvm.nx_huge_pages_recovery_ratio=
[KVM] Controls how many 4KiB pages are periodically zapped
back to huge pages. 0 disables the recovery, otherwise if
the value is N KVM will zap 1/Nth of the 4KiB pages every
2021-10-19 18:06:27 -07:00
period (see below). The default is 60.
kvm.nx_huge_pages_recovery_period_ms=
[KVM] Controls the time period at which KVM zaps 4KiB pages
back to huge pages. If the value is a non-zero N, KVM will
zap a portion (see ratio above) of the pages every N msecs.
If the value is 0 (the default), KVM will pick a period based
on the ratio, such that a page is zapped after 1 hour on average.
2019-11-04 20:26:00 +01:00
2009-07-10 14:20:35 +02:00
kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
2010-09-20 22:16:45 +08:00
Default is 1 (enabled)
2009-07-10 14:20:35 +02:00
kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU)
for all guests.
2011-08-13 12:34:52 -07:00
Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
2009-07-10 14:20:35 +02:00
2020-12-02 18:40:57 +00:00
kvm-arm.mode=
[KVM,ARM] Select one of KVM/arm64's modes of operation.
2021-10-01 18:05:53 +01:00
none: Forcefully disable KVM.
2021-02-08 09:57:26 +00:00
nvhe: Standard nVHE-based mode, without support for
protected guests.
2020-12-02 18:40:57 +00:00
protected: nVHE-based mode with support for guests whose
state is kept private from the host.
2023-02-09 17:58:03 +00:00
nested: VHE-based mode with support for nested
virtualization. Requires at least ARMv8.3
hardware.
2021-10-11 16:38:35 +01:00
Defaults to VHE/nVHE based on hardware support. Setting
mode to "protected" will disable kexec and hibernation
2023-02-09 17:58:03 +00:00
for the host. "nested" is experimental and should be
used with extreme caution.
2020-12-02 18:40:57 +00:00
2017-06-09 12:49:46 +01:00
kvm-arm.vgic_v3_group0_trap=
[KVM,ARM] Trap guest accesses to GICv3 group-0
system registers
2017-06-09 12:49:41 +01:00
kvm-arm.vgic_v3_group1_trap=
[KVM,ARM] Trap guest accesses to GICv3 group-1
system registers
2017-06-09 12:49:53 +01:00
kvm-arm.vgic_v3_common_trap=
[KVM,ARM] Trap guest accesses to GICv3 common
system registers
2017-10-27 15:28:54 +01:00
kvm-arm.vgic_v4_enable=
[KVM,ARM] Allow use of GICv4 for direct injection of
LPIs.
2020-09-21 14:32:20 +05:30
kvm_cma_resv_ratio=n [PPC]
Reserves given percentage from system memory area for
contiguous memory allocation for KVM hash pagetable
allocation.
By default it reserves 5% of total system memory.
Format: <integer>
Default: 5
2009-07-10 14:20:35 +02:00
kvm-intel.ept= [KVM,Intel] Disable extended page tables
(virtualized MMU) support on capable Intel chips.
Default is 1 (enabled)
kvm-intel.emulate_invalid_guest_state=
2021-12-07 19:30:05 +00:00
[KVM,Intel] Disable emulation of invalid guest state.
Ignored if kvm-intel.enable_unrestricted_guest=1, as
guest state is never invalid for unrestricted guests.
This param doesn't apply to nested guests (L2), as KVM
never emulates invalid L2 guest state.
Default is 1 (enabled)
2009-07-10 14:20:35 +02:00
kvm-intel.flexpriority=
[KVM,Intel] Disable FlexPriority feature (TPR shadow).
Default is 1 (enabled)
2011-08-09 14:28:35 +03:00
kvm-intel.nested=
[KVM,Intel] Enable VMX nesting (nVMX).
Default is 0 (disabled)
2009-07-10 14:20:35 +02:00
kvm-intel.unrestricted_guest=
[KVM,Intel] Disable unrestricted guest feature
(virtualized real and unpaged mode) on capable
Intel chips. Default is 1 (enabled)
2018-07-02 12:29:30 +02:00
kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
CVE-2018-3620.
Valid arguments: never, cond, always
always: L1D cache flush on every VMENTER.
cond: Flush L1D on VMENTER only when the code between
VMEXIT and VMENTER can leak host memory.
never: Disables the mitigation
Default is cond (do L1 cache flush in specific instances)
2009-07-10 14:20:35 +02:00
kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)
2021-01-08 23:10:56 +11:00
l1d_flush= [X86,INTEL]
Control mitigation for L1D based snooping vulnerability.
Certain CPUs are vulnerable to an exploit against CPU
internal buffers which can forward information to a
disclosure gadget under certain conditions.
In vulnerable processors, the speculatively
forwarded data can be used in a cache side channel
attack, to access data to which the attacker does
not have direct access.
This parameter controls the mitigation. The
options are:
on - enable the interface for the mitigation
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 16:23:25 +02:00
l1tf= [X86] Control mitigation of the L1TF vulnerability on
affected CPUs
The kernel PTE inversion protection is unconditionally
enabled and cannot be disabled.
full
Provides all available mitigations for the
L1TF vulnerability. Disables SMT and
enables all mitigations in the
hypervisors, i.e. unconditional L1D flush.
SMT control and L1D flush control via the
sysfs interface is still possible after
boot. Hypervisors will issue a warning
when the first VM is started in a
potentially insecure configuration,
i.e. SMT enabled or L1D flush disabled.
full,force
Same as 'full', but disables SMT and L1D
flush runtime control. Implies the
'nosmt=force' command line option.
(i.e. sysfs control of SMT is disabled.)
flush
Leaves SMT enabled and enables the default
hypervisor mitigation, i.e. conditional
L1D flush.
SMT control and L1D flush control via the
sysfs interface is still possible after
boot. Hypervisors will issue a warning
when the first VM is started in a
potentially insecure configuration,
i.e. SMT enabled or L1D flush disabled.
flush,nosmt
Disables SMT and enables the default
hypervisor mitigation.
SMT control and L1D flush control via the
sysfs interface is still possible after
boot. Hypervisors will issue a warning
when the first VM is started in a
potentially insecure configuration,
i.e. SMT enabled or L1D flush disabled.
flush,nowarn
Same as 'flush', but hypervisors will not
warn when a VM is started in a potentially
insecure configuration.
off
Disables hypervisor mitigations and doesn't
emit any warnings.
2018-11-13 19:49:10 +01:00
It also drops the swap size and available
RAM limit restriction on both hypervisor and
bare metal.
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 16:23:25 +02:00
Default is 'flush'.
2019-02-19 11:10:49 +01:00
For details see: Documentation/admin-guide/hw-vuln/l1tf.rst
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 16:23:25 +02:00
2005-04-16 15:20:36 -07:00
l2cr= [PPC]
2008-03-29 07:20:23 +11:00
l3cr= [PPC]
2007-07-31 00:37:59 -07:00
lapic [X86-32,APIC] Enable the local APIC even if BIOS
2005-10-23 12:57:11 -07:00
disabled it.
2005-04-16 15:20:36 -07:00
2020-09-17 22:47:39 -07:00
lapic= [X86,APIC] Do not use TSC deadline
2012-10-22 14:37:58 -07:00
value for LAPIC timer one-shot implementation. Default
back to the programmable timer unit in the LAPIC.
2020-09-17 22:47:39 -07:00
Format: notscdeadline
2012-10-22 14:37:58 -07:00
2009-04-14 14:03:43 +05:30
lapic_timer_c2_ok [X86,APIC] trust the local apic timer
2008-12-19 10:57:32 -08:00
in C2 power state.
2007-03-23 16:08:01 +01:00
2008-01-06 19:08:56 +01:00
libata.dma= [LIBATA] DMA control
libata.dma=0 Disable all PATA and SATA DMA
libata.dma=1 PATA and SATA Disk DMA only
libata.dma=2 ATAPI (CDROM) DMA only
2011-08-13 12:34:52 -07:00
libata.dma=4 Compact Flash DMA only
2008-01-06 19:08:56 +01:00
Combinations also work, so libata.dma=3 enables DMA
for disks and CDROMs, but not CFs.
2011-08-13 12:34:52 -07:00
2009-08-06 00:14:10 +02:00
libata.ignore_hpa= [LIBATA] Ignore HPA limit
libata.ignore_hpa=0 keep BIOS limits (default)
libata.ignore_hpa=1 ignore limits, using full disk
2008-01-06 19:08:56 +01:00
2007-09-27 11:50:13 -04:00
libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume
when set.
Format: <int>
2022-03-18 19:26:54 +09:00
libata.force= [LIBATA] Force configurations. The format is a comma-
separated list of "[ID:]VAL" where ID is PORT[.DEVICE].
PORT and DEVICE are decimal numbers matching port, link
or device. Basically, it matches the ATA ID string
printed on console by libata. If the whole ID part is
omitted, the last PORT and DEVICE values are used. If
ID hasn't been specified yet, the configuration applies
to all ports, links and devices.
2008-02-13 09:15:09 +09:00
If only DEVICE is omitted, the parameter applies to
the port and all links and devices behind it. DEVICE
number of 0 either selects the first device or the
first fan-out link behind PMP device. It does not
select the host link. DEVICE number of 15 selects the
host link and device attached to it.
The VAL specifies the configuration to force. As long
2022-03-18 19:26:54 +09:00
as there is no ambiguity, shortcut notation is allowed.
2008-02-13 09:15:09 +09:00
For example, both 1.5 and 1.5G would work for 1.5Gbps.
The following configurations can be forced.
* Cable type: 40c, 80c, short40c, unk, ign or sata.
Any ID with matching PORT is used.
* SATA link speed limit: 1.5Gbps or 3.0Gbps.
* Transfer mode: pio[0-7], mwdma[0-4] and udma[0-7].
udma[/][16,25,33,44,66,100,133] notation is also
allowed.
2022-03-18 19:26:54 +09:00
* nohrst, nosrst, norst: suppress hard, soft and both
resets.
* rstonce: only attempt one reset during hot-unplug
link recovery.
* [no]dbdelay: Enable or disable the extra 200ms delay
before debouncing a link PHY and device presence
detection.
2008-02-13 09:15:09 +09:00
* [no]ncq: Turn on or off NCQ.
2022-03-18 19:26:54 +09:00
* [no]ncqtrim: Enable or disable queued DSM TRIM.
* [no]ncqati: Enable or disable NCQ trim on ATI chipset.
* [no]trim: Enable or disable (unqueued) TRIM.
* trim_zero: Indicate that TRIM command zeroes data.
* max_trim_128m: Set 128M maximum trim size limit.
* [no]dma: Turn on or off DMA transfers.
* atapi_dmadir: Enable ATAPI DMADIR bridge support.
* atapi_mod16_dma: Enable the use of ATAPI DMA for
commands that are not a multiple of 16 bytes.
* [no]dmalog: Enable or disable the use of the
READ LOG DMA EXT command to access logs.
* [no]iddevlog: Enable or disable access to the
identify device data log.
* [no]logdir: Enable or disable access to the general
purpose log directory.
* max_sec_128: Set transfer size limit to 128 sectors.
* max_sec_1024: Set or clear transfer size limit to
1024 sectors.
2015-05-04 21:54:18 -04:00
2022-03-18 19:26:54 +09:00
* max_sec_lba48: Set or clear transfer size limit to
65535 sectors.
2008-08-13 20:19:09 +09:00
2022-03-18 19:26:54 +09:00
* [no]lpm: Enable or disable link power management.
2012-06-21 23:41:41 -07:00
2022-03-18 19:26:54 +09:00
* [no]setxfer: Indicate if transfer speed mode setting
should be skipped.
2010-05-23 12:59:11 +02:00
2022-10-14 18:05:38 +09:00
* [no]fua: Disable or enable FUA (Force Unit Access)
support for devices supporting this feature.
2022-03-18 19:26:54 +09:00
* dump_id: Dump IDENTIFY data.
2013-05-21 22:30:58 +02:00
2013-12-16 09:31:19 -08:00
* disable: Disable this device.
2008-02-13 09:15:09 +09:00
If there are multiple matching configurations changing
the same attribute, the last one is used.
2020-09-17 18:56:40 -07:00
load_ramdisk= [RAM] [Deprecated]
2005-04-16 15:20:36 -07:00
2006-01-14 13:21:19 -08:00
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
2005-04-16 15:20:36 -07:00
2006-01-14 13:21:19 -08:00
lockd.nlm_tcpport=N [NFS] Assign TCP port.
Format: <integer>
lockd.nlm_timeout=T [NFS] Assign timeout value.
Format: <integer>
lockd.nlm_udpport=M [NFS] Assign UDP port.
Format: <integer>
2005-04-16 15:20:36 -07:00
2019-08-19 17:17:39 -07:00
lockdown= [SECURITY]
{ integrity | confidentiality }
Enable the kernel lockdown feature. If set to
integrity, kernel features that allow userland to
modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland
to extract confidential information from the kernel
are also disabled.
2014-09-12 10:50:01 -07:00
locktorture.nreaders_stress= [KNL]
Set the number of locking read-acquisition kthreads.
Defaults to being automatically set based on the
number of online CPUs.
locktorture.nwriters_stress= [KNL]
Set the number of locking write-acquisition kthreads.
locktorture.onoff_holdoff= [KNL]
Set time (s) after boot for CPU-hotplug testing.
locktorture.onoff_interval= [KNL]
Set time (s) between CPU-hotplug operations, or
zero to disable CPU-hotplug testing.
locktorture.shuffle_interval= [KNL]
Set task-shuffle interval (jiffies). Shuffling
tasks allows some CPUs to go into dyntick-idle
mode during the locktorture test.
locktorture.shutdown_secs= [KNL]
Set time (s) after boot system shutdown. This
is useful for hands-off automated testing.
locktorture.stat_interval= [KNL]
Time (s) between statistics printk()s.
locktorture.stutter= [KNL]
Time (s) to stutter testing, for example,
specifying five seconds causes the test to run for
five seconds, wait for five seconds, and so on.
This tests the locking primitive's ability to
transition abruptly to and from idle.
locktorture.torture_type= [KNL]
Specify the locking implementation to test.
locktorture.verbose= [KNL]
Enable additional printk() statements.
2005-04-16 15:20:36 -07:00
logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver
Format: <irq>
loglevel= All Kernel Messages with a loglevel smaller than the
console loglevel will be printed to the console. It can
also be changed with klogd or other programs. The
loglevels are defined as follows:
0 (KERN_EMERG) system is unusable
1 (KERN_ALERT) action must be taken immediately
2 (KERN_CRIT) critical conditions
3 (KERN_ERR) error conditions
4 (KERN_WARNING) warning conditions
5 (KERN_NOTICE) normal but significant condition
6 (KERN_INFO) informational
7 (KERN_DEBUG) debug-level messages
2011-02-20 20:08:35 -08:00
log_buf_len=n[KMG] Sets the size of the printk ring buffer,
2014-08-06 16:08:56 -07:00
in bytes. n must be a power of two and greater
than the minimal size. The minimal size is defined
by LOG_BUF_SHIFT kernel config parameter. There is
also CONFIG_LOG_CPU_MAX_BUF_SHIFT config parameter
that allows to increase the default size depending on
the number of CPUs. See init/Kconfig for more details.
2005-04-16 15:20:36 -07:00
2007-10-16 01:29:37 -07:00
logo.nologo [FB] Disables display of the built-in Linux logo.
This may be used to provide more screen space for
kernel log messages and is useful when debugging
kernel boot problems.
2005-04-16 15:20:36 -07:00
lp=0 [LP] Specify parallel ports to use, e.g,
lp=port[,port...] lp=none,parport0 (lp0 not configured, lp1 uses
lp=reset first parallel port). 'lp=0' disables the
lp=auto printer driver. 'lp=reset' (which can be
specified in addition to the ports) causes
attached printers to be reset. Using
lp=port1,port2,... specifies the parallel ports
to associate lp devices with, starting with
lp0. A port specification may be 'none' to skip
that lp device, or a parport name such as
'parport0'. Specifying 'lp=auto' instead of a
port specification list means that device IDs
from each port should be examined, to see if
an IEEE 1284-compliant printer is attached; if
so, the driver will manage that printer.
See also header of drivers/char/lp.c.
lpj=n [KNL]
Sets loops_per_jiffy to given constant, thus avoiding
time-consuming boot-time autodetection (up to 250 ms per
CPU). 0 enables autodetection (default). To determine
the correct value for your kernel, boot with normal
autodetection and see what value is printed. Note that
on SMP systems the preset will be applied to all CPUs,
which is likely to cause problems if your CPUs need
significantly divergent settings. An incorrect value
will cause delays in the kernel to be wrong, leading to
unpredictable I/O errors and other breakage. Although
unlikely, in the extreme case this might damage your
hardware.
ltpc= [NET]
Format: <io>,<irq>,<dma>
2018-10-10 17:18:25 -07:00
lsm.debug [SECURITY] Enable LSM initialization debugging output.
2018-09-19 17:30:09 -07:00
lsm=lsm1,...,lsmN
[SECURITY] Choose order of LSM initialization. This
2019-02-12 10:23:18 -08:00
overrides CONFIG_LSM, and the "security=" parameter.
2018-09-19 17:30:09 -07:00
2011-08-13 12:34:52 -07:00
machvec= [IA-64] Force the use of a particular machine-vector
2005-10-23 12:57:11 -07:00
(machvec) in a generic kernel.
2019-08-13 09:25:06 +02:00
Example: machvec=hpzx1
2005-04-16 15:20:36 -07:00
2020-09-18 17:52:02 -07:00
machtype= [Loongson] Share the same kernel image file between
different yeeloong laptops.
2009-07-02 23:27:12 +08:00
Example: machtype=lemote-yeeloong-2f-7inch
2022-04-02 22:48:20 -07:00
max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater
2009-04-05 15:55:22 -07:00
than or equal to this physical address is ignored.
2005-04-16 15:20:36 -07:00
maxcpus= [SMP] Maximum number of processors that an SMP kernel
2016-08-24 13:06:45 +08:00
will bring up during bootup. maxcpus=n : n >= 0 limits
the kernel to bring up 'n' processors. Surely after
bootup you can bring up the other plugged cpu by executing
"echo 1 > /sys/devices/system/cpu/cpuX/online". So maxcpus
only takes effect during system bootup.
While n=0 is a special case, it is equivalent to "nosmp",
which also disables the IO APIC.
2005-04-16 15:20:36 -07:00
2011-07-31 22:08:04 +02:00
max_loop= [LOOP] The number of loop block devices that get
(loop.max_loop) unconditionally pre-created at init time. The default
number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead
of statically allocating a predefined number, loop
devices can be requested on-demand with the
/dev/loop-control interface.
2005-06-29 18:00:00 -07:00
2007-07-31 00:37:59 -07:00
mce [X86-32] Machine Check Exception
2005-04-16 15:20:36 -07:00
2019-06-07 15:54:32 -03:00
mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst
2007-10-17 18:04:38 +02:00
2005-04-16 15:20:36 -07:00
md= [HW] RAID subsystems devices and level
2016-11-03 12:10:10 +02:00
See Documentation/admin-guide/md.rst.
2005-10-23 12:57:11 -07:00
2005-04-16 15:20:36 -07:00
mdacon= [MDA]
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
2005-10-23 12:57:11 -07:00
2019-02-18 22:04:08 +01:00
mds= [X86,INTEL]
Control mitigation for the Micro-architectural Data
Sampling (MDS) vulnerability.
Certain CPUs are vulnerable to an exploit against CPU
internal buffers which can forward information to a
disclosure gadget under certain conditions.
In vulnerable processors, the speculatively
forwarded data can be used in a cache side channel
attack, to access data to which the attacker does
not have direct access.
This parameter controls the MDS mitigation. The
options are:
2019-04-02 09:59:33 -05:00
full - Enable MDS mitigation on vulnerable CPUs
full,nosmt - Enable MDS mitigation and disable
SMT on vulnerable CPUs
off - Unconditionally disable MDS mitigation
2019-02-18 22:04:08 +01:00
2019-11-15 11:14:44 -05:00
On TAA-affected machines, mds=off can be prevented by
an active TAA mitigation as both vulnerabilities are
mitigated with the same mechanism so in order to disable
this mitigation, you need to specify tsx_async_abort=off
too.
2019-02-18 22:04:08 +01:00
Not specifying this option is equivalent to
mds=full.
2019-02-19 00:02:31 +01:00
For details see: Documentation/admin-guide/hw-vuln/mds.rst
2022-03-10 10:27:36 +02:00
mem=nn[KMG] [HEXAGON] Set the memory size.
Must be specified, otherwise memory size will be 0.
2005-04-16 15:20:36 -07:00
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
2020-04-06 20:06:50 -07:00
Amount of memory to be used in cases as follows:
1 for test;
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
2022-03-10 10:27:36 +02:00
4 to limit the memory available for kdump kernel.
[ARC,MICROBLAZE] - the limit applies only to low memory,
high memory is not affected.
[ARM64] - only limits memory covered by the linear
mapping. The NOMAP regions are not affected.
2020-04-06 20:06:50 -07:00
2012-12-17 15:59:29 -08:00
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
Without memmap= PCI devices could be placed at addresses
belonging to unused RAM.
2005-04-16 15:20:36 -07:00
2020-04-06 20:06:50 -07:00
Note that this only takes effects during boot time since
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
2022-03-10 10:27:36 +02:00
mem=nn[KMG]@ss[KMG]
[ARM,MIPS] - override the memory layout reported by
firmware.
Define a memory region of size nn[KMG] starting at
ss[KMG].
Multiple different regions can be specified with
multiple mem= parameters on the command line.
2007-07-31 00:37:59 -07:00
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
2005-04-16 15:20:36 -07:00
memory.
2022-04-02 22:48:22 -07:00
memblock=debug [KNL] Enable memblock debug messages.
2008-09-21 17:14:42 +09:00
memchunk=nn[KMG]
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
2018-04-18 20:51:39 +02:00
memhp_default_state=online/offline
2016-05-19 17:13:06 -07:00
[KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is
set according to the
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
option.
2019-06-07 15:54:32 -03:00
See Documentation/admin-guide/mm/memory-hotplug.rst.
2016-05-19 17:13:06 -07:00
2009-04-14 14:03:43 +05:30
memmap=exactmap [KNL,X86] Enable setting of an exact
2005-04-16 15:20:36 -07:00
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn@ss
option description.
memmap=nn[KMG]@ss[KMG]
2020-11-29 08:51:21 +13:00
[KNL, X86, MIPS, XTENSA] Force usage of a specific region of memory.
2014-02-06 12:04:19 -08:00
Region of memory to be used is from ss to ss+nn.
Documentation/kernel-parameters.txt: Update 'memmap=' boot option description
In commit:
9710f581bb4c ("x86, mm: Let "memmap=" take more entries one time")
... 'memmap=' was changed to adopt multiple, comma delimited values in a
single entry, so update the related description.
In the special case of only specifying size value without an offset,
like memmap=nn[KMG], memmap behaves similarly to mem=nn[KMG], so update
it too here.
Furthermore, for memmap=nn[KMG]$ss[KMG], an escape character needs be added
before '$' for some bootloaders. E.g in grub2, if we specify memmap=100M$5G
as suggested by the documentation, "memmap=100MG" gets passed to the kernel.
Clarify all this.
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dan.j.williams@intel.com
Cc: douly.fnst@cn.fujitsu.com
Cc: dyoung@redhat.com
Cc: m.mizuma@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1494654390-23861-4-git-send-email-bhe@redhat.com
[ Various spelling fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-13 13:46:30 +08:00
If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
which limits max address to nn[KMG].
Multiple different regions can be specified,
comma delimited.
Example:
memmap=100M@2G,100M#3G,1G!1024G
2005-04-16 15:20:36 -07:00
memmap=nn[KMG]#ss[KMG]
[KNL,ACPI] Mark specific memory as ACPI data.
2014-02-06 12:04:19 -08:00
Region of memory to be marked is from ss to ss+nn.
2005-04-16 15:20:36 -07:00
memmap=nn[KMG]$ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
2014-02-06 12:04:19 -08:00
Region of memory to be reserved is from ss to ss+nn.
2008-03-24 12:29:43 -07:00
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
or
memmap=0x10000$0x18690000
Documentation/kernel-parameters.txt: Update 'memmap=' boot option description
In commit:
9710f581bb4c ("x86, mm: Let "memmap=" take more entries one time")
... 'memmap=' was changed to adopt multiple, comma delimited values in a
single entry, so update the related description.
In the special case of only specifying size value without an offset,
like memmap=nn[KMG], memmap behaves similarly to mem=nn[KMG], so update
it too here.
Furthermore, for memmap=nn[KMG]$ss[KMG], an escape character needs be added
before '$' for some bootloaders. E.g in grub2, if we specify memmap=100M$5G
as suggested by the documentation, "memmap=100MG" gets passed to the kernel.
Clarify all this.
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dan.j.williams@intel.com
Cc: douly.fnst@cn.fujitsu.com
Cc: dyoung@redhat.com
Cc: m.mizuma@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1494654390-23861-4-git-send-email-bhe@redhat.com
[ Various spelling fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-13 13:46:30 +08:00
Some bootloaders may need an escape character before '$',
like Grub2, otherwise '$' and the following number
will be eaten.
2005-04-16 15:20:36 -07:00
2015-04-01 09:12:18 +02:00
memmap=nn[KMG]!ss[KMG]
[KNL,X86] Mark specific memory as protected.
Region of memory to be used, from ss to ss+nn.
The memory region may be marked as e820 type 12 (0xc)
and is NVDIMM or ADR memory.
2018-02-03 00:10:20 +01:00
memmap=<size>%<offset>-<oldtype>+<newtype>
[KNL,ACPI] Convert memory within the specified region
from <oldtype> to <newtype>. If "-<oldtype>" is left
out, the whole region will be marked as <newtype>,
even if previously unavailable. If "+<newtype>" is left
out, matching memory will be removed. Types are
specified as e820 types, e.g., 1 = RAM, 2 = reserved,
3 = ACPI, 12 = PRAM.
2008-09-07 01:51:34 -07:00
memory_corruption_check=0/1 [X86]
Some BIOSes seem to corrupt the first 64k of
memory when doing things like suspend/resume.
Setting this option will scan the memory
looking for corruption. Enabling this will
both detect corruption and prevent the kernel
from using the memory being corrupted.
However, its intended as a diagnostic tool; if
repeatable BIOS-originated corruption always
affects the same memory, you can use memmap=
to prevent the kernel from using that memory.
memory_corruption_check_size=size [X86]
By default it checks for corruption in the low
64k, making this memory unavailable for normal
use. Use this parameter to scan for
corruption in more or less memory.
memory_corruption_check_period=seconds [X86]
By default it checks for corruption every 60
seconds. Use this parameter to check at some
other rate. 0 disables periodic checking.
2021-05-04 18:39:48 -07:00
memory_hotplug.memmap_on_memory
[KNL,X86,ARM] Boolean flag to enable this feature.
Format: {on | off (default)}
When enabled, runtime hotplugged memory will
2022-06-17 21:56:50 +08:00
allocate its internal metadata (struct pages,
those vmemmap pages cannot be optimized even
if hugetlb_free_vmemmap is enabled) from the
hotadded memory which will allow to hotadd a
lot of memory without requiring additional
memory to do so.
2021-05-04 18:39:48 -07:00
This feature is disabled by default because it
has some implication on large (e.g. GB)
allocations in some configurations (e.g. small
memory blocks).
The state of the flag can be read in
/sys/module/memory_hotplug/parameters/memmap_on_memory.
Note that even when enabled, there are a few cases where
the feature is not effective.
2021-11-19 09:29:38 +11:00
memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
2008-03-21 18:56:19 -07:00
Format: <integer>
default : 0 <disable>
2009-02-25 11:30:45 +01:00
Specifies the number of memtest passes to be
performed. Each pass selects another test
pattern from a given set of patterns. Memtest
fills the memory with this pattern, validates
memory contents and reserves bad memory
regions that are detected.
2008-03-21 18:56:19 -07:00
2017-07-17 16:09:58 -05:00
mem_encrypt= [X86-64] AMD Secure Memory Encryption (SME) control
Valid arguments: on, off
Default (depends on kernel configuration option):
on (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y)
off (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n)
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
2022-06-26 10:10:58 +01:00
Refer to Documentation/virt/kvm/x86/amd-memory-encryption.rst
2017-07-17 16:09:58 -05:00
for details on when memory encryption can be activated.
PM / sleep: System sleep state selection interface rework
There are systems in which the platform doesn't support any special
sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only
available system sleep state. However, some user space frameworks
only use the "mem" and (sometimes) "standby" sleep state labels, so
the users of those systems need to modify user space in order to be
able to use system suspend at all and that may be a pain in practice.
Commit 0399d4db3edf (PM / sleep: Introduce command line argument for
sleep state enumeration) attempted to address this problem by adding
a command line argument to change the meaning of the "mem" string in
/sys/power/state to make it trigger suspend-to-idle (instead of
suspend-to-RAM).
However, there also are systems in which the platform does support
special sleep states, but suspend-to-idle is the preferred one anyway
(it even may save more energy than the platform-provided sleep states
in some cases) and the above commit doesn't help in those cases.
For this reason, rework the system sleep state selection interface
again (but preserve backwards compatibiliby). Namely, add a new
sysfs file, /sys/power/mem_sleep, that will control the system
suspend mode triggered by writing "mem" to /sys/power/state (in
analogy with what /sys/power/disk does for hibernation). Make it
select suspend-to-RAM ("deep" sleep) by default (if supported) and
fall back to suspend-to-idle ("s2idle") otherwise and add a new
command line argument, mem_sleep_default, allowing that default to
be overridden if need be.
At the same time, drop the relative_sleep_states command line
argument that doesn't make sense any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21 22:45:40 +01:00
mem_sleep_default= [SUSPEND] Default system suspend mode:
s2idle - Suspend-To-Idle
shallow - Power-On Suspend or equivalent (if supported)
deep - Suspend-To-RAM or equivalent (if supported)
2017-10-06 00:38:49 +02:00
See Documentation/admin-guide/pm/sleep-states.rst.
PM / sleep: System sleep state selection interface rework
There are systems in which the platform doesn't support any special
sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only
available system sleep state. However, some user space frameworks
only use the "mem" and (sometimes) "standby" sleep state labels, so
the users of those systems need to modify user space in order to be
able to use system suspend at all and that may be a pain in practice.
Commit 0399d4db3edf (PM / sleep: Introduce command line argument for
sleep state enumeration) attempted to address this problem by adding
a command line argument to change the meaning of the "mem" string in
/sys/power/state to make it trigger suspend-to-idle (instead of
suspend-to-RAM).
However, there also are systems in which the platform does support
special sleep states, but suspend-to-idle is the preferred one anyway
(it even may save more energy than the platform-provided sleep states
in some cases) and the above commit doesn't help in those cases.
For this reason, rework the system sleep state selection interface
again (but preserve backwards compatibiliby). Namely, add a new
sysfs file, /sys/power/mem_sleep, that will control the system
suspend mode triggered by writing "mem" to /sys/power/state (in
analogy with what /sys/power/disk does for hibernation). Make it
select suspend-to-RAM ("deep" sleep) by default (if supported) and
fall back to suspend-to-idle ("s2idle") otherwise and add a new
command line argument, mem_sleep_default, allowing that default to
be overridden if need be.
At the same time, drop the relative_sleep_states command line
argument that doesn't make sense any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21 22:45:40 +01:00
2005-04-16 15:20:36 -07:00
meye.*= [HW] Set MotionEye Camera parameters
2020-03-04 13:08:03 +01:00
See Documentation/admin-guide/media/meye.rst.
2005-04-16 15:20:36 -07:00
2007-10-12 23:04:06 +02:00
mfgpt_irq= [IA-32] Specify the IRQ to use for the
Multi-Function General Purpose Timers on AMD Geode
platforms.
2008-01-30 13:33:33 +01:00
mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when
the BIOS has incorrectly applied a workaround. TinyBIOS
version 0.98 is known to be affected, 0.99 fixes the
problem by letting the user disable the workaround.
2005-04-16 15:20:36 -07:00
mga= [HW,DRM]
2022-04-02 22:48:20 -07:00
min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
2008-11-19 15:36:16 -08:00
physical address is ignored.
2009-05-20 11:10:31 +01:00
mini2440= [ARM,HW,KNL]
Format:[0..2][b][c][t]
Default: "0tb"
MINI2440 configuration specification:
0 - The attached screen is the 3.5" TFT
1 - The attached screen is the 7" TFT
2 - The VGA Shield is attached (1024x768)
Leaving out the screen size parameter will not load
the TFT driver, and the framebuffer will be left
unconfigured.
b - Enable backlight. The TFT backlight pin will be
linked to the kernel VESA blanking code and a GPIO
LED. This parameter is not necessary when using the
VGA shield.
c - Enable the s3c camera interface.
t - Reserved for enabling touchscreen support. The
touchscreen support is not enabled in the mainstream
kernel as of 2.6.30, a preliminary port can be found
in the "bleeding edge" mini2440 support kernel at
2020-06-27 09:29:35 +02:00
https://repo.or.cz/w/linux-2.6/mini2440.git
2009-05-20 11:10:31 +01:00
2019-04-12 15:39:28 -05:00
mitigations=
2019-04-12 15:39:32 -05:00
[X86,PPC,S390,ARM64] Control optional mitigations for
CPU vulnerabilities. This is a set of curated,
2019-04-12 15:39:29 -05:00
arch-independent options, each of which is an
aggregation of existing arch-specific options.
2019-04-12 15:39:28 -05:00
off
Disable all optional CPU mitigations. This
improves system performance, but it may also
expose users to several CPU vulnerabilities.
2019-04-12 15:39:30 -05:00
Equivalent to: nopti [X86,PPC]
2022-06-24 09:20:48 +08:00
if nokaslr then kpti=0 [ARM64]
x86/speculation: Enable Spectre v1 swapgs mitigations
The previous commit added macro calls in the entry code which mitigate the
Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
enabled. Enable those features where applicable.
The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
There are different features which can affect the risk of attack:
- When FSGSBASE is enabled, unprivileged users are able to place any
value in GS, using the wrgsbase instruction. This means they can
write a GS value which points to any value in kernel space, which can
be useful with the following gadget in an interrupt/exception/NMI
handler:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
// dependent load or store based on the value of %reg
// for example: mov %(reg1), %reg2
If an interrupt is coming from user space, and the entry code
speculatively skips the swapgs (due to user branch mistraining), it
may speculatively execute the GS-based load and a subsequent dependent
load or store, exposing the kernel data to an L1 side channel leak.
Note that, on Intel, a similar attack exists in the above gadget when
coming from kernel space, if the swapgs gets speculatively executed to
switch back to the user GS. On AMD, this variant isn't possible
because swapgs is serializing with respect to future GS-based
accesses.
NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
doesn't exist quite yet.
- When FSGSBASE is disabled, the issue is mitigated somewhat because
unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
restricts GS values to user space addresses only. That means the
gadget would need an additional step, since the target kernel address
needs to be read from user space first. Something like:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
mov (%reg1), %reg2
// dependent load or store based on the value of %reg2
// for example: mov %(reg2), %reg3
It's difficult to audit for this gadget in all the handlers, so while
there are no known instances of it, it's entirely possible that it
exists somewhere (or could be introduced in the future). Without
tooling to analyze all such code paths, consider it vulnerable.
Effects of SMAP on the !FSGSBASE case:
- If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
susceptible to Meltdown), the kernel is prevented from speculatively
reading user space memory, even L1 cached values. This effectively
disables the !FSGSBASE attack vector.
- If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
still prevents the kernel from speculatively reading user space
memory. But it does *not* prevent the kernel from reading the
user value from L1, if it has already been cached. This is probably
only a small hurdle for an attacker to overcome.
Thanks to Dave Hansen for contributing the speculative_smap() function.
Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
is serializing on AMD.
[ tglx: Fixed the USER fence decision and polished the comment as suggested
by Dave Hansen ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
2019-07-08 11:52:26 -05:00
nospectre_v1 [X86,PPC]
2019-04-12 15:39:31 -05:00
nobp=0 [S390]
2019-04-12 15:39:32 -05:00
nospectre_v2 [X86,PPC,S390,ARM64]
2019-04-12 15:39:29 -05:00
spectre_v2_user=off [X86]
2019-04-12 15:39:30 -05:00
spec_store_bypass_disable=off [X86,PPC]
2019-04-12 15:39:32 -05:00
ssbd=force-off [ARM64]
2022-08-26 19:40:50 +08:00
nospectre_bhb [ARM64]
2019-04-12 15:39:29 -05:00
l1tf=off [X86]
2019-04-17 16:39:02 -05:00
mds=off [X86]
2019-10-23 12:32:55 +02:00
tsx_async_abort=off [X86]
2019-11-04 12:22:02 +01:00
kvm.nx_huge_pages=off [X86]
2022-05-13 18:16:37 +08:00
srbds=off [X86,INTEL]
2020-11-17 16:59:12 +11:00
no_entry_flush [PPC]
2020-11-17 16:59:13 +11:00
no_uaccess_flush [PPC]
2022-05-19 20:29:11 -07:00
mmio_stale_data=off [X86]
2022-07-28 04:39:07 +00:00
retbleed=off [X86]
2019-11-04 12:22:02 +01:00
Exceptions:
This does not have any effect on
kvm.nx_huge_pages when
kvm.nx_huge_pages=force.
2019-04-12 15:39:28 -05:00
auto (default)
Mitigate all CPU vulnerabilities, but leave SMT
enabled, even if it's vulnerable. This is for
users who don't want to be surprised by SMT
getting disabled across kernel upgrades, or who
have other ways of avoiding SMT-based attacks.
2019-04-12 15:39:29 -05:00
Equivalent to: (default behavior)
2019-04-12 15:39:28 -05:00
auto,nosmt
Mitigate all CPU vulnerabilities, disabling SMT
if needed. This is for users who always want to
be fully mitigated, even if it means losing SMT.
2019-04-12 15:39:29 -05:00
Equivalent to: l1tf=flush,nosmt [X86]
2019-04-17 16:39:02 -05:00
mds=full,nosmt [X86]
2019-10-23 12:32:55 +02:00
tsx_async_abort=full,nosmt [X86]
2022-05-19 20:29:11 -07:00
mmio_stale_data=full,nosmt [X86]
2022-07-28 04:39:07 +00:00
retbleed=auto,nosmt [X86]
2019-04-12 15:39:28 -05:00
2008-07-23 21:26:49 -07:00
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
the additional memory initialisation checks. A value
of 0 disables mminit logging and a level of 4 will
log everything. Information is printed at KERN_DEBUG
so loglevel=8 may also need to be specified.
2022-05-19 20:29:11 -07:00
mmio_stale_data=
[X86,INTEL] Control mitigation for the Processor
MMIO Stale Data vulnerabilities.
Processor MMIO Stale Data is a class of
vulnerabilities that may expose data after an MMIO
operation. Exposed data could originate or end in
the same CPU buffers as affected by MDS and TAA.
Therefore, similar to MDS and TAA, the mitigation
is to clear the affected CPU buffers.
This parameter controls the mitigation. The
options are:
full - Enable mitigation on vulnerable CPUs
full,nosmt - Enable mitigation and disable SMT on
vulnerable CPUs.
off - Unconditionally disable mitigation
On MDS or TAA affected machines,
mmio_stale_data=off can be prevented by an active
MDS or TAA mitigation as these vulnerabilities are
mitigated with the same mechanism so in order to
disable this mitigation, you need to specify
mds=off and tsx_async_abort=off too.
Not specifying this option is equivalent to
mmio_stale_data=full.
For details see:
Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
2022-12-03 17:30:50 -08:00
<module>.async_probe[=<bool>] [KNL]
If no <bool> value is specified or if the value
specified is not a valid <bool>, enable asynchronous
probe on this module. Otherwise, enable/disable
asynchronous probe on this module as indicated by the
<bool> value. See also: module.async_probe
2022-06-03 18:01:00 -07:00
module.async_probe=<bool>
[KNL] When set to true, modules will use async probing
by default. To enable/disable async probing for a
specific module, use the module specific control that
is documented under <module>.async_probe. When both
module.async_probe and <module>.async_probe are
specified, <module>.async_probe takes precedence for
the specific module.
2012-09-26 10:09:40 +01:00
module.sig_enforce
[KNL] When CONFIG_MODULE_SIG is set, this means that
modules without (valid) signatures will fail to load.
2013-03-25 20:42:06 +01:00
Note that if CONFIG_MODULE_SIG_FORCE is set, that
2012-09-26 10:09:40 +01:00
is always true, so this option does nothing.
2016-07-21 15:37:56 +09:30
module_blacklist= [KNL] Do not load a comma-separated list of
modules. Useful for debugging problem modules.
2005-04-16 15:20:36 -07:00
mousedev.tap_time=
[MOUSE] Maximum time between finger touching and
leaving touchpad surface for touch to be considered
a tap and be reported as a left button click (for
touchpads working in absolute mode only).
Format: <msecs>
mousedev.xres= [MOUSE] Horizontal screen resolution, used for devices
reporting absolute coordinates, such as tablets
mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
mm, page_alloc: extend kernelcore and movablecore for percent
Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.
This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.
To define the percentage, the final character of the parameter should be
a '%'.
mhocko: "why is anyone using these options nowadays?"
rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.
[rientjes@google.com: updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 16:23:09 -07:00
movablecore= [KNL,X86,IA-64,PPC]
Format: nn[KMGTPE] | nn%
This parameter is the complement to kernelcore=, it
specifies the amount of memory used for migratable
allocations. If both kernelcore and movablecore is
specified, then kernelcore will be at *least* the
specified value but may be more. If movablecore on its
own is specified, the administrator must be careful
2009-04-05 15:55:22 -07:00
that the amount of memory usable for all allocations
is not too small.
2017-07-06 15:41:02 -07:00
movable_node [KNL] Boot-time switch to make hotplugable memory
NUMA nodes to be movable. This means that the memory
of such nodes will be usable only for movable
allocations which rules out almost all kernel
allocations. Use with caution!
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-12 15:08:10 -08:00
2005-04-16 15:20:36 -07:00
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
2005-10-23 12:57:11 -07:00
MTD_Region= [MTD] Format:
<name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>]
2005-04-16 15:20:36 -07:00
mtdparts= [MTD]
2020-02-18 16:02:19 +01:00
See drivers/mtd/parsers/cmdlinepart.c
2005-04-16 15:20:36 -07:00
2008-07-03 11:24:29 +01:00
mtdset= [ARM]
ARM/S3C2412 JIVE boot control
2020-09-11 16:33:41 +02:00
See arch/arm/mach-s3c/mach-jive.c
2008-07-03 11:24:29 +01:00
2005-04-16 15:20:36 -07:00
mtouchusb.raw_coordinates=
2005-10-23 12:57:11 -07:00
[HW] Make the MicroTouch USB driver use raw coordinates
('y', default) or cooked coordinates ('n')
2005-04-16 15:20:36 -07:00
2009-04-05 15:55:22 -07:00
mtrr_chunk_size=nn[KMG] [X86]
2009-04-27 15:06:31 +02:00
used for mtrr cleanup. It is largest continuous chunk
2009-04-05 15:55:22 -07:00
that could hold holes aka. UC entries.
mtrr_gran_size=nn[KMG] [X86]
Used for mtrr cleanup. It is granularity of mtrr block.
Default is 1.
Large value could prevent small alignment from
using up MTRRs.
mtrr_spare_reg_nr=n [X86]
Format: <integer>
Range: 0,7 : spare reg number
Default : 1
Used for mtrr cleanup. It is spare mtrr entries number.
Set to 2 or more if your graphical card needs more.
2022-04-02 22:48:22 -07:00
multitce=off [PPC] This parameter disables the use of the pSeries
firmware feature for updating multiple TCE entries
at a time.
2005-04-16 15:20:36 -07:00
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
netdev= [NET] Network devices parameters
Format: <irq>,<io>,<mem_start>,<mem_end>,<name>
Note that mem_start is often overloaded to mean
something different and driver-specific.
2005-10-23 12:57:11 -07:00
This usage is only documented in each driver source
file if at all.
2022-04-02 22:48:22 -07:00
netpoll.carrier_timeout=
[NET] Specifies amount of time (in seconds) that
netpoll should wait for a carrier. By default netpoll
waits 4 seconds.
netfilter: accounting rework: ct_extend + 64bit counters (v4)
Initially netfilter has had 64bit counters for conntrack-based accounting, but
it was changed in 2.6.14 to save memory. Unfortunately in-kernel 64bit counters are
still required, for example for "connbytes" extension. However, 64bit counters
waste a lot of memory and it was not possible to enable/disable it runtime.
This patch:
- reimplements accounting with respect to the extension infrastructure,
- makes one global version of seq_print_acct() instead of two seq_print_counters(),
- makes it possible to enable it at boot time (for CONFIG_SYSCTL/CONFIG_SYSFS=n),
- makes it possible to enable/disable it at runtime by sysctl or sysfs,
- extends counters from 32bit to 64bit,
- renames ip_conntrack_counter -> nf_conn_counter,
- enables accounting code unconditionally (no longer depends on CONFIG_NF_CT_ACCT),
- set initial accounting enable state based on CONFIG_NF_CT_ACCT
- removes buggy IPCT_COUNTER_FILLING event handling.
If accounting is enabled newly created connections get additional acct extend.
Old connections are not changed as it is not possible to add a ct_extend area
to confirmed conntrack. Accounting is performed for all connections with
acct extend regardless of a current state of "net.netfilter.nf_conntrack_acct".
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:01:34 -07:00
nf_conntrack.acct=
[NETFILTER] Enable connection tracking flow accounting
0 to disable accounting
1 to enable accounting
2010-06-25 14:46:56 +02:00
Default value is 0.
netfilter: accounting rework: ct_extend + 64bit counters (v4)
Initially netfilter has had 64bit counters for conntrack-based accounting, but
it was changed in 2.6.14 to save memory. Unfortunately in-kernel 64bit counters are
still required, for example for "connbytes" extension. However, 64bit counters
waste a lot of memory and it was not possible to enable/disable it runtime.
This patch:
- reimplements accounting with respect to the extension infrastructure,
- makes one global version of seq_print_acct() instead of two seq_print_counters(),
- makes it possible to enable it at boot time (for CONFIG_SYSCTL/CONFIG_SYSFS=n),
- makes it possible to enable/disable it at runtime by sysctl or sysfs,
- extends counters from 32bit to 64bit,
- renames ip_conntrack_counter -> nf_conn_counter,
- enables accounting code unconditionally (no longer depends on CONFIG_NF_CT_ACCT),
- set initial accounting enable state based on CONFIG_NF_CT_ACCT
- removes buggy IPCT_COUNTER_FILLING event handling.
If accounting is enabled newly created connections get additional acct extend.
Old connections are not changed as it is not possible to add a ct_extend area
to confirmed conntrack. Accounting is performed for all connections with
acct extend regardless of a current state of "net.netfilter.nf_conntrack_acct".
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:01:34 -07:00
2010-09-17 10:54:37 -04:00
nfsaddrs= [NFS] Deprecated. Use ip= instead.
2020-02-12 19:13:32 +01:00
See Documentation/admin-guide/nfs/nfsroot.rst.
2005-04-16 15:20:36 -07:00
nfsroot= [NFS] nfs root filesystem for disk-less boxes.
2020-02-12 19:13:32 +01:00
See Documentation/admin-guide/nfs/nfsroot.rst.
2005-04-16 15:20:36 -07:00
2010-09-17 10:54:37 -04:00
nfsrootdebug [NFS] enable nfsroot debugging messages.
2020-02-12 19:13:32 +01:00
See Documentation/admin-guide/nfs/nfsroot.rst.
2010-09-17 10:54:37 -04:00
2016-08-29 20:03:52 -04:00
nfs.callback_nr_threads=
[NFSv4] set the total number of threads that the
NFS client will assign to service NFSv4 callback
requests.
2006-01-03 09:55:41 +01:00
nfs.callback_tcpport=
[NFS] set the TCP port on which the NFSv4 callback
channel should listen.
2009-08-19 18:12:27 -04:00
nfs.cache_getent=
[NFS] sets the pathname to the program which is used
to update the NFS client cache entries.
nfs.cache_getent_timeout=
[NFS] sets the timeout after which an attempt to
update a cache entry is deemed to have failed.
2006-01-03 09:55:57 +01:00
nfs.idmap_cache_timeout=
[NFS] set the maximum lifetime for idmapper cache
entries.
2007-10-09 12:01:04 -04:00
nfs.enable_ino64=
[NFS] enable 64-bit inode numbers.
If zero, the NFS client will fake up a 32-bit inode
number for the readdir() and stat() syscalls instead
of returning the full 64-bit number.
The default is to return 64-bit inode numbers.
2016-08-29 20:03:52 -04:00
nfs.max_session_cb_slots=
[NFSv4.1] Sets the maximum number of session
slots the client will assign to the callback
channel. This determines the maximum number of
callbacks the client will process in parallel for
a particular server.
2012-02-06 19:50:40 -05:00
nfs.max_session_slots=
[NFSv4.1] Sets the maximum number of session slots
the client will attempt to negotiate with the server.
This limits the number of simultaneous RPC requests
that the client can send to the NFSv4.1 server.
Note that there is little point in setting this
value higher than the max_tcp_slot_table_limit.
2011-02-22 15:44:32 -08:00
nfs.nfs4_disable_idmapping=
2012-01-09 13:46:26 -05:00
[NFSv4] When set to the default of '1', this option
ensures that both the RPC level authentication
scheme and the NFS level operations agree to use
numeric uids/gids if the mount is using the
'sec=sys' security flavour. In effect it is
disabling idmapping, which can make migration from
legacy NFSv2/v3 systems to NFSv4 easier.
Servers that do not support this mode of operation
will be autodetected by the client, and it will fall
back to using the idmapper.
To turn off this behaviour, set the value to '0'.
2012-09-14 17:24:41 -04:00
nfs.nfs4_unique_id=
[NFS4] Specify an additional fixed unique ident-
ification string that NFSv4 clients can insert into
their nfs_client_id4 string. This is typically a
UUID that is generated at system install time.
2011-02-22 15:44:32 -08:00
2012-02-17 15:20:24 -05:00
nfs.send_implementation_id =
[NFSv4.1] Send client implementation identification
information in exchange_id requests.
If zero, no implementation identification information
will be sent.
The default is to send the implementation identification
information.
2016-11-03 12:10:10 +02:00
2013-09-04 10:08:54 -04:00
nfs.recover_lost_locks =
[NFSv4] Attempt to recover locks that were lost due
to a lease timeout on the server. Please note that
doing this risks data corruption, since there are
no guarantees that the file will remain unchanged
after the locks are lost.
If you want to enable the kernel legacy behaviour of
attempting to recover these locks, then set this
parameter to '1'.
The default parameter value of '0' causes the kernel
not to attempt recovery of lost locks.
2012-02-17 15:20:24 -05:00
2015-08-24 20:39:18 -04:00
nfs4.layoutstats_timer =
[NFSv4.2] Change the rate at which the kernel sends
layoutstats to the pNFS metadata server.
Setting this to value to 0 causes the kernel to use
whatever value is the default set by the layout
driver. A non-zero value sets the minimum interval
in seconds between layoutstats transmissions.
2021-11-01 15:17:53 -04:00
nfsd.inter_copy_offload_enable =
[NFSv4.2] When set to 1, the server will support
server-to-server copies for which this server is
the destination of the copy.
nfsd.nfsd4_ssc_umount_timeout =
[NFSv4.2] When used as the destination of a
server-to-server copy, knfsd temporarily mounts
the source server. It caches the mount in case
it will be needed again, and discards it if not
used for the number of milliseconds specified by
this parameter.
2012-03-22 16:07:18 -04:00
nfsd.nfs4_disable_idmapping=
[NFSv4] When set to the default of '1', the NFSv4
server will return only numeric uids and gids to
clients using auth_sys, and will accept numeric uids
and gids from such clients. This is intended to ease
migration from NFSv2/v3.
2012-02-17 15:20:24 -05:00
2021-11-01 15:17:53 -04:00
2020-07-08 16:25:43 -07:00
nmi_backtrace.backtrace_idle [KNL]
Dump stacks even of idle CPUs in response to an
NMI stack-backtrace request.
2017-02-26 13:17:39 +01:00
nmi_debug= [KNL,SH] Specify one or more actions to take
2007-10-10 14:58:29 +02:00
when a NMI is triggered.
Format: [state][,regs][,debounce][,die]
2009-04-14 14:03:43 +05:30
nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
2011-03-22 16:34:16 -07:00
Format: [panic,][nopanic,][num]
2015-04-14 15:44:13 -07:00
Valid num: 0 or 1
2015-10-10 15:40:42 -04:00
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
2009-04-05 15:55:22 -07:00
When panic is specified, panic when an NMI watchdog
2019-05-21 10:32:08 +08:00
timeout occurs (or 'nopanic' to not panic on an NMI
watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
To disable both hard and soft lockup detectors,
2015-10-10 15:40:42 -04:00
please see 'nowatchdog'.
2009-04-05 15:55:22 -07:00
This is useful when you use a panic=... timeout and
need the box quickly up again.
2005-04-16 15:20:36 -07:00
2017-12-10 01:48:46 -06:00
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
2007-07-31 00:37:59 -07:00
no387 [BUGS=X86-32] Tells the kernel to use the 387 maths
2005-04-16 15:20:36 -07:00
emulation library even if a 387 maths coprocessor
is present.
2018-05-18 13:35:25 +03:00
no5lvl [X86-64] Disable 5-level paging mode. Forces
kernel to use 4-level paging instead.
2020-05-28 16:13:58 -04:00
nofsgsbase [X86] Disables FSGSBASE instructions.
2020-05-28 16:13:48 -04:00
2009-04-05 15:55:22 -07:00
no_console_suspend
[HW] Never suspend the console
Disable suspending of consoles during suspend and
hibernate operations. Once disabled, debugging
messages can reach various consoles while the rest
of the system is being put to sleep (ie, while
debugging driver suspend/resume hooks). This may
not work reliably with all consoles, but is known
to work with serial and VGA consoles.
2011-10-31 17:11:27 -07:00
To facilitate more flexible debugging, we also add
console_suspend, a printk module parameter to control
it. Users could use console_suspend (usually
/sys/module/printk/parameters/console_suspend) to
turn on/off it dynamically.
2009-04-05 15:55:22 -07:00
2019-07-16 16:26:39 -07:00
novmcoredd [KNL,KDUMP]
Disable device dump. Device dump allows drivers to
append dump data to vmcore so you can collect driver
specified debug info. Drivers can append the data
without any limit and this data is stored in memory,
so this may cause significant memory stress. Disabling
device dump can help save memory but the driver debug
data will be no longer available. This parameter
is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
is set.
2007-05-31 00:40:47 -07:00
noaliencache [MM, NUMA, SLAB] Disables the allocation of alien
caches in the slab allocator. Saves per-node memory,
but will impact performance.
2006-12-06 20:32:16 -08:00
2005-10-23 12:57:11 -07:00
noalign [KNL,ARM]
s390: introduce CPU alternatives
Implement CPU alternatives, which allows to optionally patch newer
instructions at runtime, based on CPU facilities availability.
A new kernel boot parameter "noaltinstr" disables patching.
Current implementation is derived from x86 alternatives. Although
ideal instructions padding (when altinstr is longer then oldinstr)
is added at compile time, and no oldinstr nops optimization has to be
done at runtime. Also couple of compile time sanity checks are done:
1. oldinstr and altinstr must be <= 254 bytes long,
2. oldinstr and altinstr must not have an odd length.
alternative(oldinstr, altinstr, facility);
alternative_2(oldinstr, altinstr1, facility1, altinstr2, facility2);
Both compile time and runtime padding consists of either 6/4/2 bytes nop
or a jump (brcl) + 2 bytes nop filler if padding is longer then 6 bytes.
.altinstructions and .altinstr_replacement sections are part of
__init_begin : __init_end region and are freed after initialization.
Signed-off-by: Vasily Gorbik <gor@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-10-12 13:01:47 +02:00
noaltinstr [S390] Disables alternative instructions patching
(CPU alternatives feature).
2005-04-16 15:20:36 -07:00
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 14:18:03 +01:00
noautogroup Disable scheduler automatic task group creation.
2005-04-16 15:20:36 -07:00
nocache [ARM]
2005-10-23 12:57:11 -07:00
2008-09-21 17:14:42 +09:00
nodsp [SH] Disable hardware DSP at boot time.
2014-08-14 17:15:26 +08:00
noefi Disable EFI runtime services support.
2008-01-30 13:32:11 +01:00
2020-11-17 16:59:12 +11:00
no_entry_flush [PPC] Don't flush the L1-D cache when entering the kernel.
2005-04-16 15:20:36 -07:00
noexec [IA-64]
2022-01-27 12:56:23 +01:00
nosmap [PPC]
2012-09-21 12:43:13 -07:00
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
2022-01-27 12:56:24 +01:00
nosmep [PPC64s]
2012-09-21 12:43:13 -07:00
Disable SMEP (Supervisor Mode Execution Prevention)
2011-05-11 16:51:05 -07:00
even if it is supported by processor.
2008-04-12 10:28:25 +02:00
noexec32 [X86-64]
This affects only 32-bit executables.
noexec32=on: enable non-executable mappings (default)
read doesn't imply executable mappings
noexec32=off: disable non-executable mappings
read implies executable mappings
2005-04-16 15:20:36 -07:00
2015-04-03 23:23:34 +01:00
nofpu [MIPS,SH] Disable hardware FPU at boot time.
2008-09-21 17:14:42 +09:00
2007-07-31 00:37:59 -07:00
nofxsr [BUGS=X86-32] Disables x86 floating point extended
2006-03-23 02:59:34 -08:00
register save and restore. The kernel will only save
legacy floating-point registers on task switch.
2005-04-16 15:20:36 -07:00
2020-09-10 20:19:46 +08:00
nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
2015-04-14 15:47:20 -07:00
2022-09-11 12:44:23 +08:00
nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings.
2021-05-03 19:17:55 +10:00
2016-04-05 12:53:38 +02:00
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
2020-08-09 19:49:41 -07:00
[KNL,X86] Disable symmetric multithreading (SMT).
2018-06-29 16:05:47 +02:00
nosmt=force: Force disable SMT, cannot be undone
via the sysfs control file.
powerpc updates for 4.19
Notable changes:
- A fix for a bug in our page table fragment allocator, where a page table page
could be freed and reallocated for something else while still in use, leading
to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for
a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but bring us in
to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer
for reporting many of these.
- A series to improve the hvc driver & related OPAL console code, which have
been seen to cause hardlockups at times. The hvc driver changes in particular
have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use
anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM CPUs
(controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals to bring it
into line with other arches, including showing the offending VMA and dumping
the instructions around the fault.
Thanks to:
Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey
Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan,
Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza,
Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt,
Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian
Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley,
Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus
Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael
Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas
Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap,
Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff,
Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson,
Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao
B, zhong jiang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt2O6cTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgC7hD/4+cj796Df7GsVsIMxzQm7SS9dklIdO
JuKj2Nr5HRzTH59jWlXukLG9mfTNCFgFJB4gEpK1ArDOTcHTCI9RRsLZTZ/kum66
7Pd+7T40dLYXB5uecuUs0vMXa2fI3syKh1VLzACSXv3Dh9BBIKQBwW/aD2eww4YI
1fS5LnXZ2PSxfr6KNAC6ogZnuaiD0sHXOYrtGHq+S/TFC7+Z6ySa6+AnPS+hPVoo
/rHDE1Khr66aj7uk+PP2IgUrCFj6Sbj6hTVlS/iAuwbMjUl9ty6712PmvX9x6wMZ
13hJQI+g6Ci+lqLKqmqVUpXGSr6y4NJGPS/Hko4IivBTJApI+qV/tF2H9nxU+6X0
0RqzsMHPHy13n2torA1gC7ttzOuXPI4hTvm6JWMSsfmfjTxLANJng3Dq3ejh6Bqw
76EMowpDLexwpy7/glPpqNdsP4ySf2Qm8yq3mR7qpL4m3zJVRGs11x+s5DW8NKBL
Fl5SqZvd01abH+sHwv6NLaLkEtayUyohxvyqu2RU3zu5M5vi7DhqstybTPjKPGu0
icSPh7b2y10WpOUpC6lxpdi8Me8qH47mVc/trZ+SpgBrsuEmtJhGKszEnzRCOqos
o2IhYHQv3lQv86kpaAFQlg/RO+Lv+Lo5qbJ209V+hfU5nYzXpEulZs4dx1fbA+ze
fK8GEh+u0L4uJg==
=PzRz
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A fix for a bug in our page table fragment allocator, where a page
table page could be freed and reallocated for something else while
still in use, leading to memory corruption etc. The fix reuses
pt_mm in struct page (x86 only) for a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but
bring us in to line with x86 behaviour and/or fix outright bugs.
Thanks to Florian Weimer for reporting many of these.
- A series to improve the hvc driver & related OPAL console code,
which have been seen to cause hardlockups at times. The hvc driver
changes in particular have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
use anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX
instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM
CPUs (controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals
to bring it into line with other arches, including showing the
offending VMA and dumping the instructions around the fault.
Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
Rao, zhong jiang"
* tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
powerpc/mm/book3s/radix: Add mapping statistics
powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
powerpc/mm/hash: Remove unnecessary do { } while(0) loop
powerpc/64s: move machine check SLB flushing to mm/slb.c
powerpc/powernv/idle: Fix build error
powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
powerpc/mm: remove warning about ‘type’ being set
powerpc/32: Include setup.h header file to fix warnings
powerpc: Move `path` variable inside DEBUG_PROM
powerpc/powermac: Make some functions static
powerpc/powermac: Remove variable x that's never read
cxl: remove a dead branch
powerpc/powermac: Add missing include of header pmac.h
powerpc/kexec: Use common error handling code in setup_new_fdt()
powerpc/xmon: Add address lookup for percpu symbols
powerpc/mm: remove huge_pte_offset_and_shift() prototype
powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
powerpc/fadump: handle crash memory ranges array index overflow
...
2018-08-17 11:32:50 -07:00
x86/speculation: Enable Spectre v1 swapgs mitigations
The previous commit added macro calls in the entry code which mitigate the
Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
enabled. Enable those features where applicable.
The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
There are different features which can affect the risk of attack:
- When FSGSBASE is enabled, unprivileged users are able to place any
value in GS, using the wrgsbase instruction. This means they can
write a GS value which points to any value in kernel space, which can
be useful with the following gadget in an interrupt/exception/NMI
handler:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
// dependent load or store based on the value of %reg
// for example: mov %(reg1), %reg2
If an interrupt is coming from user space, and the entry code
speculatively skips the swapgs (due to user branch mistraining), it
may speculatively execute the GS-based load and a subsequent dependent
load or store, exposing the kernel data to an L1 side channel leak.
Note that, on Intel, a similar attack exists in the above gadget when
coming from kernel space, if the swapgs gets speculatively executed to
switch back to the user GS. On AMD, this variant isn't possible
because swapgs is serializing with respect to future GS-based
accesses.
NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
doesn't exist quite yet.
- When FSGSBASE is disabled, the issue is mitigated somewhat because
unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
restricts GS values to user space addresses only. That means the
gadget would need an additional step, since the target kernel address
needs to be read from user space first. Something like:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
mov (%reg1), %reg2
// dependent load or store based on the value of %reg2
// for example: mov %(reg2), %reg3
It's difficult to audit for this gadget in all the handlers, so while
there are no known instances of it, it's entirely possible that it
exists somewhere (or could be introduced in the future). Without
tooling to analyze all such code paths, consider it vulnerable.
Effects of SMAP on the !FSGSBASE case:
- If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
susceptible to Meltdown), the kernel is prevented from speculatively
reading user space memory, even L1 cached values. This effectively
disables the !FSGSBASE attack vector.
- If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
still prevents the kernel from speculatively reading user space
memory. But it does *not* prevent the kernel from reading the
user value from L1, if it has already been cached. This is probably
only a small hurdle for an attacker to overcome.
Thanks to Dave Hansen for contributing the speculative_smap() function.
Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
is serializing on AMD.
[ tglx: Fixed the USER fence decision and polished the comment as suggested
by Dave Hansen ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
2019-07-08 11:52:26 -05:00
nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1
(bounds check bypass). With this option data leaks are
possible in the system.
2018-05-29 17:48:27 +02:00
2022-09-19 19:01:36 +02:00
nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for
2019-04-15 16:21:20 -05:00
the Spectre variant 2 (indirect branch prediction)
vulnerability. System may allow data leaks with this
option.
2018-01-11 21:46:26 +00:00
2022-08-26 19:40:50 +08:00
nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch
history injection) vulnerability. System may allow data leaks
with this option.
2018-04-25 22:04:21 -04:00
nospec_store_bypass_disable
[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
2020-11-17 16:59:13 +11:00
no_uaccess_flush
[PPC] Don't flush the L1-D cache after accessing user data.
2009-05-22 12:17:45 -07:00
noxsave [BUGS=X86] Disables x86 extended register state save
and restore using xsave. The kernel will fallback to
enabling legacy floating-point and sse state.
2014-05-29 11:12:31 -07:00
noxsaveopt [X86] Disables xsaveopt used in saving x86 extended
register states. The kernel will fall back to use
xsave to save the states. By using this parameter,
performance of saving the states is degraded because
xsave doesn't support modified optimization while
xsaveopt supports it on xsaveopt enabled systems.
noxsaves [X86] Disables xsaves and xrstors used in saving and
restoring x86 extended register state in compacted
form of xsave area. The kernel will fall back to use
xsaveopt and xrstor to save and restore the states
in standard form of xsave area. By using this
parameter, xsave area per process might occupy more
memory on xsaves enabled systems.
2021-02-09 09:23:48 -08:00
nohlt [ARM,ARM64,MICROBLAZE,SH] Forces the kernel to busy wait
in do_idle() and not use the arch_cpu_idle()
implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
to be effective. This is useful on platforms where the
sleep(SH) or wfi(ARM,ARM64) instructions do not work
2023-01-29 15:10:45 -08:00
correctly or when doing power measurements to evaluate
2021-02-09 09:23:48 -08:00
the impact of the sleep instructions. This is also
useful when using JTAG debugger.
2005-10-23 12:57:11 -07:00
file capabilities: add no_file_caps switch (v4)
Add a no_file_caps boot option when file capabilities are
compiled into the kernel (CONFIG_SECURITY_FILE_CAPABILITIES=y).
This allows distributions to ship a kernel with file capabilities
compiled in, without forcing users to use (and understand and
trust) them.
When no_file_caps is specified at boot, then when a process executes
a file, any file capabilities stored with that file will not be
used in the calculation of the process' new capability sets.
This means that booting with the no_file_caps boot option will
not be the same as booting a kernel with file capabilities
compiled out - in particular a task with CAP_SETPCAP will not
have any chance of passing capabilities to another task (which
isn't "really" possible anyway, and which may soon by killed
altogether by David Howells in any case), and it will instead
be able to put new capabilities in its pI. However since fI
will always be empty and pI is masked with fI, it gains the
task nothing.
We also support the extra prctl options, setting securebits and
dropping capabilities from the per-process bounding set.
The other remaining difference is that killpriv, task_setscheduler,
setioprio, and setnice will continue to be hooked. That will
be noticable in the case where a root task changed its uid
while keeping some caps, and another task owned by the new uid
tries to change settings for the more privileged task.
Changelog:
Nov 05 2008: (v4) trivial port on top of always-start-\
with-clear-caps patch
Sep 23 2008: nixed file_caps_enabled when file caps are
not compiled in as it isn't used.
Document no_file_caps in kernel-parameters.txt.
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-05 16:08:52 -06:00
no_file_caps Tells the kernel not to honor file capabilities. The
only way then for a file to be executed with privilege
is to be setuid root or executed by root.
2005-04-16 15:20:36 -07:00
nohalt [IA-64] Tells the kernel not to use the power saving
function PAL_HALT_LIGHT when idle. This increases
power-consumption. On the positive side, it reduces
interrupt wake-up latency, which may improve performance
in certain environments such as networked servers or
real-time systems.
2021-02-14 10:13:48 -06:00
no_hash_pointers
Force pointers printed to the console or buffers to be
unhashed. By default, when a pointer is printed via %p
format string, that pointer is "hashed", i.e. obscured
by hashing the pointer value. This is a security feature
that hides actual kernel addresses from unprivileged
users, but it also makes debugging the kernel more
difficult since unequal pointers can no longer be
compared. However, if this command-line option is
specified, then all normal pointers will have their true
2022-02-17 09:49:59 +01:00
value printed. This option should only be specified when
2021-02-14 10:13:48 -06:00
debugging the kernel. Please do not use on production
kernels.
2014-06-13 13:30:35 -07:00
nohibernate [HIBERNATION] Disable hibernation and resume.
2007-02-16 01:28:03 -08:00
nohz= [KNL] Boottime enable/disable dynamic ticks
Valid arguments: on, off
Default: on
2017-12-14 19:18:27 +01:00
nohz_full= [KNL,BOOT,SMP,ISOL]
2016-10-11 13:51:35 -07:00
The argument is a cpu list, as described above.
2013-04-12 16:45:34 +02:00
In kernels built with CONFIG_NO_HZ_FULL=y, set
2012-12-18 17:32:19 +01:00
the specified list of CPUs whose tick will be stopped
2013-03-27 02:18:34 +01:00
whenever possible. The boot CPU will be forced outside
2017-06-02 11:26:43 -07:00
the range to maintain the timekeeping. Any CPUs
in this list will have their RCU callbacks offloaded,
just as if they had also been called out in the
rcu_nocbs= boot parameter.
2012-12-18 17:32:19 +01:00
2022-04-22 17:52:47 +00:00
Note that this argument takes precedence over
the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
2009-04-02 12:31:16 +09:00
noiotrap [SH] Disables trapped I/O port accesses.
2007-07-31 00:37:59 -07:00
noirqdebug [X86-32] Disables the code which attempts to detect and
2005-04-16 15:20:36 -07:00
disable unhandled interrupt sources.
2009-04-14 14:03:43 +05:30
no_timer_check [X86,APIC] Disables the code which tests for
2006-12-07 02:14:09 +01:00
broken timer IRQ sources.
2005-04-16 15:20:36 -07:00
noisapnp [ISAPNP] Disables ISA PnP code.
noinitrd [RAM] Tells the kernel not to load any configured
initial RAM disk.
2009-04-17 16:42:15 +08:00
nointremap [X86-64, Intel-IOMMU] Do not enable interrupt
remapping.
2010-07-20 11:06:49 -07:00
[Deprecated - use intremap=off]
2009-04-17 16:42:15 +08:00
2005-04-16 15:20:36 -07:00
nointroute [IA-64]
2016-01-29 11:42:58 -08:00
noinvpcid [X86] Disable the INVPCID cpu feature.
2011-08-13 12:34:52 -07:00
nojitter [IA-64] Disables jitter checking for ITC timers.
2007-07-20 11:22:30 -07:00
2022-12-03 17:30:50 -08:00
nokaslr [KNL]
When CONFIG_RANDOMIZE_BASE is set, this disables
kernel and module base offset ASLR (Address Space
Layout Randomization).
2010-08-16 17:51:20 +02:00
no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
2010-10-14 11:22:51 +02:00
no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
2016-10-28 00:54:32 -07:00
no-vmw-sched-clock
[X86,PV_OPS] Disable paravirtualized VMware scheduler
clock and use the default one.
2022-09-02 18:53:14 +10:00
no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized
steal time accounting. steal time is computed, but
won't influence scheduler behaviour
2011-07-11 15:28:19 -04:00
2007-07-31 00:37:59 -07:00
nolapic [X86-32,APIC] Do not enable or use the local APIC.
2005-04-16 15:20:36 -07:00
2007-07-31 00:37:59 -07:00
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
2007-03-22 00:11:21 -08:00
2006-02-22 09:57:55 +09:00
nomca [IA-64] Disable machine check abort handling
2015-05-16 02:16:43 +09:00
nomce [X86-32] Disable Machine Check Exception
2006-04-01 01:36:09 +02:00
2007-10-12 23:04:06 +02:00
nomfgpt [X86-32] Disable Multi-Function General Purpose
Timer usage (for AMD Geode machines).
2011-10-13 15:14:27 -04:00
nonmi_ipi [X86] Disable using NMI IPIs during panic/reboot to
shutdown the other cpus. Instead use the REBOOT_VECTOR
irq.
2022-11-11 14:30:23 +01:00
nomodeset Disable kernel modesetting. Most systems' firmware
sets up a display mode and provides framebuffer memory
for output. With nomodeset, DRM and fbdev drivers will
not load if they could possibly displace the pre-
initialized output. Only the system framebuffer will
be available for use. The respective drivers will not
perform display-mode changes or accelerated rendering.
Useful as error fallback, or for testing and debugging.
2021-11-12 14:32:29 +01:00
2012-02-01 10:33:14 +08:00
nomodule Disable module load
2010-01-18 17:05:40 +01:00
nopat [X86] Disable PAT (page attribute table extension of
pagetables) support.
2017-06-29 08:53:20 -07:00
nopcid [X86-64] Disable the PCID cpu feature.
2022-12-03 17:30:50 -08:00
nopku [X86] Disable Memory Protection Keys CPU feature found
in some Intel CPUs.
nopv= [X86,XEN,KVM,HYPER_V,VMWARE]
Disables the PV optimizations forcing the guest to run
as generic guest with no PV drivers. Currently support
XEN HVM, KVM, HYPER_V and VMWARE guest.
nopvspin [X86,XEN,KVM]
Disables the qspinlock slow path using PV optimizations
which allow the hypervisor to 'idle' the guest on lock
contention.
2009-04-05 15:55:22 -07:00
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space
2007-07-31 00:37:59 -07:00
noreplace-smp [X86-32,SMP] Don't replace SMP instructions
2007-05-02 19:27:13 +02:00
with UP alternatives
2005-10-23 12:57:11 -07:00
noresume [SWSUSP] Disables resume and restores original swap
space.
2005-04-16 15:20:36 -07:00
no-scroll [VGA] Disables scrollback.
This is required for the Braillex ib80-piezo Braille
reader made by F.H. Papenmeier (Germany).
nosbagart [IA-64]
2020-11-13 00:01:19 +02:00
nosgx [X86-64,SGX] Disables Intel SGX kernel support.
2007-08-16 03:34:22 -04:00
nosmp [SMP] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC. legacy for "maxcpus=0".
2005-04-16 15:20:36 -07:00
2007-07-15 23:41:05 -07:00
nosoftlockup [KNL] Disable the soft-lockup detector.
2005-04-16 15:20:36 -07:00
nosync [HW,M68K] Disables sync negotiation for all devices.
2015-04-14 15:44:13 -07:00
nowatchdog [KNL] Disable both lockup detectors, i.e.
2018-04-18 20:51:39 +02:00
soft-lockup and NMI watchdog (hard-lockup).
2010-05-07 17:11:44 -04:00
2005-04-16 15:20:36 -07:00
nowb [ARM]
2005-10-23 12:57:11 -07:00
2009-04-17 16:42:12 +08:00
nox2apic [X86-64,APIC] Do not enable x2APIC mode.
2022-08-16 16:19:42 -07:00
NOTE: this parameter will be ignored on systems with the
LEGACY_XAPIC_DISABLED bit set in the
IA32_XAPIC_DISABLE_STATUS MSR.
2018-04-18 20:51:39 +02:00
nps_mtm_hs_ctr= [KNL,ARC]
2017-06-15 11:43:57 +03:00
This parameter sets the maximum duration, in
cycles, each HW thread of the CTOP can run
without interruptions, before HW switches it.
The actual maximum duration is 16 times this
parameter's value.
Format: integer between 1 and 255
Default: 255
2011-08-13 12:34:52 -07:00
nptcg= [IA-64] Override max number of concurrent global TLB
2008-03-14 13:57:08 -07:00
purges which is reported from either PAL_VM_SUMMARY or
SAL PALO.
2010-02-10 01:20:37 -08:00
nr_cpus= [SMP] Maximum number of processors that an SMP kernel
could support. nr_cpus=n : n >= 1 limits the kernel to
2016-08-24 13:06:45 +08:00
support 'n' processors. It could be larger than the
number of already plugged CPU during bootup, later in
runtime you can physically add extra cpu until it reaches
n. So during boot up some boot time memory for per-cpu
variables need be pre-allocated for later physical cpu
hot plugging.
2010-02-10 01:20:37 -08:00
2009-04-05 15:55:22 -07:00
nr_uarts= [SERIAL] maximum number of UARTs to be registered.
2021-05-24 17:17:15 +12:00
numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86] Disable NUMA, Only
set up a single NUMA node spanning all memory.
2021-03-02 21:41:59 +13:00
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
NUMA balancing.
2012-11-22 11:16:36 +00:00
Allowed values are enable and disable
2007-07-15 23:38:01 -07:00
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
2017-09-06 16:20:13 -07:00
'node', 'default' can be specified
2007-07-15 23:38:01 -07:00
This can be set from sysctl after boot.
2019-04-22 16:48:00 -03:00
See Documentation/admin-guide/sysctl/vm.rst for details.
2007-07-15 23:38:01 -07:00
2009-01-06 14:42:44 -08:00
ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
2020-05-01 17:37:50 +02:00
See Documentation/core-api/debugging-via-ohci1394.rst for more
2009-01-06 14:42:44 -08:00
info.
2008-04-29 00:59:53 -07:00
olpc_ec_timeout= [OLPC] ms delay when issuing EC commands
Rather than timing out after 20 ms if an EC
command is not properly ACKed, override the length
of the timeout. We have interrupts disabled while
waiting for the ACK, so if this is set too high
interrupts *may* be lost!
2009-12-11 16:16:32 -08:00
omap_mux= [OMAP] Override bootloader pin multiplexing.
Format: <mux_mode0.mode_name=value>...
For example, to override I2C bus2:
omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
2022-04-02 22:48:21 -07:00
onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
boundary - index of last SLC block on Flex-OneNAND.
The remaining blocks are configured as MLC blocks.
lock - Configure if Flex-OneNAND boundary should be locked.
Once locked, the boundary cannot be changed.
1 indicates lock status, 0 indicates unlock status.
2011-04-04 15:02:24 -07:00
oops=panic Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
2011-03-22 16:34:04 -07:00
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
mm: shuffle initial free memory to improve memory-side-cache utilization
Patch series "mm: Randomize free memory", v10.
This patch (of 3):
Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].
Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:
It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software
and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."
A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability"). With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.
The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.
Here are some performance impact details of the patches:
1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts.
The contrived cased used the numa_emulation capability to force an
instance of the benchmark to be run in two of the near-memory sized
numa nodes. If both instances were placed on the same emulated they
would fit and cause zero conflicts. While on separate emulated nodes
without randomization they underutilized the cache and conflicted
unnecessarily due to the in-order allocation per node.
2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8%
for the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.
3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.
4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].
While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.
Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages. Pages are freed in physical address order when first
onlined.
Quoting Kees:
"While we already have a base-address randomization
(CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
memory layouts would certainly be using the predictability of
allocation ordering (i.e. for attacks where the base address isn't
important: only the relative positions between allocated memory).
This is common in lots of heap-style attacks. They try to gain
control over ordering by spraying allocations, etc.
I'd really like to see this because it gives us something similar
to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.
Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time. Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).
The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.
This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.
[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309
[dan.j.williams@intel.com: fix shuffle enable]
Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 15:41:28 -07:00
page_alloc.shuffle=
[KNL] Boolean flag to control whether the page allocator
should randomize its free lists. The randomization may
be automatically enabled if the kernel detects it is
running on a platform with a direct-mapped memory-side
cache, and this parameter can be used to
override/disable that behavior. The state of the flag
can be read from sysfs at:
/sys/module/page_alloc/parameters/shuffle.
mm/page_owner: keep track of page owners
This is the page owner tracking code which is introduced so far ago. It
is resident on Andrew's tree, though, nobody tried to upstream so it
remain as is. Our company uses this feature actively to debug memory leak
or to find a memory hogger so I decide to upstream this feature.
This functionality help us to know who allocates the page. When
allocating a page, we store some information about allocation in extra
memory. Later, if we need to know status of all pages, we can get and
analyze it from this stored information.
In previous version of this feature, extra memory is statically defined in
struct page, but, in this version, extra memory is allocated outside of
struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.
Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge the
trace buffer for preventing overlapping until userspace program launched.
And, launched program continually dump out the trace buffer for later
analysis and it would change system behaviour with more possibility rather
than just keeping it in memory, so bad for debug.
Moreover, we can use page_owner feature further for various purposes. For
example, we can use it for fragmentation statistics implemented in this
patch. And, I also plan to implement some CMA failure debugging feature
using this interface.
I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.
Contributor:
Alexander Nyberg <alexn@dsv.su.se>
Mel Gorman <mgorman@suse.de>
Dave Hansen <dave@linux.vnet.ibm.com>
Minchan Kim <minchan@kernel.org>
Michal Nazarewicz <mina86@mina86.com>
Andrew Morton <akpm@linux-foundation.org>
Jungsoo Son <jungsoo.son@lge.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:56:01 -08:00
page_owner= [KNL] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
we can turn it on.
on: enable the feature
2016-03-15 14:56:27 -07:00
page_poison= [KNL] Boot-time parameter changing the state of
2018-08-21 21:53:10 -07:00
poisoning on the buddy allocator, available with
CONFIG_PAGE_POISONING=y.
off: turn off poisoning (default)
2016-03-15 14:56:27 -07:00
on: turn on poisoning
2021-06-28 19:35:19 -07:00
page_reporting.page_reporting_order=
[KNL] Minimal page reporting order
Format: <integer>
Adjust the minimal page reporting order. The page
2023-03-15 14:31:33 +03:00
reporting is disabled when it exceeds MAX_ORDER.
2021-06-28 19:35:19 -07:00
2011-04-04 15:02:24 -07:00
panic= [KNL] Kernel behaviour on panic: delay <timeout>
2011-07-26 16:08:52 -07:00
timeout > 0: seconds before rebooting
timeout = 0: wait forever
timeout < 0: reboot immediately
2005-04-16 15:20:36 -07:00
Format: <timeout>
2019-01-03 15:28:17 -08:00
panic_print= Bitmask for printing system info when panic happens.
User can chose combination of the following bits:
bit 0: print all tasks info
bit 1: print system memory info
bit 2: print timer info
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
2019-05-17 14:31:50 -07:00
bit 5: print all printk messages in buffer
2022-03-23 16:07:06 -07:00
bit 6: print all CPUs backtrace (if available in the arch)
panic: move panic_print before kmsg dumpers
The panic_print setting allows users to collect more information in a
panic event, like memory stats, tasks, CPUs backtraces, etc. This is an
interesting debug mechanism, but currently the print event happens *after*
kmsg_dump(), meaning that pstore, for example, cannot collect a dmesg with
the panic_print extra information.
This patch changes that in 2 steps:
(a) The panic_print setting allows to replay the existing kernel log
buffer to the console (bit 5), besides the extra information dump.
This functionality makes sense only at the end of the panic()
function. So, we hereby allow to distinguish the two situations by a
new boolean parameter in the function panic_print_sys_info().
(b) With the above change, we can safely call panic_print_sys_info()
before kmsg_dump(), allowing to dump the extra information when using
pstore or other kmsg dumpers.
The additional messages from panic_print could overwrite the oldest
messages when the buffer is full. The only reasonable solution is to use
a large enough log buffer, hence we added an advice into the kernel
parameters documentation about that.
Link: https://lkml.kernel.org/r/20220214141308.841525-1-gpiccoli@igalia.com
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23 16:07:09 -07:00
*Be aware* that this option may print a _lot_ of lines,
so there are risks of losing older messages in the log.
Use this option carefully, maybe worth to setup a
bigger log buffer with "log_buf_len" along with this.
2019-01-03 15:28:17 -08:00
2020-06-07 21:40:17 -07:00
panic_on_taint= Bitmask for conditionally calling panic() in add_taint()
Format: <hex>[,nousertaint]
Hexadecimal bitmask representing the set of TAINT flags
that will cause the kernel to panic when add_taint() is
called with any of the flags in this set.
The optional switch "nousertaint" can be utilized to
prevent userspace forced crashes by writing to sysctl
/proc/sys/kernel/tainted any flagset matching with the
bitmask set on panic_on_taint.
See Documentation/admin-guide/tainted-kernels.rst for
extra details on the taint flags that users can pick
to compose the bitmask to assign to panic_on_taint.
2014-12-10 15:45:50 -08:00
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().
2005-04-16 15:20:36 -07:00
parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0.
Format: <parport#>
parkbd.mode= [HW] Parallel port keyboard adapter mode of operation,
0 for XT, 1 for AT (default is AT).
2005-10-23 12:57:11 -07:00
Format: <mode>
parport= [HW,PPT] Specify parallel ports. 0 disables.
Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] }
Use 'auto' to force the driver to use any
IRQ/DMA settings detected (the default is to
ignore detected IRQ/DMA settings because of
possible conflicts). You can specify the base
address, IRQ, and DMA settings; IRQ and DMA
should be numbers, or 'auto' (for using detected
settings on that particular port), or 'nofifo'
(to avoid using a FIFO even if it is detected).
Parallel ports are assigned in the order they
are specified on the command line, starting
with parport0.
parport_init_mode= [HW,PPT]
Configure VIA parallel port to operate in
a specific mode. This is necessary on Pegasos
computer where firmware has no options for setting
up parallel port mode and sets it to spp.
Currently this function knows 686a and 8231 chips.
2005-04-16 15:20:36 -07:00
Format: [spp|ps2|epp|ecp|ecpepp]
2021-03-21 20:55:22 +01:00
pata_legacy.all= [HW,LIBATA]
Format: <int>
Set to non-zero to probe primary and secondary ISA
port ranges on PCI systems where no PCI PATA device
has been found at either range. Disabled by default.
pata_legacy.autospeed= [HW,LIBATA]
Format: <int>
Set to non-zero if a chip is present that snoops speed
changes. Disabled by default.
pata_legacy.ht6560a= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for HT 6560A on the primary channel,
the secondary channel, or both channels respectively.
Disabled by default.
pata_legacy.ht6560b= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for HT 6560B on the primary channel,
the secondary channel, or both channels respectively.
Disabled by default.
pata_legacy.iordy_mask= [HW,LIBATA]
Format: <int>
IORDY enable mask. Set individual bits to allow IORDY
for the respective channel. Bit 0 is for the first
legacy channel handled by this driver, bit 1 is for
the second channel, and so on. The sequence will often
correspond to the primary legacy channel, the secondary
legacy channel, and so on, but the handling of a PCI
bus and the use of other driver options may interfere
with the sequence. By default IORDY is allowed across
all channels.
pata_legacy.opti82c46x= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for Opti 82c611A on the primary
channel, the secondary channel, or both channels
respectively. Disabled by default.
pata_legacy.opti82c611a= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for Opti 82c465MV on the primary
channel, the secondary channel, or both channels
respectively. Disabled by default.
pata_legacy.pio_mask= [HW,LIBATA]
Format: <int>
PIO mode mask for autospeed devices. Set individual
bits to allow the use of the respective PIO modes.
Bit 0 is for mode 0, bit 1 is for mode 1, and so on.
All modes allowed by default.
pata_legacy.probe_all= [HW,LIBATA]
Format: <int>
Set to non-zero to probe tertiary and further ISA
port ranges on PCI systems. Disabled by default.
pata_legacy: Add `probe_mask' parameter like with ide-generic
Carry the `probe_mask' parameter over from ide-generic to pata_legacy so
that there is a way to prevent random poking at ISA port I/O locations
in attempt to discover adapter option cards with libata like with the
old IDE driver. By default all enabled locations are tried, however it
may interfere with a different kind of hardware responding there.
For example with a plain (E)ISA system the driver tries all the six
possible locations:
scsi host0: pata_legacy
ata1: PATA max PIO4 cmd 0x1f0 ctl 0x3f6 irq 14
ata1.00: ATA-4: ST310211A, 3.54, max UDMA/100
ata1.00: 19541088 sectors, multi 16: LBA
ata1.00: configured for PIO
scsi 0:0:0:0: Direct-Access ATA ST310211A 3.54 PQ: 0 ANSI: 5
scsi 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:0:0: [sda] 19541088 512-byte logical blocks: (10.0 GB/9.32 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
scsi host1: pata_legacy
ata2: PATA max PIO4 cmd 0x170 ctl 0x376 irq 15
scsi host1: pata_legacy
ata3: PATA max PIO4 cmd 0x1e8 ctl 0x3ee irq 11
scsi host1: pata_legacy
ata4: PATA max PIO4 cmd 0x168 ctl 0x36e irq 10
scsi host1: pata_legacy
ata5: PATA max PIO4 cmd 0x1e0 ctl 0x3e6 irq 8
scsi host1: pata_legacy
ata6: PATA max PIO4 cmd 0x160 ctl 0x366 irq 12
however giving the kernel "pata_legacy.probe_mask=21" makes it try every
other location only:
scsi host0: pata_legacy
ata1: PATA max PIO4 cmd 0x1f0 ctl 0x3f6 irq 14
ata1.00: ATA-4: ST310211A, 3.54, max UDMA/100
ata1.00: 19541088 sectors, multi 16: LBA
ata1.00: configured for PIO
scsi 0:0:0:0: Direct-Access ATA ST310211A 3.54 PQ: 0 ANSI: 5
scsi 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:0:0: [sda] 19541088 512-byte logical blocks: (10.0 GB/9.32 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
scsi host1: pata_legacy
ata2: PATA max PIO4 cmd 0x1e8 ctl 0x3ee irq 11
scsi host1: pata_legacy
ata3: PATA max PIO4 cmd 0x1e0 ctl 0x3e6 irq 8
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2103211800110.21463@angie.orcam.me.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21 20:55:32 +01:00
pata_legacy.probe_mask= [HW,LIBATA]
Format: <int>
Probe mask for legacy ISA PATA ports. Depending on
platform configuration and the use of other driver
options up to 6 legacy ports are supported: 0x1f0,
0x170, 0x1e8, 0x168, 0x1e0, 0x160, however probing
of individual ports can be disabled by setting the
corresponding bits in the mask to 1. Bit 0 is for
the first port in the list above (0x1f0), and so on.
By default all supported ports are probed.
2021-03-21 20:55:22 +01:00
pata_legacy.qdi= [HW,LIBATA]
Format: <int>
Set to non-zero to probe QDI controllers. By default
set to 1 if CONFIG_PATA_QDI_MODULE, 0 otherwise.
pata_legacy.winbond= [HW,LIBATA]
Format: <int>
Set to non-zero to probe Winbond controllers. Use
the standard I/O port (0x130) if 1, otherwise the
value given is the I/O port to use (typically 0x1b0).
By default set to 1 if CONFIG_PATA_WINBOND_VLB_MODULE,
0 otherwise.
2021-03-21 20:55:27 +01:00
pata_platform.pio_mask= [HW,LIBATA]
Format: <int>
Supported PIO mode mask. Set individual bits to allow
the use of the respective PIO modes. Bit 0 is for
mode 0, bit 1 is for mode 1, and so on. Mode 0 only
allowed by default.
2006-03-23 03:00:57 -08:00
pause_on_oops=
Halt all CPUs after the first oops has been printed for
the specified number of seconds. This is to be used if
your oopses keep scrolling off the screen.
2005-04-16 15:20:36 -07:00
pcbit= [HW,ISDN]
2018-07-30 10:18:37 -06:00
pci=option[,option...] [PCI] various PCI subsystem options.
Some options herein operate on a specific device
or a set of devices (<pci_dev>). These are
specified in one of the following formats:
2018-07-30 10:18:38 -06:00
[<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
2018-07-30 10:18:37 -06:00
pci:<vendor>:<device>[:<subvendor>:<subdevice>]
Note: the first format specifies a PCI
bus/device/function address which may change
if new hardware is inserted, if motherboard
firmware changes, or due to changes caused
by other kernel parameters. If the
domain is left unspecified, it is
2018-07-30 10:18:38 -06:00
taken to be zero. Optionally, a path
to a device through multiple device/function
addresses can be specified after the base
address (this is more robust against
renumbering issues). The second format
2018-07-30 10:18:37 -06:00
selects devices using IDs from the
configuration space which may match multiple
devices in the system.
2018-06-04 22:16:09 -04:00
earlydump dump PCI config space before the kernel
2018-04-18 20:51:39 +02:00
changes anything
2008-08-22 09:53:39 +02:00
off [X86] don't probe for the PCI bus
2007-07-31 00:37:59 -07:00
bios [X86-32] force use of PCI BIOS, don't access
2005-10-23 12:57:11 -07:00
the hardware directly. Use this if your machine
has a non-standard PCI host bridge.
2007-07-31 00:37:59 -07:00
nobios [X86-32] disallow use of PCI BIOS, only direct
2005-10-23 12:57:11 -07:00
hardware access methods are allowed. Use this
if you experience crashes upon bootup and you
suspect they are caused by the BIOS.
2016-01-13 16:48:51 +01:00
conf1 [X86] Force use of PCI Configuration Access
Mechanism 1 (config address in IO port 0xCF8,
data in IO port 0xCFC, both 32-bit).
conf2 [X86] Force use of PCI Configuration Access
Mechanism 2 (IO port 0xCF8 is an 8-bit port for
the function, IO port 0xCFA, also 8-bit, sets
bus number. The config space is then accessed
through ports 0xC000-0xCFFF).
See http://wiki.osdev.org/PCI for more info
on the configuration access mechanisms.
2007-10-05 13:17:58 -07:00
noaer [PCIE] If the PCIEAER kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of PCIE advanced error reporting.
2007-10-11 16:57:27 -04:00
nodomains [PCI] Disable support for multiple PCI
root domains (aka PCI segments, in ACPI-speak).
2009-04-14 14:03:43 +05:30
nommconf [X86] Disable use of MMCONFIG for PCI
2006-02-15 15:17:43 -08:00
Configuration
2009-06-07 16:15:16 +02:00
check_enable_amd_mmconf [X86] check for and enable
properly configured MMIO access to PCI
config space on AMD family 10h CPU
2006-03-05 22:33:34 -07:00
nomsi [MSI] If the PCI_MSI kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of MSI interrupts system-wide.
2008-06-11 16:35:14 +02:00
noioapicquirk [APIC] Disable all boot interrupt quirks.
Safety option to keep boot IRQs enabled. This
should never be necessary.
2008-06-11 16:35:15 +02:00
ioapicreroute [APIC] Enable rerouting of boot IRQs to the
primary IO-APIC for bridges that cannot disable
boot IRQs. This fixes a source of spurious IRQs
when the system masks IRQs.
2008-07-15 13:48:55 +02:00
noioapicreroute [APIC] Disable workaround that uses the
boot IRQ equivalent of an IRQ that connects to
a chipset where boot IRQs cannot be disabled.
The opposite of ioapicreroute.
2007-07-31 00:37:59 -07:00
biosirq [X86-32] Use PCI BIOS calls to get the interrupt
2005-10-23 12:57:11 -07:00
routing table. These calls are known to be buggy
on several machines and they hang the machine
when used, but on other computers it's the only
way to get the interrupt routing table. Try
this option if the kernel is unable to allocate
IRQs or discover secondary PCI buses on your
motherboard.
2008-08-22 09:53:39 +02:00
rom [X86] Assign address space to expansion ROMs.
2005-10-23 12:57:11 -07:00
Use with caution as certain devices share
address decoders between ROMs and other
resources.
2008-08-22 09:53:39 +02:00
norom [X86] Do not assign address space to
2008-05-12 13:57:46 -07:00
expansion ROMs that do not already have
BIOS assigned address ranges.
2010-05-12 11:14:32 -07:00
nobar [X86] Do not assign address space to the
BARs that weren't assigned by the BIOS.
2008-08-22 09:53:39 +02:00
irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be
2005-10-23 12:57:11 -07:00
assigned automatically to PCI devices. You can
make the kernel exclude IRQs of your ISA cards
this way.
2008-08-22 09:53:39 +02:00
pirqaddr=0xAAAAA [X86] Specify the physical address
2005-10-23 12:57:11 -07:00
of the PIRQ table (normally generated
by the BIOS) if it is outside the
F0000h-100000h range.
2008-08-22 09:53:39 +02:00
lastbus=N [X86] Scan all buses thru bus #N. Can be
2005-10-23 12:57:11 -07:00
useful if the kernel is unable to find your
secondary buses and you want to tell it
explicitly which ones they are.
2008-08-22 09:53:39 +02:00
assign-busses [X86] Always assign all PCI bus
2005-10-23 12:57:11 -07:00
numbers ourselves, overriding
whatever the firmware may have done.
2008-08-22 09:53:39 +02:00
usepirqmask [X86] Honor the possible IRQ mask stored
2005-10-23 12:57:11 -07:00
in the BIOS $PIR table. This is needed on
some systems with broken BIOSes, notably
some HP Pavilion N5400 and Omnibook XE3
notebooks. This will have no effect if ACPI
IRQ routing is enabled.
2008-08-22 09:53:39 +02:00
noacpi [X86] Do not use ACPI for IRQ routing
2005-10-23 12:57:11 -07:00
or for PCI scanning.
2010-02-23 10:24:41 -07:00
use_crs [X86] Use PCI host bridge window information
from ACPI. On BIOSes from 2008 or later, this
is enabled by default. If you need to use this,
please report a bug.
nocrs [X86] Ignore PCI host bridge windows from ACPI.
2018-04-18 20:51:39 +02:00
If you need to use this, please report a bug.
x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions
Some firmware supplies PCI host bridge _CRS that includes address space
unusable by PCI devices, e.g., space occupied by host bridge registers or
used by hidden PCI devices.
To avoid this unusable space, Linux currently excludes E820 reserved
regions from _CRS windows; see 4dc2287c1805 ("x86: avoid E820 regions when
allocating address space").
However, this use of E820 reserved regions to clip things out of _CRS is
not supported by ACPI, UEFI, or PCI Firmware specs, and some systems have
E820 reserved regions that cover the entire memory window from _CRS.
4dc2287c1805 clips the entire window, leaving no space for hot-added or
uninitialized PCI devices.
For example, from a Lenovo IdeaPad 3 15IIL 81WE:
BIOS-e820: [mem 0x4bc50000-0xcfffffff] reserved
pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
pci 0000:00:15.0: BAR 0: [mem 0x00000000-0x00000fff 64bit]
pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
Future patches will add quirks to enable/disable E820 clipping
automatically.
Add a "pci=no_e820" kernel command line option to disable clipping with
E820 reserved regions. Also add a matching "pci=use_e820" option to enable
clipping with E820 reserved regions if that has been disabled by default by
further patches in this patch-set.
Both options taint the kernel because they are intended for debugging and
workaround purposes until a quirk can set them automatically.
[bhelgaas: commit log, add printk]
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1868899 Lenovo IdeaPad 3
Link: https://lore.kernel.org/r/20220519152150.6135-2-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Benoit Grégoire <benoitg@coeus.ca>
Cc: Hui Wang <hui.wang@canonical.com>
2022-05-19 17:21:48 +02:00
use_e820 [X86] Use E820 reservations to exclude parts of
PCI host bridge windows. This is a workaround
for BIOS defects in host bridge _CRS methods.
If you need to use this, please report a bug to
<linux-pci@vger.kernel.org>.
no_e820 [X86] Ignore E820 reservations for PCI host
bridge windows. This is the default on modern
hardware. If you need to use this, please report
a bug to <linux-pci@vger.kernel.org>.
2005-10-23 12:57:11 -07:00
routeirq Do IRQ routing for all PCI devices.
This is normally done in pci_enable_device(),
so this option is a temporary workaround
for broken drivers that don't call it.
2008-03-27 01:31:18 -07:00
skip_isa_align [X86] do not align io start addr, so can
handle more pci cards
2006-09-26 10:52:41 +02:00
noearly [X86] Don't do any early type 1 scanning.
This might help on some broken boards which
machine check when some devices' config space
is read. But various workarounds are disabled
and some IOMMU drivers will not work.
PCI: optionally sort device lists breadth-first
Problem:
New Dell PowerEdge servers have 2 embedded ethernet ports, which are
labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
in the printed documentation. Assuming no other add-in ethernet ports
in the system, Linux 2.4 kernels name these eth0 and eth1
respectively. Many people have come to expect this naming. Linux 2.6
kernels name these eth1 and eth0 respectively (backwards from
expectations). I also have reports that various Sun and HP servers
have similar behavior.
Root cause:
Linux 2.4 kernels walk the pci_devices list, which happens to be
sorted in breadth-first order (or pcbios_find_device order on i386,
which most often is breadth-first also). 2.6 kernels have both the
pci_devices list and the pci_bus_type.klist_devices list, the latter
is what is walked at driver load time to match the pci_id tables; this
klist happens to be in depth-first order.
On systems where, for physical routing reasons, NIC1 appears on a
lower bus number than NIC2, but NIC2's bridge is discovered first in
the depth-first ordering, NIC2 will be discovered before NIC1. If the
list were sorted breadth-first, NIC1 would be discovered before NIC2.
A PowerEdge 1955 system has the following topology which easily
exhibits the difference between depth-first and breadth-first device
lists.
-[0000:00]-+-00.0 Intel Corporation 5000P Chipset Memory Controller Hub
+-02.0-[0000:03-08]--+-00.0-[0000:04-07]--+-00.0-[0000:05-06]----00.0-[0000:06]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC2, 2.4 kernel name eth1, 2.6 kernel name eth0)
+-1c.0-[0000:01-02]----00.0-[0000:02]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC1, 2.4 kernel name eth0, 2.6 kernel name eth1)
Other factors, such as device driver load order and the presence of
PCI slots at various points in the bus hierarchy further complicate
this problem; I'm not trying to solve those here, just restore the
device order, and thus basic behavior, that 2.4 kernels had.
Solution:
The solution can come in multiple steps.
Suggested fix #1: kernel
Patch below optionally sorts the two device lists into breadth-first
ordering to maintain compatibility with 2.4 kernels. It adds two new
command line options:
pci=bfsort
pci=nobfsort
to force the sort order, or not, as you wish. It also adds DMI checks
for the specific Dell systems which exhibit "backwards" ordering, to
make them "right".
Suggested fix #2: udev rules from userland
Many people also have the expectation that embedded NICs are always
discovered before add-in NICs (which this patch does not try to do).
Using the PCI IRQ Routing Table provided by system BIOS, it's easy to
determine which PCI devices are embedded, or if add-in, which PCI slot
they're in. I'm working on a tool that would allow udev to name
ethernet devices in ascending embedded, slot 1 .. slot N order,
subsort by PCI bus/dev/fn breadth-first. It'll be possible to use it
independent of udev as well for those distributions that don't use
udev in their installers.
Suggested fix #3: system board routing rules
One can constrain the system board layout to put NIC1 ahead of NIC2
regardless of breadth-first or depth-first discovery order. This adds
a significant level of complexity to board routing, and may not be
possible in all instances (witness the above systems from several
major manufacturers). I don't want to encourage this particular train
of thought too far, at the expense of not doing #1 or #2 above.
Feedback appreciated. Patch tested on a Dell PowerEdge 1955 blade
with 2.6.18.
You'll also note I took some liberty and temporarily break the klist
abstraction to simplify and speed up the sort algorithm. I think
that's both safe and appropriate in this instance.
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-29 15:23:23 -05:00
bfsort Sort PCI devices into breadth-first order.
This sorting is done to get a device
order compatible with older (<= 2.4) kernels.
nobfsort Don't sort PCI devices into breadth-first order.
2013-01-30 09:40:52 +08:00
pcie_bus_tune_off Disable PCIe MPS (Max Payload Size)
tuning and use the BIOS-configured MPS defaults.
pcie_bus_safe Set every device's MPS to the largest value
supported by all devices below the root complex.
pcie_bus_perf Set device MPS to the largest allowable MPS
based on its parent bus. Also set MRRS (Max
Read Request Size) to the largest supported
value (no larger than the MPS that the device
or bus can support) for best performance.
pcie_bus_peer2peer Set every device's MPS to 128B, which
every device is guaranteed to support. This
configuration allows peer-to-peer DMA between
any pair of devices, possibly at the cost of
reduced performance. This also guarantees
that hot-added devices will work.
2007-02-05 16:36:06 -08:00
cbiosize=nn[KMG] The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
cbmemsize=nn[KMG] The fixed amount of bus space which is
reserved for the CardBus bridge's memory
window. The default value is 64 megabytes.
2009-03-16 17:13:39 +09:00
resource_alignment=
Format:
2018-07-30 10:18:37 -06:00
[<order of align>@]<pci_dev>[; ...]
2009-03-16 17:13:39 +09:00
Specifies alignment and device to reassign
2018-07-30 10:18:37 -06:00
aligned memory resources. How to
specify the device is described above.
2009-03-16 17:13:39 +09:00
If <order of align> is not specified,
PAGE_SIZE is used as alignment.
2019-06-06 13:25:57 +10:00
A PCI-PCI bridge can be specified if resource
2009-03-16 17:13:39 +09:00
windows need to be expanded.
2016-08-09 10:33:31 +02:00
To specify the alignment for several
instances of a device, the PCI vendor,
device, subvendor, and subdevice may be
2019-06-06 13:25:57 +10:00
specified, e.g., 12@pci:8086:9c22:103c:198f
for 4096-byte alignment.
2009-04-22 16:52:09 -06:00
ecrc= Enable/disable PCIe ECRC (transaction layer
2023-01-12 12:51:11 +05:30
end-to-end CRC checking). Only effective if
OS has native AER control (either granted by
ACPI _OSC or forced via "pcie_ports=native")
2009-04-22 16:52:09 -06:00
bios: Use BIOS/firmware settings. This is the
the default.
off: Turn ECRC off
on: Turn ECRC on.
2013-01-23 20:29:06 +08:00
hpiosize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's IO window.
Default size is 256 bytes.
2019-10-23 12:12:29 +00:00
hpmmiosize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's MMIO window.
Default size is 2 megabytes.
hpmmioprefsize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's MMIO_PREF window.
Default size is 2 megabytes.
2013-01-23 20:29:06 +08:00
hpmemsize=nn[KMG] The fixed amount of bus space which is
2019-10-23 12:12:29 +00:00
reserved for hotplug bridge's MMIO and
MMIO_PREF window.
2013-01-23 20:29:06 +08:00
Default size is 2 megabytes.
2016-07-21 21:40:28 -06:00
hpbussize=nn The minimum amount of additional bus numbers
reserved for buses below a hotplug bridge.
Default is 1.
2012-02-23 19:23:30 -08:00
realloc= Enable/disable reallocating PCI bridge resources
if allocations done by BIOS are too small to
accommodate resources required by all child
devices.
off: Turn realloc off
on: Turn realloc on
realloc same as realloc=on
2012-03-01 00:06:33 +01:00
noari do not use PCIe ARI.
2018-05-10 17:56:02 -05:00
noats [PCIE, Intel-IOMMU, AMD-IOMMU]
do not use PCIe ATS (and IOMMU device IOTLB).
2012-04-30 15:21:02 -06:00
pcie_scan_all Scan all possible PCIe devices. Otherwise we
only look for one device below a PCIe downstream
port.
2018-01-11 14:23:29 +01:00
big_root_window Try to add a big 64bit memory window to the PCIe
root complex on AMD CPUs. Some GFX hardware
can resize a BAR to allow access to all VRAM.
Adding the window is slightly risky (it may
conflict with unreported devices), so this
taints the kernel.
2018-07-30 10:18:40 -06:00
disable_acs_redir=<pci_dev>[; ...]
Specify one or more PCI devices (in the format
specified above) separated by semicolons.
Each device specified will have the PCI ACS
redirect capabilities forced off which will
allow P2P traffic between devices through
bridges without forcing it upstream. Note:
this removes isolation between devices and
may put more devices in an IOMMU group.
2019-02-26 16:07:32 +01:00
force_floating [S390] Force usage of floating interrupts.
2019-04-18 21:39:06 +02:00
nomio [S390] Do not use MIO instructions.
2020-04-01 11:12:24 +02:00
norid [S390] ignore the RID field and force use of
one PCI domain per PCI function
PCI: optionally sort device lists breadth-first
Problem:
New Dell PowerEdge servers have 2 embedded ethernet ports, which are
labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
in the printed documentation. Assuming no other add-in ethernet ports
in the system, Linux 2.4 kernels name these eth0 and eth1
respectively. Many people have come to expect this naming. Linux 2.6
kernels name these eth1 and eth0 respectively (backwards from
expectations). I also have reports that various Sun and HP servers
have similar behavior.
Root cause:
Linux 2.4 kernels walk the pci_devices list, which happens to be
sorted in breadth-first order (or pcbios_find_device order on i386,
which most often is breadth-first also). 2.6 kernels have both the
pci_devices list and the pci_bus_type.klist_devices list, the latter
is what is walked at driver load time to match the pci_id tables; this
klist happens to be in depth-first order.
On systems where, for physical routing reasons, NIC1 appears on a
lower bus number than NIC2, but NIC2's bridge is discovered first in
the depth-first ordering, NIC2 will be discovered before NIC1. If the
list were sorted breadth-first, NIC1 would be discovered before NIC2.
A PowerEdge 1955 system has the following topology which easily
exhibits the difference between depth-first and breadth-first device
lists.
-[0000:00]-+-00.0 Intel Corporation 5000P Chipset Memory Controller Hub
+-02.0-[0000:03-08]--+-00.0-[0000:04-07]--+-00.0-[0000:05-06]----00.0-[0000:06]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC2, 2.4 kernel name eth1, 2.6 kernel name eth0)
+-1c.0-[0000:01-02]----00.0-[0000:02]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC1, 2.4 kernel name eth0, 2.6 kernel name eth1)
Other factors, such as device driver load order and the presence of
PCI slots at various points in the bus hierarchy further complicate
this problem; I'm not trying to solve those here, just restore the
device order, and thus basic behavior, that 2.4 kernels had.
Solution:
The solution can come in multiple steps.
Suggested fix #1: kernel
Patch below optionally sorts the two device lists into breadth-first
ordering to maintain compatibility with 2.4 kernels. It adds two new
command line options:
pci=bfsort
pci=nobfsort
to force the sort order, or not, as you wish. It also adds DMI checks
for the specific Dell systems which exhibit "backwards" ordering, to
make them "right".
Suggested fix #2: udev rules from userland
Many people also have the expectation that embedded NICs are always
discovered before add-in NICs (which this patch does not try to do).
Using the PCI IRQ Routing Table provided by system BIOS, it's easy to
determine which PCI devices are embedded, or if add-in, which PCI slot
they're in. I'm working on a tool that would allow udev to name
ethernet devices in ascending embedded, slot 1 .. slot N order,
subsort by PCI bus/dev/fn breadth-first. It'll be possible to use it
independent of udev as well for those distributions that don't use
udev in their installers.
Suggested fix #3: system board routing rules
One can constrain the system board layout to put NIC1 ahead of NIC2
regardless of breadth-first or depth-first discovery order. This adds
a significant level of complexity to board routing, and may not be
possible in all instances (witness the above systems from several
major manufacturers). I don't want to encourage this particular train
of thought too far, at the expense of not doing #1 or #2 above.
Feedback appreciated. Patch tested on a Dell PowerEdge 1955 blade
with 2.6.18.
You'll also note I took some liberty and temporarily break the klist
abstraction to simplify and speed up the sort algorithm. I think
that's both safe and appropriate in this instance.
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-29 15:23:23 -05:00
2008-09-24 20:40:34 -04:00
pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
Management.
off Disable ASPM.
force Enable ASPM even on devices that claim not to support it.
WARNING: Forcing ASPM on may cause system lockups.
2018-03-09 11:21:28 -06:00
pcie_ports= [PCIE] PCIe port services handling:
native Use native PCIe services (PME, AER, DPC, PCIe hotplug)
even if the platform doesn't give the OS permission to
use them. This may cause conflicts if the platform
also tries to use these services.
PCI/DPC: Add "pcie_ports=dpc-native" to allow DPC without AER control
Prior to eed85ff4c0da7 ("PCI/DPC: Enable DPC only if AER is available"),
Linux handled DPC events regardless of whether firmware had granted it
ownership of AER or DPC, e.g., via _OSC.
PCIe r5.0, sec 6.2.10, recommends that the OS link control of DPC to
control of AER, so after eed85ff4c0da7, Linux handles DPC events only if it
has control of AER.
On platforms that do not grant OS control of AER via _OSC, Linux DPC
handling worked before eed85ff4c0da7 but not after.
To make Linux DPC handling work on those platforms the same way they did
before, add a "pcie_ports=dpc-native" kernel parameter that makes Linux
handle DPC events regardless of whether it has control of AER.
[bhelgaas: commit log, move pcie_ports_dpc_native to drivers/pci/]
Link: https://lore.kernel.org/r/20191023192205.97024-1-olof@lixom.net
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2019-10-23 12:22:05 -07:00
dpc-native Use native PCIe service for DPC only. May
cause conflicts if firmware uses AER or DPC.
2018-03-09 11:21:28 -06:00
compat Disable native PCIe services (PME, AER, DPC, PCIe
hotplug).
2010-08-21 01:51:44 +02:00
2016-06-02 11:17:12 +03:00
pcie_port_pm= [PCIE] PCIe port power management handling:
off Disable power management of all PCIe ports
force Forcibly enable power management of all PCIe ports
2010-02-17 23:39:08 +01:00
pcie_pme= [PCIE,PM] Native PCIe PME signaling options:
2010-02-17 23:40:07 +01:00
nomsi Do not use MSI for native PCIe PME signaling (this makes
PCI: PCIe: Ask BIOS for control of all native services at once
After commit 852972acff8f10f3a15679be2059bb94916cba5d (ACPI: Disable
ASPM if the platform won't provide _OSC control for PCIe) control of
the PCIe Capability Structure is unconditionally requested by
acpi_pci_root_add(), which in principle may cause problems to
happen in two ways. First, the BIOS may refuse to give control of
the PCIe Capability Structure if it is not asked for any of the
_OSC features depending on it at the same time. Second, the BIOS may
assume that control of the _OSC features depending on the PCIe
Capability Structure will be requested in the future and may behave
incorrectly if that doesn't happen. For this reason, control of
the PCIe Capability Structure should always be requested along with
control of any other _OSC features that may depend on it (ie. PCIe
native PME, PCIe native hot-plug, PCIe AER).
Rework the PCIe port driver so that (1) it checks which native PCIe
port services can be enabled, according to the BIOS, and (2) it
requests control of all these services simultaneously. In
particular, this causes pcie_portdrv_probe() to fail if the BIOS
refuses to grant control of the PCIe Capability Structure, which
means that no native PCIe port services can be enabled for the PCIe
Root Complex the given port belongs to. If that happens, ASPM is
disabled to avoid problems with mishandling it by the part of the
PCIe hierarchy for which control of the PCIe Capability Structure
has not been received.
Make it possible to override this behavior using 'pcie_ports=native'
(use the PCIe native services regardless of the BIOS response to the
control request), or 'pcie_ports=compat' (do not use the PCIe native
services at all).
Accordingly, rework the existing PCIe port service drivers so that
they don't request control of the services directly.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-08-21 22:02:38 +02:00
all PCIe root ports use INTx for all services).
2010-02-17 23:39:08 +01:00
2005-04-16 15:20:36 -07:00
pcmv= [HW,PCMCIA] BadgePAD 4
2014-03-28 10:50:21 +05:30
pd_ignore_unused
[PM]
Keep all power-domains already enabled by bootloader on,
even if no driver has claimed them. This is useful
for debug and development, but should not be
needed on a platform with proper driver support.
2005-04-16 15:20:36 -07:00
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
Format: { 0 | 1 }
See arch/parisc/kernel/pdc_chassis.c
2009-08-14 15:00:50 +09:00
percpu_alloc= Select which percpu first chunk allocator to use.
2009-08-14 15:00:53 +09:00
Currently supported values are "embed" and "page".
Archs may support subset or none of the selections.
See comments in mm/percpu.c for details on each
allocator. This parameter is primarily for debugging
and performance comparison.
2009-06-22 11:56:24 +09:00
2005-04-16 15:20:36 -07:00
pirq= [SMP,APIC] Manual mp-table setup
2019-06-07 15:54:32 -03:00
See Documentation/x86/i386/IO-APIC.rst.
2005-04-16 15:20:36 -07:00
plip= [PPT,NET] Parallel port network link
Format: { parport<nr> | timid | 0 }
2017-10-10 12:36:16 -05:00
See also Documentation/admin-guide/parport.rst.
2005-04-16 15:20:36 -07:00
2011-08-13 12:34:52 -07:00
pmtmr= [X86] Manual setup of pmtmr I/O Port.
2008-07-12 05:33:30 +02:00
Override pmtimer IOPort with a hex value.
e.g. pmtmr=0x508
2021-11-23 19:51:50 +10:00
pmu_override= [PPC] Override the PMU.
This option takes over the PMU facility, so it is no
longer usable by perf. Setting this option starts the
PMU counters by setting MMCR0 to 0 (the FC bit is
cleared). If a number is given, then MMCR1 is set to
that number, otherwise (e.g., 'pmu_override=on'), MMCR1
remains 0.
2020-04-02 15:56:52 +08:00
pm_debug_messages [SUSPEND,KNL]
Enable suspend/resume debug messages during boot up.
2011-08-11 12:14:05 -06:00
pnp.debug=1 [PNP]
Enable PNP debug messages (depends on the
CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time
via /sys/module/pnp/parameters/debug. We always show
current resource usage; turning this on also shows
possible settings and some assignment information.
2008-08-19 16:53:41 -06:00
2005-04-16 15:20:36 -07:00
pnpacpi= [ACPI]
{ off }
pnpbios= [ISAPNP]
{ on | off | curr | res | no-curr | no-res }
pnp_reserve_irq=
[ISAPNP] Exclude IRQs for the autoconfiguration
pnp_reserve_dma=
[ISAPNP] Exclude DMAs for the autoconfiguration
pnp_reserve_io= [ISAPNP] Exclude I/O ports for the autoconfiguration
2005-10-23 12:57:11 -07:00
Ranges are in pairs (I/O port base and size).
2005-04-16 15:20:36 -07:00
pnp_reserve_mem=
2005-10-23 12:57:11 -07:00
[ISAPNP] Exclude memory regions for the
autoconfiguration.
2005-04-16 15:20:36 -07:00
Ranges are in pairs (memory base and size).
2009-04-17 18:30:28 -07:00
ports= [IP_VS_FTP] IPVS ftp helper module
Default is 21.
Up to 8 (IP_VS_APP_MAX_PORTS) ports
may be specified.
Format: <port>,<port>....
2016-12-02 00:08:26 +11:00
powersave=off [PPC] This option disables power saving features.
It specifically disables cpuidle and sets the
platform machine description specific power_save
function to NULL. On Idle the CPU just reduces
execution priority.
2015-10-29 11:44:06 +11:00
ppc_strict_facility_enable
[PPC] This option catches any kernel floating point,
Altivec, VSX and SPE outside of regions specifically
allowed (eg kernel_enable_fpu()/kernel_disable_fpu()).
There is some performance impact when enabling this.
2017-10-12 21:17:16 +11:00
ppc_tm= [PPC]
Format: {"off"}
Disable Hardware Transactional Memory
2021-01-18 15:12:19 +01:00
preempt= [KNL]
Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
can be preempted anytime.
2007-07-15 23:40:10 -07:00
print-fatal-signals=
[KNL] debug: print fatal signals
2009-11-09 00:46:42 +09:00
If enabled, warn about various signal handling
related application anomalies: too many signals,
too many POSIX.1 timers, fatal signals causing a
coredump - etc.
If you hit the warning due to signal overflow,
you might want to try "ulimit -i unlimited".
2007-07-15 23:40:10 -07:00
default: off.
2012-03-05 14:59:10 -08:00
printk.always_kmsg_dump=
Trigger kmsg_dump for cases other than kernel oops or
panics
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
default: disabled
2021-07-27 14:06:35 +01:00
printk.console_no_auto_verbose=
Disable console loglevel raise on oops, panic
or lockdep-detected issues (only if lock debug is on).
With an exception to setups with low baudrate on
serial console, keeping this 0 is a good choice
in order to provide more debug information.
Format: <bool>
default: 0 (auto_verbose is enabled)
2016-08-02 14:04:07 -07:00
printk.devkmsg={on,off,ratelimit}
Control writing to /dev/kmsg.
on - unlimited logging to /dev/kmsg from userspace
off - logging to /dev/kmsg disabled
ratelimit - ratelimit the logging
Default: ratelimit
2007-07-15 23:40:25 -07:00
printk.time= Show timing data prefixed to each printk message line
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
2009-04-05 15:55:22 -07:00
processor.max_cstate= [HW,ACPI]
Limit processor to maximum C-state
max_cstate=9 overrides any DMI blacklist limit.
processor.nocst [HW,ACPI]
Ignore the _CST method to determine C-states,
instead using the legacy FADT method
2005-04-16 15:20:36 -07:00
profile= [KNL] Enable kernel profiling via /proc/profile
2017-11-19 21:08:11 -08:00
Format: [<profiletype>,]<number>
Param: <profiletype>: "schedule", "sleep", or "kvm"
[defaults to kernel profiling]
2005-10-23 12:57:11 -07:00
Param: "schedule" - profile schedule points.
2007-10-24 18:23:50 +02:00
Param: "sleep" - profile D-state sleeping (millisecs).
Requires CONFIG_SCHEDSTATS
2007-10-20 03:08:22 +02:00
Param: "kvm" - profile VM exits.
2017-11-19 21:08:11 -08:00
Param: <number> - step/bucket size as a power of 2 for
statistical time based profiling.
2005-04-16 15:20:36 -07:00
2020-09-17 18:56:40 -07:00
prompt_ramdisk= [RAM] [Deprecated]
2005-04-16 15:20:36 -07:00
2019-10-23 13:56:36 +02:00
prot_virt= [S390] enable hosting protected virtual machines
isolated from the hypervisor (if hardware supports
that).
Format: <bool>
2018-11-30 14:09:58 -08:00
psi= [KNL] Enable or disable pressure stall information
tracking.
Format: <bool>
2005-10-23 12:57:11 -07:00
psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to
probe for; one of (bare|imps|exps|lifebook|any).
2005-04-16 15:20:36 -07:00
psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports
per second.
2005-10-23 12:57:11 -07:00
psmouse.resetafter= [HW,MOUSE]
Try to reset the device after so many bad packets
2005-04-16 15:20:36 -07:00
(0 = never).
psmouse.resolution=
[HW,MOUSE] Set desired mouse resolution, in dpi.
psmouse.smartscroll=
2005-10-23 12:57:11 -07:00
[HW,MOUSE] Controls Logitech smartscroll autorepeat.
2005-04-16 15:20:36 -07:00
0 = disabled, 1 = enabled (default).
2011-07-21 16:57:55 -04:00
pstore.backend= Specify the name of the pstore backend to use
2020-08-09 19:49:41 -07:00
pti= [X86-64] Control Page Table Isolation of user and
2018-01-05 09:44:36 -08:00
kernel address spaces. Disabling this feature
removes hardening, but improves performance of
system calls and interrupts.
on - unconditionally enable
off - unconditionally disable
auto - kernel detects whether your CPU model is
vulnerable to issues that PTI mitigates
Not specifying this option is equivalent to pti=auto.
2020-08-09 19:49:41 -07:00
nopti [X86-64]
2018-01-05 09:44:36 -08:00
Equivalent to pti=off
2017-12-12 14:39:52 +01:00
2007-08-15 12:25:38 +02:00
pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
2006-09-29 02:01:02 -07:00
quiet [KNL] Disable most log messages
2005-10-23 12:57:11 -07:00
2005-04-16 15:20:36 -07:00
r128= [HW,DRM]
2022-12-03 17:30:50 -08:00
radix_hcall_invalidate=on [PPC/PSERIES]
Disable RADIX GTSE feature and use hcall for TLB
invalidate.
2005-04-16 15:20:36 -07:00
raid= [HW,RAID]
2016-11-03 12:10:10 +02:00
See Documentation/admin-guide/md.rst.
2005-04-16 15:20:36 -07:00
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
2019-06-18 11:47:10 -03:00
See Documentation/admin-guide/blockdev/ramdisk.rst.
2005-04-16 15:20:36 -07:00
2020-09-17 18:56:40 -07:00
ramdisk_start= [RAM] RAM disk image start address
2022-11-01 13:03:55 +01:00
random.trust_cpu=off
[KNL] Disable trusting the use of the CPU's
random number generator (if available) to
initialize the kernel's RNG.
random.trust_bootloader=off
[KNL] Disable trusting the use of the a seed
passed by the bootloader (if available) to
initialize the kernel's RNG.
2022-03-22 21:43:12 -06:00
stack: Optionally randomize kernel stack offset each syscall
This provides the ability for architectures to enable kernel stack base
address offset randomization. This feature is controlled by the boot
param "randomize_kstack_offset=on/off", with its default value set by
CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
This feature is based on the original idea from the last public release
of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
All the credit for the original idea goes to the PaX team. Note that
the design and implementation of this upstream randomize_kstack_offset
feature differs greatly from the RANDKSTACK feature (see below).
Reasoning for the feature:
This feature aims to make harder the various stack-based attacks that
rely on deterministic stack structure. We have had many such attacks in
past (just to name few):
https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
As Linux kernel stack protections have been constantly improving
(vmap-based stack allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have had to find new ways for their exploits
to work. They have done so, continuing to rely on the kernel's stack
determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
were not relevant. For example, the following recent attacks would have
been hampered if the stack offset was non-deterministic between syscalls:
https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
(page 70: targeting the pt_regs copy with linear stack overflow)
https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
(leaked stack address from one syscall as a target during next syscall)
The main idea is that since the stack offset is randomized on each system
call, it is harder for an attack to reliably land in any particular place
on the thread stack, even with address exposures, as the stack base will
change on the next syscall. Also, since randomization is performed after
placing pt_regs, the ptrace-based approach[1] to discover the randomized
offset during a long-running syscall should not be possible.
Design description:
During most of the kernel's execution, it runs on the "thread stack",
which is pretty deterministic in its structure: it is fixed in size,
and on every entry from userspace to kernel on a syscall the thread
stack starts construction from an address fetched from the per-cpu
cpu_current_top_of_stack variable. The first element to be pushed to the
thread stack is the pt_regs struct that stores all required CPU registers
and syscall parameters. Finally the specific syscall function is called,
with the stack being used as the kernel executes the resulting request.
The goal of randomize_kstack_offset feature is to add a random offset
after the pt_regs has been pushed to the stack and before the rest of the
thread stack is used during the syscall processing, and to change it every
time a process issues a syscall. The source of randomness is currently
architecture-defined (but x86 is using the low byte of rdtsc()). Future
improvements for different entropy sources is possible, but out of scope
for this patch. Further more, to add more unpredictability, new offsets
are chosen at the end of syscalls (the timing of which should be less
easy to measure from userspace than at syscall entry time), and stored
in a per-CPU variable, so that the life of the value does not stay
explicitly tied to a single task.
As suggested by Andy Lutomirski, the offset is added using alloca()
and an empty asm() statement with an output constraint, since it avoids
changes to assembly syscall entry code, to the unwinder, and provides
correct stack alignment as defined by the compiler.
In order to make this available by default with zero performance impact
for those that don't want it, it is boot-time selectable with static
branches. This way, if the overhead is not wanted, it can just be
left turned off with no performance impact.
The generated assembly for x86_64 with GCC looks like this:
...
ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax
# 12380 <kstack_offset>
ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax
ffffffff81003983: 48 83 c0 0f add $0xf,%rax
ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax
ffffffff8100398c: 48 29 c4 sub %rax,%rsp
ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax
ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax
...
As a result of the above stack alignment, this patch introduces about
5 bits of randomness after pt_regs is spilled to the thread stack on
x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
stack alignment). The amount of entropy could be adjusted based on how
much of the stack space we wish to trade for security.
My measure of syscall performance overhead (on x86_64):
lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
randomize_kstack_offset=y Simple syscall: 0.7082 microseconds
randomize_kstack_offset=n Simple syscall: 0.7016 microseconds
So, roughly 0.9% overhead growth for a no-op syscall, which is very
manageable. And for people that don't want this, it's off by default.
There are two gotchas with using the alloca() trick. First,
compilers that have Stack Clash protection (-fstack-clash-protection)
enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
any dynamic stack allocations. While the randomization offset is
always less than a page, the resulting assembly would still contain
(unreachable!) probing routines, bloating the resulting assembly. To
avoid this, -fno-stack-clash-protection is unconditionally added to
the kernel Makefile since this is the only dynamic stack allocation in
the kernel (now that VLAs have been removed) and it is provably safe
from Stack Clash style attacks.
The second gotcha with alloca() is a negative interaction with
-fstack-protector*, in that it sees the alloca() as an array allocation,
which triggers the unconditional addition of the stack canary function
pre/post-amble which slows down syscalls regardless of the static
branch. In order to avoid adding this unneeded check and its associated
performance impact, architectures need to carefully remove uses of
-fstack-protector-strong (or -fstack-protector) in the compilation units
that use the add_random_kstack() macro and to audit the resulting stack
mitigation coverage (to make sure no desired coverage disappears). No
change is visible for this on x86 because the stack protector is already
unconditionally disabled for the compilation unit, but the change is
required on arm64. There is, unfortunately, no attribute that can be
used to disable stack protector for specific functions.
Comparison to PaX RANDKSTACK feature:
The RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. including the location of pt_regs
structure itself on the stack. Initially this patch followed the same
approach, but during the recent discussions[2], it has been determined
to be of a little value since, if ptrace functionality is available for
an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
different offsets in the pt_regs struct, observe the cache behavior of
the pt_regs accesses, and figure out the random stack offset. Another
difference is that the random offset is stored in a per-cpu variable,
rather than having it be per-thread. As a result, these implementations
differ a fair bit in their implementation details and results, though
obviously the intent is similar.
[1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
[2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
[3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html
Co-developed-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
2021-04-01 16:23:44 -07:00
randomize_kstack_offset=
[KNL] Enable or disable kernel stack offset
randomization, which provides roughly 5 bits of
entropy, frustrating memory corruption attacks
that depend on stack address determinism or
cross-syscall address exposures. This is only
available on architectures that have defined
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET.
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
Default is CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
2017-03-27 11:33:02 +02:00
ras=option[,option,...] [KNL] RAS-specific options
cec_disable [X86]
Disable the Correctable Errors Collector,
see CONFIG_RAS_CEC help text.
2021-11-23 01:37:07 +01:00
rcu_nocbs[=cpu-list]
[KNL] The optional argument is a cpu list,
as described above.
In kernels built with CONFIG_RCU_NOCB_CPU=y,
enable the no-callback CPU mode, which prevents
such CPUs' callbacks from being invoked in
softirq context. Invocation of such CPUs' RCU
callbacks will instead be offloaded to "rcuox/N"
kthreads created for that purpose, where "x" is
"p" for RCU-preempt, "s" for RCU-sched, and "g"
for the kthreads that mediate grace periods; and
"N" is the CPU number. This reduces OS jitter on
the offloaded CPUs, which can be useful for HPC
and real-time workloads. It can also improve
energy efficiency for asymmetric multiprocessors.
If a cpulist is passed as an argument, the specified
list of CPUs is set to no-callback mode from boot.
Otherwise, if the '=' sign and the cpulist
arguments are omitted, no CPU will be set to
no-callback mode from boot but the mode may be
toggled at runtime via cpusets.
2012-08-19 21:35:53 -07:00
2022-04-22 17:52:47 +00:00
Note that this argument takes precedence over
the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
2013-10-08 20:23:47 -07:00
rcu_nocb_poll [KNL]
2012-08-19 21:35:53 -07:00
Rather than requiring that offloaded CPUs
(specified by rcu_nocbs= above) explicitly
awaken the corresponding "rcuoN" kthreads,
make these kthreads poll for callbacks.
This improves the real-time response for the
offloaded CPUs by relieving them of the need to
wake up the corresponding kthread, but degrades
energy efficiency by requiring that the kthreads
periodically wake up to do the polling.
2013-10-08 20:23:47 -07:00
rcutree.blimit= [KNL]
2013-10-27 09:44:03 -07:00
Set maximum number of finished RCU callbacks to
process in one batch.
2006-03-07 21:55:33 -08:00
2015-04-20 11:40:50 -07:00
rcutree.dump_tree= [KNL]
Dump the structure of the rcu_node combining tree
out at early boot. This is used for diagnostic
purposes, to verify correct tree setup.
2015-03-10 18:33:20 -07:00
rcutree.gp_cleanup_delay= [KNL]
Set the number of jiffies to delay each step of
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-10 14:36:55 -07:00
RCU grace-period cleanup.
2015-03-10 18:33:20 -07:00
2015-01-22 18:24:08 -08:00
rcutree.gp_init_delay= [KNL]
Set the number of jiffies to delay each step of
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-10 14:36:55 -07:00
RCU grace-period initialization.
2015-03-10 18:33:20 -07:00
rcutree.gp_preinit_delay= [KNL]
Set the number of jiffies to delay each step of
RCU grace-period pre-initialization, that is,
the propagation of recent CPU-hotplug changes up
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-10 14:36:55 -07:00
the rcu_node combining tree.
2015-01-22 18:24:08 -08:00
2019-03-20 22:13:33 +01:00
rcutree.use_softirq= [KNL]
If set to zero, move all RCU_SOFTIRQ processing to
per-CPU rcuc kthreads. Defaults to a non-zero
value, meaning that RCU_SOFTIRQ is used by default.
Specify rcutree.use_softirq=0 to use rcuc kthreads.
2020-12-15 15:16:46 +01:00
But note that CONFIG_PREEMPT_RT=y kernels disable
this kernel boot parameter, forcibly setting it
to zero.
2015-04-20 10:27:15 -07:00
rcutree.rcu_fanout_exact= [KNL]
Disable autobalancing of the rcu_node combining
tree. This is used by rcutorture, and might
possibly be useful for architectures having high
cache-to-cache transfer latencies.
2015-01-22 18:24:08 -08:00
2013-10-08 20:23:47 -07:00
rcutree.rcu_fanout_leaf= [KNL]
2015-07-31 08:28:35 -07:00
Change the number of CPUs assigned to each
leaf rcu_node structure. Useful for very
large systems, which will choose the value 64,
and for NUMA systems with large remote-access
latencies, which will choose a value aligned
with the appropriate hardware boundaries.
2012-04-23 15:52:53 -07:00
2020-05-25 23:47:52 +02:00
rcutree.rcu_min_cached_objs= [KNL]
Minimum number of objects which are cached and
maintained per one CPU. Object size is equal
to PAGE_SIZE. The cache allows to reduce the
pressure to page allocator, also it makes the
whole algorithm to behave better in low memory
condition.
2021-04-15 19:19:56 +02:00
rcutree.rcu_delay_page_cache_fill_msec= [KNL]
Set the page-cache refill delay (in milliseconds)
in response to low-memory conditions. The range
of permitted values is in the range 0:100000.
2013-10-08 20:23:47 -07:00
rcutree.jiffies_till_first_fqs= [KNL]
2012-12-28 11:30:36 -08:00
Set delay from grace-period initialization to
first attempt to force quiescent states.
Units are jiffies, minimum value is zero,
and maximum value is HZ.
2013-10-08 20:23:47 -07:00
rcutree.jiffies_till_next_fqs= [KNL]
2012-12-28 11:30:36 -08:00
Set delay between subsequent attempts to force
quiescent states. Units are jiffies, minimum
value is one, and maximum value is HZ.
2018-11-20 10:22:00 -08:00
rcutree.jiffies_till_sched_qs= [KNL]
Set required age in jiffies for a
given grace period before RCU starts
soliciting quiescent-state help from
rcu_note_context_switch() and cond_resched().
If not specified, the kernel will calculate
a value based on the most recent settings
of rcutree.jiffies_till_first_fqs
and rcutree.jiffies_till_next_fqs.
This calculated value may be viewed in
rcutree.jiffies_to_sched_qs. Any attempt to set
rcutree.jiffies_to_sched_qs will be cheerfully
overwritten.
2014-09-12 21:21:09 -05:00
rcutree.kthread_prio= [KNL,BOOT]
2015-01-20 23:54:59 -08:00
Set the SCHED_FIFO priority of the RCU per-CPU
kthreads (rcuc/N). This value is also used for
the priority of the RCU boost threads (rcub/N)
and for the RCU grace-period kthreads (rcu_bh,
rcu_preempt, and rcu_sched). If RCU_BOOST is
set, valid values are 1-99 and the default is 1
(the least-favored priority). Otherwise, when
RCU_BOOST is not set, valid values are 0-99 and
the default is zero (non-realtime operation).
2022-01-11 15:32:53 -08:00
When RCU_NOCB_CPU is set, also adjust the
priority of NOCB callback kthreads.
2014-09-12 21:21:09 -05:00
2022-04-20 08:59:46 -07:00
rcutree.rcu_divisor= [KNL]
Set the shift-right count to use to compute
the callback-invocation batch limit bl from
the number of callbacks queued on this CPU.
The result will be bounded below by the value of
the rcutree.blimit kernel parameter. Every bl
callbacks, the softirq handler will exit in
order to allow the CPU to do other work.
Please note that this callback-invocation batch
limit applies only to non-offloaded callback
invocation. Offloaded callbacks are instead
invoked in the context of an rcuoc kthread, which
scheduler will preempt as it does any other task.
2022-04-27 09:24:31 -07:00
rcutree.nocb_nobypass_lim_per_jiffy= [KNL]
On callback-offloaded (rcu_nocbs) CPUs,
RCU reduces the lock contention that would
otherwise be caused by callback floods through
use of the ->nocb_bypass list. However, in the
common non-flooded case, RCU queues directly to
the main ->cblist in order to avoid the extra
overhead of the ->nocb_bypass list and its lock.
But if there are too many callbacks queued during
a single jiffy, RCU pre-queues the callbacks into
the ->nocb_bypass queue. The definition of "too
many" is supplied by this kernel boot parameter.
2019-04-02 08:05:55 -07:00
rcutree.rcu_nocb_gp_stride= [KNL]
Set the number of NOCB callback kthreads in
each group, which defaults to the square root
of the number of CPUs. Larger numbers reduce
the wakeup overhead on the global grace-period
kthread, but increases that same overhead on
each group's NOCB grace-period kthread.
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
2013-10-08 20:23:47 -07:00
rcutree.qhimark= [KNL]
2013-10-27 09:44:03 -07:00
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
2006-03-07 21:55:33 -08:00
2013-10-08 20:23:47 -07:00
rcutree.qlowmark= [KNL]
2008-02-03 15:20:26 +02:00
Set threshold of queued RCU callbacks below which
batch limiting is re-enabled.
2006-03-07 21:55:33 -08:00
2019-10-30 11:56:10 -07:00
rcutree.qovld= [KNL]
Set threshold of queued RCU callbacks beyond which
RCU's force-quiescent-state scan will aggressively
enlist help from cond_resched() and sched IPIs to
help CPUs more quickly reach quiescent states.
Set to less than zero to make this be set based
on rcutree.qhimark at boot time and to zero to
disable more aggressive help enlistment.
2017-01-06 15:14:11 -08:00
rcutree.rcu_kick_kthreads= [KNL]
Cause the grace-period kthread to get an extra
wake_up() if it sleeps three times longer than
it should at force-quiescent-state time.
This wake_up() will be accompanied by a
WARN_ONCE() splat and an ftrace_dump().
2020-08-07 13:44:10 -07:00
rcutree.rcu_unlock_delay= [KNL]
In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels,
this specifies an rcu_read_unlock()-time delay
in microseconds. This defaults to zero.
Larger delays increase the probability of
catching RCU pointer leaks, that is, buggy use
of RCU-protected pointers after the relevant
rcu_read_unlock() has completed.
2018-12-12 12:32:06 -08:00
rcutree.sysrq_rcu= [KNL]
Commandeer a sysrq key to dump out Tree RCU's
rcu_node tree with an eye towards determining
why a new grace period has not yet started.
2020-08-11 21:18:12 -07:00
rcuscale.gp_async= [KNL]
2017-04-17 12:47:10 -07:00
Measure performance of asynchronous
grace-period primitives such as call_rcu().
2020-08-11 21:18:12 -07:00
rcuscale.gp_async_max= [KNL]
2017-04-17 12:47:10 -07:00
Specify the maximum number of outstanding
callbacks per writer thread. When a writer
thread exceeds this limit, it invokes the
corresponding flavor of rcu_barrier() to allow
previously posted callbacks to drain.
2020-08-11 21:18:12 -07:00
rcuscale.gp_exp= [KNL]
2016-01-01 13:47:19 -08:00
Measure performance of expedited synchronous
grace-period primitives.
2020-08-11 21:18:12 -07:00
rcuscale.holdoff= [KNL]
2016-01-30 20:56:38 -08:00
Set test-start holdoff period. The purpose of
this parameter is to delay the start of the
test until boot completes in order to avoid
interference.
2020-08-11 21:18:12 -07:00
rcuscale.kfree_rcu_test= [KNL]
2019-08-30 12:36:29 -04:00
Set to measure performance of kfree_rcu() flooding.
2021-02-17 19:51:10 +01:00
rcuscale.kfree_rcu_test_double= [KNL]
Test the double-argument variant of kfree_rcu().
If this parameter has the same value as
rcuscale.kfree_rcu_test_single, both the single-
and double-argument variants are tested.
rcuscale.kfree_rcu_test_single= [KNL]
Test the single-argument variant of kfree_rcu().
If this parameter has the same value as
rcuscale.kfree_rcu_test_double, both the single-
and double-argument variants are tested.
2020-08-11 21:18:12 -07:00
rcuscale.kfree_nthreads= [KNL]
2019-08-30 12:36:29 -04:00
The number of threads running loops of kfree_rcu().
2020-08-11 21:18:12 -07:00
rcuscale.kfree_alloc_num= [KNL]
2019-08-30 12:36:29 -04:00
Number of allocations and frees done in an iteration.
2020-08-11 21:18:12 -07:00
rcuscale.kfree_loops= [KNL]
Number of loops doing rcuscale.kfree_alloc_num number
2019-08-30 12:36:29 -04:00
of allocations and frees.
2020-08-11 21:18:12 -07:00
rcuscale.nreaders= [KNL]
2016-01-01 13:47:19 -08:00
Set number of RCU readers. The value -1 selects
N, where N is the number of CPUs. A value
"n" less than -1 selects N-n+1, where N is again
the number of CPUs. For example, -2 selects N
(the number of CPUs), -3 selects N+1, and so on.
A value of "n" less than or equal to -N selects
a single reader.
2020-08-11 21:18:12 -07:00
rcuscale.nwriters= [KNL]
2016-01-01 13:47:19 -08:00
Set number of RCU writers. The values operate
2020-08-11 21:18:12 -07:00
the same as for rcuscale.nreaders.
2016-01-01 13:47:19 -08:00
N, where N is the number of CPUs
2020-08-11 21:18:12 -07:00
rcuscale.perf_type= [KNL]
2017-04-25 15:12:56 -07:00
Specify the RCU implementation to test.
2020-08-11 21:18:12 -07:00
rcuscale.shutdown= [KNL]
2016-01-01 13:47:19 -08:00
Shut the system down after performance tests
complete. This is useful for hands-off automated
testing.
2020-08-11 21:18:12 -07:00
rcuscale.verbose= [KNL]
2016-01-01 13:47:19 -08:00
Enable additional printk() statements.
2020-08-11 21:18:12 -07:00
rcuscale.writer_holdoff= [KNL]
2017-04-25 15:12:56 -07:00
Write-side holdoff between grace periods,
in microseconds. The default of zero says
no holdoff.
2013-10-08 20:23:47 -07:00
rcutorture.fqs_duration= [KNL]
2015-05-14 17:29:51 -07:00
Set duration of force_quiescent_state bursts
in microseconds.
2012-04-23 10:54:45 -07:00
2013-10-08 20:23:47 -07:00
rcutorture.fqs_holdoff= [KNL]
2015-05-14 17:29:51 -07:00
Set holdoff time within force_quiescent_state bursts
in microseconds.
2012-04-23 10:54:45 -07:00
2013-10-08 20:23:47 -07:00
rcutorture.fqs_stutter= [KNL]
2015-05-14 17:29:51 -07:00
Set wait time between force_quiescent_state bursts
in seconds.
2018-10-01 08:38:54 -07:00
rcutorture.fwd_progress= [KNL]
2021-11-22 20:55:18 -08:00
Specifies the number of kthreads to be used
for RCU grace-period forward-progress testing
2018-10-01 08:38:54 -07:00
for the types of RCU supporting this notion.
2021-11-22 20:55:18 -08:00
Defaults to 1 kthread, values less than zero or
greater than the number of CPUs cause the number
of CPUs to be used.
2018-10-01 08:38:54 -07:00
rcutorture.fwd_progress_div= [KNL]
Specify the fraction of a CPU-stall-warning
period to do tight-loop forward-progress testing.
rcutorture.fwd_progress_holdoff= [KNL]
Number of seconds to wait between successive
forward-progress tests.
rcutorture.fwd_progress_need_resched= [KNL]
Enclose cond_resched() calls within checks for
need_resched() during tight-loop forward-progress
testing.
2015-05-14 17:29:51 -07:00
rcutorture.gp_cond= [KNL]
Use conditional/asynchronous update-side
primitives, if available.
2012-04-23 10:54:45 -07:00
2013-10-08 20:23:47 -07:00
rcutorture.gp_exp= [KNL]
2015-05-14 17:29:51 -07:00
Use expedited update-side primitives, if available.
2013-10-08 20:23:47 -07:00
rcutorture.gp_normal= [KNL]
2015-05-14 17:29:51 -07:00
Use normal (non-expedited) asynchronous
update-side primitives, if available.
rcutorture.gp_sync= [KNL]
Use normal (non-expedited) synchronous
update-side primitives, if available. If all
of rcutorture.gp_cond=, rcutorture.gp_exp=,
rcutorture.gp_normal=, and rcutorture.gp_sync=
are zero, rcutorture acts as if is interpreted
they are all non-zero.
2012-04-23 10:54:45 -07:00
2020-08-11 10:33:39 -07:00
rcutorture.irqreader= [KNL]
Run RCU readers from irq handlers, or, more
accurately, from a timer handler. Not all RCU
flavors take kindly to this sort of thing.
rcutorture.leakpointer= [KNL]
Leak an RCU-protected pointer out of the reader.
This can of course result in splats, and is
intended to test the ability of things like
CONFIG_RCU_STRICT_GRACE_PERIOD=y to detect
such leaks.
2013-10-08 20:23:47 -07:00
rcutorture.n_barrier_cbs= [KNL]
2012-04-23 10:54:45 -07:00
Set callbacks/threads for rcu_barrier() testing.
2013-10-08 20:23:47 -07:00
rcutorture.nfakewriters= [KNL]
2012-04-23 10:54:45 -07:00
Set number of concurrent RCU writers. These just
stress RCU, they don't participate in the actual
test, hence the "fake".
2020-09-23 17:39:46 -07:00
rcutorture.nocbs_nthreads= [KNL]
Set number of RCU callback-offload togglers.
Zero (the default) disables toggling.
rcutorture.nocbs_toggle= [KNL]
Set the delay in milliseconds between successive
callback-offload toggling attempts.
2013-10-08 20:23:47 -07:00
rcutorture.nreaders= [KNL]
2015-03-12 13:55:48 -07:00
Set number of RCU readers. The value -1 selects
N-1, where N is the number of CPUs. A value
"n" less than -1 selects N-n-2, where N is again
the number of CPUs. For example, -2 selects N
(the number of CPUs), -3 selects N+1, and so on.
2012-04-23 10:54:45 -07:00
2013-10-08 20:23:47 -07:00
rcutorture.object_debug= [KNL]
Enable debug-object double-call_rcu() testing.
rcutorture.onoff_holdoff= [KNL]
2012-04-23 10:54:45 -07:00
Set time (s) after boot for CPU-hotplug testing.
2013-10-08 20:23:47 -07:00
rcutorture.onoff_interval= [KNL]
2018-05-08 09:20:34 -07:00
Set time (jiffies) between CPU-hotplug operations,
or zero to disable CPU-hotplug testing.
2012-04-23 10:54:45 -07:00
2020-04-24 11:21:40 -07:00
rcutorture.read_exit= [KNL]
Set the number of read-then-exit kthreads used
to test the interaction of RCU updaters and
task-exit processing.
rcutorture.read_exit_burst= [KNL]
The number of times in a given read-then-exit
episode that a set of read-then-exit kthreads
is spawned.
rcutorture.read_exit_delay= [KNL]
The delay, in seconds, between successive
read-then-exit testing episodes.
2013-10-08 20:23:47 -07:00
rcutorture.shuffle_interval= [KNL]
2012-04-23 10:54:45 -07:00
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
during the rcutorture test.
2013-10-08 20:23:47 -07:00
rcutorture.shutdown_secs= [KNL]
2012-04-23 10:54:45 -07:00
Set time (s) after boot system shutdown. This
is useful for hands-off automated testing.
2013-10-08 20:23:47 -07:00
rcutorture.stall_cpu= [KNL]
2012-04-23 10:54:45 -07:00
Duration of CPU stall (s) to test RCU CPU stall
warnings, zero to disable.
2020-03-11 17:39:12 -07:00
rcutorture.stall_cpu_block= [KNL]
Sleep while stalling if set. This will result
in warnings from preemptible RCU in addition
to any other stall-related activity.
2013-10-08 20:23:47 -07:00
rcutorture.stall_cpu_holdoff= [KNL]
2012-04-23 10:54:45 -07:00
Time to wait (s) after boot before inducing stall.
2017-08-18 16:11:37 -07:00
rcutorture.stall_cpu_irqsoff= [KNL]
Disable interrupts while stalling if set.
2020-04-01 19:57:52 -07:00
rcutorture.stall_gp_kthread= [KNL]
Duration (s) of forced sleep within RCU
grace-period kthread to test RCU CPU stall
warnings, zero to disable. If both stall_cpu
and stall_gp_kthread are specified, the
kthread is starved first, then the CPU.
2013-10-08 20:23:47 -07:00
rcutorture.stat_interval= [KNL]
2012-04-23 10:54:45 -07:00
Time (s) between statistics printk()s.
2013-10-08 20:23:47 -07:00
rcutorture.stutter= [KNL]
2012-04-23 10:54:45 -07:00
Time (s) to stutter testing, for example, specifying
five seconds causes the test to run for five seconds,
wait for five seconds, and so on. This tests RCU's
ability to transition abruptly to and from idle.
2013-10-08 20:23:47 -07:00
rcutorture.test_boost= [KNL]
2012-04-23 10:54:45 -07:00
Test RCU priority boosting? 0=no, 1=maybe, 2=yes.
"Maybe" means test if the RCU implementation
under test support RCU priority boosting.
2013-10-08 20:23:47 -07:00
rcutorture.test_boost_duration= [KNL]
2012-04-23 10:54:45 -07:00
Duration (s) of each individual boost test.
2013-10-08 20:23:47 -07:00
rcutorture.test_boost_interval= [KNL]
2012-04-23 10:54:45 -07:00
Interval (s) between each boost test.
2013-10-08 20:23:47 -07:00
rcutorture.test_no_idle_hz= [KNL]
2012-04-23 10:54:45 -07:00
Test RCU's dyntick-idle handling. See also the
rcutorture.shuffle_interval parameter.
2013-10-08 20:23:47 -07:00
rcutorture.torture_type= [KNL]
2012-04-23 10:54:45 -07:00
Specify the RCU implementation to test.
2013-10-08 20:23:47 -07:00
rcutorture.verbose= [KNL]
2012-04-23 10:54:45 -07:00
Enable additional printk() statements.
2019-06-13 15:30:49 -07:00
rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
Dump ftrace buffer after reporting RCU CPU
stall warning.
2015-11-24 15:44:06 -08:00
rcupdate.rcu_cpu_stall_suppress= [KNL]
Suppress RCU CPU stall warning messages.
2019-12-05 11:29:01 -08:00
rcupdate.rcu_cpu_stall_suppress_at_boot= [KNL]
Suppress RCU CPU stall warning messages and
rcutorture writer stall warnings that occur
during early boot, that is, during the time
before the init task is spawned.
2015-11-24 15:44:06 -08:00
rcupdate.rcu_cpu_stall_timeout= [KNL]
Set timeout for RCU CPU stall warning messages.
2022-02-16 14:52:09 +01:00
The value is in seconds and the maximum allowed
value is 300 seconds.
rcupdate.rcu_exp_cpu_stall_timeout= [KNL]
Set timeout for expedited RCU CPU stall warning
messages. The value is in milliseconds
and the maximum allowed value is 21000
milliseconds. Please note that this value is
adjusted to an arch timer tick resolution.
Setting this to zero causes the value from
rcupdate.rcu_cpu_stall_timeout to be used (after
conversion from seconds to milliseconds).
2015-11-24 15:44:06 -08:00
2022-11-19 17:25:06 +08:00
rcupdate.rcu_cpu_stall_cputime= [KNL]
Provide statistics on the cputime and count of
interrupts and tasks during the sampling period. For
multiple continuous RCU stalls, all sampling periods
begin at half of the first RCU stall timeout.
2022-12-19 18:02:20 -08:00
rcupdate.rcu_exp_stall_task_details= [KNL]
Print stack dumps of any tasks blocking the
current expedited RCU grace period during an
expedited RCU CPU stall warning.
2013-10-08 20:23:47 -07:00
rcupdate.rcu_expedited= [KNL]
Use expedited grace-period primitives, for
example, synchronize_rcu_expedited() instead
of synchronize_rcu(). This reduces latency,
but can increase CPU utilization, degrade
real-time latency, and degrade energy efficiency.
2015-12-07 13:09:52 -08:00
No effect on CONFIG_TINY_RCU kernels.
2013-10-08 20:23:47 -07:00
2015-11-24 15:44:06 -08:00
rcupdate.rcu_normal= [KNL]
Use only normal grace-period primitives,
for example, synchronize_rcu() instead of
synchronize_rcu_expedited(). This improves
2015-12-07 13:09:52 -08:00
real-time latency, CPU utilization, and
energy efficiency, but can expose users to
increased grace-period latency. This parameter
overrides rcupdate.rcu_expedited. No effect on
CONFIG_TINY_RCU kernels.
2013-10-08 20:23:47 -07:00
2015-11-25 18:56:00 -08:00
rcupdate.rcu_normal_after_boot= [KNL]
Once boot has completed (that is, after
rcu_end_inkernel_boot() has been invoked), use
2015-12-07 13:09:52 -08:00
only normal grace-period primitives. No effect
on CONFIG_TINY_RCU kernels.
2015-11-25 18:56:00 -08:00
2020-12-15 15:16:47 +01:00
But note that CONFIG_PREEMPT_RT=y kernels enables
this kernel boot parameter, forcibly setting
it to the value one, that is, converting any
post-boot attempt at an expedited RCU grace
period to instead use normal non-expedited
grace-period processing.
rcu-tasks: Use fewer callbacks queues if callback flood ends
By default, when lock contention is encountered, the RCU Tasks flavors
of RCU switch to using per-CPU queueing. However, if the callback
flood ends, per-CPU queueing continues to be used, which introduces
significant additional overhead, especially for callback invocation,
which fans out a series of workqueue handlers.
This commit therefore switches back to single-queue operation if at the
beginning of a grace period there are very few callbacks. The definition
of "very few" is set by the rcupdate.rcu_task_collapse_lim module
parameter, which defaults to 10. This switch happens in two phases,
with the first phase causing future callbacks to be enqueued on CPU 0's
queue, but with all queues continuing to be checked for grace periods
and callback invocation. The second phase checks to see if an RCU grace
period has elapsed and if all remaining RCU-Tasks callbacks are queued
on CPU 0. If so, only CPU 0 is checked for future grace periods and
callback operation.
Of course, the return of contention anywhere during this process will
result in returning to per-CPU callback queueing.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-29 16:52:31 -08:00
rcupdate.rcu_task_collapse_lim= [KNL]
Set the maximum number of callbacks present
at the beginning of a grace period that allows
the RCU Tasks flavors to collapse back to using
a single callback queue. This switching only
occurs when rcupdate.rcu_task_enqueue_lim is
set to the default value of -1.
2021-11-24 15:12:15 -08:00
rcupdate.rcu_task_contend_lim= [KNL]
Set the minimum number of callback-queuing-time
lock-contention events per jiffy required to
cause the RCU Tasks flavors to switch to per-CPU
callback queuing. This switching only occurs
when rcupdate.rcu_task_enqueue_lim is set to
the default value of -1.
2021-11-12 07:33:40 -08:00
rcupdate.rcu_task_enqueue_lim= [KNL]
Set the number of callback queues to use for the
RCU Tasks family of RCU flavors. The default
of -1 allows this to be automatically (and
dynamically) adjusted. This parameter is intended
for use in testing.
2020-03-17 11:39:26 -07:00
rcupdate.rcu_task_ipi_delay= [KNL]
Set time in jiffies during which RCU tasks will
avoid sending IPIs, starting with the beginning
of a given grace period. Setting a large
number avoids disturbing real-time workloads,
but lengthens grace periods.
rcu-tasks: Print pre-stall-warning informational messages
RCU-tasks stall-warning messages are printed after the grace period is ten
minutes old. Unfortunately, most of us will have rebooted the system in
response to an apparently-hung command long before the ten minutes is up,
and will thus see what looks to be a silent hang.
This commit therefore adds pr_info() messages that are printed earlier.
These should avoid being classified as errors, but should give impatient
users a hint. These are controlled by new rcupdate.rcu_task_stall_info
and rcupdate.rcu_task_stall_info_mult kernel-boot parameters. The former
defines the initial delay in jiffies (defaulting to 10 seconds) and the
latter defines the multiplier (defaulting to 3). Thus, by default, the
first message will appear 10 seconds into the RCU-tasks grace period,
the second 40 seconds in, and the third 160 seconds in. There would be
a fourth at 640 seconds in, but the stall warning message appears 600
seconds in, and once a stall warning is printed for a given grace period,
no further informational messages are printed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-25 16:01:12 -08:00
rcupdate.rcu_task_stall_info= [KNL]
Set initial timeout in jiffies for RCU task stall
informational messages, which give some indication
of the problem for those not patient enough to
wait for ten minutes. Informational messages are
only printed prior to the stall-warning message
for a given grace period. Disable with a value
less than or equal to zero. Defaults to ten
seconds. A change in value does not take effect
until the beginning of the next grace period.
rcupdate.rcu_task_stall_info_mult= [KNL]
Multiplier for time interval between successive
RCU task stall informational messages for a given
RCU tasks grace period. This value is clamped
to one through ten, inclusive. It defaults to
the value three, so that the first informational
message is printed 10 seconds into the grace
period, the second at 40 seconds, the third at
160 seconds, and then the stall warning at 600
seconds would prevent a fourth at 640 seconds.
2014-07-01 18:16:30 -07:00
rcupdate.rcu_task_stall_timeout= [KNL]
rcu-tasks: Print pre-stall-warning informational messages
RCU-tasks stall-warning messages are printed after the grace period is ten
minutes old. Unfortunately, most of us will have rebooted the system in
response to an apparently-hung command long before the ten minutes is up,
and will thus see what looks to be a silent hang.
This commit therefore adds pr_info() messages that are printed earlier.
These should avoid being classified as errors, but should give impatient
users a hint. These are controlled by new rcupdate.rcu_task_stall_info
and rcupdate.rcu_task_stall_info_mult kernel-boot parameters. The former
defines the initial delay in jiffies (defaulting to 10 seconds) and the
latter defines the multiplier (defaulting to 3). Thus, by default, the
first message will appear 10 seconds into the RCU-tasks grace period,
the second 40 seconds in, and the third 160 seconds in. There would be
a fourth at 640 seconds in, but the stall warning message appears 600
seconds in, and once a stall warning is printed for a given grace period,
no further informational messages are printed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-25 16:01:12 -08:00
Set timeout in jiffies for RCU task stall
warning messages. Disable with a value less
than or equal to zero. Defaults to ten minutes.
A change in value does not take effect until
the beginning of the next grace period.
2014-07-01 18:16:30 -07:00
2014-09-19 11:34:09 -04:00
rcupdate.rcu_self_test= [KNL]
Run the RCU early boot self tests
2005-09-06 15:17:19 -07:00
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
2019-08-19 15:52:35 +00:00
rdrand= [X86]
force - Override the decision by the kernel to hide the
advertisement of RDRAND support (this affects
certain AMD processors because of buggy BIOS
support, specifically around the suspend/resume
path).
2017-08-24 09:26:51 -07:00
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
2017-12-20 14:57:24 -08:00
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
2023-01-13 09:20:31 -06:00
mba, smba, bmec.
2017-08-24 09:26:51 -07:00
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
2013-07-08 16:01:42 -07:00
reboot= [KNL]
Format (x86 or x86_64):
2021-05-30 12:24:46 -04:00
[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] | d[efault] \
2013-07-08 16:01:42 -07:00
[[,]s[mp]#### \
[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
[[,]f[orce]
2019-05-14 15:45:37 -07:00
Where reboot_mode is one of warm (soft) or cold (hard) or gpio
(prefix with 'panic_' to set mode for panic
reboot only),
2013-07-08 16:01:42 -07:00
reboot_type is one of bios, acpi, kbd, triple, efi, or pci,
reboot_force is either force or not specified,
reboot_cpu is s[mp]#### with #### being the processor
to be used for rebooting.
2005-04-16 15:20:36 -07:00
2020-06-17 11:53:53 -07:00
refscale.holdoff= [KNL]
2020-05-29 14:24:03 -07:00
Set test-start holdoff period. The purpose of
this parameter is to delay the start of the
test until boot completes in order to avoid
interference.
2020-06-17 11:53:53 -07:00
refscale.loops= [KNL]
2020-05-29 14:24:03 -07:00
Set the number of loops over the synchronization
primitive under test. Increasing this number
reduces noise due to loop start/end overhead,
but the default has already reduced the per-pass
noise to a handful of picoseconds on ca. 2020
x86 laptops.
2020-06-17 11:53:53 -07:00
refscale.nreaders= [KNL]
2020-05-29 14:24:03 -07:00
Set number of readers. The default value of -1
selects N, where N is roughly 75% of the number
of CPUs. A value of zero is an interesting choice.
2020-06-17 11:53:53 -07:00
refscale.nruns= [KNL]
2020-05-29 14:24:03 -07:00
Set number of runs, each of which is dumped onto
the console log.
2020-06-17 11:53:53 -07:00
refscale.readdelay= [KNL]
2020-05-29 14:24:03 -07:00
Set the read-side critical-section duration,
measured in microseconds.
2020-06-17 11:53:53 -07:00
refscale.scale_type= [KNL]
Specify the read-protection implementation to test.
refscale.shutdown= [KNL]
2020-05-29 14:24:03 -07:00
Shut down the system at the end of the performance
test. This defaults to 1 (shut it down) when
2020-08-11 21:18:12 -07:00
refscale is built into the kernel and to 0 (leave
it running) when refscale is built as a module.
2020-05-29 14:24:03 -07:00
2020-06-17 11:53:53 -07:00
refscale.verbose= [KNL]
2020-05-29 14:24:03 -07:00
Enable additional printk() statements.
2020-11-15 10:24:52 -08:00
refscale.verbose_batched= [KNL]
Batch the additional printk() statements. If zero
(the default) or negative, print everything. Otherwise,
print every Nth verbose statement, where N is the value
specified.
2008-07-04 10:00:09 -07:00
relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level.
2019-06-27 13:08:35 -03:00
See Documentation/admin-guide/cgroup-v1/cpusets.rst.
2008-07-04 10:00:09 -07:00
2017-12-01 11:50:33 -06:00
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
Format: <base1>,<size1>[,<base2>,<size2>,...]
Reserve I/O ports or memory so the kernel won't use
them. If <base> is less than 0x10000, the region
is assumed to be I/O ports; otherwise it is memory.
2005-04-16 15:20:36 -07:00
2007-07-31 00:37:59 -07:00
reservetop= [X86-32]
2006-09-25 23:32:25 -07:00
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
address space.
2006-09-27 01:50:44 -07:00
reset_devices [KNL] Force drivers to reset the underlying device
during initialization.
2005-10-23 12:57:11 -07:00
resume= [SWSUSP]
Specify the partition device for software suspend
2012-05-14 21:45:31 +02:00
Format:
{/dev/<dev> | PARTUUID=<uuid> | <int>:<int> | <hex>}
2005-04-16 15:20:36 -07:00
2006-12-06 20:34:13 -08:00
resume_offset= [SWSUSP]
Specify the offset from the beginning of the partition
given by "resume=" at which the swap header is located,
in <PAGE_SIZE> units (needed only for swap files).
2019-06-13 07:10:36 -03:00
See Documentation/power/swsusp-and-swap-files.rst
2006-12-06 20:34:13 -08:00
2011-10-10 23:38:41 +02:00
resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to
read the resume files
2011-10-06 20:34:46 +02:00
resumewait [HIBERNATION] Wait (indefinitely) for resume device to show up.
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
2007-02-10 01:44:33 -08:00
retain_initrd [RAM] Keep initrd memory after extraction
2022-06-14 23:15:50 +02:00
retbleed= [X86] Control mitigation of RETBleed (Arbitrary
Speculative Code Execution with Return Instructions)
vulnerability.
x86/bugs: Enable STIBP for IBPB mitigated RETBleed
AMD's "Technical Guidance for Mitigating Branch Type Confusion,
Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On
Privileged Mode Entry / SMT Safety" says:
Similar to the Jmp2Ret mitigation, if the code on the sibling thread
cannot be trusted, software should set STIBP to 1 or disable SMT to
ensure SMT safety when using this mitigation.
So, like already being done for retbleed=unret, and now also for
retbleed=ibpb, force STIBP on machines that have it, and report its SMT
vulnerability status accordingly.
[ bp: Remove the "we" and remove "[AMD]" applicability parameter which
doesn't work here. ]
Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com
2022-08-08 09:32:33 -05:00
AMD-based UNRET and IBPB mitigations alone do not stop
sibling threads from influencing the predictions of other
sibling threads. For that reason, STIBP is used on pro-
cessors that support it, and mitigate SMT on processors
that don't.
2022-06-14 23:15:51 +02:00
off - no mitigation
auto - automatically select a migitation
auto,nosmt - automatically select a mitigation,
disabling SMT if necessary for
the full mitigation (only on Zen1
and older without STIBP).
x86/bugs: Enable STIBP for IBPB mitigated RETBleed
AMD's "Technical Guidance for Mitigating Branch Type Confusion,
Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On
Privileged Mode Entry / SMT Safety" says:
Similar to the Jmp2Ret mitigation, if the code on the sibling thread
cannot be trusted, software should set STIBP to 1 or disable SMT to
ensure SMT safety when using this mitigation.
So, like already being done for retbleed=unret, and now also for
retbleed=ibpb, force STIBP on machines that have it, and report its SMT
vulnerability status accordingly.
[ bp: Remove the "we" and remove "[AMD]" applicability parameter which
doesn't work here. ]
Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com
2022-08-08 09:32:33 -05:00
ibpb - On AMD, mitigate short speculation
windows on basic block boundaries too.
Safe, highest perf impact. It also
enables STIBP if present. Not suitable
on Intel.
ibpb,nosmt - Like "ibpb" above but will disable SMT
when STIBP is not available. This is
the alternative for systems which do not
have STIBP.
unret - Force enable untrained return thunks,
only effective on AMD f15h-f17h based
systems.
unret,nosmt - Like unret, but will disable SMT when STIBP
is not available. This is the alternative for
systems which do not have STIBP.
2022-06-14 23:15:50 +02:00
Selecting 'auto' will choose a mitigation method at run
time according to the CPU.
Not specifying this option is equivalent to retbleed=auto.
2015-01-09 20:24:55 +00:00
rfkill.default_state=
0 "airplane mode". All wifi, bluetooth, wimax, gps, fm,
etc. communication is blocked by default.
1 Unblocked.
rfkill.master_switch_mode=
0 The "airplane mode" button does nothing.
1 The "airplane mode" button toggles between everything
blocked and the previous configuration.
2 The "airplane mode" button toggles between everything
blocked and everything unblocked.
2005-04-16 15:20:36 -07:00
rhash_entries= [KNL,NET]
Set number of hash buckets for route cache
2017-01-20 14:22:36 +01:00
ring3mwait=disable
[KNL] Disable ring 3 MONITOR/MWAIT feature on supported
CPUs.
2005-04-16 15:20:36 -07:00
ro [KNL] Mount root device read-only on boot
2016-02-17 14:41:13 -08:00
rodata= [KNL]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
2022-08-17 16:40:22 +01:00
full Mark read-only kernel memory and aliases as read-only
[arm64]
2016-02-17 14:41:13 -08:00
2016-02-22 12:55:01 +01:00
rockchip.usb_uart
Enable the uart passthrough on the designated usb port
on Rockchip SoCs. When active, the signals of the
debug-uart get routed to the D+ and D- pins of the usb
port and the regular usb controller gets disabled.
2005-04-16 15:20:36 -07:00
root= [KNL] Root filesystem
2011-08-03 16:21:08 -07:00
See name_to_dev_t comment in init/do_mounts.c.
2005-04-16 15:20:36 -07:00
rootdelay= [KNL] Delay (in seconds) to pause before attempting to
mount the root filesystem
rootflags= [KNL] Set root filesystem mount option string
rootfstype= [KNL] Set root filesystem type
2007-07-15 23:40:35 -07:00
rootwait [KNL] Wait (indefinitely) for root device to show up.
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
2013-03-28 18:41:46 -07:00
rproc_mem=nn[KMG][@address]
[KNL,ARM,CMA] Remoteproc physical memory block.
Memory area to be used by remote processor image,
managed by CMA.
2005-04-16 15:20:36 -07:00
rw [KNL] Mount root device read-write on boot
S [KNL] Run init in single mode
2014-07-18 17:37:08 +02:00
s390_iommu= [HW,S390]
Set s390 IOTLB flushing mode
strict
With strict flushing every unmap operation will result in
an IOTLB flush. Default is lazy flushing before reuse,
which is faster.
2021-09-14 09:26:49 +02:00
s390_iommu_aperture= [KNL,S390]
Specifies the size of the per device DMA address space
accessible through the DMA and IOMMU APIs as a decimal
factor of the size of main memory.
The default is 1 meaning that one can concurrently use
as many DMA addresses as physical memory is installed,
if supported by hardware, and thus map all of memory
once. With a value of 2 one can map all of memory twice
and so on. As a special case a factor of 0 imposes no
restrictions other than those given by hardware at the
cost of significant additional memory use for tables.
2005-04-16 15:20:36 -07:00
sa1100ir [NET]
See drivers/net/irda/sa1100_ir.c.
2021-04-15 18:23:17 +02:00
sched_verbose [KNL] Enables verbose scheduler debug messages.
2009-11-17 18:22:15 -06:00
2016-02-05 09:08:36 +00:00
schedstats= [KNL,X86] Enable or disable scheduled statistics.
Allowed values are enable and disable. This feature
incurs a small amount of overhead in the scheduler
but is useful for debugging and performance tuning.
2009-11-17 18:22:15 -06:00
2020-02-21 19:52:13 -05:00
sched_thermal_decay_shift=
[KNL, SMP] Set a decay shift for scheduler thermal
pressure signal. Thermal pressure signal follows the
default decay period of other scheduler pelt
signals(usually 32 ms but configurable). Setting
sched_thermal_decay_shift will left shift the decay
period for the thermal pressure signal by the shift
value.
i.e. with the default pelt decay period of 32 ms
sched_thermal_decay_shift thermal pressure decay pr
1 64 ms
2 128 ms
and so on.
Format: integer between 0 and 10
Default is 0.
2020-06-24 15:59:59 -07:00
scftorture.holdoff= [KNL]
Number of seconds to hold off before starting
test. Defaults to zero for module insertion and
to 10 seconds for built-in smp_call_function()
tests.
scftorture.longwait= [KNL]
Request ridiculously long waits randomly selected
up to the chosen limit in seconds. Zero (the
default) disables this feature. Please note
that requesting even small non-zero numbers of
seconds can result in RCU CPU stall warnings,
softlockup complaints, and so on.
scftorture.nthreads= [KNL]
Number of kthreads to spawn to invoke the
smp_call_function() family of functions.
The default of -1 specifies a number of kthreads
equal to the number of CPUs.
scftorture.onoff_holdoff= [KNL]
Number seconds to wait after the start of the
test before initiating CPU-hotplug operations.
scftorture.onoff_interval= [KNL]
Number seconds to wait between successive
CPU-hotplug operations. Specifying zero (which
is the default) disables CPU-hotplug operations.
scftorture.shutdown_secs= [KNL]
The number of seconds following the start of the
test after which to shut down the system. The
default of zero avoids shutting down the system.
Non-zero values are useful for automated tests.
scftorture.stat_interval= [KNL]
The number of seconds between outputting the
current test statistics to the console. A value
of zero disables statistics output.
scftorture.stutter_cpus= [KNL]
The number of jiffies to wait between each change
to the set of CPUs under test.
scftorture.use_cpus_read_lock= [KNL]
Use use_cpus_read_lock() instead of the default
preempt_disable() to disable CPU hotplug
while invoking one of the smp_call_function*()
functions.
scftorture.verbose= [KNL]
Enable additional printk() statements.
scftorture.weight_single= [KNL]
The probability weighting to use for the
smp_call_function_single() function with a zero
"wait" parameter. A value of -1 selects the
default if all other weights are -1. However,
if at least one weight has some other value, a
value of -1 will instead select a weight of zero.
scftorture.weight_single_wait= [KNL]
The probability weighting to use for the
smp_call_function_single() function with a
non-zero "wait" parameter. See weight_single.
scftorture.weight_many= [KNL]
The probability weighting to use for the
smp_call_function_many() function with a zero
"wait" parameter. See weight_single.
Note well that setting a high probability for
this weighting can place serious IPI load
on the system.
scftorture.weight_many_wait= [KNL]
The probability weighting to use for the
smp_call_function_many() function with a
non-zero "wait" parameter. See weight_single
and weight_many.
scftorture.weight_all= [KNL]
The probability weighting to use for the
smp_call_function_all() function with a zero
"wait" parameter. See weight_single and
weight_many.
scftorture.weight_all_wait= [KNL]
The probability weighting to use for the
smp_call_function_all() function with a
non-zero "wait" parameter. See weight_single
and weight_many.
2012-05-08 12:20:58 +02:00
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
Format: { "0" | "1" }
0 -- disable. (may be 1 via CONFIG_CMDLINE="skew_tick=1"
1 -- enable.
Note: increases power consumption, thus should only be
enabled if running jitter sensitive (HPC/RT) workloads.
2019-02-12 10:23:18 -08:00
security= [SECURITY] Choose a legacy "major" security module to
enable at boot. This has been deprecated by the
"lsm=" parameter.
2009-04-05 15:55:22 -07:00
selinux= [SELINUX] Disable or enable SELinux at boot time.
2005-04-16 15:20:36 -07:00
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
0 -- disable.
1 -- enable.
2020-01-07 11:35:04 -05:00
Default value is 1.
2005-04-16 15:20:36 -07:00
2007-07-31 00:37:59 -07:00
serialnumber [BUGS=X86-32]
2005-04-16 15:20:36 -07:00
2022-03-07 15:33:50 -06:00
sev=option[,option...] [X86-64] See Documentation/x86/x86_64/boot-options.rst
2005-04-16 15:20:36 -07:00
shapers= [NET]
Maximal number of shapers.
2005-10-23 12:57:11 -07:00
2022-12-03 17:30:50 -08:00
show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller
Limit apic dumping. The parameter defines the maximal
number of local apics being dumped. Also it is possible
to set it to "all" by meaning -- no limit here.
Format: { 1 (default) | 2 | ... | all }.
The parameter valid if only apic=debug or
apic=verbose is specified.
Example: apic=debug show_lapic=all
2005-04-16 15:20:36 -07:00
simeth= [IA-64]
simscsi=
2005-10-23 12:57:11 -07:00
2005-04-16 15:20:36 -07:00
slram= [HW,MTD]
2021-04-29 22:54:39 -07:00
slab_merge [MM]
Enable merging of slabs with similar size when the
kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
2014-10-09 15:26:22 -07:00
slab_nomerge [MM]
Disable merging of slabs with similar size. May be
necessary if there is some reason to distinguish
2017-07-06 15:36:40 -07:00
allocs to different slabs, especially in hardened
environments where the risk of heap overflows and
layout control by attackers can usually be
frustrated by disabling merging. This will reduce
most of the exposure of a heap attack to a single
cache (risks via metadata attacks are mostly
unchanged). Debug options disable merging on their
own.
2022-06-27 09:00:26 +03:00
For more information see Documentation/mm/slub.rst.
2014-10-09 15:26:22 -07:00
2011-10-18 22:09:28 -07:00
slab_max_order= [MM, SLAB]
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. Defaults to 1 for systems with
more than 32MB of RAM, 0 otherwise.
2020-08-06 23:18:35 -07:00
slub_debug[=options[,slabs][;[options[,slabs]]...] [MM, SLUB]
2007-07-15 23:38:14 -07:00
Enabling slub_debug allows one to determine the
culprit if slab objects become corrupted. Enabling
slub_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the
last alloc / free. For more information see
2022-06-27 09:00:26 +03:00
Documentation/mm/slub.rst.
2007-05-31 00:40:47 -07:00
slub_max_order= [MM, SLUB]
2007-07-15 23:38:14 -07:00
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. For more information see
2022-06-27 09:00:26 +03:00
Documentation/mm/slub.rst.
2007-05-31 00:40:47 -07:00
slub_min_objects= [MM, SLUB]
2007-07-15 23:38:14 -07:00
The minimum number of objects per slab. SLUB will
increase the slab order up to slub_max_order to
generate a sufficiently large slab able to contain
the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired.
2022-06-27 09:00:26 +03:00
For more information see Documentation/mm/slub.rst.
2007-05-31 00:40:47 -07:00
slub_min_order= [MM, SLUB]
2012-02-15 00:26:42 +09:00
Determines the minimum page order for slabs. Must be
2007-07-15 23:38:14 -07:00
lower than slub_max_order.
2022-06-27 09:00:26 +03:00
For more information see Documentation/mm/slub.rst.
2007-05-31 00:40:47 -07:00
2021-04-29 22:54:39 -07:00
slub_merge [MM, SLUB]
Same with slab_merge.
2007-05-31 00:40:47 -07:00
slub_nomerge [MM, SLUB]
2014-10-09 15:26:22 -07:00
Same with slab_nomerge. This is supported for legacy.
See slab_nomerge for more information.
2007-05-31 00:40:47 -07:00
2005-04-16 15:20:36 -07:00
smart2= [HW]
Format: <io1>[,<io2>[,...,<io8>]]
2022-02-28 18:08:33 -08:00
smp.csd_lock_timeout= [KNL]
Specify the period of time in milliseconds
that smp_call_function() and friends will wait
for a CPU to release the CSD lock. This is
useful when diagnosing bugs involving CPUs
disabling interrupts for extended periods
of time. Defaults to 5,000 milliseconds, and
setting a value of zero disables this feature.
This feature may be more efficiently disabled
using the csdlock_debug- kernel parameter.
2007-05-08 00:36:05 -07:00
smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
smsc-ircc2.ircc_sir= [HW] SIR base I/O port
smsc-ircc2.ircc_fir= [HW] FIR base I/O port
smsc-ircc2.ircc_irq= [HW] IRQ line
smsc-ircc2.ircc_dma= [HW] DMA channel
smsc-ircc2.ircc_transceiver= [HW] Transceiver type:
0: Toshiba Satellite 1800 (GP data pin select)
1: Fast pin select (default)
2: ATC IRMode
2022-04-02 22:48:20 -07:00
smt= [KNL,S390] Set the maximum number of threads (logical
2016-04-05 12:53:38 +02:00
CPUs) to use per physical CPU on systems capable of
symmetric multithreading (SMT). Will be capped to the
actual hardware limit.
Format: <integer>
Default: -1 (no limit)
2008-05-12 21:21:04 +02:00
softlockup_panic=
[KNL] Should the soft-lockup detector generate panics.
2020-06-07 21:40:42 -07:00
Format: 0 | 1
2008-05-12 21:21:04 +02:00
2020-06-07 21:40:42 -07:00
A value of 1 instructs the soft-lockup detector
2020-03-10 15:36:49 -03:00
to panic the machine when a soft-lockup occurs. It is
also controlled by the kernel.softlockup_panic sysctl
and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the
respective build-time switch to that functionality.
2017-10-03 17:54:07 +02:00
2014-06-23 13:22:05 -07:00
softlockup_all_cpu_backtrace=
[KNL] Should the soft-lockup detector generate
backtraces on all cpus.
2020-06-07 21:40:42 -07:00
Format: 0 | 1
2014-06-23 13:22:05 -07:00
2005-04-16 15:20:36 -07:00
sonypi.*= [HW] Sony Programmable I/O Control Device driver
2019-06-13 15:07:43 -03:00
See Documentation/admin-guide/laptops/sonypi.rst
2005-04-16 15:20:36 -07:00
2018-01-11 21:46:26 +00:00
spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
2018-11-25 19:33:45 +01:00
The default operation protects the kernel from
user space attacks.
2018-01-11 21:46:26 +00:00
2018-11-25 19:33:45 +01:00
on - unconditionally enable, implies
spectre_v2_user=on
off - unconditionally disable, implies
spectre_v2_user=off
2018-01-11 21:46:26 +00:00
auto - kernel detects whether your CPU model is
vulnerable
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
CONFIG_RETPOLINE configuration option, and the
compiler with which the kernel was built.
2018-11-25 19:33:45 +01:00
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
Selecting 'off' will disable both the kernel and
the user space protections.
2018-01-11 21:46:26 +00:00
Specific mitigations can also be selected manually:
retpoline - replace indirect branches
2022-02-16 20:57:02 +01:00
retpoline,generic - Retpolines
retpoline,lfence - LFENCE; indirect branch
retpoline,amd - alias for retpoline,lfence
2023-01-24 10:33:18 -06:00
eibrs - Enhanced/Auto IBRS
eibrs,retpoline - Enhanced/Auto IBRS + Retpolines
eibrs,lfence - Enhanced/Auto IBRS + LFENCE
2022-06-14 23:15:55 +02:00
ibrs - use IBRS to protect kernel
2018-01-11 21:46:26 +00:00
Not specifying this option is equivalent to
spectre_v2=auto.
2018-11-25 19:33:45 +01:00
spectre_v2_user=
[X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability between
user space tasks
on - Unconditionally enable mitigations. Is
enforced by spectre_v2=on
off - Unconditionally disable mitigations. Is
enforced by spectre_v2=off
2018-11-25 19:33:54 +01:00
prctl - Indirect branch speculation is enabled,
but mitigation can be enabled via prctl
per thread. The mitigation control state
is inherited on fork.
2018-11-25 19:33:56 +01:00
prctl,ibpb
- Like "prctl" above, but only STIBP is
controlled per thread. IBPB is issued
always when switching between different user
space processes.
2018-11-25 19:33:55 +01:00
seccomp
- Same as "prctl" above, but all seccomp
threads will enable the mitigation unless
they explicitly opt out.
2018-11-25 19:33:56 +01:00
seccomp,ibpb
- Like "seccomp" above, but only STIBP is
controlled per thread. IBPB is issued
always when switching between different
user space processes.
2018-11-25 19:33:45 +01:00
auto - Kernel selects the mitigation depending on
the available CPU features and vulnerability.
2018-11-25 19:33:55 +01:00
x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
Switch the kernel default of SSBD and STIBP to the ones with
CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl
spectre_v2_user=prctl) even if CONFIG_SECCOMP=y.
Several motivations listed below:
- If SMT is enabled the seccomp jail can still attack the rest of the
system even with spectre_v2_user=seccomp by using MDS-HT (except on
XEON PHI where MDS can be tamed with SMT left enabled, but that's a
special case). Setting STIBP become a very expensive window dressing
after MDS-HT was discovered.
- The seccomp jail cannot attack the kernel with spectre-v2-HT
regardless (even if STIBP is not set), but with MDS-HT the seccomp
jail can attack the kernel too.
- With spec_store_bypass_disable=prctl the seccomp jail can attack the
other userland (guest or host mode) using spectre-v2-HT, but the
userland attack is already mitigated by both ASLR and pid namespaces
for host userland and through virt isolation with libkrun or
kata. (if something if somebody is worried about spectre-v2-HT it's
best to mount proc with hidepid=2,gid=proc on workstations where not
all apps may run under container runtimes, rather than slowing down
all seccomp jails, but the best is to add pid namespaces to the
seccomp jail). As opposed MDS-HT is not mitigated and the seccomp
jail can still attack all other host and guest userland if SMT is
enabled even with spec_store_bypass_disable=seccomp.
- If full security is required then MDS-HT must also be mitigated with
nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp
would become identical.
- Setting spectre_v2_user=seccomp is overall lower priority than to
setting javascript.options.wasm false in about:config to protect
against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT
and STIBP which again is already statistically well mitigated by
other means in userland and it's fully mitigated in kernel with
retpolines (unlike the wasm assist call with MDS-HT).
- SSBD is needed to prevent reading the JIT memory and the primary
user being the OpenJDK. However the primary user of SSBD wouldn't be
covered by spec_store_bypass_disable=seccomp because it doesn't use
seccomp and the primary user also explicitly declined to set
PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily
could. In fact it would need to set it only when the sandboxing
mechanism is enabled for javaws applets, but it still declined it by
declaring security within the same user address space as an
untenable objective for their JIT, even in the sandboxing case where
performance would be a lesser concern (for the record: I kind of
disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and
I prefer to run javaws through a wrapper that sets
PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that
even if the primary user of SSBD would use seccomp, they would
invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now.
- runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s
and podman have a default json seccomp allowlist that cannot be
slowed down, so for the #1 seccomp user this change is already a
noop.
- systemd/sshd or other apps that use seccomp, if they really need
STIBP or SSBD, they need to explicitly set the
PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind
catch-all approach was done probably initially with a wishful
thinking objective to pretend to have a peace of mind that it could
magically fix it all. That was wishful thinking before MDS-HT was
discovered, but after MDS-HT has been discovered it become just
window dressing.
- For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP
or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's
needed with TCG it should be an opt-in with
PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't
slowdown KVM for nothing). For qemu+KVM STIBP would be even more
window dressing than it is for all other apps, because in the
qemu+KVM case there's not only the MDS attack to worry about with
SMT enabled. Even after disabling SMT, there's still a theoretical
spectre-v2 attack possible within the same thread context from guest
mode to host ring3 that the host kernel retpoline mitigation has no
theoretical chance to mitigate. On some kernels a
ibrs-always/ibrs-retpoline opt-in model is provided that will
enabled IBRS in the qemu host ring3 userland which fixes this
theoretical concern. Only after enabling IBRS in the host userland
it would then make sense to proceed and worry about STIBP and an
attack on the other host userland, but then again SMT would need to
be disabled for full security anyway, so that would render STIBP
again a noop.
- last but not the least: the lack of "spec_store_bypass_disable=prctl
spectre_v2_user=prctl" means the moment a guest boots and
sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR
which will make the guest vmexit forever slower, forcing KVM to
issue a very slow rdmsr instruction at every vmexit. So the end
result is that SPEC_CTRL MSR is only available in GCE. Most other
public cloud providers don't expose SPEC_CTRL, which means that not
only STIBP/SSBD isn't available, but IBPB isn't available either
(which would cause no overhead to the guest or the hypervisor
because it's write only and requires no reading during vmexit). So
the current default already net loss in security (missing IBPB)
which means most public cloud providers cannot achieve a fully
secure guest with nosmt (and nosmt is enough to fully mitigate
MDS-HT). It also means GCE and is unfairly penalized in performance
because it provides the option to enable full security in the guest
as an opt-in (i.e. nosmt and IBPB). So this change will allow all
cloud providers to expose SPEC_CTRL without incurring into any
hypervisor slowdown and at the same time it will remove the unfair
penalization of GCE performance for doing the right thing and it'll
allow to get full security with nosmt with IBPB being available (and
STIBP becoming meaningless).
Example to put things in prospective: the STIBP enabled in seccomp has
never been about protecting apps using seccomp like sshd from an
attack from a malicious userland, but to the contrary it has always
been about protecting the system from an attack from sshd, after a
successful remote network exploit against sshd. In fact initially it
wasn't obvious STIBP would work both ways (STIBP was about preventing
the task that runs with STIBP to be attacked with spectre-v2-HT, but
accidentally in the STIBP case it also prevents the attack in the
other direction). In the hypothetical case that sshd has been remotely
exploited the last concern should be STIBP being set, because it'll be
still possible to obtain info even from the kernel by using MDS if
nosmt wasn't set (and if it was set, STIBP is a noop in the first
place). As opposed kernel cannot leak anything with spectre-v2 HT
because of retpolines and the userland is mitigated by ASLR already
and ideally PID namespaces too. If something it'd be worth checking if
sshd run the seccomp thread under pid namespaces too if available in
the running kernel. SSBD also would be a noop for sshd, since sshd
uses no JIT. If sshd prefers to keep doing the STIBP window dressing
exercise, it still can even after this change of defaults by opting-in
with PR_SPEC_INDIRECT_BRANCH.
Ultimately setting SSBD and STIBP by default for all seccomp jails is
a bad sweet spot and bad default with more cons than pros that end up
reducing security in the public cloud (by giving an huge incentive to
not expose SPEC_CTRL which would be needed to get full security with
IBPB after setting nosmt in the guest) and by excessively hurting
performance to more secure apps using seccomp that end up having to
opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW.
The following is the verified result of the new default with SMT
enabled:
(gdb) print spectre_v2_user_stibp
$1 = SPECTRE_V2_USER_PRCTL
(gdb) print spectre_v2_user_ibpb
$2 = SPECTRE_V2_USER_PRCTL
(gdb) print ssb_mode
$3 = SPEC_STORE_BYPASS_PRCTL
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com
2020-11-04 18:50:54 -05:00
Default mitigation: "prctl"
2018-11-25 19:33:45 +01:00
Not specifying this option is equivalent to
spectre_v2_user=auto.
2018-04-25 22:04:21 -04:00
spec_store_bypass_disable=
[HW] Control Speculative Store Bypass (SSB) Disable mitigation
(Speculative Store Bypass vulnerability)
Certain CPUs are vulnerable to an exploit against a
a common industry wide performance optimization known
as "Speculative Store Bypass" in which recent stores
to the same memory location may not be observed by
later loads during speculative execution. The idea
is that such stores are unlikely and that they can
be detected prior to instruction retirement at the
end of a particular speculation execution window.
In vulnerable processors, the speculatively forwarded
store can be used in a cache side channel attack, for
example to read memory to which the attacker does not
directly have access (e.g. inside sandboxed code).
This parameter controls whether the Speculative Store
Bypass optimization is used.
2018-07-10 12:08:36 +10:00
On x86 the options are:
2018-05-03 14:37:54 -07:00
on - Unconditionally disable Speculative Store Bypass
off - Unconditionally enable Speculative Store Bypass
auto - Kernel detects whether the CPU model contains an
implementation of Speculative Store Bypass and
picks the most appropriate mitigation. If the
CPU is not vulnerable, "off" is selected. If the
CPU is vulnerable the default mitigation is
architecture and Kconfig dependent. See below.
prctl - Control Speculative Store Bypass per thread
via prctl. Speculative Store Bypass is enabled
for a process by default. The state of the control
is inherited on fork.
seccomp - Same as "prctl" above, but all seccomp threads
will disable SSB unless they explicitly opt out.
2018-04-25 22:04:21 -04:00
2018-05-03 14:37:54 -07:00
Default mitigations:
x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
Switch the kernel default of SSBD and STIBP to the ones with
CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl
spectre_v2_user=prctl) even if CONFIG_SECCOMP=y.
Several motivations listed below:
- If SMT is enabled the seccomp jail can still attack the rest of the
system even with spectre_v2_user=seccomp by using MDS-HT (except on
XEON PHI where MDS can be tamed with SMT left enabled, but that's a
special case). Setting STIBP become a very expensive window dressing
after MDS-HT was discovered.
- The seccomp jail cannot attack the kernel with spectre-v2-HT
regardless (even if STIBP is not set), but with MDS-HT the seccomp
jail can attack the kernel too.
- With spec_store_bypass_disable=prctl the seccomp jail can attack the
other userland (guest or host mode) using spectre-v2-HT, but the
userland attack is already mitigated by both ASLR and pid namespaces
for host userland and through virt isolation with libkrun or
kata. (if something if somebody is worried about spectre-v2-HT it's
best to mount proc with hidepid=2,gid=proc on workstations where not
all apps may run under container runtimes, rather than slowing down
all seccomp jails, but the best is to add pid namespaces to the
seccomp jail). As opposed MDS-HT is not mitigated and the seccomp
jail can still attack all other host and guest userland if SMT is
enabled even with spec_store_bypass_disable=seccomp.
- If full security is required then MDS-HT must also be mitigated with
nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp
would become identical.
- Setting spectre_v2_user=seccomp is overall lower priority than to
setting javascript.options.wasm false in about:config to protect
against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT
and STIBP which again is already statistically well mitigated by
other means in userland and it's fully mitigated in kernel with
retpolines (unlike the wasm assist call with MDS-HT).
- SSBD is needed to prevent reading the JIT memory and the primary
user being the OpenJDK. However the primary user of SSBD wouldn't be
covered by spec_store_bypass_disable=seccomp because it doesn't use
seccomp and the primary user also explicitly declined to set
PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily
could. In fact it would need to set it only when the sandboxing
mechanism is enabled for javaws applets, but it still declined it by
declaring security within the same user address space as an
untenable objective for their JIT, even in the sandboxing case where
performance would be a lesser concern (for the record: I kind of
disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and
I prefer to run javaws through a wrapper that sets
PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that
even if the primary user of SSBD would use seccomp, they would
invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now.
- runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s
and podman have a default json seccomp allowlist that cannot be
slowed down, so for the #1 seccomp user this change is already a
noop.
- systemd/sshd or other apps that use seccomp, if they really need
STIBP or SSBD, they need to explicitly set the
PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind
catch-all approach was done probably initially with a wishful
thinking objective to pretend to have a peace of mind that it could
magically fix it all. That was wishful thinking before MDS-HT was
discovered, but after MDS-HT has been discovered it become just
window dressing.
- For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP
or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's
needed with TCG it should be an opt-in with
PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't
slowdown KVM for nothing). For qemu+KVM STIBP would be even more
window dressing than it is for all other apps, because in the
qemu+KVM case there's not only the MDS attack to worry about with
SMT enabled. Even after disabling SMT, there's still a theoretical
spectre-v2 attack possible within the same thread context from guest
mode to host ring3 that the host kernel retpoline mitigation has no
theoretical chance to mitigate. On some kernels a
ibrs-always/ibrs-retpoline opt-in model is provided that will
enabled IBRS in the qemu host ring3 userland which fixes this
theoretical concern. Only after enabling IBRS in the host userland
it would then make sense to proceed and worry about STIBP and an
attack on the other host userland, but then again SMT would need to
be disabled for full security anyway, so that would render STIBP
again a noop.
- last but not the least: the lack of "spec_store_bypass_disable=prctl
spectre_v2_user=prctl" means the moment a guest boots and
sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR
which will make the guest vmexit forever slower, forcing KVM to
issue a very slow rdmsr instruction at every vmexit. So the end
result is that SPEC_CTRL MSR is only available in GCE. Most other
public cloud providers don't expose SPEC_CTRL, which means that not
only STIBP/SSBD isn't available, but IBPB isn't available either
(which would cause no overhead to the guest or the hypervisor
because it's write only and requires no reading during vmexit). So
the current default already net loss in security (missing IBPB)
which means most public cloud providers cannot achieve a fully
secure guest with nosmt (and nosmt is enough to fully mitigate
MDS-HT). It also means GCE and is unfairly penalized in performance
because it provides the option to enable full security in the guest
as an opt-in (i.e. nosmt and IBPB). So this change will allow all
cloud providers to expose SPEC_CTRL without incurring into any
hypervisor slowdown and at the same time it will remove the unfair
penalization of GCE performance for doing the right thing and it'll
allow to get full security with nosmt with IBPB being available (and
STIBP becoming meaningless).
Example to put things in prospective: the STIBP enabled in seccomp has
never been about protecting apps using seccomp like sshd from an
attack from a malicious userland, but to the contrary it has always
been about protecting the system from an attack from sshd, after a
successful remote network exploit against sshd. In fact initially it
wasn't obvious STIBP would work both ways (STIBP was about preventing
the task that runs with STIBP to be attacked with spectre-v2-HT, but
accidentally in the STIBP case it also prevents the attack in the
other direction). In the hypothetical case that sshd has been remotely
exploited the last concern should be STIBP being set, because it'll be
still possible to obtain info even from the kernel by using MDS if
nosmt wasn't set (and if it was set, STIBP is a noop in the first
place). As opposed kernel cannot leak anything with spectre-v2 HT
because of retpolines and the userland is mitigated by ASLR already
and ideally PID namespaces too. If something it'd be worth checking if
sshd run the seccomp thread under pid namespaces too if available in
the running kernel. SSBD also would be a noop for sshd, since sshd
uses no JIT. If sshd prefers to keep doing the STIBP window dressing
exercise, it still can even after this change of defaults by opting-in
with PR_SPEC_INDIRECT_BRANCH.
Ultimately setting SSBD and STIBP by default for all seccomp jails is
a bad sweet spot and bad default with more cons than pros that end up
reducing security in the public cloud (by giving an huge incentive to
not expose SPEC_CTRL which would be needed to get full security with
IBPB after setting nosmt in the guest) and by excessively hurting
performance to more secure apps using seccomp that end up having to
opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW.
The following is the verified result of the new default with SMT
enabled:
(gdb) print spectre_v2_user_stibp
$1 = SPECTRE_V2_USER_PRCTL
(gdb) print spectre_v2_user_ibpb
$2 = SPECTRE_V2_USER_PRCTL
(gdb) print ssb_mode
$3 = SPEC_STORE_BYPASS_PRCTL
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com
2020-11-04 18:50:54 -05:00
X86: "prctl"
2018-05-03 14:37:54 -07:00
2018-07-10 12:08:36 +10:00
On powerpc the options are:
on,auto - On Power8 and Power9 insert a store-forwarding
barrier on kernel entry and exit. On Power7
perform a software flush on kernel entry and
exit.
off - No action.
Not specifying this option is equivalent to
spec_store_bypass_disable=auto.
2005-04-16 15:20:36 -07:00
spia_io_base= [HW,MTD]
spia_fio_base=
spia_pedr=
spia_peddr=
2020-01-26 12:05:35 -08:00
split_lock_detect=
2021-03-22 13:53:25 +00:00
[X86] Enable split lock detection or bus lock detection
2020-01-26 12:05:35 -08:00
When enabled (and if hardware support is present), atomic
instructions that access data across cache line
2021-03-22 13:53:25 +00:00
boundaries will result in an alignment check exception
for split lock detection or a debug exception for
bus lock detection.
2020-01-26 12:05:35 -08:00
off - not enabled
2021-03-22 13:53:25 +00:00
warn - the kernel will emit rate-limited warnings
2020-01-26 12:05:35 -08:00
about applications triggering the #AC
2021-03-22 13:53:25 +00:00
exception or the #DB exception. This mode is
the default on CPUs that support split lock
detection or bus lock detection. Default
behavior is by #AC if both features are
enabled in hardware.
2020-01-26 12:05:35 -08:00
fatal - the kernel will send SIGBUS to applications
2021-03-22 13:53:25 +00:00
that trigger the #AC exception or the #DB
exception. Default behavior is by #AC if
both features are enabled in hardware.
2020-01-26 12:05:35 -08:00
2021-04-19 21:49:57 +00:00
ratelimit:N -
Set system wide rate limit to N bus locks
per second for bus lock detection.
0 < N <= 1000.
N/A for split lock detection.
2020-01-26 12:05:35 -08:00
If an #AC exception is hit in the kernel or in
firmware (i.e. not while executing in user mode)
the kernel will oops in either "warn" or "fatal"
mode.
2021-03-22 13:53:25 +00:00
#DB exception for bus lock is triggered only when
CPL > 0.
2020-04-16 17:54:04 +02:00
srbds= [X86,INTEL]
Control the Special Register Buffer Data Sampling
(SRBDS) mitigation.
Certain CPUs are vulnerable to an MDS-like
exploit which can leak bits from the random
number generator.
By default, this issue is mitigated by
microcode. However, the microcode fix can cause
the RDRAND and RDSEED instructions to become
much slower. Among other effects, this will
result in reduced throughput from /dev/urandom.
The microcode mitigation can be disabled with
the following option:
off: Disable mitigation and remove
performance impact to RDRAND and RDSEED
2022-01-31 11:21:30 -08:00
srcutree.big_cpu_lim [KNL]
Specifies the number of CPUs constituting a
large system, such that srcu_struct structures
should immediately allocate an srcu_node array.
This kernel-boot parameter defaults to 128,
but takes effect only when the low-order four
bits of srcutree.convert_to_big is equal to 3
(decide at boot).
2022-01-25 15:41:10 -08:00
srcutree.convert_to_big [KNL]
Specifies under what conditions an SRCU tree
srcu_struct structure will be converted to big
form, that is, with an rcu_node tree:
0: Never.
1: At init_srcu_struct() time.
2: When rcutorture decides to.
2022-01-31 11:21:30 -08:00
3: Decide at boot time (default).
2022-01-27 20:32:05 -08:00
0x1X: Above plus if high contention.
2022-01-25 15:41:10 -08:00
Either way, the srcu_node tree will be sized based
on the actual runtime number of CPUs (nr_cpu_ids)
instead of the compile-time CONFIG_NR_CPUS.
srcu: Prevent sdp->srcu_gp_seq_needed counter wrap
If a given CPU never happens to ever start an SRCU grace period, the
grace-period sequence counter might wrap. If this CPU were to decide to
finally start a grace period, the state of its sdp->srcu_gp_seq_needed
might make it appear that it has already requested this grace period,
which would prevent starting the grace period. If no other CPU ever started
a grace period again, this would look like a grace-period hang. Even
if some other CPU took pity and started the needed grace period, the
leaf rcu_node structure's ->srcu_data_have_cbs field won't have record
of the fact that this CPU has a callback pending, which would look like
a very localized grace-period hang.
This might seem very unlikely, but SRCU grace periods can take less than
a microsecond on small systems, which means that overflow can happen
in much less than an hour on a 32-bit embedded system. And embedded
systems are especially likely to have long-term idle CPUs. Therefore,
it makes sense to prevent this scenario from happening.
This commit therefore scans each srcu_data structure occasionally,
with frequency controlled by the srcutree.counter_wrap_check kernel
boot parameter. This parameter can be set to something like 255
in order to exercise the counter-wrap-prevention code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-03 15:35:32 -07:00
srcutree.counter_wrap_check [KNL]
Specifies how frequently to check for
grace-period sequence counter wrap for the
srcu_data structure's ->srcu_gp_seq_needed field.
The greater the number of bits set in this kernel
parameter, the less frequently counter wrap will
be checked for. Note that the bottom two bits
are ignored.
2017-04-25 14:03:11 -07:00
srcutree.exp_holdoff [KNL]
Specifies how many nanoseconds must elapse
since the end of the last SRCU grace period for
a given srcu_struct until the next normal SRCU
grace period will be considered for automatic
expediting. Set to zero to disable automatic
expediting.
srcu: Make expedited RCU grace periods block even less frequently
The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs
and blocking readers from consuming CPU") was to prevent a long
series of never-blocking expedited SRCU grace periods from blocking
kernel-live-patching (KLP) progress. Although it was successful, it also
resulted in excessive boot times on certain embedded workloads running
under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive"
means increasing the boot time up into the three-to-four minute range.
This increase in boot time was due to the more than 6000 back-to-back
invocations of synchronize_rcu_expedited() within the KVM host OS, which
in turn resulted from qemu's emulation of a long series of MMIO accesses.
Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
periods") did not significantly help this particular use case.
Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
of non-sleeping per phase counts on a system with preemption enabled,
and observed the following boot times:
+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE | Boot time (s) |
+──────────────────────────+────────────────+
| 100 | 30.053 |
| 150 | 25.151 |
| 200 | 20.704 |
| 250 | 15.748 |
| 500 | 11.401 |
| 1000 | 11.443 |
| 10000 | 11.258 |
| 1000000 | 11.154 |
+──────────────────────────+────────────────+
Analysis on the experiment results show additional improvements with
CPU-bound delays approaching one jiffy in duration. This improvement was
also seen when number of per-phase iterations were scaled to one jiffy.
This commit therefore scales per-grace-period phase number of non-sleeping
polls so that non-sleeping polls extend for about one jiffy. In addition,
the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
replaced with a simple check for an expedited grace period. This change
schedules callback invocation immediately after expedited grace periods
complete, which results in greatly improved boot times. Testing done
by Marc and Zhangfei confirms that this change recovers most of the
performance degradation in boottime; for CONFIG_HZ_250 configuration,
specifically, boot times improve from 3m50s to 41s on Marc's setup;
and from 2m40s to ~9.7s on Zhangfei's setup.
In addition to the changes to default per phase delays, this
change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays in
order to more quickly react to additional use cases.
Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: yueluck <yueluck@163.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-01 08:45:45 +05:30
srcutree.srcu_max_nodelay [KNL]
Specifies the number of no-delay instances
per jiffy for which the SRCU grace period
worker thread will be rescheduled with zero
delay. Beyond this limit, worker thread will
be rescheduled with a sleep delay of one jiffy.
srcutree.srcu_max_nodelay_phase [KNL]
Specifies the per-grace-period phase, number of
non-sleeping polls of readers. Beyond this limit,
grace period worker thread will be rescheduled
with a sleep delay of one jiffy, between each
rescan of the readers, for a grace period phase.
srcutree.srcu_retry_check_delay [KNL]
Specifies number of microseconds of non-sleeping
delay between each non-sleeping poll of readers.
2022-01-27 20:32:05 -08:00
srcutree.small_contention_lim [KNL]
Specifies the number of update-side contention
events per jiffy will be tolerated before
initiating a conversion of an srcu_struct
structure to big form. Note that the value of
srcutree.convert_to_big must have the 0x10 bit
set for contention-based conversions to occur.
2018-05-29 13:11:09 +01:00
ssbd= [ARM64,HW]
Speculative Store Bypass Disable control
On CPUs that are vulnerable to the Speculative
Store Bypass vulnerability and offer a
firmware based mitigation, this parameter
indicates how the mitigation should be used:
force-on: Unconditionally enable mitigation for
for both kernel and userspace
force-off: Unconditionally disable mitigation for
for both kernel and userspace
kernel: Always enable mitigation in the
kernel, and offer a prctl interface
to allow userspace to register its
interest in being mitigated too.
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 04:03:24 -07:00
stack_guard_gap= [MM]
override the default stack gap protection. The value
is in page units and it defines how many pages prior
to (for stacks growing down) resp. after (for stacks
growing up) the main stack are reserved for no other
mapping. Default value is 256 pages.
2021-02-25 17:21:27 -08:00
stack_depot_disable= [KNL]
Setting this to true through kernel command line will
disable the stack depot thereby saving the static memory
consumed by the stack hash table. By default this is set
to false.
2008-12-16 23:06:40 -05:00
stacktrace [FTRACE]
Enabled the stack tracer on boot up.
2011-12-19 22:01:00 -05:00
stacktrace_filter=[function-list]
[FTRACE] Limit the functions that the stack tracer
2020-12-31 20:08:31 -08:00
will trace at boot up. function-list is a comma-separated
2011-12-19 22:01:00 -05:00
list of functions. This list can be changed at run
time by the stack_trace_filter file in the debugfs
tracing directory. Note, this enables stack tracing
and the stacktrace above is not needed.
2005-04-16 15:20:36 -07:00
sti= [PARISC,HW]
Format: <num>
Set the STI (builtin display/keyboard on the HP-PARISC
machines) console (graphic card) which should be used
as the initial boot-console.
See also comment in drivers/video/console/sticore.c.
sti_font= [HW]
See comment in drivers/video/console/sticore.c.
stifb= [HW]
Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]
2021-10-21 15:55:06 -07:00
strict_sas_size=
[X86]
Format: <bool>
Enable or disable strict sigaltstack size checks
against the required signal frame size which
depends on the supported FPU features. This can
be used to filter out binaries which have
not yet been made aware of AT_MINSIGSTKSZ.
2022-12-03 17:30:50 -08:00
stress_hpt [PPC]
Limits the number of kernel HPT entries in the hash
page table to increase the rate of hash page table
faults on kernel addresses.
stress_slb [PPC]
Limits the number of kernel SLB entries, and flushes
them frequently to increase the rate of SLB faults
on kernel addresses.
2009-08-09 15:06:19 -04:00
sunrpc.min_resvport=
sunrpc.max_resvport=
[NFS,SUNRPC]
SunRPC servers often require that client requests
originate from a privileged port (i.e. a port in the
range 0 < portnr < 1024).
An administrator who wishes to reserve some of these
ports for other uses may adjust the range that the
kernel's sunrpc client considers to be privileged
using these two parameters to set the minimum and
maximum port values.
2016-06-24 10:55:50 -04:00
sunrpc.svc_rpc_per_connection_limit=
[NFS,SUNRPC]
Limit the number of requests that the server will
process in parallel from a single connection.
The default value is 0 (no limit).
2007-03-06 01:42:23 -08:00
sunrpc.pool_mode=
[NFS]
Control how the NFS server code allocates CPUs to
service thread pools. Depending on how many NICs
you have and where their interrupts are bound, this
option will affect which CPUs will do NFS serving.
Note: this parameter cannot be changed while the
NFS server is running.
auto the server chooses an appropriate mode
automatically using heuristics
global a single global pool contains all CPUs
percpu one pool for each CPU
pernode one pool for each NUMA node (equivalent
to global on non-NUMA machines)
2009-08-09 15:06:19 -04:00
sunrpc.tcp_slot_table_entries=
sunrpc.udp_slot_table_entries=
[NFS,SUNRPC]
Sets the upper limit on the number of simultaneous
RPC calls that can be sent from the client to a
server. Increasing these values may allow you to
improve throughput, but will also increase the
amount of memory reserved for use by the client.
PM / sleep: add configurable delay for pm_test
When CONFIG_PM_DEBUG=y, we provide a sysfs file (/sys/power/pm_test) for
selecting one of a few suspend test modes, where rather than entering a
full suspend state, the kernel will perform some subset of suspend
steps, wait 5 seconds, and then resume back to normal operation.
This mode is useful for (among other things) observing the state of the
system just before entering a sleep mode, for debugging or analysis
purposes. However, a constant 5 second wait is not sufficient for some
sorts of analysis; for example, on an SoC, one might want to use
external tools to probe the power states of various on-chip controllers
or clocks.
This patch turns this 5 second delay into a configurable module
parameter, so users can determine how long to wait in this
pseudo-suspend state before resuming the system.
Example (wait 30 seconds);
# echo 30 > /sys/module/suspend/parameters/pm_test_delay
# echo core > /sys/power/pm_test
# time echo mem > /sys/power/state
...
[ 17.583625] suspend debug: Waiting for 30 second(s).
...
real 0m30.381s
user 0m0.017s
sys 0m0.080s
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Kevin Cernekee <cernekee@chromium.org>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-22 21:16:49 -08:00
suspend.pm_test_delay=
[SUSPEND]
Sets the number of seconds to remain in a suspend test
mode before resuming the system (see
/sys/power/pm_test). Only available when CONFIG_PM_DEBUG
is set. Default value is 5.
2019-08-19 23:13:14 -03:00
svm= [PPC]
Format: { on | off | y | n | 1 | 0 }
This parameter controls use of the Protected
Execution Facility on pSeries.
2013-11-27 13:48:09 +01:00
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
2022-07-08 12:15:44 -04:00
Format: { <int> [,<int>] | force | noforce }
2013-11-27 13:48:09 +01:00
<int> -- Number of I/O TLB slabs
2022-07-08 12:15:44 -04:00
<int> -- Second integer after comma. Number of swiotlb
2022-07-21 23:38:46 -04:00
areas with their own lock. Will be rounded up
to a power of 2.
2013-11-27 13:48:09 +01:00
force -- force using of bounce buffers even if they
wouldn't be automatically used by the kernel
2016-12-16 14:28:42 +01:00
noforce -- Never use bounce buffers (for debugging)
2005-10-23 12:57:11 -07:00
2005-04-16 15:20:36 -07:00
switches= [HW,M68k]
kernel/sysctl: support setting sysctl parameters from kernel command line
Patch series "support setting sysctl parameters from kernel command line", v3.
This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1
The important part is Patch 1. The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl. I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.
I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:
- numa_zonelist_order is converted in Patch 2 together with adding the
necessary infrastructure. It's easy as it doesn't really do anything
but warn on deprecated value these days.
- hung_task_panic is converted in Patch 3, but there's a downside that
now it only accepts 0 and 1, while previously it was any integer
value
- nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
so there's no straighforward conversion possible
- traceoff_on_warning is a flag without value and it would be required
to handle that somehow in the conversion infractructure, which seems
pointless for a single flag
This patch (of 5):
A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.
Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.
Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.
Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W. Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls. Errors due to e.g. invalid parameter name or value
are reported in the kernel log.
The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized. At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.
Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.
[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-07 21:40:24 -07:00
sysctl.*= [KNL]
Set a sysctl parameter, right before loading the init
process, as if the value was written to the respective
/proc/sys/... file. Both '.' and '/' are recognized as
separators. Unrecognized parameters and invalid values
are reported in the kernel log. Sysctls registered
later by a loaded module cannot be set this way.
Example: sysctl.vm.swappiness=40
2010-09-08 16:54:17 +02:00
sysfs.deprecated=0|1 [KNL]
Enable/disable old style sysfs layout for old udev
on older distributions. When this option is enabled
very new udev will not work anymore. When this option
is disabled (or CONFIG_SYSFS_DEPRECATED not compiled)
in older udev will not work anymore.
Default depends on CONFIG_SYSFS_DEPRECATED_V2 set in
the kernel configuration.
2006-12-13 00:34:36 -08:00
sysrq_always_enabled
[KNL]
Ignore sysrq setting - this boot parameter will
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
2014-11-06 19:46:50 +01:00
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
ram pages. This is used to specify the TCP metrics
2020-04-28 00:01:49 +02:00
cache size. See Documentation/networking/ip-sysctl.rst
2014-11-06 19:46:50 +01:00
"tcp_no_metrics_save" section for more details.
2005-04-16 15:20:36 -07:00
tdfx= [HW,DRM]
2022-04-02 22:48:20 -07:00
test_suspend= [SUSPEND]
Format: { "mem" | "standby" | "freeze" }[,N]
2008-07-23 21:28:33 -07:00
Specify "mem" (for Suspend-to-RAM) or "standby" (for
2014-09-02 11:54:41 -07:00
standby suspend) or "freeze" (for suspend type freeze)
as the system sleep state during system startup with
the optional capability to repeat N number of times.
The system is woken from this state using a
wakeup-capable RTC alarm.
2008-07-23 21:28:33 -07:00
2005-04-16 15:20:36 -07:00
thash_entries= [KNL,NET]
Set number of hash buckets for TCP connection
2007-08-12 00:12:54 -04:00
thermal.act= [HW,ACPI]
-1: disable all active trip points in all thermal zones
<degrees C>: override all lowest active trip points
2007-08-14 15:49:32 -04:00
thermal.crt= [HW,ACPI]
-1: disable all critical trip points in all thermal zones
2008-10-17 02:41:20 -04:00
<degrees C>: override all critical trip points
2007-08-14 15:49:32 -04:00
2007-08-12 00:12:44 -04:00
thermal.nocrt= [HW,ACPI]
Set to disable actions on ACPI thermal zone
critical and hot trip points.
2007-08-12 00:12:17 -04:00
thermal.off= [HW,ACPI]
1: disable ACPI thermal control
2007-08-12 00:12:35 -04:00
thermal.psv= [HW,ACPI]
-1: disable all passive trip points
2008-12-19 10:57:32 -08:00
<degrees C>: override all passive trip points to this
value
2007-08-12 00:12:35 -04:00
ACPI: thermal: expose "thermal.tzp=" to set global polling frequency
Thermal Zone Polling frequency (_TZP) is an optional ACPI object
recommending the rate that the OS should poll the associated thermal zone.
If _TZP is 0, no polling should be used.
If _TZP is non-zero, then the platform recommends that
the OS poll the thermal zone at the specified rate.
The minimum period is 30 seconds.
The maximum period is 5 minutes.
(note _TZP and thermal.tzp units are in deci-seconds,
so _TZP = 300 corresponds to 30 seconds)
If _TZP is not present, ACPI 3.0b recommends that the
thermal zone be polled at an "OS provided default frequency".
However, common industry practice is:
1. The BIOS never specifies any _TZP
2. High volume OS's from this century never poll any thermal zones
Ie. The OS depends on the platform's ability to
provoke thermal events when necessary, and
the "OS provided default frequency" is "never":-)
There is a proposal that ACPI 4.0 be updated to reflect
common industry practice -- ie. no _TZP, no polling.
The Linux kernel already follows this practice --
thermal zones are not polled unless _TZP is present and non-zero.
But thermal zone polling is useful as a workaround for systems
which have ACPI thermal control, but have an issue preventing
thermal events. Indeed, some Linux distributions still
set a non-zero thermal polling frequency for this reason.
But rather than ask the user to write a polling frequency
into all the /proc/acpi/thermal_zone/*/polling_frequency
files, here we simply document and expose the already
existing module parameter to do the same at system level,
to simplify debugging those broken platforms.
Note that thermal.tzp is a module-load time parameter only.
Signed-off-by: Len Brown <len.brown@intel.com>
2007-08-12 00:12:26 -04:00
thermal.tzp= [HW,ACPI]
Specify global default ACPI thermal zone polling rate
<deci-seconds>: poll all this frequency
0: no polling (default)
2011-02-23 23:52:23 +00:00
threadirqs [KNL]
Force threading of all interrupt handlers except those
2012-02-15 00:26:42 +09:00
marked explicitly IRQF_NO_THREAD.
2011-02-23 23:52:23 +00:00
2008-12-25 13:39:23 +01:00
topology= [S390]
Format: {off | on}
Specify if the kernel should make use of the cpu
2011-04-04 15:04:46 -07:00
topology information if the hardware supports this.
The scheduler will make use of this information and
2008-12-25 13:39:23 +01:00
e.g. base its process migration decisions on it.
2010-10-25 16:10:43 +02:00
Default is on.
2008-12-25 13:39:23 +01:00
2014-10-10 09:04:49 -07:00
topology_updates= [KNL, PPC, NUMA]
Format: {off}
Specify if the kernel should ignore (off)
topology updates sent by the hypervisor to this
LPAR.
2019-12-06 15:02:59 -08:00
torture.disable_onoff_at_boot= [KNL]
Prevent the CPU-hotplug component of torturing
until after init has spawned.
2020-06-16 15:38:24 -07:00
torture.ftrace_dump_at_shutdown= [KNL]
Dump the ftrace buffer at torture-test shutdown,
even if there were no errors. This can be a
very costly operation when many torture tests
are running concurrently, especially on systems
with rotating-rust storage.
2020-11-25 13:00:04 -08:00
torture.verbose_sleep_frequency= [KNL]
Specifies how many verbose printk()s should be
emitted between each sleep. The default of zero
disables verbose-printk() sleeping.
torture.verbose_sleep_duration= [KNL]
Duration of each verbose-printk() sleep in jiffies.
2005-04-16 15:20:36 -07:00
tp720= [HW,PS2]
2010-03-25 00:55:32 -03:00
tpm_suspend_pcr=[HW,TPM]
Format: integer pcr id
Specify that at suspend time, the tpm driver
should extend the specified pcr with zeros,
as a workaround for some chips which fail to
flush the last written pcr on TPM_SaveState.
This will guarantee that all the other pcrs
are saved.
2022-04-02 22:48:22 -07:00
tp_printk [FTRACE]
2014-12-12 22:27:10 -05:00
Have the tracepoints sent to printk as well as the
tracing ring buffer. This is useful for early boot up
where the system hangs or reboots and does not give the
option for reading the tracing buffer or performing a
ftrace_dump_on_oops.
To turn off having tracepoints sent to printk,
echo 0 > /proc/sys/kernel/tracepoint_printk
Note, echoing 1 into this file without the
tracepoint_printk kernel cmdline option has no effect.
2021-06-17 10:51:02 -04:00
The tp_printk_stop_on_boot (see below) can also be used
to stop the printing of events to console at
late_initcall_sync.
2014-12-12 22:27:10 -05:00
** CAUTION **
Having tracepoints sent to printk() and activating high
frequency tracepoints such as irq or sched, can cause
the system to live lock.
2022-04-02 22:48:22 -07:00
tp_printk_stop_on_boot [FTRACE]
2021-06-17 10:51:02 -04:00
When tp_printk (above) is set, it can cause a lot of noise
on the console. It may be useful to only include the
printing of events during boot up, as user space may
make the system inoperable.
This command line option will stop the printing of events
to console at the late_initcall_sync() time frame.
2009-06-24 17:33:15 +08:00
trace_buf_size=nn[KMG]
2014-12-03 10:39:20 +09:00
[FTRACE] will set tracing buffer size on each cpu.
2009-03-10 13:57:10 +09:00
2022-04-02 22:48:20 -07:00
trace_clock= [FTRACE] Set the clock used for tracing events
at boot up.
local - Use the per CPU time stamp counter
(converted into nanoseconds). Fast, but
depending on the architecture, may not be
in sync between CPUs.
global - Event time stamps are synchronize across
CPUs. May be slower than the local clock,
but better for some race conditions.
counter - Simple counting of events (1, 2, ..)
note, some counts may be skipped due to the
infrastructure grabbing the clock more than
once per event.
uptime - Use jiffies as the time stamp.
perf - Use the same clock that perf uses.
mono - Use ktime_get_mono_fast_ns() for time stamps.
mono_raw - Use ktime_get_raw_fast_ns() for time
stamps.
boot - Use ktime_get_boot_fast_ns() for time stamps.
Architectures may add more clocks. See
Documentation/trace/ftrace.rst for more details.
2009-07-01 10:47:05 +08:00
trace_event=[event-list]
[FTRACE] Set and start specified trace events in order
2016-05-23 13:37:58 -07:00
to facilitate early boot debugging. The event-list is a
2020-12-31 20:08:31 -08:00
comma-separated list of trace events to enable. See
2018-05-08 15:14:57 -03:00
also Documentation/trace/events.rst
2009-07-01 10:47:05 +08:00
2023-02-07 12:28:50 -05:00
trace_instance=[instance-info]
[FTRACE] Create a ring buffer instance early in boot up.
This will be listed in:
/sys/kernel/tracing/instances
2023-02-07 12:28:51 -05:00
Events can be enabled at the time the instance is created
via:
trace_instance=<name>,<system1>:<event1>,<system2>:<event2>
Note, the "<system*>:" portion is optional if the event is
unique.
trace_instance=foo,sched:sched_switch,irq_handler_entry,initcall
will enable the "sched_switch" event (note, the "sched:" is optional, and
the same thing would happen if it was left off). The irq_handler_entry
event, and all events under the "initcall" system.
2012-11-01 22:56:07 -04:00
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
that can be enabled or disabled just as if you were
to echo the option name into
2023-01-25 14:32:51 -07:00
/sys/kernel/tracing/trace_options
2012-11-01 22:56:07 -04:00
For example, to enable stacktrace option (to dump the
stack trace of each event), add to the command line:
trace_options=stacktrace
2018-05-08 15:14:57 -03:00
See also Documentation/trace/ftrace.rst "trace options"
2012-11-01 22:56:07 -04:00
section.
2022-10-20 21:00:56 -04:00
trace_trigger=[trigger-list]
[FTRACE] Add a event trigger on specific events.
Set a trigger on top of a specific event, with an optional
filter.
The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
Where more than one trigger may be specified that are comma deliminated.
For example:
trace_trigger="sched_switch.stacktrace if prev_state == 2"
The above will enable the "stacktrace" trigger on the "sched_switch"
event but only trigger it if the "prev_state" of the "sched_switch"
event is "2" (TASK_UNINTERUPTIBLE).
See also "Event triggers" in Documentation/trace/events.rst
2013-06-14 16:21:43 -04:00
traceoff_on_warning
[FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can
be enabled again by echoing '1' into the "tracing_on"
2023-01-25 14:32:51 -07:00
file located in /sys/kernel/tracing/
2013-06-14 16:21:43 -04:00
This option is useful, as it disables the trace before
the WARNING dump is called, which prevents the trace to
be filled with content caused by the warning output.
This option can also be set at run time via the sysctl
option: kernel/traceoff_on_warning
2012-03-21 16:34:02 -07:00
transparent_hugepage=
[KNL]
Format: [always|madvise|never]
Can be used to control the default behavior of the system
with respect to transparent hugepages.
2018-05-14 11:13:40 +03:00
See Documentation/admin-guide/mm/transhuge.rst
for more details.
2012-03-21 16:34:02 -07:00
2021-03-01 18:41:24 +05:30
trusted.source= [KEYS]
Format: <string>
This parameter identifies the trust source as a backend
for trusted keys implementation. Supported trust
sources:
- "tpm"
- "tee"
2022-05-13 16:57:03 +02:00
- "caam"
2021-03-01 18:41:24 +05:30
If not specified then it defaults to iterating through
the trust source list starting with TPM and assigns the
first trust source as a backend which is initialized
successfully during iteration.
2022-05-13 16:57:00 +02:00
trusted.rng= [KEYS]
Format: <string>
The RNG used to generate key material for trusted keys.
Can be one of:
- "kernel"
- the same value as trusted.source: "tpm" or "tee"
- "default"
If not specified, "default" is used. In this case,
the RNG's choice is left to each individual trust source.
2009-08-17 16:40:47 -07:00
tsc= Disable clocksource stability checks for TSC.
2008-10-24 17:22:01 -07:00
Format: <string>
[x86] reliable: mark tsc clocksource as reliable, this
2009-08-17 16:40:47 -07:00
disables clocksource verification at runtime, as well
as the stability checks done at bootup. Used to enable
high-resolution timer mode on older hardware, and in
virtualized environment.
2010-10-04 17:03:20 -07:00
[x86] noirqtime: Do not use TSC to do irq accounting.
Used to run time disable IRQ_TIME_ACCOUNTING on any
platforms where RDTSC is slow and this accounting
can add overhead.
2017-10-09 17:03:33 +08:00
[x86] unstable: mark the TSC clocksource as unstable, this
marks the TSC unconditionally unstable at bootup and
avoids any further wobbles once the TSC watchdog notices.
2019-03-07 13:09:13 +01:00
[x86] nowatchdog: disable clocksource watchdog. Used
in situations with strict latency requirements (where
interruptions from clocksource watchdog are not
acceptable).
2023-01-04 16:19:38 +08:00
[x86] recalibrate: force recalibration against a HW timer
(HPET or PM timer) on systems whose TSC frequency was
obtained from HW or FW using either an MSR or CPUID(0x15).
Warn if the difference is more than 500 ppm.
2023-02-01 13:53:07 -08:00
[x86] watchdog: Use TSC as the watchdog clocksource with
which to check other HW timers (HPET or PM timer), but
only on systems where TSC has been deemed trustworthy.
This will be suppressed by an earlier tsc=nowatchdog and
can be overridden by a later tsc=nowatchdog. A console
message will flag any such suppression or overriding.
2008-10-24 17:22:01 -07:00
2020-01-23 16:09:26 +00:00
tsc_early_khz= [X86] Skip early TSC calibration and use the given
value instead. Useful when the early TSC frequency discovery
procedure is not reliable, such as on overclocked systems
with CPUID.16h support and partial CPUID.15h support.
Format: <unsigned int>
2019-10-23 11:01:53 +02:00
tsx= [X86] Control Transactional Synchronization
Extensions (TSX) feature in Intel processors that
support TSX control.
This parameter controls the TSX feature. The options are:
on - Enable TSX on the system. Although there are
mitigations for all known security vulnerabilities,
TSX has been known to be an accelerator for
several previous speculation-related CVEs, and
so there may be unknown security risks associated
with leaving it enabled.
off - Disable TSX on the system. (Note that this
option takes effect only on newer CPUs which are
not vulnerable to MDS, i.e., have
MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get
the new IA32_TSX_CTRL MSR through a microcode
update. This new MSR allows for the reliable
deactivation of the TSX functionality.)
2019-10-23 12:28:57 +02:00
auto - Disable TSX if X86_BUG_TAA is present,
otherwise enable TSX on the system.
2019-10-23 11:01:53 +02:00
Not specifying this option is equivalent to tsx=off.
See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
for more details.
2019-10-23 12:32:55 +02:00
tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async
Abort (TAA) vulnerability.
Similar to Micro-architectural Data Sampling (MDS)
certain CPUs that support Transactional
Synchronization Extensions (TSX) are vulnerable to an
exploit against CPU internal buffers which can forward
information to a disclosure gadget under certain
conditions.
In vulnerable processors, the speculatively forwarded
data can be used in a cache side channel attack, to
access data to which the attacker does not have direct
access.
This parameter controls the TAA mitigation. The
options are:
full - Enable TAA mitigation on vulnerable CPUs
if TSX is enabled.
full,nosmt - Enable TAA mitigation and disable SMT on
vulnerable CPUs. If TSX is disabled, SMT
is not disabled because CPU is not
vulnerable to cross-thread TAA attacks.
off - Unconditionally disable TAA mitigation
2019-11-15 11:14:44 -05:00
On MDS-affected machines, tsx_async_abort=off can be
prevented by an active MDS mitigation as both vulnerabilities
are mitigated with the same mechanism so in order to disable
this mitigation, you need to specify mds=off too.
2019-10-23 12:32:55 +02:00
Not specifying this option is equivalent to
tsx_async_abort=full. On CPUs which are MDS affected
and deploy MDS mitigation, TAA mitigation is not
required and doesn't provide any additional
mitigation.
For details see:
Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
2005-10-23 12:57:11 -07:00
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
Format:
<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
2017-10-10 12:36:23 -05:00
See also Documentation/input/devices/joystick-parport.rst
2005-04-16 15:20:36 -07:00
2011-05-31 15:22:05 +00:00
udbg-immortal [PPC] When debugging early kernel crashes that
2016-11-03 12:10:10 +02:00
happen after console_init() and before a proper
2011-05-31 15:22:05 +00:00
console driver takes over, this boot options might
help "seeing" what's going on.
2009-10-07 00:37:59 +00:00
uhash_entries= [KNL,NET]
Set number of hash buckets for UDP/UDP-Lite connections
2006-12-05 16:29:55 -05:00
uhci-hcd.ignore_oc=
[USB] Ignore overcurrent events (default N).
Some badly-designed motherboards generate lots of
bogus events, for ports that aren't wired to
anything. Set this parameter to avoid log spamming.
Note that genuine overcurrent events won't be
reported either.
2008-07-19 23:32:54 +01:00
unknown_nmi_panic
2011-04-04 15:02:24 -07:00
[X86] Cause panic on unknown NMI.
2008-07-19 23:32:54 +01:00
2011-05-31 21:31:08 +02:00
usbcore.authorized_default=
[USB] Default USB device authorization:
(default -1 = authorized except for wireless USB,
2019-02-16 23:21:51 -08:00
0 = not authorized, 1 = authorized, 2 = authorized
if device connected to internal port)
2011-05-31 21:31:08 +02:00
2007-02-20 15:00:53 -05:00
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2). This
is the time required before an idle device will be
autosuspended. Devices for which the delay is set
2007-03-13 16:39:15 -04:00
to a negative value won't be autosuspended at all.
2007-02-20 15:00:53 -05:00
2008-10-10 16:24:45 +02:00
usbcore.usbfs_snoop=
[USB] Set to log all usbfs traffic (default 0 = off).
2015-11-20 13:53:22 -05:00
usbcore.usbfs_snoop_max=
[USB] Maximum number of bytes to snoop in each URB
(default = 65536).
2008-10-10 16:24:45 +02:00
usbcore.blinkenlights=
[USB] Set to cycle leds on hubs (default 0 = off).
usbcore.old_scheme_first=
[USB] Start with the old device initialization
2020-04-22 16:13:08 -04:00
scheme (default 0 = off).
2008-10-10 16:24:45 +02:00
2011-11-17 16:41:35 -05:00
usbcore.usbfs_memory_mb=
[USB] Memory limit (in MB) for buffers allocated by
usbfs (default = 16, 0 = max = 2047).
2008-10-10 16:24:45 +02:00
usbcore.use_both_schemes=
[USB] Try the other device initialization scheme
if the first one fails (default 1 = enabled).
usbcore.initial_descriptor_timeout=
[USB] Specifies timeout for the initial 64-byte
2018-04-18 20:51:39 +02:00
USB_REQ_GET_DESCRIPTOR request in milliseconds
2008-10-10 16:24:45 +02:00
(default 5000 = 5.0 seconds).
2015-12-03 15:03:32 +01:00
usbcore.nousb [USB] Disable the USB subsystem
2018-03-20 00:26:06 +08:00
usbcore.quirks=
[USB] A list of quirk entries to augment the built-in
usb core quirk list. List entries are separated by
commas. Each entry has the form
VendorID:ProductID:Flags. The IDs are 4-digit hex
numbers and Flags is a set of letters. Each letter
will change the built-in quirk; setting it if it is
clear and clearing it if it is set. The letters have
the following meanings:
a = USB_QUIRK_STRING_FETCH_255 (string
descriptors must not be fetched using
a 255-byte read);
b = USB_QUIRK_RESET_RESUME (device can't resume
correctly so reset it instead);
c = USB_QUIRK_NO_SET_INTF (device can't handle
Set-Interface requests);
d = USB_QUIRK_CONFIG_INTF_STRINGS (device can't
handle its Configuration or Interface
strings);
e = USB_QUIRK_RESET (device can't be reset
(e.g morph devices), don't use reset);
f = USB_QUIRK_HONOR_BNUMINTERFACES (device has
more interface descriptions than the
bNumInterfaces count, and can't handle
talking to these interfaces);
g = USB_QUIRK_DELAY_INIT (device needs a pause
during initialization, after we read
the device descriptor);
h = USB_QUIRK_LINEAR_UFRAME_INTR_BINTERVAL (For
high speed and super speed interrupt
endpoints, the USB 2.0 and USB 3.0 spec
require the interval in microframes (1
microframe = 125 microseconds) to be
calculated as interval = 2 ^
(bInterval-1).
Devices with this quirk report their
bInterval as the result of this
calculation instead of the exponent
variable used in the calculation);
i = USB_QUIRK_DEVICE_QUALIFIER (device can't
handle device_qualifier descriptor
requests);
j = USB_QUIRK_IGNORE_REMOTE_WAKEUP (device
generates spurious wakeup, ignore
remote wakeup capability);
k = USB_QUIRK_NO_LPM (device can't handle Link
Power Management);
l = USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL
(Device reports its bInterval as linear
frames instead of the USB 2.0
calculation);
m = USB_QUIRK_DISCONNECT_SUSPEND (Device needs
to be disconnected before suspend to
2018-03-24 03:26:36 +08:00
prevent spurious wakeup);
n = USB_QUIRK_DELAY_CTRL_MSG (Device needs a
pause after every control message);
2018-10-19 16:14:50 +08:00
o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra
delay after resetting its port);
2018-03-20 00:26:06 +08:00
Example: quirks=0781:5580:bk,0a5c:5834:gij
2005-04-16 15:20:36 -07:00
usbhid.mousepoll=
[USBHID] The interval which mice are to be polled at.
2005-10-23 12:57:11 -07:00
2017-02-25 20:27:27 +01:00
usbhid.jspoll=
[USBHID] The interval which joysticks are to be polled at.
2018-03-21 17:28:25 +01:00
usbhid.kbpoll=
[USBHID] The interval which keyboards are to be polled at.
2008-11-10 14:07:45 -05:00
usb-storage.delay_use=
[UMS] The delay in seconds before a new device is
2014-11-04 13:00:15 +00:00
scanned for Logical Units (default 1).
2008-11-10 14:07:45 -05:00
usb-storage.quirks=
[UMS] A list of quirks entries to supplement or
override the built-in unusual_devs list. List
entries are separated by commas. Each entry has
the form VID:PID:Flags where VID and PID are Vendor
and Product ID values (4-digit hex numbers) and
Flags is a set of characters, each corresponding
to a common usb-storage quirk flag as follows:
2008-12-15 10:40:06 -05:00
a = SANE_SENSE (collect more than 18 bytes
2019-11-14 12:27:58 +01:00
of sense data, not on uas);
2009-12-07 16:39:16 -05:00
b = BAD_SENSE (don't collect more than 18
2019-11-14 12:27:58 +01:00
bytes of sense data, not on uas);
2008-11-10 14:07:45 -05:00
c = FIX_CAPACITY (decrease the reported
device capacity by one sector);
2011-05-18 21:42:34 +01:00
d = NO_READ_DISC_INFO (don't use
2019-11-14 12:27:58 +01:00
READ_DISC_INFO command, not on uas);
2011-05-18 21:42:34 +01:00
e = NO_READ_CAPACITY_16 (don't use
READ_CAPACITY_16 command);
2014-09-16 18:36:52 +02:00
f = NO_REPORT_OPCODES (don't use report opcodes
command, uas only);
2015-04-21 11:20:31 +02:00
g = MAX_SECTORS_240 (don't transfer more than
240 sectors at a time, uas only);
2008-12-15 10:40:06 -05:00
h = CAPACITY_HEURISTICS (decrease the
reported device capacity by one
sector if the number is odd);
2008-11-10 14:07:45 -05:00
i = IGNORE_DEVICE (don't bind to this
device);
2016-04-12 12:27:09 +02:00
j = NO_REPORT_LUNS (don't use report luns
command, uas only);
2020-12-09 16:26:39 +01:00
k = NO_SAME (do not use WRITE_SAME, uas only)
2008-11-10 14:07:45 -05:00
l = NOT_LOCKABLE (don't try to lock and
2019-11-14 12:27:58 +01:00
unlock ejectable media, not on uas);
2008-11-10 14:07:45 -05:00
m = MAX_SECTORS_64 (don't transfer more
2019-11-14 12:27:58 +01:00
than 64 sectors = 32 KB at a time,
not on uas);
2011-06-07 11:35:52 -04:00
n = INITIAL_READ10 (force a retry of the
2019-11-14 12:27:58 +01:00
initial READ(10) command, not on uas);
2008-12-15 10:40:06 -05:00
o = CAPACITY_OK (accept the capacity
2019-11-14 12:27:58 +01:00
reported by the device, not on uas);
2012-07-07 23:05:28 -04:00
p = WRITE_CACHE (the device cache is ON
2019-11-14 12:27:58 +01:00
by default, not on uas);
2008-11-10 14:07:45 -05:00
r = IGNORE_RESIDUE (the device reports
2019-11-14 12:27:58 +01:00
bogus residue values, not on uas);
2008-11-10 14:07:45 -05:00
s = SINGLE_LUN (the device has only one
Logical Unit);
2014-09-15 16:04:12 +02:00
t = NO_ATA_1X (don't allow ATA(12) and ATA(16)
commands, uas only);
2014-09-02 15:42:18 -04:00
u = IGNORE_UAS (don't bind to the uas driver);
2008-11-10 14:07:45 -05:00
w = NO_WP_DETECT (don't test whether the
medium is write-protected).
2016-09-12 15:19:41 +02:00
y = ALWAYS_SYNC (issue a SYNCHRONIZE_CACHE
2019-11-14 12:27:58 +01:00
even if the device claims no cache,
not on uas)
2008-11-10 14:07:45 -05:00
Example: quirks=0419:aaf5:rl,0421:0433:rc
2011-08-13 12:34:50 -07:00
user_debug= [KNL,ARM]
Format: <int>
See arch/arm/Kconfig.debug help text.
1 - undefined instruction events
2 - system calls
4 - invalid data aborts
8 - SIGSEGV faults
16 - SIGBUS faults
Example: user_debug=31
2010-02-17 10:38:10 +00:00
userpte=
[X86] Flags controlling user PTE allocations.
nohigh = do not allocate PTE pages in
HIGHMEM regardless of setting
of CONFIG_HIGHPTE.
2022-04-02 22:48:20 -07:00
vdso= [X86,SH,SPARC]
2014-03-13 16:01:26 -07:00
On X86_32, this is an alias for vdso32=. Otherwise:
vdso=1: enable VDSO (the default)
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 02:53:50 -07:00
vdso=0: disable VDSO mapping
2014-03-13 16:01:26 -07:00
vdso32= [X86] Control the 32-bit vDSO
vdso32=1: enable 32-bit VDSO
vdso32=0 or vdso32=2: disable 32-bit VDSO
See the help text for CONFIG_COMPAT_VDSO for more
details. If CONFIG_COMPAT_VDSO is set, the default is
vdso32=0; otherwise, the default is vdso32=1.
For compatibility with older kernels, vdso32=2 is an
alias for vdso32=0.
Try vdso32=0 if you encounter an error that says:
dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed!
2008-01-30 13:30:43 +01:00
2007-07-17 21:22:55 +09:00
vector= [IA-64,SMP]
vector=percpu: enable percpu vector domain
2005-04-16 15:20:36 -07:00
video= [FB] Frame buffer configuration
2019-06-12 14:52:45 -03:00
See Documentation/fb/modedb.rst.
2005-04-16 15:20:36 -07:00
2022-04-02 22:48:20 -07:00
video.brightness_switch_enabled= [ACPI]
Format: [0|1]
2013-06-20 15:08:55 +08:00
If set to 1, on receiving an ACPI notify event
generated by hotkey, video driver will adjust brightness
level and then send out the event to user space through
2022-04-02 22:48:20 -07:00
the allocated input device. If set to 0, video driver
2013-06-20 15:08:55 +08:00
will only send out the event without touching backlight
brightness level.
2014-07-14 19:35:45 +02:00
default: 1
2013-06-20 15:08:55 +08:00
2012-05-09 18:30:16 +01:00
virtio_mmio.device=
[VMMIO] Memory mapped virtio (platform) device.
<size>@<baseaddr>:<irq>[:<id>]
where:
<size> := size (can use standard suffixes
like K, M and G)
<baseaddr> := physical base address
<irq> := interrupt number (as passed to
request_irq())
<id> := (optional) platform device id
example:
virtio_mmio.device=1K@0x100b0000:48:7
Can be used multiple times for multiple devices.
2007-07-31 00:37:59 -07:00
vga= [BOOT,X86-32] Select a particular video mode
2019-06-07 15:54:32 -03:00
See Documentation/x86/boot.rst and
2019-06-27 14:56:51 -03:00
Documentation/admin-guide/svga.rst.
2005-04-16 15:20:36 -07:00
Use vga=ask for menu.
This is actually a boot loader parameter; the value is
passed to the kernel using a special protocol.
2018-10-26 15:07:45 -07:00
vm_debug[=options] [KNL] Available with CONFIG_DEBUG_VM=y.
May slow down system boot speed, especially when
enabled on systems with a large amount of memory.
All options are enabled by default, and this
interface is meant to allow for selectively
enabling or disabling specific virtual memory
debugging features.
Available options are:
P Enable page structure init time poisoning
- Disable all of the above options
2005-10-23 12:57:11 -07:00
vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
2005-04-16 15:20:36 -07:00
size of <nn>. This can be used to increase the
minimum size (128MB on x86). It can also be used to
decrease the size and leave more room for directly
mapped kernel RAM.
2017-08-07 15:16:15 +02:00
vmcp_cma=nn[MG] [KNL,S390]
Sets the memory size reserved for contiguous memory
allocations for the vmcp device driver.
2006-06-29 15:08:25 +02:00
vmhalt= [KNL,S390] Perform z/VM CP command after system halt.
Format: <command>
2005-04-16 15:20:36 -07:00
2006-06-29 15:08:25 +02:00
vmpanic= [KNL,S390] Perform z/VM CP command after kernel panic.
Format: <command>
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
2005-10-23 12:57:11 -07:00
2011-08-10 11:15:32 -04:00
vsyscall= [X86-64]
Controls the behavior of vsyscalls (i.e. calls to
fixed addresses of 0xffffffffff600x00 from legacy
code). Most statically-linked binaries and older
versions of glibc use these calls. Because these
functions are at fixed addresses, they make nice
targets for exploits that can control RIP.
2023-01-11 19:32:11 +00:00
emulate Vsyscalls turn into traps and are emulated
reasonably safely. The vsyscall page is
readable.
2011-08-10 11:15:32 -04:00
2023-01-11 19:32:11 +00:00
xonly [default] Vsyscalls turn into traps and are
2019-06-26 21:45:03 -07:00
emulated reasonably safely. The vsyscall
page is not readable.
2011-08-10 11:15:32 -04:00
none Vsyscalls don't work at all. This makes
them quite hard to use for exploits but
might break your system.
2013-08-04 13:09:50 +02:00
vt.color= [VT] Default text color.
Format: 0xYX, X = foreground, Y = background.
Default: 0x07 = light gray on black.
2009-12-15 16:45:39 -08:00
vt.cur_default= [VT] Default cursor shape.
Format: 0xCCBBAA, where AA, BB, and CC are the same as
the parameters of the <Esc>[?A;B;Cc escape sequence;
see VGA-softcursor.txt. Default: 2 = underline.
2009-04-05 15:55:22 -07:00
vt.default_blu= [VT]
Format: <blue0>,<blue1>,<blue2>,...,<blue15>
Change the default blue palette of the console.
This is a 16-member array composed of values
ranging from 0-255.
vt.default_grn= [VT]
Format: <green0>,<green1>,<green2>,...,<green15>
Change the default green palette of the console.
This is a 16-member array composed of values
ranging from 0-255.
vt.default_red= [VT]
Format: <red0>,<red1>,<red2>,...,<red15>
Change the default red palette of the console.
This is a 16-member array composed of values
ranging from 0-255.
vt.default_utf8=
[VT]
Format=<0|1>
Set system-wide default UTF-8 mode for all tty's.
Default is 1, i.e. UTF-8 mode is enabled for all
newly opened terminals.
2009-11-13 15:14:11 -05:00
vt.global_cursor_default=
[VT]
Format=<-1|0|1>
Set system-wide default for whether a cursor
is shown on new VTs. Default is -1,
i.e. cursors will be created by default unless
overridden by individual drivers. 0 will hide
cursors, 1 will display them.
2013-08-04 13:09:50 +02:00
vt.italic= [VT] Default color for italic text; 0-15.
Default: 2 = green.
vt.underline= [VT] Default color for underlined text; 0-15.
Default: 3 = cyan.
2010-05-03 11:42:52 -07:00
watchdog timers [HW,WDT] For information on watchdog timers,
2019-06-12 14:53:01 -03:00
see Documentation/watchdog/watchdog-parameters.rst
2010-05-03 11:42:52 -07:00
or other driver-specific files in the
Documentation/watchdog/ directory.
2005-04-16 15:20:36 -07:00
2018-11-01 09:30:18 -04:00
watchdog_thresh=
[KNL]
Set the hard lockup detector stall duration
threshold in seconds. The soft lockup detector
threshold is set to twice the value. A value of 0
disables both lockup detectors. Default is 10
seconds.
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
workqueue.watchdog_thresh=
If CONFIG_WQ_WATCHDOG is configured, workqueue can
warn stall conditions and dump internal state to
help debugging. 0 disables workqueue stall
detection; otherwise, it's the stall threshold
duration in seconds. The default value is 30 and
it can be updated at runtime by writing to the
corresponding sysfs file.
2013-04-01 11:23:38 -07:00
workqueue.disable_numa
By default, all work items queued to unbound
workqueues are affine to the NUMA nodes they're
issued on, which results in better behavior in
general. If NUMA affinity needs to be disabled for
whatever reason, this option can be used. Note
that this also can be controlled per-workqueue for
workqueues visible under /sys/bus/workqueue/.
2013-04-08 16:45:40 +05:30
workqueue.power_efficient
Per-cpu workqueues are generally preferred because
they show better performance thanks to cache
locality; unfortunately, per-cpu workqueues tend to
be more power hungry than unbound workqueues.
Enabling this makes the per-cpu workqueues which
were observed to contribute significantly to power
consumption unbound, leading to measurably lower
power usage at the cost of small performance
overhead.
The default value of this parameter is determined by
the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
2016-02-09 17:59:38 -05:00
workqueue.debug_force_rr_cpu
Workqueue used to implicitly guarantee that work
items queued without explicit CPU specified are put
on the local CPU. This guarantee is no longer true
and while local CPU is still preferred work items
may be put on foreign CPUs. This debug option
forces round-robin CPU selection to flush out
usages which depend on the now broken guarantee.
When enabled, memory and cache locality will be
impacted.
2009-04-05 15:55:22 -07:00
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
2015-07-17 06:51:36 +02:00
xen_512gb_limit [KNL,X86-64,XEN]
Restricts the kernel running paravirtualized under Xen
to use only up to 512 GB of RAM. The reason to do so is
crash analysis tools and Xen tools for doing domain
save/restore/migration must be enabled to handle larger
domains.
2010-05-14 12:44:30 +01:00
xen_emul_unplug= [HW,X86,XEN]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
ide-disks -- unplug primary master IDE devices
aux-ide-disks -- unplug non-primary-master IDE devices
nics -- unplug network devices
all -- unplug all emulated devices (NICs and IDE disks)
2010-08-23 11:59:29 +01:00
unnecessary -- unplugging emulated devices is
unnecessary even if the host did not respond to
the unplug protocol
2010-08-23 11:59:28 +01:00
never -- do not unplug even if version check succeeds
2010-05-14 12:44:30 +01:00
2019-09-30 16:44:41 -04:00
xen_legacy_crash [X86,XEN]
Crash from Xen panic notifier, without executing late
panic() code such as dumping handler.
2022-09-26 13:16:56 +02:00
xen_msr_safe= [X86,XEN]
Format: <bool>
Select whether to always use non-faulting (safe) MSR
access functions when running as Xen PV guest. The
default value is controlled by CONFIG_XEN_PV_MSR_SAFE.
2013-09-25 10:07:20 -04:00
xen_nopvspin [X86,XEN]
2019-10-23 19:16:23 +08:00
Disables the qspinlock slowpath using Xen PV optimizations.
This parameter is obsoleted by "nopvspin" parameter, which
has equivalent effect for XEN platform.
2013-09-25 10:07:20 -04:00
2014-07-11 11:51:35 -04:00
xen_nopv [X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
2019-07-11 20:02:10 +08:00
This option is obsoleted by the "nopv" option, which
has equivalent effect for XEN platform.
2014-07-11 11:51:35 -04:00
2021-01-06 15:39:56 +00:00
xen_no_vector_callback
[KNL,X86,XEN] Disable the vector callback for Xen
event channel interrupts.
2018-09-07 18:49:08 +02:00
xen_scrub_pages= [XEN]
Boolean option to control scrubbing pages before giving them back
to Xen, for use by other domains. Can be also changed at runtime
with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
2019-03-22 14:29:57 -04:00
xen_timer_slop= [X86-64,XEN]
Set the timer slop (in nanoseconds) for the virtual Xen
timers (default is 100000). This adjusts the minimum
delta of virtualized Xen timers, where lower values
improve timer resolution at the expense of processing
more timer interrupts.
2021-11-02 10:19:44 +01:00
xen.balloon_boot_timeout= [XEN]
The time (in seconds) to wait before giving up to boot
in case initial ballooning fails to free enough memory.
Applies only when running as HVM or PVH guest and
started with less memory configured than allowed at
max. Default is 180.
2020-09-07 15:47:30 +02:00
xen.event_eoi_delay= [XEN]
How long to delay EOI handling in case of event
storms (jiffies). Default is 10.
xen.event_loop_timeout= [XEN]
After which time (jiffies) the event handling loop
should start to delay EOI handling. Default is 2.
2020-10-22 11:49:07 +02:00
xen.fifo_events= [XEN]
Boolean parameter to disable using fifo event handling
even if available. Normally fifo event handling is
preferred over the 2-level event handling, as it is
fairer and the number of possible event channels is
much higher. Default is on (use fifo events).
2005-04-16 15:20:36 -07:00
xirc2ps_cs= [NET,PCMCIA]
2005-10-23 12:57:11 -07:00
Format:
<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
2018-07-05 16:31:42 +03:00
2019-05-13 15:39:10 +10:00
xive= [PPC]
By default on POWER9 and above, the kernel will
natively use the XIVE interrupt controller. This option
allows the fallback firmware mode to be used:
off Fallback to firmware control of XIVE interrupt
controller on both pseries and powernv
platforms. Only useful on POWER9 and above.
2021-11-05 11:26:34 +01:00
xive.store-eoi=off [PPC]
By default on POWER10 and above, the kernel will use
stores for EOI handling when the XIVE interrupt mode
is active. This option allows the XIVE driver to use
loads instead, as on POWER9.
2018-07-05 16:31:42 +03:00
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be
consulted in header drivers/usb/host/xhci.h.
2019-08-14 15:56:37 -05:00
xmon [PPC]
Format: { early | on | rw | ro | off }
Controls if xmon debugger is enabled. Default is off.
Passing only "xmon" is equivalent to "xmon=early".
early Call xmon as early as possible on boot; xmon
debugger is called from setup_arch().
on xmon debugger hooks will be installed so xmon
is only called on a kernel crash. Default mode,
i.e. either "ro" or "rw" mode, is controlled
with CONFIG_XMON_DEFAULT_RO_MODE.
rw xmon debugger hooks will be installed so xmon
is called only on a kernel crash, mode is write,
meaning SPR registers, memory and, other data
can be written using xmon commands.
ro same as "rw" option above but SPR registers,
memory, and other data can't be written using
xmon commands.
off xmon is disabled.
2022-11-17 15:35:41 +08:00
amd_pstate= [X86]
disable
Do not enable amd_pstate as the default
scaling driver for the supported processors
passive
Use amd_pstate as a scaling driver, driver requests a
desired performance on this abstract scale and the power
management firmware translates the requests into actual
hardware states (core frequency, data fabric and memory
clocks etc.)
2023-01-31 17:00:14 +08:00
active
Use amd_pstate_epp driver instance as the scaling driver,
driver provides a hint to the hardware if software wants
to bias toward performance (0x0) or energy efficiency (0xff)
to the CPPC firmware. then CPPC power algorithm will
calculate the runtime workload and adjust the realtime cores
frequency.