From 4419470191386456e0b8ed4eb06a70b0021798a6 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:26:07 -0700 Subject: [PATCH 001/298] Documentation: Add documentation for Processor MMIO Stale Data Add the admin guide for Processor MMIO stale data vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- Documentation/admin-guide/hw-vuln/index.rst | 1 + .../hw-vuln/processor_mmio_stale_data.rst | 246 ++++++++++++++++++ 2 files changed, 247 insertions(+) create mode 100644 Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 8cbc711cda93..4df436e7c417 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -17,3 +17,4 @@ are configurable at compile, boot or run time. special-register-buffer-data-sampling.rst core-scheduling.rst l1d_flush.rst + processor_mmio_stale_data.rst diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst new file mode 100644 index 000000000000..9393c50b5afc --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -0,0 +1,246 @@ +========================================= +Processor MMIO Stale Data Vulnerabilities +========================================= + +Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O +(MMIO) vulnerabilities that can expose data. The sequences of operations for +exposing data range from simple to very complex. Because most of the +vulnerabilities require the attacker to have access to MMIO, many environments +are not affected. System environments using virtualization where MMIO access is +provided to untrusted guests may need mitigation. These vulnerabilities are +not transient execution attacks. However, these vulnerabilities may propagate +stale data into core fill buffers where the data can subsequently be inferred +by an unmitigated transient execution attack. Mitigation for these +vulnerabilities includes a combination of microcode update and software +changes, depending on the platform and usage model. Some of these mitigations +are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or +those used to mitigate Special Register Buffer Data Sampling (SRBDS). + +Data Propagators +================ +Propagators are operations that result in stale data being copied or moved from +one microarchitectural buffer or register to another. Processor MMIO Stale Data +Vulnerabilities are operations that may result in stale data being directly +read into an architectural, software-visible state or sampled from a buffer or +register. + +Fill Buffer Stale Data Propagator (FBSDP) +----------------------------------------- +Stale data may propagate from fill buffers (FB) into the non-coherent portion +of the uncore on some non-coherent writes. Fill buffer propagation by itself +does not make stale data architecturally visible. Stale data must be propagated +to a location where it is subject to reading or sampling. + +Sideband Stale Data Propagator (SSDP) +------------------------------------- +The sideband stale data propagator (SSDP) is limited to the client (including +Intel Xeon server E3) uncore implementation. The sideband response buffer is +shared by all client cores. For non-coherent reads that go to sideband +destinations, the uncore logic returns 64 bytes of data to the core, including +both requested data and unrequested stale data, from a transaction buffer and +the sideband response buffer. As a result, stale data from the sideband +response and transaction buffers may now reside in a core fill buffer. + +Primary Stale Data Propagator (PSDP) +------------------------------------ +The primary stale data propagator (PSDP) is limited to the client (including +Intel Xeon server E3) uncore implementation. Similar to the sideband response +buffer, the primary response buffer is shared by all client cores. For some +processors, MMIO primary reads will return 64 bytes of data to the core fill +buffer including both requested data and unrequested stale data. This is +similar to the sideband stale data propagator. + +Vulnerabilities +=============== +Device Register Partial Write (DRPW) (CVE-2022-21166) +----------------------------------------------------- +Some endpoint MMIO registers incorrectly handle writes that are smaller than +the register size. Instead of aborting the write or only copying the correct +subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than +specified by the write transaction may be written to the register. On +processors affected by FBSDP, this may expose stale data from the fill buffers +of the core that created the write transaction. + +Shared Buffers Data Sampling (SBDS) (CVE-2022-21125) +---------------------------------------------------- +After propagators may have moved data around the uncore and copied stale data +into client core fill buffers, processors affected by MFBDS can leak data from +the fill buffer. It is limited to the client (including Intel Xeon server E3) +uncore implementation. + +Shared Buffers Data Read (SBDR) (CVE-2022-21123) +------------------------------------------------ +It is similar to Shared Buffer Data Sampling (SBDS) except that the data is +directly read into the architectural software-visible state. It is limited to +the client (including Intel Xeon server E3) uncore implementation. + +Affected Processors +=================== +Not all the CPUs are affected by all the variants. For instance, most +processors for the server market (excluding Intel Xeon E3 processors) are +impacted by only Device Register Partial Write (DRPW). + +Below is the list of affected Intel processors [#f1]_: + + =================== ============ ========= + Common name Family_Model Steppings + =================== ============ ========= + HASWELL_X 06_3FH 2,4 + SKYLAKE_L 06_4EH 3 + BROADWELL_X 06_4FH All + SKYLAKE_X 06_55H 3,4,6,7,11 + BROADWELL_D 06_56H 3,4,5 + SKYLAKE 06_5EH 3 + ICELAKE_X 06_6AH 4,5,6 + ICELAKE_D 06_6CH 1 + ICELAKE_L 06_7EH 5 + ATOM_TREMONT_D 06_86H All + LAKEFIELD 06_8AH 1 + KABYLAKE_L 06_8EH 9 to 12 + ATOM_TREMONT 06_96H 1 + ATOM_TREMONT_L 06_9CH 0 + KABYLAKE 06_9EH 9 to 13 + COMETLAKE 06_A5H 2,3,5 + COMETLAKE_L 06_A6H 0,1 + ROCKETLAKE 06_A7H 1 + =================== ============ ========= + +If a CPU is in the affected processor list, but not affected by a variant, it +is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later +section, mitigation largely remains the same for all the variants, i.e. to +clear the CPU fill buffers via VERW instruction. + +New bits in MSRs +================ +Newer processors and microcode update on existing affected processors added new +bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate +specific variants of Processor MMIO Stale Data vulnerabilities and mitigation +capability. + +MSR IA32_ARCH_CAPABILITIES +-------------------------- +Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the + Shared Buffers Data Read (SBDR) vulnerability or the sideband stale + data propagator (SSDP). +Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer + Stale Data Propagator (FBSDP). +Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data + Propagator (PSDP). +Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer + values as part of MD_CLEAR operations. Processors that do not + enumerate MDS_NO (meaning they are affected by MDS) but that do + enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate + FB_CLEAR as part of their MD_CLEAR support. +Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR + IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS + bit can be set to cause the VERW instruction to not perform the + FB_CLEAR action. Not all processors that support FB_CLEAR will support + FB_CLEAR_CTRL. + +MSR IA32_MCU_OPT_CTRL +--------------------- +Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR +action. This may be useful to reduce the performance impact of FB_CLEAR in +cases where system software deems it warranted (for example, when performance +is more critical, or the untrusted software has no MMIO access). Note that +FB_CLEAR_DIS has no impact on enumeration (for example, it does not change +FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors +that enumerate FB_CLEAR. + +Mitigation +========== +Like MDS, all variants of Processor MMIO Stale Data vulnerabilities have the +same mitigation strategy to force the CPU to clear the affected buffers before +an attacker can extract the secrets. + +This is achieved by using the otherwise unused and obsolete VERW instruction in +combination with a microcode update. The microcode clears the affected CPU +buffers when the VERW instruction is executed. + +Kernel reuses the MDS function to invoke the buffer clearing: + + mds_clear_cpu_buffers() + +On MDS affected CPUs, the kernel already invokes CPU buffer clear on +kernel/userspace, hypervisor/guest and C-state (idle) transitions. No +additional mitigation is needed on such CPUs. + +For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker +with MMIO capability. Therefore, VERW is not required for kernel/userspace. For +virtualization case, VERW is only needed at VMENTER for a guest with MMIO +capability. + +Mitigation points +----------------- +Return to user space +^^^^^^^^^^^^^^^^^^^^ +Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation +needed. + +C-State transition +^^^^^^^^^^^^^^^^^^ +Control register writes by CPU during C-state transition can propagate data +from fill buffer to uncore buffers. Execute VERW before C-state transition to +clear CPU fill buffers. + +Guest entry point +^^^^^^^^^^^^^^^^^ +Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise +execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by +MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO +Stale Data vulnerabilities, so there is no need to execute VERW for such guests. + +Mitigation control on the kernel command line +--------------------------------------------- +The kernel command line allows to control the Processor MMIO Stale Data +mitigations at boot time with the option "mmio_stale_data=". The valid +arguments for this option are: + + ========== ================================================================= + full If the CPU is vulnerable, enable mitigation; CPU buffer clearing + on exit to userspace and when entering a VM. Idle transitions are + protected as well. It does not automatically disable SMT. + full,nosmt Same as full, with SMT disabled on vulnerable CPUs. This is the + complete mitigation. + off Disables mitigation completely. + ========== ================================================================= + +If the CPU is affected and mmio_stale_data=off is not supplied on the kernel +command line, then the kernel selects the appropriate mitigation. + +Mitigation status information +----------------------------- +The Linux kernel provides a sysfs interface to enumerate the current +vulnerability status of the system: whether the system is vulnerable, and +which mitigations are active. The relevant sysfs file is: + + /sys/devices/system/cpu/vulnerabilities/mmio_stale_data + +The possible values in this file are: + + .. list-table:: + + * - 'Not affected' + - The processor is not vulnerable + * - 'Vulnerable' + - The processor is vulnerable, but no mitigation enabled + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The processor is vulnerable, but microcode is not updated. The + mitigation is enabled on a best effort basis. + * - 'Mitigation: Clear CPU buffers' + - The processor is vulnerable and the CPU buffer clearing mitigation is + enabled. + +If the processor is vulnerable then the following information is appended to +the above information: + + ======================== =========================================== + 'SMT vulnerable' SMT is enabled + 'SMT disabled' SMT is disabled + 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown + ======================== =========================================== + +References +---------- +.. [#f1] Affected Processors + https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html From 51802186158c74a0304f51ab963e7c2b3a2b046f Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:27:08 -0700 Subject: [PATCH 002/298] x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For more details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst Add the Processor MMIO Stale Data bug enumeration. A microcode update adds new bits to the MSR IA32_ARCH_CAPABILITIES, define them. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/msr-index.h | 19 +++++++++++ arch/x86/kernel/cpu/common.c | 43 ++++++++++++++++++++++-- tools/arch/x86/include/asm/cpufeatures.h | 1 + tools/arch/x86/include/asm/msr-index.h | 19 +++++++++++ 5 files changed, 81 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 73e643ae94b6..e17de69faa54 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -443,5 +443,6 @@ #define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */ #define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */ #define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */ +#define X86_BUG_MMIO_STALE_DATA X86_BUG(25) /* CPU is affected by Processor MMIO Stale Data vulnerabilities */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index ee15311b6be1..12976405441b 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -114,6 +114,25 @@ * Not susceptible to * TSX Async Abort (TAA) vulnerabilities. */ +#define ARCH_CAP_SBDR_SSDP_NO BIT(13) /* + * Not susceptible to SBDR and SSDP + * variants of Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FBSDP_NO BIT(14) /* + * Not susceptible to FBSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_PSDP_NO BIT(15) /* + * Not susceptible to PSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FB_CLEAR BIT(17) /* + * VERW clears CPU fill buffer + * even on MDS_NO CPUs. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index e342ae4db3c4..f7757409e133 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1237,18 +1237,39 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = { X86_FEATURE_ANY, issues) #define SRBDS BIT(0) +/* CPU is affected by X86_BUG_MMIO_STALE_DATA */ +#define MMIO BIT(1) static const struct x86_cpu_id cpu_vuln_blacklist[] __initconst = { VULNBL_INTEL_STEPPINGS(IVYBRIDGE, X86_STEPPING_ANY, SRBDS), VULNBL_INTEL_STEPPINGS(HASWELL, X86_STEPPING_ANY, SRBDS), VULNBL_INTEL_STEPPINGS(HASWELL_L, X86_STEPPING_ANY, SRBDS), VULNBL_INTEL_STEPPINGS(HASWELL_G, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(HASWELL_X, BIT(2) | BIT(4), MMIO), + VULNBL_INTEL_STEPPINGS(BROADWELL_D, X86_STEPPINGS(0x3, 0x5), MMIO), VULNBL_INTEL_STEPPINGS(BROADWELL_G, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(BROADWELL_X, X86_STEPPING_ANY, MMIO), VULNBL_INTEL_STEPPINGS(BROADWELL, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(SKYLAKE_L, X86_STEPPINGS(0x3, 0x3), SRBDS | MMIO), VULNBL_INTEL_STEPPINGS(SKYLAKE_L, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(SKYLAKE_X, BIT(3) | BIT(4) | BIT(6) | + BIT(7) | BIT(0xB), MMIO), + VULNBL_INTEL_STEPPINGS(SKYLAKE, X86_STEPPINGS(0x3, 0x3), SRBDS | MMIO), VULNBL_INTEL_STEPPINGS(SKYLAKE, X86_STEPPING_ANY, SRBDS), - VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x0, 0xC), SRBDS), - VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x0, 0xD), SRBDS), + VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x9, 0xC), SRBDS | MMIO), + VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x0, 0x8), SRBDS), + VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x9, 0xD), SRBDS | MMIO), + VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x0, 0x8), SRBDS), + VULNBL_INTEL_STEPPINGS(ICELAKE_L, X86_STEPPINGS(0x5, 0x5), MMIO), + VULNBL_INTEL_STEPPINGS(ICELAKE_D, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ICELAKE_X, X86_STEPPINGS(0x4, 0x6), MMIO), + VULNBL_INTEL_STEPPINGS(COMETLAKE, BIT(2) | BIT(3) | BIT(5), MMIO), + VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x0, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(LAKEFIELD, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ROCKETLAKE, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_D, X86_STEPPING_ANY, MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_L, X86_STEPPINGS(0x0, 0x0), MMIO), {} }; @@ -1269,6 +1290,13 @@ u64 x86_read_arch_cap_msr(void) return ia32_cap; } +static bool arch_cap_mmio_immune(u64 ia32_cap) +{ + return (ia32_cap & ARCH_CAP_FBSDP_NO && + ia32_cap & ARCH_CAP_PSDP_NO && + ia32_cap & ARCH_CAP_SBDR_SSDP_NO); +} + static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = x86_read_arch_cap_msr(); @@ -1328,6 +1356,17 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) cpu_matches(cpu_vuln_blacklist, SRBDS)) setup_force_cpu_bug(X86_BUG_SRBDS); + /* + * Processor MMIO Stale Data bug enumeration + * + * Affected CPU list is generally enough to enumerate the vulnerability, + * but for virtualization case check for ARCH_CAP MSR bits also, VMM may + * not want the guest to enumerate the bug. + */ + if (cpu_matches(cpu_vuln_blacklist, MMIO) && + !arch_cap_mmio_immune(ia32_cap)) + setup_force_cpu_bug(X86_BUG_MMIO_STALE_DATA); + if (cpu_matches(cpu_vuln_whitelist, NO_MELTDOWN)) return; diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h index 73e643ae94b6..e17de69faa54 100644 --- a/tools/arch/x86/include/asm/cpufeatures.h +++ b/tools/arch/x86/include/asm/cpufeatures.h @@ -443,5 +443,6 @@ #define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */ #define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */ #define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */ +#define X86_BUG_MMIO_STALE_DATA X86_BUG(25) /* CPU is affected by Processor MMIO Stale Data vulnerabilities */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h index ee15311b6be1..12976405441b 100644 --- a/tools/arch/x86/include/asm/msr-index.h +++ b/tools/arch/x86/include/asm/msr-index.h @@ -114,6 +114,25 @@ * Not susceptible to * TSX Async Abort (TAA) vulnerabilities. */ +#define ARCH_CAP_SBDR_SSDP_NO BIT(13) /* + * Not susceptible to SBDR and SSDP + * variants of Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FBSDP_NO BIT(14) /* + * Not susceptible to FBSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_PSDP_NO BIT(15) /* + * Not susceptible to PSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FB_CLEAR BIT(17) /* + * VERW clears CPU fill buffer + * even on MDS_NO CPUs. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* From f52ea6c26953fed339aa4eae717ee5c2133c7ff2 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:28:10 -0700 Subject: [PATCH 003/298] x86/speculation: Add a common function for MD_CLEAR mitigation update Processor MMIO Stale Data mitigation uses similar mitigation as MDS and TAA. In preparation for adding its mitigation, add a common function to update all mitigations that depend on MD_CLEAR. [ bp: Add a newline in md_clear_update_mitigation() to separate statements better. ] Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 59 +++++++++++++++++++++----------------- 1 file changed, 33 insertions(+), 26 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 6296e1ebed1d..e05d207e7ec9 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -41,7 +41,7 @@ static void __init spectre_v2_select_mitigation(void); static void __init ssb_select_mitigation(void); static void __init l1tf_select_mitigation(void); static void __init mds_select_mitigation(void); -static void __init mds_print_mitigation(void); +static void __init md_clear_update_mitigation(void); static void __init taa_select_mitigation(void); static void __init srbds_select_mitigation(void); static void __init l1d_flush_select_mitigation(void); @@ -123,10 +123,10 @@ void __init check_bugs(void) l1d_flush_select_mitigation(); /* - * As MDS and TAA mitigations are inter-related, print MDS - * mitigation until after TAA mitigation selection is done. + * As MDS and TAA mitigations are inter-related, update and print their + * mitigation after TAA mitigation selection is done. */ - mds_print_mitigation(); + md_clear_update_mitigation(); arch_smt_update(); @@ -267,14 +267,6 @@ static void __init mds_select_mitigation(void) } } -static void __init mds_print_mitigation(void) -{ - if (!boot_cpu_has_bug(X86_BUG_MDS) || cpu_mitigations_off()) - return; - - pr_info("%s\n", mds_strings[mds_mitigation]); -} - static int __init mds_cmdline(char *str) { if (!boot_cpu_has_bug(X86_BUG_MDS)) @@ -329,7 +321,7 @@ static void __init taa_select_mitigation(void) /* TSX previously disabled by tsx=off */ if (!boot_cpu_has(X86_FEATURE_RTM)) { taa_mitigation = TAA_MITIGATION_TSX_DISABLED; - goto out; + return; } if (cpu_mitigations_off()) { @@ -343,7 +335,7 @@ static void __init taa_select_mitigation(void) */ if (taa_mitigation == TAA_MITIGATION_OFF && mds_mitigation == MDS_MITIGATION_OFF) - goto out; + return; if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) taa_mitigation = TAA_MITIGATION_VERW; @@ -375,18 +367,6 @@ static void __init taa_select_mitigation(void) if (taa_nosmt || cpu_mitigations_auto_nosmt()) cpu_smt_disable(false); - - /* - * Update MDS mitigation, if necessary, as the mds_user_clear is - * now enabled for TAA mitigation. - */ - if (mds_mitigation == MDS_MITIGATION_OFF && - boot_cpu_has_bug(X86_BUG_MDS)) { - mds_mitigation = MDS_MITIGATION_FULL; - mds_select_mitigation(); - } -out: - pr_info("%s\n", taa_strings[taa_mitigation]); } static int __init tsx_async_abort_parse_cmdline(char *str) @@ -410,6 +390,33 @@ static int __init tsx_async_abort_parse_cmdline(char *str) } early_param("tsx_async_abort", tsx_async_abort_parse_cmdline); +#undef pr_fmt +#define pr_fmt(fmt) "" fmt + +static void __init md_clear_update_mitigation(void) +{ + if (cpu_mitigations_off()) + return; + + if (!static_key_enabled(&mds_user_clear)) + goto out; + + /* + * mds_user_clear is now enabled. Update MDS mitigation, if + * necessary. + */ + if (mds_mitigation == MDS_MITIGATION_OFF && + boot_cpu_has_bug(X86_BUG_MDS)) { + mds_mitigation = MDS_MITIGATION_FULL; + mds_select_mitigation(); + } +out: + if (boot_cpu_has_bug(X86_BUG_MDS)) + pr_info("MDS: %s\n", mds_strings[mds_mitigation]); + if (boot_cpu_has_bug(X86_BUG_TAA)) + pr_info("TAA: %s\n", taa_strings[taa_mitigation]); +} + #undef pr_fmt #define pr_fmt(fmt) "SRBDS: " fmt From 8cb861e9e3c9a55099ad3d08e1a3b653d29c33ca Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:29:11 -0700 Subject: [PATCH 004/298] x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- .../admin-guide/kernel-parameters.txt | 36 ++++++ arch/x86/include/asm/nospec-branch.h | 2 + arch/x86/kernel/cpu/bugs.c | 111 +++++++++++++++++- arch/x86/kvm/vmx/vmx.c | 3 + 4 files changed, 148 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3f1cc5e317ed..c4893782055b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3105,6 +3105,7 @@ kvm.nx_huge_pages=off [X86] no_entry_flush [PPC] no_uaccess_flush [PPC] + mmio_stale_data=off [X86] Exceptions: This does not have any effect on @@ -3126,6 +3127,7 @@ Equivalent to: l1tf=flush,nosmt [X86] mds=full,nosmt [X86] tsx_async_abort=full,nosmt [X86] + mmio_stale_data=full,nosmt [X86] mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this @@ -3135,6 +3137,40 @@ log everything. Information is printed at KERN_DEBUG so loglevel=8 may also need to be specified. + mmio_stale_data= + [X86,INTEL] Control mitigation for the Processor + MMIO Stale Data vulnerabilities. + + Processor MMIO Stale Data is a class of + vulnerabilities that may expose data after an MMIO + operation. Exposed data could originate or end in + the same CPU buffers as affected by MDS and TAA. + Therefore, similar to MDS and TAA, the mitigation + is to clear the affected CPU buffers. + + This parameter controls the mitigation. The + options are: + + full - Enable mitigation on vulnerable CPUs + + full,nosmt - Enable mitigation and disable SMT on + vulnerable CPUs. + + off - Unconditionally disable mitigation + + On MDS or TAA affected machines, + mmio_stale_data=off can be prevented by an active + MDS or TAA mitigation as these vulnerabilities are + mitigated with the same mechanism so in order to + disable this mitigation, you need to specify + mds=off and tsx_async_abort=off too. + + Not specifying this option is equivalent to + mmio_stale_data=full. + + For details see: + Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst + module.sig_enforce [KNL] When CONFIG_MODULE_SIG is set, this means that modules without (valid) signatures will fail to load. diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index acbaeaf83b61..da251a5645b0 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -269,6 +269,8 @@ DECLARE_STATIC_KEY_FALSE(mds_idle_clear); DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); +DECLARE_STATIC_KEY_FALSE(mmio_stale_data_clear); + #include /** diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index e05d207e7ec9..7b01ba9bc701 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -43,6 +43,7 @@ static void __init l1tf_select_mitigation(void); static void __init mds_select_mitigation(void); static void __init md_clear_update_mitigation(void); static void __init taa_select_mitigation(void); +static void __init mmio_select_mitigation(void); static void __init srbds_select_mitigation(void); static void __init l1d_flush_select_mitigation(void); @@ -85,6 +86,10 @@ EXPORT_SYMBOL_GPL(mds_idle_clear); */ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); +/* Controls CPU Fill buffer clear before KVM guest MMIO accesses */ +DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear); +EXPORT_SYMBOL_GPL(mmio_stale_data_clear); + void __init check_bugs(void) { identify_boot_cpu(); @@ -119,12 +124,14 @@ void __init check_bugs(void) l1tf_select_mitigation(); mds_select_mitigation(); taa_select_mitigation(); + mmio_select_mitigation(); srbds_select_mitigation(); l1d_flush_select_mitigation(); /* - * As MDS and TAA mitigations are inter-related, update and print their - * mitigation after TAA mitigation selection is done. + * As MDS, TAA and MMIO Stale Data mitigations are inter-related, update + * and print their mitigation after MDS, TAA and MMIO Stale Data + * mitigation selection is done. */ md_clear_update_mitigation(); @@ -390,6 +397,90 @@ static int __init tsx_async_abort_parse_cmdline(char *str) } early_param("tsx_async_abort", tsx_async_abort_parse_cmdline); +#undef pr_fmt +#define pr_fmt(fmt) "MMIO Stale Data: " fmt + +enum mmio_mitigations { + MMIO_MITIGATION_OFF, + MMIO_MITIGATION_UCODE_NEEDED, + MMIO_MITIGATION_VERW, +}; + +/* Default mitigation for Processor MMIO Stale Data vulnerabilities */ +static enum mmio_mitigations mmio_mitigation __ro_after_init = MMIO_MITIGATION_VERW; +static bool mmio_nosmt __ro_after_init = false; + +static const char * const mmio_strings[] = { + [MMIO_MITIGATION_OFF] = "Vulnerable", + [MMIO_MITIGATION_UCODE_NEEDED] = "Vulnerable: Clear CPU buffers attempted, no microcode", + [MMIO_MITIGATION_VERW] = "Mitigation: Clear CPU buffers", +}; + +static void __init mmio_select_mitigation(void) +{ + u64 ia32_cap; + + if (!boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA) || + cpu_mitigations_off()) { + mmio_mitigation = MMIO_MITIGATION_OFF; + return; + } + + if (mmio_mitigation == MMIO_MITIGATION_OFF) + return; + + ia32_cap = x86_read_arch_cap_msr(); + + /* + * Enable CPU buffer clear mitigation for host and VMM, if also affected + * by MDS or TAA. Otherwise, enable mitigation for VMM only. + */ + if (boot_cpu_has_bug(X86_BUG_MDS) || (boot_cpu_has_bug(X86_BUG_TAA) && + boot_cpu_has(X86_FEATURE_RTM))) + static_branch_enable(&mds_user_clear); + else + static_branch_enable(&mmio_stale_data_clear); + + /* + * Check if the system has the right microcode. + * + * CPU Fill buffer clear mitigation is enumerated by either an explicit + * FB_CLEAR or by the presence of both MD_CLEAR and L1D_FLUSH on MDS + * affected systems. + */ + if ((ia32_cap & ARCH_CAP_FB_CLEAR) || + (boot_cpu_has(X86_FEATURE_MD_CLEAR) && + boot_cpu_has(X86_FEATURE_FLUSH_L1D) && + !(ia32_cap & ARCH_CAP_MDS_NO))) + mmio_mitigation = MMIO_MITIGATION_VERW; + else + mmio_mitigation = MMIO_MITIGATION_UCODE_NEEDED; + + if (mmio_nosmt || cpu_mitigations_auto_nosmt()) + cpu_smt_disable(false); +} + +static int __init mmio_stale_data_parse_cmdline(char *str) +{ + if (!boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) + return 0; + + if (!str) + return -EINVAL; + + if (!strcmp(str, "off")) { + mmio_mitigation = MMIO_MITIGATION_OFF; + } else if (!strcmp(str, "full")) { + mmio_mitigation = MMIO_MITIGATION_VERW; + } else if (!strcmp(str, "full,nosmt")) { + mmio_mitigation = MMIO_MITIGATION_VERW; + mmio_nosmt = true; + } + + return 0; +} +early_param("mmio_stale_data", mmio_stale_data_parse_cmdline); + #undef pr_fmt #define pr_fmt(fmt) "" fmt @@ -402,19 +493,31 @@ static void __init md_clear_update_mitigation(void) goto out; /* - * mds_user_clear is now enabled. Update MDS mitigation, if - * necessary. + * mds_user_clear is now enabled. Update MDS, TAA and MMIO Stale Data + * mitigation, if necessary. */ if (mds_mitigation == MDS_MITIGATION_OFF && boot_cpu_has_bug(X86_BUG_MDS)) { mds_mitigation = MDS_MITIGATION_FULL; mds_select_mitigation(); } + if (taa_mitigation == TAA_MITIGATION_OFF && + boot_cpu_has_bug(X86_BUG_TAA)) { + taa_mitigation = TAA_MITIGATION_VERW; + taa_select_mitigation(); + } + if (mmio_mitigation == MMIO_MITIGATION_OFF && + boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) { + mmio_mitigation = MMIO_MITIGATION_VERW; + mmio_select_mitigation(); + } out: if (boot_cpu_has_bug(X86_BUG_MDS)) pr_info("MDS: %s\n", mds_strings[mds_mitigation]); if (boot_cpu_has_bug(X86_BUG_TAA)) pr_info("TAA: %s\n", taa_strings[taa_mitigation]); + if (boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) + pr_info("MMIO Stale Data: %s\n", mmio_strings[mmio_mitigation]); } #undef pr_fmt diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 610355b9ccce..4fa216acadce 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6773,6 +6773,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, vmx_l1d_flush(vcpu); else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); + else if (static_branch_unlikely(&mmio_stale_data_clear) && + kvm_arch_has_assigned_device(vcpu->kvm)) + mds_clear_cpu_buffers(); if (vcpu->arch.cr2 != native_read_cr2()) native_write_cr2(vcpu->arch.cr2); From e5925fb867290ee924fcf2fe3ca887b792714366 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:30:12 -0700 Subject: [PATCH 005/298] x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations MDS, TAA and Processor MMIO Stale Data mitigations rely on clearing CPU buffers. Moreover, status of these mitigations affects each other. During boot, it is important to maintain the order in which these mitigations are selected. This is especially true for md_clear_update_mitigation() that needs to be called after MDS, TAA and Processor MMIO Stale Data mitigation selection is done. Introduce md_clear_select_mitigation(), and select all these mitigations from there. This reflects relationships between these mitigations and ensures proper ordering. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 7b01ba9bc701..d2cc7dbba5e2 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -42,6 +42,7 @@ static void __init ssb_select_mitigation(void); static void __init l1tf_select_mitigation(void); static void __init mds_select_mitigation(void); static void __init md_clear_update_mitigation(void); +static void __init md_clear_select_mitigation(void); static void __init taa_select_mitigation(void); static void __init mmio_select_mitigation(void); static void __init srbds_select_mitigation(void); @@ -122,19 +123,10 @@ void __init check_bugs(void) spectre_v2_select_mitigation(); ssb_select_mitigation(); l1tf_select_mitigation(); - mds_select_mitigation(); - taa_select_mitigation(); - mmio_select_mitigation(); + md_clear_select_mitigation(); srbds_select_mitigation(); l1d_flush_select_mitigation(); - /* - * As MDS, TAA and MMIO Stale Data mitigations are inter-related, update - * and print their mitigation after MDS, TAA and MMIO Stale Data - * mitigation selection is done. - */ - md_clear_update_mitigation(); - arch_smt_update(); #ifdef CONFIG_X86_32 @@ -520,6 +512,20 @@ out: pr_info("MMIO Stale Data: %s\n", mmio_strings[mmio_mitigation]); } +static void __init md_clear_select_mitigation(void) +{ + mds_select_mitigation(); + taa_select_mitigation(); + mmio_select_mitigation(); + + /* + * As MDS, TAA and MMIO Stale Data mitigations are inter-related, update + * and print their mitigation after MDS, TAA and MMIO Stale Data + * mitigation selection is done. + */ + md_clear_update_mitigation(); +} + #undef pr_fmt #define pr_fmt(fmt) "SRBDS: " fmt From 99a83db5a605137424e1efe29dc0573d6a5b6316 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:31:12 -0700 Subject: [PATCH 006/298] x86/speculation/mmio: Enable CPU Fill buffer clearing on idle When the CPU is affected by Processor MMIO Stale Data vulnerabilities, Fill Buffer Stale Data Propagator (FBSDP) can propagate stale data out of Fill buffer to uncore buffer when CPU goes idle. Stale data can then be exploited with other variants using MMIO operations. Mitigate it by clearing the Fill buffer before entering idle state. Signed-off-by: Pawan Gupta Co-developed-by: Josh Poimboeuf Signed-off-by: Josh Poimboeuf Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index d2cc7dbba5e2..56d5dea5e128 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -433,6 +433,14 @@ static void __init mmio_select_mitigation(void) else static_branch_enable(&mmio_stale_data_clear); + /* + * If Processor-MMIO-Stale-Data bug is present and Fill Buffer data can + * be propagated to uncore buffers, clearing the Fill buffers on idle + * is required irrespective of SMT state. + */ + if (!(ia32_cap & ARCH_CAP_FBSDP_NO)) + static_branch_enable(&mds_idle_clear); + /* * Check if the system has the right microcode. * @@ -1225,6 +1233,8 @@ static void update_indir_branch_cond(void) /* Update the static key controlling the MDS CPU buffer clear in idle */ static void update_mds_branch_idle(void) { + u64 ia32_cap = x86_read_arch_cap_msr(); + /* * Enable the idle clearing if SMT is active on CPUs which are * affected only by MSBDS and not any other MDS variant. @@ -1236,10 +1246,12 @@ static void update_mds_branch_idle(void) if (!boot_cpu_has_bug(X86_BUG_MSBDS_ONLY)) return; - if (sched_smt_active()) + if (sched_smt_active()) { static_branch_enable(&mds_idle_clear); - else + } else if (mmio_mitigation == MMIO_MITIGATION_OFF || + (ia32_cap & ARCH_CAP_FBSDP_NO)) { static_branch_disable(&mds_idle_clear); + } } #define MDS_MSG_SMT "MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.\n" From 8d50cdf8b8341770bc6367bce40c0c1bb0e1d5b3 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:32:13 -0700 Subject: [PATCH 007/298] x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data Add the sysfs reporting file for Processor MMIO Stale Data vulnerability. It exposes the vulnerability and mitigation state similar to the existing files for the other hardware vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- .../ABI/testing/sysfs-devices-system-cpu | 1 + arch/x86/kernel/cpu/bugs.c | 22 +++++++++++++++++++ drivers/base/cpu.c | 8 +++++++ include/linux/cpu.h | 3 +++ 4 files changed, 34 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2ad01cad7f1c..bcc974d276dc 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -526,6 +526,7 @@ What: /sys/devices/system/cpu/vulnerabilities /sys/devices/system/cpu/vulnerabilities/srbds /sys/devices/system/cpu/vulnerabilities/tsx_async_abort /sys/devices/system/cpu/vulnerabilities/itlb_multihit + /sys/devices/system/cpu/vulnerabilities/mmio_stale_data Date: January 2018 Contact: Linux kernel mailing list Description: Information about CPU vulnerabilities diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 56d5dea5e128..38853077ca58 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1902,6 +1902,20 @@ static ssize_t tsx_async_abort_show_state(char *buf) sched_smt_active() ? "vulnerable" : "disabled"); } +static ssize_t mmio_stale_data_show_state(char *buf) +{ + if (mmio_mitigation == MMIO_MITIGATION_OFF) + return sysfs_emit(buf, "%s\n", mmio_strings[mmio_mitigation]); + + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { + return sysfs_emit(buf, "%s; SMT Host state unknown\n", + mmio_strings[mmio_mitigation]); + } + + return sysfs_emit(buf, "%s; SMT %s\n", mmio_strings[mmio_mitigation], + sched_smt_active() ? "vulnerable" : "disabled"); +} + static char *stibp_state(void) { if (spectre_v2_in_eibrs_mode(spectre_v2_enabled)) @@ -2002,6 +2016,9 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr case X86_BUG_SRBDS: return srbds_show_state(buf); + case X86_BUG_MMIO_STALE_DATA: + return mmio_stale_data_show_state(buf); + default: break; } @@ -2053,4 +2070,9 @@ ssize_t cpu_show_srbds(struct device *dev, struct device_attribute *attr, char * { return cpu_show_common(dev, attr, buf, X86_BUG_SRBDS); } + +ssize_t cpu_show_mmio_stale_data(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_MMIO_STALE_DATA); +} #endif diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 2ef23fce0860..a97776ea9d99 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -564,6 +564,12 @@ ssize_t __weak cpu_show_srbds(struct device *dev, return sysfs_emit(buf, "Not affected\n"); } +ssize_t __weak cpu_show_mmio_stale_data(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "Not affected\n"); +} + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL); static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL); static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL); @@ -573,6 +579,7 @@ static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL); static DEVICE_ATTR(tsx_async_abort, 0444, cpu_show_tsx_async_abort, NULL); static DEVICE_ATTR(itlb_multihit, 0444, cpu_show_itlb_multihit, NULL); static DEVICE_ATTR(srbds, 0444, cpu_show_srbds, NULL); +static DEVICE_ATTR(mmio_stale_data, 0444, cpu_show_mmio_stale_data, NULL); static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_meltdown.attr, @@ -584,6 +591,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_tsx_async_abort.attr, &dev_attr_itlb_multihit.attr, &dev_attr_srbds.attr, + &dev_attr_mmio_stale_data.attr, NULL }; diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 54dc2f9a2d56..2c7477354744 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -65,6 +65,9 @@ extern ssize_t cpu_show_tsx_async_abort(struct device *dev, extern ssize_t cpu_show_itlb_multihit(struct device *dev, struct device_attribute *attr, char *buf); extern ssize_t cpu_show_srbds(struct device *dev, struct device_attribute *attr, char *buf); +extern ssize_t cpu_show_mmio_stale_data(struct device *dev, + struct device_attribute *attr, + char *buf); extern __printf(4, 5) struct device *cpu_device_create(struct device *parent, void *drvdata, From 22cac9c677c95f3ac5c9244f8ca0afdc7c8afb19 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:33:13 -0700 Subject: [PATCH 008/298] x86/speculation/srbds: Update SRBDS mitigation selection Currently, Linux disables SRBDS mitigation on CPUs not affected by MDS and have the TSX feature disabled. On such CPUs, secrets cannot be extracted from CPU fill buffers using MDS or TAA. Without SRBDS mitigation, Processor MMIO Stale Data vulnerabilities can be used to extract RDRAND, RDSEED, and EGETKEY data. Do not disable SRBDS mitigation by default when CPU is also affected by Processor MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 38853077ca58..ef4749097f42 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -595,11 +595,13 @@ static void __init srbds_select_mitigation(void) return; /* - * Check to see if this is one of the MDS_NO systems supporting - * TSX that are only exposed to SRBDS when TSX is enabled. + * Check to see if this is one of the MDS_NO systems supporting TSX that + * are only exposed to SRBDS when TSX is enabled or when CPU is affected + * by Processor MMIO Stale Data vulnerability. */ ia32_cap = x86_read_arch_cap_msr(); - if ((ia32_cap & ARCH_CAP_MDS_NO) && !boot_cpu_has(X86_FEATURE_RTM)) + if ((ia32_cap & ARCH_CAP_MDS_NO) && !boot_cpu_has(X86_FEATURE_RTM) && + !boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) srbds_mitigation = SRBDS_MITIGATION_TSX_OFF; else if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) srbds_mitigation = SRBDS_MITIGATION_HYPERVISOR; From a992b8a4682f119ae035a01b40d4d0665c4a2875 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:34:14 -0700 Subject: [PATCH 009/298] x86/speculation/mmio: Reuse SRBDS mitigation for SBDS The Shared Buffers Data Sampling (SBDS) variant of Processor MMIO Stale Data vulnerabilities may expose RDRAND, RDSEED and SGX EGETKEY data. Mitigation for this is added by a microcode update. As some of the implications of SBDS are similar to SRBDS, SRBDS mitigation infrastructure can be leveraged by SBDS. Set X86_BUG_SRBDS and use SRBDS mitigation. Mitigation is enabled by default; use srbds=off to opt-out. Mitigation status can be checked from below file: /sys/devices/system/cpu/vulnerabilities/srbds Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/common.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f7757409e133..af5d0c188f7b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1239,6 +1239,8 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = { #define SRBDS BIT(0) /* CPU is affected by X86_BUG_MMIO_STALE_DATA */ #define MMIO BIT(1) +/* CPU is affected by Shared Buffers Data Sampling (SBDS), a variant of X86_BUG_MMIO_STALE_DATA */ +#define MMIO_SBDS BIT(2) static const struct x86_cpu_id cpu_vuln_blacklist[] __initconst = { VULNBL_INTEL_STEPPINGS(IVYBRIDGE, X86_STEPPING_ANY, SRBDS), @@ -1260,16 +1262,17 @@ static const struct x86_cpu_id cpu_vuln_blacklist[] __initconst = { VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x0, 0x8), SRBDS), VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x9, 0xD), SRBDS | MMIO), VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x0, 0x8), SRBDS), - VULNBL_INTEL_STEPPINGS(ICELAKE_L, X86_STEPPINGS(0x5, 0x5), MMIO), + VULNBL_INTEL_STEPPINGS(ICELAKE_L, X86_STEPPINGS(0x5, 0x5), MMIO | MMIO_SBDS), VULNBL_INTEL_STEPPINGS(ICELAKE_D, X86_STEPPINGS(0x1, 0x1), MMIO), VULNBL_INTEL_STEPPINGS(ICELAKE_X, X86_STEPPINGS(0x4, 0x6), MMIO), - VULNBL_INTEL_STEPPINGS(COMETLAKE, BIT(2) | BIT(3) | BIT(5), MMIO), - VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x0, 0x1), MMIO), - VULNBL_INTEL_STEPPINGS(LAKEFIELD, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(COMETLAKE, BIT(2) | BIT(3) | BIT(5), MMIO | MMIO_SBDS), + VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x1, 0x1), MMIO | MMIO_SBDS), + VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x0, 0x0), MMIO), + VULNBL_INTEL_STEPPINGS(LAKEFIELD, X86_STEPPINGS(0x1, 0x1), MMIO | MMIO_SBDS), VULNBL_INTEL_STEPPINGS(ROCKETLAKE, X86_STEPPINGS(0x1, 0x1), MMIO), - VULNBL_INTEL_STEPPINGS(ATOM_TREMONT, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT, X86_STEPPINGS(0x1, 0x1), MMIO | MMIO_SBDS), VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_D, X86_STEPPING_ANY, MMIO), - VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_L, X86_STEPPINGS(0x0, 0x0), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_L, X86_STEPPINGS(0x0, 0x0), MMIO | MMIO_SBDS), {} }; @@ -1350,10 +1353,14 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) /* * SRBDS affects CPUs which support RDRAND or RDSEED and are listed * in the vulnerability blacklist. + * + * Some of the implications and mitigation of Shared Buffers Data + * Sampling (SBDS) are similar to SRBDS. Give SBDS same treatment as + * SRBDS. */ if ((cpu_has(c, X86_FEATURE_RDRAND) || cpu_has(c, X86_FEATURE_RDSEED)) && - cpu_matches(cpu_vuln_blacklist, SRBDS)) + cpu_matches(cpu_vuln_blacklist, SRBDS | MMIO_SBDS)) setup_force_cpu_bug(X86_BUG_SRBDS); /* From 027bbb884be006b05d9c577d6401686053aa789e Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:35:15 -0700 Subject: [PATCH 010/298] KVM: x86/speculation: Disable Fill buffer clear within guests The enumeration of MD_CLEAR in CPUID(EAX=7,ECX=0).EDX{bit 10} is not an accurate indicator on all CPUs of whether the VERW instruction will overwrite fill buffers. FB_CLEAR enumeration in IA32_ARCH_CAPABILITIES{bit 17} covers the case of CPUs that are not vulnerable to MDS/TAA, indicating that microcode does overwrite fill buffers. Guests running in VMM environments may not be aware of all the capabilities/vulnerabilities of the host CPU. Specifically, a guest may apply MDS/TAA mitigations when a virtual CPU is enumerated as vulnerable to MDS/TAA even when the physical CPU is not. On CPUs that enumerate FB_CLEAR_CTRL the VMM may set FB_CLEAR_DIS to skip overwriting of fill buffers by the VERW instruction. This is done by setting FB_CLEAR_DIS during VMENTER and resetting on VMEXIT. For guests that enumerate FB_CLEAR (explicitly asking for fill buffer clear capability) the VMM will not use FB_CLEAR_DIS. Irrespective of guest state, host overwrites CPU buffers before VMENTER to protect itself from an MMIO capable guest, as part of mitigation for MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/include/asm/msr-index.h | 6 +++ arch/x86/kvm/vmx/vmx.c | 69 ++++++++++++++++++++++++++ arch/x86/kvm/vmx/vmx.h | 2 + arch/x86/kvm/x86.c | 3 ++ tools/arch/x86/include/asm/msr-index.h | 6 +++ 5 files changed, 86 insertions(+) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 12976405441b..4425d6773183 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -133,6 +133,11 @@ * VERW clears CPU fill buffer * even on MDS_NO CPUs. */ +#define ARCH_CAP_FB_CLEAR_CTRL BIT(18) /* + * MSR_IA32_MCU_OPT_CTRL[FB_CLEAR_DIS] + * bit available to control VERW + * behavior. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* @@ -150,6 +155,7 @@ #define MSR_IA32_MCU_OPT_CTRL 0x00000123 #define RNGDS_MITG_DIS BIT(0) /* SRBDS support */ #define RTM_ALLOW BIT(1) /* TSX development mode */ +#define FB_CLEAR_DIS BIT(3) /* CPU Fill buffer clear disable */ #define MSR_IA32_SYSENTER_CS 0x00000174 #define MSR_IA32_SYSENTER_ESP 0x00000175 diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4fa216acadce..6e8fb36bc49a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -229,6 +229,9 @@ static const struct { #define L1D_CACHE_ORDER 4 static void *vmx_l1d_flush_pages; +/* Control for disabling CPU Fill buffer clear */ +static bool __read_mostly vmx_fb_clear_ctrl_available; + static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf) { struct page *page; @@ -360,6 +363,60 @@ static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp) return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option); } +static void vmx_setup_fb_clear_ctrl(void) +{ + u64 msr; + + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) && + !boot_cpu_has_bug(X86_BUG_MDS) && + !boot_cpu_has_bug(X86_BUG_TAA)) { + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr); + if (msr & ARCH_CAP_FB_CLEAR_CTRL) + vmx_fb_clear_ctrl_available = true; + } +} + +static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx) +{ + u64 msr; + + if (!vmx->disable_fb_clear) + return; + + rdmsrl(MSR_IA32_MCU_OPT_CTRL, msr); + msr |= FB_CLEAR_DIS; + wrmsrl(MSR_IA32_MCU_OPT_CTRL, msr); + /* Cache the MSR value to avoid reading it later */ + vmx->msr_ia32_mcu_opt_ctrl = msr; +} + +static __always_inline void vmx_enable_fb_clear(struct vcpu_vmx *vmx) +{ + if (!vmx->disable_fb_clear) + return; + + vmx->msr_ia32_mcu_opt_ctrl &= ~FB_CLEAR_DIS; + wrmsrl(MSR_IA32_MCU_OPT_CTRL, vmx->msr_ia32_mcu_opt_ctrl); +} + +static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx) +{ + vmx->disable_fb_clear = vmx_fb_clear_ctrl_available; + + /* + * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS + * at VMEntry. Skip the MSR read/write when a guest has no use case to + * execute VERW. + */ + if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) || + ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO))) + vmx->disable_fb_clear = false; +} + static const struct kernel_param_ops vmentry_l1d_flush_ops = { .set = vmentry_l1d_flush_set, .get = vmentry_l1d_flush_get, @@ -2252,6 +2309,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) ret = kvm_set_msr_common(vcpu, msr_info); } + /* FB_CLEAR may have changed, also update the FB_CLEAR_DIS behavior */ + if (msr_index == MSR_IA32_ARCH_CAPABILITIES) + vmx_update_fb_clear_dis(vcpu, vmx); + return ret; } @@ -4553,6 +4614,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); vpid_sync_context(vmx->vpid); + + vmx_update_fb_clear_dis(vcpu, vmx); } static void vmx_enable_irq_window(struct kvm_vcpu *vcpu) @@ -6777,6 +6840,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, kvm_arch_has_assigned_device(vcpu->kvm)) mds_clear_cpu_buffers(); + vmx_disable_fb_clear(vmx); + if (vcpu->arch.cr2 != native_read_cr2()) native_write_cr2(vcpu->arch.cr2); @@ -6785,6 +6850,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, vcpu->arch.cr2 = native_read_cr2(); + vmx_enable_fb_clear(vmx); + guest_state_exit_irqoff(); } @@ -8185,6 +8252,8 @@ static int __init vmx_init(void) return r; } + vmx_setup_fb_clear_ctrl(); + for_each_possible_cpu(cpu) { INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu)); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index b98c7e96697a..8d2342ede0c5 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -348,6 +348,8 @@ struct vcpu_vmx { u64 msr_ia32_feature_control_valid_bits; /* SGX Launch Control public key hash */ u64 msr_ia32_sgxlepubkeyhash[4]; + u64 msr_ia32_mcu_opt_ctrl; + bool disable_fb_clear; struct pt_desc pt_desc; struct lbr_desc lbr_desc; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4790f0d7d40b..44b72caf2e0b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1587,6 +1587,9 @@ static u64 kvm_get_arch_capabilities(void) */ } + /* Guests don't need to know "Fill buffer clear control" exists */ + data &= ~ARCH_CAP_FB_CLEAR_CTRL; + return data; } diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h index 12976405441b..4425d6773183 100644 --- a/tools/arch/x86/include/asm/msr-index.h +++ b/tools/arch/x86/include/asm/msr-index.h @@ -133,6 +133,11 @@ * VERW clears CPU fill buffer * even on MDS_NO CPUs. */ +#define ARCH_CAP_FB_CLEAR_CTRL BIT(18) /* + * MSR_IA32_MCU_OPT_CTRL[FB_CLEAR_DIS] + * bit available to control VERW + * behavior. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* @@ -150,6 +155,7 @@ #define MSR_IA32_MCU_OPT_CTRL 0x00000123 #define RNGDS_MITG_DIS BIT(0) /* SRBDS support */ #define RTM_ALLOW BIT(1) /* TSX development mode */ +#define FB_CLEAR_DIS BIT(3) /* CPU Fill buffer clear disable */ #define MSR_IA32_SYSENTER_CS 0x00000174 #define MSR_IA32_SYSENTER_ESP 0x00000175 From 1dc6ff02c8bf77d71b9b5d11cbc9df77cfb28626 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Mon, 23 May 2022 09:11:49 -0700 Subject: [PATCH 011/298] x86/speculation/mmio: Print SMT warning Similar to MDS and TAA, print a warning if SMT is enabled for the MMIO Stale Data vulnerability. Signed-off-by: Josh Poimboeuf Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/bugs.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index ef4749097f42..a8a9f6406331 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1258,6 +1258,7 @@ static void update_mds_branch_idle(void) #define MDS_MSG_SMT "MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.\n" #define TAA_MSG_SMT "TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.\n" +#define MMIO_MSG_SMT "MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.\n" void cpu_bugs_smt_update(void) { @@ -1302,6 +1303,16 @@ void cpu_bugs_smt_update(void) break; } + switch (mmio_mitigation) { + case MMIO_MITIGATION_VERW: + case MMIO_MITIGATION_UCODE_NEEDED: + if (sched_smt_active()) + pr_warn_once(MMIO_MSG_SMT); + break; + case MMIO_MITIGATION_OFF: + break; + } + mutex_unlock(&spec_ctrl_mutex); } From 5b7419ae1d208cab1e2826d473d8dab045aa75c7 Mon Sep 17 00:00:00 2001 From: Phillip Potter Date: Sat, 21 May 2022 21:47:41 +0100 Subject: [PATCH 012/298] staging: r8188eu: fix rtw_alloc_hwxmits error detection for now In _rtw_init_xmit_priv, we use the res variable to store the error return from the newly converted rtw_alloc_hwxmits function. Sadly, the calling function interprets res using _SUCCESS and _FAIL still, meaning we change the semantics of the variable, even in the success case. This leads to the following on boot: r8188eu 1-2:1.0: _rtw_init_xmit_priv failed In the long term, we should reverse these semantics, but for now, this fixes the driver. Also, inside rtw_alloc_hwxmits remove the if blocks, as HWXMIT_ENTRY is always 4. Fixes: f94b47c6bde6 ("staging: r8188eu: add check for kzalloc") Signed-off-by: Phillip Potter Link: https://lore.kernel.org/r/20220521204741.921-1-phil@philpotter.co.uk Signed-off-by: Greg Kroah-Hartman --- drivers/staging/r8188eu/core/rtw_xmit.c | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/drivers/staging/r8188eu/core/rtw_xmit.c b/drivers/staging/r8188eu/core/rtw_xmit.c index 3d8e9dea7651..7135d89caac1 100644 --- a/drivers/staging/r8188eu/core/rtw_xmit.c +++ b/drivers/staging/r8188eu/core/rtw_xmit.c @@ -178,8 +178,7 @@ s32 _rtw_init_xmit_priv(struct xmit_priv *pxmitpriv, struct adapter *padapter) pxmitpriv->free_xmit_extbuf_cnt = num_xmit_extbuf; - res = rtw_alloc_hwxmits(padapter); - if (res) { + if (rtw_alloc_hwxmits(padapter)) { res = _FAIL; goto exit; } @@ -1483,19 +1482,10 @@ int rtw_alloc_hwxmits(struct adapter *padapter) hwxmits = pxmitpriv->hwxmits; - if (pxmitpriv->hwxmit_entry == 5) { - hwxmits[0] .sta_queue = &pxmitpriv->bm_pending; - hwxmits[1] .sta_queue = &pxmitpriv->vo_pending; - hwxmits[2] .sta_queue = &pxmitpriv->vi_pending; - hwxmits[3] .sta_queue = &pxmitpriv->bk_pending; - hwxmits[4] .sta_queue = &pxmitpriv->be_pending; - } else if (pxmitpriv->hwxmit_entry == 4) { - hwxmits[0] .sta_queue = &pxmitpriv->vo_pending; - hwxmits[1] .sta_queue = &pxmitpriv->vi_pending; - hwxmits[2] .sta_queue = &pxmitpriv->be_pending; - hwxmits[3] .sta_queue = &pxmitpriv->bk_pending; - } else { - } + hwxmits[0].sta_queue = &pxmitpriv->vo_pending; + hwxmits[1].sta_queue = &pxmitpriv->vi_pending; + hwxmits[2].sta_queue = &pxmitpriv->be_pending; + hwxmits[3].sta_queue = &pxmitpriv->bk_pending; return 0; } From 96f0a54e8e65a765b3a4ad4b53751581f23279f3 Mon Sep 17 00:00:00 2001 From: Larry Finger Date: Mon, 30 May 2022 20:31:03 -0500 Subject: [PATCH 013/298] staging: r8188eu: Fix warning of array overflow in ioctl_linux.c MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Building with -Warray-bounds results in the following warning plus others related to the same problem: CC [M] drivers/staging/r8188eu/os_dep/ioctl_linux.o In function ‘wpa_set_encryption’, inlined from ‘rtw_wx_set_enc_ext’ at drivers/staging/r8188eu/os_dep/ioctl_linux.c:1868:9: drivers/staging/r8188eu/os_dep/ioctl_linux.c:412:41: warning: array subscript ‘struct ndis_802_11_wep[0]’ is partly outside array bounds of ‘void[25]’ [-Warray-bounds] 412 | pwep->KeyLength = wep_key_len; | ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~ In file included from drivers/staging/r8188eu/os_dep/../include/osdep_service.h:19, from drivers/staging/r8188eu/os_dep/ioctl_linux.c:4: In function ‘kmalloc’, inlined from ‘kzalloc’ at ./include/linux/slab.h:733:9, inlined from ‘wpa_set_encryption’ at drivers/staging/r8188eu/os_dep/ioctl_linux.c:408:11, inlined from ‘rtw_wx_set_enc_ext’ at drivers/staging/r8188eu/os_dep/ioctl_linux.c:1868:9: ./include/linux/slab.h:605:16: note: object of size [17, 25] allocated by ‘__kmalloc’ 605 | return __kmalloc(size, flags); | ^~~~~~~~~~~~~~~~~~~~~~ ./include/linux/slab.h:600:24: note: object of size [17, 25] allocated by ‘kmem_cache_alloc_trace’ 600 | return kmem_cache_alloc_trace( | ^~~~~~~~~~~~~~~~~~~~~~~ 601 | kmalloc_caches[kmalloc_type(flags)][index], | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 602 | flags, size); | ~~~~~~~~~~~~ Although it is unlikely that anyone is still using WEP encryption, the size of the allocation needs to be increased just in case. Fixes commit 2b42bd58b321 ("staging: r8188eu: introduce new os_dep dir for RTL8188eu driver") Fixes: 2b42bd58b321 ("staging: r8188eu: introduce new os_dep dir for RTL8188eu driver") Signed-off-by: Larry Finger Cc: Phillip Potter Cc: Dan Carpenter Link: https://lore.kernel.org/r/20220531013103.2175-3-Larry.Finger@lwfinger.net Signed-off-by: Greg Kroah-Hartman --- drivers/staging/r8188eu/os_dep/ioctl_linux.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/r8188eu/os_dep/ioctl_linux.c b/drivers/staging/r8188eu/os_dep/ioctl_linux.c index 1b09462ca908..8dd280e2739a 100644 --- a/drivers/staging/r8188eu/os_dep/ioctl_linux.c +++ b/drivers/staging/r8188eu/os_dep/ioctl_linux.c @@ -403,7 +403,7 @@ static int wpa_set_encryption(struct net_device *dev, struct ieee_param *param, if (wep_key_len > 0) { wep_key_len = wep_key_len <= 5 ? 5 : 13; - wep_total_len = wep_key_len + FIELD_OFFSET(struct ndis_802_11_wep, KeyMaterial); + wep_total_len = wep_key_len + sizeof(*pwep); pwep = kzalloc(wep_total_len, GFP_KERNEL); if (!pwep) goto exit; From fe44fb23d6ccde4c914c44ef74ab8d9d9ba02bea Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Tue, 31 May 2022 11:03:06 -0400 Subject: [PATCH 014/298] pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE If the server tells us that a pNFS layout is not available for a specific file, then we should not keep pounding it with further layoutget requests. Fixes: 183d9e7b112a ("pnfs: rework LAYOUTGET retry handling") Signed-off-by: Trond Myklebust Signed-off-by: Anna Schumaker --- fs/nfs/pnfs.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 68a87be3e6f9..4609e641710e 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -2152,6 +2152,12 @@ lookup_again: case -ERECALLCONFLICT: case -EAGAIN: break; + case -ENODATA: + /* The server returned NFS4ERR_LAYOUTUNAVAILABLE */ + pnfs_layout_set_fail_bit( + lo, pnfs_iomode_to_fail_bit(iomode)); + lseg = NULL; + goto out_put_layout_hdr; default: if (!nfs_error_is_fatal(PTR_ERR(lseg))) { pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode)); From 880265c77ac415090090d1fe72a188fee71cb458 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Tue, 31 May 2022 11:03:07 -0400 Subject: [PATCH 015/298] pNFS: Avoid a live lock condition in pnfs_update_layout() If we're about to send the first layoutget for an empty layout, we want to make sure that we drain out the existing pending layoutget calls first. The reason is that these layouts may have been already implicitly returned to the server by a recall to which the client gave a NFS4ERR_NOMATCHING_LAYOUT response. The problem is that wait_var_event_killable() could in principle see the plh_outstanding count go back to '1' when the first process to wake up starts sending a new layoutget. If it fails to get a layout, then this loop can continue ad infinitum... Fixes: 0b77f97a7e42 ("NFSv4/pnfs: Fix layoutget behaviour after invalidation") Signed-off-by: Trond Myklebust Signed-off-by: Anna Schumaker --- fs/nfs/callback_proc.c | 1 + fs/nfs/pnfs.c | 15 +++++++++------ fs/nfs/pnfs.h | 1 + 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c index c8520284dda7..c1eda73254e1 100644 --- a/fs/nfs/callback_proc.c +++ b/fs/nfs/callback_proc.c @@ -288,6 +288,7 @@ static u32 initiate_file_draining(struct nfs_client *clp, rv = NFS4_OK; break; case -ENOENT: + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags); /* Embrace your forgetfulness! */ rv = NFS4ERR_NOMATCHING_LAYOUT; diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 4609e641710e..41a9b6b58fb9 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -469,6 +469,7 @@ pnfs_mark_layout_stateid_invalid(struct pnfs_layout_hdr *lo, pnfs_clear_lseg_state(lseg, lseg_list); pnfs_clear_layoutreturn_info(lo); pnfs_free_returned_lsegs(lo, lseg_list, &range, 0); + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags); if (test_bit(NFS_LAYOUT_RETURN, &lo->plh_flags) && !test_and_set_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) pnfs_clear_layoutreturn_waitbit(lo); @@ -1917,8 +1918,9 @@ static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo) static void nfs_layoutget_end(struct pnfs_layout_hdr *lo) { - if (atomic_dec_and_test(&lo->plh_outstanding)) - wake_up_var(&lo->plh_outstanding); + if (atomic_dec_and_test(&lo->plh_outstanding) && + test_and_clear_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) + wake_up_bit(&lo->plh_flags, NFS_LAYOUT_DRAIN); } static bool pnfs_is_first_layoutget(struct pnfs_layout_hdr *lo) @@ -2025,11 +2027,11 @@ lookup_again: * If the layout segment list is empty, but there are outstanding * layoutget calls, then they might be subject to a layoutrecall. */ - if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo)) && + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags) && atomic_read(&lo->plh_outstanding) != 0) { spin_unlock(&ino->i_lock); - lseg = ERR_PTR(wait_var_event_killable(&lo->plh_outstanding, - !atomic_read(&lo->plh_outstanding))); + lseg = ERR_PTR(wait_on_bit(&lo->plh_flags, NFS_LAYOUT_DRAIN, + TASK_KILLABLE)); if (IS_ERR(lseg)) goto out_put_layout_hdr; pnfs_put_layout_hdr(lo); @@ -2413,7 +2415,8 @@ pnfs_layout_process(struct nfs4_layoutget *lgp) goto out_forget; } - if (!pnfs_layout_is_valid(lo) && !pnfs_is_first_layoutget(lo)) + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags) && + !pnfs_is_first_layoutget(lo)) goto out_forget; if (nfs4_stateid_match_other(&lo->plh_stateid, &res->stateid)) { diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h index 07f11489e4e9..f331f067691b 100644 --- a/fs/nfs/pnfs.h +++ b/fs/nfs/pnfs.h @@ -105,6 +105,7 @@ enum { NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first layoutget */ NFS_LAYOUT_INODE_FREEING, /* The inode is being freed */ NFS_LAYOUT_HASHED, /* The layout visible */ + NFS_LAYOUT_DRAIN, }; enum layoutdriver_policy_flags { From c2f75a43f5ae48b9babeb5b82c9f23fe18d3d144 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Wed, 1 Jun 2022 09:42:12 -0700 Subject: [PATCH 016/298] objtool: Fix obsolete reference to CONFIG_X86_SMAP CONFIG_X86_SMAP no longer exists. For objtool's purposes it has been replaced with CONFIG_HAVE_UACCESS_VALIDATION. Fixes: 03f16cd020eb ("objtool: Add CONFIG_OBJTOOL") Reported-by: Lukas Bulwahn Signed-off-by: Josh Poimboeuf Link: https://lore.kernel.org/r/44c57668768c1ba1b4ba1ff541ec54781636e07c.1654101721.git.jpoimboe@kernel.org --- lib/Kconfig.ubsan | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.ubsan b/lib/Kconfig.ubsan index c4fe15d38b60..a9f7eb047768 100644 --- a/lib/Kconfig.ubsan +++ b/lib/Kconfig.ubsan @@ -94,7 +94,7 @@ config UBSAN_UNREACHABLE bool "Perform checking for unreachable code" # objtool already handles unreachable checking and gets angry about # seeing UBSan instrumentation located in unreachable places. - depends on !(OBJTOOL && (STACK_VALIDATION || UNWINDER_ORC || X86_SMAP)) + depends on !(OBJTOOL && (STACK_VALIDATION || UNWINDER_ORC || HAVE_UACCESS_VALIDATION)) depends on $(cc-option,-fsanitize=unreachable) help This option enables -fsanitize=unreachable which checks for control From dcea997beed694cbd8705100ca1a6eb0d886de69 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Wed, 1 Jun 2022 17:42:22 -0700 Subject: [PATCH 017/298] faddr2line: Fix overlapping text section failures, the sequel If a function lives in a section other than .text, but .text also exists in the object, faddr2line may wrongly assume .text. This can result in comically wrong output. For example: $ scripts/faddr2line vmlinux.o enter_from_user_mode+0x1c enter_from_user_mode+0x1c/0x30: find_next_bit at /home/jpoimboe/git/linux/./include/linux/find.h:40 (inlined by) perf_clear_dirty_counters at /home/jpoimboe/git/linux/arch/x86/events/core.c:2504 Fix it by passing the section name to addr2line, unless the object file is vmlinux, in which case the symbol table uses absolute addresses. Fixes: 1d1a0e7c5100 ("scripts/faddr2line: Fix overlapping text section failures") Reported-by: Peter Zijlstra Signed-off-by: Josh Poimboeuf Link: https://lore.kernel.org/r/7d25bc1408bd3a750ac26e60d2f2815a5f4a8363.1654130536.git.jpoimboe@kernel.org --- scripts/faddr2line | 45 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 34 insertions(+), 11 deletions(-) diff --git a/scripts/faddr2line b/scripts/faddr2line index 0e6268d59883..94ed98dd899f 100755 --- a/scripts/faddr2line +++ b/scripts/faddr2line @@ -95,17 +95,25 @@ __faddr2line() { local print_warnings=$4 local sym_name=${func_addr%+*} - local offset=${func_addr#*+} - offset=${offset%/*} + local func_offset=${func_addr#*+} + func_offset=${func_offset%/*} local user_size= + local file_type + local is_vmlinux=0 [[ $func_addr =~ "/" ]] && user_size=${func_addr#*/} - if [[ -z $sym_name ]] || [[ -z $offset ]] || [[ $sym_name = $func_addr ]]; then + if [[ -z $sym_name ]] || [[ -z $func_offset ]] || [[ $sym_name = $func_addr ]]; then warn "bad func+offset $func_addr" DONE=1 return fi + # vmlinux uses absolute addresses in the section table rather than + # section offsets. + local file_type=$(${READELF} --file-header $objfile | + ${AWK} '$1 == "Type:" { print $2; exit }') + [[ $file_type = "EXEC" ]] && is_vmlinux=1 + # Go through each of the object's symbols which match the func name. # In rare cases there might be duplicates, in which case we print all # matches. @@ -114,9 +122,11 @@ __faddr2line() { local sym_addr=0x${fields[1]} local sym_elf_size=${fields[2]} local sym_sec=${fields[6]} + local sec_size + local sec_name # Get the section size: - local sec_size=$(${READELF} --section-headers --wide $objfile | + sec_size=$(${READELF} --section-headers --wide $objfile | sed 's/\[ /\[/' | ${AWK} -v sec=$sym_sec '$1 == "[" sec "]" { print "0x" $6; exit }') @@ -126,6 +136,17 @@ __faddr2line() { return fi + # Get the section name: + sec_name=$(${READELF} --section-headers --wide $objfile | + sed 's/\[ /\[/' | + ${AWK} -v sec=$sym_sec '$1 == "[" sec "]" { print $2; exit }') + + if [[ -z $sec_name ]]; then + warn "bad section name: section: $sym_sec" + DONE=1 + return + fi + # Calculate the symbol size. # # Unfortunately we can't use the ELF size, because kallsyms @@ -174,10 +195,10 @@ __faddr2line() { sym_size=0x$(printf %x $sym_size) - # Calculate the section address from user-supplied offset: - local addr=$(($sym_addr + $offset)) + # Calculate the address from user-supplied offset: + local addr=$(($sym_addr + $func_offset)) if [[ -z $addr ]] || [[ $addr = 0 ]]; then - warn "bad address: $sym_addr + $offset" + warn "bad address: $sym_addr + $func_offset" DONE=1 return fi @@ -191,9 +212,9 @@ __faddr2line() { fi # Make sure the provided offset is within the symbol's range: - if [[ $offset -gt $sym_size ]]; then + if [[ $func_offset -gt $sym_size ]]; then [[ $print_warnings = 1 ]] && - echo "skipping $sym_name address at $addr due to size mismatch ($offset > $sym_size)" + echo "skipping $sym_name address at $addr due to size mismatch ($func_offset > $sym_size)" continue fi @@ -202,11 +223,13 @@ __faddr2line() { [[ $FIRST = 0 ]] && echo FIRST=0 - echo "$sym_name+$offset/$sym_size:" + echo "$sym_name+$func_offset/$sym_size:" # Pass section address to addr2line and strip absolute paths # from the output: - local output=$(${ADDR2LINE} -fpie $objfile $addr | sed "s; $dir_prefix\(\./\)*; ;") + local args="--functions --pretty-print --inlines --exe=$objfile" + [[ $is_vmlinux = 0 ]] && args="$args --section=$sec_name" + local output=$(${ADDR2LINE} $args $addr | sed "s; $dir_prefix\(\./\)*; ;") [[ -z $output ]] && continue # Default output (non --list): From 7b6c7a877cc616bc7dc9cd39646fe454acbed48b Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Fri, 3 Jun 2022 08:04:44 -0700 Subject: [PATCH 018/298] x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usage The file-wide OBJECT_FILES_NON_STANDARD annotation is used with CONFIG_FRAME_POINTER to tell objtool to skip the entire file when frame pointers are enabled. However that annotation is now deprecated because it doesn't work with IBT, where objtool runs on vmlinux.o instead of individual translation units. Instead, use more fine-grained function-specific annotations: - The 'save_mcount_regs' macro does funny things with the frame pointer. Use STACK_FRAME_NON_STANDARD_FP to tell objtool to ignore the functions using it. - The return_to_handler() "function" isn't actually a callable function. Instead of being called, it's returned to. The real return address isn't on the stack, so unwinding is already doomed no matter which unwinder is used. So just remove the STT_FUNC annotation, telling objtool to ignore it. That also removes the implicit ANNOTATE_NOENDBR, which now needs to be made explicit. Fixes the following warning: vmlinux.o: warning: objtool: __fentry__+0x16: return with modified stack frame Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Reported-by: kernel test robot Signed-off-by: Josh Poimboeuf Link: https://lore.kernel.org/r/b7a7a42fe306aca37826043dac89e113a1acdbac.1654268610.git.jpoimboe@kernel.org --- arch/x86/kernel/Makefile | 4 ---- arch/x86/kernel/ftrace_64.S | 11 ++++++++--- include/linux/objtool.h | 6 ++++++ tools/include/linux/objtool.h | 6 ++++++ 4 files changed, 20 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 03364dc40d8d..4c8b6ae802ac 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -36,10 +36,6 @@ KCSAN_SANITIZE := n OBJECT_FILES_NON_STANDARD_test_nx.o := y -ifdef CONFIG_FRAME_POINTER -OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o := y -endif - # If instrumentation of this dir is enabled, boot hangs during first second. # Probably could be more selective here, but note that files related to irqs, # boot, dumpstack/stacktrace, etc are either non-interesting or can lead to diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S index 4ec13608d3c6..dfeb227de561 100644 --- a/arch/x86/kernel/ftrace_64.S +++ b/arch/x86/kernel/ftrace_64.S @@ -175,6 +175,7 @@ SYM_INNER_LABEL(ftrace_caller_end, SYM_L_GLOBAL) jmp ftrace_epilogue SYM_FUNC_END(ftrace_caller); +STACK_FRAME_NON_STANDARD_FP(ftrace_caller) SYM_FUNC_START(ftrace_epilogue) /* @@ -282,6 +283,7 @@ SYM_INNER_LABEL(ftrace_regs_caller_end, SYM_L_GLOBAL) jmp ftrace_epilogue SYM_FUNC_END(ftrace_regs_caller) +STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller) #else /* ! CONFIG_DYNAMIC_FTRACE */ @@ -311,10 +313,14 @@ trace: jmp ftrace_stub SYM_FUNC_END(__fentry__) EXPORT_SYMBOL(__fentry__) +STACK_FRAME_NON_STANDARD_FP(__fentry__) + #endif /* CONFIG_DYNAMIC_FTRACE */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER -SYM_FUNC_START(return_to_handler) +SYM_CODE_START(return_to_handler) + UNWIND_HINT_EMPTY + ANNOTATE_NOENDBR subq $16, %rsp /* Save the return values */ @@ -339,7 +345,6 @@ SYM_FUNC_START(return_to_handler) int3 .Ldo_rop: mov %rdi, (%rsp) - UNWIND_HINT_FUNC RET -SYM_FUNC_END(return_to_handler) +SYM_CODE_END(return_to_handler) #endif diff --git a/include/linux/objtool.h b/include/linux/objtool.h index 6491fa8fba6d..15b940ec1eac 100644 --- a/include/linux/objtool.h +++ b/include/linux/objtool.h @@ -143,6 +143,12 @@ struct unwind_hint { .popsection .endm +.macro STACK_FRAME_NON_STANDARD_FP func:req +#ifdef CONFIG_FRAME_POINTER + STACK_FRAME_NON_STANDARD \func +#endif +.endm + .macro ANNOTATE_NOENDBR .Lhere_\@: .pushsection .discard.noendbr diff --git a/tools/include/linux/objtool.h b/tools/include/linux/objtool.h index 6491fa8fba6d..15b940ec1eac 100644 --- a/tools/include/linux/objtool.h +++ b/tools/include/linux/objtool.h @@ -143,6 +143,12 @@ struct unwind_hint { .popsection .endm +.macro STACK_FRAME_NON_STANDARD_FP func:req +#ifdef CONFIG_FRAME_POINTER + STACK_FRAME_NON_STANDARD \func +#endif +.endm + .macro ANNOTATE_NOENDBR .Lhere_\@: .pushsection .discard.noendbr From 5e3f89ad8e0cbd75aa3479e9ceb96d9e1c5585b8 Mon Sep 17 00:00:00 2001 From: Rob Herring Date: Mon, 6 Jun 2022 16:22:22 -0500 Subject: [PATCH 019/298] dt-bindings: hwmon: ti,tmp401: Drop 'items' from 'ti,n-factor' property 'ti,n-factor' is a scalar type, so 'items' should not be used as that is for arrays/matrix. A pending meta-schema change will catch future cases. Fixes: bd90c5b93950 ("dt-bindings: hwmon: Add TMP401, TMP411 and TMP43x") Signed-off-by: Rob Herring Reviewed-by: Krzysztof Kozlowski Link: https://lore.kernel.org/r/20220606212223.1360395-1-robh@kernel.org Signed-off-by: Guenter Roeck --- Documentation/devicetree/bindings/hwmon/ti,tmp401.yaml | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/Documentation/devicetree/bindings/hwmon/ti,tmp401.yaml b/Documentation/devicetree/bindings/hwmon/ti,tmp401.yaml index fe0ac08faa1a..0e8ddf0ad789 100644 --- a/Documentation/devicetree/bindings/hwmon/ti,tmp401.yaml +++ b/Documentation/devicetree/bindings/hwmon/ti,tmp401.yaml @@ -40,9 +40,8 @@ properties: value to be used for converting remote channel measurements to temperature. $ref: /schemas/types.yaml#/definitions/int32 - items: - minimum: -128 - maximum: 127 + minimum: -128 + maximum: 127 ti,beta-compensation: description: From ac6888ac5a11c0a47d1f1da4b7809c0c595fdc5d Mon Sep 17 00:00:00 2001 From: Eddie James Date: Mon, 6 Jun 2022 13:54:55 -0500 Subject: [PATCH 020/298] hwmon: (occ) Lock mutex in shutdown to prevent race with occ_active Unbinding the driver or removing the parent device at the same time as using the OCC active sysfs file can cause the driver to unregister the hwmon device twice. Prevent this by locking the occ mutex in the shutdown function. Signed-off-by: Eddie James Link: https://lore.kernel.org/r/20220606185455.21126-1-eajames@linux.ibm.com Signed-off-by: Guenter Roeck --- drivers/hwmon/occ/common.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c index d78f4bebc718..ea070b91e5b9 100644 --- a/drivers/hwmon/occ/common.c +++ b/drivers/hwmon/occ/common.c @@ -1228,10 +1228,15 @@ EXPORT_SYMBOL_GPL(occ_setup); void occ_shutdown(struct occ *occ) { + mutex_lock(&occ->lock); + occ_shutdown_sysfs(occ); if (occ->hwmon) hwmon_device_unregister(occ->hwmon); + occ->hwmon = NULL; + + mutex_unlock(&occ->lock); } EXPORT_SYMBOL_GPL(occ_shutdown); From d52d165d67c5aa26c8c89909003c94a66492d23d Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 28 May 2022 12:38:11 +0100 Subject: [PATCH 021/298] KVM: arm64: Always start with clearing SVE flag on load On each vcpu load, we set the KVM_ARM64_HOST_SVE_ENABLED flag if SVE is enabled for EL0 on the host. This is used to restore the correct state on vpcu put. However, it appears that nothing ever clears this flag. Once set, it will stick until the vcpu is destroyed, which has the potential to spuriously enable SVE for userspace. We probably never saw the issue because no VMM uses SVE, but that's still pretty bad. Unconditionally clearing the flag on vcpu load addresses the issue. Fixes: 8383741ab2e7 ("KVM: arm64: Get rid of host SVE tracking/saving") Signed-off-by: Marc Zyngier Cc: stable@vger.kernel.org Reviewed-by: Mark Brown Link: https://lore.kernel.org/r/20220528113829.1043361-2-maz@kernel.org --- arch/arm64/kvm/fpsimd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c index 3d251a4d2cf7..8267ff4642d3 100644 --- a/arch/arm64/kvm/fpsimd.c +++ b/arch/arm64/kvm/fpsimd.c @@ -80,6 +80,7 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu) vcpu->arch.flags &= ~KVM_ARM64_FP_ENABLED; vcpu->arch.flags |= KVM_ARM64_FP_HOST; + vcpu->arch.flags &= ~KVM_ARM64_HOST_SVE_ENABLED; if (read_sysreg(cpacr_el1) & CPACR_EL1_ZEN_EL0EN) vcpu->arch.flags |= KVM_ARM64_HOST_SVE_ENABLED; From 039f49c4cafb785504c678f28664d088e0108d35 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 28 May 2022 12:38:12 +0100 Subject: [PATCH 022/298] KVM: arm64: Always start with clearing SME flag on load On each vcpu load, we set the KVM_ARM64_HOST_SME_ENABLED flag if SME is enabled for EL0 on the host. This is used to restore the correct state on vpcu put. However, it appears that nothing ever clears this flag. Once set, it will stick until the vcpu is destroyed, which has the potential to spuriously enable SME for userspace. As it turns out, this is due to the SME code being more or less copied from SVE, and inheriting the same shortcomings. We never saw the issue because nothing uses SME, and the amount of testing is probably still pretty low. Fixes: 861262ab8627 ("KVM: arm64: Handle SME host state when running guests") Signed-off-by: Marc Zyngier Reviwed-by: Mark Brown Link: https://lore.kernel.org/r/20220528113829.1043361-3-maz@kernel.org --- arch/arm64/kvm/fpsimd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c index 8267ff4642d3..6012b08ecb14 100644 --- a/arch/arm64/kvm/fpsimd.c +++ b/arch/arm64/kvm/fpsimd.c @@ -94,6 +94,7 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu) * operations. Do this for ZA as well for now for simplicity. */ if (system_supports_sme()) { + vcpu->arch.flags &= ~KVM_ARM64_HOST_SME_ENABLED; if (read_sysreg(cpacr_el1) & CPACR_EL1_SMEN_EL0EN) vcpu->arch.flags |= KVM_ARM64_HOST_SME_ENABLED; From e3fe65e0d3671ee5ae8a2723e429ee4830a7c89c Mon Sep 17 00:00:00 2001 From: sunliming Date: Thu, 2 Jun 2022 10:48:05 +0800 Subject: [PATCH 023/298] KVM: arm64: Fix inconsistent indenting Fix the following smatch warnings: arch/arm64/kvm/vmid.c:62 flush_context() warn: inconsistent indenting Reported-by: kernel test robot Signed-off-by: sunliming Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220602024805.511457-1-sunliming@kylinos.cn --- arch/arm64/kvm/vmid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vmid.c b/arch/arm64/kvm/vmid.c index 8d5f0506fd87..d78ae63d7c15 100644 --- a/arch/arm64/kvm/vmid.c +++ b/arch/arm64/kvm/vmid.c @@ -66,7 +66,7 @@ static void flush_context(void) * the next context-switch, we broadcast TLB flush + I-cache * invalidation over the inner shareable domain on rollover. */ - kvm_call_hyp(__kvm_flush_vm_context); + kvm_call_hyp(__kvm_flush_vm_context); } static bool check_update_reserved_vmid(u64 vmid, u64 newvmid) From 304791255a2dc1c9be7e7c8a6cbdb31b6847b0e5 Mon Sep 17 00:00:00 2001 From: Scott Mayhew Date: Wed, 1 Jun 2022 13:34:49 -0400 Subject: [PATCH 024/298] sunrpc: set cl_max_connect when cloning an rpc_clnt If the initial attempt at trunking detection using the krb5i auth flavor fails with -EACCES, -NFS4ERR_CLID_INUSE, or -NFS4ERR_WRONGSEC, then the NFS client tries again using auth_sys, cloning the rpc_clnt in the process. If this second attempt at trunking detection succeeds, then the resulting nfs_client->cl_rpcclient winds up having cl_max_connect=0 and subsequent attempts to add additional transport connections to the rpc_clnt will fail with a message similar to the following being logged: [502044.312640] SUNRPC: reached max allowed number (0) did not add transport to server: 192.168.122.3 Signed-off-by: Scott Mayhew Fixes: dc48e0abee24 ("SUNRPC enforce creation of no more than max_connect xprts") Signed-off-by: Anna Schumaker --- net/sunrpc/clnt.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index e2c6eca0271b..b6781ada3aa8 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -651,6 +651,7 @@ static struct rpc_clnt *__rpc_clone_client(struct rpc_create_args *args, new->cl_discrtry = clnt->cl_discrtry; new->cl_chatty = clnt->cl_chatty; new->cl_principal = clnt->cl_principal; + new->cl_max_connect = clnt->cl_max_connect; return new; out_err: From 2cdea19a34c2340b3aa69508804efe4e3750fcec Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 7 Jun 2022 14:14:25 +0100 Subject: [PATCH 025/298] KVM: arm64: Don't read a HW interrupt pending state in user context Since 5bfa685e62e9 ("KVM: arm64: vgic: Read HW interrupt pending state from the HW"), we're able to source the pending bit for an interrupt that is stored either on the physical distributor or on a device. However, this state is only available when the vcpu is loaded, and is not intended to be accessed from userspace. Unfortunately, the GICv2 emulation doesn't provide specific userspace accessors, and we fallback with the ones that are intended for the guest, with fatal consequences. Add a new vgic_uaccess_read_pending() accessor for userspace to use, build on top of the existing vgic_mmio_read_pending(). Reported-by: Eric Auger Reviewed-by: Eric Auger Tested-by: Eric Auger Signed-off-by: Marc Zyngier Fixes: 5bfa685e62e9 ("KVM: arm64: vgic: Read HW interrupt pending state from the HW") Link: https://lore.kernel.org/r/20220607131427.1164881-2-maz@kernel.org Cc: stable@vger.kernel.org --- arch/arm64/kvm/vgic/vgic-mmio-v2.c | 4 ++-- arch/arm64/kvm/vgic/vgic-mmio.c | 19 ++++++++++++++++--- arch/arm64/kvm/vgic/vgic-mmio.h | 3 +++ 3 files changed, 21 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v2.c b/arch/arm64/kvm/vgic/vgic-mmio-v2.c index 77a67e9d3d14..e070cda86e12 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio-v2.c +++ b/arch/arm64/kvm/vgic/vgic-mmio-v2.c @@ -429,11 +429,11 @@ static const struct vgic_register_region vgic_v2_dist_registers[] = { VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_SET, vgic_mmio_read_pending, vgic_mmio_write_spending, - NULL, vgic_uaccess_write_spending, 1, + vgic_uaccess_read_pending, vgic_uaccess_write_spending, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_CLEAR, vgic_mmio_read_pending, vgic_mmio_write_cpending, - NULL, vgic_uaccess_write_cpending, 1, + vgic_uaccess_read_pending, vgic_uaccess_write_cpending, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ACTIVE_SET, vgic_mmio_read_active, vgic_mmio_write_sactive, diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c index 49837d3a3ef5..dc8c52487e47 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.c +++ b/arch/arm64/kvm/vgic/vgic-mmio.c @@ -226,8 +226,9 @@ int vgic_uaccess_write_cenable(struct kvm_vcpu *vcpu, return 0; } -unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, - gpa_t addr, unsigned int len) +static unsigned long __read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len, + bool is_user) { u32 intid = VGIC_ADDR_TO_INTID(addr, 1); u32 value = 0; @@ -248,7 +249,7 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, IRQCHIP_STATE_PENDING, &val); WARN_RATELIMIT(err, "IRQ %d", irq->host_irq); - } else if (vgic_irq_is_mapped_level(irq)) { + } else if (!is_user && vgic_irq_is_mapped_level(irq)) { val = vgic_get_phys_line_level(irq); } else { val = irq_is_pending(irq); @@ -263,6 +264,18 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, return value; } +unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len) +{ + return __read_pending(vcpu, addr, len, false); +} + +unsigned long vgic_uaccess_read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len) +{ + return __read_pending(vcpu, addr, len, true); +} + static bool is_vgic_v2_sgi(struct kvm_vcpu *vcpu, struct vgic_irq *irq) { return (vgic_irq_is_sgi(irq->intid) && diff --git a/arch/arm64/kvm/vgic/vgic-mmio.h b/arch/arm64/kvm/vgic/vgic-mmio.h index 3fa696f198a3..6082d4b66d39 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.h +++ b/arch/arm64/kvm/vgic/vgic-mmio.h @@ -149,6 +149,9 @@ int vgic_uaccess_write_cenable(struct kvm_vcpu *vcpu, unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len); +unsigned long vgic_uaccess_read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len); + void vgic_mmio_write_spending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len, unsigned long val); From 98432ccdec9f178ba041e1e5f9f32dbd71576504 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 7 Jun 2022 14:14:26 +0100 Subject: [PATCH 026/298] KVM: arm64: Replace vgic_v3_uaccess_read_pending with vgic_uaccess_read_pending Now that GICv2 has a proper userspace accessor for the pending state, switch GICv3 over to it, dropping the local version, moving over the specific behaviours that CGIv3 requires (such as the distinction between pending latch and line level which were never enforced with GICv2). We also gain extra locking that isn't really necessary for userspace, but that's a small price to pay for getting rid of superfluous code. Signed-off-by: Marc Zyngier Reviewed-by: Eric Auger Link: https://lore.kernel.org/r/20220607131427.1164881-3-maz@kernel.org --- arch/arm64/kvm/vgic/vgic-mmio-v3.c | 40 ++---------------------------- arch/arm64/kvm/vgic/vgic-mmio.c | 21 +++++++++++++++- 2 files changed, 22 insertions(+), 39 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c b/arch/arm64/kvm/vgic/vgic-mmio-v3.c index f7aa7bcd6fb8..f15e29cc63ce 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c +++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c @@ -353,42 +353,6 @@ static unsigned long vgic_mmio_read_v3_idregs(struct kvm_vcpu *vcpu, return 0; } -static unsigned long vgic_v3_uaccess_read_pending(struct kvm_vcpu *vcpu, - gpa_t addr, unsigned int len) -{ - u32 intid = VGIC_ADDR_TO_INTID(addr, 1); - u32 value = 0; - int i; - - /* - * pending state of interrupt is latched in pending_latch variable. - * Userspace will save and restore pending state and line_level - * separately. - * Refer to Documentation/virt/kvm/devices/arm-vgic-v3.rst - * for handling of ISPENDR and ICPENDR. - */ - for (i = 0; i < len * 8; i++) { - struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); - bool state = irq->pending_latch; - - if (irq->hw && vgic_irq_is_sgi(irq->intid)) { - int err; - - err = irq_get_irqchip_state(irq->host_irq, - IRQCHIP_STATE_PENDING, - &state); - WARN_ON(err); - } - - if (state) - value |= (1U << i); - - vgic_put_irq(vcpu->kvm, irq); - } - - return value; -} - static int vgic_v3_uaccess_write_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len, unsigned long val) @@ -666,7 +630,7 @@ static const struct vgic_register_region vgic_v3_dist_registers[] = { VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ISPENDR, vgic_mmio_read_pending, vgic_mmio_write_spending, - vgic_v3_uaccess_read_pending, vgic_v3_uaccess_write_pending, 1, + vgic_uaccess_read_pending, vgic_v3_uaccess_write_pending, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ICPENDR, vgic_mmio_read_pending, vgic_mmio_write_cpending, @@ -750,7 +714,7 @@ static const struct vgic_register_region vgic_v3_rd_registers[] = { VGIC_ACCESS_32bit), REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ISPENDR0, vgic_mmio_read_pending, vgic_mmio_write_spending, - vgic_v3_uaccess_read_pending, vgic_v3_uaccess_write_pending, 4, + vgic_uaccess_read_pending, vgic_v3_uaccess_write_pending, 4, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ICPENDR0, vgic_mmio_read_pending, vgic_mmio_write_cpending, diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c index dc8c52487e47..997d0fce2088 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.c +++ b/arch/arm64/kvm/vgic/vgic-mmio.c @@ -240,6 +240,15 @@ static unsigned long __read_pending(struct kvm_vcpu *vcpu, unsigned long flags; bool val; + /* + * When used from userspace with a GICv3 model: + * + * Pending state of interrupt is latched in pending_latch + * variable. Userspace will save and restore pending state + * and line_level separately. + * Refer to Documentation/virt/kvm/devices/arm-vgic-v3.rst + * for handling of ISPENDR and ICPENDR. + */ raw_spin_lock_irqsave(&irq->irq_lock, flags); if (irq->hw && vgic_irq_is_sgi(irq->intid)) { int err; @@ -252,7 +261,17 @@ static unsigned long __read_pending(struct kvm_vcpu *vcpu, } else if (!is_user && vgic_irq_is_mapped_level(irq)) { val = vgic_get_phys_line_level(irq); } else { - val = irq_is_pending(irq); + switch (vcpu->kvm->arch.vgic.vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V3: + if (is_user) { + val = irq->pending_latch; + break; + } + fallthrough; + default: + val = irq_is_pending(irq); + break; + } } value |= ((u32)val << i); From efedd01de475e126e43a07d0b1221bb65e497163 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 7 Jun 2022 14:14:27 +0100 Subject: [PATCH 027/298] KVM: arm64: Warn if accessing timer pending state outside of vcpu context A recurrent bug in the KVM/arm64 code base consists in trying to access the timer pending state outside of the vcpu context, which makes zero sense (the pending state only exists when the vcpu is loaded). In order to avoid more embarassing crashes and catch the offenders red-handed, add a warning to kvm_arch_timer_get_input_level() and return the state as non-pending. This avoids taking the system down, and still helps tracking down silly bugs. Reviewed-by: Eric Auger Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220607131427.1164881-4-maz@kernel.org --- arch/arm64/kvm/arch_timer.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 4e39ace073af..3b8d062e30ea 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -1230,6 +1230,9 @@ bool kvm_arch_timer_get_input_level(int vintid) struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); struct arch_timer_context *timer; + if (WARN(!vcpu, "No vcpu context!\n")) + return false; + if (vintid == vcpu_vtimer(vcpu)->irq.irq) timer = vcpu_vtimer(vcpu); else if (vintid == vcpu_ptimer(vcpu)->irq.irq) From 6640b5df1a38801be6d0595c8cd2177d968d7ee0 Mon Sep 17 00:00:00 2001 From: Saurabh Sengar Date: Fri, 27 May 2022 00:43:59 -0700 Subject: [PATCH 028/298] Drivers: hv: vmbus: Don't assign VMbus channel interrupts to isolated CPUs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When initially assigning a VMbus channel interrupt to a CPU, don’t choose a managed IRQ isolated CPU (as specified on the kernel boot line with parameter 'isolcpus=managed_irq,<#cpu>'). Also, when using sysfs to change the CPU that a VMbus channel will interrupt, don't allow changing to a managed IRQ isolated CPU. Signed-off-by: Saurabh Sengar Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1653637439-23060-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Wei Liu --- drivers/hv/channel_mgmt.c | 17 ++++++++++++----- drivers/hv/vmbus_drv.c | 4 ++++ 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index b60f13481bdc..280b52927758 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "hyperv_vmbus.h" @@ -728,16 +729,20 @@ static void init_vp_index(struct vmbus_channel *channel) u32 i, ncpu = num_online_cpus(); cpumask_var_t available_mask; struct cpumask *allocated_mask; + const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ); u32 target_cpu; int numa_node; if (!perf_chn || - !alloc_cpumask_var(&available_mask, GFP_KERNEL)) { + !alloc_cpumask_var(&available_mask, GFP_KERNEL) || + cpumask_empty(hk_mask)) { /* * If the channel is not a performance critical * channel, bind it to VMBUS_CONNECT_CPU. * In case alloc_cpumask_var() fails, bind it to * VMBUS_CONNECT_CPU. + * If all the cpus are isolated, bind it to + * VMBUS_CONNECT_CPU. */ channel->target_cpu = VMBUS_CONNECT_CPU; if (perf_chn) @@ -758,17 +763,19 @@ static void init_vp_index(struct vmbus_channel *channel) } allocated_mask = &hv_context.hv_numa_map[numa_node]; - if (cpumask_equal(allocated_mask, cpumask_of_node(numa_node))) { +retry: + cpumask_xor(available_mask, allocated_mask, cpumask_of_node(numa_node)); + cpumask_and(available_mask, available_mask, hk_mask); + + if (cpumask_empty(available_mask)) { /* * We have cycled through all the CPUs in the node; * reset the allocated map. */ cpumask_clear(allocated_mask); + goto retry; } - cpumask_xor(available_mask, allocated_mask, - cpumask_of_node(numa_node)); - target_cpu = cpumask_first(available_mask); cpumask_set_cpu(target_cpu, allocated_mask); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 714d549b7b46..547ae334e5cd 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -1770,6 +1771,9 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel, if (target_cpu >= nr_cpumask_bits) return -EINVAL; + if (!cpumask_test_cpu(target_cpu, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ))) + return -EINVAL; + /* No CPUs should come up or down during this. */ cpus_read_lock(); From 92ec746bcea0c51cd29fb46e510fb71fe15282df Mon Sep 17 00:00:00 2001 From: Xiang wangx Date: Sun, 5 Jun 2022 16:55:24 +0800 Subject: [PATCH 029/298] Drivers: hv: Fix syntax errors in comments Delete the redundant word 'in'. Signed-off-by: Xiang wangx Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20220605085524.11289-1-wangxiang@cdjrlc.com Signed-off-by: Wei Liu --- drivers/hv/hv_kvp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c index c698592b83e4..d35b60c06114 100644 --- a/drivers/hv/hv_kvp.c +++ b/drivers/hv/hv_kvp.c @@ -394,7 +394,7 @@ kvp_send_key(struct work_struct *dummy) in_msg = kvp_transaction.kvp_msg; /* - * The key/value strings sent from the host are encoded in + * The key/value strings sent from the host are encoded * in utf16; convert it to utf8 strings. * The host assures us that the utf16 strings will not exceed * the max lengths specified. We will however, reserve room From 245b993d8f6c4e25f19191edfbd8080b645e12b1 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 6 Jun 2022 14:02:38 +0900 Subject: [PATCH 030/298] clocksource: hyper-v: unexport __init-annotated hv_init_clocksource() EXPORT_SYMBOL and __init is a bad combination because the .init.text section is freed up after the initialization. Hence, modules cannot use symbols annotated __init. The access to a freed symbol may end up with kernel panic. modpost used to detect it, but it has been broken for a decade. Recently, I fixed modpost so it started to warn it again, then this showed up in linux-next builds. There are two ways to fix it: - Remove __init - Remove EXPORT_SYMBOL I chose the latter for this case because the only in-tree call-site, arch/x86/kernel/cpu/mshyperv.c is never compiled as modular. (CONFIG_HYPERVISOR_GUEST is boolean) Fixes: dd2cb348613b ("clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic") Reported-by: Stephen Rothwell Signed-off-by: Masahiro Yamada Reviewed-by: Vitaly Kuznetsov Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20220606050238.4162200-1-masahiroy@kernel.org Signed-off-by: Wei Liu --- drivers/clocksource/hyperv_timer.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c index ff188ab68496..bb47610bbd1c 100644 --- a/drivers/clocksource/hyperv_timer.c +++ b/drivers/clocksource/hyperv_timer.c @@ -565,4 +565,3 @@ void __init hv_init_clocksource(void) hv_sched_clock_offset = hv_read_reference_counter(); hv_setup_sched_clock(read_hv_sched_clock_msr); } -EXPORT_SYMBOL_GPL(hv_init_clocksource); From f5f93d7f5a5cbfef02609dead21e7056e83f4fab Mon Sep 17 00:00:00 2001 From: Michael Kelley Date: Tue, 7 Jun 2022 20:49:37 -0700 Subject: [PATCH 031/298] HID: hyperv: Correctly access fields declared as __le16 Add the use of le16_to_cpu() for fields declared as __le16. Because Hyper-V only runs in Little Endian mode, there's no actual bug. The change is made in the interest of general correctness in addition to making sparse happy. No functional change. Reported-by: kernel test robot Signed-off-by: Michael Kelley Link: https://lore.kernel.org/r/1654660177-115463-1-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu --- drivers/hid/hid-hyperv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/hid/hid-hyperv.c b/drivers/hid/hid-hyperv.c index 978ee2aab2d4..e0bc73124196 100644 --- a/drivers/hid/hid-hyperv.c +++ b/drivers/hid/hid-hyperv.c @@ -199,7 +199,8 @@ static void mousevsc_on_receive_device_info(struct mousevsc_dev *input_device, if (!input_device->hid_desc) goto cleanup; - input_device->report_desc_size = desc->desc[0].wDescriptorLength; + input_device->report_desc_size = le16_to_cpu( + desc->desc[0].wDescriptorLength); if (input_device->report_desc_size == 0) { input_device->dev_info_status = -EINVAL; goto cleanup; @@ -217,7 +218,7 @@ static void mousevsc_on_receive_device_info(struct mousevsc_dev *input_device, memcpy(input_device->report_desc, ((unsigned char *)desc) + desc->bLength, - desc->desc[0].wDescriptorLength); + le16_to_cpu(desc->desc[0].wDescriptorLength)); /* Send the ack */ memset(&ack, 0, sizeof(struct mousevsc_prt_msg)); From 8c4811e7a5a60443139369a623ca504bad9e3675 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 30 May 2022 15:02:47 +0300 Subject: [PATCH 032/298] MAINTAINERS: Update Synopsys DesignWare I2C to Supported The actual status of the code is Supported (from x86 perspective). Reported-by: dave.hansen@linux.intel.com Signed-off-by: Andy Shevchenko Acked-by: Jarkko Nikula [wsa: fixed "DesignWare" spelling] Signed-off-by: Wolfram Sang --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index a6d3bd9d2a8d..cb2342ce3b55 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19288,7 +19288,7 @@ R: Andy Shevchenko R: Mika Westerberg R: Jan Dabros L: linux-i2c@vger.kernel.org -S: Maintained +S: Supported F: drivers/i2c/busses/i2c-designware-* SYNOPSYS DESIGNWARE MMC/SD/SDIO DRIVER From 6ba12b56b9b844b83ed54fb7ed59fb0eb41e4045 Mon Sep 17 00:00:00 2001 From: Jiasheng Jiang Date: Thu, 26 May 2022 17:41:00 +0800 Subject: [PATCH 033/298] i2c: npcm7xx: Add check for platform_driver_register As platform_driver_register() could fail, it should be better to deal with the return value in order to maintain the code consisitency. Fixes: 56a1485b102e ("i2c: npcm7xx: Add Nuvoton NPCM I2C controller driver") Signed-off-by: Jiasheng Jiang Acked-by: Tali Perry Signed-off-by: Wolfram Sang --- drivers/i2c/busses/i2c-npcm7xx.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/i2c/busses/i2c-npcm7xx.c b/drivers/i2c/busses/i2c-npcm7xx.c index 5960ccde6574..aede9d551130 100644 --- a/drivers/i2c/busses/i2c-npcm7xx.c +++ b/drivers/i2c/busses/i2c-npcm7xx.c @@ -2372,8 +2372,7 @@ static struct platform_driver npcm_i2c_bus_driver = { static int __init npcm_i2c_init(void) { npcm_i2c_debugfs_dir = debugfs_create_dir("npcm_i2c", NULL); - platform_driver_register(&npcm_i2c_bus_driver); - return 0; + return platform_driver_register(&npcm_i2c_bus_driver); } module_init(npcm_i2c_init); From ea6c1213217dec65a8f9f396752b4d8bbcf226ea Mon Sep 17 00:00:00 2001 From: Julia Lawall Date: Thu, 9 Jun 2022 09:18:15 +0530 Subject: [PATCH 034/298] RISC-V: KVM: fix typos in comments Various spelling mistakes in comments. Detected with the help of Coccinelle. Signed-off-by: Julia Lawall Signed-off-by: Anup Patel --- arch/riscv/kvm/vmid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c index 9f764df125db..6cd93995fb65 100644 --- a/arch/riscv/kvm/vmid.c +++ b/arch/riscv/kvm/vmid.c @@ -97,7 +97,7 @@ void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu) * We ran out of VMIDs so we increment vmid_version and * start assigning VMIDs from 1. * - * This also means existing VMIDs assignement to all Guest + * This also means existing VMIDs assignment to all Guest * instances is invalid and we have force VMID re-assignement * for all Guest instances. The Guest instances that were not * running will automatically pick-up new VMIDs because will From 1a12b25274b9e54b0d2d59e21620f8cf13b268cb Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Thu, 9 Jun 2022 09:18:22 +0530 Subject: [PATCH 035/298] MAINTAINERS: Limit KVM RISC-V entry to existing selftests Commit fed9b26b2501 ("MAINTAINERS: Update KVM RISC-V entry to cover selftests support") optimistically adds a file entry for tools/testing/selftests/kvm/riscv/, but this directory does not exist. Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a broken reference. The script is very useful to keep MAINTAINERS up to date and MAINTAINERS can be kept in a state where the script emits no warning. So, just drop the non-matching file entry rather than starting to collect exceptions of entries that may match in some close or distant future. Fixes: fed9b26b2501 ("MAINTAINERS: Update KVM RISC-V entry to cover selftests support") Signed-off-by: Lukas Bulwahn Signed-off-by: Anup Patel --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index a6d3bd9d2a8d..e549a84e21c8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10863,7 +10863,6 @@ F: arch/riscv/include/asm/kvm* F: arch/riscv/include/uapi/asm/kvm* F: arch/riscv/kvm/ F: tools/testing/selftests/kvm/*/riscv/ -F: tools/testing/selftests/kvm/riscv/ KERNEL VIRTUAL MACHINE for s390 (KVM/s390) M: Christian Borntraeger From b6c8cd80ace30f308aeec0ecf946f55dec60cc68 Mon Sep 17 00:00:00 2001 From: Guenter Roeck Date: Fri, 3 Jun 2022 06:14:19 -0700 Subject: [PATCH 036/298] watchdog: gxp: Add missing MODULE_LICENSE The build system says: ERROR: modpost: missing MODULE_LICENSE() in drivers/watchdog/gxp-wdt.o Add the missing MODULE_LICENSE. Signed-off-by: Nick Hawkins Signed-off-by: Arnd Bergmann Link: https://lore.kernel.org/all/20220603131419.2948578-1-linux@roeck-us.net/ Signed-off-by: Guenter Roeck Signed-off-by: Wim Van Sebroeck --- drivers/watchdog/gxp-wdt.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/watchdog/gxp-wdt.c b/drivers/watchdog/gxp-wdt.c index b0b2d7a6fdde..2fd85be88278 100644 --- a/drivers/watchdog/gxp-wdt.c +++ b/drivers/watchdog/gxp-wdt.c @@ -172,3 +172,4 @@ module_platform_driver(gxp_wdt_driver); MODULE_AUTHOR("Nick Hawkins "); MODULE_AUTHOR("Jean-Marie Verdun "); MODULE_DESCRIPTION("Driver for GXP watchdog timer"); +MODULE_LICENSE("GPL"); From 908e698f2149c3d6a67d9ae15c75545a3f392559 Mon Sep 17 00:00:00 2001 From: Robert Eckelmann Date: Sat, 21 May 2022 23:08:08 +0900 Subject: [PATCH 037/298] USB: serial: io_ti: add Agilent E5805A support Add support for Agilent E5805A (rebranded ION Edgeport/4) to io_ti. Signed-off-by: Robert Eckelmann Link: https://lore.kernel.org/r/20220521230808.30931eca@octoberrain Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold --- drivers/usb/serial/io_ti.c | 2 ++ drivers/usb/serial/io_usbvend.h | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/usb/serial/io_ti.c b/drivers/usb/serial/io_ti.c index a7b3c15957ba..feba2a8d1233 100644 --- a/drivers/usb/serial/io_ti.c +++ b/drivers/usb/serial/io_ti.c @@ -166,6 +166,7 @@ static const struct usb_device_id edgeport_2port_id_table[] = { { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_TI_EDGEPORT_8S) }, { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_TI_EDGEPORT_416) }, { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_TI_EDGEPORT_416B) }, + { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_E5805A) }, { } }; @@ -204,6 +205,7 @@ static const struct usb_device_id id_table_combined[] = { { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_TI_EDGEPORT_8S) }, { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_TI_EDGEPORT_416) }, { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_TI_EDGEPORT_416B) }, + { USB_DEVICE(USB_VENDOR_ID_ION, ION_DEVICE_ID_E5805A) }, { } }; diff --git a/drivers/usb/serial/io_usbvend.h b/drivers/usb/serial/io_usbvend.h index 52cbc353051f..9a6f742ad3ab 100644 --- a/drivers/usb/serial/io_usbvend.h +++ b/drivers/usb/serial/io_usbvend.h @@ -212,6 +212,7 @@ // // Definitions for other product IDs #define ION_DEVICE_ID_MT4X56USB 0x1403 // OEM device +#define ION_DEVICE_ID_E5805A 0x1A01 // OEM device (rebranded Edgeport/4) #define GENERATION_ID_FROM_USB_PRODUCT_ID(ProductId) \ From ae187fec75aa670a551d9662f83e3947d3f02a69 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:18 +0100 Subject: [PATCH 038/298] KVM: arm64: Return error from kvm_arch_init_vm() on allocation failure If we fail to allocate the 'supported_cpus' cpumask in kvm_arch_init_vm() then be sure to return -ENOMEM instead of success (0) on the failure path. Reviewed-by: Alexandru Elisei Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-2-will@kernel.org --- arch/arm64/kvm/arm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 400bb0fe2745..0da0f06037db 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -150,8 +150,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) if (ret) goto out_free_stage2_pgd; - if (!zalloc_cpumask_var(&kvm->arch.supported_cpus, GFP_KERNEL)) + if (!zalloc_cpumask_var(&kvm->arch.supported_cpus, GFP_KERNEL)) { + ret = -ENOMEM; goto out_free_stage2_pgd; + } cpumask_copy(kvm->arch.supported_cpus, cpu_possible_mask); kvm_vgic_early_init(kvm); From fa7a17214488ef7df347dcd1a5594f69ea17f4dc Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Thu, 9 Jun 2022 13:12:19 +0100 Subject: [PATCH 039/298] KVM: arm64: Handle all ID registers trapped for a protected VM A protected VM accessing ID_AA64ISAR2_EL1 gets punished with an UNDEF, while it really should only get a zero back if the register is not handled by the hypervisor emulation (as mandated by the architecture). Introduce all the missing ID registers (including the unallocated ones), and have them to return 0. Reported-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-3-will@kernel.org --- arch/arm64/kvm/hyp/nvhe/sys_regs.c | 42 ++++++++++++++++++++++++------ 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/sys_regs.c b/arch/arm64/kvm/hyp/nvhe/sys_regs.c index b6d86e423319..35a4331ba5f3 100644 --- a/arch/arm64/kvm/hyp/nvhe/sys_regs.c +++ b/arch/arm64/kvm/hyp/nvhe/sys_regs.c @@ -243,15 +243,9 @@ u64 pvm_read_id_reg(const struct kvm_vcpu *vcpu, u32 id) case SYS_ID_AA64MMFR2_EL1: return get_pvm_id_aa64mmfr2(vcpu); default: - /* - * Should never happen because all cases are covered in - * pvm_sys_reg_descs[]. - */ - WARN_ON(1); - break; + /* Unhandled ID register, RAZ */ + return 0; } - - return 0; } static u64 read_id_reg(const struct kvm_vcpu *vcpu, @@ -332,6 +326,16 @@ static bool pvm_gic_read_sre(struct kvm_vcpu *vcpu, /* Mark the specified system register as an AArch64 feature id register. */ #define AARCH64(REG) { SYS_DESC(REG), .access = pvm_access_id_aarch64 } +/* + * sys_reg_desc initialiser for architecturally unallocated cpufeature ID + * register with encoding Op0=3, Op1=0, CRn=0, CRm=crm, Op2=op2 + * (1 <= crm < 8, 0 <= Op2 < 8). + */ +#define ID_UNALLOCATED(crm, op2) { \ + Op0(3), Op1(0), CRn(0), CRm(crm), Op2(op2), \ + .access = pvm_access_id_aarch64, \ +} + /* Mark the specified system register as Read-As-Zero/Write-Ignored */ #define RAZ_WI(REG) { SYS_DESC(REG), .access = pvm_access_raz_wi } @@ -375,24 +379,46 @@ static const struct sys_reg_desc pvm_sys_reg_descs[] = { AARCH32(SYS_MVFR0_EL1), AARCH32(SYS_MVFR1_EL1), AARCH32(SYS_MVFR2_EL1), + ID_UNALLOCATED(3,3), AARCH32(SYS_ID_PFR2_EL1), AARCH32(SYS_ID_DFR1_EL1), AARCH32(SYS_ID_MMFR5_EL1), + ID_UNALLOCATED(3,7), /* AArch64 ID registers */ /* CRm=4 */ AARCH64(SYS_ID_AA64PFR0_EL1), AARCH64(SYS_ID_AA64PFR1_EL1), + ID_UNALLOCATED(4,2), + ID_UNALLOCATED(4,3), AARCH64(SYS_ID_AA64ZFR0_EL1), + ID_UNALLOCATED(4,5), + ID_UNALLOCATED(4,6), + ID_UNALLOCATED(4,7), AARCH64(SYS_ID_AA64DFR0_EL1), AARCH64(SYS_ID_AA64DFR1_EL1), + ID_UNALLOCATED(5,2), + ID_UNALLOCATED(5,3), AARCH64(SYS_ID_AA64AFR0_EL1), AARCH64(SYS_ID_AA64AFR1_EL1), + ID_UNALLOCATED(5,6), + ID_UNALLOCATED(5,7), AARCH64(SYS_ID_AA64ISAR0_EL1), AARCH64(SYS_ID_AA64ISAR1_EL1), + AARCH64(SYS_ID_AA64ISAR2_EL1), + ID_UNALLOCATED(6,3), + ID_UNALLOCATED(6,4), + ID_UNALLOCATED(6,5), + ID_UNALLOCATED(6,6), + ID_UNALLOCATED(6,7), AARCH64(SYS_ID_AA64MMFR0_EL1), AARCH64(SYS_ID_AA64MMFR1_EL1), AARCH64(SYS_ID_AA64MMFR2_EL1), + ID_UNALLOCATED(7,3), + ID_UNALLOCATED(7,4), + ID_UNALLOCATED(7,5), + ID_UNALLOCATED(7,6), + ID_UNALLOCATED(7,7), /* Scalable Vector Registers are restricted. */ From cde5042adf11b0a30a6ce0ec3d071afcf8d2efaf Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:20 +0100 Subject: [PATCH 040/298] KVM: arm64: Ignore 'kvm-arm.mode=protected' when using VHE Ignore 'kvm-arm.mode=protected' when using VHE so that kvm_get_mode() only returns KVM_MODE_PROTECTED on systems where the feature is available. Cc: David Brazdil Acked-by: Mark Rutland Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-4-will@kernel.org --- Documentation/admin-guide/kernel-parameters.txt | 1 - arch/arm64/kernel/cpufeature.c | 10 +--------- arch/arm64/kvm/arm.c | 6 +++++- 3 files changed, 6 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 8090130b544b..97c16aa2f53f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2469,7 +2469,6 @@ protected: nVHE-based mode with support for guests whose state is kept private from the host. - Not valid if the kernel is running in EL2. Defaults to VHE/nVHE based on hardware support. Setting mode to "protected" will disable kexec and hibernation diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 42ea2bd856c6..79fac13ab2ef 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1974,15 +1974,7 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap) #ifdef CONFIG_KVM static bool is_kvm_protected_mode(const struct arm64_cpu_capabilities *entry, int __unused) { - if (kvm_get_mode() != KVM_MODE_PROTECTED) - return false; - - if (is_kernel_in_hyp_mode()) { - pr_warn("Protected KVM not available with VHE\n"); - return false; - } - - return true; + return kvm_get_mode() == KVM_MODE_PROTECTED; } #endif /* CONFIG_KVM */ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 0da0f06037db..a0188144a122 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -2273,7 +2273,11 @@ static int __init early_kvm_mode_cfg(char *arg) return -EINVAL; if (strcmp(arg, "protected") == 0) { - kvm_mode = KVM_MODE_PROTECTED; + if (!is_kernel_in_hyp_mode()) + kvm_mode = KVM_MODE_PROTECTED; + else + pr_warn_once("Protected KVM not available with VHE\n"); + return 0; } From 112f3bab41113dc53b4f35e9034b2208245bc002 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:21 +0100 Subject: [PATCH 041/298] KVM: arm64: Extend comment in has_vhe() has_vhe() expands to a compile-time constant when evaluated from the VHE or nVHE code, alternatively checking a static key when called from elsewhere in the kernel. On face value, this looks like a case of premature optimization, but in fact this allows symbol references on VHE-specific code paths to be dropped from the nVHE object. Expand the comment in has_vhe() to make this clearer, hopefully discouraging anybody from simplifying the code. Cc: David Brazdil Acked-by: Mark Rutland Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-5-will@kernel.org --- arch/arm64/include/asm/virt.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h index 3c8af033a997..0e80db4327b6 100644 --- a/arch/arm64/include/asm/virt.h +++ b/arch/arm64/include/asm/virt.h @@ -113,6 +113,9 @@ static __always_inline bool has_vhe(void) /* * Code only run in VHE/NVHE hyp context can assume VHE is present or * absent. Otherwise fall back to caps. + * This allows the compiler to discard VHE-specific code from the + * nVHE object, reducing the number of external symbol references + * needed to link. */ if (is_vhe_hyp_code()) return true; From 5879c97f37022ff22a3f13174c24fcf2807fdbc0 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:22 +0100 Subject: [PATCH 042/298] KVM: arm64: Remove redundant hyp_assert_lock_held() assertions host_stage2_try() asserts that the KVM host lock is held, so there's no need to duplicate the assertion in its wrappers. Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-6-will@kernel.org --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 78edf077fa3b..1e78acf9662e 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -314,15 +314,11 @@ static int host_stage2_adjust_range(u64 addr, struct kvm_mem_range *range) int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot) { - hyp_assert_lock_held(&host_kvm.lock); - return host_stage2_try(__host_stage2_idmap, addr, addr + size, prot); } int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) { - hyp_assert_lock_held(&host_kvm.lock); - return host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt, addr, size, &host_s2_pool, owner_id); } From bcbfb588cf323929ac46767dd14e392016bbce04 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Thu, 9 Jun 2022 13:12:23 +0100 Subject: [PATCH 043/298] KVM: arm64: Drop stale comment The layout of 'struct kvm_vcpu_arch' has evolved significantly since the initial port of KVM/arm64, so remove the stale comment suggesting that a prefix of the structure is used exclusively from assembly code. Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-7-will@kernel.org --- arch/arm64/include/asm/kvm_host.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 47a1e25e25bb..de32152cea04 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -362,11 +362,6 @@ struct kvm_vcpu_arch { struct arch_timer_cpu timer_cpu; struct kvm_pmu pmu; - /* - * Anything that is not used directly from assembly code goes - * here. - */ - /* * Guest registers we preserve during guest debugging. * From 158f7585bfcea4aae0ad4128d032a80fec550df1 Mon Sep 17 00:00:00 2001 From: Slark Xiao Date: Wed, 1 Jun 2022 11:47:40 +0800 Subject: [PATCH 044/298] USB: serial: option: add support for Cinterion MV31 with new baseline Adding support for Cinterion device MV31 with Qualcomm new baseline. Use different PIDs to separate it from previous base line products. All interfaces settings keep same as previous. Below is test evidence: T: Bus=03 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 6 Spd=480 MxCh= 0 D: Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=1e2d ProdID=00b8 Rev=04.14 S: Manufacturer=Cinterion S: Product=Cinterion PID 0x00B8 USB Mobile Broadband S: SerialNumber=90418e79 C: #Ifs= 6 Cfg#= 1 Atr=a0 MxPwr=500mA I: If#=0x0 Alt= 0 #EPs= 1 Cls=02(commc) Sub=0e Prot=00 Driver=cdc_mbim I: If#=0x1 Alt= 1 #EPs= 2 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim I: If#=0x2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option I: If#=0x3 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none) I: If#=0x4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=60 Driver=option I: If#=0x5 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option T: Bus=03 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 7 Spd=480 MxCh= 0 D: Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=1e2d ProdID=00b9 Rev=04.14 S: Manufacturer=Cinterion S: Product=Cinterion PID 0x00B9 USB Mobile Broadband S: SerialNumber=90418e79 C: #Ifs= 4 Cfg#= 1 Atr=a0 MxPwr=500mA I: If#=0x0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan I: If#=0x1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option I: If#=0x2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=60 Driver=option I: If#=0x3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option For PID 00b8, interface 3 is GNSS port which don't use serial driver. Signed-off-by: Slark Xiao Link: https://lore.kernel.org/r/20220601034740.5438-1-slark_xiao@163.com [ johan: rename defines using a "2" infix ] Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold --- drivers/usb/serial/option.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c index e60425bbf537..ed1e50d83cca 100644 --- a/drivers/usb/serial/option.c +++ b/drivers/usb/serial/option.c @@ -432,6 +432,8 @@ static void option_instat_callback(struct urb *urb); #define CINTERION_PRODUCT_CLS8 0x00b0 #define CINTERION_PRODUCT_MV31_MBIM 0x00b3 #define CINTERION_PRODUCT_MV31_RMNET 0x00b7 +#define CINTERION_PRODUCT_MV31_2_MBIM 0x00b8 +#define CINTERION_PRODUCT_MV31_2_RMNET 0x00b9 #define CINTERION_PRODUCT_MV32_WA 0x00f1 #define CINTERION_PRODUCT_MV32_WB 0x00f2 @@ -1979,6 +1981,10 @@ static const struct usb_device_id option_ids[] = { .driver_info = RSVD(3)}, { USB_DEVICE_INTERFACE_CLASS(CINTERION_VENDOR_ID, CINTERION_PRODUCT_MV31_RMNET, 0xff), .driver_info = RSVD(0)}, + { USB_DEVICE_INTERFACE_CLASS(CINTERION_VENDOR_ID, CINTERION_PRODUCT_MV31_2_MBIM, 0xff), + .driver_info = RSVD(3)}, + { USB_DEVICE_INTERFACE_CLASS(CINTERION_VENDOR_ID, CINTERION_PRODUCT_MV31_2_RMNET, 0xff), + .driver_info = RSVD(0)}, { USB_DEVICE_INTERFACE_CLASS(CINTERION_VENDOR_ID, CINTERION_PRODUCT_MV32_WA, 0xff), .driver_info = RSVD(3)}, { USB_DEVICE_INTERFACE_CLASS(CINTERION_VENDOR_ID, CINTERION_PRODUCT_MV32_WB, 0xff), From 4527d47bb63a134c4483a1a478d0ff5874b466c7 Mon Sep 17 00:00:00 2001 From: "GONG, Ruiqi" Date: Tue, 7 Jun 2022 19:08:48 +0800 Subject: [PATCH 045/298] drm/atomic: fix warning of unused variable Fix the `unused-but-set-variable` warning as how other iteration wrappers do. Link: https://lore.kernel.org/all/202206071049.pofHsRih-lkp@intel.com/ Reported-by: kernel test robot Signed-off-by: GONG, Ruiqi Signed-off-by: Maxime Ripard Link: https://patchwork.freedesktop.org/patch/msgid/20220607110848.941486-1-gongruiqi1@huawei.com --- include/drm/drm_atomic.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/drm/drm_atomic.h b/include/drm/drm_atomic.h index 0777725085df..10b1990bc1f6 100644 --- a/include/drm/drm_atomic.h +++ b/include/drm/drm_atomic.h @@ -1022,6 +1022,7 @@ void drm_state_dump(struct drm_device *dev, struct drm_printer *p); for ((__i) = 0; \ (__i) < (__state)->num_private_objs && \ ((obj) = (__state)->private_objs[__i].ptr, \ + (void)(obj) /* Only to avoid unused-but-set-variable warning */, \ (new_obj_state) = (__state)->private_objs[__i].new_state, 1); \ (__i)++) From d2263de1372a452cb64666990043b8be5c40b2a1 Mon Sep 17 00:00:00 2001 From: Yuan Yao Date: Wed, 8 Jun 2022 09:20:15 +0800 Subject: [PATCH 046/298] KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs Assign shadow_me_value, not shadow_me_mask, to PAE root entries, a.k.a. shadow PDPTRs, when host memory encryption is supported. The "mask" is the set of all possible memory encryption bits, e.g. MKTME KeyIDs, whereas "value" holds the actual value that needs to be stuffed into host page tables. Using shadow_me_mask results in a failed VM-Entry due to setting reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to physical addresses with non-zero MKTME bits sending to_shadow_page() into the weeds: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. BUG: unable to handle page fault for address: ffd43f00063049e8 PGD 86dfd8067 P4D 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm] kvm_mmu_free_roots+0xd1/0x200 [kvm] __kvm_mmu_unload+0x29/0x70 [kvm] kvm_mmu_unload+0x13/0x20 [kvm] kvm_arch_destroy_vm+0x8a/0x190 [kvm] kvm_put_kvm+0x197/0x2d0 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8e/0x260 ____fput+0xe/0x10 task_work_run+0x6f/0xb0 do_exit+0x327/0xa90 do_group_exit+0x35/0xa0 get_signal+0x911/0x930 arch_do_signal_or_restart+0x37/0x720 exit_to_user_mode_prepare+0xb2/0x140 syscall_exit_to_user_mode+0x16/0x30 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e54f1ff244ac ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask") Signed-off-by: Yuan Yao Reviewed-by: Kai Huang Message-Id: <20220608012015.19566-1-yuan.yao@intel.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e826ee9138fa..17252f39bd7c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3411,7 +3411,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i << 30, PT32_ROOT_LEVEL, true); mmu->pae_root[i] = root | PT_PRESENT_MASK | - shadow_me_mask; + shadow_me_value; } mmu->root.hpa = __pa(mmu->pae_root); } else { From a9603ae0e4ee6e7de0184801d4abe5925f43b49c Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:23 +0300 Subject: [PATCH 047/298] KVM: x86: document AVIC/APICv inhibit reasons These days there are too many AVIC/APICv inhibit reasons, and it doesn't hurt to have some documentation for them. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 65 ++++++++++++++++++++++++++++++--- 1 file changed, 60 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3a240a64ac68..1f9e47b895cf 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1047,14 +1047,69 @@ struct kvm_x86_msr_filter { }; enum kvm_apicv_inhibit { + + /********************************************************************/ + /* INHIBITs that are relevant to both Intel's APICv and AMD's AVIC. */ + /********************************************************************/ + + /* + * APIC acceleration is disabled by a module parameter + * and/or not supported in hardware. + */ APICV_INHIBIT_REASON_DISABLE, + + /* + * APIC acceleration is inhibited because AutoEOI feature is + * being used by a HyperV guest. + */ APICV_INHIBIT_REASON_HYPERV, - APICV_INHIBIT_REASON_NESTED, - APICV_INHIBIT_REASON_IRQWIN, - APICV_INHIBIT_REASON_PIT_REINJ, - APICV_INHIBIT_REASON_X2APIC, - APICV_INHIBIT_REASON_BLOCKIRQ, + + /* + * APIC acceleration is inhibited because the userspace didn't yet + * enable the kernel/split irqchip. + */ APICV_INHIBIT_REASON_ABSENT, + + /* APIC acceleration is inhibited because KVM_GUESTDBG_BLOCKIRQ + * (out of band, debug measure of blocking all interrupts on this vCPU) + * was enabled, to avoid AVIC/APICv bypassing it. + */ + APICV_INHIBIT_REASON_BLOCKIRQ, + + /******************************************************/ + /* INHIBITs that are relevant only to the AMD's AVIC. */ + /******************************************************/ + + /* + * AVIC is inhibited on a vCPU because it runs a nested guest. + * + * This is needed because unlike APICv, the peers of this vCPU + * cannot use the doorbell mechanism to signal interrupts via AVIC when + * a vCPU runs nested. + */ + APICV_INHIBIT_REASON_NESTED, + + /* + * On SVM, the wait for the IRQ window is implemented with pending vIRQ, + * which cannot be injected when the AVIC is enabled, thus AVIC + * is inhibited while KVM waits for IRQ window. + */ + APICV_INHIBIT_REASON_IRQWIN, + + /* + * PIT (i8254) 're-inject' mode, relies on EOI intercept, + * which AVIC doesn't support for edge triggered interrupts. + */ + APICV_INHIBIT_REASON_PIT_REINJ, + + /* + * AVIC is inhibited because the guest has x2apic in its CPUID. + */ + APICV_INHIBIT_REASON_X2APIC, + + /* + * AVIC is disabled because SEV doesn't support it. + */ APICV_INHIBIT_REASON_SEV, }; From 3743c2f0251743b8ae968329708bbbeefff244cf Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:24 +0300 Subject: [PATCH 048/298] KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base Neither of these settings should be changed by the guest and it is a burden to support it in the acceleration code, so just inhibit this code instead. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 8 ++++++++ arch/x86/kvm/lapic.c | 27 +++++++++++++++++++++++---- arch/x86/kvm/svm/avic.c | 4 +++- arch/x86/kvm/vmx/vmx.c | 4 +++- 4 files changed, 37 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1f9e47b895cf..9217bd6cf0d1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1076,6 +1076,14 @@ enum kvm_apicv_inhibit { */ APICV_INHIBIT_REASON_BLOCKIRQ, + /* + * For simplicity, the APIC acceleration is inhibited + * first time either APIC ID or APIC base are changed by the guest + * from their reset values. + */ + APICV_INHIBIT_REASON_APIC_ID_MODIFIED, + APICV_INHIBIT_REASON_APIC_BASE_MODIFIED, + /******************************************************/ /* INHIBITs that are relevant only to the AMD's AVIC. */ /******************************************************/ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index f1bdac3f5aa8..0e68b4c937fc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2039,6 +2039,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val) } } +static void kvm_lapic_xapic_id_updated(struct kvm_lapic *apic) +{ + struct kvm *kvm = apic->vcpu->kvm; + + if (KVM_BUG_ON(apic_x2apic_mode(apic), kvm)) + return; + + if (kvm_xapic_id(apic) == apic->vcpu->vcpu_id) + return; + + kvm_set_apicv_inhibit(apic->vcpu->kvm, APICV_INHIBIT_REASON_APIC_ID_MODIFIED); +} + static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) { int ret = 0; @@ -2047,10 +2060,12 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) switch (reg) { case APIC_ID: /* Local APIC ID */ - if (!apic_x2apic_mode(apic)) + if (!apic_x2apic_mode(apic)) { kvm_apic_set_xapic_id(apic, val >> 24); - else + kvm_lapic_xapic_id_updated(apic); + } else { ret = 1; + } break; case APIC_TASKPRI: @@ -2336,8 +2351,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) MSR_IA32_APICBASE_BASE; if ((value & MSR_IA32_APICBASE_ENABLE) && - apic->base_address != APIC_DEFAULT_PHYS_BASE) - pr_warn_once("APIC base relocation is unsupported by KVM"); + apic->base_address != APIC_DEFAULT_PHYS_BASE) { + kvm_set_apicv_inhibit(apic->vcpu->kvm, + APICV_INHIBIT_REASON_APIC_BASE_MODIFIED); + } } void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) @@ -2648,6 +2665,8 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu, icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR); __kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32); } + } else { + kvm_lapic_xapic_id_updated(vcpu->arch.apic); } return 0; diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 54fe03714f8a..8dffd67f6086 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -910,7 +910,9 @@ bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) BIT(APICV_INHIBIT_REASON_PIT_REINJ) | BIT(APICV_INHIBIT_REASON_X2APIC) | BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | - BIT(APICV_INHIBIT_REASON_SEV); + BIT(APICV_INHIBIT_REASON_SEV | + BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED)); return supported & BIT(reason); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9bd86ecccdab..553dd2317b9c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7709,7 +7709,9 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | BIT(APICV_INHIBIT_REASON_ABSENT) | BIT(APICV_INHIBIT_REASON_HYPERV) | - BIT(APICV_INHIBIT_REASON_BLOCKIRQ); + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | + BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED); return supported & BIT(reason); } From f5f9089f76ddc882b915c5d78e4beeb48dcabd1b Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:25 +0300 Subject: [PATCH 049/298] KVM: x86: SVM: remove avic's broken code that updated APIC ID AVIC is now inhibited if the guest changes the apic id, and therefore this code is no longer needed. There are several ways this code was broken, including: 1. a vCPU was only allowed to change its apic id to an apic id of an existing vCPU. 2. After such change, the vCPU whose apic id entry was overwritten, could not correctly change its own apic id, because its own entry is already overwritten. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 35 ----------------------------------- 1 file changed, 35 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 8dffd67f6086..072e2c8cc66a 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -508,35 +508,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) return ret; } -static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu) -{ - u64 *old, *new; - struct vcpu_svm *svm = to_svm(vcpu); - u32 id = kvm_xapic_id(vcpu->arch.apic); - - if (vcpu->vcpu_id == id) - return 0; - - old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id); - new = avic_get_physical_id_entry(vcpu, id); - if (!new || !old) - return 1; - - /* We need to move physical_id_entry to new offset */ - *new = *old; - *old = 0ULL; - to_svm(vcpu)->avic_physical_id_cache = new; - - /* - * Also update the guest physical APIC ID in the logical - * APIC ID table entry if already setup the LDR. - */ - if (svm->ldr_reg) - avic_handle_ldr_update(vcpu); - - return 0; -} - static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -555,10 +526,6 @@ static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu) AVIC_UNACCEL_ACCESS_OFFSET_MASK; switch (offset) { - case APIC_ID: - if (avic_handle_apic_id_update(vcpu)) - return 0; - break; case APIC_LDR: if (avic_handle_ldr_update(vcpu)) return 0; @@ -650,8 +617,6 @@ int avic_init_vcpu(struct vcpu_svm *svm) void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu) { - if (avic_handle_apic_id_update(vcpu) != 0) - return; avic_handle_dfr_update(vcpu); avic_handle_ldr_update(vcpu); } From 603ccef42ce9f07840cf4c0448f3261413460b07 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:26 +0300 Subject: [PATCH 050/298] KVM: x86: SVM: fix avic_kick_target_vcpus_fast MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There are two issues in avic_kick_target_vcpus_fast 1. It is legal to issue an IPI request with APIC_DEST_NOSHORT and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic), which must be treated as a broadcast destination. Fix this by explicitly checking for it. Also don’t use ‘index’ in this case as it gives no new information. 2. It is legal to issue a logical IPI request to more than one target. Index field only provides index in physical id table of first such target and therefore can't be used before we are sure that only a single target was addressed. Instead, parse the ICRL/ICRH, double check that a unicast interrupt was requested, and use that info to figure out the physical id of the target vCPU. At that point there is no need to use the index field as well. In addition to fixing the above issues, also skip the call to kvm_apic_match_dest. It is possible to do this now, because now as long as AVIC is not inhibited, it is guaranteed that none of the vCPUs changed their apic id from its default value. This fixes boot of windows guest with AVIC enabled because it uses IPI with 0xFF destination and no destination shorthand. Fixes: 7223fd2d5338 ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 111 ++++++++++++++++++++++++++-------------- 1 file changed, 72 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 072e2c8cc66a..5d98ac575ded 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -291,58 +291,91 @@ void avic_ring_doorbell(struct kvm_vcpu *vcpu) static int avic_kick_target_vcpus_fast(struct kvm *kvm, struct kvm_lapic *source, u32 icrl, u32 icrh, u32 index) { - u32 dest, apic_id; - struct kvm_vcpu *vcpu; + u32 l1_physical_id, dest; + struct kvm_vcpu *target_vcpu; int dest_mode = icrl & APIC_DEST_MASK; int shorthand = icrl & APIC_SHORT_MASK; struct kvm_svm *kvm_svm = to_kvm_svm(kvm); - u32 *avic_logical_id_table = page_address(kvm_svm->avic_logical_id_table_page); if (shorthand != APIC_DEST_NOSHORT) return -EINVAL; - /* - * The AVIC incomplete IPI #vmexit info provides index into - * the physical APIC ID table, which can be used to derive - * guest physical APIC ID. - */ - if (dest_mode == APIC_DEST_PHYSICAL) { - apic_id = index; - } else { - if (!apic_x2apic_mode(source)) { - /* For xAPIC logical mode, the index is for logical APIC table. */ - apic_id = avic_logical_id_table[index] & 0x1ff; - } else { - return -EINVAL; - } - } - - /* - * Assuming vcpu ID is the same as physical apic ID, - * and use it to retrieve the target vCPU. - */ - vcpu = kvm_get_vcpu_by_id(kvm, apic_id); - if (!vcpu) - return -EINVAL; - - if (apic_x2apic_mode(vcpu->arch.apic)) + if (apic_x2apic_mode(source)) dest = icrh; else dest = GET_APIC_DEST_FIELD(icrh); - /* - * Try matching the destination APIC ID with the vCPU. - */ - if (kvm_apic_match_dest(vcpu, source, shorthand, dest, dest_mode)) { - vcpu->arch.apic->irr_pending = true; - svm_complete_interrupt_delivery(vcpu, - icrl & APIC_MODE_MASK, - icrl & APIC_INT_LEVELTRIG, - icrl & APIC_VECTOR_MASK); - return 0; + if (dest_mode == APIC_DEST_PHYSICAL) { + /* broadcast destination, use slow path */ + if (apic_x2apic_mode(source) && dest == X2APIC_BROADCAST) + return -EINVAL; + if (!apic_x2apic_mode(source) && dest == APIC_BROADCAST) + return -EINVAL; + + l1_physical_id = dest; + + if (WARN_ON_ONCE(l1_physical_id != index)) + return -EINVAL; + + } else { + u32 bitmap, cluster; + int logid_index; + + if (apic_x2apic_mode(source)) { + /* 16 bit dest mask, 16 bit cluster id */ + bitmap = dest & 0xFFFF0000; + cluster = (dest >> 16) << 4; + } else if (kvm_lapic_get_reg(source, APIC_DFR) == APIC_DFR_FLAT) { + /* 8 bit dest mask*/ + bitmap = dest; + cluster = 0; + } else { + /* 4 bit desk mask, 4 bit cluster id */ + bitmap = dest & 0xF; + cluster = (dest >> 4) << 2; + } + + if (unlikely(!bitmap)) + /* guest bug: nobody to send the logical interrupt to */ + return 0; + + if (!is_power_of_2(bitmap)) + /* multiple logical destinations, use slow path */ + return -EINVAL; + + logid_index = cluster + __ffs(bitmap); + + if (apic_x2apic_mode(source)) { + l1_physical_id = logid_index; + } else { + u32 *avic_logical_id_table = + page_address(kvm_svm->avic_logical_id_table_page); + + u32 logid_entry = avic_logical_id_table[logid_index]; + + if (WARN_ON_ONCE(index != logid_index)) + return -EINVAL; + + /* guest bug: non existing/reserved logical destination */ + if (unlikely(!(logid_entry & AVIC_LOGICAL_ID_ENTRY_VALID_MASK))) + return 0; + + l1_physical_id = logid_entry & + AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; + } } - return -EINVAL; + target_vcpu = kvm_get_vcpu_by_id(kvm, l1_physical_id); + if (unlikely(!target_vcpu)) + /* guest bug: non existing vCPU is a target of this IPI*/ + return 0; + + target_vcpu->arch.apic->irr_pending = true; + svm_complete_interrupt_delivery(target_vcpu, + icrl & APIC_MODE_MASK, + icrl & APIC_INT_LEVELTRIG, + icrl & APIC_VECTOR_MASK); + return 0; } static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source, From 66c768d30e64e1280520f34dbef83419f55f3459 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:27 +0300 Subject: [PATCH 051/298] KVM: x86: disable preemption while updating apicv inhibition Currently nothing prevents preemption in kvm_vcpu_update_apicv. On SVM, If the preemption happens after we update the vcpu->arch.apicv_active, the preemption itself will 'update' the inhibition since the AVIC will be first disabled on vCPU unload and then enabled, when the current task is loaded again. Then we will try to update it again, which will lead to a warning in __avic_vcpu_load, that the AVIC is already enabled. Fix this by disabling preemption in this code. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03fbfbbec460..158b2e135efc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9850,6 +9850,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) return; down_read(&vcpu->kvm->arch.apicv_update_lock); + preempt_disable(); activate = kvm_vcpu_apicv_activated(vcpu); @@ -9870,6 +9871,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) kvm_make_request(KVM_REQ_EVENT, vcpu); out: + preempt_enable(); up_read(&vcpu->kvm->arch.apicv_update_lock); } EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv); From 18869f26df1a11ed11031dfb7392bc7d774062e8 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:28 +0300 Subject: [PATCH 052/298] KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking On SVM, if preemption happens right after the call to finish_rcuwait but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself will re-enable AVIC, and then we will try to re-enable it again in kvm_arch_vcpu_unblocking which will lead to a warning in __avic_vcpu_load. The same problem can happen if the vCPU is preempted right after the call to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait and in this case, we will end up with AVIC enabled during sleep - Ooops. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- virt/kvm/kvm_main.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 44c47670447a..a49df8988cd6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3328,9 +3328,11 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu) vcpu->stat.generic.blocking = 1; + preempt_disable(); kvm_arch_vcpu_blocking(vcpu); - prepare_to_rcuwait(wait); + preempt_enable(); + for (;;) { set_current_state(TASK_INTERRUPTIBLE); @@ -3340,9 +3342,11 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu) waited = true; schedule(); } - finish_rcuwait(wait); + preempt_disable(); + finish_rcuwait(wait); kvm_arch_vcpu_unblocking(vcpu); + preempt_enable(); vcpu->stat.generic.blocking = 0; From ba8ec273240a7a67819b5957c8d06a267ec54db7 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:29 +0300 Subject: [PATCH 053/298] KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put Now that these functions are always called with preemption disabled, remove the preempt_disable()/preempt_enable() pair inside them. No functional change intended. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 27 ++++----------------------- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/svm/svm.h | 4 ++-- 3 files changed, 8 insertions(+), 27 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 5d98ac575ded..5542d8959e11 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -946,7 +946,7 @@ out: return ret; } -void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { u64 entry; int h_physical_id = kvm_cpu_get_apicid(cpu); @@ -978,7 +978,7 @@ void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true); } -void __avic_vcpu_put(struct kvm_vcpu *vcpu) +void avic_vcpu_put(struct kvm_vcpu *vcpu) { u64 entry; struct vcpu_svm *svm = to_svm(vcpu); @@ -997,25 +997,6 @@ void __avic_vcpu_put(struct kvm_vcpu *vcpu) WRITE_ONCE(*(svm->avic_physical_id_cache), entry); } -static void avic_vcpu_load(struct kvm_vcpu *vcpu) -{ - int cpu = get_cpu(); - - WARN_ON(cpu != vcpu->cpu); - - __avic_vcpu_load(vcpu, cpu); - - put_cpu(); -} - -static void avic_vcpu_put(struct kvm_vcpu *vcpu) -{ - preempt_disable(); - - __avic_vcpu_put(vcpu); - - preempt_enable(); -} void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { @@ -1042,7 +1023,7 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) vmcb_mark_dirty(vmcb, VMCB_AVIC); if (activated) - avic_vcpu_load(vcpu); + avic_vcpu_load(vcpu, vcpu->cpu); else avic_vcpu_put(vcpu); @@ -1075,5 +1056,5 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu) if (!kvm_vcpu_apicv_active(vcpu)) return; - avic_vcpu_load(vcpu); + avic_vcpu_load(vcpu, vcpu->cpu); } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1dc02cdf6960..1ac66fbceaa1 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1400,13 +1400,13 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) indirect_branch_prediction_barrier(); } if (kvm_vcpu_apicv_active(vcpu)) - __avic_vcpu_load(vcpu, cpu); + avic_vcpu_load(vcpu, cpu); } static void svm_vcpu_put(struct kvm_vcpu *vcpu) { if (kvm_vcpu_apicv_active(vcpu)) - __avic_vcpu_put(vcpu); + avic_vcpu_put(vcpu); svm_prepare_host_switch(vcpu); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 500348c1cb35..1bddd336a27e 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -610,8 +610,8 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb); int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu); int avic_unaccelerated_access_interception(struct kvm_vcpu *vcpu); int avic_init_vcpu(struct vcpu_svm *svm); -void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu); -void __avic_vcpu_put(struct kvm_vcpu *vcpu); +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu); +void avic_vcpu_put(struct kvm_vcpu *vcpu); void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu); void avic_set_virtual_apic_mode(struct kvm_vcpu *vcpu); void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu); From e3cdaab5ff022874e65df80ae8b8382ccc0a4fe0 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 31 May 2022 13:57:32 -0400 Subject: [PATCH 054/298] KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE Commit 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") introduced passthrough support for nested pause filtering, (when the host doesn't intercept PAUSE) (either disabled with kvm module param, or disabled with '-overcommit cpu-pm=on') Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards, the feature was exposed as supported by KVM cpuid unconditionally, thus if L1 could try to use it even when the L0 KVM can't really support it. In this case the fallback caused KVM to intercept each PAUSE instruction; in some cases, such intercept can slow down the nested guest so much that it can fail to boot. Instead, before the problematic commit KVM was already setting both thresholds to 0 in vmcb02, but after the first userspace VM exit shrink_ple_window was called and would reset the pause_filter_count to the default value. To fix this, change the fallback strategy - ignore the guest threshold values, but use/update the host threshold values unless the guest specifically requests disabling PAUSE filtering (either simple or advanced). Also fix a minor bug: on nested VM exit, when PAUSE filter counter were copied back to vmcb01, a dirty bit was not set. Thanks a lot to Suravee Suthikulpanit for debugging this! Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Reported-by: Suravee Suthikulpanit Tested-by: Suravee Suthikulpanit Co-developed-by: Maxim Levitsky Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/nested.c | 39 +++++++++++++++++++++------------------ arch/x86/kvm/svm/svm.c | 4 ++-- 2 files changed, 23 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3361258640a2..ba7cd26f438f 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -616,6 +616,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) struct kvm_vcpu *vcpu = &svm->vcpu; struct vmcb *vmcb01 = svm->vmcb01.ptr; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; + u32 pause_count12; + u32 pause_thresh12; /* * Filled at exit: exit_code, exit_code_hi, exit_info_1, exit_info_2, @@ -671,27 +673,25 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (!nested_vmcb_needs_vls_intercept(svm)) vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + pause_count12 = svm->pause_filter_enabled ? svm->nested.ctl.pause_filter_count : 0; + pause_thresh12 = svm->pause_threshold_enabled ? svm->nested.ctl.pause_filter_thresh : 0; if (kvm_pause_in_guest(svm->vcpu.kvm)) { - /* use guest values since host doesn't use them */ - vmcb02->control.pause_filter_count = - svm->pause_filter_enabled ? - svm->nested.ctl.pause_filter_count : 0; + /* use guest values since host doesn't intercept PAUSE */ + vmcb02->control.pause_filter_count = pause_count12; + vmcb02->control.pause_filter_thresh = pause_thresh12; - vmcb02->control.pause_filter_thresh = - svm->pause_threshold_enabled ? - svm->nested.ctl.pause_filter_thresh : 0; - - } else if (!vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { - /* use host values when guest doesn't use them */ + } else { + /* start from host values otherwise */ vmcb02->control.pause_filter_count = vmcb01->control.pause_filter_count; vmcb02->control.pause_filter_thresh = vmcb01->control.pause_filter_thresh; - } else { - /* - * Intercept every PAUSE otherwise and - * ignore both host and guest values - */ - vmcb02->control.pause_filter_count = 0; - vmcb02->control.pause_filter_thresh = 0; + + /* ... but ensure filtering is disabled if so requested. */ + if (vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { + if (!pause_count12) + vmcb02->control.pause_filter_count = 0; + if (!pause_thresh12) + vmcb02->control.pause_filter_thresh = 0; + } } nested_svm_transition_tlb_flush(vcpu); @@ -951,8 +951,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb12->control.event_inj = svm->nested.ctl.event_inj; vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; - if (!kvm_pause_in_guest(vcpu->kvm) && vmcb02->control.pause_filter_count) + if (!kvm_pause_in_guest(vcpu->kvm)) { vmcb01->control.pause_filter_count = vmcb02->control.pause_filter_count; + vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS); + + } nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1ac66fbceaa1..87da90360bc7 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -921,7 +921,7 @@ static void grow_ple_window(struct kvm_vcpu *vcpu) struct vmcb_control_area *control = &svm->vmcb->control; int old = control->pause_filter_count; - if (kvm_pause_in_guest(vcpu->kvm) || !old) + if (kvm_pause_in_guest(vcpu->kvm)) return; control->pause_filter_count = __grow_ple_window(old, @@ -942,7 +942,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) struct vmcb_control_area *control = &svm->vmcb->control; int old = control->pause_filter_count; - if (kvm_pause_in_guest(vcpu->kvm) || !old) + if (kvm_pause_in_guest(vcpu->kvm)) return; control->pause_filter_count = From 4ee602e78d706e740a48be9b6ddb239df4a113b5 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:39 +0000 Subject: [PATCH 055/298] KVM: selftests: Replace x86_page_size with PG_LEVEL_XX x86_page_size is an enum used to communicate the desired page size with which to map a range of memory. Under the hood they just encode the desired level at which to map the page. This ends up being clunky in a few ways: - The name suggests it encodes the size of the page rather than the level. - In other places in x86_64/processor.c we just use a raw int to encode the level. Simplify this by adopting the kernel style of PG_LEVEL_XX enums and pass around raw ints when referring to the level. This makes the code easier to understand since these macros are very common in KVM MMU code. Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- .../selftests/kvm/include/x86_64/processor.h | 18 +++++++---- .../selftests/kvm/lib/x86_64/processor.c | 31 +++++++++---------- .../selftests/kvm/max_guest_memory_test.c | 2 +- .../selftests/kvm/x86_64/mmu_role_test.c | 2 +- 4 files changed, 29 insertions(+), 24 deletions(-) diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index d0d51adec76e..273c70e91647 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -482,13 +482,19 @@ void vcpu_set_hv_cpuid(struct kvm_vm *vm, uint32_t vcpuid); struct kvm_cpuid2 *vcpu_get_supported_hv_cpuid(struct kvm_vm *vm, uint32_t vcpuid); void vm_xsave_req_perm(int bit); -enum x86_page_size { - X86_PAGE_SIZE_4K = 0, - X86_PAGE_SIZE_2M, - X86_PAGE_SIZE_1G, +enum pg_level { + PG_LEVEL_NONE, + PG_LEVEL_4K, + PG_LEVEL_2M, + PG_LEVEL_1G, + PG_LEVEL_512G, + PG_LEVEL_NUM }; -void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, - enum x86_page_size page_size); + +#define PG_LEVEL_SHIFT(_level) ((_level - 1) * 9 + 12) +#define PG_LEVEL_SIZE(_level) (1ull << PG_LEVEL_SHIFT(_level)) + +void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level); /* * Basic CPU control in CR0 diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c index 33ea5e9955d9..ead7011ee8f6 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/processor.c +++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c @@ -158,7 +158,7 @@ static void *virt_get_pte(struct kvm_vm *vm, uint64_t pt_pfn, uint64_t vaddr, int level) { uint64_t *page_table = addr_gpa2hva(vm, pt_pfn << vm->page_shift); - int index = vaddr >> (vm->page_shift + level * 9) & 0x1ffu; + int index = (vaddr >> PG_LEVEL_SHIFT(level)) & 0x1ffu; return &page_table[index]; } @@ -167,14 +167,14 @@ static uint64_t *virt_create_upper_pte(struct kvm_vm *vm, uint64_t pt_pfn, uint64_t vaddr, uint64_t paddr, - int level, - enum x86_page_size page_size) + int current_level, + int target_level) { - uint64_t *pte = virt_get_pte(vm, pt_pfn, vaddr, level); + uint64_t *pte = virt_get_pte(vm, pt_pfn, vaddr, current_level); if (!(*pte & PTE_PRESENT_MASK)) { *pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK; - if (level == page_size) + if (current_level == target_level) *pte |= PTE_LARGE_MASK | (paddr & PHYSICAL_PAGE_MASK); else *pte |= vm_alloc_page_table(vm) & PHYSICAL_PAGE_MASK; @@ -184,20 +184,19 @@ static uint64_t *virt_create_upper_pte(struct kvm_vm *vm, * a hugepage at this level, and that there isn't a hugepage at * this level. */ - TEST_ASSERT(level != page_size, + TEST_ASSERT(current_level != target_level, "Cannot create hugepage at level: %u, vaddr: 0x%lx\n", - page_size, vaddr); + current_level, vaddr); TEST_ASSERT(!(*pte & PTE_LARGE_MASK), "Cannot create page table at level: %u, vaddr: 0x%lx\n", - level, vaddr); + current_level, vaddr); } return pte; } -void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, - enum x86_page_size page_size) +void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level) { - const uint64_t pg_size = 1ull << ((page_size * 9) + 12); + const uint64_t pg_size = PG_LEVEL_SIZE(level); uint64_t *pml4e, *pdpe, *pde; uint64_t *pte; @@ -222,20 +221,20 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, * early if a hugepage was created. */ pml4e = virt_create_upper_pte(vm, vm->pgd >> vm->page_shift, - vaddr, paddr, 3, page_size); + vaddr, paddr, PG_LEVEL_512G, level); if (*pml4e & PTE_LARGE_MASK) return; - pdpe = virt_create_upper_pte(vm, PTE_GET_PFN(*pml4e), vaddr, paddr, 2, page_size); + pdpe = virt_create_upper_pte(vm, PTE_GET_PFN(*pml4e), vaddr, paddr, PG_LEVEL_1G, level); if (*pdpe & PTE_LARGE_MASK) return; - pde = virt_create_upper_pte(vm, PTE_GET_PFN(*pdpe), vaddr, paddr, 1, page_size); + pde = virt_create_upper_pte(vm, PTE_GET_PFN(*pdpe), vaddr, paddr, PG_LEVEL_2M, level); if (*pde & PTE_LARGE_MASK) return; /* Fill in page table entry. */ - pte = virt_get_pte(vm, PTE_GET_PFN(*pde), vaddr, 0); + pte = virt_get_pte(vm, PTE_GET_PFN(*pde), vaddr, PG_LEVEL_4K); TEST_ASSERT(!(*pte & PTE_PRESENT_MASK), "PTE already present for 4k page at vaddr: 0x%lx\n", vaddr); *pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK | (paddr & PHYSICAL_PAGE_MASK); @@ -243,7 +242,7 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) { - __virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K); + __virt_pg_map(vm, vaddr, paddr, PG_LEVEL_4K); } static uint64_t *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid, diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/max_guest_memory_test.c index 3875c4b23a04..15f046e19cb2 100644 --- a/tools/testing/selftests/kvm/max_guest_memory_test.c +++ b/tools/testing/selftests/kvm/max_guest_memory_test.c @@ -244,7 +244,7 @@ int main(int argc, char *argv[]) #ifdef __x86_64__ /* Identity map memory in the guest using 1gb pages. */ for (i = 0; i < slot_size; i += size_1gb) - __virt_pg_map(vm, gpa + i, gpa + i, X86_PAGE_SIZE_1G); + __virt_pg_map(vm, gpa + i, gpa + i, PG_LEVEL_1G); #else for (i = 0; i < slot_size; i += vm_get_page_size(vm)) virt_pg_map(vm, gpa + i, gpa + i); diff --git a/tools/testing/selftests/kvm/x86_64/mmu_role_test.c b/tools/testing/selftests/kvm/x86_64/mmu_role_test.c index da2325fcad87..bdecd532f935 100644 --- a/tools/testing/selftests/kvm/x86_64/mmu_role_test.c +++ b/tools/testing/selftests/kvm/x86_64/mmu_role_test.c @@ -35,7 +35,7 @@ static void mmu_role_test(u32 *cpuid_reg, u32 evil_cpuid_val) run = vcpu_state(vm, VCPU_ID); /* Map 1gb page without a backing memlot. */ - __virt_pg_map(vm, MMIO_GPA, MMIO_GPA, X86_PAGE_SIZE_1G); + __virt_pg_map(vm, MMIO_GPA, MMIO_GPA, PG_LEVEL_1G); r = _vcpu_run(vm, VCPU_ID); From c5a0ccec4cb4edde8e5b7e369dbe4d169b111e42 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:40 +0000 Subject: [PATCH 056/298] KVM: selftests: Add option to create 2M and 1G EPT mappings The current EPT mapping code in the selftests only supports mapping 4K pages. This commit extends that support with an option to map at 2M or 1G. This will be used in a future commit to create large page mappings to test eager page splitting. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 112 ++++++++++--------- 1 file changed, 61 insertions(+), 51 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index d089d8b850b5..fdc1e6deb922 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -392,80 +392,90 @@ void nested_vmx_check_supported(void) } } -void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, - uint64_t nested_paddr, uint64_t paddr) +static void nested_create_pte(struct kvm_vm *vm, + struct eptPageTableEntry *pte, + uint64_t nested_paddr, + uint64_t paddr, + int current_level, + int target_level) { - uint16_t index[4]; - struct eptPageTableEntry *pml4e; + if (!pte->readable) { + pte->writable = true; + pte->readable = true; + pte->executable = true; + pte->page_size = (current_level == target_level); + if (pte->page_size) + pte->address = paddr >> vm->page_shift; + else + pte->address = vm_alloc_page_table(vm) >> vm->page_shift; + } else { + /* + * Entry already present. Assert that the caller doesn't want + * a hugepage at this level, and that there isn't a hugepage at + * this level. + */ + TEST_ASSERT(current_level != target_level, + "Cannot create hugepage at level: %u, nested_paddr: 0x%lx\n", + current_level, nested_paddr); + TEST_ASSERT(!pte->page_size, + "Cannot create page table at level: %u, nested_paddr: 0x%lx\n", + current_level, nested_paddr); + } +} + + +void __nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr, int target_level) +{ + const uint64_t page_size = PG_LEVEL_SIZE(target_level); + struct eptPageTableEntry *pt = vmx->eptp_hva, *pte; + uint16_t index; TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " "unknown or unsupported guest mode, mode: 0x%x", vm->mode); - TEST_ASSERT((nested_paddr % vm->page_size) == 0, + TEST_ASSERT((nested_paddr % page_size) == 0, "Nested physical address not on page boundary,\n" - " nested_paddr: 0x%lx vm->page_size: 0x%x", - nested_paddr, vm->page_size); + " nested_paddr: 0x%lx page_size: 0x%lx", + nested_paddr, page_size); TEST_ASSERT((nested_paddr >> vm->page_shift) <= vm->max_gfn, "Physical address beyond beyond maximum supported,\n" " nested_paddr: 0x%lx vm->max_gfn: 0x%lx vm->page_size: 0x%x", paddr, vm->max_gfn, vm->page_size); - TEST_ASSERT((paddr % vm->page_size) == 0, + TEST_ASSERT((paddr % page_size) == 0, "Physical address not on page boundary,\n" - " paddr: 0x%lx vm->page_size: 0x%x", - paddr, vm->page_size); + " paddr: 0x%lx page_size: 0x%lx", + paddr, page_size); TEST_ASSERT((paddr >> vm->page_shift) <= vm->max_gfn, "Physical address beyond beyond maximum supported,\n" " paddr: 0x%lx vm->max_gfn: 0x%lx vm->page_size: 0x%x", paddr, vm->max_gfn, vm->page_size); - index[0] = (nested_paddr >> 12) & 0x1ffu; - index[1] = (nested_paddr >> 21) & 0x1ffu; - index[2] = (nested_paddr >> 30) & 0x1ffu; - index[3] = (nested_paddr >> 39) & 0x1ffu; + for (int level = PG_LEVEL_512G; level >= PG_LEVEL_4K; level--) { + index = (nested_paddr >> PG_LEVEL_SHIFT(level)) & 0x1ffu; + pte = &pt[index]; - /* Allocate page directory pointer table if not present. */ - pml4e = vmx->eptp_hva; - if (!pml4e[index[3]].readable) { - pml4e[index[3]].address = vm_alloc_page_table(vm) >> vm->page_shift; - pml4e[index[3]].writable = true; - pml4e[index[3]].readable = true; - pml4e[index[3]].executable = true; + nested_create_pte(vm, pte, nested_paddr, paddr, level, target_level); + + if (pte->page_size) + break; + + pt = addr_gpa2hva(vm, pte->address * vm->page_size); } - /* Allocate page directory table if not present. */ - struct eptPageTableEntry *pdpe; - pdpe = addr_gpa2hva(vm, pml4e[index[3]].address * vm->page_size); - if (!pdpe[index[2]].readable) { - pdpe[index[2]].address = vm_alloc_page_table(vm) >> vm->page_shift; - pdpe[index[2]].writable = true; - pdpe[index[2]].readable = true; - pdpe[index[2]].executable = true; - } - - /* Allocate page table if not present. */ - struct eptPageTableEntry *pde; - pde = addr_gpa2hva(vm, pdpe[index[2]].address * vm->page_size); - if (!pde[index[1]].readable) { - pde[index[1]].address = vm_alloc_page_table(vm) >> vm->page_shift; - pde[index[1]].writable = true; - pde[index[1]].readable = true; - pde[index[1]].executable = true; - } - - /* Fill in page table entry. */ - struct eptPageTableEntry *pte; - pte = addr_gpa2hva(vm, pde[index[1]].address * vm->page_size); - pte[index[0]].address = paddr >> vm->page_shift; - pte[index[0]].writable = true; - pte[index[0]].readable = true; - pte[index[0]].executable = true; - /* * For now mark these as accessed and dirty because the only * testcase we have needs that. Can be reconsidered later. */ - pte[index[0]].accessed = true; - pte[index[0]].dirty = true; + pte->accessed = true; + pte->dirty = true; + +} + +void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr) +{ + __nested_pg_map(vmx, vm, nested_paddr, paddr, PG_LEVEL_4K); } /* From b8ca01ea19068b54938ebb4ebc06814a89dee8ea Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:41 +0000 Subject: [PATCH 057/298] KVM: selftests: Drop stale function parameter comment for nested_map() nested_map() does not take a parameter named eptp_memslot. Drop the comment referring to it. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index fdc1e6deb922..baeaa35de113 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -486,7 +486,6 @@ void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, * nested_paddr - Nested guest physical address to map * paddr - VM Physical Address * size - The size of the range to map - * eptp_memslot - Memory region slot for new virtual translation tables * * Output Args: None * From ce690e9c17d27486af879defc506679cbbb14777 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:42 +0000 Subject: [PATCH 058/298] KVM: selftests: Refactor nested_map() to specify target level Refactor nested_map() to specify that it explicityl wants 4K mappings (the existing behavior) and push the implementation down into __nested_map(), which can be used in subsequent commits to create huge page mappings. No function change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index baeaa35de113..b8cfe4914a3a 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -486,6 +486,7 @@ void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, * nested_paddr - Nested guest physical address to map * paddr - VM Physical Address * size - The size of the range to map + * level - The level at which to map the range * * Output Args: None * @@ -494,22 +495,29 @@ void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, * Within the VM given by vm, creates a nested guest translation for the * page range starting at nested_paddr to the page range starting at paddr. */ -void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, - uint64_t nested_paddr, uint64_t paddr, uint64_t size) +void __nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr, uint64_t size, + int level) { - size_t page_size = vm->page_size; + size_t page_size = PG_LEVEL_SIZE(level); size_t npages = size / page_size; TEST_ASSERT(nested_paddr + size > nested_paddr, "Vaddr overflow"); TEST_ASSERT(paddr + size > paddr, "Paddr overflow"); while (npages--) { - nested_pg_map(vmx, vm, nested_paddr, paddr); + __nested_pg_map(vmx, vm, nested_paddr, paddr, level); nested_paddr += page_size; paddr += page_size; } } +void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr, uint64_t size) +{ + __nested_map(vmx, vm, nested_paddr, paddr, size, PG_LEVEL_4K); +} + /* Prepare an identity extended page table that maps all the * physical pages in VM. */ From b6c086d04c0a1ba356145cdba5b46bd6cea2b9bd Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:43 +0000 Subject: [PATCH 059/298] KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h This is a VMX-related macro so move it to vmx.h. While here, open code the mask like the rest of the VMX bitmask macros. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/include/x86_64/processor.h | 3 --- tools/testing/selftests/kvm/include/x86_64/vmx.h | 2 ++ 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index 273c70e91647..d78f97f502b5 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -511,9 +511,6 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level) #define X86_CR0_CD (1UL<<30) /* Cache Disable */ #define X86_CR0_PG (1UL<<31) /* Paging */ -/* VMX_EPT_VPID_CAP bits */ -#define VMX_EPT_VPID_CAP_AD_BITS (1ULL << 21) - #define XSTATE_XTILE_CFG_BIT 17 #define XSTATE_XTILE_DATA_BIT 18 diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h index 583ceb0d1457..3b1794baa97c 100644 --- a/tools/testing/selftests/kvm/include/x86_64/vmx.h +++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h @@ -96,6 +96,8 @@ #define VMX_MISC_PREEMPTION_TIMER_RATE_MASK 0x0000001f #define VMX_MISC_SAVE_EFER_LMA 0x00000020 +#define VMX_EPT_VPID_CAP_AD_BITS 0x00200000 + #define EXIT_REASON_FAILED_VMENTRY 0x80000000 #define EXIT_REASON_EXCEPTION_NMI 0 #define EXIT_REASON_EXTERNAL_INTERRUPT 1 From c363d95986b1b930947305e2372665141721d15f Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:44 +0000 Subject: [PATCH 060/298] KVM: selftests: Add a helper to check EPT/VPID capabilities Create a small helper function to check if a given EPT/VPID capability is supported. This will be re-used in a follow-up commit to check for 1G page support. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index b8cfe4914a3a..5bf169179455 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -198,6 +198,11 @@ bool load_vmcs(struct vmx_pages *vmx) return true; } +static bool ept_vpid_cap_supported(uint64_t mask) +{ + return rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & mask; +} + /* * Initialize the control fields to the most basic settings possible. */ @@ -215,7 +220,7 @@ static inline void init_vmcs_control_fields(struct vmx_pages *vmx) struct eptPageTablePointer eptp = { .memory_type = VMX_BASIC_MEM_TYPE_WB, .page_walk_length = 3, /* + 1 */ - .ad_enabled = !!(rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & VMX_EPT_VPID_CAP_AD_BITS), + .ad_enabled = ept_vpid_cap_supported(VMX_EPT_VPID_CAP_AD_BITS), .address = vmx->eptp_gpa >> PAGE_SHIFT_4K, }; From acf57736e755ba5c467fc6fa85e4a0750cc36150 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:45 +0000 Subject: [PATCH 061/298] KVM: selftests: Drop unnecessary rule for STATIC_LIBS Drop the "all: $(STATIC_LIBS)" rule. The KVM selftests already depend on $(STATIC_LIBS), so there is no reason to have an extra "all" rule. Suggested-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 81470a99ed1c..e7d65e04b16a 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -192,7 +192,6 @@ $(OUTPUT)/libkvm.a: $(LIBKVM_OBJS) $(AR) crs $@ $^ x := $(shell mkdir -p $(sort $(dir $(TEST_GEN_PROGS)))) -all: $(STATIC_LIBS) $(TEST_GEN_PROGS): $(STATIC_LIBS) cscope: include_paths = $(LINUX_TOOL_INCLUDE) $(LINUX_HDR_PATH) include lib .. From cdc979dae265cc77a035b736f78f58e4c7309bb2 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:46 +0000 Subject: [PATCH 062/298] KVM: selftests: Link selftests directly with lib object files The linker does obey strong/weak symbols when linking static libraries, it simply resolves an undefined symbol to the first-encountered symbol. This means that defining __weak arch-generic functions and then defining arch-specific strong functions to override them in libkvm will not always work. More specifically, if we have: lib/generic.c: void __weak foo(void) { pr_info("weak\n"); } void bar(void) { foo(); } lib/x86_64/arch.c: void foo(void) { pr_info("strong\n"); } And a selftest that calls bar(), it will print "weak". Now if you make generic.o explicitly depend on arch.o (e.g. add function to arch.c that is called directly from generic.c) it will print "strong". In other words, it seems that the linker is free to throw out arch.o when linking because generic.o does not explicitly depend on it, which causes the linker to lose the strong symbol. One solution is to link libkvm.a with --whole-archive so that the linker doesn't throw away object files it thinks are unnecessary. However that is a bit difficult to plumb since we are using the common selftests makefile rules. An easier solution is to drop libkvm.a just link selftests with all the .o files that were originally in libkvm.a. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index e7d65e04b16a..804bf927618a 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -173,12 +173,13 @@ LDFLAGS += -pthread $(no-pie-option) $(pgste-option) # $(TEST_GEN_PROGS) starts with $(OUTPUT)/ include ../lib.mk -STATIC_LIBS := $(OUTPUT)/libkvm.a LIBKVM_C := $(filter %.c,$(LIBKVM)) LIBKVM_S := $(filter %.S,$(LIBKVM)) LIBKVM_C_OBJ := $(patsubst %.c, $(OUTPUT)/%.o, $(LIBKVM_C)) LIBKVM_S_OBJ := $(patsubst %.S, $(OUTPUT)/%.o, $(LIBKVM_S)) -EXTRA_CLEAN += $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) $(STATIC_LIBS) cscope.* +LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) + +EXTRA_CLEAN += $(LIBKVM_OBJS) cscope.* x := $(shell mkdir -p $(sort $(dir $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ)))) $(LIBKVM_C_OBJ): $(OUTPUT)/%.o: %.c @@ -187,12 +188,8 @@ $(LIBKVM_C_OBJ): $(OUTPUT)/%.o: %.c $(LIBKVM_S_OBJ): $(OUTPUT)/%.o: %.S $(CC) $(CFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c $< -o $@ -LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) -$(OUTPUT)/libkvm.a: $(LIBKVM_OBJS) - $(AR) crs $@ $^ - x := $(shell mkdir -p $(sort $(dir $(TEST_GEN_PROGS)))) -$(TEST_GEN_PROGS): $(STATIC_LIBS) +$(TEST_GEN_PROGS): $(LIBKVM_OBJS) cscope: include_paths = $(LINUX_TOOL_INCLUDE) $(LINUX_HDR_PATH) include lib .. cscope: From cf97d5e99f69f876dc310ea21b5f97c3a493a18a Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:47 +0000 Subject: [PATCH 063/298] KVM: selftests: Clean up LIBKVM files in Makefile Break up the long lines for LIBKVM and alphabetize each architecture. This makes reading the Makefile easier, and will make reading diffs to LIBKVM easier. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 36 ++++++++++++++++++++++++---- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 804bf927618a..14566c0a330d 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -37,11 +37,37 @@ ifeq ($(ARCH),riscv) UNAME_M := riscv endif -LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/rbtree.c lib/sparsebit.c lib/test_util.c lib/guest_modes.c lib/perf_test_util.c -LIBKVM_x86_64 = lib/x86_64/apic.c lib/x86_64/processor.c lib/x86_64/vmx.c lib/x86_64/svm.c lib/x86_64/ucall.c lib/x86_64/handlers.S -LIBKVM_aarch64 = lib/aarch64/processor.c lib/aarch64/ucall.c lib/aarch64/handlers.S lib/aarch64/spinlock.c lib/aarch64/gic.c lib/aarch64/gic_v3.c lib/aarch64/vgic.c -LIBKVM_s390x = lib/s390x/processor.c lib/s390x/ucall.c lib/s390x/diag318_test_handler.c -LIBKVM_riscv = lib/riscv/processor.c lib/riscv/ucall.c +LIBKVM += lib/assert.c +LIBKVM += lib/elf.c +LIBKVM += lib/guest_modes.c +LIBKVM += lib/io.c +LIBKVM += lib/kvm_util.c +LIBKVM += lib/perf_test_util.c +LIBKVM += lib/rbtree.c +LIBKVM += lib/sparsebit.c +LIBKVM += lib/test_util.c + +LIBKVM_x86_64 += lib/x86_64/apic.c +LIBKVM_x86_64 += lib/x86_64/handlers.S +LIBKVM_x86_64 += lib/x86_64/processor.c +LIBKVM_x86_64 += lib/x86_64/svm.c +LIBKVM_x86_64 += lib/x86_64/ucall.c +LIBKVM_x86_64 += lib/x86_64/vmx.c + +LIBKVM_aarch64 += lib/aarch64/gic.c +LIBKVM_aarch64 += lib/aarch64/gic_v3.c +LIBKVM_aarch64 += lib/aarch64/handlers.S +LIBKVM_aarch64 += lib/aarch64/processor.c +LIBKVM_aarch64 += lib/aarch64/spinlock.c +LIBKVM_aarch64 += lib/aarch64/ucall.c +LIBKVM_aarch64 += lib/aarch64/vgic.c + +LIBKVM_s390x += lib/s390x/diag318_test_handler.c +LIBKVM_s390x += lib/s390x/processor.c +LIBKVM_s390x += lib/s390x/ucall.c + +LIBKVM_riscv += lib/riscv/processor.c +LIBKVM_riscv += lib/riscv/ucall.c TEST_GEN_PROGS_x86_64 = x86_64/cpuid_test TEST_GEN_PROGS_x86_64 += x86_64/cr4_cpuid_sync_test From 71d489661904fcc3ec31b343acd5c0dac84b5410 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:48 +0000 Subject: [PATCH 064/298] KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2 Add an option to dirty_log_perf_test that configures the vCPUs to run in L2 instead of L1. This makes it possible to benchmark the dirty logging performance of nested virtualization, which is particularly interesting because KVM must shadow L1's EPT/NPT tables. For now this support only works on x86_64 CPUs with VMX. Otherwise passing -n results in the test being skipped. Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/dirty_log_perf_test.c | 10 +- .../selftests/kvm/include/perf_test_util.h | 9 ++ .../selftests/kvm/include/x86_64/processor.h | 4 + .../selftests/kvm/include/x86_64/vmx.h | 4 + .../selftests/kvm/lib/perf_test_util.c | 35 +++++- .../selftests/kvm/lib/x86_64/perf_test_util.c | 112 ++++++++++++++++++ tools/testing/selftests/kvm/lib/x86_64/vmx.c | 15 +++ 8 files changed, 182 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 14566c0a330d..22423c871ed6 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -49,6 +49,7 @@ LIBKVM += lib/test_util.c LIBKVM_x86_64 += lib/x86_64/apic.c LIBKVM_x86_64 += lib/x86_64/handlers.S +LIBKVM_x86_64 += lib/x86_64/perf_test_util.c LIBKVM_x86_64 += lib/x86_64/processor.c LIBKVM_x86_64 += lib/x86_64/svm.c LIBKVM_x86_64 += lib/x86_64/ucall.c diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c index 7b47ae4f952e..d60a34cdfaee 100644 --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c @@ -336,8 +336,8 @@ static void run_test(enum vm_guest_mode mode, void *arg) static void help(char *name) { puts(""); - printf("usage: %s [-h] [-i iterations] [-p offset] [-g]" - "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]" + printf("usage: %s [-h] [-i iterations] [-p offset] [-g] " + "[-m mode] [-n] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]" "[-x memslots]\n", name); puts(""); printf(" -i: specify iteration counts (default: %"PRIu64")\n", @@ -351,6 +351,7 @@ static void help(char *name) printf(" -p: specify guest physical test memory offset\n" " Warning: a low offset can conflict with the loaded test code.\n"); guest_modes_help(); + printf(" -n: Run the vCPUs in nested mode (L2)\n"); printf(" -b: specify the size of the memory region which should be\n" " dirtied by each vCPU. e.g. 10M or 3G.\n" " (default: 1G)\n"); @@ -387,7 +388,7 @@ int main(int argc, char *argv[]) guest_modes_append_default(); - while ((opt = getopt(argc, argv, "ghi:p:m:b:f:v:os:x:")) != -1) { + while ((opt = getopt(argc, argv, "ghi:p:m:nb:f:v:os:x:")) != -1) { switch (opt) { case 'g': dirty_log_manual_caps = 0; @@ -401,6 +402,9 @@ int main(int argc, char *argv[]) case 'm': guest_modes_cmdline(optarg); break; + case 'n': + perf_test_args.nested = true; + break; case 'b': guest_percpu_mem_size = parse_size(optarg); break; diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h index a86f953d8d36..d822cb670f1c 100644 --- a/tools/testing/selftests/kvm/include/perf_test_util.h +++ b/tools/testing/selftests/kvm/include/perf_test_util.h @@ -30,10 +30,15 @@ struct perf_test_vcpu_args { struct perf_test_args { struct kvm_vm *vm; + /* The starting address and size of the guest test region. */ uint64_t gpa; + uint64_t size; uint64_t guest_page_size; int wr_fract; + /* Run vCPUs in L2 instead of L1, if the architecture supports it. */ + bool nested; + struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS]; }; @@ -49,5 +54,9 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract); void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *)); void perf_test_join_vcpu_threads(int vcpus); +void perf_test_guest_code(uint32_t vcpu_id); + +uint64_t perf_test_nested_pages(int nr_vcpus); +void perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus); #endif /* SELFTEST_KVM_PERF_TEST_UTIL_H */ diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index d78f97f502b5..6ce185449259 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -494,6 +494,10 @@ enum pg_level { #define PG_LEVEL_SHIFT(_level) ((_level - 1) * 9 + 12) #define PG_LEVEL_SIZE(_level) (1ull << PG_LEVEL_SHIFT(_level)) +#define PG_SIZE_4K PG_LEVEL_SIZE(PG_LEVEL_4K) +#define PG_SIZE_2M PG_LEVEL_SIZE(PG_LEVEL_2M) +#define PG_SIZE_1G PG_LEVEL_SIZE(PG_LEVEL_1G) + void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level); /* diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h index 3b1794baa97c..cc3604f8f1d3 100644 --- a/tools/testing/selftests/kvm/include/x86_64/vmx.h +++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h @@ -96,6 +96,7 @@ #define VMX_MISC_PREEMPTION_TIMER_RATE_MASK 0x0000001f #define VMX_MISC_SAVE_EFER_LMA 0x00000020 +#define VMX_EPT_VPID_CAP_1G_PAGES 0x00020000 #define VMX_EPT_VPID_CAP_AD_BITS 0x00200000 #define EXIT_REASON_FAILED_VMENTRY 0x80000000 @@ -608,6 +609,7 @@ bool load_vmcs(struct vmx_pages *vmx); bool nested_vmx_supported(void); void nested_vmx_check_supported(void); +bool ept_1g_pages_supported(void); void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, uint64_t nested_paddr, uint64_t paddr); @@ -615,6 +617,8 @@ void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, uint64_t nested_paddr, uint64_t paddr, uint64_t size); void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm, uint32_t memslot); +void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t addr, uint64_t size); void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm, uint32_t eptp_memslot); void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm); diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index 722df3a28791..b2ff2cee2e51 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -40,7 +40,7 @@ static bool all_vcpu_threads_running; * Continuously write to the first 8 bytes of each page in the * specified region. */ -static void guest_code(uint32_t vcpu_id) +void perf_test_guest_code(uint32_t vcpu_id) { struct perf_test_args *pta = &perf_test_args; struct perf_test_vcpu_args *vcpu_args = &pta->vcpu_args[vcpu_id]; @@ -108,7 +108,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, { struct perf_test_args *pta = &perf_test_args; struct kvm_vm *vm; - uint64_t guest_num_pages; + uint64_t guest_num_pages, slot0_pages = DEFAULT_GUEST_PHY_PAGES; uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src); int i; @@ -134,13 +134,20 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, "Guest memory cannot be evenly divided into %d slots.", slots); + /* + * If using nested, allocate extra pages for the nested page tables and + * in-memory data structures. + */ + if (pta->nested) + slot0_pages += perf_test_nested_pages(vcpus); + /* * Pass guest_num_pages to populate the page tables for test memory. * The memory is also added to memslot 0, but that's a benign side * effect as KVM allows aliasing HVAs in meslots. */ - vm = vm_create_with_vcpus(mode, vcpus, DEFAULT_GUEST_PHY_PAGES, - guest_num_pages, 0, guest_code, NULL); + vm = vm_create_with_vcpus(mode, vcpus, slot0_pages, guest_num_pages, 0, + perf_test_guest_code, NULL); pta->vm = vm; @@ -161,7 +168,9 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, /* Align to 1M (segment size) */ pta->gpa = align_down(pta->gpa, 1 << 20); #endif - pr_info("guest physical test memory offset: 0x%lx\n", pta->gpa); + pta->size = guest_num_pages * pta->guest_page_size; + pr_info("guest physical test memory: [0x%lx, 0x%lx)\n", + pta->gpa, pta->gpa + pta->size); /* Add extra memory slots for testing */ for (i = 0; i < slots; i++) { @@ -178,6 +187,11 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, perf_test_setup_vcpus(vm, vcpus, vcpu_memory_bytes, partition_vcpu_memory_access); + if (pta->nested) { + pr_info("Configuring vCPUs to run in L2 (nested).\n"); + perf_test_setup_nested(vm, vcpus); + } + ucall_init(vm, NULL); /* Export the shared variables to the guest. */ @@ -198,6 +212,17 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract) sync_global_to_guest(vm, perf_test_args); } +uint64_t __weak perf_test_nested_pages(int nr_vcpus) +{ + return 0; +} + +void __weak perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus) +{ + pr_info("%s() not support on this architecture, skipping.\n", __func__); + exit(KSFT_SKIP); +} + static void *vcpu_thread_main(void *data) { struct vcpu_thread *vcpu = data; diff --git a/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c b/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c new file mode 100644 index 000000000000..e258524435a0 --- /dev/null +++ b/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c @@ -0,0 +1,112 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * x86_64-specific extensions to perf_test_util.c. + * + * Copyright (C) 2022, Google, Inc. + */ +#include +#include +#include +#include + +#include "test_util.h" +#include "kvm_util.h" +#include "perf_test_util.h" +#include "../kvm_util_internal.h" +#include "processor.h" +#include "vmx.h" + +void perf_test_l2_guest_code(uint64_t vcpu_id) +{ + perf_test_guest_code(vcpu_id); + vmcall(); +} + +extern char perf_test_l2_guest_entry[]; +__asm__( +"perf_test_l2_guest_entry:" +" mov (%rsp), %rdi;" +" call perf_test_l2_guest_code;" +" ud2;" +); + +static void perf_test_l1_guest_code(struct vmx_pages *vmx, uint64_t vcpu_id) +{ +#define L2_GUEST_STACK_SIZE 64 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + unsigned long *rsp; + + GUEST_ASSERT(vmx->vmcs_gpa); + GUEST_ASSERT(prepare_for_vmx_operation(vmx)); + GUEST_ASSERT(load_vmcs(vmx)); + GUEST_ASSERT(ept_1g_pages_supported()); + + rsp = &l2_guest_stack[L2_GUEST_STACK_SIZE - 1]; + *rsp = vcpu_id; + prepare_vmcs(vmx, perf_test_l2_guest_entry, rsp); + + GUEST_ASSERT(!vmlaunch()); + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); + GUEST_DONE(); +} + +uint64_t perf_test_nested_pages(int nr_vcpus) +{ + /* + * 513 page tables is enough to identity-map 256 TiB of L2 with 1G + * pages and 4-level paging, plus a few pages per-vCPU for data + * structures such as the VMCS. + */ + return 513 + 10 * nr_vcpus; +} + +void perf_test_setup_ept(struct vmx_pages *vmx, struct kvm_vm *vm) +{ + uint64_t start, end; + + prepare_eptp(vmx, vm, 0); + + /* + * Identity map the first 4G and the test region with 1G pages so that + * KVM can shadow the EPT12 with the maximum huge page size supported + * by the backing source. + */ + nested_identity_map_1g(vmx, vm, 0, 0x100000000ULL); + + start = align_down(perf_test_args.gpa, PG_SIZE_1G); + end = align_up(perf_test_args.gpa + perf_test_args.size, PG_SIZE_1G); + nested_identity_map_1g(vmx, vm, start, end - start); +} + +void perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus) +{ + struct vmx_pages *vmx, *vmx0 = NULL; + struct kvm_regs regs; + vm_vaddr_t vmx_gva; + int vcpu_id; + + nested_vmx_check_supported(); + + for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) { + vmx = vcpu_alloc_vmx(vm, &vmx_gva); + + if (vcpu_id == 0) { + perf_test_setup_ept(vmx, vm); + vmx0 = vmx; + } else { + /* Share the same EPT table across all vCPUs. */ + vmx->eptp = vmx0->eptp; + vmx->eptp_hva = vmx0->eptp_hva; + vmx->eptp_gpa = vmx0->eptp_gpa; + } + + /* + * Override the vCPU to run perf_test_l1_guest_code() which will + * bounce it into L2 before calling perf_test_guest_code(). + */ + vcpu_regs_get(vm, vcpu_id, ®s); + regs.rip = (unsigned long) perf_test_l1_guest_code; + vcpu_regs_set(vm, vcpu_id, ®s); + vcpu_args_set(vm, vcpu_id, 2, vmx_gva, vcpu_id); + } +} diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index 5bf169179455..b77a01d0a271 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -203,6 +203,11 @@ static bool ept_vpid_cap_supported(uint64_t mask) return rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & mask; } +bool ept_1g_pages_supported(void) +{ + return ept_vpid_cap_supported(VMX_EPT_VPID_CAP_1G_PAGES); +} + /* * Initialize the control fields to the most basic settings possible. */ @@ -439,6 +444,9 @@ void __nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " "unknown or unsupported guest mode, mode: 0x%x", vm->mode); + TEST_ASSERT((nested_paddr >> 48) == 0, + "Nested physical address 0x%lx requires 5-level paging", + nested_paddr); TEST_ASSERT((nested_paddr % page_size) == 0, "Nested physical address not on page boundary,\n" " nested_paddr: 0x%lx page_size: 0x%lx", @@ -547,6 +555,13 @@ void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm, } } +/* Identity map a region with 1GiB Pages. */ +void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t addr, uint64_t size) +{ + __nested_map(vmx, vm, addr, addr, size, PG_LEVEL_1G); +} + void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm, uint32_t eptp_memslot) { From e0f3f46e42064a51573914766897b4ab95d943e3 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:49 +0000 Subject: [PATCH 065/298] KVM: selftests: Restrict test region to 48-bit physical addresses when using nested The selftests nested code only supports 4-level paging at the moment. This means it cannot map nested guest physical addresses with more than 48 bits. Allow perf_test_util nested mode to work on hosts with more than 48 physical addresses by restricting the guest test region to 48-bits. While here, opportunistically fix an off-by-one error when dealing with vm_get_max_gfn(). perf_test_util.c was treating this as the maximum number of GFNs, rather than the maximum allowed GFN. This didn't result in any correctness issues, but it did end up shifting the test region down slightly when using huge pages. Suggested-by: Sean Christopherson Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-12-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- .../testing/selftests/kvm/lib/perf_test_util.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index b2ff2cee2e51..f989ff91f022 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -110,6 +110,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, struct kvm_vm *vm; uint64_t guest_num_pages, slot0_pages = DEFAULT_GUEST_PHY_PAGES; uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src); + uint64_t region_end_gfn; int i; pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode)); @@ -151,18 +152,29 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, pta->vm = vm; + /* Put the test region at the top guest physical memory. */ + region_end_gfn = vm_get_max_gfn(vm) + 1; + +#ifdef __x86_64__ + /* + * When running vCPUs in L2, restrict the test region to 48 bits to + * avoid needing 5-level page tables to identity map L2. + */ + if (pta->nested) + region_end_gfn = min(region_end_gfn, (1UL << 48) / pta->guest_page_size); +#endif /* * If there should be more memory in the guest test region than there * can be pages in the guest, it will definitely cause problems. */ - TEST_ASSERT(guest_num_pages < vm_get_max_gfn(vm), + TEST_ASSERT(guest_num_pages < region_end_gfn, "Requested more guest memory than address space allows.\n" " guest pages: %" PRIx64 " max gfn: %" PRIx64 " vcpus: %d wss: %" PRIx64 "]\n", - guest_num_pages, vm_get_max_gfn(vm), vcpus, + guest_num_pages, region_end_gfn - 1, vcpus, vcpu_memory_bytes); - pta->gpa = (vm_get_max_gfn(vm) - guest_num_pages) * pta->guest_page_size; + pta->gpa = (region_end_gfn - guest_num_pages) * pta->guest_page_size; pta->gpa = align_down(pta->gpa, backing_src_pagesz); #ifdef __s390x__ /* Align to 1M (segment size) */ From 668a9fe5c6a1bcac6b65d5e9b91a9eca86f782a3 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 8 Jun 2022 14:45:35 +0100 Subject: [PATCH 066/298] genirq: PM: Use runtime PM for chained interrupts When requesting an interrupt, we correctly call into the runtime PM framework to guarantee that the underlying interrupt controller is up and running. However, we fail to do so for chained interrupt controllers, as the mux interrupt is not requested along the same path. Augment __irq_do_set_handler() to call into the runtime PM code in this case, making sure the PM flow is the same for all interrupts. Reported-by: Lucas Stach Tested-by: Liu Ying Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/26973cddee5f527ea17184c0f3fccb70bc8969a0.camel@pengutronix.de --- kernel/irq/chip.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index e6b8e564b37f..886789dcee43 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -1006,8 +1006,10 @@ __irq_do_set_handler(struct irq_desc *desc, irq_flow_handler_t handle, if (desc->irq_data.chip != &no_irq_chip) mask_ack_irq(desc); irq_state_set_disabled(desc); - if (is_chained) + if (is_chained) { desc->action = NULL; + WARN_ON(irq_chip_pm_put(irq_desc_get_irq_data(desc))); + } desc->depth = 1; } desc->handle_irq = handle; @@ -1033,6 +1035,7 @@ __irq_do_set_handler(struct irq_desc *desc, irq_flow_handler_t handle, irq_settings_set_norequest(desc); irq_settings_set_nothread(desc); desc->action = &chained_action; + WARN_ON(irq_chip_pm_get(irq_desc_get_irq_data(desc))); irq_activate_and_startup(desc, IRQ_RESEND); } } From c3238d36c3a2be0a29a9d848d6c51e1b14be6692 Mon Sep 17 00:00:00 2001 From: Grzegorz Szczurek Date: Fri, 29 Apr 2022 14:27:08 +0200 Subject: [PATCH 067/298] i40e: Fix adding ADQ filter to TC0 Procedure of configure tc flower filters erroneously allows to create filters on TC0 where unfiltered packets are also directed by default. Issue was caused by insufficient checks of hw_tc parameter specifying the hardware traffic class to pass matching packets to. Fix checking hw_tc parameter which blocks creation of filters on TC0. Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower") Signed-off-by: Grzegorz Szczurek Signed-off-by: Jedrzej Jagielski Tested-by: Bharathi Sreenivas Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 332a608dbaa6..72576bb3e94d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -8542,6 +8542,11 @@ static int i40e_configure_clsflower(struct i40e_vsi *vsi, return -EOPNOTSUPP; } + if (!tc) { + dev_err(&pf->pdev->dev, "Unable to add filter because of invalid destination"); + return -EINVAL; + } + if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) || test_bit(__I40E_RESET_INTR_RECEIVED, pf->state)) return -EBUSY; From 0bb050670ac90a167ecfa3f9590f92966c9a3677 Mon Sep 17 00:00:00 2001 From: Grzegorz Szczurek Date: Fri, 29 Apr 2022 14:40:23 +0200 Subject: [PATCH 068/298] i40e: Fix calculating the number of queue pairs If ADQ is enabled for a VF, then actual number of queue pair is a number of currently available traffic classes for this VF. Without this change the configuration of the Rx/Tx queues fails with error. Fixes: d29e0d233e0d ("i40e: missing input validation on VF message handling by the PF") Signed-off-by: Grzegorz Szczurek Signed-off-by: Jedrzej Jagielski Tested-by: Bharathi Sreenivas Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 2606e8f0f19b..033ea71763e3 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2282,7 +2282,7 @@ static int i40e_vc_config_queues_msg(struct i40e_vf *vf, u8 *msg) } if (vf->adq_enabled) { - for (i = 0; i < I40E_MAX_VF_VSI; i++) + for (i = 0; i < vf->num_tc; i++) num_qps_all += vf->ch[i].num_qps; if (num_qps_all != qci->num_queue_pairs) { aq_ret = I40E_ERR_PARAM; From fd5855e6b1358e816710afee68a1d2bc685176ca Mon Sep 17 00:00:00 2001 From: Aleksandr Loktionov Date: Thu, 19 May 2022 16:01:45 +0200 Subject: [PATCH 069/298] i40e: Fix call trace in setup_tx_descriptors After PF reset and ethtool -t there was call trace in dmesg sometimes leading to panic. When there was some time, around 5 seconds, between reset and test there were no errors. Problem was that pf reset calls i40e_vsi_close in prep_for_reset and ethtool -t calls i40e_vsi_close in diag_test. If there was not enough time between those commands the second i40e_vsi_close starts before previous i40e_vsi_close was done which leads to crash. Add check to diag_test if pf is in reset and don't start offline tests if it is true. Add netif_info("testing failed") into unhappy path of i40e_diag_test() Fixes: e17bc411aea8 ("i40e: Disable offline diagnostics if VFs are enabled") Fixes: 510efb2682b3 ("i40e: Fix ethtool offline diagnostic with netqueues") Signed-off-by: Michal Jaron Signed-off-by: Aleksandr Loktionov Tested-by: Gurucharan (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- .../net/ethernet/intel/i40e/i40e_ethtool.c | 25 +++++++++++++------ 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 610f00cbaff9..19704f5c8291 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -2586,15 +2586,16 @@ static void i40e_diag_test(struct net_device *netdev, set_bit(__I40E_TESTING, pf->state); + if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) || + test_bit(__I40E_RESET_INTR_RECEIVED, pf->state)) { + dev_warn(&pf->pdev->dev, + "Cannot start offline testing when PF is in reset state.\n"); + goto skip_ol_tests; + } + if (i40e_active_vfs(pf) || i40e_active_vmdqs(pf)) { dev_warn(&pf->pdev->dev, "Please take active VFs and Netqueues offline and restart the adapter before running NIC diagnostics\n"); - data[I40E_ETH_TEST_REG] = 1; - data[I40E_ETH_TEST_EEPROM] = 1; - data[I40E_ETH_TEST_INTR] = 1; - data[I40E_ETH_TEST_LINK] = 1; - eth_test->flags |= ETH_TEST_FL_FAILED; - clear_bit(__I40E_TESTING, pf->state); goto skip_ol_tests; } @@ -2641,9 +2642,17 @@ static void i40e_diag_test(struct net_device *netdev, data[I40E_ETH_TEST_INTR] = 0; } -skip_ol_tests: - netif_info(pf, drv, netdev, "testing finished\n"); + return; + +skip_ol_tests: + data[I40E_ETH_TEST_REG] = 1; + data[I40E_ETH_TEST_EEPROM] = 1; + data[I40E_ETH_TEST_INTR] = 1; + data[I40E_ETH_TEST_LINK] = 1; + eth_test->flags |= ETH_TEST_FL_FAILED; + clear_bit(__I40E_TESTING, pf->state); + netif_info(pf, drv, netdev, "testing failed\n"); } static void i40e_get_wol(struct net_device *netdev, From 645603844270b69175899268be68b871295764fe Mon Sep 17 00:00:00 2001 From: Michal Wilczynski Date: Fri, 20 May 2022 13:19:27 +0200 Subject: [PATCH 070/298] iavf: Fix issue with MAC address of VF shown as zero After reinitialization of iavf, ice driver gets VIRTCHNL_OP_ADD_ETH_ADDR message with incorrectly set type of MAC address. Hardware address should have is_primary flag set as true. This way ice driver knows what it has to set as a MAC address. Check if the address is primary in iavf_add_filter function and set flag accordingly. To test set all-zero MAC on a VF. This triggers iavf re-initialization and VIRTCHNL_OP_ADD_ETH_ADDR message gets sent to PF. For example: ip link set dev ens785 vf 0 mac 00:00:00:00:00:00 This triggers re-initialization of iavf. New MAC should be assigned. Now check if MAC is non-zero: ip link show dev ens785 Fixes: a3e839d539e0 ("iavf: Add usage of new virtchnl format to set default MAC") Signed-off-by: Michal Wilczynski Reviewed-by: Maciej Fijalkowski Tested-by: Konrad Jankowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index 7dfcf78b57fb..f3ecb3bca33d 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -984,7 +984,7 @@ struct iavf_mac_filter *iavf_add_filter(struct iavf_adapter *adapter, list_add_tail(&f->list, &adapter->mac_filter_list); f->add = true; f->is_new_mac = true; - f->is_primary = false; + f->is_primary = ether_addr_equal(macaddr, adapter->hw.mac.addr); adapter->aq_required |= IAVF_FLAG_AQ_ADD_MAC_FILTER; } else { f->remove = false; From b84dc7f0e3646d480b6972c5f25586215c5f33e2 Mon Sep 17 00:00:00 2001 From: Jamie Iles Date: Mon, 6 Jun 2022 22:39:52 +0100 Subject: [PATCH 071/298] irqchip/xilinx: Remove microblaze+zynq dependency The Xilinx IRQ controller doesn't really have any architecture dependencies - it's a generic AXI component that can be used for any FPGA core from Zynq hard processor systems to microblaze+riscv soft cores and more. Signed-off-by: Jamie Iles Acked-by: Michal Simek Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606213952.298686-1-jamie@jamieiles.com --- drivers/irqchip/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 4ab1038b5482..1f23a6be7d88 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -298,7 +298,7 @@ config XTENSA_MX config XILINX_INTC bool "Xilinx Interrupt Controller IP" - depends on MICROBLAZE || ARCH_ZYNQ || ARCH_ZYNQMP + depends on OF select IRQ_DOMAIN help Support for the Xilinx Interrupt Controller IP core. From f4b98e314888cc51486421bcf6d52852452ea48b Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 12:09:25 +0400 Subject: [PATCH 072/298] irqchip/gic/realview: Fix refcount leak in realview_gic_of_init of_find_matching_node_and_match() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. Add missing of_node_put() to avoid refcount leak. Fixes: 82b0a434b436 ("irqchip/gic/realview: Support more RealView DCC variants") Signed-off-by: Miaoqian Lin Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220601080930.31005-2-linmq006@gmail.com --- drivers/irqchip/irq-gic-realview.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/irqchip/irq-gic-realview.c b/drivers/irqchip/irq-gic-realview.c index b4c1924f0255..38fab02ffe9d 100644 --- a/drivers/irqchip/irq-gic-realview.c +++ b/drivers/irqchip/irq-gic-realview.c @@ -57,6 +57,7 @@ realview_gic_of_init(struct device_node *node, struct device_node *parent) /* The PB11MPCore GIC needs to be configured in the syscon */ map = syscon_node_to_regmap(np); + of_node_put(np); if (!IS_ERR(map)) { /* new irq mode with no DCC */ regmap_write(map, REALVIEW_SYS_LOCK_OFFSET, From b1ac803f47cb1615468f35cf1ccb553c52087301 Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 12:09:26 +0400 Subject: [PATCH 073/298] irqchip/apple-aic: Fix refcount leak in build_fiq_affinity of_find_node_by_phandle() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. Add missing of_node_put() to avoid refcount leak. Fixes: a5e8801202b3 ("irqchip/apple-aic: Parse FIQ affinities from device-tree") Signed-off-by: Miaoqian Lin Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220601080930.31005-3-linmq006@gmail.com --- drivers/irqchip/irq-apple-aic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/irqchip/irq-apple-aic.c b/drivers/irqchip/irq-apple-aic.c index 12dd48727a15..478d0af16d9f 100644 --- a/drivers/irqchip/irq-apple-aic.c +++ b/drivers/irqchip/irq-apple-aic.c @@ -1035,6 +1035,7 @@ static void build_fiq_affinity(struct aic_irq_chip *ic, struct device_node *aff) continue; cpu = of_cpu_node_to_id(cpu_node); + of_node_put(cpu_node); if (WARN_ON(cpu < 0)) continue; From 3d45670fab3c25a7452721e4588cc95c51cda134 Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 12:09:27 +0400 Subject: [PATCH 074/298] irqchip/apple-aic: Fix refcount leak in aic_of_ic_init of_get_child_by_name() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. Add missing of_node_put() to avoid refcount leak. Fixes: a5e8801202b3 ("irqchip/apple-aic: Parse FIQ affinities from device-tree") Signed-off-by: Miaoqian Lin Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220601080930.31005-4-linmq006@gmail.com --- drivers/irqchip/irq-apple-aic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/irqchip/irq-apple-aic.c b/drivers/irqchip/irq-apple-aic.c index 478d0af16d9f..5ac83185ff47 100644 --- a/drivers/irqchip/irq-apple-aic.c +++ b/drivers/irqchip/irq-apple-aic.c @@ -1144,6 +1144,7 @@ static int __init aic_of_ic_init(struct device_node *node, struct device_node *p for_each_child_of_node(affs, chld) build_fiq_affinity(irqc, chld); } + of_node_put(affs); set_handle_irq(aic_handle_irq); set_handle_fiq(aic_handle_fiq); From ec8401a429ffee34ccf38cebf3443f8d5ae6cb0d Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 12:09:28 +0400 Subject: [PATCH 075/298] irqchip/gic-v3: Fix error handling in gic_populate_ppi_partitions of_get_child_by_name() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. When kcalloc fails, it missing of_node_put() and results in refcount leak. Fix this by goto out_put_node label. Fixes: 52085d3f2028 ("irqchip/gic-v3: Dynamically allocate PPI partition descriptors") Signed-off-by: Miaoqian Lin Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220601080930.31005-5-linmq006@gmail.com --- drivers/irqchip/irq-gic-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 2be8dea6b6b0..1d5b4755a27e 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1932,7 +1932,7 @@ static void __init gic_populate_ppi_partitions(struct device_node *gic_node) gic_data.ppi_descs = kcalloc(gic_data.ppi_nr, sizeof(*gic_data.ppi_descs), GFP_KERNEL); if (!gic_data.ppi_descs) - return; + goto out_put_node; nr_parts = of_get_child_count(parts_node); From fa1ad9d4cc47ca2470cd904ad4519f05d7e43a2b Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 12:09:29 +0400 Subject: [PATCH 076/298] irqchip/gic-v3: Fix refcount leak in gic_populate_ppi_partitions of_find_node_by_phandle() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. Add missing of_node_put() to avoid refcount leak. Fixes: e3825ba1af3a ("irqchip/gic-v3: Add support for partitioned PPIs") Signed-off-by: Miaoqian Lin Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220601080930.31005-6-linmq006@gmail.com --- drivers/irqchip/irq-gic-v3.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 1d5b4755a27e..5c1cf907ee68 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1973,12 +1973,15 @@ static void __init gic_populate_ppi_partitions(struct device_node *gic_node) continue; cpu = of_cpu_node_to_id(cpu_node); - if (WARN_ON(cpu < 0)) + if (WARN_ON(cpu < 0)) { + of_node_put(cpu_node); continue; + } pr_cont("%pOF[%d] ", cpu_node, cpu); cpumask_set_cpu(cpu, &part->mask); + of_node_put(cpu_node); } pr_cont("}\n"); From eff4780f83d0ae3e5b6c02ff5d999dc4c1c5c8ce Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 12:09:30 +0400 Subject: [PATCH 077/298] irqchip/realtek-rtl: Fix refcount leak in map_interrupts of_find_node_by_phandle() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. This function doesn't call of_node_put() in error path. Call of_node_put() directly after of_property_read_u32() to cover both normal path and error path. Fixes: 9f3a0f34b84a ("irqchip: Add support for Realtek RTL838x/RTL839x interrupt controller") Signed-off-by: Miaoqian Lin Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220601080930.31005-7-linmq006@gmail.com --- drivers/irqchip/irq-realtek-rtl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-realtek-rtl.c b/drivers/irqchip/irq-realtek-rtl.c index 50a56820c99b..56bf502d9c67 100644 --- a/drivers/irqchip/irq-realtek-rtl.c +++ b/drivers/irqchip/irq-realtek-rtl.c @@ -134,9 +134,9 @@ static int __init map_interrupts(struct device_node *node, struct irq_domain *do if (!cpu_ictl) return -EINVAL; ret = of_property_read_u32(cpu_ictl, "#interrupt-cells", &tmp); + of_node_put(cpu_ictl); if (ret || tmp != 1) return -EINVAL; - of_node_put(cpu_ictl); cpu_int = be32_to_cpup(imap + 2); if (cpu_int > 7 || cpu_int < 2) From df089e6f07e3c94cb7a330dc74f5041db800009c Mon Sep 17 00:00:00 2001 From: Kunihiko Hayashi Date: Fri, 20 May 2022 14:17:01 +0900 Subject: [PATCH 078/298] dt-bindings: interrupt-controller/uniphier-aidet: Add bindings for NX1 SoC Update uniphier-aidet binding document for UniPhier NX1 SoC. Signed-off-by: Kunihiko Hayashi Acked-by: Rob Herring Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1653023822-19229-2-git-send-email-hayashi.kunihiko@socionext.com --- .../bindings/interrupt-controller/socionext,uniphier-aidet.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.yaml b/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.yaml index f89ebde76dab..de7c5e59bae1 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.yaml @@ -30,6 +30,7 @@ properties: - socionext,uniphier-ld11-aidet - socionext,uniphier-ld20-aidet - socionext,uniphier-pxs3-aidet + - socionext,uniphier-nx1-aidet reg: maxItems: 1 From e3f056a7aafabe4ac3ad4b7465ba821b44a7e639 Mon Sep 17 00:00:00 2001 From: Kunihiko Hayashi Date: Fri, 20 May 2022 14:17:02 +0900 Subject: [PATCH 079/298] irqchip/uniphier-aidet: Add compatible string for NX1 SoC Add the compatible string to support UniPhier NX1 SoC, which has the same kinds of controls as the other UniPhier SoCs. Signed-off-by: Kunihiko Hayashi Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1653023822-19229-3-git-send-email-hayashi.kunihiko@socionext.com --- drivers/irqchip/irq-uniphier-aidet.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/irqchip/irq-uniphier-aidet.c b/drivers/irqchip/irq-uniphier-aidet.c index 89121b39be26..716b1bb88bf2 100644 --- a/drivers/irqchip/irq-uniphier-aidet.c +++ b/drivers/irqchip/irq-uniphier-aidet.c @@ -237,6 +237,7 @@ static const struct of_device_id uniphier_aidet_match[] = { { .compatible = "socionext,uniphier-ld11-aidet" }, { .compatible = "socionext,uniphier-ld20-aidet" }, { .compatible = "socionext,uniphier-pxs3-aidet" }, + { .compatible = "socionext,uniphier-nx1-aidet" }, { /* sentinel */ } }; From de0952f267ffe9d4ecbfeab7c476f7e29e028b3e Mon Sep 17 00:00:00 2001 From: Javier Martinez Canillas Date: Fri, 10 Jun 2022 00:34:24 +0200 Subject: [PATCH 080/298] staging: olpc_dcon: mark driver as broken The commit eecb3e4e5d9d ("staging: olpc_dcon: add OLPC display controller (DCON) support") added this driver in 2010, and has been in staging since then. It was marked as broken at some point because it didn't even build but that got removed once the build issues were addressed. But it seems that the work to move this driver out of staging has stalled, the last non-trivial change to fix one of the items mentioned in its todo file was commit e40219d5e4b2 ("staging: olpc_dcon: allow simultaneous XO-1 and XO-1.5 support") in 2019. And even if work to destage the driver is resumed, the fbdev subsystem has been deprecated for a long time and instead it should be ported to DRM. Now this driver is preventing to land a kernel wide change, that makes the num_registered_fb symbol to be private to the fbmem.c file. So let's just mark the driver as broken. Someone can then work on making it not depend on the num_registered_fb symbol, allowing to drop the broken dependency again. Suggested-by: Sam Ravnborg Acked-by: Thomas Zimmermann Signed-off-by: Javier Martinez Canillas Link: https://lore.kernel.org/r/20220609223424.907174-1-javierm@redhat.com Signed-off-by: Greg Kroah-Hartman --- drivers/staging/olpc_dcon/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/olpc_dcon/Kconfig b/drivers/staging/olpc_dcon/Kconfig index d1a0dea09ef0..d0ba34cc32f7 100644 --- a/drivers/staging/olpc_dcon/Kconfig +++ b/drivers/staging/olpc_dcon/Kconfig @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 config FB_OLPC_DCON tristate "One Laptop Per Child Display CONtroller support" - depends on OLPC && FB + depends on OLPC && FB && BROKEN depends on I2C depends on GPIO_CS5535 && ACPI select BACKLIGHT_CLASS_DEVICE From 67ea0a2adbf667cd6da4965fbcfd0da741035084 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 8 Jun 2022 14:55:12 -0700 Subject: [PATCH 081/298] staging: rtl8723bs: Allocate full pwep structure The pwep allocation was always being allocated smaller than the true structure size. Avoid this by always allocating the full structure. Found with GCC 12 and -Warray-bounds: ../drivers/staging/rtl8723bs/os_dep/ioctl_linux.c: In function 'rtw_set_encryption': ../drivers/staging/rtl8723bs/os_dep/ioctl_linux.c:591:29: warning: array subscript 'struct ndis_802_11_wep[0]' is partly outside array bounds of 'void[25]' [-Warray-bounds] 591 | pwep->length = wep_total_len; | ^~ Cc: Greg Kroah-Hartman Cc: Fabio Aiuto Cc: Hans de Goede Cc: linux-staging@lists.linux.dev Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220608215512.1070847-1-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman --- drivers/staging/rtl8723bs/os_dep/ioctl_linux.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/staging/rtl8723bs/os_dep/ioctl_linux.c b/drivers/staging/rtl8723bs/os_dep/ioctl_linux.c index ece97e37ac91..30374a820496 100644 --- a/drivers/staging/rtl8723bs/os_dep/ioctl_linux.c +++ b/drivers/staging/rtl8723bs/os_dep/ioctl_linux.c @@ -90,7 +90,8 @@ static int wpa_set_encryption(struct net_device *dev, struct ieee_param *param, if (wep_key_len > 0) { wep_key_len = wep_key_len <= 5 ? 5 : 13; wep_total_len = wep_key_len + FIELD_OFFSET(struct ndis_802_11_wep, key_material); - pwep = kzalloc(wep_total_len, GFP_KERNEL); + /* Allocate a full structure to avoid potentially running off the end. */ + pwep = kzalloc(sizeof(*pwep), GFP_KERNEL); if (!pwep) { ret = -ENOMEM; goto exit; @@ -582,7 +583,8 @@ static int rtw_set_encryption(struct net_device *dev, struct ieee_param *param, if (wep_key_len > 0) { wep_key_len = wep_key_len <= 5 ? 5 : 13; wep_total_len = wep_key_len + FIELD_OFFSET(struct ndis_802_11_wep, key_material); - pwep = kzalloc(wep_total_len, GFP_KERNEL); + /* Allocate a full structure to avoid potentially running off the end. */ + pwep = kzalloc(sizeof(*pwep), GFP_KERNEL); if (!pwep) goto exit; From 6fac824f40987a54a08dfbcc36145869d02e45b1 Mon Sep 17 00:00:00 2001 From: Jiaxun Yang Date: Thu, 9 Jun 2022 18:52:41 +0100 Subject: [PATCH 082/298] irqchip/loongson-liointc: Use architecture register to get coreid fa84f89395e0 ("irqchip/loongson-liointc: Fix build error for LoongArch") replaced get_ebase_cpunum with physical processor id from SMP facilities. However that breaks MIPS non-SMP build and makes booting from other cores inpossible on non-SMP kernel. Thus we revert get_ebase_cpunum back and use get_csr_cpuid for LoongArch. Fixes: fa84f89395e0 ("irqchip/loongson-liointc: Fix build error for LoongArch") Signed-off-by: Jiaxun Yang Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609175242.977-1-jiaxun.yang@flygoat.com --- drivers/irqchip/irq-loongson-liointc.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-loongson-liointc.c b/drivers/irqchip/irq-loongson-liointc.c index aed88857d90f..8d05d8bcf56f 100644 --- a/drivers/irqchip/irq-loongson-liointc.c +++ b/drivers/irqchip/irq-loongson-liointc.c @@ -39,6 +39,12 @@ #define LIOINTC_ERRATA_IRQ 10 +#if defined(CONFIG_MIPS) +#define liointc_core_id get_ebase_cpunum() +#else +#define liointc_core_id get_csr_cpuid() +#endif + struct liointc_handler_data { struct liointc_priv *priv; u32 parent_int_map; @@ -57,7 +63,7 @@ static void liointc_chained_handle_irq(struct irq_desc *desc) struct liointc_handler_data *handler = irq_desc_get_handler_data(desc); struct irq_chip *chip = irq_desc_get_chip(desc); struct irq_chip_generic *gc = handler->priv->gc; - int core = cpu_logical_map(smp_processor_id()) % LIOINTC_NUM_CORES; + int core = liointc_core_id % LIOINTC_NUM_CORES; u32 pending; chained_irq_enter(chip, desc); From 656c5ba50b7172a0ea25dc1b37606bd51d01fe8d Mon Sep 17 00:00:00 2001 From: Saurabh Sengar Date: Thu, 9 Jun 2022 10:16:36 -0700 Subject: [PATCH 083/298] Drivers: hv: vmbus: Release cpu lock in error case In case of invalid sub channel, release cpu lock before returning. Fixes: a949e86c0d780 ("Drivers: hv: vmbus: Resolve race between init_vp_index() and CPU hotplug") Signed-off-by: Saurabh Sengar Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1654794996-13244-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Wei Liu --- drivers/hv/channel_mgmt.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 280b52927758..5b120402d405 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -639,6 +639,7 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) */ if (newchannel->offermsg.offer.sub_channel_index == 0) { mutex_unlock(&vmbus_connection.channel_mutex); + cpus_read_unlock(); /* * Don't call free_channel(), because newchannel->kobj * is not initialized yet. From 9c1e916960c1192e746bf615e4dae25423473a64 Mon Sep 17 00:00:00 2001 From: Wesley Cheng Date: Mon, 23 May 2022 14:39:48 -0700 Subject: [PATCH 084/298] usb: dwc3: gadget: Fix IN endpoint max packet size allocation The current logic to assign the max packet limit for IN endpoints attempts to take the default HW value and apply the optimal endpoint settings based on it. However, if the default value reports a TxFIFO size large enough for only one max packet, it will divide the value and assign a smaller ep max packet limit. For example, if the default TxFIFO size fits 1024B, current logic will assign 1024/3 = 341B to ep max packet size. If function drivers attempt to request for an endpoint with a wMaxPacketSize of 1024B (SS BULK max packet size) then it will fail, as the gadget is unable to find an endpoint which can fit the requested size. Functionally, if the TxFIFO has enough space to fit one max packet, it will be sufficient, at least when initializing the endpoints. Fixes: d94ea5319813 ("usb: dwc3: gadget: Properly set maxpacket limit") Cc: stable Signed-off-by: Wesley Cheng Link: https://lore.kernel.org/r/20220523213948.22142-1-quic_wcheng@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/dwc3/gadget.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 00427d108ab9..8716bece1072 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2976,6 +2976,7 @@ static int dwc3_gadget_init_in_endpoint(struct dwc3_ep *dep) struct dwc3 *dwc = dep->dwc; u32 mdwidth; int size; + int maxpacket; mdwidth = dwc3_mdwidth(dwc); @@ -2988,21 +2989,24 @@ static int dwc3_gadget_init_in_endpoint(struct dwc3_ep *dep) else size = DWC31_GTXFIFOSIZ_TXFDEP(size); - /* FIFO Depth is in MDWDITH bytes. Multiply */ - size *= mdwidth; - /* - * To meet performance requirement, a minimum TxFIFO size of 3x - * MaxPacketSize is recommended for endpoints that support burst and a - * minimum TxFIFO size of 2x MaxPacketSize for endpoints that don't - * support burst. Use those numbers and we can calculate the max packet - * limit as below. + * maxpacket size is determined as part of the following, after assuming + * a mult value of one maxpacket: + * DWC3 revision 280A and prior: + * fifo_size = mult * (max_packet / mdwidth) + 1; + * maxpacket = mdwidth * (fifo_size - 1); + * + * DWC3 revision 290A and onwards: + * fifo_size = mult * ((max_packet + mdwidth)/mdwidth + 1) + 1 + * maxpacket = mdwidth * ((fifo_size - 1) - 1) - mdwidth; */ - if (dwc->maximum_speed >= USB_SPEED_SUPER) - size /= 3; + if (DWC3_VER_IS_PRIOR(DWC3, 290A)) + maxpacket = mdwidth * (size - 1); else - size /= 2; + maxpacket = mdwidth * ((size - 1) - 1) - mdwidth; + /* Functionally, space for one max packet is sufficient */ + size = min_t(int, maxpacket, 1024); usb_ep_set_maxpacket_limit(&dep->endpoint, size); dep->endpoint.max_streams = 16; From 7ddda2614d62ef7fdef7fd85f5151cdf665b22d8 Mon Sep 17 00:00:00 2001 From: Stephan Gerhold Date: Sat, 28 May 2022 19:09:13 +0200 Subject: [PATCH 085/298] usb: dwc3: pci: Restore line lost in merge conflict resolution Commit 582ab24e096f ("usb: dwc3: pci: Set "linux,phy_charger_detect" property on some Bay Trail boards") added a new swnode similar to the existing ones for boards where the PHY handles charger detection. Unfortunately, the "linux,sysdev_is_parent" property got lost in the merge conflict resolution of commit ca9400ef7f67 ("Merge 5.17-rc6 into usb-next"). Now dwc3_pci_intel_phy_charger_detect_properties is the only swnode in dwc3-pci that is missing "linux,sysdev_is_parent". It does not seem to cause any obvious functional issues, but it's certainly unintended so restore the line to make the properties consistent again. Fixes: ca9400ef7f67 ("Merge 5.17-rc6 into usb-next") Cc: stable@vger.kernel.org Signed-off-by: Stephan Gerhold Link: https://lore.kernel.org/r/20220528170913.9240-1-stephan@gerhold.net Signed-off-by: Greg Kroah-Hartman --- drivers/usb/dwc3/dwc3-pci.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/usb/dwc3/dwc3-pci.c b/drivers/usb/dwc3/dwc3-pci.c index ba51de7dd760..6b018048fe2e 100644 --- a/drivers/usb/dwc3/dwc3-pci.c +++ b/drivers/usb/dwc3/dwc3-pci.c @@ -127,6 +127,7 @@ static const struct property_entry dwc3_pci_intel_phy_charger_detect_properties[ PROPERTY_ENTRY_STRING("dr_mode", "peripheral"), PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"), PROPERTY_ENTRY_BOOL("linux,phy_charger_detect"), + PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"), {} }; From 3755278f078460b021cd0384562977bf2039a57a Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Mon, 30 May 2022 12:54:12 +0400 Subject: [PATCH 086/298] usb: dwc2: Fix memory leak in dwc2_hcd_init usb_create_hcd will alloc memory for hcd, and we should call usb_put_hcd to free it when platform_get_resource() fails to prevent memory leak. goto error2 label instead error1 to fix this. Fixes: 856e6e8e0f93 ("usb: dwc2: check return value after calling platform_get_resource()") Cc: stable Acked-by: Minas Harutyunyan Signed-off-by: Miaoqian Lin Link: https://lore.kernel.org/r/20220530085413.44068-1-linmq006@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/dwc2/hcd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c index f63a27d11fac..3f107a06817d 100644 --- a/drivers/usb/dwc2/hcd.c +++ b/drivers/usb/dwc2/hcd.c @@ -5190,7 +5190,7 @@ int dwc2_hcd_init(struct dwc2_hsotg *hsotg) res = platform_get_resource(pdev, IORESOURCE_MEM, 0); if (!res) { retval = -EINVAL; - goto error1; + goto error2; } hcd->rsrc_start = res->start; hcd->rsrc_len = resource_size(res); From 4757c9ade34178b351580133771f510b5ffcf9c8 Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Fri, 3 Jun 2022 18:02:44 +0400 Subject: [PATCH 087/298] usb: gadget: lpc32xx_udc: Fix refcount leak in lpc32xx_udc_probe of_parse_phandle() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. Add missing of_node_put() to avoid refcount leak. of_node_put() will check NULL pointer. Fixes: 24a28e428351 ("USB: gadget driver for LPC32xx") Cc: stable Signed-off-by: Miaoqian Lin Link: https://lore.kernel.org/r/20220603140246.64529-1-linmq006@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/gadget/udc/lpc32xx_udc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/usb/gadget/udc/lpc32xx_udc.c b/drivers/usb/gadget/udc/lpc32xx_udc.c index 6117ae8e7242..cea10cdb83ae 100644 --- a/drivers/usb/gadget/udc/lpc32xx_udc.c +++ b/drivers/usb/gadget/udc/lpc32xx_udc.c @@ -3016,6 +3016,7 @@ static int lpc32xx_udc_probe(struct platform_device *pdev) } udc->isp1301_i2c_client = isp1301_get_client(isp1301_node); + of_node_put(isp1301_node); if (!udc->isp1301_i2c_client) { return -EPROBE_DEFER; } From b337af3a4d6147000b7ca6b3438bf5c820849b37 Mon Sep 17 00:00:00 2001 From: Marian Postevca Date: Fri, 3 Jun 2022 18:34:59 +0300 Subject: [PATCH 088/298] usb: gadget: u_ether: fix regression in setting fixed MAC address In systemd systems setting a fixed MAC address through the "dev_addr" module argument fails systematically. When checking the MAC address after the interface is created it always has the same but different MAC address to the one supplied as argument. This is partially caused by systemd which by default will set an internally generated permanent MAC address for interfaces that are marked as having a randomly generated address. Commit 890d5b40908bfd1a ("usb: gadget: u_ether: fix race in setting MAC address in setup phase") didn't take into account the fact that the interface must be marked as having a set MAC address when it's set as module argument. Fixed by marking the interface with NET_ADDR_SET when the "dev_addr" module argument is supplied. Fixes: 890d5b40908bfd1a ("usb: gadget: u_ether: fix race in setting MAC address in setup phase") Cc: stable@vger.kernel.org Signed-off-by: Marian Postevca Link: https://lore.kernel.org/r/20220603153459.32722-1-posteuca@mutex.one Signed-off-by: Greg Kroah-Hartman --- drivers/usb/gadget/function/u_ether.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/usb/gadget/function/u_ether.c b/drivers/usb/gadget/function/u_ether.c index 6f5d45ef2e39..f51694f29de9 100644 --- a/drivers/usb/gadget/function/u_ether.c +++ b/drivers/usb/gadget/function/u_ether.c @@ -775,9 +775,13 @@ struct eth_dev *gether_setup_name(struct usb_gadget *g, dev->qmult = qmult; snprintf(net->name, sizeof(net->name), "%s%%d", netname); - if (get_ether_addr(dev_addr, addr)) + if (get_ether_addr(dev_addr, addr)) { + net->addr_assign_type = NET_ADDR_RANDOM; dev_warn(&g->dev, "using random %s ethernet address\n", "self"); + } else { + net->addr_assign_type = NET_ADDR_SET; + } eth_hw_addr_set(net, addr); if (get_ether_addr(host_addr, dev->host_mac)) dev_warn(&g->dev, @@ -844,6 +848,10 @@ struct net_device *gether_setup_name_default(const char *netname) eth_random_addr(dev->dev_mac); pr_warn("using random %s ethernet address\n", "self"); + + /* by default we always have a random MAC address */ + net->addr_assign_type = NET_ADDR_RANDOM; + eth_random_addr(dev->host_mac); pr_warn("using random %s ethernet address\n", "host"); @@ -871,7 +879,6 @@ int gether_register_netdev(struct net_device *net) dev = netdev_priv(net); g = dev->gadget; - net->addr_assign_type = NET_ADDR_RANDOM; eth_hw_addr_set(net, dev->dev_mac); status = register_netdev(net); @@ -912,6 +919,7 @@ int gether_set_dev_addr(struct net_device *net, const char *dev_addr) if (get_ether_addr(dev_addr, new_addr)) return -EINVAL; memcpy(dev->dev_mac, new_addr, ETH_ALEN); + net->addr_assign_type = NET_ADDR_SET; return 0; } EXPORT_SYMBOL_GPL(gether_set_dev_addr); From 5c7578c39c3fffe85b7d15ca1cf8cf7ac38ec0c1 Mon Sep 17 00:00:00 2001 From: Jing Leng Date: Thu, 9 Jun 2022 10:11:34 +0800 Subject: [PATCH 089/298] usb: cdnsp: Fixed setting last_trb incorrectly When ZLP occurs in bulk transmission, currently cdnsp will set last_trb for the last two TRBs, it will trigger an error "ERROR Transfer event TRB DMA ptr not part of current TD ...". Fixes: e913aada0683 ("usb: cdnsp: Fixed issue with ZLP") Cc: stable Acked-by: Pawel Laszczak Signed-off-by: Jing Leng Link: https://lore.kernel.org/r/20220609021134.1606-1-3090101217@zju.edu.cn Signed-off-by: Greg Kroah-Hartman --- drivers/usb/cdns3/cdnsp-ring.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/usb/cdns3/cdnsp-ring.c b/drivers/usb/cdns3/cdnsp-ring.c index e45c3d6e1536..794e413800ae 100644 --- a/drivers/usb/cdns3/cdnsp-ring.c +++ b/drivers/usb/cdns3/cdnsp-ring.c @@ -1941,13 +1941,16 @@ int cdnsp_queue_bulk_tx(struct cdnsp_device *pdev, struct cdnsp_request *preq) } if (enqd_len + trb_buff_len >= full_len) { - if (need_zero_pkt) - zero_len_trb = !zero_len_trb; - - field &= ~TRB_CHAIN; - field |= TRB_IOC; - more_trbs_coming = false; - preq->td.last_trb = ring->enqueue; + if (need_zero_pkt && !zero_len_trb) { + zero_len_trb = true; + } else { + zero_len_trb = false; + field &= ~TRB_CHAIN; + field |= TRB_IOC; + more_trbs_coming = false; + need_zero_pkt = false; + preq->td.last_trb = ring->enqueue; + } } /* Only set interrupt on short packet for OUT endpoints. */ @@ -1962,7 +1965,7 @@ int cdnsp_queue_bulk_tx(struct cdnsp_device *pdev, struct cdnsp_request *preq) length_field = TRB_LEN(trb_buff_len) | TRB_TD_SIZE(remainder) | TRB_INTR_TARGET(0); - cdnsp_queue_trb(pdev, ring, more_trbs_coming | zero_len_trb, + cdnsp_queue_trb(pdev, ring, more_trbs_coming, lower_32_bits(send_addr), upper_32_bits(send_addr), length_field, From 8bd6b8c4b1009d7d2662138d6bdc6fe58a9274c5 Mon Sep 17 00:00:00 2001 From: Stephen Rothwell Date: Tue, 26 Apr 2022 15:27:39 +1000 Subject: [PATCH 090/298] USB: fixup for merge issue with "usb: dwc3: Don't switch OTG -> peripheral if extcon is present" Today's linux-next merge of the extcon tree got a conflict in: drivers/usb/dwc3/drd.c between commit: 0f0101719138 ("usb: dwc3: Don't switch OTG -> peripheral if extcon is present") from the usb tree and commit: 88490c7f43c4 ("extcon: Fix extcon_get_extcon_dev() error handling") from the extcon tree. I fixed it up (the former moved the code modified by the latter, so I used the former version of this files and added the following merge fix patch) and can carry the fix as necessary. Signed-off-by: Stephen Rothwell Link: https://lore.kernel.org/r/20220426152739.62f6836e@canb.auug.org.au Signed-off-by: Greg Kroah-Hartman --- drivers/usb/dwc3/core.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index e027c0420dc3..573421984948 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -1644,13 +1644,8 @@ static struct extcon_dev *dwc3_get_extcon(struct dwc3 *dwc) * This device property is for kernel internal use only and * is expected to be set by the glue code. */ - if (device_property_read_string(dev, "linux,extcon-name", &name) == 0) { - edev = extcon_get_extcon_dev(name); - if (!edev) - return ERR_PTR(-EPROBE_DEFER); - - return edev; - } + if (device_property_read_string(dev, "linux,extcon-name", &name) == 0) + return extcon_get_extcon_dev(name); /* * Try to get an extcon device from the USB PHY controller's "port" From 81b0d0e4f811553cbe2d58c8a495c124fb626432 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christian=20K=C3=B6nig?= Date: Fri, 3 Jun 2022 12:43:50 +0200 Subject: [PATCH 091/298] drm/ttm: fix missing NULL check in ttm_device_swapout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Resources about to be destructed are not tied to BOs any more. Signed-off-by: Christian König Reviewed-by: Felix Kuehling Fixes: 6a9b02899402 ("drm/ttm: move the LRU into resource handling v4") Link: https://patchwork.freedesktop.org/patch/msgid/20220603104604.456991-1-christian.koenig@amd.com --- drivers/gpu/drm/ttm/ttm_device.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index a0562ab386f5..e7147e304637 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -156,8 +156,12 @@ int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx, ttm_resource_manager_for_each_res(man, &cursor, res) { struct ttm_buffer_object *bo = res->bo; - uint32_t num_pages = PFN_UP(bo->base.size); + uint32_t num_pages; + if (!bo) + continue; + + num_pages = PFN_UP(bo->base.size); ret = ttm_bo_swapout(bo, ctx, gfp_flags); /* ttm_bo_swapout has dropped the lru_lock */ if (!ret) From e74024b2eccbb784824a0f9feaeaaa3b47514b79 Mon Sep 17 00:00:00 2001 From: Tony Lindgren Date: Mon, 23 May 2022 18:50:52 +0300 Subject: [PATCH 092/298] tty: n_gsm: Debug output allocation must use GFP_ATOMIC Dan Carpenter reported the following Smatch warning: drivers/tty/n_gsm.c:720 gsm_data_kick() warn: sleeping in atomic context This is because gsm_control_message() is holding a spin lock so gsm_hex_dump_bytes() needs to use GFP_ATOMIC instead of GFP_KERNEL. Fixes: 925ea0fa5277 ("tty: n_gsm: Fix packet data hex dump output") Cc: stable Reported-by: Dan Carpenter Reviewed-by: Gregory CLEMENT Signed-off-by: Tony Lindgren Link: https://lore.kernel.org/r/20220523155052.57129-1-tony@atomide.com Signed-off-by: Greg Kroah-Hartman --- drivers/tty/n_gsm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c index 137eebdcfda9..fd4d24f61c46 100644 --- a/drivers/tty/n_gsm.c +++ b/drivers/tty/n_gsm.c @@ -455,7 +455,7 @@ static void gsm_hex_dump_bytes(const char *fname, const u8 *data, return; } - prefix = kasprintf(GFP_KERNEL, "%s: ", fname); + prefix = kasprintf(GFP_ATOMIC, "%s: ", fname); if (!prefix) return; print_hex_dump(KERN_INFO, prefix, DUMP_PREFIX_OFFSET, 16, 1, data, len, From cfab87c2c2715763dc7e43d9968bdaa01cde4bc3 Mon Sep 17 00:00:00 2001 From: Vijaya Krishna Nivarthi Date: Wed, 8 Jun 2022 00:22:44 +0530 Subject: [PATCH 093/298] serial: core: Introduce callback for start_rx and do stop_rx in suspend only if this callback implementation is present. In suspend sequence there is a need to perform stop_rx during suspend sequence to prevent any asynchronous data over rx line. However this can cause problem to drivers which dont do re-start_rx during set_termios. Add new callback start_rx and perform stop_rx only when implementation of start_rx is present. Also add call to start_rx in resume sequence so that drivers who come across this problem can make use of this framework. Fixes: c9d2325cdb92 ("serial: core: Do stop_rx in suspend path for console if console_suspend is disabled") Reviewed-by: Douglas Anderson Signed-off-by: Vijaya Krishna Nivarthi Link: https://lore.kernel.org/r/1654627965-1461-2-git-send-email-quic_vnivarth@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/tty/serial/serial_core.c | 9 ++++++--- include/linux/serial_core.h | 1 + 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c index 9a85b41caa0a..338ebadfd44b 100644 --- a/drivers/tty/serial/serial_core.c +++ b/drivers/tty/serial/serial_core.c @@ -2214,11 +2214,12 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport) /* * Nothing to do if the console is not suspending * except stop_rx to prevent any asynchronous data - * over RX line. Re-start_rx, when required, is - * done by set_termios in resume sequence + * over RX line. However ensure that we will be + * able to Re-start_rx later. */ if (!console_suspend_enabled && uart_console(uport)) { - uport->ops->stop_rx(uport); + if (uport->ops->start_rx) + uport->ops->stop_rx(uport); goto unlock; } @@ -2310,6 +2311,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport) if (console_suspend_enabled) uart_change_pm(state, UART_PM_STATE_ON); uport->ops->set_termios(uport, &termios, NULL); + if (!console_suspend_enabled && uport->ops->start_rx) + uport->ops->start_rx(uport); if (console_suspend_enabled) console_start(uport->cons); } diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h index cbd5070bc87f..657a0fc68a3f 100644 --- a/include/linux/serial_core.h +++ b/include/linux/serial_core.h @@ -45,6 +45,7 @@ struct uart_ops { void (*unthrottle)(struct uart_port *); void (*send_xchar)(struct uart_port *, char ch); void (*stop_rx)(struct uart_port *); + void (*start_rx)(struct uart_port *); void (*enable_ms)(struct uart_port *); void (*break_ctl)(struct uart_port *, int ctl); int (*startup)(struct uart_port *); From 654a8d6c93e77ecff2256ca3ab2cd98967821f0a Mon Sep 17 00:00:00 2001 From: Vijaya Krishna Nivarthi Date: Wed, 8 Jun 2022 00:22:45 +0530 Subject: [PATCH 094/298] tty: serial: qcom-geni-serial: Implement start_rx callback In suspend sequence stop_rx will be performed only if implementation for start_rx callback is present. Set qcom_geni_serial_start_rx as callback for start_rx so that stop_rx is performed. Fixes: c9d2325cdb92 ("serial: core: Do stop_rx in suspend path for console if console_suspend is disabled") Reviewed-by: Douglas Anderson Signed-off-by: Vijaya Krishna Nivarthi Link: https://lore.kernel.org/r/1654627965-1461-3-git-send-email-quic_vnivarth@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/tty/serial/qcom_geni_serial.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c index 4733a233bd0c..f8f950641ad9 100644 --- a/drivers/tty/serial/qcom_geni_serial.c +++ b/drivers/tty/serial/qcom_geni_serial.c @@ -1306,6 +1306,7 @@ static const struct uart_ops qcom_geni_console_pops = { .stop_tx = qcom_geni_serial_stop_tx, .start_tx = qcom_geni_serial_start_tx, .stop_rx = qcom_geni_serial_stop_rx, + .start_rx = qcom_geni_serial_start_rx, .set_termios = qcom_geni_serial_set_termios, .startup = qcom_geni_serial_startup, .request_port = qcom_geni_serial_request_port, From 499e13aac6c762e1e828172b0f0f5275651d6512 Mon Sep 17 00:00:00 2001 From: Vincent Whitchurch Date: Thu, 9 Jun 2022 16:17:04 +0200 Subject: [PATCH 095/298] tty: goldfish: Fix free_irq() on remove Pass the correct dev_id to free_irq() to fix this splat when the driver is unbound: WARNING: CPU: 0 PID: 30 at kernel/irq/manage.c:1895 free_irq Trying to free already-free IRQ 65 Call Trace: warn_slowpath_fmt free_irq goldfish_tty_remove platform_remove device_remove device_release_driver_internal device_driver_detach unbind_store drv_attr_store ... Fixes: 465893e18878e119 ("tty: goldfish: support platform_device with id -1") Signed-off-by: Vincent Whitchurch Link: https://lore.kernel.org/r/20220609141704.1080024-1-vincent.whitchurch@axis.com Signed-off-by: Greg Kroah-Hartman --- drivers/tty/goldfish.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/tty/goldfish.c b/drivers/tty/goldfish.c index c7968aecd870..d02de3f0326f 100644 --- a/drivers/tty/goldfish.c +++ b/drivers/tty/goldfish.c @@ -426,7 +426,7 @@ static int goldfish_tty_remove(struct platform_device *pdev) tty_unregister_device(goldfish_tty_driver, qtty->console.index); iounmap(qtty->base); qtty->base = NULL; - free_irq(qtty->irq, pdev); + free_irq(qtty->irq, qtty); tty_port_destroy(&qtty->port); goldfish_tty_current_line_count--; if (goldfish_tty_current_line_count == 0) From be03b0651ffd8bab69dfd574c6818b446c0753ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= Date: Fri, 20 May 2022 13:35:41 +0300 Subject: [PATCH 096/298] serial: 8250: Store to lsr_save_flags after lsr read MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Not all LSR register flags are preserved across reads. Therefore, LSR readers must store the non-preserved bits into lsr_save_flags. This fix was initially mixed into feature commit f6f586102add ("serial: 8250: Handle UART without interrupt on TEMT using em485"). However, that feature change had a flaw and it was reverted to make room for simpler approach providing the same feature. The embedded fix got reverted with the feature change. Re-add the lsr_save_flags fix and properly mark it's a fix. Link: https://lore.kernel.org/all/1d6c31d-d194-9e6a-ddf9-5f29af829f3@linux.intel.com/T/#m1737eef986bd20cf19593e344cebd7b0244945fc Fixes: e490c9144cfa ("tty: Add software emulated RS485 support for 8250") Cc: stable Acked-by: Uwe Kleine-König Signed-off-by: Uwe Kleine-König Signed-off-by: Ilpo Järvinen Link: https://lore.kernel.org/r/f4d774be-1437-a550-8334-19d8722ab98c@linux.intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/tty/serial/8250/8250_port.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 78b6dedc43e6..8f32fe9e149e 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -1517,6 +1517,8 @@ static inline void __stop_tx(struct uart_8250_port *p) unsigned char lsr = serial_in(p, UART_LSR); u64 stop_delay = 0; + p->lsr_saved_flags |= lsr & LSR_SAVE_FLAGS; + if (!(lsr & UART_LSR_THRE)) return; /* From 802dcafc420af536fcde1b44ac51ca211f4ec673 Mon Sep 17 00:00:00 2001 From: Mathias Nyman Date: Fri, 10 Jun 2022 14:53:38 +0300 Subject: [PATCH 097/298] xhci: Fix null pointer dereference in resume if xhci has only one roothub In the re-init path xhci_resume() passes 'hcd->primary_hcd' to hci_init(), however this field isn't initialized by __usb_create_hcd() for a HCD without secondary controller. xhci_resume() is called once per xHC device, not per hcd, so the extra checking for primary hcd can be removed. Fixes: e0fe986972f5 ("usb: host: xhci-plat: prepare operation w/o shared hcd") Reported-by: Matthias Kaehlcke Tested-by: Matthias Kaehlcke Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20220610115338.863152-2-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/host/xhci.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index f0ab63138016..9ac56e9ffc64 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -1107,7 +1107,6 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated) { u32 command, temp = 0; struct usb_hcd *hcd = xhci_to_hcd(xhci); - struct usb_hcd *secondary_hcd; int retval = 0; bool comp_timer_running = false; bool pending_portevent = false; @@ -1214,23 +1213,19 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated) * first with the primary HCD, and then with the secondary HCD. * If we don't do the same, the host will never be started. */ - if (!usb_hcd_is_primary_hcd(hcd)) - secondary_hcd = hcd; - else - secondary_hcd = xhci->shared_hcd; - xhci_dbg(xhci, "Initialize the xhci_hcd\n"); - retval = xhci_init(hcd->primary_hcd); + retval = xhci_init(hcd); if (retval) return retval; comp_timer_running = true; xhci_dbg(xhci, "Start the primary HCD\n"); - retval = xhci_run(hcd->primary_hcd); - if (!retval && secondary_hcd) { + retval = xhci_run(hcd); + if (!retval && xhci->shared_hcd) { xhci_dbg(xhci, "Start the secondary HCD\n"); - retval = xhci_run(secondary_hcd); + retval = xhci_run(xhci->shared_hcd); } + hcd->state = HC_STATE_SUSPENDED; if (xhci->shared_hcd) xhci->shared_hcd->state = HC_STATE_SUSPENDED; From fb1f16d74e263baa4ad11e31e28b68f144aa55ed Mon Sep 17 00:00:00 2001 From: Linyu Yuan Date: Fri, 10 Jun 2022 20:17:57 +0800 Subject: [PATCH 098/298] usb: gadget: f_fs: change ep->status safe in ffs_epfile_io() If a task read/write data in blocking mode, it will wait the completion in ffs_epfile_io(), if function unbind occurs, ffs_func_unbind() will kfree ffs ep, once the task wake up, it still dereference the ffs ep to obtain the request status. Fix it by moving the request status to io_data which is stack-safe. Cc: # 5.15 Reported-by: Michael Wu Tested-by: Michael Wu Reviewed-by: John Keeping Signed-off-by: Linyu Yuan Link: https://lore.kernel.org/r/1654863478-26228-2-git-send-email-quic_linyyuan@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/gadget/function/f_fs.c | 34 +++++++++++++++++------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index 4585ee3a444a..e1fcd8bc80a1 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -122,8 +122,6 @@ struct ffs_ep { struct usb_endpoint_descriptor *descs[3]; u8 num; - - int status; /* P: epfile->mutex */ }; struct ffs_epfile { @@ -227,6 +225,9 @@ struct ffs_io_data { bool use_sg; struct ffs_data *ffs; + + int status; + struct completion done; }; struct ffs_desc_helper { @@ -707,12 +708,15 @@ static const struct file_operations ffs_ep0_operations = { static void ffs_epfile_io_complete(struct usb_ep *_ep, struct usb_request *req) { + struct ffs_io_data *io_data = req->context; + ENTER(); - if (req->context) { - struct ffs_ep *ep = _ep->driver_data; - ep->status = req->status ? req->status : req->actual; - complete(req->context); - } + if (req->status) + io_data->status = req->status; + else + io_data->status = req->actual; + + complete(&io_data->done); } static ssize_t ffs_copy_to_iter(void *data, int data_len, struct iov_iter *iter) @@ -1050,7 +1054,6 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) WARN(1, "%s: data_len == -EINVAL\n", __func__); ret = -EINVAL; } else if (!io_data->aio) { - DECLARE_COMPLETION_ONSTACK(done); bool interrupted = false; req = ep->req; @@ -1066,7 +1069,8 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) io_data->buf = data; - req->context = &done; + init_completion(&io_data->done); + req->context = io_data; req->complete = ffs_epfile_io_complete; ret = usb_ep_queue(ep->ep, req, GFP_ATOMIC); @@ -1075,7 +1079,7 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) spin_unlock_irq(&epfile->ffs->eps_lock); - if (wait_for_completion_interruptible(&done)) { + if (wait_for_completion_interruptible(&io_data->done)) { /* * To avoid race condition with ffs_epfile_io_complete, * dequeue the request first then check @@ -1083,17 +1087,17 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) * condition with req->complete callback. */ usb_ep_dequeue(ep->ep, req); - wait_for_completion(&done); - interrupted = ep->status < 0; + wait_for_completion(&io_data->done); + interrupted = io_data->status < 0; } if (interrupted) ret = -EINTR; - else if (io_data->read && ep->status > 0) - ret = __ffs_epfile_read_data(epfile, data, ep->status, + else if (io_data->read && io_data->status > 0) + ret = __ffs_epfile_read_data(epfile, data, io_data->status, &io_data->data); else - ret = ep->status; + ret = io_data->status; goto error_mutex; } else if (!(req = usb_ep_alloc_request(ep->ep, GFP_ATOMIC))) { ret = -ENOMEM; From 0698f0209d8032e8869525aeb68f65ee7fde12ad Mon Sep 17 00:00:00 2001 From: Linyu Yuan Date: Fri, 10 Jun 2022 20:17:58 +0800 Subject: [PATCH 099/298] usb: gadget: f_fs: change ep->ep safe in ffs_epfile_io() In ffs_epfile_io(), when read/write data in blocking mode, it will wait the completion in interruptible mode, if task receive a signal, it will terminate the wait, at same time, if function unbind occurs, ffs_func_unbind() will kfree all eps, ffs_epfile_io() still try to dequeue request by dereferencing ep which may become invalid. Fix it by add ep spinlock and will not dereference ep if it is not valid. Cc: # 5.15 Reported-by: Michael Wu Tested-by: Michael Wu Reviewed-by: John Keeping Signed-off-by: Linyu Yuan Link: https://lore.kernel.org/r/1654863478-26228-3-git-send-email-quic_linyyuan@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/gadget/function/f_fs.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index e1fcd8bc80a1..e0fa4b186ec6 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -1080,6 +1080,11 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) spin_unlock_irq(&epfile->ffs->eps_lock); if (wait_for_completion_interruptible(&io_data->done)) { + spin_lock_irq(&epfile->ffs->eps_lock); + if (epfile->ep != ep) { + ret = -ESHUTDOWN; + goto error_lock; + } /* * To avoid race condition with ffs_epfile_io_complete, * dequeue the request first then check @@ -1087,6 +1092,7 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) * condition with req->complete callback. */ usb_ep_dequeue(ep->ep, req); + spin_unlock_irq(&epfile->ffs->eps_lock); wait_for_completion(&io_data->done); interrupted = io_data->status < 0; } From 242439f7e279d86b3f73b5de724bc67b2f8aeb07 Mon Sep 17 00:00:00 2001 From: Ian Abbott Date: Tue, 7 Jun 2022 18:18:19 +0100 Subject: [PATCH 100/298] comedi: vmk80xx: fix expression for tx buffer size The expression for setting the size of the allocated bulk TX buffer (`devpriv->usb_tx_buf`) is calling `usb_endpoint_maxp(devpriv->ep_rx)`, which is using the wrong endpoint (should be `devpriv->ep_tx`). Fix it. Fixes: a23461c47482 ("comedi: vmk80xx: fix transfer-buffer overflow") Cc: Johan Hovold Cc: stable@vger.kernel.org # 4.9+ Reviewed-by: Johan Hovold Signed-off-by: Ian Abbott Link: https://lore.kernel.org/r/20220607171819.4121-1-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman --- drivers/comedi/drivers/vmk80xx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/comedi/drivers/vmk80xx.c b/drivers/comedi/drivers/vmk80xx.c index 46023adc5395..4536ed43f65b 100644 --- a/drivers/comedi/drivers/vmk80xx.c +++ b/drivers/comedi/drivers/vmk80xx.c @@ -684,7 +684,7 @@ static int vmk80xx_alloc_usb_buffers(struct comedi_device *dev) if (!devpriv->usb_rx_buf) return -ENOMEM; - size = max(usb_endpoint_maxp(devpriv->ep_rx), MIN_BUF_SIZE); + size = max(usb_endpoint_maxp(devpriv->ep_tx), MIN_BUF_SIZE); devpriv->usb_tx_buf = kzalloc(size, GFP_KERNEL); if (!devpriv->usb_tx_buf) return -ENOMEM; From bd476c1306ea989d6d9eb65295572e98d93edeb6 Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Mon, 23 May 2022 08:05:22 -0700 Subject: [PATCH 101/298] misc: rtsx: Fix clang -Wsometimes-uninitialized in rts5261_init_from_hw() Clang warns: drivers/misc/cardreader/rts5261.c:406:13: error: variable 'setting_reg2' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] } else if (efuse_valid == 0) { ^~~~~~~~~~~~~~~~ drivers/misc/cardreader/rts5261.c:412:30: note: uninitialized use occurs here pci_read_config_dword(pdev, setting_reg2, &lval2); ^~~~~~~~~~~~ efuse_valid == 1 is not a valid value so just return early from the function to avoid using setting_reg2 uninitialized. Fixes: b1c5f3085149 ("misc: rtsx: add rts5261 efuse function") Reported-by: Dan Carpenter Reported-by: kernel test robot Reported-by: Tom Rix Suggested-by: Ricky WU Acked-by: Arnd Bergmann Signed-off-by: Nathan Chancellor Link: https://lore.kernel.org/r/20220523150521.2947108-1-nathan@kernel.org Signed-off-by: Greg Kroah-Hartman --- drivers/misc/cardreader/rts5261.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/misc/cardreader/rts5261.c b/drivers/misc/cardreader/rts5261.c index 749cc5a46d13..b1e76030cafd 100644 --- a/drivers/misc/cardreader/rts5261.c +++ b/drivers/misc/cardreader/rts5261.c @@ -407,6 +407,8 @@ static void rts5261_init_from_hw(struct rtsx_pcr *pcr) // default setting_reg1 = PCR_SETTING_REG1; setting_reg2 = PCR_SETTING_REG2; + } else { + return; } pci_read_config_dword(pdev, setting_reg2, &lval2); From 6497e7776441e0567c02b9c12b133d2ba51918df Mon Sep 17 00:00:00 2001 From: Shreenidhi Shedi Date: Fri, 3 Jun 2022 18:30:40 +0530 Subject: [PATCH 102/298] char: lp: remove redundant initialization of err err is getting assigned with an appropriate value before returning, hence this initialization is unnecessary. Signed-off-by: Shreenidhi Shedi Link: https://lore.kernel.org/r/20220603130040.601673-2-sshedi@vmware.com Signed-off-by: Greg Kroah-Hartman --- drivers/char/lp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/char/lp.c b/drivers/char/lp.c index 0e22e3b0a04e..38aad99ebb61 100644 --- a/drivers/char/lp.c +++ b/drivers/char/lp.c @@ -1019,7 +1019,7 @@ static struct parport_driver lp_driver = { static int __init lp_init(void) { - int i, err = 0; + int i, err; if (parport_nr[0] == LP_PARPORT_OFF) return 0; From 1c245358ce0b13669f6d1625f7a4e05c41f28980 Mon Sep 17 00:00:00 2001 From: Miaoqian Lin Date: Wed, 1 Jun 2022 16:30:26 +0400 Subject: [PATCH 103/298] misc: atmel-ssc: Fix IRQ check in ssc_probe platform_get_irq() returns negative error number instead 0 on failure. And the doc of platform_get_irq() provides a usage example: int irq = platform_get_irq(pdev, 0); if (irq < 0) return irq; Fix the check of return value to catch errors correctly. Fixes: eb1f2930609b ("Driver for the Atmel on-chip SSC on AT32AP and AT91") Reviewed-by: Claudiu Beznea Signed-off-by: Miaoqian Lin Link: https://lore.kernel.org/r/20220601123026.7119-1-linmq006@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/misc/atmel-ssc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/misc/atmel-ssc.c b/drivers/misc/atmel-ssc.c index d6cd5537126c..69f9b0336410 100644 --- a/drivers/misc/atmel-ssc.c +++ b/drivers/misc/atmel-ssc.c @@ -232,9 +232,9 @@ static int ssc_probe(struct platform_device *pdev) clk_disable_unprepare(ssc->clk); ssc->irq = platform_get_irq(pdev, 0); - if (!ssc->irq) { + if (ssc->irq < 0) { dev_dbg(&pdev->dev, "could not get irq\n"); - return -ENXIO; + return ssc->irq; } mutex_lock(&user_lock); From cd756dafd86ee3a4969906086f3c2537e0c6d9d0 Mon Sep 17 00:00:00 2001 From: Peter Robinson Date: Mon, 6 Jun 2022 14:22:00 +0100 Subject: [PATCH 104/298] staging: Also remove the Unisys visorbus.h The commit that removed the Unisys s-Par and visorbus drivers left around the include/linux/visorbus.h file mentioned in the MAINTAINERS entry, we can also remove that too. Fixes: e5f45b011e4a ("staging: Remove the drivers for the Unisys s-Par") Reviewed-by: Fabio M. De Francesco Signed-off-by: Peter Robinson Link: https://lore.kernel.org/r/20220606132200.2873243-1-pbrobinson@gmail.com Signed-off-by: Greg Kroah-Hartman --- include/linux/visorbus.h | 344 --------------------------------------- 1 file changed, 344 deletions(-) delete mode 100644 include/linux/visorbus.h diff --git a/include/linux/visorbus.h b/include/linux/visorbus.h deleted file mode 100644 index 0d8bd6769b13..000000000000 --- a/include/linux/visorbus.h +++ /dev/null @@ -1,344 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0+ -/* - * Copyright (C) 2010 - 2013 UNISYS CORPORATION - * All rights reserved. - */ - -/* - * This header file is to be included by other kernel mode components that - * implement a particular kind of visor_device. Each of these other kernel - * mode components is called a visor device driver. Refer to visortemplate - * for a minimal sample visor device driver. - * - * There should be nothing in this file that is private to the visorbus - * bus implementation itself. - */ - -#ifndef __VISORBUS_H__ -#define __VISORBUS_H__ - -#include - -#define VISOR_CHANNEL_SIGNATURE ('L' << 24 | 'N' << 16 | 'C' << 8 | 'E') - -/* - * enum channel_serverstate - * @CHANNELSRV_UNINITIALIZED: Channel is in an undefined state. - * @CHANNELSRV_READY: Channel has been initialized by server. - */ -enum channel_serverstate { - CHANNELSRV_UNINITIALIZED = 0, - CHANNELSRV_READY = 1 -}; - -/* - * enum channel_clientstate - * @CHANNELCLI_DETACHED: - * @CHANNELCLI_DISABLED: Client can see channel but is NOT allowed to use it - * unless given TBD* explicit request - * (should actually be < DETACHED). - * @CHANNELCLI_ATTACHING: Legacy EFI client request for EFI server to attach. - * @CHANNELCLI_ATTACHED: Idle, but client may want to use channel any time. - * @CHANNELCLI_BUSY: Client either wants to use or is using channel. - * @CHANNELCLI_OWNED: "No worries" state - client can access channel - * anytime. - */ -enum channel_clientstate { - CHANNELCLI_DETACHED = 0, - CHANNELCLI_DISABLED = 1, - CHANNELCLI_ATTACHING = 2, - CHANNELCLI_ATTACHED = 3, - CHANNELCLI_BUSY = 4, - CHANNELCLI_OWNED = 5 -}; - -/* - * Values for VISOR_CHANNEL_PROTOCOL.Features: This define exists so that - * a guest can look at the FeatureFlags in the io channel, and configure the - * driver to use interrupts or not based on this setting. All feature bits for - * all channels should be defined here. The io channel feature bits are defined - * below. - */ -#define VISOR_DRIVER_ENABLES_INTS (0x1ULL << 1) -#define VISOR_CHANNEL_IS_POLLING (0x1ULL << 3) -#define VISOR_IOVM_OK_DRIVER_DISABLING_INTS (0x1ULL << 4) -#define VISOR_DRIVER_DISABLES_INTS (0x1ULL << 5) -#define VISOR_DRIVER_ENHANCED_RCVBUF_CHECKING (0x1ULL << 6) - -/* - * struct channel_header - Common Channel Header - * @signature: Signature. - * @legacy_state: DEPRECATED - being replaced by. - * @header_size: sizeof(struct channel_header). - * @size: Total size of this channel in bytes. - * @features: Flags to modify behavior. - * @chtype: Channel type: data, bus, control, etc.. - * @partition_handle: ID of guest partition. - * @handle: Device number of this channel in client. - * @ch_space_offset: Offset in bytes to channel specific area. - * @version_id: Struct channel_header Version ID. - * @partition_index: Index of guest partition. - * @zone_uuid: Guid of Channel's zone. - * @cli_str_offset: Offset from channel header to null-terminated - * ClientString (0 if ClientString not present). - * @cli_state_boot: CHANNEL_CLIENTSTATE of pre-boot EFI client of this - * channel. - * @cmd_state_cli: CHANNEL_COMMANDSTATE (overloaded in Windows drivers, see - * ServerStateUp, ServerStateDown, etc). - * @cli_state_os: CHANNEL_CLIENTSTATE of Guest OS client of this channel. - * @ch_characteristic: CHANNEL_CHARACTERISTIC_. - * @cmd_state_srv: CHANNEL_COMMANDSTATE (overloaded in Windows drivers, see - * ServerStateUp, ServerStateDown, etc). - * @srv_state: CHANNEL_SERVERSTATE. - * @cli_error_boot: Bits to indicate err states for boot clients, so err - * messages can be throttled. - * @cli_error_os: Bits to indicate err states for OS clients, so err - * messages can be throttled. - * @filler: Pad out to 128 byte cacheline. - * @recover_channel: Please add all new single-byte values below here. - */ -struct channel_header { - u64 signature; - u32 legacy_state; - /* SrvState, CliStateBoot, and CliStateOS below */ - u32 header_size; - u64 size; - u64 features; - guid_t chtype; - u64 partition_handle; - u64 handle; - u64 ch_space_offset; - u32 version_id; - u32 partition_index; - guid_t zone_guid; - u32 cli_str_offset; - u32 cli_state_boot; - u32 cmd_state_cli; - u32 cli_state_os; - u32 ch_characteristic; - u32 cmd_state_srv; - u32 srv_state; - u8 cli_error_boot; - u8 cli_error_os; - u8 filler[1]; - u8 recover_channel; -} __packed; - -#define VISOR_CHANNEL_ENABLE_INTS (0x1ULL << 0) - -/* - * struct signal_queue_header - Subheader for the Signal Type variation of the - * Common Channel. - * @version: SIGNAL_QUEUE_HEADER Version ID. - * @chtype: Queue type: storage, network. - * @size: Total size of this queue in bytes. - * @sig_base_offset: Offset to signal queue area. - * @features: Flags to modify behavior. - * @num_sent: Total # of signals placed in this queue. - * @num_overflows: Total # of inserts failed due to full queue. - * @signal_size: Total size of a signal for this queue. - * @max_slots: Max # of slots in queue, 1 slot is always empty. - * @max_signals: Max # of signals in queue (MaxSignalSlots-1). - * @head: Queue head signal #. - * @num_received: Total # of signals removed from this queue. - * @tail: Queue tail signal. - * @reserved1: Reserved field. - * @reserved2: Reserved field. - * @client_queue: - * @num_irq_received: Total # of Interrupts received. This is incremented by the - * ISR in the guest windows driver. - * @num_empty: Number of times that visor_signal_remove is called and - * returned Empty Status. - * @errorflags: Error bits set during SignalReinit to denote trouble with - * client's fields. - * @filler: Pad out to 64 byte cacheline. - */ -struct signal_queue_header { - /* 1st cache line */ - u32 version; - u32 chtype; - u64 size; - u64 sig_base_offset; - u64 features; - u64 num_sent; - u64 num_overflows; - u32 signal_size; - u32 max_slots; - u32 max_signals; - u32 head; - /* 2nd cache line */ - u64 num_received; - u32 tail; - u32 reserved1; - u64 reserved2; - u64 client_queue; - u64 num_irq_received; - u64 num_empty; - u32 errorflags; - u8 filler[12]; -} __packed; - -/* VISORCHANNEL Guids */ -/* {414815ed-c58c-11da-95a9-00e08161165f} */ -#define VISOR_VHBA_CHANNEL_GUID \ - GUID_INIT(0x414815ed, 0xc58c, 0x11da, \ - 0x95, 0xa9, 0x0, 0xe0, 0x81, 0x61, 0x16, 0x5f) -#define VISOR_VHBA_CHANNEL_GUID_STR \ - "414815ed-c58c-11da-95a9-00e08161165f" -struct visorchipset_state { - u32 created:1; - u32 attached:1; - u32 configured:1; - u32 running:1; - /* Remaining bits in this 32-bit word are reserved. */ -}; - -/** - * struct visor_device - A device type for things "plugged" into the visorbus - * bus - * @visorchannel: Points to the channel that the device is - * associated with. - * @channel_type_guid: Identifies the channel type to the bus driver. - * @device: Device struct meant for use by the bus driver - * only. - * @list_all: Used by the bus driver to enumerate devices. - * @timer: Timer fired periodically to do interrupt-type - * activity. - * @being_removed: Indicates that the device is being removed from - * the bus. Private bus driver use only. - * @visordriver_callback_lock: Used by the bus driver to lock when adding and - * removing devices. - * @pausing: Indicates that a change towards a paused state. - * is in progress. Only modified by the bus driver. - * @resuming: Indicates that a change towards a running state - * is in progress. Only modified by the bus driver. - * @chipset_bus_no: Private field used by the bus driver. - * @chipset_dev_no: Private field used the bus driver. - * @state: Used to indicate the current state of the - * device. - * @inst: Unique GUID for this instance of the device. - * @name: Name of the device. - * @pending_msg_hdr: For private use by bus driver to respond to - * hypervisor requests. - * @vbus_hdr_info: A pointer to header info. Private use by bus - * driver. - * @partition_guid: Indicates client partion id. This should be the - * same across all visor_devices in the current - * guest. Private use by bus driver only. - */ -struct visor_device { - struct visorchannel *visorchannel; - guid_t channel_type_guid; - /* These fields are for private use by the bus driver only. */ - struct device device; - struct list_head list_all; - struct timer_list timer; - bool timer_active; - bool being_removed; - struct mutex visordriver_callback_lock; /* synchronize probe/remove */ - bool pausing; - bool resuming; - u32 chipset_bus_no; - u32 chipset_dev_no; - struct visorchipset_state state; - guid_t inst; - u8 *name; - struct controlvm_message_header *pending_msg_hdr; - void *vbus_hdr_info; - guid_t partition_guid; - struct dentry *debugfs_dir; - struct dentry *debugfs_bus_info; -}; - -#define to_visor_device(x) container_of(x, struct visor_device, device) - -typedef void (*visorbus_state_complete_func) (struct visor_device *dev, - int status); - -/* - * This struct describes a specific visor channel, by providing its GUID, name, - * and sizes. - */ -struct visor_channeltype_descriptor { - const guid_t guid; - const char *name; - u64 min_bytes; - u32 version; -}; - -/** - * struct visor_driver - Information provided by each visor driver when it - * registers with the visorbus driver - * @name: Name of the visor driver. - * @owner: The module owner. - * @channel_types: Types of channels handled by this driver, ending with - * a zero GUID. Our specialized BUS.match() method knows - * about this list, and uses it to determine whether this - * driver will in fact handle a new device that it has - * detected. - * @probe: Called when a new device comes online, by our probe() - * function specified by driver.probe() (triggered - * ultimately by some call to driver_register(), - * bus_add_driver(), or driver_attach()). - * @remove: Called when a new device is removed, by our remove() - * function specified by driver.remove() (triggered - * ultimately by some call to device_release_driver()). - * @channel_interrupt: Called periodically, whenever there is a possiblity - * that "something interesting" may have happened to the - * channel. - * @pause: Called to initiate a change of the device's state. If - * the return valu`e is < 0, there was an error and the - * state transition will NOT occur. If the return value - * is >= 0, then the state transition was INITIATED - * successfully, and complete_func() will be called (or - * was just called) with the final status when either the - * state transition fails or completes successfully. - * @resume: Behaves similar to pause. - * @driver: Private reference to the device driver. For use by bus - * driver only. - */ -struct visor_driver { - const char *name; - struct module *owner; - struct visor_channeltype_descriptor *channel_types; - int (*probe)(struct visor_device *dev); - void (*remove)(struct visor_device *dev); - void (*channel_interrupt)(struct visor_device *dev); - int (*pause)(struct visor_device *dev, - visorbus_state_complete_func complete_func); - int (*resume)(struct visor_device *dev, - visorbus_state_complete_func complete_func); - - /* These fields are for private use by the bus driver only. */ - struct device_driver driver; -}; - -#define to_visor_driver(x) (container_of(x, struct visor_driver, driver)) - -int visor_check_channel(struct channel_header *ch, struct device *dev, - const guid_t *expected_uuid, char *chname, - u64 expected_min_bytes, u32 expected_version, - u64 expected_signature); - -int visorbus_register_visor_driver(struct visor_driver *drv); -void visorbus_unregister_visor_driver(struct visor_driver *drv); -int visorbus_read_channel(struct visor_device *dev, - unsigned long offset, void *dest, - unsigned long nbytes); -int visorbus_write_channel(struct visor_device *dev, - unsigned long offset, void *src, - unsigned long nbytes); -int visorbus_enable_channel_interrupts(struct visor_device *dev); -void visorbus_disable_channel_interrupts(struct visor_device *dev); - -int visorchannel_signalremove(struct visorchannel *channel, u32 queue, - void *msg); -int visorchannel_signalinsert(struct visorchannel *channel, u32 queue, - void *msg); -bool visorchannel_signalempty(struct visorchannel *channel, u32 queue); -const guid_t *visorchannel_get_guid(struct visorchannel *channel); - -#define BUS_ROOT_DEVICE UINT_MAX -struct visor_device *visorbus_get_device_by_id(u32 bus_no, u32 dev_no, - struct visor_device *from); -#endif From 9f4639373e6756e1ccf0029f861f1061db3c3616 Mon Sep 17 00:00:00 2001 From: Alexander Usyskin Date: Mon, 6 Jun 2022 17:42:23 +0300 Subject: [PATCH 105/298] mei: me: set internal pg flag to off on hardware reset Link reset flow is always performed in the runtime resumed state. The internal PG state may be left as ON after the suspend and will not be updated upon the resume if the D0i3 is not supported. Ensure that the internal PG state is set to the right value on the flow entrance in case the firmware does not support D0i3. Signed-off-by: Alexander Usyskin Signed-off-by: Tomas Winkler Link: https://lore.kernel.org/r/20220606144225.282375-1-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/misc/mei/hw-me.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/misc/mei/hw-me.c b/drivers/misc/mei/hw-me.c index 9870bf717979..befa491e3344 100644 --- a/drivers/misc/mei/hw-me.c +++ b/drivers/misc/mei/hw-me.c @@ -1154,6 +1154,8 @@ static int mei_me_hw_reset(struct mei_device *dev, bool intr_enable) ret = mei_me_d0i3_exit_sync(dev); if (ret) return ret; + } else { + hw->pg_state = MEI_PG_OFF; } } From 68553650bc9c57c7e530c84e5b2945e9dfe1a560 Mon Sep 17 00:00:00 2001 From: Alexander Usyskin Date: Mon, 6 Jun 2022 17:42:24 +0300 Subject: [PATCH 106/298] mei: hbm: drop capability response on early shutdown Drop HBM responses also in the early shutdown phase where the usual traffic is allowed. Extend the rule that drop HBM responses received during the shutdown phase by also in MEI_DEV_POWERING_DOWN state. This resolves the stall if the driver is stopping in the middle of the link initialization or link reset. Drop the capabilities response on early shutdown. Fixes: 6d7163f2c49f ("mei: hbm: drop hbm responses on early shutdown") Cc: Signed-off-by: Alexander Usyskin Signed-off-by: Tomas Winkler Link: https://lore.kernel.org/r/20220606144225.282375-2-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/misc/mei/hbm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/misc/mei/hbm.c b/drivers/misc/mei/hbm.c index cebcca6d6d3e..cf2b8261da14 100644 --- a/drivers/misc/mei/hbm.c +++ b/drivers/misc/mei/hbm.c @@ -1351,7 +1351,8 @@ int mei_hbm_dispatch(struct mei_device *dev, struct mei_msg_hdr *hdr) if (dev->dev_state != MEI_DEV_INIT_CLIENTS || dev->hbm_state != MEI_HBM_CAP_SETUP) { - if (dev->dev_state == MEI_DEV_POWER_DOWN) { + if (dev->dev_state == MEI_DEV_POWER_DOWN || + dev->dev_state == MEI_DEV_POWERING_DOWN) { dev_dbg(dev->dev, "hbm: capabilities response: on shutdown, ignoring\n"); return 0; } From 3ed8c7d39cfef831fe508fc1308f146912fa72e6 Mon Sep 17 00:00:00 2001 From: Alexander Usyskin Date: Mon, 6 Jun 2022 17:42:25 +0300 Subject: [PATCH 107/298] mei: me: add raptor lake point S DID Add Raptor (Point) Lake S device id. Cc: Signed-off-by: Alexander Usyskin Signed-off-by: Tomas Winkler Link: https://lore.kernel.org/r/20220606144225.282375-3-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/misc/mei/hw-me-regs.h | 2 ++ drivers/misc/mei/pci-me.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/drivers/misc/mei/hw-me-regs.h b/drivers/misc/mei/hw-me-regs.h index 64ce3f830262..15e8e2b322b1 100644 --- a/drivers/misc/mei/hw-me-regs.h +++ b/drivers/misc/mei/hw-me-regs.h @@ -109,6 +109,8 @@ #define MEI_DEV_ID_ADP_P 0x51E0 /* Alder Lake Point P */ #define MEI_DEV_ID_ADP_N 0x54E0 /* Alder Lake Point N */ +#define MEI_DEV_ID_RPL_S 0x7A68 /* Raptor Lake Point S */ + /* * MEI HW Section */ diff --git a/drivers/misc/mei/pci-me.c b/drivers/misc/mei/pci-me.c index 33e58821e478..5435604327a7 100644 --- a/drivers/misc/mei/pci-me.c +++ b/drivers/misc/mei/pci-me.c @@ -116,6 +116,8 @@ static const struct pci_device_id mei_me_pci_tbl[] = { {MEI_PCI_DEVICE(MEI_DEV_ID_ADP_P, MEI_ME_PCH15_CFG)}, {MEI_PCI_DEVICE(MEI_DEV_ID_ADP_N, MEI_ME_PCH15_CFG)}, + {MEI_PCI_DEVICE(MEI_DEV_ID_RPL_S, MEI_ME_PCH15_CFG)}, + /* required last entry */ {0, } }; From 928ea98252ad75118950941683893cf904541da9 Mon Sep 17 00:00:00 2001 From: Shin'ichiro Kawasaki Date: Wed, 1 Jun 2022 19:51:59 +0900 Subject: [PATCH 108/298] bus: fsl-mc-bus: fix KASAN use-after-free in fsl_mc_bus_remove() In fsl_mc_bus_remove(), mc->root_mc_bus_dev->mc_io is passed to fsl_destroy_mc_io(). However, mc->root_mc_bus_dev is already freed in fsl_mc_device_remove(). Then reference to mc->root_mc_bus_dev->mc_io triggers KASAN use-after-free. To avoid the use-after-free, keep the reference to mc->root_mc_bus_dev->mc_io in a local variable and pass to fsl_destroy_mc_io(). This patch needs rework to apply to kernels older than v5.15. Fixes: f93627146f0e ("staging: fsl-mc: fix asymmetry in destroy of mc_io") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Shin'ichiro Kawasaki Link: https://lore.kernel.org/r/20220601105159.87752-1-shinichiro.kawasaki@wdc.com Signed-off-by: Greg Kroah-Hartman --- drivers/bus/fsl-mc/fsl-mc-bus.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c index e81a9700cfd0..6143dbf31f31 100644 --- a/drivers/bus/fsl-mc/fsl-mc-bus.c +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c @@ -1239,14 +1239,14 @@ error_cleanup_mc_io: static int fsl_mc_bus_remove(struct platform_device *pdev) { struct fsl_mc *mc = platform_get_drvdata(pdev); + struct fsl_mc_io *mc_io; if (!fsl_mc_is_root_dprc(&mc->root_mc_bus_dev->dev)) return -EINVAL; + mc_io = mc->root_mc_bus_dev->mc_io; fsl_mc_device_remove(mc->root_mc_bus_dev); - - fsl_destroy_mc_io(mc->root_mc_bus_dev->mc_io); - mc->root_mc_bus_dev->mc_io = NULL; + fsl_destroy_mc_io(mc_io); bus_unregister_notifier(&fsl_mc_bus_type, &fsl_mc_nb); From 0a35780c755ccec097d15c6b4ff8b246a89f1689 Mon Sep 17 00:00:00 2001 From: Brad Bishop Date: Tue, 24 May 2022 16:51:42 -0500 Subject: [PATCH 109/298] eeprom: at25: Split reads into chunks and cap write size Make use of spi_max_transfer_size to avoid requesting transfers that are too large for some spi controllers. Signed-off-by: Brad Bishop Signed-off-by: Eddie James Signed-off-by: Joel Stanley Link: https://lore.kernel.org/r/20220524215142.60047-1-eajames@linux.ibm.com Signed-off-by: Greg Kroah-Hartman --- drivers/misc/eeprom/at25.c | 97 +++++++++++++++++++++----------------- 1 file changed, 55 insertions(+), 42 deletions(-) diff --git a/drivers/misc/eeprom/at25.c b/drivers/misc/eeprom/at25.c index 8d169a35cf13..c9c56fd194c1 100644 --- a/drivers/misc/eeprom/at25.c +++ b/drivers/misc/eeprom/at25.c @@ -79,6 +79,11 @@ static int at25_ee_read(void *priv, unsigned int offset, { struct at25_data *at25 = priv; char *buf = val; + size_t max_chunk = spi_max_transfer_size(at25->spi); + size_t num_msgs = DIV_ROUND_UP(count, max_chunk); + size_t nr_bytes = 0; + unsigned int msg_offset; + size_t msg_count; u8 *cp; ssize_t status; struct spi_transfer t[2]; @@ -92,54 +97,59 @@ static int at25_ee_read(void *priv, unsigned int offset, if (unlikely(!count)) return -EINVAL; - cp = at25->command; + msg_offset = (unsigned int)offset; + msg_count = min(count, max_chunk); + while (num_msgs) { + cp = at25->command; - instr = AT25_READ; - if (at25->chip.flags & EE_INSTR_BIT3_IS_ADDR) - if (offset >= BIT(at25->addrlen * 8)) - instr |= AT25_INSTR_BIT3; + instr = AT25_READ; + if (at25->chip.flags & EE_INSTR_BIT3_IS_ADDR) + if (msg_offset >= BIT(at25->addrlen * 8)) + instr |= AT25_INSTR_BIT3; - mutex_lock(&at25->lock); + mutex_lock(&at25->lock); - *cp++ = instr; + *cp++ = instr; - /* 8/16/24-bit address is written MSB first */ - switch (at25->addrlen) { - default: /* case 3 */ - *cp++ = offset >> 16; - fallthrough; - case 2: - *cp++ = offset >> 8; - fallthrough; - case 1: - case 0: /* can't happen: for better code generation */ - *cp++ = offset >> 0; + /* 8/16/24-bit address is written MSB first */ + switch (at25->addrlen) { + default: /* case 3 */ + *cp++ = msg_offset >> 16; + fallthrough; + case 2: + *cp++ = msg_offset >> 8; + fallthrough; + case 1: + case 0: /* can't happen: for better code generation */ + *cp++ = msg_offset >> 0; + } + + spi_message_init(&m); + memset(t, 0, sizeof(t)); + + t[0].tx_buf = at25->command; + t[0].len = at25->addrlen + 1; + spi_message_add_tail(&t[0], &m); + + t[1].rx_buf = buf + nr_bytes; + t[1].len = msg_count; + spi_message_add_tail(&t[1], &m); + + status = spi_sync(at25->spi, &m); + + mutex_unlock(&at25->lock); + + if (status) + return status; + + --num_msgs; + msg_offset += msg_count; + nr_bytes += msg_count; } - spi_message_init(&m); - memset(t, 0, sizeof(t)); - - t[0].tx_buf = at25->command; - t[0].len = at25->addrlen + 1; - spi_message_add_tail(&t[0], &m); - - t[1].rx_buf = buf; - t[1].len = count; - spi_message_add_tail(&t[1], &m); - - /* - * Read it all at once. - * - * REVISIT that's potentially a problem with large chips, if - * other devices on the bus need to be accessed regularly or - * this chip is clocked very slowly. - */ - status = spi_sync(at25->spi, &m); - dev_dbg(&at25->spi->dev, "read %zu bytes at %d --> %zd\n", - count, offset, status); - - mutex_unlock(&at25->lock); - return status; + dev_dbg(&at25->spi->dev, "read %zu bytes at %d\n", + count, offset); + return 0; } /* Read extra registers as ID or serial number */ @@ -190,6 +200,7 @@ ATTRIBUTE_GROUPS(sernum); static int at25_ee_write(void *priv, unsigned int off, void *val, size_t count) { struct at25_data *at25 = priv; + size_t maxsz = spi_max_transfer_size(at25->spi); const char *buf = val; int status = 0; unsigned buf_size; @@ -253,6 +264,8 @@ static int at25_ee_write(void *priv, unsigned int off, void *val, size_t count) segment = buf_size - (offset % buf_size); if (segment > count) segment = count; + if (segment > maxsz) + segment = maxsz; memcpy(cp, buf, segment); status = spi_write(at25->spi, bounce, segment + at25->addrlen + 1); From c349ae5f831cb9817a45e4e36705d2f7d47c7bc3 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 9 Jun 2022 11:17:13 -0400 Subject: [PATCH 110/298] Documentation: add description for net.sctp.reconf_enable Describe it in networking/ip-sysctl.rst like other SCTP options. Fixes: c0d8bab6ae51 ("sctp: add get and set sockopt for reconf_enable") Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 04216564a03c..aebf87b19fba 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2925,6 +2925,17 @@ plpmtud_probe_interval - INTEGER Default: 0 +reconf_enable - BOOLEAN + Enable or disable extension of Stream Reconfiguration functionality + specified in RFC6525. This extension provides the ability to "reset" + a stream, and it includes the Parameters of "Outgoing/Incoming SSN + Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams". + + - 1: Enable extension. + - 0: Disable extension. + + Default: 0 + ``/proc/sys/net/core/*`` ======================== From e65775fdd389e4f47eb1972ef6372e20c6c2cc05 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 9 Jun 2022 11:17:14 -0400 Subject: [PATCH 111/298] Documentation: add description for net.sctp.intl_enable Describe it in networking/ip-sysctl.rst like other SCTP options. We need to document this especially as when using the feature of User Message Interleaving, some socket options also needs to be set. Fixes: 463118c34a35 ("sctp: support sysctl to allow users to use stream interleave") Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index aebf87b19fba..5fe837505d43 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2936,6 +2936,20 @@ reconf_enable - BOOLEAN Default: 0 +intl_enable - BOOLEAN + Enable or disable extension of User Message Interleaving functionality + specified in RFC8260. This extension allows the interleaving of user + messages sent on different streams. With this feature enabled, I-DATA + chunk will replace DATA chunk to carry user messages if also supported + by the peer. Note that to use this feature, one needs to set this option + to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2 + and SCTP_INTERLEAVING_SUPPORTED to 1. + + - 1: Enable extension. + - 0: Disable extension. + + Default: 0 + ``/proc/sys/net/core/*`` ======================== From 249eddaf651fda7cb32e9ebae4c6d5904b390d81 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 9 Jun 2022 11:17:15 -0400 Subject: [PATCH 112/298] Documentation: add description for net.sctp.ecn_enable Describe it in networking/ip-sysctl.rst like other SCTP options. Fixes: 2f5268a9249b ("sctp: allow users to set netns ecn flag with sysctl") Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 5fe837505d43..9f41961d11d5 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2950,6 +2950,18 @@ intl_enable - BOOLEAN Default: 0 +ecn_enable - BOOLEAN + Control use of Explicit Congestion Notification (ECN) by SCTP. + Like in TCP, ECN is used only when both ends of the SCTP connection + indicate support for it. This feature is useful in avoiding losses + due to congestion by allowing supporting routers to signal congestion + before having to drop packets. + + 1: Enable ecn. + 0: Disable ecn. + + Default: 1 + ``/proc/sys/net/core/*`` ======================== From abfed87e2a12bd246047d78c01d81eb9529f1d06 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Sat, 28 May 2022 12:24:29 +0200 Subject: [PATCH 113/298] crypto: memneq - move into lib/ This is used by code that doesn't need CONFIG_CRYPTO, so move this into lib/ with a Kconfig option so that it can be selected by whatever needs it. This fixes a linker error Zheng pointed out when CRYPTO_MANAGER_DISABLE_TESTS!=y and CRYPTO=m: lib/crypto/curve25519-selftest.o: In function `curve25519_selftest': curve25519-selftest.c:(.init.text+0x60): undefined reference to `__crypto_memneq' curve25519-selftest.c:(.init.text+0xec): undefined reference to `__crypto_memneq' curve25519-selftest.c:(.init.text+0x114): undefined reference to `__crypto_memneq' curve25519-selftest.c:(.init.text+0x154): undefined reference to `__crypto_memneq' Reported-by: Zheng Bin Cc: Eric Biggers Cc: stable@vger.kernel.org Fixes: aa127963f1ca ("crypto: lib/curve25519 - re-add selftests") Signed-off-by: Jason A. Donenfeld Reviewed-by: Eric Biggers Signed-off-by: Herbert Xu --- crypto/Kconfig | 1 + crypto/Makefile | 2 +- lib/Kconfig | 3 +++ lib/Makefile | 1 + lib/crypto/Kconfig | 1 + {crypto => lib}/memneq.c | 0 6 files changed, 7 insertions(+), 1 deletion(-) rename {crypto => lib}/memneq.c (100%) diff --git a/crypto/Kconfig b/crypto/Kconfig index 19197469cfab..1d44893a997b 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -15,6 +15,7 @@ source "crypto/async_tx/Kconfig" # menuconfig CRYPTO tristate "Cryptographic API" + select LIB_MEMNEQ help This option provides the core Cryptographic API. diff --git a/crypto/Makefile b/crypto/Makefile index 43bc33e247d1..ceaaa9f34145 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -4,7 +4,7 @@ # obj-$(CONFIG_CRYPTO) += crypto.o -crypto-y := api.o cipher.o compress.o memneq.o +crypto-y := api.o cipher.o compress.o obj-$(CONFIG_CRYPTO_ENGINE) += crypto_engine.o obj-$(CONFIG_CRYPTO_FIPS) += fips.o diff --git a/lib/Kconfig b/lib/Kconfig index 6a843639814f..eaaad4d85bf2 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -120,6 +120,9 @@ config INDIRECT_IOMEM_FALLBACK source "lib/crypto/Kconfig" +config LIB_MEMNEQ + bool + config CRC_CCITT tristate "CRC-CCITT functions" help diff --git a/lib/Makefile b/lib/Makefile index ea54294d73bf..f99bf61f8bbc 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -251,6 +251,7 @@ obj-$(CONFIG_DIMLIB) += dim/ obj-$(CONFIG_SIGNATURE) += digsig.o lib-$(CONFIG_CLZ_TAB) += clz_tab.o +lib-$(CONFIG_LIB_MEMNEQ) += memneq.o obj-$(CONFIG_GENERIC_STRNCPY_FROM_USER) += strncpy_from_user.o obj-$(CONFIG_GENERIC_STRNLEN_USER) += strnlen_user.o diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig index 9856e291f414..2082af43d51f 100644 --- a/lib/crypto/Kconfig +++ b/lib/crypto/Kconfig @@ -71,6 +71,7 @@ config CRYPTO_LIB_CURVE25519 tristate "Curve25519 scalar multiplication library" depends on CRYPTO_ARCH_HAVE_LIB_CURVE25519 || !CRYPTO_ARCH_HAVE_LIB_CURVE25519 select CRYPTO_LIB_CURVE25519_GENERIC if CRYPTO_ARCH_HAVE_LIB_CURVE25519=n + select LIB_MEMNEQ help Enable the Curve25519 library interface. This interface may be fulfilled by either the generic implementation or an arch-specific diff --git a/crypto/memneq.c b/lib/memneq.c similarity index 100% rename from crypto/memneq.c rename to lib/memneq.c From 5e757deddd918edb8cb2fdb56eb79656ffc6dade Mon Sep 17 00:00:00 2001 From: Conor Dooley Date: Fri, 3 Jun 2022 09:38:26 +0100 Subject: [PATCH 114/298] riscv: dts: microchip: re-add pdma to mpfs device tree PolarFire SoC /does/ have a SiFive pdma, despite what I suggested as a conflict resolution to Zong. Somehow the entry fell through the cracks between versions of my dt patches, so re-add it with Zong's updated compatible & dma-channels property. Fixes: c5094f371008 ("riscv: dts: microchip: refactor icicle kit device tree") Signed-off-by: Conor Dooley --- arch/riscv/boot/dts/microchip/mpfs.dtsi | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/arch/riscv/boot/dts/microchip/mpfs.dtsi b/arch/riscv/boot/dts/microchip/mpfs.dtsi index 8c3259134194..3095d08453a1 100644 --- a/arch/riscv/boot/dts/microchip/mpfs.dtsi +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi @@ -192,6 +192,15 @@ riscv,ndev = <186>; }; + pdma: dma-controller@3000000 { + compatible = "sifive,fu540-c000-pdma", "sifive,pdma0"; + reg = <0x0 0x3000000 0x0 0x8000>; + interrupt-parent = <&plic>; + interrupts = <5 6>, <7 8>, <9 10>, <11 12>; + dma-channels = <4>; + #dma-cells = <1>; + }; + clkcfg: clkcfg@20002000 { compatible = "microchip,mpfs-clkcfg"; reg = <0x0 0x20002000 0x0 0x1000>, <0x0 0x3E001000 0x0 0x1000>; From e32683c6f7d22ba624e0bfc58b02cf3348bdca63 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Thu, 9 Jun 2022 00:17:32 -0700 Subject: [PATCH 115/298] x86/mm: Fix RESERVE_BRK() for older binutils With binutils 2.26, RESERVE_BRK() causes a build failure: /tmp/ccnGOKZ5.s: Assembler messages: /tmp/ccnGOKZ5.s:98: Error: missing ')' /tmp/ccnGOKZ5.s:98: Error: missing ')' /tmp/ccnGOKZ5.s:98: Error: missing ')' /tmp/ccnGOKZ5.s:98: Error: junk at end of line, first unrecognized character is `U' The problem is this line: RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE) Specifically, the INIT_PGT_BUF_SIZE macro which (via PAGE_SIZE's use _AC()) has a "1UL", which makes older versions of the assembler unhappy. Unfortunately the _AC() macro doesn't work for inline asm. Inline asm was only needed here to convince the toolchain to add the STT_NOBITS flag. However, if a C variable is placed in a section whose name is prefixed with ".bss", GCC and Clang automatically set STT_NOBITS. In fact, ".bss..page_aligned" already relies on this trick. So fix the build failure (and simplify the macro) by allocating the variable in C. Also, add NOLOAD to the ".brk" output section clause in the linker script. This is a failsafe in case the ".bss" prefix magic trick ever stops working somehow. If there's a section type mismatch, the GNU linker will force the ".brk" output section to be STT_NOBITS. The LLVM linker will fail with a "section type mismatch" error. Note this also changes the name of the variable from .brk.##name to __brk_##name. The variable names aren't actually used anywhere, so it's harmless. Fixes: a1e2c031ec39 ("x86/mm: Simplify RESERVE_BRK()") Reported-by: Joe Damato Reported-by: Byungchul Park Signed-off-by: Josh Poimboeuf Signed-off-by: Peter Zijlstra (Intel) Tested-by: Joe Damato Link: https://lore.kernel.org/r/22d07a44c80d8e8e1e82b9a806ddc8c6bbb2606e.1654759036.git.jpoimboe@kernel.org --- arch/x86/include/asm/setup.h | 38 +++++++++++++++++++---------------- arch/x86/kernel/setup.c | 5 ----- arch/x86/kernel/vmlinux.lds.S | 4 ++-- 3 files changed, 23 insertions(+), 24 deletions(-) diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 7590ac2570b9..f8b9ee97a891 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -108,19 +108,16 @@ extern unsigned long _brk_end; void *extend_brk(size_t size, size_t align); /* - * Reserve space in the brk section. The name must be unique within the file, - * and somewhat descriptive. The size is in bytes. + * Reserve space in the .brk section, which is a block of memory from which the + * caller is allowed to allocate very early (before even memblock is available) + * by calling extend_brk(). All allocated memory will be eventually converted + * to memblock. Any leftover unallocated memory will be freed. * - * The allocation is done using inline asm (rather than using a section - * attribute on a normal variable) in order to allow the use of @nobits, so - * that it doesn't take up any space in the vmlinux file. + * The size is in bytes. */ -#define RESERVE_BRK(name, size) \ - asm(".pushsection .brk_reservation,\"aw\",@nobits\n\t" \ - ".brk." #name ":\n\t" \ - ".skip " __stringify(size) "\n\t" \ - ".size .brk." #name ", " __stringify(size) "\n\t" \ - ".popsection\n\t") +#define RESERVE_BRK(name, size) \ + __section(".bss..brk") __aligned(1) __used \ + static char __brk_##name[size] extern void probe_roms(void); #ifdef __i386__ @@ -133,12 +130,19 @@ asmlinkage void __init x86_64_start_reservations(char *real_mode_data); #endif /* __i386__ */ #endif /* _SETUP */ -#else -#define RESERVE_BRK(name,sz) \ - .pushsection .brk_reservation,"aw",@nobits; \ -.brk.name: \ -1: .skip sz; \ - .size .brk.name,.-1b; \ + +#else /* __ASSEMBLY */ + +.macro __RESERVE_BRK name, size + .pushsection .bss..brk, "aw" +SYM_DATA_START(__brk_\name) + .skip \size +SYM_DATA_END(__brk_\name) .popsection +.endm + +#define RESERVE_BRK(name, size) __RESERVE_BRK name, size + #endif /* __ASSEMBLY__ */ + #endif /* _ASM_X86_SETUP_H */ diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 3ebb85327edb..bd6c6fd373ae 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -67,11 +67,6 @@ RESERVE_BRK(dmi_alloc, 65536); #endif -/* - * Range of the BSS area. The size of the BSS area is determined - * at link time, with RESERVE_BRK() facility reserving additional - * chunks. - */ unsigned long _brk_start = (unsigned long)__brk_base; unsigned long _brk_end = (unsigned long)__brk_base; diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index f5f6dc2e8007..81aba718ecd5 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -385,10 +385,10 @@ SECTIONS __end_of_kernel_reserve = .; . = ALIGN(PAGE_SIZE); - .brk : AT(ADDR(.brk) - LOAD_OFFSET) { + .brk (NOLOAD) : AT(ADDR(.brk) - LOAD_OFFSET) { __brk_base = .; . += 64 * 1024; /* 64k alignment slop space */ - *(.brk_reservation) /* areas brk users have reserved */ + *(.bss..brk) /* areas brk users have reserved */ __brk_limit = .; } From 04193d590b390ec7a0592630f46d559ec6564ba1 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 7 Jun 2022 22:41:55 +0200 Subject: [PATCH 116/298] sched: Fix balance_push() vs __sched_setscheduler() The purpose of balance_push() is to act as a filter on task selection in the case of CPU hotplug, specifically when taking the CPU out. It does this by (ab)using the balance callback infrastructure, with the express purpose of keeping all the unlikely/odd cases in a single place. In order to serve its purpose, the balance_push_callback needs to be (exclusively) on the callback list at all times (noting that the callback always places itself back on the list the moment it runs, also noting that when the CPU goes down, regular balancing concerns are moot, so ignoring them is fine). And here-in lies the problem, __sched_setscheduler()'s use of splice_balance_callbacks() takes the callbacks off the list across a lock-break, making it possible for, an interleaving, __schedule() to see an empty list and not get filtered. Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()") Reported-by: Jing-Ting Wu Signed-off-by: Peter Zijlstra (Intel) Tested-by: Jing-Ting Wu Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net --- kernel/sched/core.c | 36 +++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 5 +++++ 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bfa7452ca92e..da0bf6fe9ecd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4798,25 +4798,55 @@ static void do_balance_callbacks(struct rq *rq, struct callback_head *head) static void balance_push(struct rq *rq); +/* + * balance_push_callback is a right abuse of the callback interface and plays + * by significantly different rules. + * + * Where the normal balance_callback's purpose is to be ran in the same context + * that queued it (only later, when it's safe to drop rq->lock again), + * balance_push_callback is specifically targeted at __schedule(). + * + * This abuse is tolerated because it places all the unlikely/odd cases behind + * a single test, namely: rq->balance_callback == NULL. + */ struct callback_head balance_push_callback = { .next = NULL, .func = (void (*)(struct callback_head *))balance_push, }; -static inline struct callback_head *splice_balance_callbacks(struct rq *rq) +static inline struct callback_head * +__splice_balance_callbacks(struct rq *rq, bool split) { struct callback_head *head = rq->balance_callback; + if (likely(!head)) + return NULL; + lockdep_assert_rq_held(rq); - if (head) + /* + * Must not take balance_push_callback off the list when + * splice_balance_callbacks() and balance_callbacks() are not + * in the same rq->lock section. + * + * In that case it would be possible for __schedule() to interleave + * and observe the list empty. + */ + if (split && head == &balance_push_callback) + head = NULL; + else rq->balance_callback = NULL; return head; } +static inline struct callback_head *splice_balance_callbacks(struct rq *rq) +{ + return __splice_balance_callbacks(rq, true); +} + static void __balance_callbacks(struct rq *rq) { - do_balance_callbacks(rq, splice_balance_callbacks(rq)); + do_balance_callbacks(rq, __splice_balance_callbacks(rq, false)); } static inline void balance_callbacks(struct rq *rq, struct callback_head *head) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 01259611beb9..47b89a0fc6e5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1693,6 +1693,11 @@ queue_balance_callback(struct rq *rq, { lockdep_assert_rq_held(rq); + /* + * Don't (re)queue an already queued item; nor queue anything when + * balance_push() is active, see the comment with + * balance_push_callback. + */ if (unlikely(head->next || rq->balance_callback == &balance_push_callback)) return; From 4051a81774d6d8e28192742c26999d6f29bc0e68 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 17 May 2022 11:16:14 +0200 Subject: [PATCH 117/298] locking/lockdep: Use sched_clock() for random numbers Since the rewrote of prandom_u32(), in the commit mentioned below, the function uses sleeping locks which extracing random numbers and filling the batch. This breaks lockdep on PREEMPT_RT because lock_pin_lock() disables interrupts while calling __lock_pin_lock(). This can't be moved earlier because the main user of the function (rq_pin_lock()) invokes that function after disabling interrupts in order to acquire the lock. The cookie does not require random numbers as its goal is to provide a random value in order to notice unexpected "unlock + lock" sites. Use sched_clock() to provide random numbers. Fixes: a0103f4d86f88 ("random32: use real rng for non-deterministic randomness") Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/YoNn3pTkm5+QzE5k@linutronix.de --- kernel/locking/lockdep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 81e87280513e..f06b91ca6482 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -5432,7 +5432,7 @@ static struct pin_cookie __lock_pin_lock(struct lockdep_map *lock) * be guessable and still allows some pin nesting in * our u32 pin_count. */ - cookie.val = 1 + (prandom_u32() >> 16); + cookie.val = 1 + (sched_clock() & 0xffff); hlock->pin_count += cookie.val; return cookie; } From b0380bf6dad4601d92025841e2b7a135d566c6e3 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Mon, 13 Jun 2022 06:32:44 +0100 Subject: [PATCH 118/298] io_uring: fix races with file table unregister Fixed file table quiesce might unlock ->uring_lock, potentially letting new requests to be submitted, don't allow those requests to use the table as they will race with unregistration. Reported-and-tested-by: van fantasy Fixes: 05f3fb3c53975 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Pavel Begunkov --- fs/io_uring.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index ed3416a7b2e9..00d266746916 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -9768,11 +9768,19 @@ static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) static int io_sqe_files_unregister(struct io_ring_ctx *ctx) { + unsigned nr = ctx->nr_user_files; int ret; if (!ctx->file_data) return -ENXIO; + + /* + * Quiesce may unlock ->uring_lock, and while it's not held + * prevent new requests using the table. + */ + ctx->nr_user_files = 0; ret = io_rsrc_ref_quiesce(ctx->file_data, ctx); + ctx->nr_user_files = nr; if (!ret) __io_sqe_files_unregister(ctx); return ret; From d11d31fc5d8a96f707facee0babdcffaafa38de2 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Mon, 13 Jun 2022 06:30:06 +0100 Subject: [PATCH 119/298] io_uring: fix races with buffer table unregister Fixed buffer table quiesce might unlock ->uring_lock, potentially letting new requests to be submitted, don't allow those requests to use the table as they will race with unregistration. Reported-and-tested-by: van fantasy Fixes: bd54b6fe3316ec ("io_uring: implement fixed buffers registration similar to fixed files") Signed-off-by: Pavel Begunkov --- fs/io_uring.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index 00d266746916..be05f375a776 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -10680,12 +10680,19 @@ static void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx) static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx) { + unsigned nr = ctx->nr_user_bufs; int ret; if (!ctx->buf_data) return -ENXIO; + /* + * Quiesce may unlock ->uring_lock, and while it's not held + * prevent new requests using the table. + */ + ctx->nr_user_bufs = 0; ret = io_rsrc_ref_quiesce(ctx->buf_data, ctx); + ctx->nr_user_bufs = nr; if (!ret) __io_sqe_buffers_unregister(ctx); return ret; From 05b538c1765f8d14a71ccf5f85258dcbeaf189f7 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Thu, 9 Jun 2022 08:34:35 +0100 Subject: [PATCH 120/298] io_uring: fix not locked access to fixed buf table We can look inside the fixed buffer table only while holding ->uring_lock, however in some cases we don't do the right async prep for IORING_OP_{WRITE,READ}_FIXED ending up with NULL req->imu forcing making an io-wq worker to try to resolve the fixed buffer without proper locking. Move req->imu setup into early req init paths, i.e. io_prep_rw(), which is called unconditionally for rw requests and under uring_lock. Fixes: 634d00df5e1cf ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov --- fs/io_uring.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index be05f375a776..fd8a1ffe6a1a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3636,6 +3636,20 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) int ret; kiocb->ki_pos = READ_ONCE(sqe->off); + /* used for fixed read/write too - just read unconditionally */ + req->buf_index = READ_ONCE(sqe->buf_index); + + if (req->opcode == IORING_OP_READ_FIXED || + req->opcode == IORING_OP_WRITE_FIXED) { + struct io_ring_ctx *ctx = req->ctx; + u16 index; + + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); + req->imu = ctx->user_bufs[index]; + io_req_set_rsrc_node(req, ctx, 0); + } ioprio = READ_ONCE(sqe->ioprio); if (ioprio) { @@ -3648,12 +3662,9 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) kiocb->ki_ioprio = get_current_ioprio(); } - req->imu = NULL; req->rw.addr = READ_ONCE(sqe->addr); req->rw.len = READ_ONCE(sqe->len); req->rw.flags = READ_ONCE(sqe->rw_flags); - /* used for fixed read/write too - just read unconditionally */ - req->buf_index = READ_ONCE(sqe->buf_index); return 0; } @@ -3785,20 +3796,9 @@ static int __io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter static int io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter, unsigned int issue_flags) { - struct io_mapped_ubuf *imu = req->imu; - u16 index, buf_index = req->buf_index; - - if (likely(!imu)) { - struct io_ring_ctx *ctx = req->ctx; - - if (unlikely(buf_index >= ctx->nr_user_bufs)) - return -EFAULT; - io_req_set_rsrc_node(req, ctx, issue_flags); - index = array_index_nospec(buf_index, ctx->nr_user_bufs); - imu = READ_ONCE(ctx->user_bufs[index]); - req->imu = imu; - } - return __io_import_fixed(req, rw, iter, imu); + if (WARN_ON_ONCE(!req->imu)) + return -EFAULT; + return __io_import_fixed(req, rw, iter, req->imu); } static int io_buffer_add_list(struct io_ring_ctx *ctx, From c9b576d0c7bf55aeae1a736da7974fa202c4394d Mon Sep 17 00:00:00 2001 From: Alan Previn Date: Thu, 10 Mar 2022 16:43:11 -0800 Subject: [PATCH 121/298] drm/i915/reset: Fix error_state_read ptr + offset use Fix our pointer offset usage in error_state_read when there is no i915_gpu_coredump but buf offset is non-zero. This fixes a kernel page fault can happen when multiple tests are running concurrently in a loop and one is producing engine resets and consuming the i915 error_state dump while the other is forcing full GT resets. (takes a while to trigger). The dmesg call trace: [ 5590.803000] BUG: unable to handle page fault for address: ffffffffa0b0e000 [ 5590.803009] #PF: supervisor read access in kernel mode [ 5590.803013] #PF: error_code(0x0000) - not-present page [ 5590.803016] PGD 5814067 P4D 5814067 PUD 5815063 PMD 109de4067 PTE 0 [ 5590.803022] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 5590.803026] CPU: 5 PID: 13656 Comm: i915_hangman Tainted: G U 5.17.0-rc5-ups69-guc-err-capt-rev6+ #136 [ 5590.803033] Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-M LP4x RVP, BIOS ADLPFWI1.R00. 3031.A02.2201171222 01/17/2022 [ 5590.803039] RIP: 0010:memcpy_erms+0x6/0x10 [ 5590.803045] Code: fe ff ff cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe [ 5590.803054] RSP: 0018:ffffc90003a8fdf0 EFLAGS: 00010282 [ 5590.803057] RAX: ffff888107ee9000 RBX: ffff888108cb1a00 RCX: 0000000000000f8f [ 5590.803061] RDX: 0000000000001000 RSI: ffffffffa0b0e000 RDI: ffff888107ee9071 [ 5590.803065] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001 [ 5590.803069] R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000019 [ 5590.803073] R13: 0000000000174fff R14: 0000000000001000 R15: ffff888107ee9000 [ 5590.803077] FS: 00007f62a99bee80(0000) GS:ffff88849f880000(0000) knlGS:0000000000000000 [ 5590.803082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5590.803085] CR2: ffffffffa0b0e000 CR3: 000000010a1a8004 CR4: 0000000000770ee0 [ 5590.803089] PKRU: 55555554 [ 5590.803091] Call Trace: [ 5590.803093] [ 5590.803096] error_state_read+0xa1/0xd0 [i915] [ 5590.803175] kernfs_fop_read_iter+0xb2/0x1b0 [ 5590.803180] new_sync_read+0x116/0x1a0 [ 5590.803185] vfs_read+0x114/0x1b0 [ 5590.803189] ksys_read+0x63/0xe0 [ 5590.803193] do_syscall_64+0x38/0xc0 [ 5590.803197] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 5590.803201] RIP: 0033:0x7f62aaea5912 [ 5590.803204] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 5a b9 0c 00 e8 05 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 [ 5590.803213] RSP: 002b:00007fff5b659ae8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 5590.803218] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007f62aaea5912 [ 5590.803221] RDX: 000000000008b000 RSI: 00007f62a8c4000f RDI: 0000000000000006 [ 5590.803225] RBP: 00007f62a8bcb00f R08: 0000000000200010 R09: 0000000000101000 [ 5590.803229] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000006 [ 5590.803233] R13: 0000000000075000 R14: 00007f62a8acb010 R15: 0000000000200000 [ 5590.803238] [ 5590.803240] Modules linked in: i915 ttm drm_buddy drm_dp_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops prime_numbers nfnetlink br_netfilter overlay mei_pxp mei_hdcp x86_pkg_temp_thermal coretemp kvm_intel snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me mei fuse ip_tables x_tables crct10dif_pclmul e1000e crc32_pclmul ptp i2c_i801 ghash_clmulni_intel i2c_smbus pps_core [last unloa ded: ttm] [ 5590.803277] CR2: ffffffffa0b0e000 [ 5590.803280] ---[ end trace 0000000000000000 ]--- Fixes: 0e39037b3165 ("drm/i915: Cache the error string") Signed-off-by: Alan Previn Reviewed-by: John Harrison Signed-off-by: John Harrison Link: https://patchwork.freedesktop.org/patch/msgid/20220311004311.514198-2-alan.previn.teres.alexis@intel.com (cherry picked from commit 3304033a1e69cd81a2044b4422f0d7e593afb4e6) Signed-off-by: Jani Nikula --- drivers/gpu/drm/i915/i915_sysfs.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c index 8521daba212a..a4e3b6dbb231 100644 --- a/drivers/gpu/drm/i915/i915_sysfs.c +++ b/drivers/gpu/drm/i915/i915_sysfs.c @@ -166,7 +166,14 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj, struct device *kdev = kobj_to_dev(kobj); struct drm_i915_private *i915 = kdev_minor_to_i915(kdev); struct i915_gpu_coredump *gpu; - ssize_t ret; + ssize_t ret = 0; + + /* + * FIXME: Concurrent clients triggering resets and reading + clearing + * dumps can cause inconsistent sysfs reads when a user calls in with a + * non-zero offset to complete a prior partial read but the + * gpu_coredump has been cleared or replaced. + */ gpu = i915_first_error_state(i915); if (IS_ERR(gpu)) { @@ -178,8 +185,10 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj, const char *str = "No error state collected\n"; size_t len = strlen(str); - ret = min_t(size_t, count, len - off); - memcpy(buf, str + off, ret); + if (off < len) { + ret = min_t(size_t, count, len - off); + memcpy(buf, str + off, ret); + } } return ret; From 6e3f3c239ee547c5b55a85f467c92a6ba7eee83a Mon Sep 17 00:00:00 2001 From: Ashutosh Dixit Date: Wed, 25 May 2022 06:19:20 -0700 Subject: [PATCH 122/298] drm/i915/gt: Fix memory leaks in per-gt sysfs All kmalloc'd kobjects need a kobject_put() to free memory. For example in previous code, kobj_gt_release() never gets called. The requirement of kobject_put() now results in a slightly different code organization. v2: s/gtn/gt/ (Andi) Fixes: b770bcfae9ad ("drm/i915/gt: create per-tile sysfs interface") Signed-off-by: Ashutosh Dixit Reviewed-by: Andi Shyti Acked-by: Andrzej Hajda Signed-off-by: Tvrtko Ursulin Link: https://patchwork.freedesktop.org/patch/msgid/a6f6686517c85fba61a0c45097f5bb4fe7e257fb.1653484574.git.ashutosh.dixit@intel.com (cherry picked from commit 69d6bf5c3754ffc491896632438417d1cedc2c68) Signed-off-by: Jani Nikula --- drivers/gpu/drm/i915/gt/intel_gt.c | 1 + drivers/gpu/drm/i915/gt/intel_gt_sysfs.c | 29 ++++++++++-------------- drivers/gpu/drm/i915/gt/intel_gt_sysfs.h | 6 +---- drivers/gpu/drm/i915/gt/intel_gt_types.h | 3 +++ drivers/gpu/drm/i915/i915_sysfs.c | 2 ++ 5 files changed, 19 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 53307ca0eed0..51a0fe60c050 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -785,6 +785,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt) { intel_wakeref_t wakeref; + intel_gt_sysfs_unregister(gt); intel_rps_driver_unregister(>->rps); intel_gsc_fini(>->gsc); diff --git a/drivers/gpu/drm/i915/gt/intel_gt_sysfs.c b/drivers/gpu/drm/i915/gt/intel_gt_sysfs.c index 8ec8bc660c8c..9e4ebf53379b 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_sysfs.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_sysfs.c @@ -24,7 +24,7 @@ bool is_object_gt(struct kobject *kobj) static struct intel_gt *kobj_to_gt(struct kobject *kobj) { - return container_of(kobj, struct kobj_gt, base)->gt; + return container_of(kobj, struct intel_gt, sysfs_gt); } struct intel_gt *intel_gt_sysfs_get_drvdata(struct device *dev, @@ -72,9 +72,9 @@ static struct attribute *id_attrs[] = { }; ATTRIBUTE_GROUPS(id); +/* A kobject needs a release() method even if it does nothing */ static void kobj_gt_release(struct kobject *kobj) { - kfree(kobj); } static struct kobj_type kobj_gt_type = { @@ -85,8 +85,6 @@ static struct kobj_type kobj_gt_type = { void intel_gt_sysfs_register(struct intel_gt *gt) { - struct kobj_gt *kg; - /* * We need to make things right with the * ABI compatibility. The files were originally @@ -98,25 +96,22 @@ void intel_gt_sysfs_register(struct intel_gt *gt) if (gt_is_root(gt)) intel_gt_sysfs_pm_init(gt, gt_get_parent_obj(gt)); - kg = kzalloc(sizeof(*kg), GFP_KERNEL); - if (!kg) + /* init and xfer ownership to sysfs tree */ + if (kobject_init_and_add(>->sysfs_gt, &kobj_gt_type, + gt->i915->sysfs_gt, "gt%d", gt->info.id)) goto exit_fail; - kobject_init(&kg->base, &kobj_gt_type); - kg->gt = gt; - - /* xfer ownership to sysfs tree */ - if (kobject_add(&kg->base, gt->i915->sysfs_gt, "gt%d", gt->info.id)) - goto exit_kobj_put; - - intel_gt_sysfs_pm_init(gt, &kg->base); + intel_gt_sysfs_pm_init(gt, >->sysfs_gt); return; -exit_kobj_put: - kobject_put(&kg->base); - exit_fail: + kobject_put(>->sysfs_gt); drm_warn(>->i915->drm, "failed to initialize gt%d sysfs root\n", gt->info.id); } + +void intel_gt_sysfs_unregister(struct intel_gt *gt) +{ + kobject_put(>->sysfs_gt); +} diff --git a/drivers/gpu/drm/i915/gt/intel_gt_sysfs.h b/drivers/gpu/drm/i915/gt/intel_gt_sysfs.h index 9471b26752cf..a99aa7e8b01a 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_sysfs.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_sysfs.h @@ -13,11 +13,6 @@ struct intel_gt; -struct kobj_gt { - struct kobject base; - struct intel_gt *gt; -}; - bool is_object_gt(struct kobject *kobj); struct drm_i915_private *kobj_to_i915(struct kobject *kobj); @@ -28,6 +23,7 @@ intel_gt_create_kobj(struct intel_gt *gt, const char *name); void intel_gt_sysfs_register(struct intel_gt *gt); +void intel_gt_sysfs_unregister(struct intel_gt *gt); struct intel_gt *intel_gt_sysfs_get_drvdata(struct device *dev, const char *name); diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h index b06611c1d4ad..edd7a3cf5f5f 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h @@ -224,6 +224,9 @@ struct intel_gt { } mocs; struct intel_pxp pxp; + + /* gt/gtN sysfs */ + struct kobject sysfs_gt; }; enum intel_gt_scratch_field { diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c index a4e3b6dbb231..1e2750210831 100644 --- a/drivers/gpu/drm/i915/i915_sysfs.c +++ b/drivers/gpu/drm/i915/i915_sysfs.c @@ -268,4 +268,6 @@ void i915_teardown_sysfs(struct drm_i915_private *dev_priv) device_remove_bin_file(kdev, &dpf_attrs_1); device_remove_bin_file(kdev, &dpf_attrs); + + kobject_put(dev_priv->sysfs_gt); } From 842d9346b2fdda4d2fb8ccb5b87faef1ac01ab51 Mon Sep 17 00:00:00 2001 From: Nirmoy Das Date: Wed, 25 May 2022 11:59:55 +0200 Subject: [PATCH 123/298] drm/i915: Individualize fences before adding to dma_resv obj _i915_vma_move_to_active() can receive > 1 fences for multiple batch buffers submission. Because dma_resv_add_fence() can only accept one fence at a time, change _i915_vma_move_to_active() to be aware of multiple fences so that it can add individual fences to the dma resv object. v6: fix multi-line comment. v5: remove double fence reservation for batch VMAs. v4: Reserve fences for composite_fence on multi-batch contexts and also reserve fence slots to composite_fence for each VMAs. v3: dma_resv_reserve_fences is not cumulative so pass num_fences. v2: make sure to reserve enough fence slots before adding. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5614 Fixes: 544460c33821 ("drm/i915: Multi-BB execbuf") Cc: # v5.16+ Signed-off-by: Nirmoy Das Reviewed-by: Matthew Auld Reviewed-by: Andrzej Hajda Signed-off-by: Matthew Auld Link: https://patchwork.freedesktop.org/patch/msgid/20220525095955.15371-1-nirmoy.das@intel.com (cherry picked from commit 420a07b841d03f6a436d8c06571c69aa5c783897) Signed-off-by: Jani Nikula --- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 3 +- drivers/gpu/drm/i915/i915_vma.c | 48 +++++++++++-------- 2 files changed, 30 insertions(+), 21 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index c326bd2b444f..30fe847c6664 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -999,7 +999,8 @@ static int eb_validate_vmas(struct i915_execbuffer *eb) } } - err = dma_resv_reserve_fences(vma->obj->base.resv, 1); + /* Reserve enough slots to accommodate composite fences */ + err = dma_resv_reserve_fences(vma->obj->base.resv, eb->num_batches); if (err) return err; diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index 4f6db539571a..0bffb70b3c5f 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -23,6 +23,7 @@ */ #include +#include #include #include "display/intel_frontbuffer.h" @@ -1823,6 +1824,21 @@ int _i915_vma_move_to_active(struct i915_vma *vma, if (unlikely(err)) return err; + /* + * Reserve fences slot early to prevent an allocation after preparing + * the workload and associating fences with dma_resv. + */ + if (fence && !(flags & __EXEC_OBJECT_NO_RESERVE)) { + struct dma_fence *curr; + int idx; + + dma_fence_array_for_each(curr, idx, fence) + ; + err = dma_resv_reserve_fences(vma->obj->base.resv, idx); + if (unlikely(err)) + return err; + } + if (flags & EXEC_OBJECT_WRITE) { struct intel_frontbuffer *front; @@ -1832,31 +1848,23 @@ int _i915_vma_move_to_active(struct i915_vma *vma, i915_active_add_request(&front->write, rq); intel_frontbuffer_put(front); } + } - if (!(flags & __EXEC_OBJECT_NO_RESERVE)) { - err = dma_resv_reserve_fences(vma->obj->base.resv, 1); - if (unlikely(err)) - return err; - } + if (fence) { + struct dma_fence *curr; + enum dma_resv_usage usage; + int idx; - if (fence) { - dma_resv_add_fence(vma->obj->base.resv, fence, - DMA_RESV_USAGE_WRITE); + obj->read_domains = 0; + if (flags & EXEC_OBJECT_WRITE) { + usage = DMA_RESV_USAGE_WRITE; obj->write_domain = I915_GEM_DOMAIN_RENDER; - obj->read_domains = 0; - } - } else { - if (!(flags & __EXEC_OBJECT_NO_RESERVE)) { - err = dma_resv_reserve_fences(vma->obj->base.resv, 1); - if (unlikely(err)) - return err; + } else { + usage = DMA_RESV_USAGE_READ; } - if (fence) { - dma_resv_add_fence(vma->obj->base.resv, fence, - DMA_RESV_USAGE_READ); - obj->write_domain = 0; - } + dma_fence_array_for_each(curr, idx, fence) + dma_resv_add_fence(vma->obj->base.resv, curr, usage); } if (flags & EXEC_OBJECT_NEEDS_FENCE && vma->fence) From e71d7c56dd69f720169c1675f87a1d22d8167767 Mon Sep 17 00:00:00 2001 From: Hao Xu Date: Sat, 11 Jun 2022 20:22:20 +0800 Subject: [PATCH 124/298] io_uring: openclose: fix bug of closing wrong fixed file Don't update ret until fixed file is closed, otherwise the file slot becomes the error code. Fixes: a7c41b4687f5 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots") Signed-off-by: Hao Xu [pavel: 5.19 rebase] Signed-off-by: Pavel Begunkov --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index fd8a1ffe6a1a..e6d8cafdd28e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8035,8 +8035,8 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req, if (ret < 0) break; if (copy_to_user(&fds[done], &ret, sizeof(ret))) { - ret = -EFAULT; __io_close_fixed(req, issue_flags, ret); + ret = -EFAULT; break; } } From 42db0c00e275877eb92480beaa16b33507dc3bda Mon Sep 17 00:00:00 2001 From: Hao Xu Date: Sat, 11 Jun 2022 20:29:52 +0800 Subject: [PATCH 125/298] io_uring: kbuf: fix bug of not consuming ring buffer in partial io case When we use ring-mapped provided buffer, we should consume it before arm poll if partial io has been done. Otherwise the buffer may be used by other requests and thus we lost the data. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Hao Xu [pavel: 5.19 rebase] Signed-off-by: Pavel Begunkov --- fs/io_uring.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index e6d8cafdd28e..84b45ed91b2d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1729,9 +1729,16 @@ static void io_kbuf_recycle(struct io_kiocb *req, unsigned issue_flags) if (!(req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING))) return; - /* don't recycle if we already did IO to this buffer */ - if (req->flags & REQ_F_PARTIAL_IO) + /* + * For legacy provided buffer mode, don't recycle if we already did + * IO to this buffer. For ring-mapped provided buffer mode, we should + * increment ring->head to explicitly monopolize the buffer to avoid + * multiple use. + */ + if ((req->flags & REQ_F_BUFFER_SELECTED) && + (req->flags & REQ_F_PARTIAL_IO)) return; + /* * We don't need to recycle for REQ_F_BUFFER_RING, we can just clear * the flag and hence ensure that bl->head doesn't get incremented. @@ -1739,8 +1746,13 @@ static void io_kbuf_recycle(struct io_kiocb *req, unsigned issue_flags) */ if (req->flags & REQ_F_BUFFER_RING) { if (req->buf_list) { - req->buf_index = req->buf_list->bgid; - req->flags &= ~REQ_F_BUFFER_RING; + if (req->flags & REQ_F_PARTIAL_IO) { + req->buf_list->head++; + req->buf_list = NULL; + } else { + req->buf_index = req->buf_list->bgid; + req->flags &= ~REQ_F_BUFFER_RING; + } } return; } From fc9375e3f763b06c3c90c5f5b2b84d3e07c1f4c2 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Sun, 12 Jun 2022 14:31:38 +0100 Subject: [PATCH 126/298] io_uring: fix double unlock for pbuf select io_buffer_select(), which is the only caller of io_ring_buffer_select(), fully handles locking, mutex unlock in io_ring_buffer_select() will lead to double unlock. Fixes: c7fb19428d67d ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Pavel Begunkov --- fs/io_uring.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 84b45ed91b2d..4719eaee3b45 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3849,10 +3849,8 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, struct io_uring_buf *buf; __u32 head = bl->head; - if (unlikely(smp_load_acquire(&br->tail) == head)) { - io_ring_submit_unlock(req->ctx, issue_flags); + if (unlikely(smp_load_acquire(&br->tail) == head)) return NULL; - } head &= bl->mask; if (head < IO_BUFFER_LIST_BUF_PER_PAGE) { From 2636e008112465ca54559ac4898da5a2515e118a Mon Sep 17 00:00:00 2001 From: Jani Nikula Date: Wed, 11 May 2022 12:46:19 +0300 Subject: [PATCH 127/298] drm/i915/uc: remove accidental static from a local variable The arrays are static const, but the pointer shouldn't be static. Fixes: 3d832f370d16 ("drm/i915/uc: Allow platforms to have GuC but not HuC") Cc: John Harrison Cc: Lucas De Marchi Cc: Daniele Ceraolo Spurio Signed-off-by: Jani Nikula Reviewed-by: Tvrtko Ursulin Link: https://patchwork.freedesktop.org/patch/msgid/20220511094619.27889-1-jani.nikula@intel.com (cherry picked from commit 5821a0bbb4c39960975d29d6b58ae290088db0ed) --- drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c index d078f884b5e3..f0d7b57b741e 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c @@ -156,7 +156,7 @@ __uc_fw_auto_select(struct drm_i915_private *i915, struct intel_uc_fw *uc_fw) [INTEL_UC_FW_TYPE_GUC] = { blobs_guc, ARRAY_SIZE(blobs_guc) }, [INTEL_UC_FW_TYPE_HUC] = { blobs_huc, ARRAY_SIZE(blobs_huc) }, }; - static const struct uc_fw_platform_requirement *fw_blobs; + const struct uc_fw_platform_requirement *fw_blobs; enum intel_platform p = INTEL_INFO(i915)->platform; u32 fw_count; u8 rev = INTEL_REVID(i915); From 9eda7d8bcbdb6909f202edeedff51948f1cad1e5 Mon Sep 17 00:00:00 2001 From: Guangbin Huang Date: Sat, 11 Jun 2022 20:25:24 +0800 Subject: [PATCH 128/298] net: hns3: set port base vlan tbl_sta to false before removing old vlan When modify port base vlan, the port base vlan tbl_sta needs to set to false before removing old vlan, to indicate this operation is not finish. Fixes: c0f46de30c96 ("net: hns3: fix port base vlan add fail when concurrent with reset") Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 1ebad0e50e6a..fc0265b63331 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -10117,6 +10117,7 @@ static int hclge_modify_port_base_vlan_tag(struct hclge_vport *vport, if (ret) return ret; + vport->port_base_vlan_cfg.tbl_sta = false; /* remove old VLAN tag */ if (old_info->vlan_tag == 0) ret = hclge_set_vf_vlan_common(hdev, vport->vport_id, From 283847e3ef6dbf79bf67083b5ce7b8033e8b6f34 Mon Sep 17 00:00:00 2001 From: Jian Shen Date: Sat, 11 Jun 2022 20:25:25 +0800 Subject: [PATCH 129/298] net: hns3: don't push link state to VF if unalive It's unnecessary to push link state to unalive VF, and the VF will query link state from PF when it being start works. Fixes: 18b6e31f8bf4 ("net: hns3: PF add support for pushing link status to VFs") Signed-off-by: Jian Shen Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index fc0265b63331..2e891b837c51 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3376,6 +3376,12 @@ static int hclge_set_vf_link_state(struct hnae3_handle *handle, int vf, link_state_old = vport->vf_info.link_state; vport->vf_info.link_state = link_state; + /* return success directly if the VF is unalive, VF will + * query link state itself when it starts work. + */ + if (!test_bit(HCLGE_VPORT_STATE_ALIVE, &vport->state)) + return 0; + ret = hclge_push_vf_link_status(vport); if (ret) { vport->vf_info.link_state = link_state_old; From cfd80687a5388e731b3db65ad6a557ede9b45905 Mon Sep 17 00:00:00 2001 From: Jie Wang Date: Sat, 11 Jun 2022 20:25:26 +0800 Subject: [PATCH 130/298] net: hns3: modify the ring param print info Currently tx push is also a ring param. So the original ring param print info in hns3_is_ringparam_changed should be adjusted. Fixes: 07fdc163ac88 ("net: hns3: refactor hns3_set_ringparam()") Signed-off-by: Jie Wang Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 6d20974519fe..4c7988e308a2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1129,7 +1129,7 @@ hns3_is_ringparam_changed(struct net_device *ndev, if (old_ringparam->tx_desc_num == new_ringparam->tx_desc_num && old_ringparam->rx_desc_num == new_ringparam->rx_desc_num && old_ringparam->rx_buf_len == new_ringparam->rx_buf_len) { - netdev_info(ndev, "ringparam not changed\n"); + netdev_info(ndev, "descriptor number and rx buffer length not changed\n"); return false; } From e93530ae0e5d8fcf2d908933d206e0c93bc3c09b Mon Sep 17 00:00:00 2001 From: Guangbin Huang Date: Sat, 11 Jun 2022 20:25:27 +0800 Subject: [PATCH 131/298] net: hns3: restore tm priority/qset to default settings when tc disabled Currently, settings parameters of schedule mode, dwrr, shaper of tm priority or qset of one tc are only be set when tc is enabled, they are not restored to the default settings when tc is disabled. It confuses users when they cat tm_priority or tm_qset files of debugfs. So this patch fixes it. Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver") Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + .../ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 95 +++++++++++++------ 2 files changed, 65 insertions(+), 31 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 8a3a446219f7..94f80e1c4020 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -769,6 +769,7 @@ struct hnae3_tc_info { u8 prio_tc[HNAE3_MAX_USER_PRIO]; /* TC indexed by prio */ u16 tqp_count[HNAE3_MAX_TC]; u16 tqp_offset[HNAE3_MAX_TC]; + u8 max_tc; /* Total number of TCs */ u8 num_tc; /* Total number of enabled TCs */ bool mqprio_active; }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index 1f87a8a3fe32..ad53a3447322 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -282,8 +282,8 @@ static int hclge_tm_pg_to_pri_map_cfg(struct hclge_dev *hdev, return hclge_cmd_send(&hdev->hw, &desc, 1); } -static int hclge_tm_qs_to_pri_map_cfg(struct hclge_dev *hdev, - u16 qs_id, u8 pri) +static int hclge_tm_qs_to_pri_map_cfg(struct hclge_dev *hdev, u16 qs_id, u8 pri, + bool link_vld) { struct hclge_qs_to_pri_link_cmd *map; struct hclge_desc desc; @@ -294,7 +294,7 @@ static int hclge_tm_qs_to_pri_map_cfg(struct hclge_dev *hdev, map->qs_id = cpu_to_le16(qs_id); map->priority = pri; - map->link_vld = HCLGE_TM_QS_PRI_LINK_VLD_MSK; + map->link_vld = link_vld ? HCLGE_TM_QS_PRI_LINK_VLD_MSK : 0; return hclge_cmd_send(&hdev->hw, &desc, 1); } @@ -642,11 +642,13 @@ static void hclge_tm_update_kinfo_rss_size(struct hclge_vport *vport) * one tc for VF for simplicity. VF's vport_id is non zero. */ if (vport->vport_id) { + kinfo->tc_info.max_tc = 1; kinfo->tc_info.num_tc = 1; vport->qs_offset = HNAE3_MAX_TC + vport->vport_id - HCLGE_VF_VPORT_START_NUM; vport_max_rss_size = hdev->vf_rss_size_max; } else { + kinfo->tc_info.max_tc = hdev->tc_max; kinfo->tc_info.num_tc = min_t(u16, vport->alloc_tqps, hdev->tm_info.num_tc); vport->qs_offset = 0; @@ -714,14 +716,22 @@ static void hclge_tm_vport_info_update(struct hclge_dev *hdev) static void hclge_tm_tc_info_init(struct hclge_dev *hdev) { - u8 i; + u8 i, tc_sch_mode; + u32 bw_limit; + + for (i = 0; i < hdev->tc_max; i++) { + if (i < hdev->tm_info.num_tc) { + tc_sch_mode = HCLGE_SCH_MODE_DWRR; + bw_limit = hdev->tm_info.pg_info[0].bw_limit; + } else { + tc_sch_mode = HCLGE_SCH_MODE_SP; + bw_limit = 0; + } - for (i = 0; i < hdev->tm_info.num_tc; i++) { hdev->tm_info.tc_info[i].tc_id = i; - hdev->tm_info.tc_info[i].tc_sch_mode = HCLGE_SCH_MODE_DWRR; + hdev->tm_info.tc_info[i].tc_sch_mode = tc_sch_mode; hdev->tm_info.tc_info[i].pgid = 0; - hdev->tm_info.tc_info[i].bw_limit = - hdev->tm_info.pg_info[0].bw_limit; + hdev->tm_info.tc_info[i].bw_limit = bw_limit; } for (i = 0; i < HNAE3_MAX_USER_PRIO; i++) @@ -926,10 +936,13 @@ static int hclge_tm_pri_q_qs_cfg_tc_base(struct hclge_dev *hdev) for (k = 0; k < hdev->num_alloc_vport; k++) { struct hnae3_knic_private_info *kinfo = &vport[k].nic.kinfo; - for (i = 0; i < kinfo->tc_info.num_tc; i++) { + for (i = 0; i < kinfo->tc_info.max_tc; i++) { + u8 pri = i < kinfo->tc_info.num_tc ? i : 0; + bool link_vld = i < kinfo->tc_info.num_tc; + ret = hclge_tm_qs_to_pri_map_cfg(hdev, vport[k].qs_offset + i, - i); + pri, link_vld); if (ret) return ret; } @@ -949,7 +962,7 @@ static int hclge_tm_pri_q_qs_cfg_vnet_base(struct hclge_dev *hdev) for (i = 0; i < HNAE3_MAX_TC; i++) { ret = hclge_tm_qs_to_pri_map_cfg(hdev, vport[k].qs_offset + i, - k); + k, true); if (ret) return ret; } @@ -989,33 +1002,39 @@ static int hclge_tm_pri_tc_base_shaper_cfg(struct hclge_dev *hdev) { u32 max_tm_rate = hdev->ae_dev->dev_specs.max_tm_rate; struct hclge_shaper_ir_para ir_para; - u32 shaper_para; + u32 shaper_para_c, shaper_para_p; int ret; u32 i; - for (i = 0; i < hdev->tm_info.num_tc; i++) { + for (i = 0; i < hdev->tc_max; i++) { u32 rate = hdev->tm_info.tc_info[i].bw_limit; - ret = hclge_shaper_para_calc(rate, HCLGE_SHAPER_LVL_PRI, - &ir_para, max_tm_rate); - if (ret) - return ret; + if (rate) { + ret = hclge_shaper_para_calc(rate, HCLGE_SHAPER_LVL_PRI, + &ir_para, max_tm_rate); + if (ret) + return ret; + + shaper_para_c = hclge_tm_get_shapping_para(0, 0, 0, + HCLGE_SHAPER_BS_U_DEF, + HCLGE_SHAPER_BS_S_DEF); + shaper_para_p = hclge_tm_get_shapping_para(ir_para.ir_b, + ir_para.ir_u, + ir_para.ir_s, + HCLGE_SHAPER_BS_U_DEF, + HCLGE_SHAPER_BS_S_DEF); + } else { + shaper_para_c = 0; + shaper_para_p = 0; + } - shaper_para = hclge_tm_get_shapping_para(0, 0, 0, - HCLGE_SHAPER_BS_U_DEF, - HCLGE_SHAPER_BS_S_DEF); ret = hclge_tm_pri_shapping_cfg(hdev, HCLGE_TM_SHAP_C_BUCKET, i, - shaper_para, rate); + shaper_para_c, rate); if (ret) return ret; - shaper_para = hclge_tm_get_shapping_para(ir_para.ir_b, - ir_para.ir_u, - ir_para.ir_s, - HCLGE_SHAPER_BS_U_DEF, - HCLGE_SHAPER_BS_S_DEF); ret = hclge_tm_pri_shapping_cfg(hdev, HCLGE_TM_SHAP_P_BUCKET, i, - shaper_para, rate); + shaper_para_p, rate); if (ret) return ret; } @@ -1125,7 +1144,7 @@ static int hclge_tm_pri_tc_base_dwrr_cfg(struct hclge_dev *hdev) int ret; u32 i, k; - for (i = 0; i < hdev->tm_info.num_tc; i++) { + for (i = 0; i < hdev->tc_max; i++) { pg_info = &hdev->tm_info.pg_info[hdev->tm_info.tc_info[i].pgid]; dwrr = pg_info->tc_dwrr[i]; @@ -1135,9 +1154,15 @@ static int hclge_tm_pri_tc_base_dwrr_cfg(struct hclge_dev *hdev) return ret; for (k = 0; k < hdev->num_alloc_vport; k++) { + struct hnae3_knic_private_info *kinfo = &vport[k].nic.kinfo; + + if (i >= kinfo->tc_info.max_tc) + continue; + + dwrr = i < kinfo->tc_info.num_tc ? vport[k].dwrr : 0; ret = hclge_tm_qs_weight_cfg( hdev, vport[k].qs_offset + i, - vport[k].dwrr); + dwrr); if (ret) return ret; } @@ -1303,6 +1328,7 @@ static int hclge_tm_schd_mode_tc_base_cfg(struct hclge_dev *hdev, u8 pri_id) { struct hclge_vport *vport = hdev->vport; int ret; + u8 mode; u16 i; ret = hclge_tm_pri_schd_mode_cfg(hdev, pri_id); @@ -1310,9 +1336,16 @@ static int hclge_tm_schd_mode_tc_base_cfg(struct hclge_dev *hdev, u8 pri_id) return ret; for (i = 0; i < hdev->num_alloc_vport; i++) { + struct hnae3_knic_private_info *kinfo = &vport[i].nic.kinfo; + + if (pri_id >= kinfo->tc_info.max_tc) + continue; + + mode = pri_id < kinfo->tc_info.num_tc ? HCLGE_SCH_MODE_DWRR : + HCLGE_SCH_MODE_SP; ret = hclge_tm_qs_schd_mode_cfg(hdev, vport[i].qs_offset + pri_id, - HCLGE_SCH_MODE_DWRR); + mode); if (ret) return ret; } @@ -1353,7 +1386,7 @@ static int hclge_tm_lvl34_schd_mode_cfg(struct hclge_dev *hdev) u8 i; if (hdev->tx_sch_mode == HCLGE_FLAG_TC_BASE_SCH_MODE) { - for (i = 0; i < hdev->tm_info.num_tc; i++) { + for (i = 0; i < hdev->tc_max; i++) { ret = hclge_tm_schd_mode_tc_base_cfg(hdev, i); if (ret) return ret; From 71b215f36dca1a3d5d1c576b2099e6d7ea03047e Mon Sep 17 00:00:00 2001 From: Jie Wang Date: Sat, 11 Jun 2022 20:25:28 +0800 Subject: [PATCH 132/298] net: hns3: fix PF rss size initialization bug Currently hns3 driver misuses the VF rss size to initialize the PF rss size in hclge_tm_vport_tc_info_update. So this patch fix it by checking the vport id before initialization. Fixes: 7347255ea389 ("net: hns3: refactor PF rss get APIs with new common rss get APIs") Signed-off-by: Jie Wang Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index ad53a3447322..f5296ff60694 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -681,7 +681,9 @@ static void hclge_tm_vport_tc_info_update(struct hclge_vport *vport) kinfo->num_tqps = hclge_vport_get_tqp_num(vport); vport->dwrr = 100; /* 100 percent as init */ vport->bw_limit = hdev->tm_info.pg_info[0].bw_limit; - hdev->rss_cfg.rss_size = kinfo->rss_size; + + if (vport->vport_id == PF_VPORT_ID) + hdev->rss_cfg.rss_size = kinfo->rss_size; /* when enable mqprio, the tc_info has been updated. */ if (kinfo->tc_info.mqprio_active) From 12a3670887725df364cc3e030cf3bede6f13b364 Mon Sep 17 00:00:00 2001 From: Guangbin Huang Date: Sat, 11 Jun 2022 20:25:29 +0800 Subject: [PATCH 133/298] net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization Currently in driver initialization process, driver will set shapping parameters of tm port to default speed read from firmware. However, the speed of SFP module may not be default speed, so shapping parameters of tm port may be incorrect. To fix this problem, driver sets new shapping parameters for tm port after getting exact speed of SFP module in this case. Fixes: 88d10bd6f730 ("net: hns3: add support for multiple media type") Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 11 ++++++++--- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h | 1 + 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 2e891b837c51..fae79764dc44 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3268,7 +3268,7 @@ static int hclge_tp_port_init(struct hclge_dev *hdev) static int hclge_update_port_info(struct hclge_dev *hdev) { struct hclge_mac *mac = &hdev->hw.mac; - int speed = HCLGE_MAC_SPEED_UNKNOWN; + int speed; int ret; /* get the port info from SFP cmd if not copper port */ @@ -3279,10 +3279,13 @@ static int hclge_update_port_info(struct hclge_dev *hdev) if (!hdev->support_sfp_query) return 0; - if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) + if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) { + speed = mac->speed; ret = hclge_get_sfp_info(hdev, mac); - else + } else { + speed = HCLGE_MAC_SPEED_UNKNOWN; ret = hclge_get_sfp_speed(hdev, &speed); + } if (ret == -EOPNOTSUPP) { hdev->support_sfp_query = false; @@ -3294,6 +3297,8 @@ static int hclge_update_port_info(struct hclge_dev *hdev) if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) { if (mac->speed_type == QUERY_ACTIVE_SPEED) { hclge_update_port_capability(hdev, mac); + if (mac->speed != speed) + (void)hclge_tm_port_shaper_cfg(hdev); return 0; } return hclge_cfg_mac_speed_dup(hdev, mac->speed, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index f5296ff60694..2f33b036a47a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -420,7 +420,7 @@ static int hclge_tm_pg_shapping_cfg(struct hclge_dev *hdev, return hclge_cmd_send(&hdev->hw, &desc, 1); } -static int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev) +int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev) { struct hclge_port_shapping_cmd *shap_cfg_cmd; struct hclge_shaper_ir_para ir_para; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h index 619cc30a2dfc..d943943912f7 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h @@ -237,6 +237,7 @@ int hclge_pause_addr_cfg(struct hclge_dev *hdev, const u8 *mac_addr); void hclge_pfc_rx_stats_get(struct hclge_dev *hdev, u64 *stats); void hclge_pfc_tx_stats_get(struct hclge_dev *hdev, u64 *stats); int hclge_tm_qs_shaper_cfg(struct hclge_vport *vport, int max_tx_rate); +int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev); int hclge_tm_get_qset_num(struct hclge_dev *hdev, u16 *qset_num); int hclge_tm_get_pri_num(struct hclge_dev *hdev, u8 *pri_num); int hclge_tm_get_qset_map_pri(struct hclge_dev *hdev, u16 qset_id, u8 *priority, From 97da4a537924d87e2261773f3ac9365abb191fc9 Mon Sep 17 00:00:00 2001 From: Dylan Yudaken Date: Mon, 13 Jun 2022 03:11:55 -0700 Subject: [PATCH 134/298] io_uring: fix index calculation When indexing into a provided buffer ring, do not subtract 1 from the index. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Dylan Yudaken Link: https://lore.kernel.org/r/20220613101157.3687-2-dylany@fb.com Reviewed-by: Hao Xu Signed-off-by: Jens Axboe --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 3aab4182fd89..9cf9aff51b70 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3888,7 +3888,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, buf = &br->bufs[head]; } else { int off = head & (IO_BUFFER_LIST_BUF_PER_PAGE - 1); - int index = head / IO_BUFFER_LIST_BUF_PER_PAGE - 1; + int index = head / IO_BUFFER_LIST_BUF_PER_PAGE; buf = page_address(bl->buf_pages[index]); buf += off; } From c6e9fa5c0ab811f4bec36a96337f4b1bb77d142c Mon Sep 17 00:00:00 2001 From: Dylan Yudaken Date: Mon, 13 Jun 2022 03:11:56 -0700 Subject: [PATCH 135/298] io_uring: fix types in provided buffer ring The type of head needs to match that of tail in order for rollover and comparisons to work correctly. Without this change the comparison of tail to head might incorrectly allow io_uring to use a buffer that userspace had not given it. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Dylan Yudaken Link: https://lore.kernel.org/r/20220613101157.3687-3-dylany@fb.com Reviewed-by: Hao Xu Signed-off-by: Jens Axboe --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 9cf9aff51b70..6eea18e8330c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -298,8 +298,8 @@ struct io_buffer_list { /* below is for ring provided buffers */ __u16 buf_nr_pages; __u16 nr_entries; - __u32 head; - __u32 mask; + __u16 head; + __u16 mask; }; struct io_buffer { @@ -3876,7 +3876,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, { struct io_uring_buf_ring *br = bl->buf_ring; struct io_uring_buf *buf; - __u32 head = bl->head; + __u16 head = bl->head; if (unlikely(smp_load_acquire(&br->tail) == head)) { io_ring_submit_unlock(req->ctx, issue_flags); From f9437ac0f851cea2374d53594f52fbbefdd977bd Mon Sep 17 00:00:00 2001 From: Dylan Yudaken Date: Mon, 13 Jun 2022 03:11:57 -0700 Subject: [PATCH 136/298] io_uring: limit size of provided buffer ring The type of head and tail do not allow more than 2^15 entries in a provided buffer ring, so do not allow this. At 2^16 while each entry can be indexed, there is no way to disambiguate full vs empty. Signed-off-by: Dylan Yudaken Link: https://lore.kernel.org/r/20220613101157.3687-4-dylany@fb.com Reviewed-by: Hao Xu Signed-off-by: Jens Axboe --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6eea18e8330c..85b116ddfd2a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -13002,6 +13002,10 @@ static int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) if (!is_power_of_2(reg.ring_entries)) return -EINVAL; + /* cannot disambiguate full vs empty due to head/tail size */ + if (reg.ring_entries >= 65536) + return -EINVAL; + if (unlikely(reg.bgid < BGID_ARRAY && !ctx->io_bl)) { int ret = io_init_bl_list(ctx); if (ret) From 00be43a74ca262267ceb96c0c5e3f51d3a56342e Mon Sep 17 00:00:00 2001 From: Andy Chiu Date: Mon, 13 Jun 2022 11:42:01 +0800 Subject: [PATCH 137/298] net: axienet: make the 64b addresable DMA depends on 64b archectures Currently it is not safe to config the IP as 64-bit addressable on 32-bit archectures, which cannot perform a double-word store on its descriptor pointers. The pointer is 64-bit wide if the IP is configured as 64-bit, and the device would process the partially updated pointer on some states if the pointer was updated via two store-words. To prevent such condition, we force a probe fail if we discover that the IP has 64-bit capability but it is not running on a 64-Bit kernel. This is a series of patch (1/2). The next patch must be applied in order to make 64b DMA safe on 64b archectures. Signed-off-by: Andy Chiu Reported-by: Max Hsu Reviewed-by: Greentime Hu Signed-off-by: David S. Miller --- drivers/net/ethernet/xilinx/xilinx_axienet.h | 36 +++++++++++++++++++ .../net/ethernet/xilinx/xilinx_axienet_main.c | 28 +++------------ 2 files changed, 40 insertions(+), 24 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h index 4225efbeda3d..6c95676ba172 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h @@ -547,6 +547,42 @@ static inline void axienet_iow(struct axienet_local *lp, off_t offset, iowrite32(value, lp->regs + offset); } +/** + * axienet_dma_out32 - Memory mapped Axi DMA register write. + * @lp: Pointer to axienet local structure + * @reg: Address offset from the base address of the Axi DMA core + * @value: Value to be written into the Axi DMA register + * + * This function writes the desired value into the corresponding Axi DMA + * register. + */ + +static inline void axienet_dma_out32(struct axienet_local *lp, + off_t reg, u32 value) +{ + iowrite32(value, lp->dma_regs + reg); +} + +#ifdef CONFIG_64BIT +static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, + dma_addr_t addr) +{ + axienet_dma_out32(lp, reg, lower_32_bits(addr)); + + if (lp->features & XAE_FEATURE_DMA_64BIT) + axienet_dma_out32(lp, reg + 4, upper_32_bits(addr)); +} + +#else /* CONFIG_64BIT */ + +static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, + dma_addr_t addr) +{ + axienet_dma_out32(lp, reg, lower_32_bits(addr)); +} + +#endif /* CONFIG_64BIT */ + /* Function prototypes visible in xilinx_axienet_mdio.c for other files */ int axienet_mdio_enable(struct axienet_local *lp); void axienet_mdio_disable(struct axienet_local *lp); diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c index 93c9f305bba4..fa7bcd2c1892 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c @@ -133,30 +133,6 @@ static inline u32 axienet_dma_in32(struct axienet_local *lp, off_t reg) return ioread32(lp->dma_regs + reg); } -/** - * axienet_dma_out32 - Memory mapped Axi DMA register write. - * @lp: Pointer to axienet local structure - * @reg: Address offset from the base address of the Axi DMA core - * @value: Value to be written into the Axi DMA register - * - * This function writes the desired value into the corresponding Axi DMA - * register. - */ -static inline void axienet_dma_out32(struct axienet_local *lp, - off_t reg, u32 value) -{ - iowrite32(value, lp->dma_regs + reg); -} - -static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, - dma_addr_t addr) -{ - axienet_dma_out32(lp, reg, lower_32_bits(addr)); - - if (lp->features & XAE_FEATURE_DMA_64BIT) - axienet_dma_out32(lp, reg + 4, upper_32_bits(addr)); -} - static void desc_set_phys_addr(struct axienet_local *lp, dma_addr_t addr, struct axidma_bd *desc) { @@ -2061,6 +2037,10 @@ static int axienet_probe(struct platform_device *pdev) iowrite32(0x0, desc); } } + if (!IS_ENABLED(CONFIG_64BIT) && lp->features & XAE_FEATURE_DMA_64BIT) { + dev_err(&pdev->dev, "64-bit addressable DMA is not compatible with 32-bit archecture\n"); + goto cleanup_clk; + } ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(addr_width)); if (ret) { From b690f8df6497b654c2c871871e0a598e9750c0eb Mon Sep 17 00:00:00 2001 From: Andy Chiu Date: Mon, 13 Jun 2022 11:42:02 +0800 Subject: [PATCH 138/298] net: axienet: Use iowrite64 to write all 64b descriptor pointers According to commit f735c40ed93c ("net: axienet: Autodetect 64-bit DMA capability") and AXI-DMA spec (pg021), on 64-bit capable dma, only writing MSB part of tail descriptor pointer causes DMA engine to start fetching descriptors. However, we found that it is true only if dma is in idle state. In other words, dma would use a tailp even if it only has LSB updated, when the dma is running. The non-atomicity of this behavior could be problematic if enough delay were introduced in between the 2 writes. For example, if an interrupt comes right after the LSB write and the cpu spends long enough time in the handler for the dma to get back into idle state by completing descriptors, then the seconcd write to MSB would treat dma to start fetching descriptors again. Since the descriptor next to the one pointed by current tail pointer is not filled by the kernel yet, fetching a null descriptor here causes a dma internal error and halt the dma engine down. We suggest that the dma engine should start process a 64-bit MMIO write to the descriptor pointer only if ONE 32-bit part of it is written on all states. Or we should restrict the use of 64-bit addressable dma on 32-bit platforms, since those devices have no instruction to guarantee the write to LSB and MSB part of tail pointer occurs atomically to the dma. initial condition: curp = x-3; tailp = x-2; LSB = x; MSB = 0; cpu: |dma: iowrite32(LSB, tailp) | completes #(x-3) desc, curp = x-3 ... | tailp updated => irq | completes #(x-2) desc, curp = x-2 ... | completes #(x-1) desc, curp = x-1 ... | ... ... | completes #x desc, curp = tailp = x <= irqreturn | reaches tailp == curp = x, idle iowrite32(MSB, tailp + 4) | ... | tailp updated, starts fetching... | fetches #(x + 1) desc, sees cntrl = 0 | post Tx error, halt Signed-off-by: Andy Chiu Reported-by: Max Hsu Reviewed-by: Greentime Hu Signed-off-by: David S. Miller --- drivers/net/ethernet/xilinx/xilinx_axienet.h | 21 +++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h index 6c95676ba172..97ddc0273b8a 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h @@ -564,13 +564,28 @@ static inline void axienet_dma_out32(struct axienet_local *lp, } #ifdef CONFIG_64BIT +/** + * axienet_dma_out64 - Memory mapped Axi DMA register write. + * @lp: Pointer to axienet local structure + * @reg: Address offset from the base address of the Axi DMA core + * @value: Value to be written into the Axi DMA register + * + * This function writes the desired value into the corresponding Axi DMA + * register. + */ +static inline void axienet_dma_out64(struct axienet_local *lp, + off_t reg, u64 value) +{ + iowrite64(value, lp->dma_regs + reg); +} + static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, dma_addr_t addr) { - axienet_dma_out32(lp, reg, lower_32_bits(addr)); - if (lp->features & XAE_FEATURE_DMA_64BIT) - axienet_dma_out32(lp, reg + 4, upper_32_bits(addr)); + axienet_dma_out64(lp, reg, addr); + else + axienet_dma_out32(lp, reg, lower_32_bits(addr)); } #else /* CONFIG_64BIT */ From 5f7b84151a89f6f3a8d1db4db2bc4f5b270d66ee Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Mon, 13 Jun 2022 12:49:21 +0100 Subject: [PATCH 139/298] xilinx: Fix build on x86. CONFIG_64BIT is not sufficient for checking for availability of iowrite64() and friends. Also, the out_addr helpers need to be inline. Fixes: b690f8df6497 ("net: axienet: Use iowrite64 to write all 64b descriptor pointers") Signed-off-by: David S. Miller --- drivers/net/ethernet/xilinx/xilinx_axienet.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h index 97ddc0273b8a..f2e2261b4b7d 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h @@ -563,7 +563,7 @@ static inline void axienet_dma_out32(struct axienet_local *lp, iowrite32(value, lp->dma_regs + reg); } -#ifdef CONFIG_64BIT +#if defined(CONFIG_64BIT) && defined(iowrite64) /** * axienet_dma_out64 - Memory mapped Axi DMA register write. * @lp: Pointer to axienet local structure @@ -579,8 +579,8 @@ static inline void axienet_dma_out64(struct axienet_local *lp, iowrite64(value, lp->dma_regs + reg); } -static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, - dma_addr_t addr) +static inline void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, + dma_addr_t addr) { if (lp->features & XAE_FEATURE_DMA_64BIT) axienet_dma_out64(lp, reg, addr); @@ -590,7 +590,7 @@ static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, #else /* CONFIG_64BIT */ -static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, +static inline void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, dma_addr_t addr) { axienet_dma_out32(lp, reg, lower_32_bits(addr)); From 619c010a65391d06bc96e79fa0e7725790e5d1a9 Mon Sep 17 00:00:00 2001 From: Suman Ghosh Date: Sun, 12 Jun 2022 23:15:36 +0530 Subject: [PATCH 140/298] octeontx2-vf: Add support for adaptive interrupt coalescing Fixes: 6e144b47f560 (octeontx2-pf: Add support for adaptive interrupt coalescing) Added support for VF interfaces as well. Signed-off-by: Suman Ghosh Signed-off-by: David S. Miller --- drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c index bc614a4def9e..3f60a80e34c8 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c @@ -1390,7 +1390,8 @@ static int otx2vf_get_link_ksettings(struct net_device *netdev, static const struct ethtool_ops otx2vf_ethtool_ops = { .supported_coalesce_params = ETHTOOL_COALESCE_USECS | - ETHTOOL_COALESCE_MAX_FRAMES, + ETHTOOL_COALESCE_MAX_FRAMES | + ETHTOOL_COALESCE_USE_ADAPTIVE, .supported_ring_params = ETHTOOL_RING_USE_RX_BUF_LEN | ETHTOOL_RING_USE_CQE_SIZE, .get_link = otx2_get_link, From 6e21408774da49b34fbe258d161e6329a43fcbe8 Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Mon, 13 Jun 2022 13:46:14 +0200 Subject: [PATCH 141/298] MAINTAINERS: add include/dt-bindings/i2c to I2C SUBSYSTEM HOST DRIVERS Maintainers of the directory Documentation/devicetree/bindings/i2c are also the maintainers of the corresponding directory include/dt-bindings/i2c. Add the file entry for include/dt-bindings/i2c to the appropriate section in MAINTAINERS. Signed-off-by: Lukas Bulwahn Signed-off-by: Wolfram Sang --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index cb2342ce3b55..d6de26c5bd5d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9283,6 +9283,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux.git F: Documentation/devicetree/bindings/i2c/ F: drivers/i2c/algos/ F: drivers/i2c/busses/ +F: include/dt-bindings/i2c/ I2C-TAOS-EVM DRIVER M: Jean Delvare From 5edc99f0c5b753eb34defad1cdb164824056a487 Mon Sep 17 00:00:00 2001 From: Wolfram Sang Date: Mon, 13 Jun 2022 16:45:19 +0200 Subject: [PATCH 142/298] MAINTAINERS: core DT include belongs to core Signed-off-by: Wolfram Sang --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index d6de26c5bd5d..c512a083d659 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9268,6 +9268,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux.git F: Documentation/devicetree/bindings/i2c/i2c.txt F: Documentation/i2c/ F: drivers/i2c/* +F: include/dt-bindings/i2c/i2c.h F: include/linux/i2c-dev.h F: include/linux/i2c-smbus.h F: include/linux/i2c.h From 27071b5cbca59d8e8f8750c199a6cbf8c9799963 Mon Sep 17 00:00:00 2001 From: Serge Semin Date: Fri, 10 Jun 2022 10:42:33 +0300 Subject: [PATCH 143/298] i2c: designware: Use standard optional ref clock implementation Even though the DW I2C controller reference clock source is requested by the method devm_clk_get() with non-optional clock requirement the way the clock handler is used afterwards has a pure optional clock semantic (though in some circumstances we can get a warning about the clock missing printed in the system console). There is no point in reimplementing that functionality seeing the kernel clock framework already supports the optional interface from scratch. Thus let's convert the platform driver to using it. Note by providing this commit we get to fix two problems. The first one was introduced in commit c62ebb3d5f0d ("i2c: designware: Add support for an interface clock"). It causes not having the interface clock (pclk) enabled/disabled in case if the reference clock isn't provided. The second problem was first introduced in commit b33af11de236 ("i2c: designware: Do not require clock when SSCN and FFCN are provided"). Since that modification the deferred probe procedure has been unsupported in case if the interface clock isn't ready. Fixes: c62ebb3d5f0d ("i2c: designware: Add support for an interface clock") Fixes: b33af11de236 ("i2c: designware: Do not require clock when SSCN and FFCN are provided") Signed-off-by: Serge Semin Reviewed-by: Andy Shevchenko Acked-by: Jarkko Nikula Signed-off-by: Wolfram Sang --- drivers/i2c/busses/i2c-designware-common.c | 3 --- drivers/i2c/busses/i2c-designware-platdrv.c | 13 +++++++++++-- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-common.c b/drivers/i2c/busses/i2c-designware-common.c index e7d316b1401a..c023b691441e 100644 --- a/drivers/i2c/busses/i2c-designware-common.c +++ b/drivers/i2c/busses/i2c-designware-common.c @@ -477,9 +477,6 @@ int i2c_dw_prepare_clk(struct dw_i2c_dev *dev, bool prepare) { int ret; - if (IS_ERR(dev->clk)) - return PTR_ERR(dev->clk); - if (prepare) { /* Optional interface clock */ ret = clk_prepare_enable(dev->pclk); diff --git a/drivers/i2c/busses/i2c-designware-platdrv.c b/drivers/i2c/busses/i2c-designware-platdrv.c index 70ade5306e45..ba043b547393 100644 --- a/drivers/i2c/busses/i2c-designware-platdrv.c +++ b/drivers/i2c/busses/i2c-designware-platdrv.c @@ -320,8 +320,17 @@ static int dw_i2c_plat_probe(struct platform_device *pdev) goto exit_reset; } - dev->clk = devm_clk_get(&pdev->dev, NULL); - if (!i2c_dw_prepare_clk(dev, true)) { + dev->clk = devm_clk_get_optional(&pdev->dev, NULL); + if (IS_ERR(dev->clk)) { + ret = PTR_ERR(dev->clk); + goto exit_reset; + } + + ret = i2c_dw_prepare_clk(dev, true); + if (ret) + goto exit_reset; + + if (dev->clk) { u64 clk_khz; dev->get_clk_rate_khz = i2c_dw_get_clk_rate_khz; From 57cd6d157eb479f0a8e820fd36b7240845c8a937 Mon Sep 17 00:00:00 2001 From: Sami Tolvanen Date: Tue, 31 May 2022 10:59:10 -0700 Subject: [PATCH 144/298] cfi: Fix __cfi_slowpath_diag RCU usage with cpuidle RCU_NONIDLE usage during __cfi_slowpath_diag can result in an invalid RCU state in the cpuidle code path: WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter+0xe4/0x138 ... Call trace: rcu_eqs_enter+0xe4/0x138 rcu_idle_enter+0xa8/0x100 cpuidle_enter_state+0x154/0x3a8 cpuidle_enter+0x3c/0x58 do_idle.llvm.6590768638138871020+0x1f4/0x2ec cpu_startup_entry+0x28/0x2c secondary_start_kernel+0x1b8/0x220 __secondary_switched+0x94/0x98 Instead, call rcu_irq_enter/exit to wake up RCU only when needed and disable interrupts for the entire CFI shadow/module check when we do. Signed-off-by: Sami Tolvanen Link: https://lore.kernel.org/r/20220531175910.890307-1-samitolvanen@google.com Fixes: cf68fffb66d6 ("add support for Clang CFI") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook --- kernel/cfi.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/kernel/cfi.c b/kernel/cfi.c index 9594cfd1cf2c..08102d19ec15 100644 --- a/kernel/cfi.c +++ b/kernel/cfi.c @@ -281,6 +281,8 @@ static inline cfi_check_fn find_module_check_fn(unsigned long ptr) static inline cfi_check_fn find_check_fn(unsigned long ptr) { cfi_check_fn fn = NULL; + unsigned long flags; + bool rcu_idle; if (is_kernel_text(ptr)) return __cfi_check; @@ -290,13 +292,21 @@ static inline cfi_check_fn find_check_fn(unsigned long ptr) * the shadow and __module_address use RCU, so we need to wake it * up if necessary. */ - RCU_NONIDLE({ - if (IS_ENABLED(CONFIG_CFI_CLANG_SHADOW)) - fn = find_shadow_check_fn(ptr); + rcu_idle = !rcu_is_watching(); + if (rcu_idle) { + local_irq_save(flags); + rcu_irq_enter(); + } - if (!fn) - fn = find_module_check_fn(ptr); - }); + if (IS_ENABLED(CONFIG_CFI_CLANG_SHADOW)) + fn = find_shadow_check_fn(ptr); + if (!fn) + fn = find_module_check_fn(ptr); + + if (rcu_idle) { + rcu_irq_exit(); + local_irq_restore(flags); + } return fn; } From 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sun, 12 Jun 2022 22:32:25 +0100 Subject: [PATCH 145/298] usercopy: Handle vm_map_ram() areas vmalloc does not allocate a vm_struct for vm_map_ram() areas. That causes us to deny usercopies from those areas. This affects XFS which uses vm_map_ram() for its directories. Fix this by calling find_vmap_area() instead of find_vm_area(). Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Zorro Lang Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220612213227.3881769-2-willy@infradead.org --- include/linux/vmalloc.h | 1 + mm/usercopy.c | 10 ++++------ mm/vmalloc.c | 2 +- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index b159c2789961..096d48aa3437 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -215,6 +215,7 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size, void free_vm_area(struct vm_struct *area); extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); +struct vmap_area *find_vmap_area(unsigned long addr); static inline bool is_vm_area_hugepages(const void *addr) { diff --git a/mm/usercopy.c b/mm/usercopy.c index baeacc735b83..cd4b41d9bf76 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -173,16 +173,14 @@ static inline void check_heap_object(const void *ptr, unsigned long n, } if (is_vmalloc_addr(ptr)) { - struct vm_struct *area = find_vm_area(ptr); + struct vmap_area *area = find_vmap_area((unsigned long)ptr); unsigned long offset; - if (!area) { + if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); - return; - } - offset = ptr - area->addr; - if (offset + n > get_vm_area_size(area)) + offset = (unsigned long)ptr - area->va_start; + if ((unsigned long)ptr + n > area->va_end) usercopy_abort("vmalloc", NULL, to_user, offset, n); return; } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 07db42455dd4..effd1ff6a4b4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1798,7 +1798,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) free_vmap_area_noflush(va); } -static struct vmap_area *find_vmap_area(unsigned long addr) +struct vmap_area *find_vmap_area(unsigned long addr) { struct vmap_area *va; From 35fb9ae4aa2e838b234323e6f7cf6336ff019e5a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sun, 12 Jun 2022 22:32:26 +0100 Subject: [PATCH 146/298] usercopy: Cast pointer to an integer once Get rid of a lot of annoying casts by setting 'addr' once at the top of the function. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Zorro Lang Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220612213227.3881769-3-willy@infradead.org --- mm/usercopy.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index cd4b41d9bf76..30a4db3cb1df 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -161,26 +161,27 @@ static inline void check_bogus_address(const unsigned long ptr, unsigned long n, static inline void check_heap_object(const void *ptr, unsigned long n, bool to_user) { + uintptr_t addr = (uintptr_t)ptr; struct folio *folio; if (is_kmap_addr(ptr)) { - unsigned long page_end = (unsigned long)ptr | (PAGE_SIZE - 1); + unsigned long page_end = addr | (PAGE_SIZE - 1); - if ((unsigned long)ptr + n - 1 > page_end) + if (addr + n - 1 > page_end) usercopy_abort("kmap", NULL, to_user, offset_in_page(ptr), n); return; } if (is_vmalloc_addr(ptr)) { - struct vmap_area *area = find_vmap_area((unsigned long)ptr); + struct vmap_area *area = find_vmap_area(addr); unsigned long offset; if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); - offset = (unsigned long)ptr - area->va_start; - if ((unsigned long)ptr + n > area->va_end) + offset = addr - area->va_start; + if (addr + n > area->va_end) usercopy_abort("vmalloc", NULL, to_user, offset, n); return; } From 1dfbe9fcda4afc957f0e371e207ae3cb7e8f3b0e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sun, 12 Jun 2022 22:32:27 +0100 Subject: [PATCH 147/298] usercopy: Make usercopy resilient against ridiculously large copies If 'n' is so large that it's negative, we might wrap around and mistakenly think that the copy is OK when it's not. Such a copy would probably crash, but just doing the arithmetic in a more simple way lets us detect and refuse this case. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Zorro Lang Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220612213227.3881769-4-willy@infradead.org --- mm/usercopy.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 30a4db3cb1df..4e1da708699b 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -162,27 +162,26 @@ static inline void check_heap_object(const void *ptr, unsigned long n, bool to_user) { uintptr_t addr = (uintptr_t)ptr; + unsigned long offset; struct folio *folio; if (is_kmap_addr(ptr)) { - unsigned long page_end = addr | (PAGE_SIZE - 1); - - if (addr + n - 1 > page_end) - usercopy_abort("kmap", NULL, to_user, - offset_in_page(ptr), n); + offset = offset_in_page(ptr); + if (n > PAGE_SIZE - offset) + usercopy_abort("kmap", NULL, to_user, offset, n); return; } if (is_vmalloc_addr(ptr)) { struct vmap_area *area = find_vmap_area(addr); - unsigned long offset; if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); - offset = addr - area->va_start; - if (addr + n > area->va_end) + if (n > area->va_end - addr) { + offset = addr - area->va_start; usercopy_abort("vmalloc", NULL, to_user, offset, n); + } return; } @@ -195,8 +194,8 @@ static inline void check_heap_object(const void *ptr, unsigned long n, /* Check slab allocator for flags and size. */ __check_heap_object(ptr, n, folio_slab(folio), to_user); } else if (folio_test_large(folio)) { - unsigned long offset = ptr - folio_address(folio); - if (offset + n > folio_size(folio)) + offset = ptr - folio_address(folio); + if (n > folio_size(folio) - offset) usercopy_abort("page alloc", NULL, to_user, offset, n); } } From 1fc766b5c08417248e0008bca14c3572ac0f1c26 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Thomas=20Wei=C3=9Fschuh?= Date: Tue, 7 Jun 2022 17:55:55 +0200 Subject: [PATCH 148/298] nvme: add device name to warning in uuid_show() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This provides more context to users. Old message: [ 00.000000] No UUID available providing old NGUID New message: [ 00.000000] block nvme0n1: No UUID available providing old NGUID Fixes: d934f9848a77 ("nvme: provide UUID value to userspace") Signed-off-by: Thomas Weißschuh Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 24165daee3c8..9409b8843872 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3285,8 +3285,8 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr, * we have no UUID set */ if (uuid_is_null(&ids->uuid)) { - printk_ratelimited(KERN_WARNING - "No UUID available providing old NGUID\n"); + dev_warn_ratelimited(dev, + "No UUID available providing old NGUID\n"); return sysfs_emit(buf, "%pU\n", ids->nguid); } return sysfs_emit(buf, "%pU\n", &ids->uuid); From 2f0dad1719cbbd690e916a42d937b7605ee63964 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Tue, 7 Jun 2022 08:30:29 -0700 Subject: [PATCH 149/298] nvme: add bug report info for global duplicate id The recent global id check is finding poorly implemented devices in the wild. Include relavant device information in the output to help quicken an appropriate quirk patch. Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 1 + drivers/nvme/host/nvme.h | 28 ++++++++++++++++++++++++++++ drivers/nvme/host/pci.c | 16 ++++++++++++++++ 3 files changed, 45 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 9409b8843872..3ab2cfd254a4 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3863,6 +3863,7 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid, if (ret) { dev_err(ctrl->device, "globally duplicate IDs for nsid %d\n", nsid); + nvme_print_device_info(ctrl); return ret; } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 9b72b6ecf33c..0da94b233fed 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -503,6 +503,7 @@ struct nvme_ctrl_ops { void (*submit_async_event)(struct nvme_ctrl *ctrl); void (*delete_ctrl)(struct nvme_ctrl *ctrl); int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size); + void (*print_device_info)(struct nvme_ctrl *ctrl); }; /* @@ -548,6 +549,33 @@ static inline struct request *nvme_cid_to_rq(struct blk_mq_tags *tags, return blk_mq_tag_to_rq(tags, nvme_tag_from_cid(command_id)); } +/* + * Return the length of the string without the space padding + */ +static inline int nvme_strlen(char *s, int len) +{ + while (s[len - 1] == ' ') + len--; + return len; +} + +static inline void nvme_print_device_info(struct nvme_ctrl *ctrl) +{ + struct nvme_subsystem *subsys = ctrl->subsys; + + if (ctrl->ops->print_device_info) { + ctrl->ops->print_device_info(ctrl); + return; + } + + dev_err(ctrl->device, + "VID:%04x model:%.*s firmware:%.*s\n", subsys->vendor_id, + nvme_strlen(subsys->model, sizeof(subsys->model)), + subsys->model, nvme_strlen(subsys->firmware_rev, + sizeof(subsys->firmware_rev)), + subsys->firmware_rev); +} + #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj, const char *dev_name); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 48f4f6eb877b..96579d48002a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2976,6 +2976,21 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size) return snprintf(buf, size, "%s\n", dev_name(&pdev->dev)); } + +static void nvme_pci_print_device_info(struct nvme_ctrl *ctrl) +{ + struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev); + struct nvme_subsystem *subsys = ctrl->subsys; + + dev_err(ctrl->device, + "VID:DID %04x:%04x model:%.*s firmware:%.*s\n", + pdev->vendor, pdev->device, + nvme_strlen(subsys->model, sizeof(subsys->model)), + subsys->model, nvme_strlen(subsys->firmware_rev, + sizeof(subsys->firmware_rev)), + subsys->firmware_rev); +} + static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { .name = "pcie", .module = THIS_MODULE, @@ -2987,6 +3002,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { .free_ctrl = nvme_pci_free_ctrl, .submit_async_event = nvme_pci_submit_async_event, .get_address = nvme_pci_get_address, + .print_device_info = nvme_pci_print_device_info, }; static int nvme_dev_map(struct nvme_dev *dev) From 4641a8e6e145f595059e695f0f8dbbe608134086 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Mon, 6 Jun 2022 09:53:17 -0700 Subject: [PATCH 150/298] nvme-pci: add trouble shooting steps for timeouts Many users have encountered IO timeouts with a CSTS value of 0xffffffff, which indicates a failure to read the register. While there are various potential causes for this observation, faulty NVMe APST has been the culprit quite frequently. Add the recommended troubleshooting steps in the error output when this condition occurs. Signed-off-by: Keith Busch Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 96579d48002a..be053d943731 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1334,6 +1334,14 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts) dev_warn(dev->ctrl.device, "controller is down; will reset: CSTS=0x%x, PCI_STATUS read failed (%d)\n", csts, result); + + if (csts != ~0) + return; + + dev_warn(dev->ctrl.device, + "Does your device have a faulty power saving mode enabled?\n"); + dev_warn(dev->ctrl.device, + "Try \"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off\" and report a bug\n"); } static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) From 3765fad508964f433ac111c127d6bedd19bdfa04 Mon Sep 17 00:00:00 2001 From: Stefan Reiter Date: Mon, 6 Jun 2022 13:01:29 +0000 Subject: [PATCH 151/298] nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50 ADATA XPG GAMMIX S50 drives report bogus eui64 values that appear to be the same across drives in one system. Quirk them out so they are not marked as "non globally unique" duplicates. Signed-off-by: Stefan Reiter Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index be053d943731..631add46e439 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3487,6 +3487,8 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x1e4B, 0x1202), /* MAXIO MAP1202 */ .driver_data = NVME_QUIRK_BOGUS_NID, }, + { PCI_DEVICE(0x1cc1, 0x5350), /* ADATA XPG GAMMIX S50 */ + .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0061), .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0065), From 2cf7a77ed5f8903606f4f7833d02d67b08650442 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Mon, 13 Jun 2022 07:45:47 -0700 Subject: [PATCH 152/298] nvme-pci: phison e12 has bogus namespace ids Add the quirk. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216049 Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 631add46e439..156723d3af83 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3461,6 +3461,8 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY | NVME_QUIRK_DISABLE_WRITE_ZEROES| NVME_QUIRK_IGNORE_DEV_SUBNQN, }, + { PCI_DEVICE(0x1987, 0x5012), /* Phison E12 */ + .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x1987, 0x5016), /* Phison E16 */ .driver_data = NVME_QUIRK_IGNORE_DEV_SUBNQN, }, { PCI_DEVICE(0x1b4b, 0x1092), /* Lexar 256 GB SSD */ From c98a879312caf775c9768faed25ce1c013b4df04 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Mon, 13 Jun 2022 07:45:48 -0700 Subject: [PATCH 153/298] nvme-pci: smi has bogus namespace ids Add the quirk. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216096 Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 156723d3af83..0d2113468523 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3445,7 +3445,8 @@ static const struct pci_device_id nvme_id_table[] = { { PCI_VDEVICE(REDHAT, 0x0010), /* Qemu emulated controller */ .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x126f, 0x2263), /* Silicon Motion unidentified */ - .driver_data = NVME_QUIRK_NO_NS_DESC_LIST, }, + .driver_data = NVME_QUIRK_NO_NS_DESC_LIST | + NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x1bb1, 0x0100), /* Seagate Nytro Flash Storage */ .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY | NVME_QUIRK_NO_NS_DESC_LIST, }, From c4f01a776b28378f4f61b53f8cb0e358f4fa3721 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Mon, 13 Jun 2022 07:45:49 -0700 Subject: [PATCH 154/298] nvme-pci: sk hynix p31 has bogus namespace ids Add the quirk. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216049 Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 0d2113468523..82b4daa9bf95 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3476,6 +3476,8 @@ static const struct pci_device_id nvme_id_table[] = { NVME_QUIRK_IGNORE_DEV_SUBNQN, }, { PCI_DEVICE(0x1c5c, 0x1504), /* SK Hynix PC400 */ .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x1c5c, 0x174a), /* SK Hynix P31 SSD */ + .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x15b7, 0x2001), /* Sandisk Skyhawk */ .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, { PCI_DEVICE(0x1d97, 0x2263), /* SPCC */ From 6b961bce50e489186232cef51036ddb8d672bc3b Mon Sep 17 00:00:00 2001 From: Ning Wang Date: Sun, 5 Jun 2022 20:36:48 +0000 Subject: [PATCH 155/298] nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs When ZHITAI TiPro7000 SSDs entered deepest power state(ps4) it has the same APST sleep problem as Kingston A2000. by chance the system crashes and displays the same dmesg info: https://bugzilla.kernel.org/show_bug.cgi?id=195039#c65 As the Archlinux wiki suggest (enlat + exlat) < 25000 is fine and my testing shows no system crashes ever since. Therefore disabling the deepest power state will fix the APST sleep issue. https://wiki.archlinux.org/title/Solid_state_drive/NVMe This is the APST data from 'nvme id-ctrl /dev/nvme1' NVME Identify Controller: vid : 0x1e49 ssvid : 0x1e49 sn : [...] mn : ZHITAI TiPro7000 1TB fr : ZTA32F3Y [...] ps 0 : mp:3.50W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:3.30W operational enlat:50 exlat:100 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:2.80W operational enlat:50 exlat:200 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.1500W non-operational enlat:500 exlat:5000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0200W non-operational enlat:2000 exlat:60000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- Signed-off-by: Ning Wang Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 82b4daa9bf95..b6f536f9ee78 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3494,6 +3494,8 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x1cc1, 0x5350), /* ADATA XPG GAMMIX S50 */ .driver_data = NVME_QUIRK_BOGUS_NID, }, + { PCI_DEVICE(0x1e49, 0x0041), /* ZHITAI TiPro7000 NVMe SSD */ + .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0061), .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0065), From 43047e082b90ead395c44b0e8497bc853bd13845 Mon Sep 17 00:00:00 2001 From: "rasheed.hsueh" Date: Fri, 10 Jun 2022 14:27:34 +0800 Subject: [PATCH 156/298] nvme-pci: disable write zeros support on UMIC and Samsung SSDs Like commit 5611ec2b9814 ("nvme-pci: prevent SK hynix PC400 from using Write Zeroes command"), UMIS and Samsung has the same issue: [ 6305.633887] blk_update_request: operation not supported error, dev nvme0n1, sector 340812032 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0 So also disable Write Zeroes command on UMIS and Samsung. Signed-off-by: rasheed.hsueh Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b6f536f9ee78..c7012e85d035 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3482,6 +3482,14 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, { PCI_DEVICE(0x1d97, 0x2263), /* SPCC */ .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x144d, 0xa809), /* Samsung MZALQ256HBJD 256G */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x1cc4, 0x6303), /* UMIS RPJTJ512MGE1QDY 512G */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x1cc4, 0x6302), /* UMIS RPJTJ256MGE1QDY 256G */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, { PCI_DEVICE(0x2646, 0x2262), /* KINGSTON SKC2000 NVMe SSD */ .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, { PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */ From 884c65e4daf3eab8730b2bbd5abc5a2c0403b3f3 Mon Sep 17 00:00:00 2001 From: Jean-Philippe Brucker Date: Thu, 9 Jun 2022 17:14:59 +0100 Subject: [PATCH 157/298] amd-xgbe: Use platform_irq_count() The AMD XGbE driver currently counts the number of interrupts assigned to the device by inspecting the pdev->resource array. Since commit a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT core") removed IRQs from this array, the driver now attempts to get all interrupts from 1 to -1U and gives up probing once it reaches an invalid interrupt index. Obtain the number of IRQs with platform_irq_count() instead. Fixes: a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT core") Signed-off-by: Jean-Philippe Brucker Acked-by: Rob Herring Acked-by: Tom Lendacky Link: https://lore.kernel.org/r/20220609161457.69614-1-jean-philippe@linaro.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/amd/xgbe/xgbe-platform.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-platform.c b/drivers/net/ethernet/amd/xgbe/xgbe-platform.c index 4ebd2410185a..4d790a89fe77 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-platform.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-platform.c @@ -338,7 +338,7 @@ static int xgbe_platform_probe(struct platform_device *pdev) * the PHY resources listed last */ phy_memnum = xgbe_resource_count(pdev, IORESOURCE_MEM) - 3; - phy_irqnum = xgbe_resource_count(pdev, IORESOURCE_IRQ) - 1; + phy_irqnum = platform_irq_count(pdev) - 1; dma_irqnum = 1; dma_irqend = phy_irqnum; } else { @@ -348,7 +348,7 @@ static int xgbe_platform_probe(struct platform_device *pdev) phy_memnum = 0; phy_irqnum = 0; dma_irqnum = 1; - dma_irqend = xgbe_resource_count(pdev, IORESOURCE_IRQ); + dma_irqend = platform_irq_count(pdev); } /* Obtain the mmio areas for the device */ From 9cc8ea99bf7ae6f5a5a305bb14a6f1e3f18f5f54 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jonathan=20Neusch=C3=A4fer?= Date: Fri, 10 Jun 2022 09:28:08 +0200 Subject: [PATCH 158/298] docs: networking: phy: Fix a typo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Write "to be operated" instead of "to be operate". Signed-off-by: Jonathan Neuschäfer Reviewed-by: Andrew Lunn Link: https://lore.kernel.org/r/20220610072809.352962-1-j.neuschaefer@gmx.net Signed-off-by: Jakub Kicinski --- Documentation/networking/phy.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index d43da709bf40..704f31da5167 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -104,7 +104,7 @@ Whenever possible, use the PHY side RGMII delay for these reasons: * PHY device drivers in PHYLIB being reusable by nature, being able to configure correctly a specified delay enables more designs with similar delay - requirements to be operate correctly + requirements to be operated correctly For cases where the PHY is not capable of providing this delay, but the Ethernet MAC driver is capable of doing so, the correct phy_interface_t value From 0f9cd1ea10d307cad221d6693b648a8956e812b0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christian=20K=C3=B6nig?= Date: Mon, 13 Jun 2022 09:37:03 +0200 Subject: [PATCH 159/298] drm/ttm: fix bulk move handling v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The resource must be on the LRU before ttm_lru_bulk_move_add() is called and we need to check if the BO is pinned or not before adding it. Additional to that we missed taking the LRU spinlock in ttm_bo_unpin(). Signed-off-by: Christian König Reviewed-by: Arunpravin Paneer Selvam Acked-by: Luben Tuikov Link: https://patchwork.freedesktop.org/patch/msgid/20220613080816.4965-1-christian.koenig@amd.com Fixes: fee2ede15542 ("drm/ttm: rework bulk move handling v5") --- drivers/gpu/drm/ttm/ttm_bo.c | 22 ++++++++----- drivers/gpu/drm/ttm/ttm_resource.c | 52 +++++++++++++++++++++--------- include/drm/ttm/ttm_resource.h | 8 ++--- 3 files changed, 54 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 75d308ec173d..406e9c324e76 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -109,11 +109,11 @@ void ttm_bo_set_bulk_move(struct ttm_buffer_object *bo, return; spin_lock(&bo->bdev->lru_lock); - if (bo->bulk_move && bo->resource) - ttm_lru_bulk_move_del(bo->bulk_move, bo->resource); + if (bo->resource) + ttm_resource_del_bulk_move(bo->resource, bo); bo->bulk_move = bulk; - if (bo->bulk_move && bo->resource) - ttm_lru_bulk_move_add(bo->bulk_move, bo->resource); + if (bo->resource) + ttm_resource_add_bulk_move(bo->resource, bo); spin_unlock(&bo->bdev->lru_lock); } EXPORT_SYMBOL(ttm_bo_set_bulk_move); @@ -689,8 +689,11 @@ void ttm_bo_pin(struct ttm_buffer_object *bo) { dma_resv_assert_held(bo->base.resv); WARN_ON_ONCE(!kref_read(&bo->kref)); - if (!(bo->pin_count++) && bo->bulk_move && bo->resource) - ttm_lru_bulk_move_del(bo->bulk_move, bo->resource); + spin_lock(&bo->bdev->lru_lock); + if (bo->resource) + ttm_resource_del_bulk_move(bo->resource, bo); + ++bo->pin_count; + spin_unlock(&bo->bdev->lru_lock); } EXPORT_SYMBOL(ttm_bo_pin); @@ -707,8 +710,11 @@ void ttm_bo_unpin(struct ttm_buffer_object *bo) if (WARN_ON_ONCE(!bo->pin_count)) return; - if (!(--bo->pin_count) && bo->bulk_move && bo->resource) - ttm_lru_bulk_move_add(bo->bulk_move, bo->resource); + spin_lock(&bo->bdev->lru_lock); + --bo->pin_count; + if (bo->resource) + ttm_resource_add_bulk_move(bo->resource, bo); + spin_unlock(&bo->bdev->lru_lock); } EXPORT_SYMBOL(ttm_bo_unpin); diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c index 65889b3caf50..20f9adcc3235 100644 --- a/drivers/gpu/drm/ttm/ttm_resource.c +++ b/drivers/gpu/drm/ttm/ttm_resource.c @@ -91,8 +91,8 @@ static void ttm_lru_bulk_move_pos_tail(struct ttm_lru_bulk_move_pos *pos, } /* Add the resource to a bulk_move cursor */ -void ttm_lru_bulk_move_add(struct ttm_lru_bulk_move *bulk, - struct ttm_resource *res) +static void ttm_lru_bulk_move_add(struct ttm_lru_bulk_move *bulk, + struct ttm_resource *res) { struct ttm_lru_bulk_move_pos *pos = ttm_lru_bulk_move_pos(bulk, res); @@ -105,8 +105,8 @@ void ttm_lru_bulk_move_add(struct ttm_lru_bulk_move *bulk, } /* Remove the resource from a bulk_move range */ -void ttm_lru_bulk_move_del(struct ttm_lru_bulk_move *bulk, - struct ttm_resource *res) +static void ttm_lru_bulk_move_del(struct ttm_lru_bulk_move *bulk, + struct ttm_resource *res) { struct ttm_lru_bulk_move_pos *pos = ttm_lru_bulk_move_pos(bulk, res); @@ -122,6 +122,22 @@ void ttm_lru_bulk_move_del(struct ttm_lru_bulk_move *bulk, } } +/* Add the resource to a bulk move if the BO is configured for it */ +void ttm_resource_add_bulk_move(struct ttm_resource *res, + struct ttm_buffer_object *bo) +{ + if (bo->bulk_move && !bo->pin_count) + ttm_lru_bulk_move_add(bo->bulk_move, res); +} + +/* Remove the resource from a bulk move if the BO is configured for it */ +void ttm_resource_del_bulk_move(struct ttm_resource *res, + struct ttm_buffer_object *bo) +{ + if (bo->bulk_move && !bo->pin_count) + ttm_lru_bulk_move_del(bo->bulk_move, res); +} + /* Move a resource to the LRU or bulk tail */ void ttm_resource_move_to_lru_tail(struct ttm_resource *res) { @@ -169,15 +185,14 @@ void ttm_resource_init(struct ttm_buffer_object *bo, res->bus.is_iomem = false; res->bus.caching = ttm_cached; res->bo = bo; - INIT_LIST_HEAD(&res->lru); man = ttm_manager_type(bo->bdev, place->mem_type); spin_lock(&bo->bdev->lru_lock); - man->usage += res->num_pages << PAGE_SHIFT; - if (bo->bulk_move) - ttm_lru_bulk_move_add(bo->bulk_move, res); + if (bo->pin_count) + list_add_tail(&res->lru, &bo->bdev->pinned); else - ttm_resource_move_to_lru_tail(res); + list_add_tail(&res->lru, &man->lru[bo->priority]); + man->usage += res->num_pages << PAGE_SHIFT; spin_unlock(&bo->bdev->lru_lock); } EXPORT_SYMBOL(ttm_resource_init); @@ -210,8 +225,16 @@ int ttm_resource_alloc(struct ttm_buffer_object *bo, { struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, place->mem_type); + int ret; - return man->func->alloc(man, bo, place, res_ptr); + ret = man->func->alloc(man, bo, place, res_ptr); + if (ret) + return ret; + + spin_lock(&bo->bdev->lru_lock); + ttm_resource_add_bulk_move(*res_ptr, bo); + spin_unlock(&bo->bdev->lru_lock); + return 0; } void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res) @@ -221,12 +244,9 @@ void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res) if (!*res) return; - if (bo->bulk_move) { - spin_lock(&bo->bdev->lru_lock); - ttm_lru_bulk_move_del(bo->bulk_move, *res); - spin_unlock(&bo->bdev->lru_lock); - } - + spin_lock(&bo->bdev->lru_lock); + ttm_resource_del_bulk_move(*res, bo); + spin_unlock(&bo->bdev->lru_lock); man = ttm_manager_type(bo->bdev, (*res)->mem_type); man->func->free(man, *res); *res = NULL; diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h index 441653693970..ca89a48c2460 100644 --- a/include/drm/ttm/ttm_resource.h +++ b/include/drm/ttm/ttm_resource.h @@ -311,12 +311,12 @@ ttm_resource_manager_cleanup(struct ttm_resource_manager *man) } void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk); -void ttm_lru_bulk_move_add(struct ttm_lru_bulk_move *bulk, - struct ttm_resource *res); -void ttm_lru_bulk_move_del(struct ttm_lru_bulk_move *bulk, - struct ttm_resource *res); void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk); +void ttm_resource_add_bulk_move(struct ttm_resource *res, + struct ttm_buffer_object *bo); +void ttm_resource_del_bulk_move(struct ttm_resource *res, + struct ttm_buffer_object *bo); void ttm_resource_move_to_lru_tail(struct ttm_resource *res); void ttm_resource_init(struct ttm_buffer_object *bo, From 168f912893407a5acb798a4a58613b5f1f98c717 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 13 Jun 2022 13:15:17 +0200 Subject: [PATCH 160/298] fs: account for group membership When calling setattr_prepare() to determine the validity of the attributes the ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id. This is exactly the same for idmapped and non-idmapped mounts and allows callers to pass in the values they want to see written to inode->i_{g,u}id. When group ownership is changed a caller whose fsuid owns the inode can change the group of the inode to any group they are a member of. When searching through the caller's groups we need to use the gid mapped according to the idmapped mount otherwise we will fail to change ownership for unprivileged users. Consider a caller running with fsuid and fsgid 1000 using an idmapped mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the idmapped mount. The caller now requests the gid of the file to be changed to 1000 going through the idmapped mount. In the vfs we will immediately map the requested gid to the value that will need to be written to inode->i_gid and place it in attr->ia_gid. Since this idmapped mount maps 65534 to 1000 we place 65534 in attr->ia_gid. When we check whether the caller is allowed to change group ownership we first validate that their fsuid matches the inode's uid. The inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount. Since the caller's fsuid is 1000 we pass the check. We now check whether the caller is allowed to change inode->i_gid to the requested gid by calling in_group_p(). This will compare the passed in gid to the caller's fsgid and search the caller's additional groups. Since we're dealing with an idmapped mount we need to pass in the gid mapped according to the idmapped mount. This is akin to checking whether a caller is privileged over the future group the inode is owned by. And that needs to take the idmapped mount into account. Note, all helpers are nops without idmapped mounts. New regression test sent to xfstests. Link: https://github.com/lxc/lxd/issues/10537 Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts") Cc: Seth Forshee Cc: Christoph Hellwig Cc: Aleksa Sarai Cc: Al Viro Cc: stable@vger.kernel.org # 5.15+ CC: linux-fsdevel@vger.kernel.org Reviewed-by: Seth Forshee Signed-off-by: Christian Brauner (Microsoft) --- fs/attr.c | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 66899b6e9bd8..dbe996b0dedf 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -61,9 +61,15 @@ static bool chgrp_ok(struct user_namespace *mnt_userns, const struct inode *inode, kgid_t gid) { kgid_t kgid = i_gid_into_mnt(mnt_userns, inode); - if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode)) && - (in_group_p(gid) || gid_eq(gid, inode->i_gid))) - return true; + if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) { + kgid_t mapped_gid; + + if (gid_eq(gid, inode->i_gid)) + return true; + mapped_gid = mapped_kgid_fs(mnt_userns, i_user_ns(inode), gid); + if (in_group_p(mapped_gid)) + return true; + } if (capable_wrt_inode_uidgid(mnt_userns, inode, CAP_CHOWN)) return true; if (gid_eq(kgid, INVALID_GID) && @@ -123,12 +129,20 @@ int setattr_prepare(struct user_namespace *mnt_userns, struct dentry *dentry, /* Make sure a caller can chmod. */ if (ia_valid & ATTR_MODE) { + kgid_t mapped_gid; + if (!inode_owner_or_capable(mnt_userns, inode)) return -EPERM; + + if (ia_valid & ATTR_GID) + mapped_gid = mapped_kgid_fs(mnt_userns, + i_user_ns(inode), attr->ia_gid); + else + mapped_gid = i_gid_into_mnt(mnt_userns, inode); + /* Also check the setgid bit! */ - if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : - i_gid_into_mnt(mnt_userns, inode)) && - !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) + if (!in_group_p(mapped_gid) && + !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) attr->ia_mode &= ~S_ISGID; } From 5c2b745173347ba21e3995d815f26925c91c517d Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Fri, 8 Apr 2022 13:21:34 +0300 Subject: [PATCH 161/298] drm/exynos: fix IS_ERR() vs NULL check in probe The of_drm_find_bridge() does not return error pointers, it returns NULL on error. Fixes: dd8b6803bc49 ("exynos: drm: dsi: Attach in_bridge in MIC driver") Signed-off-by: Dan Carpenter Signed-off-by: Inki Dae --- drivers/gpu/drm/exynos/exynos_drm_mic.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/exynos/exynos_drm_mic.c b/drivers/gpu/drm/exynos/exynos_drm_mic.c index 9e06f8e2a863..07e04ceb2476 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_mic.c +++ b/drivers/gpu/drm/exynos/exynos_drm_mic.c @@ -434,9 +434,9 @@ static int exynos_mic_probe(struct platform_device *pdev) remote = of_graph_get_remote_node(dev->of_node, 1, 0); mic->next_bridge = of_drm_find_bridge(remote); - if (IS_ERR(mic->next_bridge)) { + if (!mic->next_bridge) { DRM_DEV_ERROR(dev, "mic: Failed to find next bridge\n"); - ret = PTR_ERR(mic->next_bridge); + ret = -EPROBE_DEFER; goto err; } From 7d787184a18f0f84e996de8ff007e4395c1978ea Mon Sep 17 00:00:00 2001 From: Marek Szyprowski Date: Fri, 13 May 2022 10:31:05 +0200 Subject: [PATCH 162/298] drm/exynos: mic: Rework initialization Commit dd8b6803bc49 ("exynos: drm: dsi: Attach in_bridge in MIC driver") moved Exynos MIC attaching from DSI to MIC driver. However the method proposed there is incomplete and cannot really work. To properly attach it to the bridge chain, access to the respective encoder is needed. The Exynos MIC driver always attaches to the encoder created by the Exynos DSI driver, so grab it via available helpers for getting access to the CRTC and encoders. This also requires to change the order of driver component binding to let DSI to be bound before MIC. Fixes: dd8b6803bc49 ("exynos: drm: dsi: Attach in_bridge in MIC driver") Signed-off-by: Marek Szyprowski Fixed merge conflict. Signed-off-by: Inki Dae --- drivers/gpu/drm/exynos/exynos_drm_drv.c | 6 ++-- drivers/gpu/drm/exynos/exynos_drm_mic.c | 42 +++++++------------------ 2 files changed, 15 insertions(+), 33 deletions(-) diff --git a/drivers/gpu/drm/exynos/exynos_drm_drv.c b/drivers/gpu/drm/exynos/exynos_drm_drv.c index 424ea23eec32..16c539657f73 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_drv.c +++ b/drivers/gpu/drm/exynos/exynos_drm_drv.c @@ -176,15 +176,15 @@ static struct exynos_drm_driver_info exynos_drm_drivers[] = { }, { DRV_PTR(mixer_driver, CONFIG_DRM_EXYNOS_MIXER), DRM_COMPONENT_DRIVER - }, { - DRV_PTR(mic_driver, CONFIG_DRM_EXYNOS_MIC), - DRM_COMPONENT_DRIVER }, { DRV_PTR(dp_driver, CONFIG_DRM_EXYNOS_DP), DRM_COMPONENT_DRIVER }, { DRV_PTR(dsi_driver, CONFIG_DRM_EXYNOS_DSI), DRM_COMPONENT_DRIVER + }, { + DRV_PTR(mic_driver, CONFIG_DRM_EXYNOS_MIC), + DRM_COMPONENT_DRIVER }, { DRV_PTR(hdmi_driver, CONFIG_DRM_EXYNOS_HDMI), DRM_COMPONENT_DRIVER diff --git a/drivers/gpu/drm/exynos/exynos_drm_mic.c b/drivers/gpu/drm/exynos/exynos_drm_mic.c index 07e04ceb2476..09ce28ee08d9 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_mic.c +++ b/drivers/gpu/drm/exynos/exynos_drm_mic.c @@ -26,6 +26,7 @@ #include #include "exynos_drm_drv.h" +#include "exynos_drm_crtc.h" /* Sysreg registers for MIC */ #define DSD_CFG_MUX 0x1004 @@ -100,9 +101,7 @@ struct exynos_mic { bool i80_mode; struct videomode vm; - struct drm_encoder *encoder; struct drm_bridge bridge; - struct drm_bridge *next_bridge; bool enabled; }; @@ -229,8 +228,6 @@ static void mic_set_reg_on(struct exynos_mic *mic, bool enable) writel(reg, mic->reg + MIC_OP); } -static void mic_disable(struct drm_bridge *bridge) { } - static void mic_post_disable(struct drm_bridge *bridge) { struct exynos_mic *mic = bridge->driver_private; @@ -297,34 +294,30 @@ unlock: mutex_unlock(&mic_mutex); } -static void mic_enable(struct drm_bridge *bridge) { } - -static int mic_attach(struct drm_bridge *bridge, - enum drm_bridge_attach_flags flags) -{ - struct exynos_mic *mic = bridge->driver_private; - - return drm_bridge_attach(bridge->encoder, mic->next_bridge, - &mic->bridge, flags); -} - static const struct drm_bridge_funcs mic_bridge_funcs = { - .disable = mic_disable, .post_disable = mic_post_disable, .mode_set = mic_mode_set, .pre_enable = mic_pre_enable, - .enable = mic_enable, - .attach = mic_attach, }; static int exynos_mic_bind(struct device *dev, struct device *master, void *data) { struct exynos_mic *mic = dev_get_drvdata(dev); + struct drm_device *drm_dev = data; + struct exynos_drm_crtc *crtc = exynos_drm_crtc_get_by_type(drm_dev, + EXYNOS_DISPLAY_TYPE_LCD); + struct drm_encoder *e, *encoder = NULL; + + drm_for_each_encoder(e, drm_dev) + if (e->possible_crtcs == drm_crtc_mask(&crtc->base)) + encoder = e; + if (!encoder) + return -ENODEV; mic->bridge.driver_private = mic; - return 0; + return drm_bridge_attach(encoder, &mic->bridge, NULL, 0); } static void exynos_mic_unbind(struct device *dev, struct device *master, @@ -388,7 +381,6 @@ static int exynos_mic_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; struct exynos_mic *mic; - struct device_node *remote; struct resource res; int ret, i; @@ -432,16 +424,6 @@ static int exynos_mic_probe(struct platform_device *pdev) } } - remote = of_graph_get_remote_node(dev->of_node, 1, 0); - mic->next_bridge = of_drm_find_bridge(remote); - if (!mic->next_bridge) { - DRM_DEV_ERROR(dev, "mic: Failed to find next bridge\n"); - ret = -EPROBE_DEFER; - goto err; - } - - of_node_put(remote); - platform_set_drvdata(pdev, mic); mic->bridge.funcs = &mic_bridge_funcs; From 4b7a632ac4e7101ceefee8484d5c2ca505d347b3 Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Mon, 13 Jun 2022 15:50:17 +0300 Subject: [PATCH 163/298] mlxsw: spectrum_cnt: Reorder counter pools Both RIF and ACL flow counters use a 24-bit SW-managed counter address to communicate which counter they want to bind. In a number of Spectrum FW releases, binding a RIF counter is broken and slices the counter index to 16 bits. As a result, on Spectrum-2 and above, no more than about 410 RIF counters can be effectively used. This translates to 205 netdevices for which L3 HW stats can be enabled. (This does not happen on Spectrum-1, because there are fewer counters available overall and the counter index never exceeds 16 bits.) Binding counters to ACLs does not have this issue. Therefore reorder the counter allocation scheme so that RIF counters come first and therefore get lower indices that are below the 16-bit barrier. Fixes: 98e60dce4da1 ("Merge branch 'mlxsw-Introduce-initial-Spectrum-2-support'") Reported-by: Maksym Yaremchuk Signed-off-by: Petr Machata Signed-off-by: Ido Schimmel Link: https://lore.kernel.org/r/20220613125017.2018162-1-idosch@nvidia.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h index a68d931090dd..15c8d4de8350 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h @@ -8,8 +8,8 @@ #include "spectrum.h" enum mlxsw_sp_counter_sub_pool_id { - MLXSW_SP_COUNTER_SUB_POOL_FLOW, MLXSW_SP_COUNTER_SUB_POOL_RIF, + MLXSW_SP_COUNTER_SUB_POOL_FLOW, }; int mlxsw_sp_counter_alloc(struct mlxsw_sp *mlxsw_sp, From 71a579f0d3777a704355e6f1572dfba92a9b58b2 Mon Sep 17 00:00:00 2001 From: Michal Michalik Date: Tue, 10 May 2022 13:03:43 +0200 Subject: [PATCH 164/298] ice: Fix PTP TX timestamp offset calculation The offset was being incorrectly calculated for E822 - that led to collisions in choosing TX timestamp register location when more than one port was trying to use timestamping mechanism. In E822 one quad is being logically split between ports, so quad 0 is having trackers for ports 0-3, quad 1 ports 4-7 etc. Each port should have separate memory location for tracking timestamps. Due to error for example ports 1 and 2 had been assigned to quad 0 with same offset (0), while port 1 should have offset 0 and 1 offset 16. Fix it by correctly calculating quad offset. Fixes: 3a7496234d17 ("ice: implement basic E822 PTP support") Signed-off-by: Michal Michalik Tested-by: Gurucharan (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_ptp.c | 2 +- drivers/net/ethernet/intel/ice/ice_ptp.h | 31 ++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c index 662947c882e8..ef9344ef0d8e 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c @@ -2271,7 +2271,7 @@ static int ice_ptp_init_tx_e822(struct ice_pf *pf, struct ice_ptp_tx *tx, u8 port) { tx->quad = port / ICE_PORTS_PER_QUAD; - tx->quad_offset = tx->quad * INDEX_PER_PORT; + tx->quad_offset = (port % ICE_PORTS_PER_QUAD) * INDEX_PER_PORT; tx->len = INDEX_PER_PORT; return ice_ptp_alloc_tx_tracker(tx); diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h index afd048d69959..10e396abf130 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp.h @@ -49,6 +49,37 @@ struct ice_perout_channel { * To allow multiple ports to access the shared register block independently, * the blocks are split up so that indexes are assigned to each port based on * hardware logical port number. + * + * The timestamp blocks are handled differently for E810- and E822-based + * devices. In E810 devices, each port has its own block of timestamps, while in + * E822 there is a need to logically break the block of registers into smaller + * chunks based on the port number to avoid collisions. + * + * Example for port 5 in E810: + * +--------+--------+--------+--------+--------+--------+--------+--------+ + * |register|register|register|register|register|register|register|register| + * | block | block | block | block | block | block | block | block | + * | for | for | for | for | for | for | for | for | + * | port 0 | port 1 | port 2 | port 3 | port 4 | port 5 | port 6 | port 7 | + * +--------+--------+--------+--------+--------+--------+--------+--------+ + * ^^ + * || + * |--- quad offset is always 0 + * ---- quad number + * + * Example for port 5 in E822: + * +-----------------------------+-----------------------------+ + * | register block for quad 0 | register block for quad 1 | + * |+------+------+------+------+|+------+------+------+------+| + * ||port 0|port 1|port 2|port 3|||port 0|port 1|port 2|port 3|| + * |+------+------+------+------+|+------+------+------+------+| + * +-----------------------------+-------^---------------------+ + * ^ | + * | --- quad offset* + * ---- quad number + * + * * PHY port 5 is port 1 in quad 1 + * */ /** From 9542ef4fba8c73e176b8aa18a8adf04aecb889e5 Mon Sep 17 00:00:00 2001 From: Roman Storozhenko Date: Tue, 7 Jun 2022 08:54:57 +0200 Subject: [PATCH 165/298] ice: Sync VLAN filtering features for DVM VLAN filtering features, that is C-Tag and S-Tag, in DVM mode must be both enabled or disabled. In case of turning off/on only one of the features, another feature must be turned off/on automatically with issuing an appropriate message to the kernel log. Fixes: 1babaf77f49d ("ice: Advertise 802.1ad VLAN filtering and offloads for PF netdev") Signed-off-by: Roman Storozhenko Co-developed-by: Anatolii Gerasymenko Signed-off-by: Anatolii Gerasymenko Tested-by: Gurucharan (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_main.c | 43 +++++++++++++++-------- 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index e1cae253412c..c1ac2f746714 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -5763,25 +5763,38 @@ static netdev_features_t ice_fix_features(struct net_device *netdev, netdev_features_t features) { struct ice_netdev_priv *np = netdev_priv(netdev); - netdev_features_t supported_vlan_filtering; - netdev_features_t requested_vlan_filtering; - struct ice_vsi *vsi = np->vsi; + netdev_features_t req_vlan_fltr, cur_vlan_fltr; + bool cur_ctag, cur_stag, req_ctag, req_stag; - requested_vlan_filtering = features & NETIF_VLAN_FILTERING_FEATURES; + cur_vlan_fltr = netdev->features & NETIF_VLAN_FILTERING_FEATURES; + cur_ctag = cur_vlan_fltr & NETIF_F_HW_VLAN_CTAG_FILTER; + cur_stag = cur_vlan_fltr & NETIF_F_HW_VLAN_STAG_FILTER; - /* make sure supported_vlan_filtering works for both SVM and DVM */ - supported_vlan_filtering = NETIF_F_HW_VLAN_CTAG_FILTER; - if (ice_is_dvm_ena(&vsi->back->hw)) - supported_vlan_filtering |= NETIF_F_HW_VLAN_STAG_FILTER; + req_vlan_fltr = features & NETIF_VLAN_FILTERING_FEATURES; + req_ctag = req_vlan_fltr & NETIF_F_HW_VLAN_CTAG_FILTER; + req_stag = req_vlan_fltr & NETIF_F_HW_VLAN_STAG_FILTER; - if (requested_vlan_filtering && - requested_vlan_filtering != supported_vlan_filtering) { - if (requested_vlan_filtering & NETIF_F_HW_VLAN_CTAG_FILTER) { - netdev_warn(netdev, "cannot support requested VLAN filtering settings, enabling all supported VLAN filtering settings\n"); - features |= supported_vlan_filtering; + if (req_vlan_fltr != cur_vlan_fltr) { + if (ice_is_dvm_ena(&np->vsi->back->hw)) { + if (req_ctag && req_stag) { + features |= NETIF_VLAN_FILTERING_FEATURES; + } else if (!req_ctag && !req_stag) { + features &= ~NETIF_VLAN_FILTERING_FEATURES; + } else if ((!cur_ctag && req_ctag && !cur_stag) || + (!cur_stag && req_stag && !cur_ctag)) { + features |= NETIF_VLAN_FILTERING_FEATURES; + netdev_warn(netdev, "802.1Q and 802.1ad VLAN filtering must be either both on or both off. VLAN filtering has been enabled for both types.\n"); + } else if ((cur_ctag && !req_ctag && cur_stag) || + (cur_stag && !req_stag && cur_ctag)) { + features &= ~NETIF_VLAN_FILTERING_FEATURES; + netdev_warn(netdev, "802.1Q and 802.1ad VLAN filtering must be either both on or both off. VLAN filtering has been disabled for both types.\n"); + } } else { - netdev_warn(netdev, "cannot support requested VLAN filtering settings, clearing all supported VLAN filtering settings\n"); - features &= ~supported_vlan_filtering; + if (req_vlan_fltr & NETIF_F_HW_VLAN_STAG_FILTER) + netdev_warn(netdev, "cannot support requested 802.1ad filtering setting in SVM mode\n"); + + if (req_vlan_fltr & NETIF_F_HW_VLAN_CTAG_FILTER) + features |= NETIF_F_HW_VLAN_CTAG_FILTER; } } From be2af71496a54a7195ac62caba6fab49cfe5006c Mon Sep 17 00:00:00 2001 From: Przemyslaw Patynowski Date: Thu, 2 Jun 2022 12:09:04 +0200 Subject: [PATCH 166/298] ice: Fix queue config fail handling Disable VF's RX/TX queues, when VIRTCHNL_OP_CONFIG_VSI_QUEUES fail. Not disabling them might lead to scenario, where PF driver leaves VF queues enabled, when VF's VSI failed queue config. In this scenario VF should not have RX/TX queues enabled. If PF failed to set up VF's queues, VF will reset due to TX timeouts in VF driver. Initialize iterator 'i' to -1, so if error happens prior to configuring queues then error path code will not disable queue 0. Loop that configures queues will is using same iterator, so error path code will only disable queues that were configured. Fixes: 77ca27c41705 ("ice: add support for virtchnl_queue_select.[tx|rx]_queues bitmap") Suggested-by: Slawomir Laba Signed-off-by: Przemyslaw Patynowski Signed-off-by: Mateusz Palczewski Tested-by: Konrad Jankowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_virtchnl.c | 55 +++++++++---------- 1 file changed, 27 insertions(+), 28 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c index 1d9b84c3937a..4547bc1f7cee 100644 --- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c +++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c @@ -1569,35 +1569,27 @@ error_param: */ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) { - enum virtchnl_status_code v_ret = VIRTCHNL_STATUS_SUCCESS; struct virtchnl_vsi_queue_config_info *qci = (struct virtchnl_vsi_queue_config_info *)msg; struct virtchnl_queue_pair_info *qpi; struct ice_pf *pf = vf->pf; struct ice_vsi *vsi; - int i, q_idx; + int i = -1, q_idx; - if (!test_bit(ICE_VF_STATE_ACTIVE, vf->vf_states)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (!test_bit(ICE_VF_STATE_ACTIVE, vf->vf_states)) goto error_param; - } - if (!ice_vc_isvalid_vsi_id(vf, qci->vsi_id)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (!ice_vc_isvalid_vsi_id(vf, qci->vsi_id)) goto error_param; - } vsi = ice_get_vf_vsi(vf); - if (!vsi) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (!vsi) goto error_param; - } if (qci->num_queue_pairs > ICE_MAX_RSS_QS_PER_VF || qci->num_queue_pairs > min_t(u16, vsi->alloc_txq, vsi->alloc_rxq)) { dev_err(ice_pf_to_dev(pf), "VF-%d requesting more than supported number of queues: %d\n", vf->vf_id, min_t(u16, vsi->alloc_txq, vsi->alloc_rxq)); - v_ret = VIRTCHNL_STATUS_ERR_PARAM; goto error_param; } @@ -1610,7 +1602,6 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) !ice_vc_isvalid_ring_len(qpi->txq.ring_len) || !ice_vc_isvalid_ring_len(qpi->rxq.ring_len) || !ice_vc_isvalid_q_id(vf, qci->vsi_id, qpi->txq.queue_id)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; goto error_param; } @@ -1620,7 +1611,6 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) * for selected "vsi" */ if (q_idx >= vsi->alloc_txq || q_idx >= vsi->alloc_rxq) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; goto error_param; } @@ -1630,14 +1620,13 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) vsi->tx_rings[i]->count = qpi->txq.ring_len; /* Disable any existing queue first */ - if (ice_vf_vsi_dis_single_txq(vf, vsi, q_idx)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (ice_vf_vsi_dis_single_txq(vf, vsi, q_idx)) goto error_param; - } /* Configure a queue with the requested settings */ if (ice_vsi_cfg_single_txq(vsi, vsi->tx_rings, q_idx)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + dev_warn(ice_pf_to_dev(pf), "VF-%d failed to configure TX queue %d\n", + vf->vf_id, i); goto error_param; } } @@ -1651,17 +1640,13 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) if (qpi->rxq.databuffer_size != 0 && (qpi->rxq.databuffer_size > ((16 * 1024) - 128) || - qpi->rxq.databuffer_size < 1024)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + qpi->rxq.databuffer_size < 1024)) goto error_param; - } vsi->rx_buf_len = qpi->rxq.databuffer_size; vsi->rx_rings[i]->rx_buf_len = vsi->rx_buf_len; if (qpi->rxq.max_pkt_size > max_frame_size || - qpi->rxq.max_pkt_size < 64) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + qpi->rxq.max_pkt_size < 64) goto error_param; - } vsi->max_frame = qpi->rxq.max_pkt_size; /* add space for the port VLAN since the VF driver is @@ -1672,16 +1657,30 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) vsi->max_frame += VLAN_HLEN; if (ice_vsi_cfg_single_rxq(vsi, q_idx)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + dev_warn(ice_pf_to_dev(pf), "VF-%d failed to configure RX queue %d\n", + vf->vf_id, i); goto error_param; } } } -error_param: /* send the response to the VF */ - return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES, v_ret, - NULL, 0); + return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES, + VIRTCHNL_STATUS_SUCCESS, NULL, 0); +error_param: + /* disable whatever we can */ + for (; i >= 0; i--) { + if (ice_vsi_ctrl_one_rx_ring(vsi, false, i, true)) + dev_err(ice_pf_to_dev(pf), "VF-%d could not disable RX queue %d\n", + vf->vf_id, i); + if (ice_vf_vsi_dis_single_txq(vf, vsi, i)) + dev_err(ice_pf_to_dev(pf), "VF-%d could not disable TX queue %d\n", + vf->vf_id, i); + } + + /* send the response to the VF */ + return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES, + VIRTCHNL_STATUS_ERR_PARAM, NULL, 0); } /** From efe41860008e57fb6b69855b4b93fdf34bc42798 Mon Sep 17 00:00:00 2001 From: Przemyslaw Patynowski Date: Thu, 2 Jun 2022 12:09:17 +0200 Subject: [PATCH 167/298] ice: Fix memory corruption in VF driver Disable VF's RX/TX queues, when it's disabled. VF can have queues enabled, when it requests a reset. If PF driver assumes that VF is disabled, while VF still has queues configured, VF may unmap DMA resources. In such scenario device still can map packets to memory, which ends up silently corrupting it. Previously, VF driver could experience memory corruption, which lead to crash: [ 5119.170157] BUG: unable to handle kernel paging request at 00001b9780003237 [ 5119.170166] PGD 0 P4D 0 [ 5119.170173] Oops: 0002 [#1] PREEMPT_RT SMP PTI [ 5119.170181] CPU: 30 PID: 427592 Comm: kworker/u96:2 Kdump: loaded Tainted: G W I --------- - - 4.18.0-372.9.1.rt7.166.el8.x86_64 #1 [ 5119.170189] Hardware name: Dell Inc. PowerEdge R740/014X06, BIOS 2.3.10 08/15/2019 [ 5119.170193] Workqueue: iavf iavf_adminq_task [iavf] [ 5119.170219] RIP: 0010:__page_frag_cache_drain+0x5/0x30 [ 5119.170238] Code: 0f 0f b6 77 51 85 f6 74 07 31 d2 e9 05 df ff ff e9 90 fe ff ff 48 8b 05 49 db 33 01 eb b4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 29 77 34 74 01 c3 48 8b 07 f6 c4 80 74 0f 0f b6 77 51 85 f6 74 [ 5119.170244] RSP: 0018:ffffa43b0bdcfd78 EFLAGS: 00010282 [ 5119.170250] RAX: ffffffff896b3e40 RBX: ffff8fb282524000 RCX: 0000000000000002 [ 5119.170254] RDX: 0000000049000000 RSI: 0000000000000000 RDI: 00001b9780003203 [ 5119.170259] RBP: ffff8fb248217b00 R08: 0000000000000022 R09: 0000000000000009 [ 5119.170262] R10: 2b849d6300000000 R11: 0000000000000020 R12: 0000000000000000 [ 5119.170265] R13: 0000000000001000 R14: 0000000000000009 R15: 0000000000000000 [ 5119.170269] FS: 0000000000000000(0000) GS:ffff8fb1201c0000(0000) knlGS:0000000000000000 [ 5119.170274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5119.170279] CR2: 00001b9780003237 CR3: 00000008f3e1a003 CR4: 00000000007726e0 [ 5119.170283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5119.170286] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5119.170290] PKRU: 55555554 [ 5119.170292] Call Trace: [ 5119.170298] iavf_clean_rx_ring+0xad/0x110 [iavf] [ 5119.170324] iavf_free_rx_resources+0xe/0x50 [iavf] [ 5119.170342] iavf_free_all_rx_resources.part.51+0x30/0x40 [iavf] [ 5119.170358] iavf_virtchnl_completion+0xd8a/0x15b0 [iavf] [ 5119.170377] ? iavf_clean_arq_element+0x210/0x280 [iavf] [ 5119.170397] iavf_adminq_task+0x126/0x2e0 [iavf] [ 5119.170416] process_one_work+0x18f/0x420 [ 5119.170429] worker_thread+0x30/0x370 [ 5119.170437] ? process_one_work+0x420/0x420 [ 5119.170445] kthread+0x151/0x170 [ 5119.170452] ? set_kthread_struct+0x40/0x40 [ 5119.170460] ret_from_fork+0x35/0x40 [ 5119.170477] Modules linked in: iavf sctp ip6_udp_tunnel udp_tunnel mlx4_en mlx4_core nfp tls vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr iTCO_wdt iTCO_vendor_support dell_smbios wmi_bmof dell_wmi_descriptor dcdbas kvm_intel kvm irqbypass intel_rapl_common isst_if_common skx_edac irdma nfit libnvdimm x86_pkg_temp_thermal i40e intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ib_uverbs rapl ipmi_ssif intel_cstate intel_uncore mei_me pcspkr acpi_ipmi ib_core mei lpc_ich i2c_i801 ipmi_si ipmi_devintf wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ice ahci drm libahci crc32c_intel libata tg3 megaraid_sas [ 5119.170613] i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: iavf] [ 5119.170627] CR2: 00001b9780003237 Fixes: ec4f5a436bdf ("ice: Check if VF is disabled for Opcode and other operations") Signed-off-by: Przemyslaw Patynowski Co-developed-by: Slawomir Laba Signed-off-by: Slawomir Laba Signed-off-by: Mateusz Palczewski Tested-by: Konrad Jankowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_vf_lib.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c index cd8e6b50968c..7adf9ddf129e 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c @@ -504,6 +504,11 @@ int ice_reset_vf(struct ice_vf *vf, u32 flags) } if (ice_is_vf_disabled(vf)) { + vsi = ice_get_vf_vsi(vf); + if (WARN_ON(!vsi)) + return -EINVAL; + ice_vsi_stop_lan_tx_rings(vsi, ICE_NO_RESET, vf->vf_id); + ice_vsi_stop_all_rx_rings(vsi); dev_dbg(dev, "VF is already disabled, there is no need for resetting it, telling VM, all is fine %d\n", vf->vf_id); return 0; From 8899ce4b2f7364a90e3b9cf332dfd9993c61f46c Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Tue, 14 Jun 2022 17:51:16 +0100 Subject: [PATCH 168/298] Revert "io_uring: support CQE32 for nop operation" This reverts commit 2bb04df7c2af9dad5d28771c723bc39b01cf7df4. CQE32 nops were used for debugging and benchmarking but it doesn't target any real use case. Revert it, we can return it back if someone finds a good way to use it. Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/5ff623d84ccb4b3f3b92a3ea41cdcfa612f3d96f.1655224415.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 21 +-------------------- 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index ca6170a66e62..bf556f77d4ab 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -784,12 +784,6 @@ struct io_msg { u32 len; }; -struct io_nop { - struct file *file; - u64 extra1; - u64 extra2; -}; - struct io_async_connect { struct sockaddr_storage address; }; @@ -994,7 +988,6 @@ struct io_kiocb { struct io_msg msg; struct io_xattr xattr; struct io_socket sock; - struct io_nop nop; struct io_uring_cmd uring_cmd; }; @@ -5268,14 +5261,6 @@ done: static int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - /* - * If the ring is setup with CQE32, relay back addr/addr - */ - if (req->ctx->flags & IORING_SETUP_CQE32) { - req->nop.extra1 = READ_ONCE(sqe->addr); - req->nop.extra2 = READ_ONCE(sqe->addr2); - } - return 0; } @@ -5296,11 +5281,7 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags) } cflags = io_put_kbuf(req, issue_flags); - if (!(req->ctx->flags & IORING_SETUP_CQE32)) - __io_req_complete(req, issue_flags, 0, cflags); - else - __io_req_complete32(req, issue_flags, 0, cflags, - req->nop.extra1, req->nop.extra2); + __io_req_complete(req, issue_flags, 0, cflags); return 0; } From aa165d6d2bb55f8b1bb5047fd634311681316fa2 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Tue, 14 Jun 2022 17:51:17 +0100 Subject: [PATCH 169/298] Revert "io_uring: add buffer selection support to IORING_OP_NOP" This reverts commit 3d200242a6c968af321913b635fc4014b238cba4. Buffer selection with nops was used for debugging and benchmarking but is useless in real life. Let's revert it before it's released. Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/c5012098ca6b51dfbdcb190f8c4e3c0bf1c965dc.1655224415.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index bf556f77d4ab..1b95c6750a81 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1114,7 +1114,6 @@ static const struct io_op_def io_op_defs[] = { [IORING_OP_NOP] = { .audit_skip = 1, .iopoll = 1, - .buffer_select = 1, }, [IORING_OP_READV] = { .needs_file = 1, @@ -5269,19 +5268,7 @@ static int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ static int io_nop(struct io_kiocb *req, unsigned int issue_flags) { - unsigned int cflags; - void __user *buf; - - if (req->flags & REQ_F_BUFFER_SELECT) { - size_t len = 1; - - buf = io_buffer_select(req, &len, issue_flags); - if (!buf) - return -ENOBUFS; - } - - cflags = io_put_kbuf(req, issue_flags); - __io_req_complete(req, issue_flags, 0, cflags); + __io_req_complete(req, issue_flags, 0, 0); return 0; } From d884b6498d2f022098502e106d5a45ab635f2e9a Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Tue, 14 Jun 2022 17:51:18 +0100 Subject: [PATCH 170/298] io_uring: remove IORING_CLOSE_FD_AND_FILE_SLOT This partially reverts a7c41b4687f5902af70cd559806990930c8a307b Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some users, but it tries to do two things at a time and it's not clear how to handle errors and what to return in a single result field when one part fails and another completes well. Kill it for now. Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/837c745019b3795941eee4fcfd7de697886d645b.1655224415.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 12 +++--------- include/uapi/linux/io_uring.h | 6 ------ 2 files changed, 3 insertions(+), 15 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 1b95c6750a81..1b0b6099e717 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -576,7 +576,6 @@ struct io_close { struct file *file; int fd; u32 file_slot; - u32 flags; }; struct io_timeout_data { @@ -5966,18 +5965,14 @@ static int io_statx(struct io_kiocb *req, unsigned int issue_flags) static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - if (sqe->off || sqe->addr || sqe->len || sqe->buf_index) + if (sqe->off || sqe->addr || sqe->len || sqe->rw_flags || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) return -EBADF; req->close.fd = READ_ONCE(sqe->fd); req->close.file_slot = READ_ONCE(sqe->file_index); - req->close.flags = READ_ONCE(sqe->close_flags); - if (req->close.flags & ~IORING_CLOSE_FD_AND_FILE_SLOT) - return -EINVAL; - if (!(req->close.flags & IORING_CLOSE_FD_AND_FILE_SLOT) && - req->close.file_slot && req->close.fd) + if (req->close.file_slot && req->close.fd) return -EINVAL; return 0; @@ -5993,8 +5988,7 @@ static int io_close(struct io_kiocb *req, unsigned int issue_flags) if (req->close.file_slot) { ret = io_close_fixed(req, issue_flags); - if (ret || !(req->close.flags & IORING_CLOSE_FD_AND_FILE_SLOT)) - goto err; + goto err; } spin_lock(&files->file_lock); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 776e0278f9dd..53e7dae92e42 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -47,7 +47,6 @@ struct io_uring_sqe { __u32 unlink_flags; __u32 hardlink_flags; __u32 xattr_flags; - __u32 close_flags; }; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ @@ -259,11 +258,6 @@ enum io_uring_op { */ #define IORING_ACCEPT_MULTISHOT (1U << 0) -/* - * close flags, store in sqe->close_flags - */ -#define IORING_CLOSE_FD_AND_FILE_SLOT (1U << 0) - /* * IO completion data structure (Completion Queue Entry) */ From c904e3acbab3fd97649cd4ab1ff7f1521ad3a255 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michel=20D=C3=A4nzer?= Date: Fri, 10 Jun 2022 15:54:26 +0200 Subject: [PATCH 171/298] drm/amdgpu: Fix GTT size reporting in amdgpu_ioctl MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The commit below changed the TTM manager size unit from pages to bytes, but failed to adjust the corresponding calculations in amdgpu_ioctl. Fixes: dfa714b88eb0 ("drm/amdgpu: remove GTT accounting v2") Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1930 Bug: https://gitlab.freedesktop.org/mesa/mesa/-/issues/6642 Tested-by: Martin Roukala Tested-by: Mike Lothian Reviewed-by: Christian König Signed-off-by: Michel Dänzer Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 5.18.x --- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 801f6fa692e9..6de63ea6687e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -642,7 +642,6 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) atomic64_read(&adev->visible_pin_size), vram_gtt.vram_size); vram_gtt.gtt_size = ttm_manager_type(&adev->mman.bdev, TTM_PL_TT)->size; - vram_gtt.gtt_size *= PAGE_SIZE; vram_gtt.gtt_size -= atomic64_read(&adev->gart_pin_size); return copy_to_user(out, &vram_gtt, min((size_t)size, sizeof(vram_gtt))) ? -EFAULT : 0; @@ -675,7 +674,6 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) mem.cpu_accessible_vram.usable_heap_size * 3 / 4; mem.gtt.total_heap_size = gtt_man->size; - mem.gtt.total_heap_size *= PAGE_SIZE; mem.gtt.usable_heap_size = mem.gtt.total_heap_size - atomic64_read(&adev->gart_pin_size); mem.gtt.heap_usage = ttm_resource_manager_usage(gtt_man); From 4fd17f2ac0aa4e48823ac2ede5b050fb70300bf4 Mon Sep 17 00:00:00 2001 From: Roman Li Date: Thu, 19 May 2022 14:41:16 -0400 Subject: [PATCH 172/298] drm/amd/display: Cap OLED brightness per max frame-average luminance [Why] For OLED eDP the Display Manager uses max_cll value as a limit for brightness control. max_cll defines the content light luminance for individual pixel. Whereas max_fall defines frame-average level luminance. The user may not observe the difference in brightness in between max_fall and max_cll. That negatively impacts the user experience. [How] Use max_fall value instead of max_cll as a limit for brightness control. Reviewed-by: Rodrigo Siqueira Acked-by: Hamza Mahfooz Signed-off-by: Roman Li Tested-by: Daniel Wheeler Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 70be67a56673..39b425d83bb1 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -2812,7 +2812,7 @@ static struct drm_mode_config_helper_funcs amdgpu_dm_mode_config_helperfuncs = { static void update_connector_ext_caps(struct amdgpu_dm_connector *aconnector) { - u32 max_cll, min_cll, max, min, q, r; + u32 max_avg, min_cll, max, min, q, r; struct amdgpu_dm_backlight_caps *caps; struct amdgpu_display_manager *dm; struct drm_connector *conn_base; @@ -2842,7 +2842,7 @@ static void update_connector_ext_caps(struct amdgpu_dm_connector *aconnector) caps = &dm->backlight_caps[i]; caps->ext_caps = &aconnector->dc_link->dpcd_sink_ext_caps; caps->aux_support = false; - max_cll = conn_base->hdr_sink_metadata.hdmi_type1.max_cll; + max_avg = conn_base->hdr_sink_metadata.hdmi_type1.max_fall; min_cll = conn_base->hdr_sink_metadata.hdmi_type1.min_cll; if (caps->ext_caps->bits.oled == 1 /*|| @@ -2870,8 +2870,8 @@ static void update_connector_ext_caps(struct amdgpu_dm_connector *aconnector) * The results of the above expressions can be verified at * pre_computed_values. */ - q = max_cll >> 5; - r = max_cll % 32; + q = max_avg >> 5; + r = max_avg % 32; max = (1 << q) * pre_computed_values[r]; // min luminance: maxLum * (CV/255)^2 / 100 From 018ab4fabddd94f1c96f3b59e180691b9e88d5d8 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 14 Jun 2022 10:36:11 -0700 Subject: [PATCH 173/298] netfs: fix up netfs_inode_init() docbook comment Commit e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") changed the argument types and names, and actually updated the comment too (although that was thanks to David Howells, not me: my original patch only changed the code). But the comment fixup didn't go quite far enough, and didn't change the argument name in the comment, resulting in include/linux/netfs.h:314: warning: Function parameter or member 'ctx' not described in 'netfs_inode_init' include/linux/netfs.h:314: warning: Excess function parameter 'inode' description in 'netfs_inode_init' during htmldoc generation. Fixes: e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") Reported-by: Stephen Rothwell Signed-off-by: Linus Torvalds --- include/linux/netfs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 097cdd644665..1773e5df8e65 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -304,7 +304,7 @@ static inline struct netfs_inode *netfs_inode(struct inode *inode) /** * netfs_inode_init - Initialise a netfslib inode context - * @inode: The netfs inode to initialise + * @ctx: The netfs inode to initialise * @ops: The netfs's operations list * * Initialise the netfs library context struct. This is expected to follow on From de87b603b0919e31578c8fa312a3541f1fb37e1c Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Sun, 22 May 2022 14:22:07 +0200 Subject: [PATCH 174/298] i2c: mediatek: Fix an error handling path in mtk_i2c_probe() The clsk are prepared, enabled, then disabled. So if an error occurs after the disable step, they are still prepared. Add an error handling path to unprepare the clks in such a case, as already done in the .remove function. Fixes: 8b4fc246c3ff ("i2c: mediatek: Optimize master_xfer() and avoid circular locking") Signed-off-by: Christophe JAILLET Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: Qii Wang Signed-off-by: Wolfram Sang --- drivers/i2c/busses/i2c-mt65xx.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/i2c/busses/i2c-mt65xx.c b/drivers/i2c/busses/i2c-mt65xx.c index bdecb78bfc26..8e6985354fd5 100644 --- a/drivers/i2c/busses/i2c-mt65xx.c +++ b/drivers/i2c/busses/i2c-mt65xx.c @@ -1420,17 +1420,22 @@ static int mtk_i2c_probe(struct platform_device *pdev) if (ret < 0) { dev_err(&pdev->dev, "Request I2C IRQ %d fail\n", irq); - return ret; + goto err_bulk_unprepare; } i2c_set_adapdata(&i2c->adap, i2c); ret = i2c_add_adapter(&i2c->adap); if (ret) - return ret; + goto err_bulk_unprepare; platform_set_drvdata(pdev, i2c); return 0; + +err_bulk_unprepare: + clk_bulk_unprepare(I2C_MT65XX_CLK_MAX, i2c->clocks); + + return ret; } static int mtk_i2c_remove(struct platform_device *pdev) From d7dd6eccfbc95ac47a12396f84e7e1b361db654b Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Mon, 13 Jun 2022 22:53:50 +0200 Subject: [PATCH 175/298] net: bgmac: Fix an erroneous kfree() in bgmac_remove() 'bgmac' is part of a managed resource allocated with bgmac_alloc(). It should not be freed explicitly. Remove the erroneous kfree() from the .remove() function. Fixes: 34a5102c3235 ("net: bgmac: allocate struct bgmac just once & don't copy it") Signed-off-by: Christophe JAILLET Reviewed-by: Florian Fainelli Link: https://lore.kernel.org/r/a026153108dd21239036a032b95c25b5cece253b.1655153616.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/broadcom/bgmac-bcma.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bgmac-bcma.c b/drivers/net/ethernet/broadcom/bgmac-bcma.c index e6f48786949c..02bd3cf9a260 100644 --- a/drivers/net/ethernet/broadcom/bgmac-bcma.c +++ b/drivers/net/ethernet/broadcom/bgmac-bcma.c @@ -332,7 +332,6 @@ static void bgmac_remove(struct bcma_device *core) bcma_mdio_mii_unregister(bgmac->mii_bus); bgmac_enet_remove(bgmac); bcma_set_drvdata(core, NULL); - kfree(bgmac); } static struct bcma_driver bgmac_bcma_driver = { From 56315b6bf7fc63d2b26c37869d2753f765849bd6 Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Fri, 10 Jun 2022 10:16:21 +0200 Subject: [PATCH 176/298] ARM: dts: at91: ksz9477_evb: fix port/phy validation Latest drivers version requires phy-mode to be set. Otherwise we will use "NA" mode and the switch driver will invalidate this port mode. Fixes: 65ac79e18120 ("net: dsa: microchip: add the phylink get_caps") Signed-off-by: Oleksij Rempel Link: https://lore.kernel.org/r/20220610081621.584393-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski --- arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts b/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts index 443e8b022897..14af1fd6d247 100644 --- a/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts +++ b/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts @@ -120,26 +120,31 @@ port@0 { reg = <0>; label = "lan1"; + phy-mode = "internal"; }; port@1 { reg = <1>; label = "lan2"; + phy-mode = "internal"; }; port@2 { reg = <2>; label = "lan3"; + phy-mode = "internal"; }; port@3 { reg = <3>; label = "lan4"; + phy-mode = "internal"; }; port@4 { reg = <4>; label = "lan5"; + phy-mode = "internal"; }; port@5 { From b60377de779052bf00b34a62f0bae03c92b88776 Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Mon, 13 Jun 2022 14:18:26 +0200 Subject: [PATCH 177/298] MAINTAINERS: add include/dt-bindings/net to NETWORKING DRIVERS Maintainers of the directory Documentation/devicetree/bindings/net are also the maintainers of the corresponding directory include/dt-bindings/net. Add the file entry for include/dt-bindings/net to the appropriate section in MAINTAINERS. Signed-off-by: Lukas Bulwahn Link: https://lore.kernel.org/r/20220613121826.11484-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 96158b337b40..0c3847fb2bbc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13799,6 +13799,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git F: Documentation/devicetree/bindings/net/ F: drivers/connector/ F: drivers/net/ +F: include/dt-bindings/net/ F: include/linux/etherdevice.h F: include/linux/fcdevice.h F: include/linux/fddidevice.h From 36a15e1cb134c0395261ba1940762703f778438c Mon Sep 17 00:00:00 2001 From: Jose Alonso Date: Mon, 13 Jun 2022 15:32:44 -0300 Subject: [PATCH 178/298] net: usb: ax88179_178a needs FLAG_SEND_ZLP The extra byte inserted by usbnet.c when (length % dev->maxpacket == 0) is causing problems to device. This patch sets FLAG_SEND_ZLP to avoid this. Tested with: 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet Problems observed: ====================================================================== 1) Using ssh/sshfs. The remote sshd daemon can abort with the message: "message authentication code incorrect" This happens because the tcp message sent is corrupted during the USB "Bulk out". The device calculate the tcp checksum and send a valid tcp message to the remote sshd. Then the encryption detects the error and aborts. 2) NETDEV WATCHDOG: ... (ax88179_178a): transmit queue 0 timed out 3) Stop normal work without any log message. The "Bulk in" continue receiving packets normally. The host sends "Bulk out" and the device responds with -ECONNRESET. (The netusb.c code tx_complete ignore -ECONNRESET) Under normal conditions these errors take days to happen and in intense usage take hours. A test with ping gives packet loss, showing that something is wrong: ping -4 -s 462 {destination} # 462 = 512 - 42 - 8 Not all packets fail. My guess is that the device tries to find another packet starting at the extra byte and will fail or not depending on the next bytes (old buffer content). ====================================================================== Signed-off-by: Jose Alonso Signed-off-by: David S. Miller --- drivers/net/usb/ax88179_178a.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c index 7a8c11a26eb5..4704ed6f00ef 100644 --- a/drivers/net/usb/ax88179_178a.c +++ b/drivers/net/usb/ax88179_178a.c @@ -1750,7 +1750,7 @@ static const struct driver_info ax88179_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1763,7 +1763,7 @@ static const struct driver_info ax88178a_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1776,7 +1776,7 @@ static const struct driver_info cypress_GX3_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1789,7 +1789,7 @@ static const struct driver_info dlink_dub1312_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1802,7 +1802,7 @@ static const struct driver_info sitecom_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1815,7 +1815,7 @@ static const struct driver_info samsung_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1828,7 +1828,7 @@ static const struct driver_info lenovo_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1841,7 +1841,7 @@ static const struct driver_info belkin_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1854,7 +1854,7 @@ static const struct driver_info toshiba_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1867,7 +1867,7 @@ static const struct driver_info mct_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1880,7 +1880,7 @@ static const struct driver_info at_umc2000_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1893,7 +1893,7 @@ static const struct driver_info at_umc200_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1906,7 +1906,7 @@ static const struct driver_info at_umc2000sp_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; From 91ef75a7db0d0855284b78d60d3fcec5c353ec5a Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 15 Jun 2022 11:23:02 +0100 Subject: [PATCH 179/298] io_uring: get rid of __io_fill_cqe{32}_req() There are too many cqe filling helpers, kill __io_fill_cqe{32}_req(), use __io_fill_cqe{32}_req_filled() instead, and then rename it. It'll simplify fixing in following patches. Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/c18e0d191014fb574f24721245e4e3fddd0b6917.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 70 ++++++++++++++++----------------------------------- 1 file changed, 21 insertions(+), 49 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 1b0b6099e717..654c2f897497 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2464,8 +2464,8 @@ static inline bool __io_fill_cqe(struct io_ring_ctx *ctx, u64 user_data, return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0); } -static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, + struct io_kiocb *req) { struct io_uring_cqe *cqe; @@ -2486,8 +2486,8 @@ static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx, req->cqe.res, req->cqe.flags, 0, 0); } -static inline bool __io_fill_cqe32_req_filled(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static inline bool __io_fill_cqe32_req(struct io_ring_ctx *ctx, + struct io_kiocb *req) { struct io_uring_cqe *cqe; u64 extra1 = req->extra1; @@ -2513,44 +2513,6 @@ static inline bool __io_fill_cqe32_req_filled(struct io_ring_ctx *ctx, req->cqe.flags, extra1, extra2); } -static inline bool __io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags) -{ - trace_io_uring_complete(req->ctx, req, req->cqe.user_data, res, cflags, 0, 0); - return __io_fill_cqe(req->ctx, req->cqe.user_data, res, cflags); -} - -static inline void __io_fill_cqe32_req(struct io_kiocb *req, s32 res, u32 cflags, - u64 extra1, u64 extra2) -{ - struct io_ring_ctx *ctx = req->ctx; - struct io_uring_cqe *cqe; - - if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_CQE32))) - return; - if (req->flags & REQ_F_CQE_SKIP) - return; - - trace_io_uring_complete(ctx, req, req->cqe.user_data, res, cflags, - extra1, extra2); - - /* - * If we can't get a cq entry, userspace overflowed the - * submission (by quite a lot). Increment the overflow count in - * the ring. - */ - cqe = io_get_cqe(ctx); - if (likely(cqe)) { - WRITE_ONCE(cqe->user_data, req->cqe.user_data); - WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, cflags); - WRITE_ONCE(cqe->big_cqe[0], extra1); - WRITE_ONCE(cqe->big_cqe[1], extra2); - return; - } - - io_cqring_event_overflow(ctx, req->cqe.user_data, res, cflags, extra1, extra2); -} - static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) { @@ -2593,16 +2555,24 @@ static void __io_req_complete_put(struct io_kiocb *req) static void __io_req_complete_post(struct io_kiocb *req, s32 res, u32 cflags) { - if (!(req->flags & REQ_F_CQE_SKIP)) - __io_fill_cqe_req(req, res, cflags); + if (!(req->flags & REQ_F_CQE_SKIP)) { + req->cqe.res = res; + req->cqe.flags = cflags; + __io_fill_cqe_req(req->ctx, req); + } __io_req_complete_put(req); } static void __io_req_complete_post32(struct io_kiocb *req, s32 res, u32 cflags, u64 extra1, u64 extra2) { - if (!(req->flags & REQ_F_CQE_SKIP)) - __io_fill_cqe32_req(req, res, cflags, extra1, extra2); + if (!(req->flags & REQ_F_CQE_SKIP)) { + req->cqe.res = res; + req->cqe.flags = cflags; + req->extra1 = extra1; + req->extra2 = extra2; + __io_fill_cqe32_req(req->ctx, req); + } __io_req_complete_put(req); } @@ -3207,9 +3177,9 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx) if (!(req->flags & REQ_F_CQE_SKIP)) { if (!(ctx->flags & IORING_SETUP_CQE32)) - __io_fill_cqe_req_filled(ctx, req); + __io_fill_cqe_req(ctx, req); else - __io_fill_cqe32_req_filled(ctx, req); + __io_fill_cqe32_req(ctx, req); } } @@ -3329,7 +3299,9 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) nr_events++; if (unlikely(req->flags & REQ_F_CQE_SKIP)) continue; - __io_fill_cqe_req(req, req->cqe.res, io_put_kbuf(req, 0)); + + req->cqe.flags = io_put_kbuf(req, 0); + __io_fill_cqe_req(req->ctx, req); } if (unlikely(!nr_events)) From f43de1f88841d59f27f761219b6550bd6ce3dcc1 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 15 Jun 2022 11:23:03 +0100 Subject: [PATCH 180/298] io_uring: unite fill_cqe and the 32B version We want just one function that will handle both normal cqes and 32B cqes. Combine __io_fill_cqe_req() and __io_fill_cqe_req32(). It's still not entirely correct yet, but saves us from cases when we fill an CQE of a wrong size. Fixes: 76c68fbf1a1f9 ("io_uring: enable CQE32") Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/8085c5b2f74141520f60decd45334f87e389b718.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 63 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 43 insertions(+), 20 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 654c2f897497..eb858cf92af9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2469,21 +2469,48 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, { struct io_uring_cqe *cqe; - trace_io_uring_complete(req->ctx, req, req->cqe.user_data, - req->cqe.res, req->cqe.flags, 0, 0); - - /* - * If we can't get a cq entry, userspace overflowed the - * submission (by quite a lot). Increment the overflow count in - * the ring. - */ - cqe = io_get_cqe(ctx); - if (likely(cqe)) { - memcpy(cqe, &req->cqe, sizeof(*cqe)); - return true; - } - return io_cqring_event_overflow(ctx, req->cqe.user_data, + if (!(ctx->flags & IORING_SETUP_CQE32)) { + trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, 0, 0); + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_get_cqe(ctx); + if (likely(cqe)) { + memcpy(cqe, &req->cqe, sizeof(*cqe)); + return true; + } + + return io_cqring_event_overflow(ctx, req->cqe.user_data, + req->cqe.res, req->cqe.flags, + 0, 0); + } else { + u64 extra1 = req->extra1; + u64 extra2 = req->extra2; + + trace_io_uring_complete(req->ctx, req, req->cqe.user_data, + req->cqe.res, req->cqe.flags, extra1, extra2); + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_get_cqe(ctx); + if (likely(cqe)) { + memcpy(cqe, &req->cqe, sizeof(struct io_uring_cqe)); + WRITE_ONCE(cqe->big_cqe[0], extra1); + WRITE_ONCE(cqe->big_cqe[1], extra2); + return true; + } + + return io_cqring_event_overflow(ctx, req->cqe.user_data, + req->cqe.res, req->cqe.flags, + extra1, extra2); + } } static inline bool __io_fill_cqe32_req(struct io_ring_ctx *ctx, @@ -3175,12 +3202,8 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx) struct io_kiocb *req = container_of(node, struct io_kiocb, comp_list); - if (!(req->flags & REQ_F_CQE_SKIP)) { - if (!(ctx->flags & IORING_SETUP_CQE32)) - __io_fill_cqe_req(ctx, req); - else - __io_fill_cqe32_req(ctx, req); - } + if (!(req->flags & REQ_F_CQE_SKIP)) + __io_fill_cqe_req(ctx, req); } io_commit_cqring(ctx); From 29ede2014c87576d2fc83680aa4c1d7403db0dfe Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 15 Jun 2022 11:23:04 +0100 Subject: [PATCH 181/298] io_uring: fill extra big cqe fields from req The only user of io_req_complete32()-like functions is cmd requests. Instead of keeping the whole complete32 family, remove them and provide the extras in already added for inline completions req->extra{1,2}. When fill_cqe_res() finds CQE32 option enabled it'll use those fields to fill a 32B cqe. Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/af1319eb661b1f9a0abceb51cbbf72b8002e019d.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 78 +++++++-------------------------------------------- 1 file changed, 10 insertions(+), 68 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index eb858cf92af9..10901db93f7e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2513,33 +2513,6 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, } } -static inline bool __io_fill_cqe32_req(struct io_ring_ctx *ctx, - struct io_kiocb *req) -{ - struct io_uring_cqe *cqe; - u64 extra1 = req->extra1; - u64 extra2 = req->extra2; - - trace_io_uring_complete(req->ctx, req, req->cqe.user_data, - req->cqe.res, req->cqe.flags, extra1, extra2); - - /* - * If we can't get a cq entry, userspace overflowed the - * submission (by quite a lot). Increment the overflow count in - * the ring. - */ - cqe = io_get_cqe(ctx); - if (likely(cqe)) { - memcpy(cqe, &req->cqe, sizeof(struct io_uring_cqe)); - cqe->big_cqe[0] = extra1; - cqe->big_cqe[1] = extra2; - return true; - } - - return io_cqring_event_overflow(ctx, req->cqe.user_data, req->cqe.res, - req->cqe.flags, extra1, extra2); -} - static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) { @@ -2590,19 +2563,6 @@ static void __io_req_complete_post(struct io_kiocb *req, s32 res, __io_req_complete_put(req); } -static void __io_req_complete_post32(struct io_kiocb *req, s32 res, - u32 cflags, u64 extra1, u64 extra2) -{ - if (!(req->flags & REQ_F_CQE_SKIP)) { - req->cqe.res = res; - req->cqe.flags = cflags; - req->extra1 = extra1; - req->extra2 = extra2; - __io_fill_cqe32_req(req->ctx, req); - } - __io_req_complete_put(req); -} - static void io_req_complete_post(struct io_kiocb *req, s32 res, u32 cflags) { struct io_ring_ctx *ctx = req->ctx; @@ -2614,18 +2574,6 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res, u32 cflags) io_cqring_ev_posted(ctx); } -static void io_req_complete_post32(struct io_kiocb *req, s32 res, - u32 cflags, u64 extra1, u64 extra2) -{ - struct io_ring_ctx *ctx = req->ctx; - - spin_lock(&ctx->completion_lock); - __io_req_complete_post32(req, res, cflags, extra1, extra2); - io_commit_cqring(ctx); - spin_unlock(&ctx->completion_lock); - io_cqring_ev_posted(ctx); -} - static inline void io_req_complete_state(struct io_kiocb *req, s32 res, u32 cflags) { @@ -2643,19 +2591,6 @@ static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags, io_req_complete_post(req, res, cflags); } -static inline void __io_req_complete32(struct io_kiocb *req, - unsigned int issue_flags, s32 res, - u32 cflags, u64 extra1, u64 extra2) -{ - if (issue_flags & IO_URING_F_COMPLETE_DEFER) { - io_req_complete_state(req, res, cflags); - req->extra1 = extra1; - req->extra2 = extra2; - } else { - io_req_complete_post32(req, res, cflags, extra1, extra2); - } -} - static inline void io_req_complete(struct io_kiocb *req, s32 res) { if (res < 0) @@ -5079,6 +5014,13 @@ void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, } EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); +static inline void io_req_set_cqe32_extra(struct io_kiocb *req, + u64 extra1, u64 extra2) +{ + req->extra1 = extra1; + req->extra2 = extra2; +} + /* * Called by consumers of io_uring_cmd, if they originally returned * -EIOCBQUEUED upon receiving the command. @@ -5089,10 +5031,10 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2) if (ret < 0) req_set_fail(req); + if (req->ctx->flags & IORING_SETUP_CQE32) - __io_req_complete32(req, 0, ret, 0, res2, 0); - else - io_req_complete(req, ret); + io_req_set_cqe32_extra(req, res2, 0); + io_req_complete(req, ret); } EXPORT_SYMBOL_GPL(io_uring_cmd_done); From 2caf9822f0507463168a9e83f93c75b3e3fac971 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 15 Jun 2022 11:23:05 +0100 Subject: [PATCH 182/298] io_uring: fix ->extra{1,2} misuse We don't really know the state of req->extra{1,2] fields in __io_fill_cqe_req(), if an opcode handler is not aware of CQE32 option, it never sets them up properly. Track the state of those fields with a request flag. Fixes: 76c68fbf1a1f9 ("io_uring: enable CQE32") Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/4b3e5be512fbf4debec7270fd485b8a3b014d464.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 10901db93f7e..808b7f4ace0b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -844,6 +844,7 @@ enum { REQ_F_SINGLE_POLL_BIT, REQ_F_DOUBLE_POLL_BIT, REQ_F_PARTIAL_IO_BIT, + REQ_F_CQE32_INIT_BIT, REQ_F_APOLL_MULTISHOT_BIT, /* keep async read/write and isreg together and in order */ REQ_F_SUPPORT_NOWAIT_BIT, @@ -913,6 +914,8 @@ enum { REQ_F_PARTIAL_IO = BIT(REQ_F_PARTIAL_IO_BIT), /* fast poll multishot mode */ REQ_F_APOLL_MULTISHOT = BIT(REQ_F_APOLL_MULTISHOT_BIT), + /* ->extra1 and ->extra2 are initialised */ + REQ_F_CQE32_INIT = BIT(REQ_F_CQE32_INIT_BIT), }; struct async_poll { @@ -2488,8 +2491,12 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, req->cqe.res, req->cqe.flags, 0, 0); } else { - u64 extra1 = req->extra1; - u64 extra2 = req->extra2; + u64 extra1 = 0, extra2 = 0; + + if (req->flags & REQ_F_CQE32_INIT) { + extra1 = req->extra1; + extra2 = req->extra2; + } trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, extra1, extra2); @@ -5019,6 +5026,7 @@ static inline void io_req_set_cqe32_extra(struct io_kiocb *req, { req->extra1 = extra1; req->extra2 = extra2; + req->flags |= REQ_F_CQE32_INIT; } /* From cd94903d3ba50d7ae797c603f68996af8d1ba1a1 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 15 Jun 2022 11:23:06 +0100 Subject: [PATCH 183/298] io_uring: remove __io_fill_cqe() helper In preparation for the following patch, inline __io_fill_cqe(), there is only one user. Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/71dab9afc3cde3f8b64d26f20d3b60bdc40726ff.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 37 ++++++++++++++++--------------------- 1 file changed, 16 insertions(+), 21 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 808b7f4ace0b..792e9c95d217 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2447,26 +2447,6 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, return true; } -static inline bool __io_fill_cqe(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags) -{ - struct io_uring_cqe *cqe; - - /* - * If we can't get a cq entry, userspace overflowed the - * submission (by quite a lot). Increment the overflow count in - * the ring. - */ - cqe = io_get_cqe(ctx); - if (likely(cqe)) { - WRITE_ONCE(cqe->user_data, user_data); - WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, cflags); - return true; - } - return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0); -} - static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req) { @@ -2523,9 +2503,24 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) { + struct io_uring_cqe *cqe; + ctx->cq_extra++; trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0); - return __io_fill_cqe(ctx, user_data, res, cflags); + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_get_cqe(ctx); + if (likely(cqe)) { + WRITE_ONCE(cqe->user_data, user_data); + WRITE_ONCE(cqe->res, res); + WRITE_ONCE(cqe->flags, cflags); + return true; + } + return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0); } static void __io_req_complete_put(struct io_kiocb *req) From c5595975b53a487bf329eeba65b5c5f34605a4c0 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Wed, 15 Jun 2022 11:23:07 +0100 Subject: [PATCH 184/298] io_uring: make io_fill_cqe_aux honour CQE32 Don't let io_fill_cqe_aux() post 16B cqes for CQE32 rings, neither the kernel nor the userspace expect this to happen. Fixes: 76c68fbf1a1f9 ("io_uring: enable CQE32") Signed-off-by: Pavel Begunkov Link: https://lore.kernel.org/r/64fae669fae1b7083aa15d0cd807f692b0880b9a.1655287457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index 792e9c95d217..5d479428d8e5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2518,6 +2518,11 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, WRITE_ONCE(cqe->user_data, user_data); WRITE_ONCE(cqe->res, res); WRITE_ONCE(cqe->flags, cflags); + + if (ctx->flags & IORING_SETUP_CQE32) { + WRITE_ONCE(cqe->big_cqe[0], 0); + WRITE_ONCE(cqe->big_cqe[1], 0); + } return true; } return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0); From 219b51a6f040fa5367adadd7d58c4dda0896a01d Mon Sep 17 00:00:00 2001 From: Duoming Zhou Date: Tue, 14 Jun 2022 17:25:57 +0800 Subject: [PATCH 185/298] net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg The skb_recv_datagram() in ax25_recvmsg() will hold lock_sock and block until it receives a packet from the remote. If the client doesn`t connect to server and calls read() directly, it will not receive any packets forever. As a result, the deadlock will happen. The fail log caused by deadlock is shown below: [ 369.606973] INFO: task ax25_deadlock:157 blocked for more than 245 seconds. [ 369.608919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.613058] Call Trace: [ 369.613315] [ 369.614072] __schedule+0x2f9/0xb20 [ 369.615029] schedule+0x49/0xb0 [ 369.615734] __lock_sock+0x92/0x100 [ 369.616763] ? destroy_sched_domains_rcu+0x20/0x20 [ 369.617941] lock_sock_nested+0x6e/0x70 [ 369.618809] ax25_bind+0xaa/0x210 [ 369.619736] __sys_bind+0xca/0xf0 [ 369.620039] ? do_futex+0xae/0x1b0 [ 369.620387] ? __x64_sys_futex+0x7c/0x1c0 [ 369.620601] ? fpregs_assert_state_consistent+0x19/0x40 [ 369.620613] __x64_sys_bind+0x11/0x20 [ 369.621791] do_syscall_64+0x3b/0x90 [ 369.622423] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 369.623319] RIP: 0033:0x7f43c8aa8af7 [ 369.624301] RSP: 002b:00007f43c8197ef8 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 [ 369.625756] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f43c8aa8af7 [ 369.626724] RDX: 0000000000000010 RSI: 000055768e2021d0 RDI: 0000000000000005 [ 369.628569] RBP: 00007f43c8197f00 R08: 0000000000000011 R09: 00007f43c8198700 [ 369.630208] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff845e6afe [ 369.632240] R13: 00007fff845e6aff R14: 00007f43c8197fc0 R15: 00007f43c8198700 This patch replaces skb_recv_datagram() with an open-coded variant of it releasing the socket lock before the __skb_wait_for_more_packets() call and re-acquiring it after such call in order that other functions that need socket lock could be executed. what's more, the socket lock will be released only when recvmsg() will block and that should produce nicer overall behavior. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Thomas Osterried Signed-off-by: Duoming Zhou Reported-by: Thomas Habets Acked-by: Paolo Abeni Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ax25/af_ax25.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c index 95393bb2760b..4c7030ed8d33 100644 --- a/net/ax25/af_ax25.c +++ b/net/ax25/af_ax25.c @@ -1661,9 +1661,12 @@ static int ax25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk = sock->sk; - struct sk_buff *skb; + struct sk_buff *skb, *last; + struct sk_buff_head *sk_queue; int copied; int err = 0; + int off = 0; + long timeo; lock_sock(sk); /* @@ -1675,10 +1678,29 @@ static int ax25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, goto out; } - /* Now we can treat all alike */ - skb = skb_recv_datagram(sk, flags, &err); - if (skb == NULL) - goto out; + /* We need support for non-blocking reads. */ + sk_queue = &sk->sk_receive_queue; + skb = __skb_try_recv_datagram(sk, sk_queue, flags, &off, &err, &last); + /* If no packet is available, release_sock(sk) and try again. */ + if (!skb) { + if (err != -EAGAIN) + goto out; + release_sock(sk); + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (timeo && !__skb_wait_for_more_packets(sk, sk_queue, &err, + &timeo, last)) { + skb = __skb_try_recv_datagram(sk, sk_queue, flags, &off, + &err, &last); + if (skb) + break; + + if (err != -EAGAIN) + goto done; + } + if (!skb) + goto done; + lock_sock(sk); + } if (!sk_to_ax25(sk)->pidincl) skb_pull(skb, 1); /* Remove PID */ @@ -1725,6 +1747,7 @@ static int ax25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, out: release_sock(sk); +done: return err; } From 27d8fa207835fa5c7cd6f969c6cc94d1123951ee Mon Sep 17 00:00:00 2001 From: Catalin Marinas Date: Wed, 15 Jun 2022 14:22:38 +0100 Subject: [PATCH 186/298] Revert "arm64: Initialize jump labels before setup_machine_fdt()" This reverts commit 73e2d827a501d48dceeb5b9b267a4cd283d6b1ae. The reverted patch was needed as a fix after commit f5bda35fba61 ("random: use static branch for crng_ready()"). However, this was already fixed by 60e5b2886b92 ("random: do not use jump labels before they are initialized") and hence no longer necessary to initialise jump labels before setup_machine_fdt(). Signed-off-by: Catalin Marinas --- arch/arm64/kernel/setup.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index cf3a759f10d4..fea3223704b6 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -303,14 +303,13 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p) early_fixmap_init(); early_ioremap_init(); - /* - * Initialise the static keys early as they may be enabled by the - * cpufeature code, early parameters, and DT setup. - */ - jump_label_init(); - setup_machine_fdt(__fdt_pointer); + /* + * Initialise the static keys early as they may be enabled by the + * cpufeature code and early parameters. + */ + jump_label_init(); parse_early_param(); /* From ec41c6d82056cbbd7ec8f44eed6d86fea50acf4e Mon Sep 17 00:00:00 2001 From: Michael Carns Date: Wed, 15 Jun 2022 14:25:44 +0200 Subject: [PATCH 187/298] hwmon: (asus-ec-sensors) add missing comma in board name list. This fixes a regression where coma lead to concatenating board names and broke module loading for C8H. Fixes: 5b4285c57b6f ("hwmon: (asus-ec-sensors) fix Formula VIII definition") Signed-off-by: Michael Carns Signed-off-by: Eugene Shalygin Link: https://lore.kernel.org/r/20220615122544.140340-1-eugene.shalygin@gmail.com Signed-off-by: Guenter Roeck --- drivers/hwmon/asus-ec-sensors.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hwmon/asus-ec-sensors.c b/drivers/hwmon/asus-ec-sensors.c index 57e11b2bab74..3633ab691662 100644 --- a/drivers/hwmon/asus-ec-sensors.c +++ b/drivers/hwmon/asus-ec-sensors.c @@ -259,7 +259,7 @@ static const struct ec_board_info board_info[] = { }, { .board_names = { - "ROG CROSSHAIR VIII FORMULA" + "ROG CROSSHAIR VIII FORMULA", "ROG CROSSHAIR VIII HERO", "ROG CROSSHAIR VIII HERO (WI-FI)", }, From 3eefdf9d1e406f3da47470b2854347009ffcb6fa Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 14 Jun 2022 09:09:42 +0100 Subject: [PATCH 188/298] arm64: ftrace: fix branch range checks The branch range checks in ftrace_make_call() and ftrace_make_nop() are incorrect, erroneously permitting a forwards branch of 128M and erroneously rejecting a backwards branch of 128M. This is because both functions calculate the offset backwards, calculating the offset *from* the target *to* the branch, rather than the other way around as the later comparisons expect. If an out-of-range branch were erroeously permitted, this would later be rejected by aarch64_insn_gen_branch_imm() as branch_imm_common() checks the bounds correctly, resulting in warnings and the placement of a BRK instruction. Note that this can only happen for a forwards branch of exactly 128M, and so the caller would need to be exactly 128M bytes below the relevant ftrace trampoline. If an in-range branch were erroeously rejected, then: * For modules when CONFIG_ARM64_MODULE_PLTS=y, this would result in the use of a PLT entry, which is benign. Note that this is the common case, as this is selected by CONFIG_RANDOMIZE_BASE (and therefore RANDOMIZE_MODULE_REGION_FULL), which distributions typically seelct. This is also selected by CONFIG_ARM64_ERRATUM_843419. * For modules when CONFIG_ARM64_MODULE_PLTS=n, this would result in internal ftrace failures. * For core kernel text, this would result in internal ftrace failues. Note that for this to happen, the kernel text would need to be at least 128M bytes in size, and typical configurations are smaller tha this. Fix this by calculating the offset *from* the branch *to* the target in both functions. Fixes: f8af0b364e24 ("arm64: ftrace: don't validate branch via PLT in ftrace_make_nop()") Fixes: e71a4e1bebaf ("arm64: ftrace: add support for far branches to dynamic ftrace") Signed-off-by: Mark Rutland Cc: Ard Biesheuvel Cc: Will Deacon Tested-by: "Ivan T. Ivanov" Reviewed-by: Chengming Zhou Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20220614080944.1349146-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- arch/arm64/kernel/ftrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c index f447c4a36f69..e1c88234b882 100644 --- a/arch/arm64/kernel/ftrace.c +++ b/arch/arm64/kernel/ftrace.c @@ -84,7 +84,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) { unsigned long pc = rec->ip; u32 old, new; - long offset = (long)pc - (long)addr; + long offset = (long)addr - (long)pc; if (offset < -SZ_128M || offset >= SZ_128M) { struct module *mod; @@ -183,7 +183,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long pc = rec->ip; bool validate = true; u32 old = 0, new; - long offset = (long)pc - (long)addr; + long offset = (long)addr - (long)pc; if (offset < -SZ_128M || offset >= SZ_128M) { u32 replaced; From a6253579977e4c6f7818eeb05bf2bc65678a7187 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 14 Jun 2022 09:09:43 +0100 Subject: [PATCH 189/298] arm64: ftrace: consistently handle PLTs. Sometimes it is necessary to use a PLT entry to call an ftrace trampoline. This is handled by ftrace_make_call() and ftrace_make_nop(), with each having *almost* identical logic, but this is not handled by ftrace_modify_call() since its introduction in commit: 3b23e4991fb66f6d ("arm64: implement ftrace with regs") Due to this, if we ever were to call ftrace_modify_call() for a callsite which requires a PLT entry for a trampoline, then either: a) If the old addr requires a trampoline, ftrace_modify_call() will use an out-of-range address to generate the 'old' branch instruction. This will result in warnings from aarch64_insn_gen_branch_imm() and ftrace_modify_code(), and no instructions will be modified. As ftrace_modify_call() will return an error, this will result in subsequent internal ftrace errors. b) If the old addr does not require a trampoline, but the new addr does, ftrace_modify_call() will use an out-of-range address to generate the 'new' branch instruction. This will result in warnings from aarch64_insn_gen_branch_imm(), and ftrace_modify_code() will replace the 'old' branch with a BRK. This will result in a kernel panic when this BRK is later executed. Practically speaking, case (a) is vastly more likely than case (b), and typically this will result in internal ftrace errors that don't necessarily affect the rest of the system. This can be demonstrated with an out-of-tree test module which triggers ftrace_modify_call(), e.g. | # insmod test_ftrace.ko | test_ftrace: Function test_function raw=0xffffb3749399201c, callsite=0xffffb37493992024 | branch_imm_common: offset out of range | branch_imm_common: offset out of range | ------------[ ftrace bug ]------------ | ftrace failed to modify | [] test_function+0x8/0x38 [test_ftrace] | actual: 1d:00:00:94 | Updating ftrace call site to call a different ftrace function | ftrace record flags: e0000002 | (2) R | expected tramp: ffffb374ae42ed54 | ------------[ cut here ]------------ | WARNING: CPU: 0 PID: 165 at kernel/trace/ftrace.c:2085 ftrace_bug+0x280/0x2b0 | Modules linked in: test_ftrace(+) | CPU: 0 PID: 165 Comm: insmod Not tainted 5.19.0-rc2-00002-g4d9ead8b45ce #13 | Hardware name: linux,dummy-virt (DT) | pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : ftrace_bug+0x280/0x2b0 | lr : ftrace_bug+0x280/0x2b0 | sp : ffff80000839ba00 | x29: ffff80000839ba00 x28: 0000000000000000 x27: ffff80000839bcf0 | x26: ffffb37493994180 x25: ffffb374b0991c28 x24: ffffb374b0d70000 | x23: 00000000ffffffea x22: ffffb374afcc33b0 x21: ffffb374b08f9cc8 | x20: ffff572b8462c000 x19: ffffb374b08f9000 x18: ffffffffffffffff | x17: 6c6c6163202c6331 x16: ffffb374ae5ad110 x15: ffffb374b0d51ee4 | x14: 0000000000000000 x13: 3435646532346561 x12: 3437336266666666 | x11: 203a706d61727420 x10: 6465746365707865 x9 : ffffb374ae5149e8 | x8 : 336266666666203a x7 : 706d617274206465 x6 : 00000000fffff167 | x5 : ffff572bffbc4a08 x4 : 00000000fffff167 x3 : 0000000000000000 | x2 : 0000000000000000 x1 : ffff572b84461e00 x0 : 0000000000000022 | Call trace: | ftrace_bug+0x280/0x2b0 | ftrace_replace_code+0x98/0xa0 | ftrace_modify_all_code+0xe0/0x144 | arch_ftrace_update_code+0x14/0x20 | ftrace_startup+0xf8/0x1b0 | register_ftrace_function+0x38/0x90 | test_ftrace_init+0xd0/0x1000 [test_ftrace] | do_one_initcall+0x50/0x2b0 | do_init_module+0x50/0x1f0 | load_module+0x17c8/0x1d64 | __do_sys_finit_module+0xa8/0x100 | __arm64_sys_finit_module+0x2c/0x3c | invoke_syscall+0x50/0x120 | el0_svc_common.constprop.0+0xdc/0x100 | do_el0_svc+0x3c/0xd0 | el0_svc+0x34/0xb0 | el0t_64_sync_handler+0xbc/0x140 | el0t_64_sync+0x18c/0x190 | ---[ end trace 0000000000000000 ]--- We can solve this by consistently determining whether to use a PLT entry for an address. Note that since (the earlier) commit: f1a54ae9af0da4d7 ("arm64: module/ftrace: intialize PLT at load time") ... we can consistently determine the PLT address that a given callsite will use, and therefore ftrace_make_nop() does not need to skip validation when a PLT is in use. This patch factors the existing logic out of ftrace_make_call() and ftrace_make_nop() into a common ftrace_find_callable_addr() helper function, which is used by ftrace_make_call(), ftrace_make_nop(), and ftrace_modify_call(). In ftrace_make_nop() the patching is consistently validated by ftrace_modify_code() as we can always determine what the old instruction should have been. Fixes: 3b23e4991fb6 ("arm64: implement ftrace with regs") Signed-off-by: Mark Rutland Cc: Ard Biesheuvel Cc: Will Deacon Tested-by: "Ivan T. Ivanov" Reviewed-by: Chengming Zhou Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20220614080944.1349146-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- arch/arm64/kernel/ftrace.c | 147 ++++++++++++++++++------------------- 1 file changed, 71 insertions(+), 76 deletions(-) diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c index e1c88234b882..ea5dc7c90f46 100644 --- a/arch/arm64/kernel/ftrace.c +++ b/arch/arm64/kernel/ftrace.c @@ -77,6 +77,66 @@ static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr) return NULL; } +/* + * Find the address the callsite must branch to in order to reach '*addr'. + * + * Due to the limited range of 'BL' instructions, modules may be placed too far + * away to branch directly and must use a PLT. + * + * Returns true when '*addr' contains a reachable target address, or has been + * modified to contain a PLT address. Returns false otherwise. + */ +static bool ftrace_find_callable_addr(struct dyn_ftrace *rec, + struct module *mod, + unsigned long *addr) +{ + unsigned long pc = rec->ip; + long offset = (long)*addr - (long)pc; + struct plt_entry *plt; + + /* + * When the target is within range of the 'BL' instruction, use 'addr' + * as-is and branch to that directly. + */ + if (offset >= -SZ_128M && offset < SZ_128M) + return true; + + /* + * When the target is outside of the range of a 'BL' instruction, we + * must use a PLT to reach it. We can only place PLTs for modules, and + * only when module PLT support is built-in. + */ + if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS)) + return false; + + /* + * 'mod' is only set at module load time, but if we end up + * dealing with an out-of-range condition, we can assume it + * is due to a module being loaded far away from the kernel. + * + * NOTE: __module_text_address() must be called with preemption + * disabled, but we can rely on ftrace_lock to ensure that 'mod' + * retains its validity throughout the remainder of this code. + */ + if (!mod) { + preempt_disable(); + mod = __module_text_address(pc); + preempt_enable(); + } + + if (WARN_ON(!mod)) + return false; + + plt = get_ftrace_plt(mod, *addr); + if (!plt) { + pr_err("ftrace: no module PLT for %ps\n", (void *)*addr); + return false; + } + + *addr = (unsigned long)plt; + return true; +} + /* * Turn on the call to ftrace_caller() in instrumented function */ @@ -84,40 +144,9 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) { unsigned long pc = rec->ip; u32 old, new; - long offset = (long)addr - (long)pc; - if (offset < -SZ_128M || offset >= SZ_128M) { - struct module *mod; - struct plt_entry *plt; - - if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS)) - return -EINVAL; - - /* - * On kernels that support module PLTs, the offset between the - * branch instruction and its target may legally exceed the - * range of an ordinary relative 'bl' opcode. In this case, we - * need to branch via a trampoline in the module. - * - * NOTE: __module_text_address() must be called with preemption - * disabled, but we can rely on ftrace_lock to ensure that 'mod' - * retains its validity throughout the remainder of this code. - */ - preempt_disable(); - mod = __module_text_address(pc); - preempt_enable(); - - if (WARN_ON(!mod)) - return -EINVAL; - - plt = get_ftrace_plt(mod, addr); - if (!plt) { - pr_err("ftrace: no module PLT for %ps\n", (void *)addr); - return -EINVAL; - } - - addr = (unsigned long)plt; - } + if (!ftrace_find_callable_addr(rec, NULL, &addr)) + return -EINVAL; old = aarch64_insn_gen_nop(); new = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK); @@ -132,6 +161,11 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, unsigned long pc = rec->ip; u32 old, new; + if (!ftrace_find_callable_addr(rec, NULL, &old_addr)) + return -EINVAL; + if (!ftrace_find_callable_addr(rec, NULL, &addr)) + return -EINVAL; + old = aarch64_insn_gen_branch_imm(pc, old_addr, AARCH64_INSN_BRANCH_LINK); new = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK); @@ -181,54 +215,15 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long addr) { unsigned long pc = rec->ip; - bool validate = true; u32 old = 0, new; - long offset = (long)addr - (long)pc; - if (offset < -SZ_128M || offset >= SZ_128M) { - u32 replaced; - - if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS)) - return -EINVAL; - - /* - * 'mod' is only set at module load time, but if we end up - * dealing with an out-of-range condition, we can assume it - * is due to a module being loaded far away from the kernel. - */ - if (!mod) { - preempt_disable(); - mod = __module_text_address(pc); - preempt_enable(); - - if (WARN_ON(!mod)) - return -EINVAL; - } - - /* - * The instruction we are about to patch may be a branch and - * link instruction that was redirected via a PLT entry. In - * this case, the normal validation will fail, but we can at - * least check that we are dealing with a branch and link - * instruction that points into the right module. - */ - if (aarch64_insn_read((void *)pc, &replaced)) - return -EFAULT; - - if (!aarch64_insn_is_bl(replaced) || - !within_module(pc + aarch64_get_branch_offset(replaced), - mod)) - return -EINVAL; - - validate = false; - } else { - old = aarch64_insn_gen_branch_imm(pc, addr, - AARCH64_INSN_BRANCH_LINK); - } + if (!ftrace_find_callable_addr(rec, mod, &addr)) + return -EINVAL; + old = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK); new = aarch64_insn_gen_nop(); - return ftrace_modify_code(pc, old, new, validate); + return ftrace_modify_code(pc, old, new, true); } void arch_ftrace_update_code(int command) From 0d8116ccd83b7e5384cf04de570ae19771e8a3d0 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 14 Jun 2022 09:09:44 +0100 Subject: [PATCH 190/298] arm64: ftrace: remove redundant label Since commit: c4a0ebf87cebbfa2 ("arm64/ftrace: Make function graph use ftrace directly") The 'ftrace_common_return' label has been unused. Remove it. Signed-off-by: Mark Rutland Cc: Chengming Zhou Cc: Will Deacon Tested-by: "Ivan T. Ivanov" Reviewed-by: Chengming Zhou Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20220614080944.1349146-4-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- arch/arm64/kernel/entry-ftrace.S | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index d42a205ef625..bd5df50e4643 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -102,7 +102,6 @@ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) * x19-x29 per the AAPCS, and we created frame records upon entry, so we need * to restore x0-x8, x29, and x30. */ -ftrace_common_return: /* Restore function arguments */ ldp x0, x1, [sp] ldp x2, x3, [sp, #S_X2] From 10eb3a0d517fcc83eeea4242c149461205675eb4 Mon Sep 17 00:00:00 2001 From: Benjamin Marzinski Date: Tue, 14 Jun 2022 11:10:28 -0500 Subject: [PATCH 191/298] dm: fix race in dm_start_io_acct After commit 82f6cdcc3676c ("dm: switch dm_io booleans over to proper flags") dm_start_io_acct stopped atomically checking and setting was_accounted, which turned into the DM_IO_ACCOUNTED flag. This opened the possibility for a race where IO accounting is started twice for duplicate bios. To remove the race, check the flag while holding the io->lock. Fixes: 82f6cdcc3676c ("dm: switch dm_io booleans over to proper flags") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Marzinski Signed-off-by: Mike Snitzer --- drivers/md/dm.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index d8f16183bf27..d5e6d33700e5 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -555,6 +555,10 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone) unsigned long flags; /* Can afford locking given DM_TIO_IS_DUPLICATE_BIO */ spin_lock_irqsave(&io->lock, flags); + if (dm_io_flagged(io, DM_IO_ACCOUNTED)) { + spin_unlock_irqrestore(&io->lock, flags); + return; + } dm_io_set_flag(io, DM_IO_ACCOUNTED); spin_unlock_irqrestore(&io->lock, flags); } From d0a180341fe00cd0bd1cc259d196dc255c13f229 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 7 Jun 2022 10:03:56 +0800 Subject: [PATCH 192/298] Revert "md: don't unregister sync_thread with reconfig_mutex held" The 07reshape5intr test is broke because of below path. md_reap_sync_thread -> mddev_unlock -> md_unregister_thread(&mddev->sync_thread) And md_check_recovery is triggered by, mddev_unlock -> md_wakeup_thread(mddev->thread) then mddev->reshape_position is set to MaxSector in raid5_finish_reshape since MD_RECOVERY_INTR is cleared in md_check_recovery, which means feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's reshape_position can't be updated accordingly. Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") Reported-by: Logan Gunthorpe Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/dm-raid.c | 2 +- drivers/md/md.c | 14 +++++--------- drivers/md/md.h | 2 +- 3 files changed, 7 insertions(+), 11 deletions(-) diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 5e41fbae3f6b..9526ccbedafb 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3725,7 +3725,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv, if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) { if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev, false); + md_reap_sync_thread(mddev); } } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle) return -EBUSY; diff --git a/drivers/md/md.c b/drivers/md/md.c index 8273ac5eef06..c7ecb0bffda0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4831,7 +4831,7 @@ action_store(struct mddev *mddev, const char *page, size_t len) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev, true); + md_reap_sync_thread(mddev); } mddev_unlock(mddev); } @@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev, true); + md_reap_sync_thread(mddev); } del_timer_sync(&mddev->safemode_timer); @@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev) * ->spare_active and clear saved_raid_disk */ set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev, true); + md_reap_sync_thread(mddev); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags); @@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev) goto unlock; } if (mddev->sync_thread) { - md_reap_sync_thread(mddev, true); + md_reap_sync_thread(mddev); goto unlock; } /* Set RUNNING before clearing NEEDED to avoid @@ -9411,18 +9411,14 @@ void md_check_recovery(struct mddev *mddev) } EXPORT_SYMBOL(md_check_recovery); -void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held) +void md_reap_sync_thread(struct mddev *mddev) { struct md_rdev *rdev; sector_t old_dev_sectors = mddev->dev_sectors; bool is_reshaped = false; - if (reconfig_mutex_held) - mddev_unlock(mddev); /* resync has finished, collect result */ md_unregister_thread(&mddev->sync_thread); - if (reconfig_mutex_held) - mddev_lock_nointr(mddev); if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && mddev->degraded != mddev->raid_disks) { diff --git a/drivers/md/md.h b/drivers/md/md.h index 5f62c46ac2d3..cf2cbb17acbd 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -719,7 +719,7 @@ extern struct md_thread *md_register_thread( extern void md_unregister_thread(struct md_thread **threadp); extern void md_wakeup_thread(struct md_thread *thread); extern void md_check_recovery(struct mddev *mddev); -extern void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held); +extern void md_reap_sync_thread(struct mddev *mddev); extern int mddev_init_writes_pending(struct mddev *mddev); extern bool md_write_start(struct mddev *mddev, struct bio *bi); extern void md_write_inc(struct mddev *mddev, struct bio *bi); From f34fdcd4a0e7a0b92340ad7e48e7bcff9393fab5 Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Wed, 8 Jun 2022 10:27:46 -0600 Subject: [PATCH 193/298] md/raid5-ppl: Fix argument order in bio_alloc_bioset() bio_alloc_bioset() takes a block device, number of vectors, the OP flags, the GFP mask and the bio set. However when the prototype was changed, the callisite in ppl_do_flush() had the OP flags and the GFP flags reversed. This introduced some sparse error: drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3 (different base types) drivers/md/raid5-ppl.c:632:57: expected unsigned int opf drivers/md/raid5-ppl.c:632:57: got restricted gfp_t [usertype] drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4 (different base types) drivers/md/raid5-ppl.c:633:61: expected restricted gfp_t [usertype] gfp_mask drivers/md/raid5-ppl.c:633:61: got unsigned long long The sparse error introduction may not have been reported correctly by 0day due to other work that was cleaning up other sparse errors in this area. Fixes: 609be1066731 ("block: pass a block_device and opf to bio_alloc_bioset") Cc: stable@vger.kernel.org # 5.18+ Signed-off-by: Logan Gunthorpe Reviewed-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/raid5-ppl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c index 973e2e06f19c..0a2e4806b1ec 100644 --- a/drivers/md/raid5-ppl.c +++ b/drivers/md/raid5-ppl.c @@ -629,9 +629,9 @@ static void ppl_do_flush(struct ppl_io_unit *io) if (bdev) { struct bio *bio; - bio = bio_alloc_bioset(bdev, 0, GFP_NOIO, + bio = bio_alloc_bioset(bdev, 0, REQ_OP_WRITE | REQ_PREFLUSH, - &ppl_conf->flush_bs); + GFP_NOIO, &ppl_conf->flush_bs); bio->bi_private = io; bio->bi_end_io = ppl_flush_endio; From 60428d8bc27f52e8f1540f98e1b6ef0156d43f0d Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Tue, 14 Jun 2022 15:01:33 +0300 Subject: [PATCH 194/298] x86/tdx: Fix early #VE handling tdx_early_handle_ve() does not increment RIP after successfully handling the exception. That leads to infinite loop of exceptions. Move RIP when exceptions are successfully handled. [ dhansen: make problem statement more clear ] Fixes: 32e72854fa5f ("x86/tdx: Port I/O: Add early boot support") Signed-off-by: Kirill A. Shutemov Signed-off-by: Dave Hansen Reviewed-by: Kuppuswamy Sathyanarayanan Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com --- arch/x86/coco/tdx/tdx.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 03deb4d6920d..faae53f8d559 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -447,13 +447,17 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual) __init bool tdx_early_handle_ve(struct pt_regs *regs) { struct ve_info ve; + bool ret; tdx_get_ve_info(&ve); if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION) return false; - return handle_io(regs, ve.exit_qual); + ret = handle_io(regs, ve.exit_qual); + if (ret) + regs->ip += ve.instr_len; + return ret; } void tdx_get_ve_info(struct ve_info *ve) From cdd85786f4b3b9273e4376e69aa95a2d71722764 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Tue, 14 Jun 2022 15:01:34 +0300 Subject: [PATCH 195/298] x86/tdx: Clarify RIP adjustments in #VE handler After successful #VE handling, tdx_handle_virt_exception() has to move RIP to the next instruction. The handler needs to know the length of the instruction. If the #VE happened due to instruction execution, the GET_VEINFO TDX module call provides info on the instruction in R10, including its length. For #VE due to EPT violation, the info in R10 is not populand and the kernel must decode the instruction manually to find out its length. Restructure the code to make it explicit that the instruction length depends on the type of #VE. Make individual #VE handlers return the instruction length on success or -errno on failure. [ dhansen: fix up changelog and comments ] Suggested-by: Dave Hansen Signed-off-by: Kirill A. Shutemov Signed-off-by: Dave Hansen Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com --- arch/x86/coco/tdx/tdx.c | 178 +++++++++++++++++++++++++++------------- 1 file changed, 123 insertions(+), 55 deletions(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index faae53f8d559..c8d44f463283 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -124,6 +124,51 @@ static u64 get_cc_mask(void) return BIT_ULL(gpa_width - 1); } +/* + * The TDX module spec states that #VE may be injected for a limited set of + * reasons: + * + * - Emulation of the architectural #VE injection on EPT violation; + * + * - As a result of guest TD execution of a disallowed instruction, + * a disallowed MSR access, or CPUID virtualization; + * + * - A notification to the guest TD about anomalous behavior; + * + * The last one is opt-in and is not used by the kernel. + * + * The Intel Software Developer's Manual describes cases when instruction + * length field can be used in section "Information for VM Exits Due to + * Instruction Execution". + * + * For TDX, it ultimately means GET_VEINFO provides reliable instruction length + * information if #VE occurred due to instruction execution, but not for EPT + * violations. + */ +static int ve_instr_len(struct ve_info *ve) +{ + switch (ve->exit_reason) { + case EXIT_REASON_HLT: + case EXIT_REASON_MSR_READ: + case EXIT_REASON_MSR_WRITE: + case EXIT_REASON_CPUID: + case EXIT_REASON_IO_INSTRUCTION: + /* It is safe to use ve->instr_len for #VE due instructions */ + return ve->instr_len; + case EXIT_REASON_EPT_VIOLATION: + /* + * For EPT violations, ve->insn_len is not defined. For those, + * the kernel must decode instructions manually and should not + * be using this function. + */ + WARN_ONCE(1, "ve->instr_len is not defined for EPT violations"); + return 0; + default: + WARN_ONCE(1, "Unexpected #VE-type: %lld\n", ve->exit_reason); + return ve->instr_len; + } +} + static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti) { struct tdx_hypercall_args args = { @@ -147,7 +192,7 @@ static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti) return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0); } -static bool handle_halt(void) +static int handle_halt(struct ve_info *ve) { /* * Since non safe halt is mainly used in CPU offlining @@ -158,9 +203,9 @@ static bool handle_halt(void) const bool do_sti = false; if (__halt(irq_disabled, do_sti)) - return false; + return -EIO; - return true; + return ve_instr_len(ve); } void __cpuidle tdx_safe_halt(void) @@ -180,7 +225,7 @@ void __cpuidle tdx_safe_halt(void) WARN_ONCE(1, "HLT instruction emulation failed\n"); } -static bool read_msr(struct pt_regs *regs) +static int read_msr(struct pt_regs *regs, struct ve_info *ve) { struct tdx_hypercall_args args = { .r10 = TDX_HYPERCALL_STANDARD, @@ -194,14 +239,14 @@ static bool read_msr(struct pt_regs *regs) * (GHCI), section titled "TDG.VP.VMCALL". */ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) - return false; + return -EIO; regs->ax = lower_32_bits(args.r11); regs->dx = upper_32_bits(args.r11); - return true; + return ve_instr_len(ve); } -static bool write_msr(struct pt_regs *regs) +static int write_msr(struct pt_regs *regs, struct ve_info *ve) { struct tdx_hypercall_args args = { .r10 = TDX_HYPERCALL_STANDARD, @@ -215,10 +260,13 @@ static bool write_msr(struct pt_regs *regs) * can be found in TDX Guest-Host-Communication Interface * (GHCI) section titled "TDG.VP.VMCALL". */ - return !__tdx_hypercall(&args, 0); + if (__tdx_hypercall(&args, 0)) + return -EIO; + + return ve_instr_len(ve); } -static bool handle_cpuid(struct pt_regs *regs) +static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve) { struct tdx_hypercall_args args = { .r10 = TDX_HYPERCALL_STANDARD, @@ -236,7 +284,7 @@ static bool handle_cpuid(struct pt_regs *regs) */ if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) { regs->ax = regs->bx = regs->cx = regs->dx = 0; - return true; + return ve_instr_len(ve); } /* @@ -245,7 +293,7 @@ static bool handle_cpuid(struct pt_regs *regs) * (GHCI), section titled "VP.VMCALL". */ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) - return false; + return -EIO; /* * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of @@ -257,7 +305,7 @@ static bool handle_cpuid(struct pt_regs *regs) regs->cx = args.r14; regs->dx = args.r15; - return true; + return ve_instr_len(ve); } static bool mmio_read(int size, unsigned long addr, unsigned long *val) @@ -283,7 +331,7 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val) EPT_WRITE, addr, val); } -static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) +static int handle_mmio(struct pt_regs *regs, struct ve_info *ve) { char buffer[MAX_INSN_SIZE]; unsigned long *reg, val; @@ -294,34 +342,36 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) /* Only in-kernel MMIO is supported */ if (WARN_ON_ONCE(user_mode(regs))) - return false; + return -EFAULT; if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) - return false; + return -EFAULT; if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) - return false; + return -EINVAL; mmio = insn_decode_mmio(&insn, &size); if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) - return false; + return -EINVAL; if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { reg = insn_get_modrm_reg_ptr(&insn, regs); if (!reg) - return false; + return -EINVAL; } - ve->instr_len = insn.length; - /* Handle writes first */ switch (mmio) { case MMIO_WRITE: memcpy(&val, reg, size); - return mmio_write(size, ve->gpa, val); + if (!mmio_write(size, ve->gpa, val)) + return -EIO; + return insn.length; case MMIO_WRITE_IMM: val = insn.immediate.value; - return mmio_write(size, ve->gpa, val); + if (!mmio_write(size, ve->gpa, val)) + return -EIO; + return insn.length; case MMIO_READ: case MMIO_READ_ZERO_EXTEND: case MMIO_READ_SIGN_EXTEND: @@ -334,15 +384,15 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) * decoded or handled properly. It was likely not using io.h * helpers or accessed MMIO accidentally. */ - return false; + return -EINVAL; default: WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?"); - return false; + return -EINVAL; } /* Handle reads */ if (!mmio_read(size, ve->gpa, &val)) - return false; + return -EIO; switch (mmio) { case MMIO_READ: @@ -364,13 +414,13 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) default: /* All other cases has to be covered with the first switch() */ WARN_ON_ONCE(1); - return false; + return -EINVAL; } if (extend_size) memset(reg, extend_val, extend_size); memcpy(reg, &val, size); - return true; + return insn.length; } static bool handle_in(struct pt_regs *regs, int size, int port) @@ -421,13 +471,14 @@ static bool handle_out(struct pt_regs *regs, int size, int port) * * Return True on success or False on failure. */ -static bool handle_io(struct pt_regs *regs, u32 exit_qual) +static int handle_io(struct pt_regs *regs, struct ve_info *ve) { + u32 exit_qual = ve->exit_qual; int size, port; - bool in; + bool in, ret; if (VE_IS_IO_STRING(exit_qual)) - return false; + return -EIO; in = VE_IS_IO_IN(exit_qual); size = VE_GET_IO_SIZE(exit_qual); @@ -435,9 +486,13 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual) if (in) - return handle_in(regs, size, port); + ret = handle_in(regs, size, port); else - return handle_out(regs, size, port); + ret = handle_out(regs, size, port); + if (!ret) + return -EIO; + + return ve_instr_len(ve); } /* @@ -447,17 +502,19 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual) __init bool tdx_early_handle_ve(struct pt_regs *regs) { struct ve_info ve; - bool ret; + int insn_len; tdx_get_ve_info(&ve); if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION) return false; - ret = handle_io(regs, ve.exit_qual); - if (ret) - regs->ip += ve.instr_len; - return ret; + insn_len = handle_io(regs, &ve); + if (insn_len < 0) + return false; + + regs->ip += insn_len; + return true; } void tdx_get_ve_info(struct ve_info *ve) @@ -490,54 +547,65 @@ void tdx_get_ve_info(struct ve_info *ve) ve->instr_info = upper_32_bits(out.r10); } -/* Handle the user initiated #VE */ -static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve) +/* + * Handle the user initiated #VE. + * + * On success, returns the number of bytes RIP should be incremented (>=0) + * or -errno on error. + */ +static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve) { switch (ve->exit_reason) { case EXIT_REASON_CPUID: - return handle_cpuid(regs); + return handle_cpuid(regs, ve); default: pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); - return false; + return -EIO; } } -/* Handle the kernel #VE */ -static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) +/* + * Handle the kernel #VE. + * + * On success, returns the number of bytes RIP should be incremented (>=0) + * or -errno on error. + */ +static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) { switch (ve->exit_reason) { case EXIT_REASON_HLT: - return handle_halt(); + return handle_halt(ve); case EXIT_REASON_MSR_READ: - return read_msr(regs); + return read_msr(regs, ve); case EXIT_REASON_MSR_WRITE: - return write_msr(regs); + return write_msr(regs, ve); case EXIT_REASON_CPUID: - return handle_cpuid(regs); + return handle_cpuid(regs, ve); case EXIT_REASON_EPT_VIOLATION: return handle_mmio(regs, ve); case EXIT_REASON_IO_INSTRUCTION: - return handle_io(regs, ve->exit_qual); + return handle_io(regs, ve); default: pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); - return false; + return -EIO; } } bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve) { - bool ret; + int insn_len; if (user_mode(regs)) - ret = virt_exception_user(regs, ve); + insn_len = virt_exception_user(regs, ve); else - ret = virt_exception_kernel(regs, ve); + insn_len = virt_exception_kernel(regs, ve); + if (insn_len < 0) + return false; /* After successful #VE handling, move the IP */ - if (ret) - regs->ip += ve->instr_len; + regs->ip += insn_len; - return ret; + return true; } static bool tdx_tlb_flush_required(bool private) From 49d6a3c062a1026a5ba957c46f3603c372288ab6 Mon Sep 17 00:00:00 2001 From: Tianyu Lan Date: Mon, 13 Jun 2022 21:45:53 -0400 Subject: [PATCH 196/298] x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM Hyper-V Isolation VM current code uses sev_es_ghcb_hv_call() to read/write MSR via GHCB page and depends on the sev code. This may cause regression when sev code changes interface design. The latest SEV-ES code requires to negotiate GHCB version before reading/writing MSR via GHCB page and sev_es_ghcb_hv_call() doesn't work for Hyper-V Isolation VM. Add Hyper-V ghcb related implementation to decouple SEV and Hyper-V code. Negotiate GHCB version in the hyperv_init() and use the version to communicate with Hyper-V in the ghcb hv call function. Fixes: 2ea29c5abbc2 ("x86/sev: Save the negotiated GHCB version") Signed-off-by: Tianyu Lan Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20220614014553.1915929-1-ltykernel@gmail.com Signed-off-by: Wei Liu --- arch/x86/hyperv/hv_init.c | 6 +++ arch/x86/hyperv/ivm.c | 84 ++++++++++++++++++++++++++++++--- arch/x86/include/asm/mshyperv.h | 4 ++ 3 files changed, 88 insertions(+), 6 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 8b392b6b7b93..3de6d8b53367 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -405,6 +406,11 @@ void __init hyperv_init(void) } if (hv_isolation_type_snp()) { + /* Negotiate GHCB Version. */ + if (!hv_ghcb_negotiate_protocol()) + hv_ghcb_terminate(SEV_TERM_SET_GEN, + GHCB_SEV_ES_PROT_UNSUPPORTED); + hv_ghcb_pg = alloc_percpu(union hv_ghcb *); if (!hv_ghcb_pg) goto free_vp_assist_page; diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 2b994117581e..1dbcbd9da74d 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -53,6 +53,8 @@ union hv_ghcb { } hypercall; } __packed __aligned(HV_HYP_PAGE_SIZE); +static u16 hv_ghcb_version __ro_after_init; + u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size) { union hv_ghcb *hv_ghcb; @@ -96,12 +98,85 @@ u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size) return status; } +static inline u64 rd_ghcb_msr(void) +{ + return __rdmsr(MSR_AMD64_SEV_ES_GHCB); +} + +static inline void wr_ghcb_msr(u64 val) +{ + native_wrmsrl(MSR_AMD64_SEV_ES_GHCB, val); +} + +static enum es_result hv_ghcb_hv_call(struct ghcb *ghcb, u64 exit_code, + u64 exit_info_1, u64 exit_info_2) +{ + /* Fill in protocol and format specifiers */ + ghcb->protocol_version = hv_ghcb_version; + ghcb->ghcb_usage = GHCB_DEFAULT_USAGE; + + ghcb_set_sw_exit_code(ghcb, exit_code); + ghcb_set_sw_exit_info_1(ghcb, exit_info_1); + ghcb_set_sw_exit_info_2(ghcb, exit_info_2); + + VMGEXIT(); + + if (ghcb->save.sw_exit_info_1 & GENMASK_ULL(31, 0)) + return ES_VMM_ERROR; + else + return ES_OK; +} + +void hv_ghcb_terminate(unsigned int set, unsigned int reason) +{ + u64 val = GHCB_MSR_TERM_REQ; + + /* Tell the hypervisor what went wrong. */ + val |= GHCB_SEV_TERM_REASON(set, reason); + + /* Request Guest Termination from Hypvervisor */ + wr_ghcb_msr(val); + VMGEXIT(); + + while (true) + asm volatile("hlt\n" : : : "memory"); +} + +bool hv_ghcb_negotiate_protocol(void) +{ + u64 ghcb_gpa; + u64 val; + + /* Save ghcb page gpa. */ + ghcb_gpa = rd_ghcb_msr(); + + /* Do the GHCB protocol version negotiation */ + wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ); + VMGEXIT(); + val = rd_ghcb_msr(); + + if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP) + return false; + + if (GHCB_MSR_PROTO_MAX(val) < GHCB_PROTOCOL_MIN || + GHCB_MSR_PROTO_MIN(val) > GHCB_PROTOCOL_MAX) + return false; + + hv_ghcb_version = min_t(size_t, GHCB_MSR_PROTO_MAX(val), + GHCB_PROTOCOL_MAX); + + /* Write ghcb page back after negotiating protocol. */ + wr_ghcb_msr(ghcb_gpa); + VMGEXIT(); + + return true; +} + void hv_ghcb_msr_write(u64 msr, u64 value) { union hv_ghcb *hv_ghcb; void **ghcb_base; unsigned long flags; - struct es_em_ctxt ctxt; if (!hv_ghcb_pg) return; @@ -120,8 +195,7 @@ void hv_ghcb_msr_write(u64 msr, u64 value) ghcb_set_rax(&hv_ghcb->ghcb, lower_32_bits(value)); ghcb_set_rdx(&hv_ghcb->ghcb, upper_32_bits(value)); - if (sev_es_ghcb_hv_call(&hv_ghcb->ghcb, false, &ctxt, - SVM_EXIT_MSR, 1, 0)) + if (hv_ghcb_hv_call(&hv_ghcb->ghcb, SVM_EXIT_MSR, 1, 0)) pr_warn("Fail to write msr via ghcb %llx.\n", msr); local_irq_restore(flags); @@ -133,7 +207,6 @@ void hv_ghcb_msr_read(u64 msr, u64 *value) union hv_ghcb *hv_ghcb; void **ghcb_base; unsigned long flags; - struct es_em_ctxt ctxt; /* Check size of union hv_ghcb here. */ BUILD_BUG_ON(sizeof(union hv_ghcb) != HV_HYP_PAGE_SIZE); @@ -152,8 +225,7 @@ void hv_ghcb_msr_read(u64 msr, u64 *value) } ghcb_set_rcx(&hv_ghcb->ghcb, msr); - if (sev_es_ghcb_hv_call(&hv_ghcb->ghcb, false, &ctxt, - SVM_EXIT_MSR, 0, 0)) + if (hv_ghcb_hv_call(&hv_ghcb->ghcb, SVM_EXIT_MSR, 0, 0)) pr_warn("Fail to read msr via ghcb %llx.\n", msr); else *value = (u64)lower_32_bits(hv_ghcb->ghcb.save.rax) diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index a82f603d4312..61f0c206bff0 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -179,9 +179,13 @@ int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible); #ifdef CONFIG_AMD_MEM_ENCRYPT void hv_ghcb_msr_write(u64 msr, u64 value); void hv_ghcb_msr_read(u64 msr, u64 *value); +bool hv_ghcb_negotiate_protocol(void); +void hv_ghcb_terminate(unsigned int set, unsigned int reason); #else static inline void hv_ghcb_msr_write(u64 msr, u64 value) {} static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {} +static inline bool hv_ghcb_negotiate_protocol(void) { return false; } +static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {} #endif extern bool hv_isolation_type_snp(void); From 6a1c3767d82ed8233de1263aa7da81595e176087 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 12 Jun 2022 02:22:30 +0900 Subject: [PATCH 197/298] certs/blacklist_hashes.c: fix const confusion in certs blacklist MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This file fails to compile as follows: CC certs/blacklist_hashes.o certs/blacklist_hashes.c:4:1: error: ignoring attribute ‘section (".init.data")’ because it conflicts with previous ‘section (".init.rodata")’ [-Werror=attributes] 4 | const char __initdata *const blacklist_hashes[] = { | ^~~~~ In file included from certs/blacklist_hashes.c:2: certs/blacklist.h:5:38: note: previous declaration here 5 | extern const char __initconst *const blacklist_hashes[]; | ^~~~~~~~~~~~~~~~ Apply the same fix as commit 2be04df5668d ("certs/blacklist_nohashes.c: fix const confusion in certs blacklist"). Fixes: 734114f8782f ("KEYS: Add a system blacklist keyring") Signed-off-by: Masahiro Yamada Reviewed-by: Jarkko Sakkinen Reviewed-by: Mickaël Salaün Signed-off-by: Jarkko Sakkinen --- certs/blacklist_hashes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/certs/blacklist_hashes.c b/certs/blacklist_hashes.c index 344892337be0..d5961aa3d338 100644 --- a/certs/blacklist_hashes.c +++ b/certs/blacklist_hashes.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include "blacklist.h" -const char __initdata *const blacklist_hashes[] = { +const char __initconst *const blacklist_hashes[] = { #include CONFIG_SYSTEM_BLACKLIST_HASH_LIST , NULL }; From 27b5b22d252c6d71a2a37a4bdf18d0be6d25ee5a Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 12 Jun 2022 02:22:31 +0900 Subject: [PATCH 198/298] certs: fix and refactor CONFIG_SYSTEM_BLACKLIST_HASH_LIST build MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit addf466389d9 ("certs: Check that builtin blacklist hashes are valid") was applied 8 months after the submission. In the meantime, the base code had been removed by commit b8c96a6b466c ("certs: simplify $(srctree)/ handling and remove config_filename macro"). Fix the Makefile. Create a local copy of $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST). It is included from certs/blacklist_hashes.c and also works as a timestamp. Send error messages from check-blacklist-hashes.awk to stderr instead of stdout. Fixes: addf466389d9 ("certs: Check that builtin blacklist hashes are valid") Signed-off-by: Masahiro Yamada Reviewed-by: Jarkko Sakkinen Reviewed-by: Mickaël Salaün Signed-off-by: Jarkko Sakkinen --- certs/.gitignore | 2 +- certs/Makefile | 20 ++++++++++---------- certs/blacklist_hashes.c | 2 +- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/certs/.gitignore b/certs/.gitignore index 56637aceaf81..cec5465f31c1 100644 --- a/certs/.gitignore +++ b/certs/.gitignore @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only -/blacklist_hashes_checked +/blacklist_hash_list /extract-cert /x509_certificate_list /x509_revocation_list diff --git a/certs/Makefile b/certs/Makefile index cb1a9da3fc58..a8d628fd5f7b 100644 --- a/certs/Makefile +++ b/certs/Makefile @@ -7,22 +7,22 @@ obj-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += system_keyring.o system_certificates.o c obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist.o common.o obj-$(CONFIG_SYSTEM_REVOCATION_LIST) += revocation_certificates.o ifneq ($(CONFIG_SYSTEM_BLACKLIST_HASH_LIST),) -quiet_cmd_check_blacklist_hashes = CHECK $(patsubst "%",%,$(2)) - cmd_check_blacklist_hashes = $(AWK) -f $(srctree)/scripts/check-blacklist-hashes.awk $(2); touch $@ -$(eval $(call config_filename,SYSTEM_BLACKLIST_HASH_LIST)) +$(obj)/blacklist_hashes.o: $(obj)/blacklist_hash_list +CFLAGS_blacklist_hashes.o := -I $(obj) -$(obj)/blacklist_hashes.o: $(obj)/blacklist_hashes_checked +quiet_cmd_check_and_copy_blacklist_hash_list = GEN $@ + cmd_check_and_copy_blacklist_hash_list = \ + $(AWK) -f $(srctree)/scripts/check-blacklist-hashes.awk $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST) >&2; \ + cat $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST) > $@ -CFLAGS_blacklist_hashes.o += -I$(srctree) - -targets += blacklist_hashes_checked -$(obj)/blacklist_hashes_checked: $(SYSTEM_BLACKLIST_HASH_LIST_SRCPREFIX)$(SYSTEM_BLACKLIST_HASH_LIST_FILENAME) scripts/check-blacklist-hashes.awk FORCE - $(call if_changed,check_blacklist_hashes,$(SYSTEM_BLACKLIST_HASH_LIST_SRCPREFIX)$(CONFIG_SYSTEM_BLACKLIST_HASH_LIST)) +$(obj)/blacklist_hash_list: $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST) FORCE + $(call if_changed,check_and_copy_blacklist_hash_list) obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist_hashes.o else obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist_nohashes.o endif +targets += blacklist_hash_list quiet_cmd_extract_certs = CERT $@ cmd_extract_certs = $(obj)/extract-cert $(extract-cert-in) $@ @@ -33,7 +33,7 @@ $(obj)/system_certificates.o: $(obj)/x509_certificate_list $(obj)/x509_certificate_list: $(CONFIG_SYSTEM_TRUSTED_KEYS) $(obj)/extract-cert FORCE $(call if_changed,extract_certs) -targets += x509_certificate_list blacklist_hashes_checked +targets += x509_certificate_list # If module signing is requested, say by allyesconfig, but a key has not been # supplied, then one will need to be generated to make sure the build does not diff --git a/certs/blacklist_hashes.c b/certs/blacklist_hashes.c index d5961aa3d338..86d66fe11348 100644 --- a/certs/blacklist_hashes.c +++ b/certs/blacklist_hashes.c @@ -2,6 +2,6 @@ #include "blacklist.h" const char __initconst *const blacklist_hashes[] = { -#include CONFIG_SYSTEM_BLACKLIST_HASH_LIST +#include "blacklist_hash_list" , NULL }; From 5ee3d10f84d0a32fc11a55c70c204b6d81fd9ef6 Mon Sep 17 00:00:00 2001 From: Dave Wysochanski Date: Thu, 9 Jun 2022 20:46:29 -0400 Subject: [PATCH 199/298] NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x file Commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") added the FMODE_CAN_ODIRECT flag for NFSv3 but neglected to add it for NFSv4.x. This causes direct io on NFSv4.x to fail open with EINVAL: mount -o vers=4.2 127.0.0.1:/export /mnt/nfs4 dd if=/dev/zero of=/mnt/nfs4/file.bin bs=128k count=1 oflag=direct dd: failed to open '/mnt/nfs4/file.bin': Invalid argument dd of=/dev/null if=/mnt/nfs4/file.bin bs=128k count=1 iflag=direct dd: failed to open '/mnt/dir1/file1.bin': Invalid argument Fixes: a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") Signed-off-by: Dave Wysochanski Signed-off-by: Anna Schumaker --- fs/nfs/dir.c | 1 + fs/nfs/nfs4file.c | 1 + 2 files changed, 2 insertions(+) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index a8ecdd527662..0c4e8dd6aa96 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2124,6 +2124,7 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, } goto out; } + file->f_mode |= FMODE_CAN_ODIRECT; err = nfs_finish_open(ctx, ctx->dentry, file, open_flags); trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index 03d3a270eff4..e88f6b18445e 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -93,6 +93,7 @@ nfs4_file_open(struct inode *inode, struct file *filp) nfs_file_set_open_context(filp, ctx); nfs_fscache_open_file(inode, filp); err = 0; + filp->f_mode |= FMODE_CAN_ODIRECT; out_put_ctx: put_nfs_open_context(ctx); From c3230283e2819a69dad2cf7a63143fde8bab8b5c Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 15 Jun 2022 18:28:04 +0200 Subject: [PATCH 200/298] printk: Block console kthreads when direct printing will be required There are known situations when the console kthreads are not reliable or does not work in principle, for example, early boot, panic, shutdown. For these situations there is the direct (legacy) mode when printk() tries to get console_lock() and flush the messages directly. It works very well during the early boot when the console kthreads are not available at all. It gets more complicated in the other situations when console kthreads might be actively printing and block console_trylock() in printk(). The same problem is in the legacy code as well. Any console_lock() owner could block console_trylock() in printk(). It is solved by a trick that the current console_lock() owner is responsible for printing all pending messages. It is actually the reason why there is the risk of softlockups and why the console kthreads were introduced. The console kthreads use the same approach. They are responsible for printing the messages by definition. So that they handle the messages anytime when they are awake and see new ones. The global console_lock is available when there is nothing to do. It should work well when the problematic context is correctly detected and printk() switches to the direct mode. But it seems that it is not enough in practice. There are reports that the messages are not printed during panic() or shutdown() even though printk() tries to use the direct mode here. The problem seems to be that console kthreads become active in these situation as well. They steel the job before other CPUs are stopped. Then they are stopped in the middle of the job and block the global console_lock. First part of the solution is to block console kthreads when the system is in a problematic state and requires the direct printk() mode. Link: https://lore.kernel.org/r/20220610205038.GA3050413@paulmck-ThinkPad-P17-Gen-1 Link: https://lore.kernel.org/r/CAMdYzYpF4FNTBPZsEFeWRuEwSies36QM_As8osPWZSr2q-viEA@mail.gmail.com Suggested-by: John Ogness Tested-by: Paul E. McKenney Signed-off-by: Petr Mladek Link: https://lore.kernel.org/r/20220615162805.27962-2-pmladek@suse.com --- kernel/printk/printk.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index ea3dd55709e7..45c6c2b0b104 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -3729,7 +3729,9 @@ static bool printer_should_wake(struct console *con, u64 seq) return true; if (con->blocked || - console_kthreads_atomically_blocked()) { + console_kthreads_atomically_blocked() || + system_state > SYSTEM_RUNNING || + oops_in_progress) { return false; } From b87f02307d3cfbda768520f0687c51ca77e14fc3 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Wed, 15 Jun 2022 18:28:05 +0200 Subject: [PATCH 201/298] printk: Wait for the global console lock when the system is going down There are reports that the console kthreads block the global console lock when the system is going down, for example, reboot, panic. First part of the solution was to block kthreads in these problematic system states so they stopped handling newly added messages. Second part of the solution is to wait when for the kthreads when they are actively printing. It solves the problem when a message was printed before the system entered the problematic state and the kthreads managed to step in. A busy waiting has to be used because panic() can be called in any context and in an unknown state of the scheduler. There must be a timeout because the kthread might get stuck or sleeping and never release the lock. The timeout 10s is an arbitrary value inspired by the softlockup timeout. Link: https://lore.kernel.org/r/20220610205038.GA3050413@paulmck-ThinkPad-P17-Gen-1 Link: https://lore.kernel.org/r/CAMdYzYpF4FNTBPZsEFeWRuEwSies36QM_As8osPWZSr2q-viEA@mail.gmail.com Signed-off-by: Petr Mladek Tested-by: Paul E. McKenney Link: https://lore.kernel.org/r/20220615162805.27962-3-pmladek@suse.com --- include/linux/printk.h | 5 +++++ kernel/panic.c | 2 ++ kernel/printk/internal.h | 2 ++ kernel/printk/printk.c | 4 ++++ kernel/printk/printk_safe.c | 32 ++++++++++++++++++++++++++++++++ kernel/reboot.c | 2 ++ 6 files changed, 47 insertions(+) diff --git a/include/linux/printk.h b/include/linux/printk.h index cd26aab0ab2a..c1e07c0652c7 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -174,6 +174,7 @@ extern void printk_prefer_direct_enter(void); extern void printk_prefer_direct_exit(void); extern bool pr_flush(int timeout_ms, bool reset_on_progress); +extern void try_block_console_kthreads(int timeout_ms); /* * Please don't use printk_ratelimit(), because it shares ratelimiting state @@ -238,6 +239,10 @@ static inline bool pr_flush(int timeout_ms, bool reset_on_progress) return true; } +static inline void try_block_console_kthreads(int timeout_ms) +{ +} + static inline int printk_ratelimit(void) { return 0; diff --git a/kernel/panic.c b/kernel/panic.c index 6737b2332275..fe73d18ecdf0 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -273,6 +273,7 @@ void panic(const char *fmt, ...) * unfortunately means it may not be hardened to work in a * panic situation. */ + try_block_console_kthreads(10000); smp_send_stop(); } else { /* @@ -280,6 +281,7 @@ void panic(const char *fmt, ...) * kmsg_dump, we will need architecture dependent extra * works in addition to stopping other CPUs. */ + try_block_console_kthreads(10000); crash_smp_send_stop(); } diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index d947ca6c84f9..e7d8578860ad 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -20,6 +20,8 @@ enum printk_info_flags { LOG_CONT = 8, /* text is a fragment of a continuation line */ }; +extern bool block_console_kthreads; + __printf(4, 0) int vprintk_store(int facility, int level, const struct dev_printk_info *dev_info, diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 45c6c2b0b104..b095fb5f5f61 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -250,6 +250,9 @@ static atomic_t console_kthreads_active = ATOMIC_INIT(0); #define console_kthread_printing_exit() \ atomic_dec(&console_kthreads_active) +/* Block console kthreads to avoid processing new messages. */ +bool block_console_kthreads; + /* * Helper macros to handle lockdep when locking/unlocking console_sem. We use * macros instead of functions so that _RET_IP_ contains useful information. @@ -3730,6 +3733,7 @@ static bool printer_should_wake(struct console *con, u64 seq) if (con->blocked || console_kthreads_atomically_blocked() || + block_console_kthreads || system_state > SYSTEM_RUNNING || oops_in_progress) { return false; diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c index ef0f9a2044da..caac4de1ea59 100644 --- a/kernel/printk/printk_safe.c +++ b/kernel/printk/printk_safe.c @@ -8,7 +8,9 @@ #include #include #include +#include #include +#include #include "internal.h" @@ -50,3 +52,33 @@ asmlinkage int vprintk(const char *fmt, va_list args) return vprintk_default(fmt, args); } EXPORT_SYMBOL(vprintk); + +/** + * try_block_console_kthreads() - Try to block console kthreads and + * make the global console_lock() avaialble + * + * @timeout_ms: The maximum time (in ms) to wait. + * + * Prevent console kthreads from starting processing new messages. Wait + * until the global console_lock() become available. + * + * Context: Can be called in any context. + */ +void try_block_console_kthreads(int timeout_ms) +{ + block_console_kthreads = true; + + /* Do not wait when the console lock could not be safely taken. */ + if (this_cpu_read(printk_context) || in_nmi()) + return; + + while (timeout_ms > 0) { + if (console_trylock()) { + console_unlock(); + return; + } + + udelay(1000); + timeout_ms -= 1; + } +} diff --git a/kernel/reboot.c b/kernel/reboot.c index 4177645e74d6..310363685502 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -74,6 +74,7 @@ void kernel_restart_prepare(char *cmd) { blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd); system_state = SYSTEM_RESTART; + try_block_console_kthreads(10000); usermodehelper_disable(); device_shutdown(); } @@ -262,6 +263,7 @@ static void kernel_shutdown_prepare(enum system_states state) blocking_notifier_call_chain(&reboot_notifier_list, (state == SYSTEM_HALT) ? SYS_HALT : SYS_POWER_OFF, NULL); system_state = state; + try_block_console_kthreads(10000); usermodehelper_disable(); device_shutdown(); } From ef79c396c664be99d0c5660dc75fe863c1e20315 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christian=20G=C3=B6ttsche?= Date: Wed, 15 Jun 2022 17:44:31 +0200 Subject: [PATCH 202/298] audit: free module name MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reset the type of the record last as the helper `audit_free_module()` depends on it. unreferenced object 0xffff888153b707f0 (size 16): comm "modprobe", pid 1319, jiffies 4295110033 (age 1083.016s) hex dump (first 16 bytes): 62 69 6e 66 6d 74 5f 6d 69 73 63 00 6b 6b 6b a5 binfmt_misc.kkk. backtrace: [] kstrdup+0x2b/0x50 [] __audit_log_kern_module+0x4d/0xf0 [] load_module+0x9d4/0x2e10 [] __do_sys_finit_module+0x114/0x1b0 [] do_syscall_64+0x34/0x80 [] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Cc: stable@vger.kernel.org Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Christian Göttsche Signed-off-by: Paul Moore --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f3a2abd6d1a1..3a8c9d744800 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1014,10 +1014,10 @@ static void audit_reset_context(struct audit_context *ctx) ctx->target_comm[0] = '\0'; unroll_tree_refs(ctx, NULL, 0); WARN_ON(!list_empty(&ctx->killed_trees)); - ctx->type = 0; audit_free_module(ctx); ctx->fds[0] = -1; audit_proctitle_free(ctx); + ctx->type = 0; /* reset last for audit_free_*() */ } static inline struct audit_context *audit_alloc_context(enum audit_state state) From cad140d00899e7a9cb6fe93b282051df589e671c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christian=20G=C3=B6ttsche?= Date: Wed, 15 Jun 2022 17:38:39 +0200 Subject: [PATCH 203/298] selinux: free contexts previously transferred in selinux_add_opt() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `selinux_add_opt()` stopped taking ownership of the passed context since commit 70f4169ab421 ("selinux: parse contexts for mount options early"). unreferenced object 0xffff888114dfd140 (size 64): comm "mount", pid 15182, jiffies 4295687028 (age 796.340s) hex dump (first 32 bytes): 73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f system_u:object_ 72 3a 74 65 73 74 5f 66 69 6c 65 73 79 73 74 65 r:test_filesyste backtrace: [] kmemdup_nul+0x24/0x80 [] selinux_sb_eat_lsm_opts+0x293/0x560 [] security_sb_eat_lsm_opts+0x58/0x80 [] generic_parse_monolithic+0x82/0x180 [] do_new_mount+0x1f5/0x550 [] path_mount+0x2ab/0x1570 [] __x64_sys_mount+0x20e/0x280 [] do_syscall_64+0x34/0x80 [] entry_SYSCALL_64_after_hwframe+0x46/0xb0 unreferenced object 0xffff888108e71640 (size 64): comm "fsmount", pid 7607, jiffies 4295044974 (age 1601.016s) hex dump (first 32 bytes): 73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f system_u:object_ 72 3a 74 65 73 74 5f 66 69 6c 65 73 79 73 74 65 r:test_filesyste backtrace: [] memdup_user+0x21/0x90 [] strndup_user+0x47/0xa0 [] __do_sys_fsconfig+0x485/0x9f0 [] do_syscall_64+0x34/0x80 [] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Cc: stable@vger.kernel.org Fixes: 70f4169ab421 ("selinux: parse contexts for mount options early") Signed-off-by: Christian Göttsche Signed-off-by: Paul Moore --- security/selinux/hooks.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index beceb89f68d9..1bbd53321d13 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2600,8 +2600,9 @@ static int selinux_sb_eat_lsm_opts(char *options, void **mnt_opts) } } rc = selinux_add_opt(token, arg, mnt_opts); + kfree(arg); + arg = NULL; if (unlikely(rc)) { - kfree(arg); goto free_opt; } } else { @@ -2792,17 +2793,13 @@ static int selinux_fs_context_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct fs_parse_result result; - int opt, rc; + int opt; opt = fs_parse(fc, selinux_fs_parameters, param, &result); if (opt < 0) return opt; - rc = selinux_add_opt(opt, param->string, &fc->security); - if (!rc) - param->string = NULL; - - return rc; + return selinux_add_opt(opt, param->string, &fc->security); } /* inode security operations */ From f4288f01820e2d57722d21874c1fda661003c9b9 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Sun, 5 Jun 2022 18:51:22 -0700 Subject: [PATCH 204/298] xfs: fix TOCTOU race involving the new logged xattrs control knob I found a race involving the larp control knob, aka the debugging knob that lets developers enable logging of extended attribute updates: Thread 1 Thread 2 echo 0 > /sys/fs/xfs/debug/larp setxattr(REPLACE) xfs_has_larp (returns false) xfs_attr_set echo 1 > /sys/fs/xfs/debug/larp xfs_attr_defer_replace xfs_attr_init_replace_state xfs_has_larp (returns true) xfs_attr_init_remove_state This isn't a particularly severe problem right now because xattr logging is only enabled when CONFIG_XFS_DEBUG=y, and developers *should* know what they're doing. However, the eventual intent is that callers should be able to ask for the assistance of the log in persisting xattr updates. This capability might not be required for /all/ callers, which means that dynamic control must work correctly. Once an xattr update has decided whether or not to use logged xattrs, it needs to stay in that mode until the end of the operation regardless of what subsequent parallel operations might do. Therefore, it is an error to continue sampling xfs_globals.larp once xfs_attr_change has made a decision about larp, and it was not correct for me to have told Allison that ->create_intent functions can sample the global log incompat feature bitfield to decide to elide a log item. Instead, create a new op flag for the xfs_da_args structure, and convert all other callers of xfs_has_larp and xfs_sb_version_haslogxattrs within the attr update state machine to look for the operations flag. Signed-off-by: Darrick J. Wong Reviewed-by: Allison Henderson --- fs/xfs/libxfs/xfs_attr.c | 6 ++++-- fs/xfs/libxfs/xfs_attr.h | 12 +----------- fs/xfs/libxfs/xfs_attr_leaf.c | 2 +- fs/xfs/libxfs/xfs_da_btree.h | 4 +++- fs/xfs/xfs_attr_item.c | 15 +++++++++------ fs/xfs/xfs_xattr.c | 17 ++++++++++++++++- 6 files changed, 34 insertions(+), 22 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 836ab1b8ed7b..0847b4e16237 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -997,9 +997,11 @@ xfs_attr_set( /* * We have no control over the attribute names that userspace passes us * to remove, so we have to allow the name lookup prior to attribute - * removal to fail as well. + * removal to fail as well. Preserve the logged flag, since we need + * to pass that through to the logging code. */ - args->op_flags = XFS_DA_OP_OKNOENT; + args->op_flags = XFS_DA_OP_OKNOENT | + (args->op_flags & XFS_DA_OP_LOGGED); if (args->value) { XFS_STATS_INC(mp, xs_attr_set); diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h index e329da3e7afa..b4a2fc77017e 100644 --- a/fs/xfs/libxfs/xfs_attr.h +++ b/fs/xfs/libxfs/xfs_attr.h @@ -28,16 +28,6 @@ struct xfs_attr_list_context; */ #define ATTR_MAX_VALUELEN (64*1024) /* max length of a value */ -static inline bool xfs_has_larp(struct xfs_mount *mp) -{ -#ifdef DEBUG - /* Logged xattrs require a V5 super for log_incompat */ - return xfs_has_crc(mp) && xfs_globals.larp; -#else - return false; -#endif -} - /* * Kernel-internal version of the attrlist cursor. */ @@ -624,7 +614,7 @@ static inline enum xfs_delattr_state xfs_attr_init_replace_state(struct xfs_da_args *args) { args->op_flags |= XFS_DA_OP_ADDNAME | XFS_DA_OP_REPLACE; - if (xfs_has_larp(args->dp->i_mount)) + if (args->op_flags & XFS_DA_OP_LOGGED) return xfs_attr_init_remove_state(args); return xfs_attr_init_add_state(args); } diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index 15a990409463..37e7c33f6283 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -1530,7 +1530,7 @@ xfs_attr3_leaf_add_work( if (tmp) entry->flags |= XFS_ATTR_LOCAL; if (args->op_flags & XFS_DA_OP_REPLACE) { - if (!xfs_has_larp(mp)) + if (!(args->op_flags & XFS_DA_OP_LOGGED)) entry->flags |= XFS_ATTR_INCOMPLETE; if ((args->blkno2 == args->blkno) && (args->index2 <= args->index)) { diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index d33b7686a0b3..ffa3df5b2893 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -92,6 +92,7 @@ typedef struct xfs_da_args { #define XFS_DA_OP_NOTIME (1u << 5) /* don't update inode timestamps */ #define XFS_DA_OP_REMOVE (1u << 6) /* this is a remove operation */ #define XFS_DA_OP_RECOVERY (1u << 7) /* Log recovery operation */ +#define XFS_DA_OP_LOGGED (1u << 8) /* Use intent items to track op */ #define XFS_DA_OP_FLAGS \ { XFS_DA_OP_JUSTCHECK, "JUSTCHECK" }, \ @@ -101,7 +102,8 @@ typedef struct xfs_da_args { { XFS_DA_OP_CILOOKUP, "CILOOKUP" }, \ { XFS_DA_OP_NOTIME, "NOTIME" }, \ { XFS_DA_OP_REMOVE, "REMOVE" }, \ - { XFS_DA_OP_RECOVERY, "RECOVERY" } + { XFS_DA_OP_RECOVERY, "RECOVERY" }, \ + { XFS_DA_OP_LOGGED, "LOGGED" } /* * Storage for holding state during Btree searches and split/join ops. diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c index 4a28c2d77070..135d44133477 100644 --- a/fs/xfs/xfs_attr_item.c +++ b/fs/xfs/xfs_attr_item.c @@ -413,18 +413,20 @@ xfs_attr_create_intent( struct xfs_mount *mp = tp->t_mountp; struct xfs_attri_log_item *attrip; struct xfs_attr_intent *attr; + struct xfs_da_args *args; ASSERT(count == 1); - if (!xfs_sb_version_haslogxattrs(&mp->m_sb)) - return NULL; - /* * Each attr item only performs one attribute operation at a time, so * this is a list of one */ attr = list_first_entry_or_null(items, struct xfs_attr_intent, xattri_list); + args = attr->xattri_da_args; + + if (!(args->op_flags & XFS_DA_OP_LOGGED)) + return NULL; /* * Create a buffer to store the attribute name and value. This buffer @@ -432,8 +434,6 @@ xfs_attr_create_intent( * and the lower level xattr log items. */ if (!attr->xattri_nameval) { - struct xfs_da_args *args = attr->xattri_da_args; - /* * Transfer our reference to the name/value buffer to the * deferred work state structure. @@ -617,7 +617,10 @@ xfs_attri_item_recover( args->namelen = nv->name.i_len; args->hashval = xfs_da_hashname(args->name, args->namelen); args->attr_filter = attrp->alfi_attr_filter & XFS_ATTRI_FILTER_MASK; - args->op_flags = XFS_DA_OP_RECOVERY | XFS_DA_OP_OKNOENT; + args->op_flags = XFS_DA_OP_RECOVERY | XFS_DA_OP_OKNOENT | + XFS_DA_OP_LOGGED; + + ASSERT(xfs_sb_version_haslogxattrs(&mp->m_sb)); switch (attr->xattri_op_flags) { case XFS_ATTRI_OP_FLAGS_SET: diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c index 35e13e125ec6..c325a28b89a8 100644 --- a/fs/xfs/xfs_xattr.c +++ b/fs/xfs/xfs_xattr.c @@ -68,6 +68,18 @@ xfs_attr_rele_log_assist( xlog_drop_incompat_feat(mp->m_log); } +static inline bool +xfs_attr_want_log_assist( + struct xfs_mount *mp) +{ +#ifdef DEBUG + /* Logged xattrs require a V5 super for log_incompat */ + return xfs_has_crc(mp) && xfs_globals.larp; +#else + return false; +#endif +} + /* * Set or remove an xattr, having grabbed the appropriate logging resources * prior to calling libxfs. @@ -80,11 +92,14 @@ xfs_attr_change( bool use_logging = false; int error; - if (xfs_has_larp(mp)) { + ASSERT(!(args->op_flags & XFS_DA_OP_LOGGED)); + + if (xfs_attr_want_log_assist(mp)) { error = xfs_attr_grab_log_assist(mp); if (error) return error; + args->op_flags |= XFS_DA_OP_LOGGED; use_logging = true; } From 10930b254d5be1cb4350fb7a456ccd5ea7e3cbd9 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Sun, 5 Jun 2022 18:51:22 -0700 Subject: [PATCH 205/298] xfs: fix variable state usage MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The variable @args is fed to a tracepoint, and that's the only place it's used. This is fine for the kernel, but for userspace, tracepoints are #define'd out of existence, which results in this warning on gcc 11.2: xfs_attr.c: In function ‘xfs_attr_node_try_addname’: xfs_attr.c:1440:42: warning: unused variable ‘args’ [-Wunused-variable] 1440 | struct xfs_da_args *args = attr->xattri_da_args; | ^~~~ Clean this up. Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Reviewed-by: Allison Henderson --- fs/xfs/libxfs/xfs_attr.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 0847b4e16237..1824f61621a2 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -1441,12 +1441,11 @@ static int xfs_attr_node_try_addname( struct xfs_attr_intent *attr) { - struct xfs_da_args *args = attr->xattri_da_args; struct xfs_da_state *state = attr->xattri_da_state; struct xfs_da_state_blk *blk; int error; - trace_xfs_attr_node_addname(args); + trace_xfs_attr_node_addname(state->args); blk = &state->path.blk[state->path.active-1]; ASSERT(blk->magic == XFS_ATTR_LEAF_MAGIC); From e89ab76d7e2564c65986add3d634cc5cf5bacf14 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Sun, 5 Jun 2022 18:51:23 -0700 Subject: [PATCH 206/298] xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes It is vitally important that we preserve the state of the NREXT64 inode flag when we're changing the other flags2 fields. Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers") Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Allison Henderson --- fs/xfs/xfs_ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 5a364a7d58fd..0d67ff8a8961 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1096,7 +1096,8 @@ xfs_flags2diflags2( { uint64_t di_flags2 = (ip->i_diflags2 & (XFS_DIFLAG2_REFLINK | - XFS_DIFLAG2_BIGTIME)); + XFS_DIFLAG2_BIGTIME | + XFS_DIFLAG2_NREXT64)); if (xflags & FS_XFLAG_DAX) di_flags2 |= XFS_DIFLAG2_DAX; From 27cfa258951a465e3eae63ee1e715e902cd45578 Mon Sep 17 00:00:00 2001 From: Ye Bin Date: Wed, 15 Jun 2022 17:00:10 +0800 Subject: [PATCH 207/298] ext2: fix fs corruption when trying to remove a non-empty directory with IO error We got issue as follows: [home]# mount /dev/sdd test [home]# cd test [test]# ls dir1 lost+found [test]# rmdir dir1 ext2_empty_dir: inject fault [test]# ls lost+found [test]# cd .. [home]# umount test [home]# fsck.ext2 -fn /dev/sdd e2fsck 1.42.9 (28-Dec-2013) Pass 1: Checking inodes, blocks, and sizes Inode 4065, i_size is 0, should be 1024. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 4065 (/???) Connect to /lost+found? no '..' in ... (4065) is / (2), should be (0). Fix? no Pass 4: Checking reference counts Inode 2 ref count is 3, should be 4. Fix? no Inode 4065 ref count is 2, should be 3. Fix? no Pass 5: Checking group summary information /dev/sdd: ********** WARNING: Filesystem still has errors ********** /dev/sdd: 14/128016 files (0.0% non-contiguous), 18477/512000 blocks Reason is same with commit 7aab5c84a0f6. We can't assume directory is empty when read directory entry failed. Link: https://lore.kernel.org/r/20220615090010.1544152-1-yebin10@huawei.com Signed-off-by: Ye Bin Signed-off-by: Jan Kara --- fs/ext2/dir.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 2c2f179b6977..43de293cef56 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -672,17 +672,14 @@ int ext2_empty_dir (struct inode * inode) void *page_addr = NULL; struct page *page = NULL; unsigned long i, npages = dir_pages(inode); - int dir_has_error = 0; for (i = 0; i < npages; i++) { char *kaddr; ext2_dirent * de; - page = ext2_get_page(inode, i, dir_has_error, &page_addr); + page = ext2_get_page(inode, i, 0, &page_addr); - if (IS_ERR(page)) { - dir_has_error = 1; - continue; - } + if (IS_ERR(page)) + goto not_empty; kaddr = page_addr; de = (ext2_dirent *)kaddr; From 4bca7e80b6455772b4bf3f536dcbc19aac424d6a Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 15 Jun 2022 15:22:29 +0200 Subject: [PATCH 208/298] init: Initialize noop_backing_dev_info early noop_backing_dev_info is used by superblocks of various pseudofilesystems such as kdevtmpfs. After commit 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error") this broke because __mark_inode_dirty() started to access more fields from noop_backing_dev_info and this led to crashes inside locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty(). Fix the problem by initializing noop_backing_dev_info before the filesystems get mounted. Fixes: 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error") Reported-and-tested-by: Suzuki K Poulose Reported-and-tested-by: Alexandru Elisei Reported-and-tested-by: Guenter Roeck Reviewed-by: Christoph Hellwig Signed-off-by: Jan Kara --- drivers/base/init.c | 2 ++ include/linux/backing-dev.h | 2 ++ mm/backing-dev.c | 11 ++--------- 3 files changed, 6 insertions(+), 9 deletions(-) diff --git a/drivers/base/init.c b/drivers/base/init.c index d8d0fe687111..397eb9880cec 100644 --- a/drivers/base/init.c +++ b/drivers/base/init.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "base.h" @@ -20,6 +21,7 @@ void __init driver_init(void) { /* These are the core pieces */ + bdi_init(&noop_backing_dev_info); devtmpfs_init(); devices_init(); buses_init(); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 2bd073fa6bb5..d452071db572 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -119,6 +119,8 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); extern struct backing_dev_info noop_backing_dev_info; +int bdi_init(struct backing_dev_info *bdi); + /** * writeback_in_progress - determine whether there is writeback in progress * @wb: bdi_writeback of interest diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ff60bd7d74e0..95550b8fa7fe 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -231,20 +231,13 @@ static __init int bdi_class_init(void) } postcore_initcall(bdi_class_init); -static int bdi_init(struct backing_dev_info *bdi); - static int __init default_bdi_init(void) { - int err; - bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_SYSFS, 0); if (!bdi_wq) return -ENOMEM; - - err = bdi_init(&noop_backing_dev_info); - - return err; + return 0; } subsys_initcall(default_bdi_init); @@ -781,7 +774,7 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) #endif /* CONFIG_CGROUP_WRITEBACK */ -static int bdi_init(struct backing_dev_info *bdi) +int bdi_init(struct backing_dev_info *bdi) { int ret; From a76c0b31eef50fdb8b21d53a6d050f59241fb88e Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 15 Jun 2022 19:51:11 -0600 Subject: [PATCH 209/298] io_uring: commit non-pollable provided mapped buffers upfront For recv/recvmsg, IO either completes immediately or gets queued for a retry. This isn't the case for read/readv, if eg a normal file or a block device is used. Here, an operation can get queued with the block layer. If this happens, ring mapped buffers must get committed immediately to avoid that the next read can consume the same buffer. Check if we're dealing with pollable file, when getting a new ring mapped provided buffer. If it's not, commit it immediately rather than wait post issue. If we don't wait, we can race with completions coming in, or just plain buffer reuse by committing after a retry where others could have grabbed the same buffer. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Reviewed-by: Hao Xu Signed-off-by: Jens Axboe --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 5d479428d8e5..b6e75f69c6b1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3836,7 +3836,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, req->buf_list = bl; req->buf_index = buf->bid; - if (issue_flags & IO_URING_F_UNLOCKED) { + if (issue_flags & IO_URING_F_UNLOCKED || !file_can_poll(req->file)) { /* * If we came in unlocked, we have no choice but to consume the * buffer here. This does mean it'll be pinned until the IO From 4f5bf12732fd78e225fc62b7c5c84d9032f8048a Mon Sep 17 00:00:00 2001 From: Yang Li Date: Thu, 12 May 2022 15:54:32 +0800 Subject: [PATCH 210/298] fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment Add the description of @folio and remove @page in function kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/jbd2/transaction.c:2149: warning: Function parameter or member 'folio' not described in 'jbd2_journal_try_to_free_buffers' fs/jbd2/transaction.c:2149: warning: Excess function parameter 'page' description in 'jbd2_journal_try_to_free_buffers' Reported-by: Abaci Robot Signed-off-by: Yang Li Link: https://lore.kernel.org/r/20220512075432.31763-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o --- fs/jbd2/transaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index e49bb0938376..e9c308ae475f 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -2114,7 +2114,7 @@ out: /** * jbd2_journal_try_to_free_buffers() - try to free page buffers. * @journal: journal for operation - * @page: to try and free + * @folio: Folio to detach data from. * * For all the buffers on this page, * if they are fully written out ordered data, move them onto BUF_CLEAN From 48e02e6113825db81e4aacc035933c0d0e4e68ce Mon Sep 17 00:00:00 2001 From: Wang Jianjian Date: Fri, 20 May 2022 10:22:54 +0800 Subject: [PATCH 211/298] ext4: fix incorrect comment in ext4_bio_write_page() Signed-off-by: Wang Jianjian Link: https://lore.kernel.org/r/20220520022255.2120576-1-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/page-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 14695e2b5042..97fa7b4c645f 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -465,7 +465,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io, /* * In the first loop we prepare and mark buffers to submit. We have to * mark all buffers in the page before submitting so that - * end_page_writeback() cannot be called from ext4_bio_end_io() when IO + * end_page_writeback() cannot be called from ext4_end_bio() when IO * on the first buffer finishes and we are still working on submitting * the second buffer. */ From 3103084afcf2341e12b0ee2c7b2ed570164f44a2 Mon Sep 17 00:00:00 2001 From: Wang Jianjian Date: Fri, 20 May 2022 10:22:55 +0800 Subject: [PATCH 212/298] ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o --- Documentation/filesystems/ext4/attributes.rst | 68 +-- Documentation/filesystems/ext4/bigalloc.rst | 2 +- Documentation/filesystems/ext4/bitmaps.rst | 6 +- Documentation/filesystems/ext4/blockgroup.rst | 30 +- Documentation/filesystems/ext4/blockmap.rst | 2 +- Documentation/filesystems/ext4/checksums.rst | 26 +- Documentation/filesystems/ext4/directory.rst | 166 +++--- Documentation/filesystems/ext4/eainode.rst | 10 +- .../filesystems/ext4/group_descr.rst | 126 ++-- Documentation/filesystems/ext4/ifork.rst | 60 +- Documentation/filesystems/ext4/inlinedata.rst | 8 +- Documentation/filesystems/ext4/inodes.rst | 306 +++++----- Documentation/filesystems/ext4/journal.rst | 214 +++---- Documentation/filesystems/ext4/mmp.rst | 36 +- Documentation/filesystems/ext4/overview.rst | 2 +- .../filesystems/ext4/special_inodes.rst | 8 +- Documentation/filesystems/ext4/super.rst | 550 +++++++++--------- 17 files changed, 810 insertions(+), 810 deletions(-) diff --git a/Documentation/filesystems/ext4/attributes.rst b/Documentation/filesystems/ext4/attributes.rst index 871d2da7a0a9..87814696a65b 100644 --- a/Documentation/filesystems/ext4/attributes.rst +++ b/Documentation/filesystems/ext4/attributes.rst @@ -13,8 +13,8 @@ disappeared as of Linux 3.0. There are two places where extended attributes can be found. The first place is between the end of each inode entry and the beginning of the -next inode entry. For example, if inode.i\_extra\_isize = 28 and -sb.inode\_size = 256, then there are 256 - (128 + 28) = 100 bytes +next inode entry. For example, if inode.i_extra_isize = 28 and +sb.inode_size = 256, then there are 256 - (128 + 28) = 100 bytes available for in-inode extended attribute storage. The second place where extended attributes can be found is in the block pointed to by ``inode.i_file_acl``. As of Linux 3.11, it is not possible for this @@ -38,8 +38,8 @@ Extended attributes, when stored after the inode, have a header - Name - Description * - 0x0 - - \_\_le32 - - h\_magic + - __le32 + - h_magic - Magic number for identification, 0xEA020000. This value is set by the Linux driver, though e2fsprogs doesn't seem to check it(?) @@ -55,28 +55,28 @@ The beginning of an extended attribute block is in - Name - Description * - 0x0 - - \_\_le32 - - h\_magic + - __le32 + - h_magic - Magic number for identification, 0xEA020000. * - 0x4 - - \_\_le32 - - h\_refcount + - __le32 + - h_refcount - Reference count. * - 0x8 - - \_\_le32 - - h\_blocks + - __le32 + - h_blocks - Number of disk blocks used. * - 0xC - - \_\_le32 - - h\_hash + - __le32 + - h_hash - Hash value of all attributes. * - 0x10 - - \_\_le32 - - h\_checksum + - __le32 + - h_checksum - Checksum of the extended attribute block. * - 0x14 - - \_\_u32 - - h\_reserved[3] + - __u32 + - h_reserved[3] - Zero. The checksum is calculated against the FS UUID, the 64-bit block number @@ -100,46 +100,46 @@ Attributes stored inside an inode do not need be stored in sorted order. - Name - Description * - 0x0 - - \_\_u8 - - e\_name\_len + - __u8 + - e_name_len - Length of name. * - 0x1 - - \_\_u8 - - e\_name\_index + - __u8 + - e_name_index - Attribute name index. There is a discussion of this below. * - 0x2 - - \_\_le16 - - e\_value\_offs + - __le16 + - e_value_offs - Location of this attribute's value on the disk block where it is stored. Multiple attributes can share the same value. For an inode attribute this value is relative to the start of the first entry; for a block this value is relative to the start of the block (i.e. the header). * - 0x4 - - \_\_le32 - - e\_value\_inum + - __le32 + - e_value_inum - The inode where the value is stored. Zero indicates the value is in the same block as this entry. This field is only used if the - INCOMPAT\_EA\_INODE feature is enabled. + INCOMPAT_EA_INODE feature is enabled. * - 0x8 - - \_\_le32 - - e\_value\_size + - __le32 + - e_value_size - Length of attribute value. * - 0xC - - \_\_le32 - - e\_hash + - __le32 + - e_hash - Hash value of attribute name and attribute value. The kernel doesn't update the hash for in-inode attributes, so for that case this value must be zero, because e2fsck validates any non-zero hash regardless of where the xattr lives. * - 0x10 - char - - e\_name[e\_name\_len] + - e_name[e_name_len] - Attribute name. Does not include trailing NULL. Attribute values can follow the end of the entry table. There appears to be a requirement that they be aligned to 4-byte boundaries. The values are stored starting at the end of the block and grow towards the -xattr\_header/xattr\_entry table. When the two collide, the overflow is +xattr_header/xattr_entry table. When the two collide, the overflow is put into a separate disk block. If the disk block fills up, the filesystem returns -ENOSPC. @@ -167,15 +167,15 @@ the key name. Here is a map of name index values to key prefixes: * - 1 - “user.” * - 2 - - “system.posix\_acl\_access” + - “system.posix_acl_access” * - 3 - - “system.posix\_acl\_default” + - “system.posix_acl_default” * - 4 - “trusted.” * - 6 - “security.” * - 7 - - “system.” (inline\_data only?) + - “system.” (inline_data only?) * - 8 - “system.richacl” (SuSE kernels only?) diff --git a/Documentation/filesystems/ext4/bigalloc.rst b/Documentation/filesystems/ext4/bigalloc.rst index 72075aa608e4..976a180b209c 100644 --- a/Documentation/filesystems/ext4/bigalloc.rst +++ b/Documentation/filesystems/ext4/bigalloc.rst @@ -23,7 +23,7 @@ means that a block group addresses 32 gigabytes instead of 128 megabytes, also shrinking the amount of file system overhead for metadata. The administrator can set a block cluster size at mkfs time (which is -stored in the s\_log\_cluster\_size field in the superblock); from then +stored in the s_log_cluster_size field in the superblock); from then on, the block bitmaps track clusters, not individual blocks. This means that block groups can be several gigabytes in size (instead of just 128MiB); however, the minimum allocation unit becomes a cluster, not a diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst index c7546dbc197a..91c45d86e9bb 100644 --- a/Documentation/filesystems/ext4/bitmaps.rst +++ b/Documentation/filesystems/ext4/bitmaps.rst @@ -9,15 +9,15 @@ group. The inode bitmap records which entries in the inode table are in use. As with most bitmaps, one bit represents the usage status of one data -block or inode table entry. This implies a block group size of 8 \* -number\_of\_bytes\_in\_a\_logical\_block. +block or inode table entry. This implies a block group size of 8 * +number_of_bytes_in_a_logical_block. NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts of the kernel and e2fsprogs code pretends that the block bitmap contains zeros (i.e. all blocks in the group are free). However, it is not necessarily the case that no blocks are in use -- if ``meta_bg`` is set, the bitmaps and group descriptor live inside the group. Unfortunately, -ext2fs\_test\_block\_bitmap2() will return '0' for those locations, +ext2fs_test_block_bitmap2() will return '0' for those locations, which produces confusing debugfs output. Inode Table diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst index d5d652addce5..46d78f860623 100644 --- a/Documentation/filesystems/ext4/blockgroup.rst +++ b/Documentation/filesystems/ext4/blockgroup.rst @@ -56,39 +56,39 @@ established that the super block and the group descriptor table, if present, will be at the beginning of the block group. The bitmaps and the inode table can be anywhere, and it is quite possible for the bitmaps to come after the inode table, or for both to be in different -groups (flex\_bg). Leftover space is used for file data blocks, indirect +groups (flex_bg). Leftover space is used for file data blocks, indirect block maps, extent tree blocks, and extended attributes. Flexible Block Groups --------------------- Starting in ext4, there is a new feature called flexible block groups -(flex\_bg). In a flex\_bg, several block groups are tied together as one +(flex_bg). In a flex_bg, several block groups are tied together as one logical block group; the bitmap spaces and the inode table space in the -first block group of the flex\_bg are expanded to include the bitmaps -and inode tables of all other block groups in the flex\_bg. For example, -if the flex\_bg size is 4, then group 0 will contain (in order) the +first block group of the flex_bg are expanded to include the bitmaps +and inode tables of all other block groups in the flex_bg. For example, +if the flex_bg size is 4, then group 0 will contain (in order) the superblock, group descriptors, data block bitmaps for groups 0-3, inode bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining space in group 0 is for file data. The effect of this is to group the block group metadata close together for faster loading, and to enable large files to be continuous on disk. Backup copies of the superblock and group descriptors are always at the beginning of block groups, even -if flex\_bg is enabled. The number of block groups that make up a -flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. +if flex_bg is enabled. The number of block groups that make up a +flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. Meta Block Groups ----------------- -Without the option META\_BG, for safety concerns, all block group +Without the option META_BG, for safety concerns, all block group descriptors copies are kept in the first block group. Given the default 128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 can have at most 2^27/64 = 2^21 block groups. This limits the entire filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB. The solution to this problem is to use the metablock group feature -(META\_BG), which is already in ext3 for all 2.6 releases. With the -META\_BG feature, ext4 filesystems are partitioned into many metablock +(META_BG), which is already in ext3 for all 2.6 releases. With the +META_BG feature, ext4 filesystems are partitioned into many metablock groups. Each metablock group is a cluster of block groups whose group descriptor structures can be stored in a single disk block. For ext4 filesystems with 4 KB block size, a single metablock group partition @@ -110,7 +110,7 @@ bytes, a meta-block group contains 32 block groups for filesystems with a 1KB block size, and 128 block groups for filesystems with a 4KB blocksize. Filesystems can either be created using this new block group descriptor layout, or existing filesystems can be resized on-line, and -the field s\_first\_meta\_bg in the superblock will indicate the first +the field s_first_meta_bg in the superblock will indicate the first block group using this new layout. Please see an important note about ``BLOCK_UNINIT`` in the section about @@ -121,15 +121,15 @@ Lazy Block Group Initialization A new feature for ext4 are three block group descriptor flags that enable mkfs to skip initializing other parts of the block group -metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean +metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean that the inode and block bitmaps for that group can be calculated and therefore the on-disk bitmap blocks are not initialized. This is generally the case for an empty block group or a block group containing -only fixed-location block group metadata. The INODE\_ZEROED flag means +only fixed-location block group metadata. The INODE_ZEROED flag means that the inode table has been initialized; mkfs will unset this flag and rely on the kernel to initialize the inode tables in the background. By not writing zeroes to the bitmaps and inode table, mkfs time is -reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM, -but the dumpe2fs output prints this as “uninit\_bg”. They are the same +reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM, +but the dumpe2fs output prints this as “uninit_bg”. They are the same thing. diff --git a/Documentation/filesystems/ext4/blockmap.rst b/Documentation/filesystems/ext4/blockmap.rst index 30e25750d88a..2bd990402a5c 100644 --- a/Documentation/filesystems/ext4/blockmap.rst +++ b/Documentation/filesystems/ext4/blockmap.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| i.i\_block Offset | Where It Points | +| i.i_block Offset | Where It Points | +=====================+==============================================================================================================================================================================================================================+ | 0 to 11 | Direct map to file blocks 0 to 11. | +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/Documentation/filesystems/ext4/checksums.rst b/Documentation/filesystems/ext4/checksums.rst index 5519e253810d..e232749daf5f 100644 --- a/Documentation/filesystems/ext4/checksums.rst +++ b/Documentation/filesystems/ext4/checksums.rst @@ -4,7 +4,7 @@ Checksums --------- Starting in early 2012, metadata checksums were added to all major ext4 -and jbd2 data structures. The associated feature flag is metadata\_csum. +and jbd2 data structures. The associated feature flag is metadata_csum. The desired checksum algorithm is indicated in the superblock, though as of October 2012 the only supported algorithm is crc32c. Some data structures did not have space to fit a full 32-bit checksum, so only the @@ -20,7 +20,7 @@ encounters directory blocks that lack sufficient empty space to add a checksum, it will request that you run ``e2fsck -D`` to have the directories rebuilt with checksums. This has the added benefit of removing slack space from the directory files and rebalancing the htree -indexes. If you \_ignore\_ this step, your directories will not be +indexes. If you _ignore_ this step, your directories will not be protected by a checksum! The following table describes the data elements that go into each type @@ -35,39 +35,39 @@ of checksum. The checksum function is whatever the superblock describes - Length - Ingredients * - Superblock - - \_\_le32 + - __le32 - The entire superblock up to the checksum field. The UUID lives inside the superblock. * - MMP - - \_\_le32 + - __le32 - UUID + the entire MMP block up to the checksum field. * - Extended Attributes - - \_\_le32 + - __le32 - UUID + the entire extended attribute block. The checksum field is set to zero. * - Directory Entries - - \_\_le32 + - __le32 - UUID + inode number + inode generation + the directory block up to the fake entry enclosing the checksum field. * - HTREE Nodes - - \_\_le32 + - __le32 - UUID + inode number + inode generation + all valid extents + HTREE tail. The checksum field is set to zero. * - Extents - - \_\_le32 + - __le32 - UUID + inode number + inode generation + the entire extent block up to the checksum field. * - Bitmaps - - \_\_le32 or \_\_le16 + - __le32 or __le16 - UUID + the entire bitmap. Checksums are stored in the group descriptor, and truncated if the group descriptor size is 32 bytes (i.e. ^64bit) * - Inodes - - \_\_le32 + - __le32 - UUID + inode number + inode generation + the entire inode. The checksum field is set to zero. Each inode has its own checksum. * - Group Descriptors - - \_\_le16 - - If metadata\_csum, then UUID + group number + the entire descriptor; - else if gdt\_csum, then crc16(UUID + group number + the entire + - __le16 + - If metadata_csum, then UUID + group number + the entire descriptor; + else if gdt_csum, then crc16(UUID + group number + the entire descriptor). In all cases, only the lower 16 bits are stored. diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst index 55f618b37144..6eece8e31df8 100644 --- a/Documentation/filesystems/ext4/directory.rst +++ b/Documentation/filesystems/ext4/directory.rst @@ -42,24 +42,24 @@ is at most 263 bytes long, though on disk you'll need to reference - Name - Description * - 0x0 - - \_\_le32 + - __le32 - inode - Number of the inode that this directory entry points to. * - 0x4 - - \_\_le16 - - rec\_len + - __le16 + - rec_len - Length of this directory entry. Must be a multiple of 4. * - 0x6 - - \_\_le16 - - name\_len + - __le16 + - name_len - Length of the file name. * - 0x8 - char - - name[EXT4\_NAME\_LEN] + - name[EXT4_NAME_LEN] - File name. Since file names cannot be longer than 255 bytes, the new directory -entry format shortens the name\_len field and uses the space for a file +entry format shortens the name_len field and uses the space for a file type flag, probably to avoid having to load every inode during directory tree traversal. This format is ``ext4_dir_entry_2``, which is at most 263 bytes long, though on disk you'll need to reference @@ -74,24 +74,24 @@ tree traversal. This format is ``ext4_dir_entry_2``, which is at most - Name - Description * - 0x0 - - \_\_le32 + - __le32 - inode - Number of the inode that this directory entry points to. * - 0x4 - - \_\_le16 - - rec\_len + - __le16 + - rec_len - Length of this directory entry. * - 0x6 - - \_\_u8 - - name\_len + - __u8 + - name_len - Length of the file name. * - 0x7 - - \_\_u8 - - file\_type + - __u8 + - file_type - File type code, see ftype_ table below. * - 0x8 - char - - name[EXT4\_NAME\_LEN] + - name[EXT4_NAME_LEN] - File name. .. _ftype: @@ -137,19 +137,19 @@ entry uses this extension, it may be up to 271 bytes. - Name - Description * - 0x0 - - \_\_le32 + - __le32 - hash - The hash of the directory name * - 0x4 - - \_\_le32 - - minor\_hash + - __le32 + - minor_hash - The minor hash of the directory name In order to add checksums to these classic directory blocks, a phony ``struct ext4_dir_entry`` is placed at the end of each leaf block to hold the checksum. The directory entry is 12 bytes long. The inode -number and name\_len fields are set to zero to fool old software into +number and name_len fields are set to zero to fool old software into ignoring an apparently empty directory entry, and the checksum is stored in the place where the name normally goes. The structure is ``struct ext4_dir_entry_tail``: @@ -163,24 +163,24 @@ in the place where the name normally goes. The structure is - Name - Description * - 0x0 - - \_\_le32 - - det\_reserved\_zero1 + - __le32 + - det_reserved_zero1 - Inode number, which must be zero. * - 0x4 - - \_\_le16 - - det\_rec\_len + - __le16 + - det_rec_len - Length of this directory entry, which must be 12. * - 0x6 - - \_\_u8 - - det\_reserved\_zero2 + - __u8 + - det_reserved_zero2 - Length of the file name, which must be zero. * - 0x7 - - \_\_u8 - - det\_reserved\_ft + - __u8 + - det_reserved_ft - File type, which must be 0xDE. * - 0x8 - - \_\_le32 - - det\_checksum + - __le32 + - det_checksum - Directory leaf block checksum. The leaf directory block checksum is calculated against the FS UUID, the @@ -194,7 +194,7 @@ Hash Tree Directories A linear array of directory entries isn't great for performance, so a new feature was added to ext3 to provide a faster (but peculiar) balanced tree keyed off a hash of the directory entry name. If the -EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a +EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a hashed btree (htree) to organize and find directory entries. For backwards read-only compatibility with ext2, this tree is actually hidden inside the directory file, masquerading as “empty” directory data @@ -206,14 +206,14 @@ rest of the directory block is empty so that it moves on. The root of the tree always lives in the first data block of the directory. By ext2 custom, the '.' and '..' entries must appear at the beginning of this first block, so they are put here as two -``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of +``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of the root node contains metadata about the tree and finally a hash->block map to find nodes that are lower in the htree. If ``dx_root.info.indirect_levels`` is non-zero then the htree has two levels; the data block pointed to by the root node's map is an interior node, which is indexed by a minor hash. Interior nodes in this tree contains a zeroed out ``struct ext4_dir_entry_2`` followed by a -minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear +minor_hash->block map to find leafe nodes. Leaf nodes contain a linear array of all ``struct ext4_dir_entry_2``; all of these entries (presumably) hash to the same value. If there is an overflow, the entries simply overflow into the next leaf node, and the @@ -245,83 +245,83 @@ of a data block: - Name - Description * - 0x0 - - \_\_le32 + - __le32 - dot.inode - inode number of this directory. * - 0x4 - - \_\_le16 - - dot.rec\_len + - __le16 + - dot.rec_len - Length of this record, 12. * - 0x6 - u8 - - dot.name\_len + - dot.name_len - Length of the name, 1. * - 0x7 - u8 - - dot.file\_type + - dot.file_type - File type of this entry, 0x2 (directory) (if the feature flag is set). * - 0x8 - char - dot.name[4] - - “.\\0\\0\\0” + - “.\0\0\0” * - 0xC - - \_\_le32 + - __le32 - dotdot.inode - inode number of parent directory. * - 0x10 - - \_\_le16 - - dotdot.rec\_len - - block\_size - 12. The record length is long enough to cover all htree + - __le16 + - dotdot.rec_len + - block_size - 12. The record length is long enough to cover all htree data. * - 0x12 - u8 - - dotdot.name\_len + - dotdot.name_len - Length of the name, 2. * - 0x13 - u8 - - dotdot.file\_type + - dotdot.file_type - File type of this entry, 0x2 (directory) (if the feature flag is set). * - 0x14 - char - - dotdot\_name[4] - - “..\\0\\0” + - dotdot_name[4] + - “..\0\0” * - 0x18 - - \_\_le32 - - struct dx\_root\_info.reserved\_zero + - __le32 + - struct dx_root_info.reserved_zero - Zero. * - 0x1C - u8 - - struct dx\_root\_info.hash\_version + - struct dx_root_info.hash_version - Hash type, see dirhash_ table below. * - 0x1D - u8 - - struct dx\_root\_info.info\_length + - struct dx_root_info.info_length - Length of the tree information, 0x8. * - 0x1E - u8 - - struct dx\_root\_info.indirect\_levels - - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR + - struct dx_root_info.indirect_levels + - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR feature is set; cannot be larger than 2 otherwise. * - 0x1F - u8 - - struct dx\_root\_info.unused\_flags + - struct dx_root_info.unused_flags - * - 0x20 - - \_\_le16 + - __le16 - limit - - Maximum number of dx\_entries that can follow this header, plus 1 for + - Maximum number of dx_entries that can follow this header, plus 1 for the header itself. * - 0x22 - - \_\_le16 + - __le16 - count - - Actual number of dx\_entries that follow this header, plus 1 for the + - Actual number of dx_entries that follow this header, plus 1 for the header itself. * - 0x24 - - \_\_le32 + - __le32 - block - The block number (within the directory file) that goes with hash=0. * - 0x28 - - struct dx\_entry + - struct dx_entry - entries[0] - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. @@ -362,38 +362,38 @@ also the full length of a data block: - Name - Description * - 0x0 - - \_\_le32 + - __le32 - fake.inode - Zero, to make it look like this entry is not in use. * - 0x4 - - \_\_le16 - - fake.rec\_len - - The size of the block, in order to hide all of the dx\_node data. + - __le16 + - fake.rec_len + - The size of the block, in order to hide all of the dx_node data. * - 0x6 - u8 - - name\_len + - name_len - Zero. There is no name for this “unused” directory entry. * - 0x7 - u8 - - file\_type + - file_type - Zero. There is no file type for this “unused” directory entry. * - 0x8 - - \_\_le16 + - __le16 - limit - - Maximum number of dx\_entries that can follow this header, plus 1 for + - Maximum number of dx_entries that can follow this header, plus 1 for the header itself. * - 0xA - - \_\_le16 + - __le16 - count - - Actual number of dx\_entries that follow this header, plus 1 for the + - Actual number of dx_entries that follow this header, plus 1 for the header itself. * - 0xE - - \_\_le32 + - __le32 - block - The block number (within the directory file) that goes with the lowest hash value of this block. This value is stored in the parent block. * - 0x12 - - struct dx\_entry + - struct dx_entry - entries[0] - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. @@ -410,11 +410,11 @@ long: - Name - Description * - 0x0 - - \_\_le32 + - __le32 - hash - Hash code. * - 0x4 - - \_\_le32 + - __le32 - block - Block number (within the directory file, not filesystem blocks) of the next node in the htree. @@ -423,13 +423,13 @@ long: author.) If metadata checksums are enabled, the last 8 bytes of the directory -block (precisely the length of one dx\_entry) are used to store a +block (precisely the length of one dx_entry) are used to store a ``struct dx_tail``, which contains the checksum. The ``limit`` and -``count`` entries in the dx\_root/dx\_node structures are adjusted as -necessary to fit the dx\_tail into the block. If there is no space for -the dx\_tail, the user is notified to run e2fsck -D to rebuild the +``count`` entries in the dx_root/dx_node structures are adjusted as +necessary to fit the dx_tail into the block. If there is no space for +the dx_tail, the user is notified to run e2fsck -D to rebuild the directory index (which will ensure that there's space for the checksum. -The dx\_tail structure is 8 bytes long and looks like this: +The dx_tail structure is 8 bytes long and looks like this: .. list-table:: :widths: 8 8 24 40 @@ -441,13 +441,13 @@ The dx\_tail structure is 8 bytes long and looks like this: - Description * - 0x0 - u32 - - dt\_reserved + - dt_reserved - Zero. * - 0x4 - - \_\_le32 - - dt\_checksum + - __le32 + - dt_checksum - Checksum of the htree directory block. The checksum is calculated against the FS UUID, the htree index header -(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in -use, and the tail block (dx\_tail). +(dx_root or dx_node), all of the htree indices (dx_entry) that are in +use, and the tail block (dx_tail). diff --git a/Documentation/filesystems/ext4/eainode.rst b/Documentation/filesystems/ext4/eainode.rst index ecc0d01a0a72..7a2ef26b064a 100644 --- a/Documentation/filesystems/ext4/eainode.rst +++ b/Documentation/filesystems/ext4/eainode.rst @@ -5,14 +5,14 @@ Large Extended Attribute Values To enable ext4 to store extended attribute values that do not fit in the inode or in the single extended attribute block attached to an inode, -the EA\_INODE feature allows us to store the value in the data blocks of +the EA_INODE feature allows us to store the value in the data blocks of a regular file inode. This “EA inode” is linked only from the extended attribute name index and must not appear in a directory entry. The -inode's i\_atime field is used to store a checksum of the xattr value; -and i\_ctime/i\_version store a 64-bit reference count, which enables +inode's i_atime field is used to store a checksum of the xattr value; +and i_ctime/i_version store a 64-bit reference count, which enables sharing of large xattr values between multiple owning inodes. For backward compatibility with older versions of this feature, the -i\_mtime/i\_generation *may* store a back-reference to the inode number -and i\_generation of the **one** owning inode (in cases where the EA +i_mtime/i_generation *may* store a back-reference to the inode number +and i_generation of the **one** owning inode (in cases where the EA inode is not referenced by multiple inodes) to verify that the EA inode is the correct one being accessed. diff --git a/Documentation/filesystems/ext4/group_descr.rst b/Documentation/filesystems/ext4/group_descr.rst index 7ba6114e7f5c..392ec44f8fb0 100644 --- a/Documentation/filesystems/ext4/group_descr.rst +++ b/Documentation/filesystems/ext4/group_descr.rst @@ -7,34 +7,34 @@ Each block group on the filesystem has one of these descriptors associated with it. As noted in the Layout section above, the group descriptors (if present) are the second item in the block group. The standard configuration is for each block group to contain a full copy of -the block group descriptor table unless the sparse\_super feature flag +the block group descriptor table unless the sparse_super feature flag is set. Notice how the group descriptor records the location of both bitmaps and the inode table (i.e. they can float). This means that within a block group, the only data structures with fixed locations are the superblock -and the group descriptor table. The flex\_bg mechanism uses this +and the group descriptor table. The flex_bg mechanism uses this property to group several block groups into a flex group and lay out all of the groups' bitmaps and inode tables into one long run in the first group of the flex group. -If the meta\_bg feature flag is set, then several block groups are -grouped together into a meta group. Note that in the meta\_bg case, +If the meta_bg feature flag is set, then several block groups are +grouped together into a meta group. Note that in the meta_bg case, however, the first and last two block groups within the larger meta group contain only group descriptors for the groups inside the meta group. -flex\_bg and meta\_bg do not appear to be mutually exclusive features. +flex_bg and meta_bg do not appear to be mutually exclusive features. In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the block group descriptor was only 32 bytes long and therefore ends at -bg\_checksum. On an ext4 filesystem with the 64bit feature enabled, the +bg_checksum. On an ext4 filesystem with the 64bit feature enabled, the block group descriptor expands to at least the 64 bytes described below; the size is stored in the superblock. -If gdt\_csum is set and metadata\_csum is not set, the block group +If gdt_csum is set and metadata_csum is not set, the block group checksum is the crc16 of the FS UUID, the group number, and the group -descriptor structure. If metadata\_csum is set, then the block group +descriptor structure. If metadata_csum is set, then the block group checksum is the lower 16 bits of the checksum of the FS UUID, the group number, and the group descriptor structure. Both block and inode bitmap checksums are calculated against the FS UUID, the group number, and the @@ -51,59 +51,59 @@ The block group descriptor is laid out in ``struct ext4_group_desc``. - Name - Description * - 0x0 - - \_\_le32 - - bg\_block\_bitmap\_lo + - __le32 + - bg_block_bitmap_lo - Lower 32-bits of location of block bitmap. * - 0x4 - - \_\_le32 - - bg\_inode\_bitmap\_lo + - __le32 + - bg_inode_bitmap_lo - Lower 32-bits of location of inode bitmap. * - 0x8 - - \_\_le32 - - bg\_inode\_table\_lo + - __le32 + - bg_inode_table_lo - Lower 32-bits of location of inode table. * - 0xC - - \_\_le16 - - bg\_free\_blocks\_count\_lo + - __le16 + - bg_free_blocks_count_lo - Lower 16-bits of free block count. * - 0xE - - \_\_le16 - - bg\_free\_inodes\_count\_lo + - __le16 + - bg_free_inodes_count_lo - Lower 16-bits of free inode count. * - 0x10 - - \_\_le16 - - bg\_used\_dirs\_count\_lo + - __le16 + - bg_used_dirs_count_lo - Lower 16-bits of directory count. * - 0x12 - - \_\_le16 - - bg\_flags + - __le16 + - bg_flags - Block group flags. See the bgflags_ table below. * - 0x14 - - \_\_le32 - - bg\_exclude\_bitmap\_lo + - __le32 + - bg_exclude_bitmap_lo - Lower 32-bits of location of snapshot exclusion bitmap. * - 0x18 - - \_\_le16 - - bg\_block\_bitmap\_csum\_lo + - __le16 + - bg_block_bitmap_csum_lo - Lower 16-bits of the block bitmap checksum. * - 0x1A - - \_\_le16 - - bg\_inode\_bitmap\_csum\_lo + - __le16 + - bg_inode_bitmap_csum_lo - Lower 16-bits of the inode bitmap checksum. * - 0x1C - - \_\_le16 - - bg\_itable\_unused\_lo + - __le16 + - bg_itable_unused_lo - Lower 16-bits of unused inode count. If set, we needn't scan past the - ``(sb.s_inodes_per_group - gdt.bg_itable_unused)``\ th entry in the + ``(sb.s_inodes_per_group - gdt.bg_itable_unused)`` th entry in the inode table for this group. * - 0x1E - - \_\_le16 - - bg\_checksum - - Group descriptor checksum; crc16(sb\_uuid+group\_num+bg\_desc) if the - RO\_COMPAT\_GDT\_CSUM feature is set, or - crc32c(sb\_uuid+group\_num+bg\_desc) & 0xFFFF if the - RO\_COMPAT\_METADATA\_CSUM feature is set. The bg\_checksum - field in bg\_desc is skipped when calculating crc16 checksum, + - __le16 + - bg_checksum + - Group descriptor checksum; crc16(sb_uuid+group_num+bg_desc) if the + RO_COMPAT_GDT_CSUM feature is set, or + crc32c(sb_uuid+group_num+bg_desc) & 0xFFFF if the + RO_COMPAT_METADATA_CSUM feature is set. The bg_checksum + field in bg_desc is skipped when calculating crc16 checksum, and set to zero if crc32c checksum is used. * - - @@ -111,48 +111,48 @@ The block group descriptor is laid out in ``struct ext4_group_desc``. - These fields only exist if the 64bit feature is enabled and s_desc_size > 32. * - 0x20 - - \_\_le32 - - bg\_block\_bitmap\_hi + - __le32 + - bg_block_bitmap_hi - Upper 32-bits of location of block bitmap. * - 0x24 - - \_\_le32 - - bg\_inode\_bitmap\_hi + - __le32 + - bg_inode_bitmap_hi - Upper 32-bits of location of inodes bitmap. * - 0x28 - - \_\_le32 - - bg\_inode\_table\_hi + - __le32 + - bg_inode_table_hi - Upper 32-bits of location of inodes table. * - 0x2C - - \_\_le16 - - bg\_free\_blocks\_count\_hi + - __le16 + - bg_free_blocks_count_hi - Upper 16-bits of free block count. * - 0x2E - - \_\_le16 - - bg\_free\_inodes\_count\_hi + - __le16 + - bg_free_inodes_count_hi - Upper 16-bits of free inode count. * - 0x30 - - \_\_le16 - - bg\_used\_dirs\_count\_hi + - __le16 + - bg_used_dirs_count_hi - Upper 16-bits of directory count. * - 0x32 - - \_\_le16 - - bg\_itable\_unused\_hi + - __le16 + - bg_itable_unused_hi - Upper 16-bits of unused inode count. * - 0x34 - - \_\_le32 - - bg\_exclude\_bitmap\_hi + - __le32 + - bg_exclude_bitmap_hi - Upper 32-bits of location of snapshot exclusion bitmap. * - 0x38 - - \_\_le16 - - bg\_block\_bitmap\_csum\_hi + - __le16 + - bg_block_bitmap_csum_hi - Upper 16-bits of the block bitmap checksum. * - 0x3A - - \_\_le16 - - bg\_inode\_bitmap\_csum\_hi + - __le16 + - bg_inode_bitmap_csum_hi - Upper 16-bits of the inode bitmap checksum. * - 0x3C - - \_\_u32 - - bg\_reserved + - __u32 + - bg_reserved - Padding to 64 bytes. .. _bgflags: @@ -166,8 +166,8 @@ Block group flags can be any combination of the following: * - Value - Description * - 0x1 - - inode table and bitmap are not initialized (EXT4\_BG\_INODE\_UNINIT). + - inode table and bitmap are not initialized (EXT4_BG_INODE_UNINIT). * - 0x2 - - block bitmap is not initialized (EXT4\_BG\_BLOCK\_UNINIT). + - block bitmap is not initialized (EXT4_BG_BLOCK_UNINIT). * - 0x4 - - inode table is zeroed (EXT4\_BG\_INODE\_ZEROED). + - inode table is zeroed (EXT4_BG_INODE_ZEROED). diff --git a/Documentation/filesystems/ext4/ifork.rst b/Documentation/filesystems/ext4/ifork.rst index b9816d5a896b..dc31f505e6c8 100644 --- a/Documentation/filesystems/ext4/ifork.rst +++ b/Documentation/filesystems/ext4/ifork.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -The Contents of inode.i\_block +The Contents of inode.i_block ------------------------------ Depending on the type of file an inode describes, the 60 bytes of @@ -47,7 +47,7 @@ In ext4, the file to logical block map has been replaced with an extent tree. Under the old scheme, allocating a contiguous run of 1,000 blocks requires an indirect block to map all 1,000 entries; with extents, the mapping is reduced to a single ``struct ext4_extent`` with -``ee_len = 1000``. If flex\_bg is enabled, it is possible to allocate +``ee_len = 1000``. If flex_bg is enabled, it is possible to allocate very large files with a single extent, at a considerable reduction in metadata block use, and some improvement in disk efficiency. The inode must have the extents flag (0x80000) flag set for this feature to be in @@ -76,28 +76,28 @@ which is 12 bytes long: - Name - Description * - 0x0 - - \_\_le16 - - eh\_magic + - __le16 + - eh_magic - Magic number, 0xF30A. * - 0x2 - - \_\_le16 - - eh\_entries + - __le16 + - eh_entries - Number of valid entries following the header. * - 0x4 - - \_\_le16 - - eh\_max + - __le16 + - eh_max - Maximum number of entries that could follow the header. * - 0x6 - - \_\_le16 - - eh\_depth + - __le16 + - eh_depth - Depth of this extent node in the extent tree. 0 = this extent node points to data blocks; otherwise, this extent node points to other extent nodes. The extent tree can be at most 5 levels deep: a logical block number can be at most ``2^32``, and the smallest ``n`` that satisfies ``4*(((blocksize - 12)/12)^n) >= 2^32`` is 5. * - 0x8 - - \_\_le32 - - eh\_generation + - __le32 + - eh_generation - Generation of the tree. (Used by Lustre, but not standard ext4). Internal nodes of the extent tree, also known as index nodes, are @@ -112,22 +112,22 @@ recorded as ``struct ext4_extent_idx``, and are 12 bytes long: - Name - Description * - 0x0 - - \_\_le32 - - ei\_block + - __le32 + - ei_block - This index node covers file blocks from 'block' onward. * - 0x4 - - \_\_le32 - - ei\_leaf\_lo + - __le32 + - ei_leaf_lo - Lower 32-bits of the block number of the extent node that is the next level lower in the tree. The tree node pointed to can be either another internal node or a leaf node, described below. * - 0x8 - - \_\_le16 - - ei\_leaf\_hi + - __le16 + - ei_leaf_hi - Upper 16-bits of the previous field. * - 0xA - - \_\_u16 - - ei\_unused + - __u16 + - ei_unused - Leaf nodes of the extent tree are recorded as ``struct ext4_extent``, @@ -142,24 +142,24 @@ and are also 12 bytes long: - Name - Description * - 0x0 - - \_\_le32 - - ee\_block + - __le32 + - ee_block - First file block number that this extent covers. * - 0x4 - - \_\_le16 - - ee\_len + - __le16 + - ee_len - Number of blocks covered by extent. If the value of this field is <= 32768, the extent is initialized. If the value of the field is > 32768, the extent is uninitialized and the actual extent length is ``ee_len`` - 32768. Therefore, the maximum length of a initialized extent is 32768 blocks, and the maximum length of an uninitialized extent is 32767. * - 0x6 - - \_\_le16 - - ee\_start\_hi + - __le16 + - ee_start_hi - Upper 16-bits of the block number to which this extent points. * - 0x8 - - \_\_le32 - - ee\_start\_lo + - __le32 + - ee_start_lo - Lower 32-bits of the block number to which this extent points. Prior to the introduction of metadata checksums, the extent header + @@ -182,8 +182,8 @@ including) the checksum itself. - Name - Description * - 0x0 - - \_\_le32 - - eb\_checksum + - __le32 + - eb_checksum - Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock) Inline Data diff --git a/Documentation/filesystems/ext4/inlinedata.rst b/Documentation/filesystems/ext4/inlinedata.rst index d1075178ce0b..a728af0d2fd0 100644 --- a/Documentation/filesystems/ext4/inlinedata.rst +++ b/Documentation/filesystems/ext4/inlinedata.rst @@ -11,12 +11,12 @@ file is smaller than 60 bytes, then the data are stored inline in attribute space, then it might be found as an extended attribute “system.data” within the inode body (“ibody EA”). This of course constrains the amount of extended attributes one can attach to an inode. -If the data size increases beyond i\_block + ibody EA, a regular block +If the data size increases beyond i_block + ibody EA, a regular block is allocated and the contents moved to that block. Pending a change to compact the extended attribute key used to store inline data, one ought to be able to store 160 bytes of data in a -256-byte inode (as of June 2015, when i\_extra\_isize is 28). Prior to +256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to that, the limit was 156 bytes due to inefficient use of inode space. The inline data feature requires the presence of an extended attribute @@ -25,12 +25,12 @@ for “system.data”, even if the attribute value is zero length. Inline Directories ~~~~~~~~~~~~~~~~~~ -The first four bytes of i\_block are the inode number of the parent +The first four bytes of i_block are the inode number of the parent directory. Following that is a 56-byte space for an array of directory entries; see ``struct ext4_dir_entry``. If there is a “system.data” attribute in the inode body, the EA value is an array of ``struct ext4_dir_entry`` as well. Note that for inline directories, the -i\_block and EA space are treated as separate dirent blocks; directory +i_block and EA space are treated as separate dirent blocks; directory entries cannot span the two. Inline directory entries are not checksummed, as the inode checksum diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/filesystems/ext4/inodes.rst index 6c5ce666e63f..cfc6c1659931 100644 --- a/Documentation/filesystems/ext4/inodes.rst +++ b/Documentation/filesystems/ext4/inodes.rst @@ -38,138 +38,138 @@ The inode table entry is laid out in ``struct ext4_inode``. - Name - Description * - 0x0 - - \_\_le16 - - i\_mode + - __le16 + - i_mode - File mode. See the table i_mode_ below. * - 0x2 - - \_\_le16 - - i\_uid + - __le16 + - i_uid - Lower 16-bits of Owner UID. * - 0x4 - - \_\_le32 - - i\_size\_lo + - __le32 + - i_size_lo - Lower 32-bits of size in bytes. * - 0x8 - - \_\_le32 - - i\_atime - - Last access time, in seconds since the epoch. However, if the EA\_INODE + - __le32 + - i_atime + - Last access time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the checksum of the value. * - 0xC - - \_\_le32 - - i\_ctime + - __le32 + - i_ctime - Last inode change time, in seconds since the epoch. However, if the - EA\_INODE inode flag is set, this inode stores an extended attribute + EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the lower 32 bits of the attribute value's reference count. * - 0x10 - - \_\_le32 - - i\_mtime + - __le32 + - i_mtime - Last data modification time, in seconds since the epoch. However, if the - EA\_INODE inode flag is set, this inode stores an extended attribute + EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the number of the inode that owns the extended attribute. * - 0x14 - - \_\_le32 - - i\_dtime + - __le32 + - i_dtime - Deletion Time, in seconds since the epoch. * - 0x18 - - \_\_le16 - - i\_gid + - __le16 + - i_gid - Lower 16-bits of GID. * - 0x1A - - \_\_le16 - - i\_links\_count + - __le16 + - i_links_count - Hard link count. Normally, ext4 does not permit an inode to have more than 65,000 hard links. This applies to files as well as directories, which means that there cannot be more than 64,998 subdirectories in a directory (each subdirectory's '..' entry counts as a hard link, as does - the '.' entry in the directory itself). With the DIR\_NLINK feature + the '.' entry in the directory itself). With the DIR_NLINK feature enabled, ext4 supports more than 64,998 subdirectories by setting this field to 1 to indicate that the number of hard links is not known. * - 0x1C - - \_\_le32 - - i\_blocks\_lo - - Lower 32-bits of “block” count. If the huge\_file feature flag is not + - __le32 + - i_blocks_lo + - Lower 32-bits of “block” count. If the huge_file feature flag is not set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks - on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in + on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi - << 32)`` 512-byte blocks on disk. If huge\_file is set and - EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file + << 32)`` 512-byte blocks on disk. If huge_file is set and + EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on disk. * - 0x20 - - \_\_le32 - - i\_flags + - __le32 + - i_flags - Inode flags. See the table i_flags_ below. * - 0x24 - 4 bytes - - i\_osd1 + - i_osd1 - See the table i_osd1_ for more details. * - 0x28 - 60 bytes - - i\_block[EXT4\_N\_BLOCKS=15] - - Block map or extent tree. See the section “The Contents of inode.i\_block”. + - i_block[EXT4_N_BLOCKS=15] + - Block map or extent tree. See the section “The Contents of inode.i_block”. * - 0x64 - - \_\_le32 - - i\_generation + - __le32 + - i_generation - File version (for NFS). * - 0x68 - - \_\_le32 - - i\_file\_acl\_lo + - __le32 + - i_file_acl_lo - Lower 32-bits of extended attribute block. ACLs are of course one of many possible extended attributes; I think the name of this field is a result of the first use of extended attributes being for ACLs. * - 0x6C - - \_\_le32 - - i\_size\_high / i\_dir\_acl + - __le32 + - i_size_high / i_dir_acl - Upper 32-bits of file/directory size. In ext2/3 this field was named - i\_dir\_acl, though it was usually set to zero and never used. + i_dir_acl, though it was usually set to zero and never used. * - 0x70 - - \_\_le32 - - i\_obso\_faddr + - __le32 + - i_obso_faddr - (Obsolete) fragment address. * - 0x74 - 12 bytes - - i\_osd2 + - i_osd2 - See the table i_osd2_ for more details. * - 0x80 - - \_\_le16 - - i\_extra\_isize + - __le16 + - i_extra_isize - Size of this inode - 128. Alternately, the size of the extended inode fields beyond the original ext2 inode, including this field. * - 0x82 - - \_\_le16 - - i\_checksum\_hi + - __le16 + - i_checksum_hi - Upper 16-bits of the inode checksum. * - 0x84 - - \_\_le32 - - i\_ctime\_extra + - __le32 + - i_ctime_extra - Extra change time bits. This provides sub-second precision. See Inode Timestamps section. * - 0x88 - - \_\_le32 - - i\_mtime\_extra + - __le32 + - i_mtime_extra - Extra modification time bits. This provides sub-second precision. * - 0x8C - - \_\_le32 - - i\_atime\_extra + - __le32 + - i_atime_extra - Extra access time bits. This provides sub-second precision. * - 0x90 - - \_\_le32 - - i\_crtime + - __le32 + - i_crtime - File creation time, in seconds since the epoch. * - 0x94 - - \_\_le32 - - i\_crtime\_extra + - __le32 + - i_crtime_extra - Extra file creation time bits. This provides sub-second precision. * - 0x98 - - \_\_le32 - - i\_version\_hi + - __le32 + - i_version_hi - Upper 32-bits for version number. * - 0x9C - - \_\_le32 - - i\_projid + - __le32 + - i_projid - Project ID. .. _i_mode: @@ -183,45 +183,45 @@ The ``i_mode`` value is a combination of the following flags: * - Value - Description * - 0x1 - - S\_IXOTH (Others may execute) + - S_IXOTH (Others may execute) * - 0x2 - - S\_IWOTH (Others may write) + - S_IWOTH (Others may write) * - 0x4 - - S\_IROTH (Others may read) + - S_IROTH (Others may read) * - 0x8 - - S\_IXGRP (Group members may execute) + - S_IXGRP (Group members may execute) * - 0x10 - - S\_IWGRP (Group members may write) + - S_IWGRP (Group members may write) * - 0x20 - - S\_IRGRP (Group members may read) + - S_IRGRP (Group members may read) * - 0x40 - - S\_IXUSR (Owner may execute) + - S_IXUSR (Owner may execute) * - 0x80 - - S\_IWUSR (Owner may write) + - S_IWUSR (Owner may write) * - 0x100 - - S\_IRUSR (Owner may read) + - S_IRUSR (Owner may read) * - 0x200 - - S\_ISVTX (Sticky bit) + - S_ISVTX (Sticky bit) * - 0x400 - - S\_ISGID (Set GID) + - S_ISGID (Set GID) * - 0x800 - - S\_ISUID (Set UID) + - S_ISUID (Set UID) * - - These are mutually-exclusive file types: * - 0x1000 - - S\_IFIFO (FIFO) + - S_IFIFO (FIFO) * - 0x2000 - - S\_IFCHR (Character device) + - S_IFCHR (Character device) * - 0x4000 - - S\_IFDIR (Directory) + - S_IFDIR (Directory) * - 0x6000 - - S\_IFBLK (Block device) + - S_IFBLK (Block device) * - 0x8000 - - S\_IFREG (Regular file) + - S_IFREG (Regular file) * - 0xA000 - - S\_IFLNK (Symbolic link) + - S_IFLNK (Symbolic link) * - 0xC000 - - S\_IFSOCK (Socket) + - S_IFSOCK (Socket) .. _i_flags: @@ -234,56 +234,56 @@ The ``i_flags`` field is a combination of these values: * - Value - Description * - 0x1 - - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented) + - This file requires secure deletion (EXT4_SECRM_FL). (not implemented) * - 0x2 - This file should be preserved, should undeletion be desired - (EXT4\_UNRM\_FL). (not implemented) + (EXT4_UNRM_FL). (not implemented) * - 0x4 - - File is compressed (EXT4\_COMPR\_FL). (not really implemented) + - File is compressed (EXT4_COMPR_FL). (not really implemented) * - 0x8 - - All writes to the file must be synchronous (EXT4\_SYNC\_FL). + - All writes to the file must be synchronous (EXT4_SYNC_FL). * - 0x10 - - File is immutable (EXT4\_IMMUTABLE\_FL). + - File is immutable (EXT4_IMMUTABLE_FL). * - 0x20 - - File can only be appended (EXT4\_APPEND\_FL). + - File can only be appended (EXT4_APPEND_FL). * - 0x40 - - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL). + - The dump(1) utility should not dump this file (EXT4_NODUMP_FL). * - 0x80 - - Do not update access time (EXT4\_NOATIME\_FL). + - Do not update access time (EXT4_NOATIME_FL). * - 0x100 - - Dirty compressed file (EXT4\_DIRTY\_FL). (not used) + - Dirty compressed file (EXT4_DIRTY_FL). (not used) * - 0x200 - - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used) + - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used) * - 0x400 - - Do not compress file (EXT4\_NOCOMPR\_FL). (not used) + - Do not compress file (EXT4_NOCOMPR_FL). (not used) * - 0x800 - - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was - EXT4\_ECOMPR\_FL (compression error), which was never used. + - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was + EXT4_ECOMPR_FL (compression error), which was never used. * - 0x1000 - - Directory has hashed indexes (EXT4\_INDEX\_FL). + - Directory has hashed indexes (EXT4_INDEX_FL). * - 0x2000 - - AFS magic directory (EXT4\_IMAGIC\_FL). + - AFS magic directory (EXT4_IMAGIC_FL). * - 0x4000 - File data must always be written through the journal - (EXT4\_JOURNAL\_DATA\_FL). + (EXT4_JOURNAL_DATA_FL). * - 0x8000 - - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4) + - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4) * - 0x10000 - All directory entry data should be written synchronously (see - ``dirsync``) (EXT4\_DIRSYNC\_FL). + ``dirsync``) (EXT4_DIRSYNC_FL). * - 0x20000 - - Top of directory hierarchy (EXT4\_TOPDIR\_FL). + - Top of directory hierarchy (EXT4_TOPDIR_FL). * - 0x40000 - - This is a huge file (EXT4\_HUGE\_FILE\_FL). + - This is a huge file (EXT4_HUGE_FILE_FL). * - 0x80000 - - Inode uses extents (EXT4\_EXTENTS\_FL). + - Inode uses extents (EXT4_EXTENTS_FL). * - 0x100000 - - Verity protected file (EXT4\_VERITY\_FL). + - Verity protected file (EXT4_VERITY_FL). * - 0x200000 - Inode stores a large extended attribute value in its data blocks - (EXT4\_EA\_INODE\_FL). + (EXT4_EA_INODE_FL). * - 0x400000 - - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL). + - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL). (deprecated) * - 0x01000000 - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) @@ -294,21 +294,21 @@ The ``i_flags`` field is a combination of these values: - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in mainline) * - 0x10000000 - - Inode has inline data (EXT4\_INLINE\_DATA\_FL). + - Inode has inline data (EXT4_INLINE_DATA_FL). * - 0x20000000 - - Create children with the same project ID (EXT4\_PROJINHERIT\_FL). + - Create children with the same project ID (EXT4_PROJINHERIT_FL). * - 0x80000000 - - Reserved for ext4 library (EXT4\_RESERVED\_FL). + - Reserved for ext4 library (EXT4_RESERVED_FL). * - - Aggregate flags: * - 0x705BDFFF - User-visible flags. * - 0x604BC0FF - - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and - EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's - EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of + - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and + EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's + EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of these flags in a special manner and they are masked out of the set of - flags that are saved directly to i\_flags. + flags that are saved directly to i_flags. .. _i_osd1: @@ -325,9 +325,9 @@ Linux: - Name - Description * - 0x0 - - \_\_le32 - - l\_i\_version - - Inode version. However, if the EA\_INODE inode flag is set, this inode + - __le32 + - l_i_version + - Inode version. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the upper 32 bits of the attribute value's reference count. @@ -342,8 +342,8 @@ Hurd: - Name - Description * - 0x0 - - \_\_le32 - - h\_i\_translator + - __le32 + - h_i_translator - ?? Masix: @@ -357,8 +357,8 @@ Masix: - Name - Description * - 0x0 - - \_\_le32 - - m\_i\_reserved + - __le32 + - m_i_reserved - ?? .. _i_osd2: @@ -376,30 +376,30 @@ Linux: - Name - Description * - 0x0 - - \_\_le16 - - l\_i\_blocks\_high + - __le16 + - l_i_blocks_high - Upper 16-bits of the block count. Please see the note attached to - i\_blocks\_lo. + i_blocks_lo. * - 0x2 - - \_\_le16 - - l\_i\_file\_acl\_high + - __le16 + - l_i_file_acl_high - Upper 16-bits of the extended attribute block (historically, the file ACL location). See the Extended Attributes section below. * - 0x4 - - \_\_le16 - - l\_i\_uid\_high + - __le16 + - l_i_uid_high - Upper 16-bits of the Owner UID. * - 0x6 - - \_\_le16 - - l\_i\_gid\_high + - __le16 + - l_i_gid_high - Upper 16-bits of the GID. * - 0x8 - - \_\_le16 - - l\_i\_checksum\_lo + - __le16 + - l_i_checksum_lo - Lower 16-bits of the inode checksum. * - 0xA - - \_\_le16 - - l\_i\_reserved + - __le16 + - l_i_reserved - Unused. Hurd: @@ -413,24 +413,24 @@ Hurd: - Name - Description * - 0x0 - - \_\_le16 - - h\_i\_reserved1 + - __le16 + - h_i_reserved1 - ?? * - 0x2 - - \_\_u16 - - h\_i\_mode\_high + - __u16 + - h_i_mode_high - Upper 16-bits of the file mode. * - 0x4 - - \_\_le16 - - h\_i\_uid\_high + - __le16 + - h_i_uid_high - Upper 16-bits of the Owner UID. * - 0x6 - - \_\_le16 - - h\_i\_gid\_high + - __le16 + - h_i_gid_high - Upper 16-bits of the GID. * - 0x8 - - \_\_u32 - - h\_i\_author + - __u32 + - h_i_author - Author code? Masix: @@ -444,17 +444,17 @@ Masix: - Name - Description * - 0x0 - - \_\_le16 - - h\_i\_reserved1 + - __le16 + - h_i_reserved1 - ?? * - 0x2 - - \_\_u16 - - m\_i\_file\_acl\_high + - __u16 + - m_i_file_acl_high - Upper 16-bits of the extended attribute block (historically, the file ACL location). * - 0x4 - - \_\_u32 - - m\_i\_reserved2[2] + - __u32 + - m_i_reserved2[2] - ?? Inode Size @@ -466,11 +466,11 @@ In ext2 and ext3, the inode structure size was fixed at 128 bytes on-disk inode at format time for all inodes in the filesystem to provide space beyond the end of the original ext2 inode. The on-disk inode record size is recorded in the superblock as ``s_inode_size``. The -number of bytes actually used by struct ext4\_inode beyond the original +number of bytes actually used by struct ext4_inode beyond the original 128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each -inode, which allows struct ext4\_inode to grow for a new kernel without +inode, which allows struct ext4_inode to grow for a new kernel without having to upgrade all of the on-disk inodes. Access to fields beyond -EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within +EXT2_GOOD_OLD_INODE_SIZE should be verified to be within ``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as of August 2019) the inode structure is 160 bytes (``i_extra_isize = 32``). The extra space between the end of the inode @@ -516,7 +516,7 @@ creation time (crtime); this field is 64-bits wide and decoded in the same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible through the regular stat() interface, though debugfs will report them. -We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)). +We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). In other words: .. list-table:: @@ -525,8 +525,8 @@ In other words: * - Extra epoch bits - MSB of 32-bit time - - Adjustment for signed 32-bit to 64-bit tv\_sec - - Decoded 64-bit tv\_sec + - Adjustment for signed 32-bit to 64-bit tv_sec + - Decoded 64-bit tv_sec - valid time range * - 0 0 - 1 diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst index 5fad38860f17..a6bef5293a60 100644 --- a/Documentation/filesystems/ext4/journal.rst +++ b/Documentation/filesystems/ext4/journal.rst @@ -63,8 +63,8 @@ Generally speaking, the journal has this format: :header-rows: 1 * - Superblock - - descriptor\_block (data\_blocks or revocation\_block) [more data or - revocations] commmit\_block + - descriptor_block (data_blocks or revocation_block) [more data or + revocations] commmit_block - [more transactions...] * - - One transaction @@ -93,8 +93,8 @@ superblock. * - 1024 bytes of padding - ext4 Superblock - Journal Superblock - - descriptor\_block (data\_blocks or revocation\_block) [more data or - revocations] commmit\_block + - descriptor_block (data_blocks or revocation_block) [more data or + revocations] commmit_block - [more transactions...] * - - @@ -117,17 +117,17 @@ Every block in the journal starts with a common 12-byte header - Name - Description * - 0x0 - - \_\_be32 - - h\_magic + - __be32 + - h_magic - jbd2 magic number, 0xC03B3998. * - 0x4 - - \_\_be32 - - h\_blocktype + - __be32 + - h_blocktype - Description of what this block contains. See the jbd2_blocktype_ table below. * - 0x8 - - \_\_be32 - - h\_sequence + - __be32 + - h_sequence - The transaction ID that goes with this block. .. _jbd2_blocktype: @@ -177,99 +177,99 @@ which is 1024 bytes long: - - Static information describing the journal. * - 0x0 - - journal\_header\_t (12 bytes) - - s\_header + - journal_header_t (12 bytes) + - s_header - Common header identifying this as a superblock. * - 0xC - - \_\_be32 - - s\_blocksize + - __be32 + - s_blocksize - Journal device block size. * - 0x10 - - \_\_be32 - - s\_maxlen + - __be32 + - s_maxlen - Total number of blocks in this journal. * - 0x14 - - \_\_be32 - - s\_first + - __be32 + - s_first - First block of log information. * - - - - Dynamic information describing the current state of the log. * - 0x18 - - \_\_be32 - - s\_sequence + - __be32 + - s_sequence - First commit ID expected in log. * - 0x1C - - \_\_be32 - - s\_start + - __be32 + - s_start - Block number of the start of log. Contrary to the comments, this field being zero does not imply that the journal is clean! * - 0x20 - - \_\_be32 - - s\_errno - - Error value, as set by jbd2\_journal\_abort(). + - __be32 + - s_errno + - Error value, as set by jbd2_journal_abort(). * - - - - The remaining fields are only valid in a v2 superblock. * - 0x24 - - \_\_be32 - - s\_feature\_compat; + - __be32 + - s_feature_compat; - Compatible feature set. See the table jbd2_compat_ below. * - 0x28 - - \_\_be32 - - s\_feature\_incompat + - __be32 + - s_feature_incompat - Incompatible feature set. See the table jbd2_incompat_ below. * - 0x2C - - \_\_be32 - - s\_feature\_ro\_compat + - __be32 + - s_feature_ro_compat - Read-only compatible feature set. There aren't any of these currently. * - 0x30 - - \_\_u8 - - s\_uuid[16] + - __u8 + - s_uuid[16] - 128-bit uuid for journal. This is compared against the copy in the ext4 super block at mount time. * - 0x40 - - \_\_be32 - - s\_nr\_users + - __be32 + - s_nr_users - Number of file systems sharing this journal. * - 0x44 - - \_\_be32 - - s\_dynsuper + - __be32 + - s_dynsuper - Location of dynamic super block copy. (Not used?) * - 0x48 - - \_\_be32 - - s\_max\_transaction + - __be32 + - s_max_transaction - Limit of journal blocks per transaction. (Not used?) * - 0x4C - - \_\_be32 - - s\_max\_trans\_data + - __be32 + - s_max_trans_data - Limit of data blocks per transaction. (Not used?) * - 0x50 - - \_\_u8 - - s\_checksum\_type + - __u8 + - s_checksum_type - Checksum algorithm used for the journal. See jbd2_checksum_type_ for more info. * - 0x51 - - \_\_u8[3] - - s\_padding2 + - __u8[3] + - s_padding2 - * - 0x54 - - \_\_be32 - - s\_num\_fc\_blocks + - __be32 + - s_num_fc_blocks - Number of fast commit blocks in the journal. * - 0x58 - - \_\_u32 - - s\_padding[42] + - __u32 + - s_padding[42] - * - 0xFC - - \_\_be32 - - s\_checksum + - __be32 + - s_checksum - Checksum of the entire superblock, with this field set to zero. * - 0x100 - - \_\_u8 - - s\_users[16\*48] + - __u8 + - s_users[16*48] - ids of all file systems sharing the log. e2fsprogs/Linux don't allow shared external journals, but I imagine Lustre (or ocfs2?), which use the jbd2 code, might. @@ -286,7 +286,7 @@ The journal compat features are any combination of the following: - Description * - 0x1 - Journal maintains checksums on the data blocks. - (JBD2\_FEATURE\_COMPAT\_CHECKSUM) + (JBD2_FEATURE_COMPAT_CHECKSUM) .. _jbd2_incompat: @@ -299,23 +299,23 @@ The journal incompat features are any combination of the following: * - Value - Description * - 0x1 - - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) + - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE) * - 0x2 - Journal can deal with 64-bit block numbers. - (JBD2\_FEATURE\_INCOMPAT\_64BIT) + (JBD2_FEATURE_INCOMPAT_64BIT) * - 0x4 - - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) + - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) * - 0x8 - This journal uses v2 of the checksum on-disk format. Each journal metadata block gets its own checksum, and the block tags in the descriptor table contain checksums for each of the data blocks in the - journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) + journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2) * - 0x10 - This journal uses v3 of the checksum on-disk format. This is the same as v2, but the journal block tag size is fixed regardless of the size of - block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) + block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3) * - 0x20 - - Journal has fast commit blocks. (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT) + - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) .. _jbd2_checksum_type: @@ -355,11 +355,11 @@ Descriptor blocks consume at least 36 bytes, but use a full block: - Name - Descriptor * - 0x0 - - journal\_header\_t + - journal_header_t - (open coded) - Common block header. * - 0xC - - struct journal\_block\_tag\_s + - struct journal_block_tag_s - open coded array[] - Enough tags either to fill up the block or to describe all the data blocks that follow this descriptor block. @@ -367,7 +367,7 @@ Descriptor blocks consume at least 36 bytes, but use a full block: Journal block tags have any of the following formats, depending on which journal feature and block tag flags are set. -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is +If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is defined as ``struct journal_block_tag3_s``, which looks like the following. The size is 16 or 32 bytes. @@ -380,24 +380,24 @@ following. The size is 16 or 32 bytes. - Name - Descriptor * - 0x0 - - \_\_be32 - - t\_blocknr + - __be32 + - t_blocknr - Lower 32-bits of the location of where the corresponding data block should end up on disk. * - 0x4 - - \_\_be32 - - t\_flags + - __be32 + - t_flags - Flags that go with the descriptor. See the table jbd2_tag_flags_ for more info. * - 0x8 - - \_\_be32 - - t\_blocknr\_high + - __be32 + - t_blocknr_high - Upper 32-bits of the location of where the corresponding data block - should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is + should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is not enabled. * - 0xC - - \_\_be32 - - t\_checksum + - __be32 + - t_checksum - Checksum of the journal UUID, the sequence number, and the data block. * - - @@ -433,7 +433,7 @@ The journal tag flags are any combination of the following: * - 0x8 - This is the last tag in this descriptor block. -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag +If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag is defined as ``struct journal_block_tag_s``, which looks like the following. The size is 8, 12, 24, or 28 bytes: @@ -446,18 +446,18 @@ following. The size is 8, 12, 24, or 28 bytes: - Name - Descriptor * - 0x0 - - \_\_be32 - - t\_blocknr + - __be32 + - t_blocknr - Lower 32-bits of the location of where the corresponding data block should end up on disk. * - 0x4 - - \_\_be16 - - t\_checksum + - __be16 + - t_checksum - Checksum of the journal UUID, the sequence number, and the data block. Note that only the lower 16 bits are stored. * - 0x6 - - \_\_be16 - - t\_flags + - __be16 + - t_flags - Flags that go with the descriptor. See the table jbd2_tag_flags_ for more info. * - @@ -466,8 +466,8 @@ following. The size is 8, 12, 24, or 28 bytes: - This next field is only present if the super block indicates support for 64-bit block numbers. * - 0x8 - - \_\_be32 - - t\_blocknr\_high + - __be32 + - t_blocknr_high - Upper 32-bits of the location of where the corresponding data block should end up on disk. * - @@ -483,8 +483,8 @@ following. The size is 8, 12, 24, or 28 bytes: ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that field. -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or -JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a +If JBD2_FEATURE_INCOMPAT_CSUM_V2 or +JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a ``struct jbd2_journal_block_tail``, which looks like this: .. list-table:: @@ -496,8 +496,8 @@ JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a - Name - Descriptor * - 0x0 - - \_\_be32 - - t\_checksum + - __be32 + - t_checksum - Checksum of the journal UUID + the descriptor block, with this field set to zero. @@ -538,25 +538,25 @@ length, but use a full block: - Name - Description * - 0x0 - - journal\_header\_t - - r\_header + - journal_header_t + - r_header - Common block header. * - 0xC - - \_\_be32 - - r\_count + - __be32 + - r_count - Number of bytes used in this block. * - 0x10 - - \_\_be32 or \_\_be64 + - __be32 or __be64 - blocks[0] - Blocks to revoke. -After r\_count is a linear array of block numbers that are effectively +After r_count is a linear array of block numbers that are effectively revoked by this transaction. The size of each block number is 8 bytes if the superblock advertises 64-bit block number support, or 4 bytes otherwise. -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or -JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation +If JBD2_FEATURE_INCOMPAT_CSUM_V2 or +JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation block is a ``struct jbd2_journal_revoke_tail``, which has this format: .. list-table:: @@ -568,8 +568,8 @@ block is a ``struct jbd2_journal_revoke_tail``, which has this format: - Name - Description * - 0x0 - - \_\_be32 - - r\_checksum + - __be32 + - r_checksum - Checksum of the journal UUID + revocation block Commit Block @@ -592,38 +592,38 @@ bytes long (but uses a full block): - Name - Descriptor * - 0x0 - - journal\_header\_s + - journal_header_s - (open coded) - Common block header. * - 0xC - unsigned char - - h\_chksum\_type + - h_chksum_type - The type of checksum to use to verify the integrity of the data blocks in the transaction. See jbd2_checksum_type_ for more info. * - 0xD - unsigned char - - h\_chksum\_size + - h_chksum_size - The number of bytes used by the checksum. Most likely 4. * - 0xE - unsigned char - - h\_padding[2] + - h_padding[2] - * - 0x10 - - \_\_be32 - - h\_chksum[JBD2\_CHECKSUM\_BYTES] + - __be32 + - h_chksum[JBD2_CHECKSUM_BYTES] - 32 bytes of space to store checksums. If - JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 + JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the first ``__be32`` is the checksum of the journal UUID and the entire commit block, with this field zeroed. If - JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the + JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the crc32 of all the blocks already written to the transaction. * - 0x30 - - \_\_be64 - - h\_commit\_sec + - __be64 + - h_commit_sec - The time that the transaction was committed, in seconds since the epoch. * - 0x38 - - \_\_be32 - - h\_commit\_nsec + - __be32 + - h_commit_nsec - Nanoseconds component of the above timestamp. Fast commits diff --git a/Documentation/filesystems/ext4/mmp.rst b/Documentation/filesystems/ext4/mmp.rst index 25660981d93c..174dd6538737 100644 --- a/Documentation/filesystems/ext4/mmp.rst +++ b/Documentation/filesystems/ext4/mmp.rst @@ -7,8 +7,8 @@ Multiple mount protection (MMP) is a feature that protects the filesystem against multiple hosts trying to use the filesystem simultaneously. When a filesystem is opened (for mounting, or fsck, etc.), the MMP code running on the node (call it node A) checks a -sequence number. If the sequence number is EXT4\_MMP\_SEQ\_CLEAN, the -open continues. If the sequence number is EXT4\_MMP\_SEQ\_FSCK, then +sequence number. If the sequence number is EXT4_MMP_SEQ_CLEAN, the +open continues. If the sequence number is EXT4_MMP_SEQ_FSCK, then fsck is (hopefully) running, and open fails immediately. Otherwise, the open code will wait for twice the specified MMP check interval and check the sequence number again. If the sequence number has changed, then the @@ -40,38 +40,38 @@ The MMP structure (``struct mmp_struct``) is as follows: - Name - Description * - 0x0 - - \_\_le32 - - mmp\_magic + - __le32 + - mmp_magic - Magic number for MMP, 0x004D4D50 (“MMP”). * - 0x4 - - \_\_le32 - - mmp\_seq + - __le32 + - mmp_seq - Sequence number, updated periodically. * - 0x8 - - \_\_le64 - - mmp\_time + - __le64 + - mmp_time - Time that the MMP block was last updated. * - 0x10 - char[64] - - mmp\_nodename + - mmp_nodename - Hostname of the node that opened the filesystem. * - 0x50 - char[32] - - mmp\_bdevname + - mmp_bdevname - Block device name of the filesystem. * - 0x70 - - \_\_le16 - - mmp\_check\_interval + - __le16 + - mmp_check_interval - The MMP re-check interval, in seconds. * - 0x72 - - \_\_le16 - - mmp\_pad1 + - __le16 + - mmp_pad1 - Zero. * - 0x74 - - \_\_le32[226] - - mmp\_pad2 + - __le32[226] + - mmp_pad2 - Zero. * - 0x3FC - - \_\_le32 - - mmp\_checksum + - __le32 + - mmp_checksum - Checksum of the MMP block. diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst index 123ebfde47ee..0fad6eda6e15 100644 --- a/Documentation/filesystems/ext4/overview.rst +++ b/Documentation/filesystems/ext4/overview.rst @@ -7,7 +7,7 @@ An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times. The size of a block group is specified in -``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 \* +``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 * ``block_size_in_bytes``. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB. The number of block groups is the size of the device divided by the size of a block group. diff --git a/Documentation/filesystems/ext4/special_inodes.rst b/Documentation/filesystems/ext4/special_inodes.rst index 94f304e3a0a7..fc0636901fa0 100644 --- a/Documentation/filesystems/ext4/special_inodes.rst +++ b/Documentation/filesystems/ext4/special_inodes.rst @@ -34,7 +34,7 @@ ext4 reserves some inode for special features, as follows: * - 10 - Replica inode, used for some non-upstream feature? * - 11 - - Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock. + - Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock. Note that there are also some inodes allocated from non-reserved inode numbers for other filesystem features which are not referenced from standard directory @@ -47,9 +47,9 @@ hierarchy. These are generally reference from the superblock. They are: * - Superblock field - Description - * - s\_lpf\_ino + * - s_lpf_ino - Inode number of lost+found directory. - * - s\_prj\_quota\_inum + * - s_prj_quota_inum - Inode number of quota file tracking project quotas - * - s\_orphan\_file\_inum + * - s_orphan_file_inum - Inode number of file tracking orphan inodes. diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst index f6a548e957bb..268888522e35 100644 --- a/Documentation/filesystems/ext4/super.rst +++ b/Documentation/filesystems/ext4/super.rst @@ -7,7 +7,7 @@ The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more. -If the sparse\_super feature flag is set, redundant copies of the +If the sparse_super feature flag is set, redundant copies of the superblock and group descriptors are kept only in the groups whose group number is either 0 or a power of 3, 5, or 7. If the flag is not set, redundant copies are kept in all groups. @@ -27,107 +27,107 @@ The ext4 superblock is laid out as follows in - Name - Description * - 0x0 - - \_\_le32 - - s\_inodes\_count + - __le32 + - s_inodes_count - Total inode count. * - 0x4 - - \_\_le32 - - s\_blocks\_count\_lo + - __le32 + - s_blocks_count_lo - Total block count. * - 0x8 - - \_\_le32 - - s\_r\_blocks\_count\_lo + - __le32 + - s_r_blocks_count_lo - This number of blocks can only be allocated by the super-user. * - 0xC - - \_\_le32 - - s\_free\_blocks\_count\_lo + - __le32 + - s_free_blocks_count_lo - Free block count. * - 0x10 - - \_\_le32 - - s\_free\_inodes\_count + - __le32 + - s_free_inodes_count - Free inode count. * - 0x14 - - \_\_le32 - - s\_first\_data\_block + - __le32 + - s_first_data_block - First data block. This must be at least 1 for 1k-block filesystems and is typically 0 for all other block sizes. * - 0x18 - - \_\_le32 - - s\_log\_block\_size - - Block size is 2 ^ (10 + s\_log\_block\_size). + - __le32 + - s_log_block_size + - Block size is 2 ^ (10 + s_log_block_size). * - 0x1C - - \_\_le32 - - s\_log\_cluster\_size - - Cluster size is 2 ^ (10 + s\_log\_cluster\_size) blocks if bigalloc is - enabled. Otherwise s\_log\_cluster\_size must equal s\_log\_block\_size. + - __le32 + - s_log_cluster_size + - Cluster size is 2 ^ (10 + s_log_cluster_size) blocks if bigalloc is + enabled. Otherwise s_log_cluster_size must equal s_log_block_size. * - 0x20 - - \_\_le32 - - s\_blocks\_per\_group + - __le32 + - s_blocks_per_group - Blocks per group. * - 0x24 - - \_\_le32 - - s\_clusters\_per\_group + - __le32 + - s_clusters_per_group - Clusters per group, if bigalloc is enabled. Otherwise - s\_clusters\_per\_group must equal s\_blocks\_per\_group. + s_clusters_per_group must equal s_blocks_per_group. * - 0x28 - - \_\_le32 - - s\_inodes\_per\_group + - __le32 + - s_inodes_per_group - Inodes per group. * - 0x2C - - \_\_le32 - - s\_mtime + - __le32 + - s_mtime - Mount time, in seconds since the epoch. * - 0x30 - - \_\_le32 - - s\_wtime + - __le32 + - s_wtime - Write time, in seconds since the epoch. * - 0x34 - - \_\_le16 - - s\_mnt\_count + - __le16 + - s_mnt_count - Number of mounts since the last fsck. * - 0x36 - - \_\_le16 - - s\_max\_mnt\_count + - __le16 + - s_max_mnt_count - Number of mounts beyond which a fsck is needed. * - 0x38 - - \_\_le16 - - s\_magic + - __le16 + - s_magic - Magic signature, 0xEF53 * - 0x3A - - \_\_le16 - - s\_state + - __le16 + - s_state - File system state. See super_state_ for more info. * - 0x3C - - \_\_le16 - - s\_errors + - __le16 + - s_errors - Behaviour when detecting errors. See super_errors_ for more info. * - 0x3E - - \_\_le16 - - s\_minor\_rev\_level + - __le16 + - s_minor_rev_level - Minor revision level. * - 0x40 - - \_\_le32 - - s\_lastcheck + - __le32 + - s_lastcheck - Time of last check, in seconds since the epoch. * - 0x44 - - \_\_le32 - - s\_checkinterval + - __le32 + - s_checkinterval - Maximum time between checks, in seconds. * - 0x48 - - \_\_le32 - - s\_creator\_os + - __le32 + - s_creator_os - Creator OS. See the table super_creator_ for more info. * - 0x4C - - \_\_le32 - - s\_rev\_level + - __le32 + - s_rev_level - Revision level. See the table super_revision_ for more info. * - 0x50 - - \_\_le16 - - s\_def\_resuid + - __le16 + - s_def_resuid - Default uid for reserved blocks. * - 0x52 - - \_\_le16 - - s\_def\_resgid + - __le16 + - s_def_resgid - Default gid for reserved blocks. * - - @@ -143,50 +143,50 @@ The ext4 superblock is laid out as follows in about a feature in either the compatible or incompatible feature set, it must abort and not try to meddle with things it doesn't understand... * - 0x54 - - \_\_le32 - - s\_first\_ino + - __le32 + - s_first_ino - First non-reserved inode. * - 0x58 - - \_\_le16 - - s\_inode\_size + - __le16 + - s_inode_size - Size of inode structure, in bytes. * - 0x5A - - \_\_le16 - - s\_block\_group\_nr + - __le16 + - s_block_group_nr - Block group # of this superblock. * - 0x5C - - \_\_le32 - - s\_feature\_compat + - __le32 + - s_feature_compat - Compatible feature set flags. Kernel can still read/write this fs even if it doesn't understand a flag; fsck should not do that. See the super_compat_ table for more info. * - 0x60 - - \_\_le32 - - s\_feature\_incompat + - __le32 + - s_feature_incompat - Incompatible feature set. If the kernel or fsck doesn't understand one of these bits, it should stop. See the super_incompat_ table for more info. * - 0x64 - - \_\_le32 - - s\_feature\_ro\_compat + - __le32 + - s_feature_ro_compat - Readonly-compatible feature set. If the kernel doesn't understand one of these bits, it can still mount read-only. See the super_rocompat_ table for more info. * - 0x68 - - \_\_u8 - - s\_uuid[16] + - __u8 + - s_uuid[16] - 128-bit UUID for volume. * - 0x78 - char - - s\_volume\_name[16] + - s_volume_name[16] - Volume label. * - 0x88 - char - - s\_last\_mounted[64] + - s_last_mounted[64] - Directory where filesystem was last mounted. * - 0xC8 - - \_\_le32 - - s\_algorithm\_usage\_bitmap + - __le32 + - s_algorithm_usage_bitmap - For compression (Not used in e2fsprogs/Linux) * - - @@ -194,18 +194,18 @@ The ext4 superblock is laid out as follows in - Performance hints. Directory preallocation should only happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. * - 0xCC - - \_\_u8 - - s\_prealloc\_blocks + - __u8 + - s_prealloc_blocks - #. of blocks to try to preallocate for ... files? (Not used in e2fsprogs/Linux) * - 0xCD - - \_\_u8 - - s\_prealloc\_dir\_blocks + - __u8 + - s_prealloc_dir_blocks - #. of blocks to preallocate for directories. (Not used in e2fsprogs/Linux) * - 0xCE - - \_\_le16 - - s\_reserved\_gdt\_blocks + - __le16 + - s_reserved_gdt_blocks - Number of reserved GDT entries for future filesystem expansion. * - - @@ -213,281 +213,281 @@ The ext4 superblock is laid out as follows in - Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNAL is set. * - 0xD0 - - \_\_u8 - - s\_journal\_uuid[16] + - __u8 + - s_journal_uuid[16] - UUID of journal superblock * - 0xE0 - - \_\_le32 - - s\_journal\_inum + - __le32 + - s_journal_inum - inode number of journal file. * - 0xE4 - - \_\_le32 - - s\_journal\_dev + - __le32 + - s_journal_dev - Device number of journal file, if the external journal feature flag is set. * - 0xE8 - - \_\_le32 - - s\_last\_orphan + - __le32 + - s_last_orphan - Start of list of orphaned inodes to delete. * - 0xEC - - \_\_le32 - - s\_hash\_seed[4] + - __le32 + - s_hash_seed[4] - HTREE hash seed. * - 0xFC - - \_\_u8 - - s\_def\_hash\_version + - __u8 + - s_def_hash_version - Default hash algorithm to use for directory hashes. See super_def_hash_ for more info. * - 0xFD - - \_\_u8 - - s\_jnl\_backup\_type - - If this value is 0 or EXT3\_JNL\_BACKUP\_BLOCKS (1), then the + - __u8 + - s_jnl_backup_type + - If this value is 0 or EXT3_JNL_BACKUP_BLOCKS (1), then the ``s_jnl_blocks`` field contains a duplicate copy of the inode's ``i_block[]`` array and ``i_size``. * - 0xFE - - \_\_le16 - - s\_desc\_size + - __le16 + - s_desc_size - Size of group descriptors, in bytes, if the 64bit incompat feature flag is set. * - 0x100 - - \_\_le32 - - s\_default\_mount\_opts + - __le32 + - s_default_mount_opts - Default mount options. See the super_mountopts_ table for more info. * - 0x104 - - \_\_le32 - - s\_first\_meta\_bg - - First metablock block group, if the meta\_bg feature is enabled. + - __le32 + - s_first_meta_bg + - First metablock block group, if the meta_bg feature is enabled. * - 0x108 - - \_\_le32 - - s\_mkfs\_time + - __le32 + - s_mkfs_time - When the filesystem was created, in seconds since the epoch. * - 0x10C - - \_\_le32 - - s\_jnl\_blocks[17] + - __le32 + - s_jnl_blocks[17] - Backup copy of the journal inode's ``i_block[]`` array in the first 15 - elements and i\_size\_high and i\_size in the 16th and 17th elements, + elements and i_size_high and i_size in the 16th and 17th elements, respectively. * - - - - 64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set. * - 0x150 - - \_\_le32 - - s\_blocks\_count\_hi + - __le32 + - s_blocks_count_hi - High 32-bits of the block count. * - 0x154 - - \_\_le32 - - s\_r\_blocks\_count\_hi + - __le32 + - s_r_blocks_count_hi - High 32-bits of the reserved block count. * - 0x158 - - \_\_le32 - - s\_free\_blocks\_count\_hi + - __le32 + - s_free_blocks_count_hi - High 32-bits of the free block count. * - 0x15C - - \_\_le16 - - s\_min\_extra\_isize + - __le16 + - s_min_extra_isize - All inodes have at least # bytes. * - 0x15E - - \_\_le16 - - s\_want\_extra\_isize + - __le16 + - s_want_extra_isize - New inodes should reserve # bytes. * - 0x160 - - \_\_le32 - - s\_flags + - __le32 + - s_flags - Miscellaneous flags. See the super_flags_ table for more info. * - 0x164 - - \_\_le16 - - s\_raid\_stride + - __le16 + - s_raid_stride - RAID stride. This is the number of logical blocks read from or written to the disk before moving to the next disk. This affects the placement of filesystem metadata, which will hopefully make RAID storage faster. * - 0x166 - - \_\_le16 - - s\_mmp\_interval + - __le16 + - s_mmp_interval - #. seconds to wait in multi-mount prevention (MMP) checking. In theory, MMP is a mechanism to record in the superblock which host and device have mounted the filesystem, in order to prevent multiple mounts. This feature does not seem to be implemented... * - 0x168 - - \_\_le64 - - s\_mmp\_block + - __le64 + - s_mmp_block - Block # for multi-mount protection data. * - 0x170 - - \_\_le32 - - s\_raid\_stripe\_width + - __le32 + - s_raid_stripe_width - RAID stripe width. This is the number of logical blocks read from or written to the disk before coming back to the current disk. This is used by the block allocator to try to reduce the number of read-modify-write operations in a RAID5/6. * - 0x174 - - \_\_u8 - - s\_log\_groups\_per\_flex + - __u8 + - s_log_groups_per_flex - Size of a flexible block group is 2 ^ ``s_log_groups_per_flex``. * - 0x175 - - \_\_u8 - - s\_checksum\_type + - __u8 + - s_checksum_type - Metadata checksum algorithm type. The only valid value is 1 (crc32c). * - 0x176 - - \_\_le16 - - s\_reserved\_pad + - __le16 + - s_reserved_pad - * - 0x178 - - \_\_le64 - - s\_kbytes\_written + - __le64 + - s_kbytes_written - Number of KiB written to this filesystem over its lifetime. * - 0x180 - - \_\_le32 - - s\_snapshot\_inum + - __le32 + - s_snapshot_inum - inode number of active snapshot. (Not used in e2fsprogs/Linux.) * - 0x184 - - \_\_le32 - - s\_snapshot\_id + - __le32 + - s_snapshot_id - Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.) * - 0x188 - - \_\_le64 - - s\_snapshot\_r\_blocks\_count + - __le64 + - s_snapshot_r_blocks_count - Number of blocks reserved for active snapshot's future use. (Not used in e2fsprogs/Linux.) * - 0x190 - - \_\_le32 - - s\_snapshot\_list + - __le32 + - s_snapshot_list - inode number of the head of the on-disk snapshot list. (Not used in e2fsprogs/Linux.) * - 0x194 - - \_\_le32 - - s\_error\_count + - __le32 + - s_error_count - Number of errors seen. * - 0x198 - - \_\_le32 - - s\_first\_error\_time + - __le32 + - s_first_error_time - First time an error happened, in seconds since the epoch. * - 0x19C - - \_\_le32 - - s\_first\_error\_ino + - __le32 + - s_first_error_ino - inode involved in first error. * - 0x1A0 - - \_\_le64 - - s\_first\_error\_block + - __le64 + - s_first_error_block - Number of block involved of first error. * - 0x1A8 - - \_\_u8 - - s\_first\_error\_func[32] + - __u8 + - s_first_error_func[32] - Name of function where the error happened. * - 0x1C8 - - \_\_le32 - - s\_first\_error\_line + - __le32 + - s_first_error_line - Line number where error happened. * - 0x1CC - - \_\_le32 - - s\_last\_error\_time + - __le32 + - s_last_error_time - Time of most recent error, in seconds since the epoch. * - 0x1D0 - - \_\_le32 - - s\_last\_error\_ino + - __le32 + - s_last_error_ino - inode involved in most recent error. * - 0x1D4 - - \_\_le32 - - s\_last\_error\_line + - __le32 + - s_last_error_line - Line number where most recent error happened. * - 0x1D8 - - \_\_le64 - - s\_last\_error\_block + - __le64 + - s_last_error_block - Number of block involved in most recent error. * - 0x1E0 - - \_\_u8 - - s\_last\_error\_func[32] + - __u8 + - s_last_error_func[32] - Name of function where the most recent error happened. * - 0x200 - - \_\_u8 - - s\_mount\_opts[64] + - __u8 + - s_mount_opts[64] - ASCIIZ string of mount options. * - 0x240 - - \_\_le32 - - s\_usr\_quota\_inum + - __le32 + - s_usr_quota_inum - Inode number of user `quota `__ file. * - 0x244 - - \_\_le32 - - s\_grp\_quota\_inum + - __le32 + - s_grp_quota_inum - Inode number of group `quota `__ file. * - 0x248 - - \_\_le32 - - s\_overhead\_blocks + - __le32 + - s_overhead_blocks - Overhead blocks/clusters in fs. (Huh? This field is always zero, which means that the kernel calculates it dynamically.) * - 0x24C - - \_\_le32 - - s\_backup\_bgs[2] - - Block groups containing superblock backups (if sparse\_super2) + - __le32 + - s_backup_bgs[2] + - Block groups containing superblock backups (if sparse_super2) * - 0x254 - - \_\_u8 - - s\_encrypt\_algos[4] + - __u8 + - s_encrypt_algos[4] - Encryption algorithms in use. There can be up to four algorithms in use at any time; valid algorithm codes are given in the super_encrypt_ table below. * - 0x258 - - \_\_u8 - - s\_encrypt\_pw\_salt[16] + - __u8 + - s_encrypt_pw_salt[16] - Salt for the string2key algorithm for encryption. * - 0x268 - - \_\_le32 - - s\_lpf\_ino + - __le32 + - s_lpf_ino - Inode number of lost+found * - 0x26C - - \_\_le32 - - s\_prj\_quota\_inum + - __le32 + - s_prj_quota_inum - Inode that tracks project quotas. * - 0x270 - - \_\_le32 - - s\_checksum\_seed - - Checksum seed used for metadata\_csum calculations. This value is - crc32c(~0, $orig\_fs\_uuid). + - __le32 + - s_checksum_seed + - Checksum seed used for metadata_csum calculations. This value is + crc32c(~0, $orig_fs_uuid). * - 0x274 - - \_\_u8 - - s\_wtime_hi + - __u8 + - s_wtime_hi - Upper 8 bits of the s_wtime field. * - 0x275 - - \_\_u8 - - s\_mtime_hi + - __u8 + - s_mtime_hi - Upper 8 bits of the s_mtime field. * - 0x276 - - \_\_u8 - - s\_mkfs_time_hi + - __u8 + - s_mkfs_time_hi - Upper 8 bits of the s_mkfs_time field. * - 0x277 - - \_\_u8 - - s\_lastcheck_hi + - __u8 + - s_lastcheck_hi - Upper 8 bits of the s_lastcheck_hi field. * - 0x278 - - \_\_u8 - - s\_first_error_time_hi + - __u8 + - s_first_error_time_hi - Upper 8 bits of the s_first_error_time_hi field. * - 0x279 - - \_\_u8 - - s\_last_error_time_hi + - __u8 + - s_last_error_time_hi - Upper 8 bits of the s_last_error_time_hi field. * - 0x27A - - \_\_u8 - - s\_pad[2] + - __u8 + - s_pad[2] - Zero padding. * - 0x27C - - \_\_le16 - - s\_encoding + - __le16 + - s_encoding - Filename charset encoding. * - 0x27E - - \_\_le16 - - s\_encoding_flags + - __le16 + - s_encoding_flags - Filename charset encoding flags. * - 0x280 - - \_\_le32 - - s\_orphan\_file\_inum + - __le32 + - s_orphan_file_inum - Orphan file inode number. * - 0x284 - - \_\_le32 - - s\_reserved[94] + - __le32 + - s_reserved[94] - Padding to the end of the block. * - 0x3FC - - \_\_le32 - - s\_checksum + - __le32 + - s_checksum - Superblock checksum. .. _super_state: @@ -574,44 +574,44 @@ following: * - Value - Description * - 0x1 - - Directory preallocation (COMPAT\_DIR\_PREALLOC). + - Directory preallocation (COMPAT_DIR_PREALLOC). * - 0x2 - “imagic inodes”. Not clear from the code what this does - (COMPAT\_IMAGIC\_INODES). + (COMPAT_IMAGIC_INODES). * - 0x4 - - Has a journal (COMPAT\_HAS\_JOURNAL). + - Has a journal (COMPAT_HAS_JOURNAL). * - 0x8 - - Supports extended attributes (COMPAT\_EXT\_ATTR). + - Supports extended attributes (COMPAT_EXT_ATTR). * - 0x10 - Has reserved GDT blocks for filesystem expansion - (COMPAT\_RESIZE\_INODE). Requires RO\_COMPAT\_SPARSE\_SUPER. + (COMPAT_RESIZE_INODE). Requires RO_COMPAT_SPARSE_SUPER. * - 0x20 - - Has directory indices (COMPAT\_DIR\_INDEX). + - Has directory indices (COMPAT_DIR_INDEX). * - 0x40 - “Lazy BG”. Not in Linux kernel, seems to have been for uninitialized - block groups? (COMPAT\_LAZY\_BG) + block groups? (COMPAT_LAZY_BG) * - 0x80 - - “Exclude inode”. Not used. (COMPAT\_EXCLUDE\_INODE). + - “Exclude inode”. Not used. (COMPAT_EXCLUDE_INODE). * - 0x100 - “Exclude bitmap”. Seems to be used to indicate the presence of snapshot-related exclude bitmaps? Not defined in kernel or used in - e2fsprogs (COMPAT\_EXCLUDE\_BITMAP). + e2fsprogs (COMPAT_EXCLUDE_BITMAP). * - 0x200 - - Sparse Super Block, v2. If this flag is set, the SB field s\_backup\_bgs + - Sparse Super Block, v2. If this flag is set, the SB field s_backup_bgs points to the two block groups that contain backup superblocks - (COMPAT\_SPARSE\_SUPER2). + (COMPAT_SPARSE_SUPER2). * - 0x400 - Fast commits supported. Although fast commits blocks are backward incompatible, fast commit blocks are not always present in the journal. If fast commit blocks are present in the journal, JBD2 incompat feature - (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT) gets - set (COMPAT\_FAST\_COMMIT). + (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) gets + set (COMPAT_FAST_COMMIT). * - 0x1000 - Orphan file allocated. This is the special file for more efficient tracking of unlinked but still open inodes. When there may be any entries in the file, we additionally set proper rocompat feature - (RO\_COMPAT\_ORPHAN\_PRESENT). + (RO_COMPAT_ORPHAN_PRESENT). .. _super_incompat: @@ -625,45 +625,45 @@ following: * - Value - Description * - 0x1 - - Compression (INCOMPAT\_COMPRESSION). + - Compression (INCOMPAT_COMPRESSION). * - 0x2 - - Directory entries record the file type. See ext4\_dir\_entry\_2 below - (INCOMPAT\_FILETYPE). + - Directory entries record the file type. See ext4_dir_entry_2 below + (INCOMPAT_FILETYPE). * - 0x4 - - Filesystem needs recovery (INCOMPAT\_RECOVER). + - Filesystem needs recovery (INCOMPAT_RECOVER). * - 0x8 - - Filesystem has a separate journal device (INCOMPAT\_JOURNAL\_DEV). + - Filesystem has a separate journal device (INCOMPAT_JOURNAL_DEV). * - 0x10 - Meta block groups. See the earlier discussion of this feature - (INCOMPAT\_META\_BG). + (INCOMPAT_META_BG). * - 0x40 - - Files in this filesystem use extents (INCOMPAT\_EXTENTS). + - Files in this filesystem use extents (INCOMPAT_EXTENTS). * - 0x80 - - Enable a filesystem size of 2^64 blocks (INCOMPAT\_64BIT). + - Enable a filesystem size of 2^64 blocks (INCOMPAT_64BIT). * - 0x100 - - Multiple mount protection (INCOMPAT\_MMP). + - Multiple mount protection (INCOMPAT_MMP). * - 0x200 - Flexible block groups. See the earlier discussion of this feature - (INCOMPAT\_FLEX\_BG). + (INCOMPAT_FLEX_BG). * - 0x400 - Inodes can be used to store large extended attribute values - (INCOMPAT\_EA\_INODE). + (INCOMPAT_EA_INODE). * - 0x1000 - - Data in directory entry (INCOMPAT\_DIRDATA). (Not implemented?) + - Data in directory entry (INCOMPAT_DIRDATA). (Not implemented?) * - 0x2000 - Metadata checksum seed is stored in the superblock. This feature enables - the administrator to change the UUID of a metadata\_csum filesystem + the administrator to change the UUID of a metadata_csum filesystem while the filesystem is mounted; without it, the checksum definition - requires all metadata blocks to be rewritten (INCOMPAT\_CSUM\_SEED). + requires all metadata blocks to be rewritten (INCOMPAT_CSUM_SEED). * - 0x4000 - - Large directory >2GB or 3-level htree (INCOMPAT\_LARGEDIR). Prior to + - Large directory >2GB or 3-level htree (INCOMPAT_LARGEDIR). Prior to this feature, directories could not be larger than 4GiB and could not have an htree more than 2 levels deep. If this feature is enabled, directories can be larger than 4GiB and have a maximum htree depth of 3. * - 0x8000 - - Data in inode (INCOMPAT\_INLINE\_DATA). + - Data in inode (INCOMPAT_INLINE_DATA). * - 0x10000 - - Encrypted inodes are present on the filesystem. (INCOMPAT\_ENCRYPT). + - Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT). .. _super_rocompat: @@ -678,54 +678,54 @@ the following: - Description * - 0x1 - Sparse superblocks. See the earlier discussion of this feature - (RO\_COMPAT\_SPARSE\_SUPER). + (RO_COMPAT_SPARSE_SUPER). * - 0x2 - This filesystem has been used to store a file greater than 2GiB - (RO\_COMPAT\_LARGE\_FILE). + (RO_COMPAT_LARGE_FILE). * - 0x4 - - Not used in kernel or e2fsprogs (RO\_COMPAT\_BTREE\_DIR). + - Not used in kernel or e2fsprogs (RO_COMPAT_BTREE_DIR). * - 0x8 - This filesystem has files whose sizes are represented in units of logical blocks, not 512-byte sectors. This implies a very large file - indeed! (RO\_COMPAT\_HUGE\_FILE) + indeed! (RO_COMPAT_HUGE_FILE) * - 0x10 - Group descriptors have checksums. In addition to detecting corruption, this is useful for lazy formatting with uninitialized groups - (RO\_COMPAT\_GDT\_CSUM). + (RO_COMPAT_GDT_CSUM). * - 0x20 - Indicates that the old ext3 32,000 subdirectory limit no longer applies - (RO\_COMPAT\_DIR\_NLINK). A directory's i\_links\_count will be set to 1 + (RO_COMPAT_DIR_NLINK). A directory's i_links_count will be set to 1 if it is incremented past 64,999. * - 0x40 - Indicates that large inodes exist on this filesystem - (RO\_COMPAT\_EXTRA\_ISIZE). + (RO_COMPAT_EXTRA_ISIZE). * - 0x80 - - This filesystem has a snapshot (RO\_COMPAT\_HAS\_SNAPSHOT). + - This filesystem has a snapshot (RO_COMPAT_HAS_SNAPSHOT). * - 0x100 - - `Quota `__ (RO\_COMPAT\_QUOTA). + - `Quota `__ (RO_COMPAT_QUOTA). * - 0x200 - This filesystem supports “bigalloc”, which means that file extents are tracked in units of clusters (of blocks) instead of blocks - (RO\_COMPAT\_BIGALLOC). + (RO_COMPAT_BIGALLOC). * - 0x400 - This filesystem supports metadata checksumming. - (RO\_COMPAT\_METADATA\_CSUM; implies RO\_COMPAT\_GDT\_CSUM, though - GDT\_CSUM must not be set) + (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though + GDT_CSUM must not be set) * - 0x800 - Filesystem supports replicas. This feature is neither in the kernel nor - e2fsprogs. (RO\_COMPAT\_REPLICA) + e2fsprogs. (RO_COMPAT_REPLICA) * - 0x1000 - Read-only filesystem image; the kernel will not mount this image read-write and most tools will refuse to write to the image. - (RO\_COMPAT\_READONLY) + (RO_COMPAT_READONLY) * - 0x2000 - - Filesystem tracks project quotas. (RO\_COMPAT\_PROJECT) + - Filesystem tracks project quotas. (RO_COMPAT_PROJECT) * - 0x8000 - - Verity inodes may be present on the filesystem. (RO\_COMPAT\_VERITY) + - Verity inodes may be present on the filesystem. (RO_COMPAT_VERITY) * - 0x10000 - Indicates orphan file may have valid orphan entries and thus we need to clean them up when mounting the filesystem - (RO\_COMPAT\_ORPHAN\_PRESENT). + (RO_COMPAT_ORPHAN_PRESENT). .. _super_def_hash: @@ -761,36 +761,36 @@ The ``s_default_mount_opts`` field is any combination of the following: * - Value - Description * - 0x0001 - - Print debugging info upon (re)mount. (EXT4\_DEFM\_DEBUG) + - Print debugging info upon (re)mount. (EXT4_DEFM_DEBUG) * - 0x0002 - New files take the gid of the containing directory (instead of the fsgid - of the current process). (EXT4\_DEFM\_BSDGROUPS) + of the current process). (EXT4_DEFM_BSDGROUPS) * - 0x0004 - - Support userspace-provided extended attributes. (EXT4\_DEFM\_XATTR\_USER) + - Support userspace-provided extended attributes. (EXT4_DEFM_XATTR_USER) * - 0x0008 - - Support POSIX access control lists (ACLs). (EXT4\_DEFM\_ACL) + - Support POSIX access control lists (ACLs). (EXT4_DEFM_ACL) * - 0x0010 - - Do not support 32-bit UIDs. (EXT4\_DEFM\_UID16) + - Do not support 32-bit UIDs. (EXT4_DEFM_UID16) * - 0x0020 - All data and metadata are commited to the journal. - (EXT4\_DEFM\_JMODE\_DATA) + (EXT4_DEFM_JMODE_DATA) * - 0x0040 - All data are flushed to the disk before metadata are committed to the - journal. (EXT4\_DEFM\_JMODE\_ORDERED) + journal. (EXT4_DEFM_JMODE_ORDERED) * - 0x0060 - Data ordering is not preserved; data may be written after the metadata - has been written. (EXT4\_DEFM\_JMODE\_WBACK) + has been written. (EXT4_DEFM_JMODE_WBACK) * - 0x0100 - - Disable write flushes. (EXT4\_DEFM\_NOBARRIER) + - Disable write flushes. (EXT4_DEFM_NOBARRIER) * - 0x0200 - Track which blocks in a filesystem are metadata and therefore should not be used as data blocks. This option will be enabled by default on 3.18, - hopefully. (EXT4\_DEFM\_BLOCK\_VALIDITY) + hopefully. (EXT4_DEFM_BLOCK_VALIDITY) * - 0x0400 - Enable DISCARD support, where the storage device is told about blocks - becoming unused. (EXT4\_DEFM\_DISCARD) + becoming unused. (EXT4_DEFM_DISCARD) * - 0x0800 - - Disable delayed allocation. (EXT4\_DEFM\_NODELALLOC) + - Disable delayed allocation. (EXT4_DEFM_NODELALLOC) .. _super_flags: @@ -820,12 +820,12 @@ The ``s_encrypt_algos`` list can contain any of the following: * - Value - Description * - 0 - - Invalid algorithm (ENCRYPTION\_MODE\_INVALID). + - Invalid algorithm (ENCRYPTION_MODE_INVALID). * - 1 - - 256-bit AES in XTS mode (ENCRYPTION\_MODE\_AES\_256\_XTS). + - 256-bit AES in XTS mode (ENCRYPTION_MODE_AES_256_XTS). * - 2 - - 256-bit AES in GCM mode (ENCRYPTION\_MODE\_AES\_256\_GCM). + - 256-bit AES in GCM mode (ENCRYPTION_MODE_AES_256_GCM). * - 3 - - 256-bit AES in CBC mode (ENCRYPTION\_MODE\_AES\_256\_CBC). + - 256-bit AES in CBC mode (ENCRYPTION_MODE_AES_256_CBC). Total size of the superblock is 1024 bytes. From 32fc810b364f3dd30930c594e461ffa1761fef39 Mon Sep 17 00:00:00 2001 From: Dylan Yudaken Date: Thu, 16 Jun 2022 06:50:11 -0700 Subject: [PATCH 213/298] io_uring: do not use prio task_work_add in uring_cmd io_req_task_prio_work_add has a strict assumption that it will only be used with io_req_task_complete. There is a codepath that assumes this is the case and will not even call the completion function if it is hit. For uring_cmd with an arbitrary completion function change the call to the correct non-priority version. Fixes: ee692a21e9bf8 ("fs,io_uring: add infrastructure for uring-cmd") Signed-off-by: Dylan Yudaken Reviewed-by: Pavel Begunkov Link: https://lore.kernel.org/r/20220616135011.441980-1-dylany@fb.com Signed-off-by: Jens Axboe --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index b6e75f69c6b1..95a1a78d799a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5017,7 +5017,7 @@ void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, req->uring_cmd.task_work_cb = task_work_cb; req->io_task_work.func = io_uring_cmd_work; - io_req_task_prio_work_add(req); + io_req_task_work_add(req); } EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); From 15baa7dcadf1c4f0b4f752dc054191855ff2d78e Mon Sep 17 00:00:00 2001 From: Zhang Yi Date: Fri, 20 May 2022 10:32:16 +0800 Subject: [PATCH 214/298] ext4: fix warning when submitting superblock in ext4_commit_super() We have already check the io_error and uptodate flag before submitting the superblock buffer, and re-set the uptodate flag if it has been failed to write out. But it was lockless and could be raced by another ext4_commit_super(), and finally trigger '!uptodate' WARNING when marking buffer dirty. Fix it by submit buffer directly. Reported-by: Hulk Robot Signed-off-by: Zhang Yi Reviewed-by: Jan Kara Reviewed-by: Ritesh Harjani Link: https://lore.kernel.org/r/20220520023216.3065073-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/super.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 450c918d68fc..b2ecae8adbfc 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5898,7 +5898,6 @@ static void ext4_update_super(struct super_block *sb) static int ext4_commit_super(struct super_block *sb) { struct buffer_head *sbh = EXT4_SB(sb)->s_sbh; - int error = 0; if (!sbh) return -EINVAL; @@ -5907,6 +5906,13 @@ static int ext4_commit_super(struct super_block *sb) ext4_update_super(sb); + lock_buffer(sbh); + /* Buffer got discarded which means block device got invalidated */ + if (!buffer_mapped(sbh)) { + unlock_buffer(sbh); + return -EIO; + } + if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) { /* * Oh, dear. A previous attempt to write the @@ -5921,17 +5927,21 @@ static int ext4_commit_super(struct super_block *sb) clear_buffer_write_io_error(sbh); set_buffer_uptodate(sbh); } - BUFFER_TRACE(sbh, "marking dirty"); - mark_buffer_dirty(sbh); - error = __sync_dirty_buffer(sbh, - REQ_SYNC | (test_opt(sb, BARRIER) ? REQ_FUA : 0)); + get_bh(sbh); + /* Clear potential dirty bit if it was journalled update */ + clear_buffer_dirty(sbh); + sbh->b_end_io = end_buffer_write_sync; + submit_bh(REQ_OP_WRITE, + REQ_SYNC | (test_opt(sb, BARRIER) ? REQ_FUA : 0), sbh); + wait_on_buffer(sbh); if (buffer_write_io_error(sbh)) { ext4_msg(sb, KERN_ERR, "I/O error while writing " "superblock"); clear_buffer_write_io_error(sbh); set_buffer_uptodate(sbh); + return -EIO; } - return error; + return 0; } /* From 8d5459c11f548131ce48b2fbf45cccc5c382558f Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 20 May 2022 13:14:02 +0200 Subject: [PATCH 215/298] ext4: improve write performance with disabled delalloc When delayed allocation is disabled (either through mount option or because we are running low on free space), ext4_write_begin() allocates blocks with EXT4_GET_BLOCKS_IO_CREATE_EXT flag. With this flag extent merging is disabled and since ext4_write_begin() is called for each page separately, we end up with a *lot* of 1 block extents in the extent tree and following writeback is writing 1 block at a time which results in very poor write throughput (4 MB/s instead of 200 MB/s). These days when ext4_get_block_unwritten() is used only by ext4_write_begin(), ext4_page_mkwrite() and inline data conversion, we can safely allow extent merging to happen from these paths since following writeback will happen on different boundaries anyway. So use EXT4_GET_BLOCKS_CREATE_UNRIT_EXT instead which restores the performance. Signed-off-by: Jan Kara Link: https://lore.kernel.org/r/20220520111402.4252-1-jack@suse.cz Signed-off-by: Theodore Ts'o --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3dce7d058985..84c0eb55071d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -829,7 +829,7 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock, ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", inode->i_ino, create); return _ext4_get_block(inode, iblock, bh_result, - EXT4_GET_BLOCKS_IO_CREATE_EXT); + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); } /* Maximum number of blocks we map for direct IO at once. */ From 3f77a1d0570e62cfce8d472319df00008bbeab38 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Wed, 15 Jun 2022 20:15:04 +0100 Subject: [PATCH 216/298] arm64/cpufeature: Unexport set_cpu_feature() We currently export set_cpu_feature() to modules but there are no in tree users that can be built as modules and it is hard to see cases where it would make sense for there to be any such users. Remove the export to avoid anyone else having to worry about why it is there and ensure that any users that do get added get a bit more visiblity. Signed-off-by: Mark Brown Acked-by: Suzuki K Poulose Reviewed-by: Mark Rutland Link: https://lore.kernel.org/r/20220615191504.626604-1-broonie@kernel.org Signed-off-by: Catalin Marinas --- arch/arm64/kernel/cpufeature.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 42ea2bd856c6..d76fd95376f0 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -3109,7 +3109,6 @@ void cpu_set_feature(unsigned int num) WARN_ON(num >= MAX_CPU_FEATURES); elf_hwcap |= BIT(num); } -EXPORT_SYMBOL_GPL(cpu_set_feature); bool cpu_have_feature(unsigned int num) { From 593d1ebe00a45af5cb7bda1235c0790987c2a2b2 Mon Sep 17 00:00:00 2001 From: Joanne Koong Date: Wed, 15 Jun 2022 12:32:13 -0700 Subject: [PATCH 217/298] Revert "net: Add a second bind table hashed by port and address" This reverts: commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: Joanne Koong Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com Signed-off-by: Jakub Kicinski --- include/net/inet_connection_sock.h | 3 - include/net/inet_hashtables.h | 68 +---- include/net/sock.h | 14 - net/dccp/proto.c | 33 +-- net/ipv4/inet_connection_sock.c | 247 +++++------------- net/ipv4/inet_hashtables.c | 193 +------------- net/ipv4/tcp.c | 14 +- tools/testing/selftests/net/.gitignore | 1 - tools/testing/selftests/net/Makefile | 2 - tools/testing/selftests/net/bind_bhash_test.c | 119 --------- 10 files changed, 83 insertions(+), 611 deletions(-) delete mode 100644 tools/testing/selftests/net/bind_bhash_test.c diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 077cd730ce2f..85cd695e7fd1 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -25,7 +25,6 @@ #undef INET_CSK_CLEAR_TIMERS struct inet_bind_bucket; -struct inet_bind2_bucket; struct tcp_congestion_ops; /* @@ -58,7 +57,6 @@ struct inet_connection_sock_af_ops { * * @icsk_accept_queue: FIFO of established children * @icsk_bind_hash: Bind node - * @icsk_bind2_hash: Bind node in the bhash2 table * @icsk_timeout: Timeout * @icsk_retransmit_timer: Resend (no ack) * @icsk_rto: Retransmit timeout @@ -85,7 +83,6 @@ struct inet_connection_sock { struct inet_sock icsk_inet; struct request_sock_queue icsk_accept_queue; struct inet_bind_bucket *icsk_bind_hash; - struct inet_bind2_bucket *icsk_bind2_hash; unsigned long icsk_timeout; struct timer_list icsk_retransmit_timer; struct timer_list icsk_delack_timer; diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index a0887b70967b..ebfa3df6f8dc 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -90,32 +90,11 @@ struct inet_bind_bucket { struct hlist_head owners; }; -struct inet_bind2_bucket { - possible_net_t ib_net; - int l3mdev; - unsigned short port; - union { -#if IS_ENABLED(CONFIG_IPV6) - struct in6_addr v6_rcv_saddr; -#endif - __be32 rcv_saddr; - }; - /* Node in the inet2_bind_hashbucket chain */ - struct hlist_node node; - /* List of sockets hashed to this bucket */ - struct hlist_head owners; -}; - static inline struct net *ib_net(struct inet_bind_bucket *ib) { return read_pnet(&ib->ib_net); } -static inline struct net *ib2_net(struct inet_bind2_bucket *ib) -{ - return read_pnet(&ib->ib_net); -} - #define inet_bind_bucket_for_each(tb, head) \ hlist_for_each_entry(tb, head, node) @@ -124,15 +103,6 @@ struct inet_bind_hashbucket { struct hlist_head chain; }; -/* This is synchronized using the inet_bind_hashbucket's spinlock. - * Instead of having separate spinlocks, the inet_bind2_hashbucket can share - * the inet_bind_hashbucket's given that in every case where the bhash2 table - * is useful, a lookup in the bhash table also occurs. - */ -struct inet_bind2_hashbucket { - struct hlist_head chain; -}; - /* Sockets can be hashed in established or listening table. * We must use different 'nulls' end-of-chain value for all hash buckets : * A socket might transition from ESTABLISH to LISTEN state without @@ -164,12 +134,6 @@ struct inet_hashinfo { */ struct kmem_cache *bind_bucket_cachep; struct inet_bind_hashbucket *bhash; - /* The 2nd binding table hashed by port and address. - * This is used primarily for expediting the resolution of bind - * conflicts. - */ - struct kmem_cache *bind2_bucket_cachep; - struct inet_bind2_hashbucket *bhash2; unsigned int bhash_size; /* The 2nd listener table hashed by local port and address */ @@ -229,36 +193,6 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket *tb); -static inline bool check_bind_bucket_match(struct inet_bind_bucket *tb, - struct net *net, - const unsigned short port, - int l3mdev) -{ - return net_eq(ib_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev; -} - -struct inet_bind2_bucket * -inet_bind2_bucket_create(struct kmem_cache *cachep, struct net *net, - struct inet_bind2_hashbucket *head, - const unsigned short port, int l3mdev, - const struct sock *sk); - -void inet_bind2_bucket_destroy(struct kmem_cache *cachep, - struct inet_bind2_bucket *tb); - -struct inet_bind2_bucket * -inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, - const unsigned short port, int l3mdev, - struct sock *sk, - struct inet_bind2_hashbucket **head); - -bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, - struct net *net, - const unsigned short port, - int l3mdev, - const struct sock *sk); - static inline u32 inet_bhashfn(const struct net *net, const __u16 lport, const u32 bhash_size) { @@ -266,7 +200,7 @@ static inline u32 inet_bhashfn(const struct net *net, const __u16 lport, } void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, - struct inet_bind2_bucket *tb2, const unsigned short snum); + const unsigned short snum); /* Caller must disable local BH processing. */ int __inet_inherit_port(const struct sock *sk, struct sock *child); diff --git a/include/net/sock.h b/include/net/sock.h index c585ef6565d9..72ca97ccb460 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -348,7 +348,6 @@ struct sk_filter; * @sk_txtime_report_errors: set report errors mode for SO_TXTIME * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference - * @sk_bind2_node: bind node in the bhash2 table */ struct sock { /* @@ -538,7 +537,6 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; - struct hlist_node sk_bind2_node; }; enum sk_pacing { @@ -819,16 +817,6 @@ static inline void sk_add_bind_node(struct sock *sk, hlist_add_head(&sk->sk_bind_node, list); } -static inline void __sk_del_bind2_node(struct sock *sk) -{ - __hlist_del(&sk->sk_bind2_node); -} - -static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list) -{ - hlist_add_head(&sk->sk_bind2_node, list); -} - #define sk_for_each(__sk, list) \ hlist_for_each_entry(__sk, list, sk_node) #define sk_for_each_rcu(__sk, list) \ @@ -846,8 +834,6 @@ static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list) hlist_for_each_entry_safe(__sk, tmp, list, sk_node) #define sk_for_each_bound(__sk, list) \ hlist_for_each_entry(__sk, list, sk_bind_node) -#define sk_for_each_bound_bhash2(__sk, list) \ - hlist_for_each_entry(__sk, list, sk_bind2_node) /** * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 2e78458900f2..eb8e128e43e8 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1120,12 +1120,6 @@ static int __init dccp_init(void) SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); if (!dccp_hashinfo.bind_bucket_cachep) goto out_free_hashinfo2; - dccp_hashinfo.bind2_bucket_cachep = - kmem_cache_create("dccp_bind2_bucket", - sizeof(struct inet_bind2_bucket), 0, - SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); - if (!dccp_hashinfo.bind2_bucket_cachep) - goto out_free_bind_bucket_cachep; /* * Size and allocate the main established and bind bucket @@ -1156,7 +1150,7 @@ static int __init dccp_init(void) if (!dccp_hashinfo.ehash) { DCCP_CRIT("Failed to allocate DCCP established hash table"); - goto out_free_bind2_bucket_cachep; + goto out_free_bind_bucket_cachep; } for (i = 0; i <= dccp_hashinfo.ehash_mask; i++) @@ -1182,23 +1176,14 @@ static int __init dccp_init(void) goto out_free_dccp_locks; } - dccp_hashinfo.bhash2 = (struct inet_bind2_hashbucket *) - __get_free_pages(GFP_ATOMIC | __GFP_NOWARN, bhash_order); - - if (!dccp_hashinfo.bhash2) { - DCCP_CRIT("Failed to allocate DCCP bind2 hash table"); - goto out_free_dccp_bhash; - } - for (i = 0; i < dccp_hashinfo.bhash_size; i++) { spin_lock_init(&dccp_hashinfo.bhash[i].lock); INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain); - INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain); } rc = dccp_mib_init(); if (rc) - goto out_free_dccp_bhash2; + goto out_free_dccp_bhash; rc = dccp_ackvec_init(); if (rc) @@ -1222,38 +1207,30 @@ out_ackvec_exit: dccp_ackvec_exit(); out_free_dccp_mib: dccp_mib_exit(); -out_free_dccp_bhash2: - free_pages((unsigned long)dccp_hashinfo.bhash2, bhash_order); out_free_dccp_bhash: free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order); out_free_dccp_locks: inet_ehash_locks_free(&dccp_hashinfo); out_free_dccp_ehash: free_pages((unsigned long)dccp_hashinfo.ehash, ehash_order); -out_free_bind2_bucket_cachep: - kmem_cache_destroy(dccp_hashinfo.bind2_bucket_cachep); out_free_bind_bucket_cachep: kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep); out_free_hashinfo2: inet_hashinfo2_free_mod(&dccp_hashinfo); out_fail: dccp_hashinfo.bhash = NULL; - dccp_hashinfo.bhash2 = NULL; dccp_hashinfo.ehash = NULL; dccp_hashinfo.bind_bucket_cachep = NULL; - dccp_hashinfo.bind2_bucket_cachep = NULL; return rc; } static void __exit dccp_fini(void) { - int bhash_order = get_order(dccp_hashinfo.bhash_size * - sizeof(struct inet_bind_hashbucket)); - ccid_cleanup_builtins(); dccp_mib_exit(); - free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order); - free_pages((unsigned long)dccp_hashinfo.bhash2, bhash_order); + free_pages((unsigned long)dccp_hashinfo.bhash, + get_order(dccp_hashinfo.bhash_size * + sizeof(struct inet_bind_hashbucket))); free_pages((unsigned long)dccp_hashinfo.ehash, get_order((dccp_hashinfo.ehash_mask + 1) * sizeof(struct inet_ehash_bucket))); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index c0b7e6c21360..53f5f956d948 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -117,32 +117,6 @@ bool inet_rcv_saddr_any(const struct sock *sk) return !sk->sk_rcv_saddr; } -static bool use_bhash2_on_bind(const struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - int addr_type; - - if (sk->sk_family == AF_INET6) { - addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr); - return addr_type != IPV6_ADDR_ANY && - addr_type != IPV6_ADDR_MAPPED; - } -#endif - return sk->sk_rcv_saddr != htonl(INADDR_ANY); -} - -static u32 get_bhash2_nulladdr_hash(const struct sock *sk, struct net *net, - int port) -{ -#if IS_ENABLED(CONFIG_IPV6) - struct in6_addr nulladdr = {}; - - if (sk->sk_family == AF_INET6) - return ipv6_portaddr_hash(net, &nulladdr, port); -#endif - return ipv4_portaddr_hash(net, 0, port); -} - void inet_get_local_port_range(struct net *net, int *low, int *high) { unsigned int seq; @@ -156,71 +130,16 @@ void inet_get_local_port_range(struct net *net, int *low, int *high) } EXPORT_SYMBOL(inet_get_local_port_range); -static bool bind_conflict_exist(const struct sock *sk, struct sock *sk2, - kuid_t sk_uid, bool relax, - bool reuseport_cb_ok, bool reuseport_ok) -{ - int bound_dev_if2; - - if (sk == sk2) - return false; - - bound_dev_if2 = READ_ONCE(sk2->sk_bound_dev_if); - - if (!sk->sk_bound_dev_if || !bound_dev_if2 || - sk->sk_bound_dev_if == bound_dev_if2) { - if (sk->sk_reuse && sk2->sk_reuse && - sk2->sk_state != TCP_LISTEN) { - if (!relax || (!reuseport_ok && sk->sk_reuseport && - sk2->sk_reuseport && reuseport_cb_ok && - (sk2->sk_state == TCP_TIME_WAIT || - uid_eq(sk_uid, sock_i_uid(sk2))))) - return true; - } else if (!reuseport_ok || !sk->sk_reuseport || - !sk2->sk_reuseport || !reuseport_cb_ok || - (sk2->sk_state != TCP_TIME_WAIT && - !uid_eq(sk_uid, sock_i_uid(sk2)))) { - return true; - } - } - return false; -} - -static bool check_bhash2_conflict(const struct sock *sk, - struct inet_bind2_bucket *tb2, kuid_t sk_uid, - bool relax, bool reuseport_cb_ok, - bool reuseport_ok) -{ - struct sock *sk2; - - sk_for_each_bound_bhash2(sk2, &tb2->owners) { - if (sk->sk_family == AF_INET && ipv6_only_sock(sk2)) - continue; - - if (bind_conflict_exist(sk, sk2, sk_uid, relax, - reuseport_cb_ok, reuseport_ok)) - return true; - } - return false; -} - -/* This should be called only when the corresponding inet_bind_bucket spinlock - * is held - */ -static int inet_csk_bind_conflict(const struct sock *sk, int port, - struct inet_bind_bucket *tb, - struct inet_bind2_bucket *tb2, /* may be null */ +static int inet_csk_bind_conflict(const struct sock *sk, + const struct inet_bind_bucket *tb, bool relax, bool reuseport_ok) { - struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - kuid_t uid = sock_i_uid((struct sock *)sk); - struct sock_reuseport *reuseport_cb; - struct inet_bind2_hashbucket *head2; - bool reuseport_cb_ok; struct sock *sk2; - struct net *net; - int l3mdev; - u32 hash; + bool reuseport_cb_ok; + bool reuse = sk->sk_reuse; + bool reuseport = !!sk->sk_reuseport; + struct sock_reuseport *reuseport_cb; + kuid_t uid = sock_i_uid((struct sock *)sk); rcu_read_lock(); reuseport_cb = rcu_dereference(sk->sk_reuseport_cb); @@ -231,42 +150,40 @@ static int inet_csk_bind_conflict(const struct sock *sk, int port, /* * Unlike other sk lookup places we do not check * for sk_net here, since _all_ the socks listed - * in tb->owners and tb2->owners list belong - * to the same net + * in tb->owners list belong to the same net - the + * one this bucket belongs to. */ - if (!use_bhash2_on_bind(sk)) { - sk_for_each_bound(sk2, &tb->owners) - if (bind_conflict_exist(sk, sk2, uid, relax, - reuseport_cb_ok, reuseport_ok) && - inet_rcv_saddr_equal(sk, sk2, true)) - return true; + sk_for_each_bound(sk2, &tb->owners) { + int bound_dev_if2; - return false; + if (sk == sk2) + continue; + bound_dev_if2 = READ_ONCE(sk2->sk_bound_dev_if); + if ((!sk->sk_bound_dev_if || + !bound_dev_if2 || + sk->sk_bound_dev_if == bound_dev_if2)) { + if (reuse && sk2->sk_reuse && + sk2->sk_state != TCP_LISTEN) { + if ((!relax || + (!reuseport_ok && + reuseport && sk2->sk_reuseport && + reuseport_cb_ok && + (sk2->sk_state == TCP_TIME_WAIT || + uid_eq(uid, sock_i_uid(sk2))))) && + inet_rcv_saddr_equal(sk, sk2, true)) + break; + } else if (!reuseport_ok || + !reuseport || !sk2->sk_reuseport || + !reuseport_cb_ok || + (sk2->sk_state != TCP_TIME_WAIT && + !uid_eq(uid, sock_i_uid(sk2)))) { + if (inet_rcv_saddr_equal(sk, sk2, true)) + break; + } + } } - - if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, - reuseport_ok)) - return true; - - net = sock_net(sk); - - /* check there's no conflict with an existing IPV6_ADDR_ANY (if ipv6) or - * INADDR_ANY (if ipv4) socket. - */ - hash = get_bhash2_nulladdr_hash(sk, net, port); - head2 = &hinfo->bhash2[hash & (hinfo->bhash_size - 1)]; - - l3mdev = inet_sk_bound_l3mdev(sk); - inet_bind_bucket_for_each(tb2, &head2->chain) - if (check_bind2_bucket_match_nulladdr(tb2, net, port, l3mdev, sk)) - break; - - if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, - reuseport_ok)) - return true; - - return false; + return sk2 != NULL; } /* @@ -274,20 +191,16 @@ static int inet_csk_bind_conflict(const struct sock *sk, int port, * inet_bind_hashbucket lock held. */ static struct inet_bind_hashbucket * -inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, - struct inet_bind2_bucket **tb2_ret, - struct inet_bind2_hashbucket **head2_ret, int *port_ret) +inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *port_ret) { struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - struct inet_bind2_hashbucket *head2; + int port = 0; struct inet_bind_hashbucket *head; struct net *net = sock_net(sk); + bool relax = false; int i, low, high, attempt_half; - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; u32 remaining, offset; - bool relax = false; - int port = 0; int l3mdev; l3mdev = inet_sk_bound_l3mdev(sk); @@ -326,12 +239,10 @@ other_parity_scan: head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)]; spin_lock_bh(&head->lock); - tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, - &head2); inet_bind_bucket_for_each(tb, &head->chain) - if (check_bind_bucket_match(tb, net, port, l3mdev)) { - if (!inet_csk_bind_conflict(sk, port, tb, tb2, - relax, false)) + if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && + tb->port == port) { + if (!inet_csk_bind_conflict(sk, tb, relax, false)) goto success; goto next_port; } @@ -361,8 +272,6 @@ next_port: success: *port_ret = port; *tb_ret = tb; - *tb2_ret = tb2; - *head2_ret = head2; return head; } @@ -458,81 +367,54 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum) { bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN; struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - bool bhash_created = false, bhash2_created = false; - struct inet_bind2_bucket *tb2 = NULL; - struct inet_bind2_hashbucket *head2; - struct inet_bind_bucket *tb = NULL; + int ret = 1, port = snum; struct inet_bind_hashbucket *head; struct net *net = sock_net(sk); - int ret = 1, port = snum; - bool found_port = false; + struct inet_bind_bucket *tb = NULL; int l3mdev; l3mdev = inet_sk_bound_l3mdev(sk); if (!port) { - head = inet_csk_find_open_port(sk, &tb, &tb2, &head2, &port); + head = inet_csk_find_open_port(sk, &tb, &port); if (!head) return ret; - if (tb && tb2) - goto success; - found_port = true; - } else { - head = &hinfo->bhash[inet_bhashfn(net, port, - hinfo->bhash_size)]; - spin_lock_bh(&head->lock); - inet_bind_bucket_for_each(tb, &head->chain) - if (check_bind_bucket_match(tb, net, port, l3mdev)) - break; - - tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, - &head2); - } - - if (!tb) { - tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, net, - head, port, l3mdev); if (!tb) - goto fail_unlock; - bhash_created = true; + goto tb_not_found; + goto success; } - - if (!tb2) { - tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, - net, head2, port, l3mdev, sk); - if (!tb2) - goto fail_unlock; - bhash2_created = true; - } - - /* If we had to find an open port, we already checked for conflicts */ - if (!found_port && !hlist_empty(&tb->owners)) { + head = &hinfo->bhash[inet_bhashfn(net, port, + hinfo->bhash_size)]; + spin_lock_bh(&head->lock); + inet_bind_bucket_for_each(tb, &head->chain) + if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && + tb->port == port) + goto tb_found; +tb_not_found: + tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, + net, head, port, l3mdev); + if (!tb) + goto fail_unlock; +tb_found: + if (!hlist_empty(&tb->owners)) { if (sk->sk_reuse == SK_FORCE_REUSE) goto success; if ((tb->fastreuse > 0 && reuse) || sk_reuseport_match(tb, sk)) goto success; - if (inet_csk_bind_conflict(sk, port, tb, tb2, true, true)) + if (inet_csk_bind_conflict(sk, tb, true, true)) goto fail_unlock; } success: inet_csk_update_fastreuse(tb, sk); if (!inet_csk(sk)->icsk_bind_hash) - inet_bind_hash(sk, tb, tb2, port); + inet_bind_hash(sk, tb, port); WARN_ON(inet_csk(sk)->icsk_bind_hash != tb); - WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2); ret = 0; fail_unlock: - if (ret) { - if (bhash_created) - inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb); - if (bhash2_created) - inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, - tb2); - } spin_unlock_bh(&head->lock); return ret; } @@ -1079,7 +961,6 @@ struct sock *inet_csk_clone_lock(const struct sock *sk, inet_sk_set_state(newsk, TCP_SYN_RECV); newicsk->icsk_bind_hash = NULL; - newicsk->icsk_bind2_hash = NULL; inet_sk(newsk)->inet_dport = inet_rsk(req)->ir_rmt_port; inet_sk(newsk)->inet_num = inet_rsk(req)->ir_num; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 545f91b6cb5e..b9d995b5ce24 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -81,41 +81,6 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep, return tb; } -struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep, - struct net *net, - struct inet_bind2_hashbucket *head, - const unsigned short port, - int l3mdev, - const struct sock *sk) -{ - struct inet_bind2_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC); - - if (tb) { - write_pnet(&tb->ib_net, net); - tb->l3mdev = l3mdev; - tb->port = port; -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - tb->v6_rcv_saddr = sk->sk_v6_rcv_saddr; - else -#endif - tb->rcv_saddr = sk->sk_rcv_saddr; - INIT_HLIST_HEAD(&tb->owners); - hlist_add_head(&tb->node, &head->chain); - } - return tb; -} - -static bool bind2_bucket_addr_match(struct inet_bind2_bucket *tb2, struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - return ipv6_addr_equal(&tb2->v6_rcv_saddr, - &sk->sk_v6_rcv_saddr); -#endif - return tb2->rcv_saddr == sk->sk_rcv_saddr; -} - /* * Caller must hold hashbucket lock for this tb with local BH disabled */ @@ -127,25 +92,12 @@ void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket } } -/* Caller must hold the lock for the corresponding hashbucket in the bhash table - * with local BH disabled - */ -void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb) -{ - if (hlist_empty(&tb->owners)) { - __hlist_del(&tb->node); - kmem_cache_free(cachep, tb); - } -} - void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, - struct inet_bind2_bucket *tb2, const unsigned short snum) + const unsigned short snum) { inet_sk(sk)->inet_num = snum; sk_add_bind_node(sk, &tb->owners); inet_csk(sk)->icsk_bind_hash = tb; - sk_add_bind2_node(sk, &tb2->owners); - inet_csk(sk)->icsk_bind2_hash = tb2; } /* @@ -157,7 +109,6 @@ static void __inet_put_port(struct sock *sk) const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num, hashinfo->bhash_size); struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash]; - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; spin_lock(&head->lock); @@ -166,13 +117,6 @@ static void __inet_put_port(struct sock *sk) inet_csk(sk)->icsk_bind_hash = NULL; inet_sk(sk)->inet_num = 0; inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb); - - if (inet_csk(sk)->icsk_bind2_hash) { - tb2 = inet_csk(sk)->icsk_bind2_hash; - __sk_del_bind2_node(sk); - inet_csk(sk)->icsk_bind2_hash = NULL; - inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2); - } spin_unlock(&head->lock); } @@ -189,19 +133,14 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) struct inet_hashinfo *table = sk->sk_prot->h.hashinfo; unsigned short port = inet_sk(child)->inet_num; const int bhash = inet_bhashfn(sock_net(sk), port, - table->bhash_size); + table->bhash_size); struct inet_bind_hashbucket *head = &table->bhash[bhash]; - struct inet_bind2_hashbucket *head_bhash2; - bool created_inet_bind_bucket = false; - struct net *net = sock_net(sk); - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; int l3mdev; spin_lock(&head->lock); tb = inet_csk(sk)->icsk_bind_hash; - tb2 = inet_csk(sk)->icsk_bind2_hash; - if (unlikely(!tb || !tb2)) { + if (unlikely(!tb)) { spin_unlock(&head->lock); return -ENOENT; } @@ -214,45 +153,25 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) * as that of the child socket. We have to look up or * create a new bind bucket for the child here. */ inet_bind_bucket_for_each(tb, &head->chain) { - if (check_bind_bucket_match(tb, net, port, l3mdev)) + if (net_eq(ib_net(tb), sock_net(sk)) && + tb->l3mdev == l3mdev && tb->port == port) break; } if (!tb) { tb = inet_bind_bucket_create(table->bind_bucket_cachep, - net, head, port, l3mdev); + sock_net(sk), head, port, + l3mdev); if (!tb) { spin_unlock(&head->lock); return -ENOMEM; } - created_inet_bind_bucket = true; } inet_csk_update_fastreuse(tb, child); - - goto bhash2_find; - } else if (!bind2_bucket_addr_match(tb2, child)) { - l3mdev = inet_sk_bound_l3mdev(sk); - -bhash2_find: - tb2 = inet_bind2_bucket_find(table, net, port, l3mdev, child, - &head_bhash2); - if (!tb2) { - tb2 = inet_bind2_bucket_create(table->bind2_bucket_cachep, - net, head_bhash2, port, - l3mdev, child); - if (!tb2) - goto error; - } } - inet_bind_hash(child, tb, tb2, port); + inet_bind_hash(child, tb, port); spin_unlock(&head->lock); return 0; - -error: - if (created_inet_bind_bucket) - inet_bind_bucket_destroy(table->bind_bucket_cachep, tb); - spin_unlock(&head->lock); - return -ENOMEM; } EXPORT_SYMBOL_GPL(__inet_inherit_port); @@ -756,76 +675,6 @@ void inet_unhash(struct sock *sk) } EXPORT_SYMBOL_GPL(inet_unhash); -static bool check_bind2_bucket_match(struct inet_bind2_bucket *tb, - struct net *net, unsigned short port, - int l3mdev, struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && - ipv6_addr_equal(&tb->v6_rcv_saddr, &sk->sk_v6_rcv_saddr); - else -#endif - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && tb->rcv_saddr == sk->sk_rcv_saddr; -} - -bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, - struct net *net, const unsigned short port, - int l3mdev, const struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - struct in6_addr nulladdr = {}; - - if (sk->sk_family == AF_INET6) - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && - ipv6_addr_equal(&tb->v6_rcv_saddr, &nulladdr); - else -#endif - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && tb->rcv_saddr == 0; -} - -static struct inet_bind2_hashbucket * -inet_bhashfn_portaddr(struct inet_hashinfo *hinfo, const struct sock *sk, - const struct net *net, unsigned short port) -{ - u32 hash; - -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - hash = ipv6_portaddr_hash(net, &sk->sk_v6_rcv_saddr, port); - else -#endif - hash = ipv4_portaddr_hash(net, sk->sk_rcv_saddr, port); - return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)]; -} - -/* This should only be called when the spinlock for the socket's corresponding - * bind_hashbucket is held - */ -struct inet_bind2_bucket * -inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, - const unsigned short port, int l3mdev, struct sock *sk, - struct inet_bind2_hashbucket **head) -{ - struct inet_bind2_bucket *bhash2 = NULL; - struct inet_bind2_hashbucket *h; - - h = inet_bhashfn_portaddr(hinfo, sk, net, port); - inet_bind_bucket_for_each(bhash2, &h->chain) { - if (check_bind2_bucket_match(bhash2, net, port, l3mdev, sk)) - break; - } - - if (head) - *head = h; - - return bhash2; -} - /* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm * Note that we use 32bit integers (vs RFC 'short integers') * because 2^16 is not a multiple of num_ephemeral and this @@ -846,13 +695,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, { struct inet_hashinfo *hinfo = death_row->hashinfo; struct inet_timewait_sock *tw = NULL; - struct inet_bind2_hashbucket *head2; struct inet_bind_hashbucket *head; int port = inet_sk(sk)->inet_num; struct net *net = sock_net(sk); - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; - bool tb_created = false; u32 remaining, offset; int ret, i, low, high; int l3mdev; @@ -909,7 +755,8 @@ other_parity_scan: * the established check is already unique enough. */ inet_bind_bucket_for_each(tb, &head->chain) { - if (check_bind_bucket_match(tb, net, port, l3mdev)) { + if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && + tb->port == port) { if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) goto next_port; @@ -927,7 +774,6 @@ other_parity_scan: spin_unlock_bh(&head->lock); return -ENOMEM; } - tb_created = true; tb->fastreuse = -1; tb->fastreuseport = -1; goto ok; @@ -943,17 +789,6 @@ next_port: return -EADDRNOTAVAIL; ok: - /* Find the corresponding tb2 bucket since we need to - * add the socket to the bhash2 table as well - */ - tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2); - if (!tb2) { - tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net, - head2, port, l3mdev, sk); - if (!tb2) - goto error; - } - /* Here we want to add a little bit of randomness to the next source * port that will be chosen. We use a max() with a random here so that * on low contention the randomness is maximal and on high contention @@ -963,7 +798,7 @@ ok: WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2); /* Head lock still held and bh's disabled */ - inet_bind_hash(sk, tb, tb2, port); + inet_bind_hash(sk, tb, port); if (sk_unhashed(sk)) { inet_sk(sk)->inet_sport = htons(port); inet_ehash_nolisten(sk, (struct sock *)tw, NULL); @@ -975,12 +810,6 @@ ok: inet_twsk_deschedule_put(tw); local_bh_enable(); return 0; - -error: - if (tb_created) - inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb); - spin_unlock_bh(&head->lock); - return -ENOMEM; } /* diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 9984d23a7f3e..028513d3e2a2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4604,12 +4604,6 @@ void __init tcp_init(void) SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); - tcp_hashinfo.bind2_bucket_cachep = - kmem_cache_create("tcp_bind2_bucket", - sizeof(struct inet_bind2_bucket), 0, - SLAB_HWCACHE_ALIGN | SLAB_PANIC | - SLAB_ACCOUNT, - NULL); /* Size and allocate the main established and bind bucket * hash tables. @@ -4632,9 +4626,8 @@ void __init tcp_init(void) if (inet_ehash_locks_alloc(&tcp_hashinfo)) panic("TCP: failed to alloc ehash_locks"); tcp_hashinfo.bhash = - alloc_large_system_hash("TCP bind bhash tables", - sizeof(struct inet_bind_hashbucket) + - sizeof(struct inet_bind2_hashbucket), + alloc_large_system_hash("TCP bind", + sizeof(struct inet_bind_hashbucket), tcp_hashinfo.ehash_mask + 1, 17, /* one slot per 128 KB of memory */ 0, @@ -4643,12 +4636,9 @@ void __init tcp_init(void) 0, 64 * 1024); tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size; - tcp_hashinfo.bhash2 = - (struct inet_bind2_hashbucket *)(tcp_hashinfo.bhash + tcp_hashinfo.bhash_size); for (i = 0; i < tcp_hashinfo.bhash_size; i++) { spin_lock_init(&tcp_hashinfo.bhash[i].lock); INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain); - INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain); } diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index b984f8c8d523..a29f79618934 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -37,4 +37,3 @@ gro ioam6_parser toeplitz cmsg_sender -bind_bhash_test diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 464df13831f2..7ea54af55490 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -59,7 +59,6 @@ TEST_GEN_FILES += toeplitz TEST_GEN_FILES += cmsg_sender TEST_GEN_FILES += stress_reuseport_listen TEST_PROGS += test_vxlan_vnifiltering.sh -TEST_GEN_FILES += bind_bhash_test TEST_FILES := settings @@ -70,5 +69,4 @@ include bpf/Makefile $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -$(OUTPUT)/bind_bhash_test: LDLIBS += -lpthread $(OUTPUT)/tcp_inq: LDLIBS += -lpthread diff --git a/tools/testing/selftests/net/bind_bhash_test.c b/tools/testing/selftests/net/bind_bhash_test.c deleted file mode 100644 index 252e73754e76..000000000000 --- a/tools/testing/selftests/net/bind_bhash_test.c +++ /dev/null @@ -1,119 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * This times how long it takes to bind to a port when the port already - * has multiple sockets in its bhash table. - * - * In the setup(), we populate the port's bhash table with - * MAX_THREADS * MAX_CONNECTIONS number of entries. - */ - -#include -#include -#include -#include - -#define MAX_THREADS 600 -#define MAX_CONNECTIONS 40 - -static const char *bind_addr = "::1"; -static const char *port; - -static int fd_array[MAX_THREADS][MAX_CONNECTIONS]; - -static int bind_socket(int opt, const char *addr) -{ - struct addrinfo *res, hint = {}; - int sock_fd, reuse = 1, err; - - sock_fd = socket(AF_INET6, SOCK_STREAM, 0); - if (sock_fd < 0) { - perror("socket fd err"); - return -1; - } - - hint.ai_family = AF_INET6; - hint.ai_socktype = SOCK_STREAM; - - err = getaddrinfo(addr, port, &hint, &res); - if (err) { - perror("getaddrinfo failed"); - return -1; - } - - if (opt) { - err = setsockopt(sock_fd, SOL_SOCKET, opt, &reuse, sizeof(reuse)); - if (err) { - perror("setsockopt failed"); - return -1; - } - } - - err = bind(sock_fd, res->ai_addr, res->ai_addrlen); - if (err) { - perror("failed to bind to port"); - return -1; - } - - return sock_fd; -} - -static void *setup(void *arg) -{ - int sock_fd, i; - int *array = (int *)arg; - - for (i = 0; i < MAX_CONNECTIONS; i++) { - sock_fd = bind_socket(SO_REUSEADDR | SO_REUSEPORT, bind_addr); - if (sock_fd < 0) - return NULL; - array[i] = sock_fd; - } - - return NULL; -} - -int main(int argc, const char *argv[]) -{ - int listener_fd, sock_fd, i, j; - pthread_t tid[MAX_THREADS]; - clock_t begin, end; - - if (argc != 2) { - printf("Usage: listener \n"); - return -1; - } - - port = argv[1]; - - listener_fd = bind_socket(SO_REUSEADDR | SO_REUSEPORT, bind_addr); - if (listen(listener_fd, 100) < 0) { - perror("listen failed"); - return -1; - } - - /* Set up threads to populate the bhash table entry for the port */ - for (i = 0; i < MAX_THREADS; i++) - pthread_create(&tid[i], NULL, setup, fd_array[i]); - - for (i = 0; i < MAX_THREADS; i++) - pthread_join(tid[i], NULL); - - begin = clock(); - - /* Bind to the same port on a different address */ - sock_fd = bind_socket(0, "2001:0db8:0:f101::1"); - - end = clock(); - - printf("time spent = %f\n", (double)(end - begin) / CLOCKS_PER_SEC); - - /* clean up */ - close(sock_fd); - close(listener_fd); - for (i = 0; i < MAX_THREADS; i++) { - for (j = 0; i < MAX_THREADS; i++) - close(fd_array[i][j]); - } - - return 0; -} From 2e7bf4a6af482f73f01245f08b4a953412c77070 Mon Sep 17 00:00:00 2001 From: Yang Yingliang Date: Thu, 16 Jun 2022 14:29:17 +0800 Subject: [PATCH 218/298] net: axienet: add missing error return code in axienet_probe() It should return error code in error path in axienet_probe(). Fixes: 00be43a74ca2 ("net: axienet: make the 64b addresable DMA depends on 64b archectures") Reported-by: Hulk Robot Signed-off-by: Yang Yingliang Link: https://lore.kernel.org/r/20220616062917.3601-1-yangyingliang@huawei.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c index fa7bcd2c1892..1760930ec0c4 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c @@ -2039,6 +2039,7 @@ static int axienet_probe(struct platform_device *pdev) } if (!IS_ENABLED(CONFIG_64BIT) && lp->features & XAE_FEATURE_DMA_64BIT) { dev_err(&pdev->dev, "64-bit addressable DMA is not compatible with 32-bit archecture\n"); + ret = -EINVAL; goto cleanup_clk; } From 14dc7a18abbe4176f5626c13c333670da8e06aa1 Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Wed, 15 Jun 2022 14:00:04 -0700 Subject: [PATCH 219/298] block: Fix handling of offline queues in blk_mq_alloc_request_hctx() This patch prevents that test nvme/004 triggers the following: UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9 index 512 is out of range for type 'long unsigned int [512]' Call Trace: show_stack+0x52/0x58 dump_stack_lvl+0x49/0x5e dump_stack+0x10/0x12 ubsan_epilogue+0x9/0x3b __ubsan_handle_out_of_bounds.cold+0x44/0x49 blk_mq_alloc_request_hctx+0x304/0x310 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core] nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics] nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop] nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop] nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics] nvmf_dev_write+0xae/0x111 [nvme_fabrics] vfs_write+0x144/0x560 ksys_write+0xb7/0x140 __x64_sys_write+0x42/0x50 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: Christoph Hellwig Cc: Ming Lei Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Signed-off-by: Bart Van Assche Reviewed-by: Christoph Hellwig Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org Signed-off-by: Jens Axboe --- block/blk-mq.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index e9bf950983c7..26a7f802d7ee 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -579,6 +579,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, if (!blk_mq_hw_queue_mapped(data.hctx)) goto out_queue_exit; cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) + goto out_queue_exit; data.ctx = __blk_mq_get_ctx(q, cpu); if (!q->elevator) From 5fd7a84a09e640016fe106dd3e992f5210e23dc7 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Thu, 16 Jun 2022 09:43:59 +0800 Subject: [PATCH 220/298] blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none elevator can be tore down by sysfs switch interface or disk release, so hold ->sysfs_lock before referring to q->elevator, then potential use-after-free can be avoided. Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.com Signed-off-by: Jens Axboe --- block/blk-mq.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 26a7f802d7ee..c13d03b2e17c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4440,12 +4440,14 @@ static bool blk_mq_elv_switch_none(struct list_head *head, if (!qe) return false; + /* q->elevator needs protection from ->sysfs_lock */ + mutex_lock(&q->sysfs_lock); + INIT_LIST_HEAD(&qe->node); qe->q = q; qe->type = q->elevator->type; list_add(&qe->node, head); - mutex_lock(&q->sysfs_lock); /* * After elevator_switch_mq, the previous elevator_queue will be * released by elevator_release. The reference of the io scheduler From 4d337cebcb1c27d9b48c48b9a98e939d4552d584 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Thu, 16 Jun 2022 09:44:00 +0800 Subject: [PATCH 221/298] blk-mq: avoid to touch q->elevator without any protection q->elevator is referred in blk_mq_has_sqsched() without any protection, no .q_usage_counter is held, no queue srcu and rcu read lock is held, so potential use-after-free may be triggered. Fix the issue by adding one queue flag for checking if the elevator uses single queue style dispatch. Meantime the elevator feature flag of ELEVATOR_F_MQ_AWARE isn't needed any more. Cc: Jan Kara Signed-off-by: Ming Lei Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 3 +++ block/blk-mq-sched.c | 1 + block/blk-mq.c | 18 ++---------------- block/kyber-iosched.c | 3 ++- block/mq-deadline.c | 3 +++ include/linux/blkdev.h | 4 ++-- 6 files changed, 13 insertions(+), 19 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 0d46cb728bbf..caa55a5624bc 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -7188,6 +7188,9 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) bfq_init_root_group(bfqd->root_group, bfqd); bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group); + /* We dispatch from request queue wide instead of hw queue */ + blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); + wbt_disable_default(q); return 0; diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 9e56a69422b6..eb3c65a21362 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -564,6 +564,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) int ret; if (!e) { + blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); q->elevator = NULL; q->nr_requests = q->tag_set->queue_depth; return 0; diff --git a/block/blk-mq.c b/block/blk-mq.c index c13d03b2e17c..4cdb08a70912 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2142,20 +2142,6 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) } EXPORT_SYMBOL(blk_mq_run_hw_queue); -/* - * Is the request queue handled by an IO scheduler that does not respect - * hardware queues when dispatching? - */ -static bool blk_mq_has_sqsched(struct request_queue *q) -{ - struct elevator_queue *e = q->elevator; - - if (e && e->type->ops.dispatch_request && - !(e->type->elevator_features & ELEVATOR_F_MQ_AWARE)) - return true; - return false; -} - /* * Return prefered queue to dispatch from (if any) for non-mq aware IO * scheduler. @@ -2188,7 +2174,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) unsigned long i; sq_hctx = NULL; - if (blk_mq_has_sqsched(q)) + if (blk_queue_sq_sched(q)) sq_hctx = blk_mq_get_sq_hctx(q); queue_for_each_hw_ctx(q, hctx, i) { if (blk_mq_hctx_stopped(hctx)) @@ -2216,7 +2202,7 @@ void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs) unsigned long i; sq_hctx = NULL; - if (blk_mq_has_sqsched(q)) + if (blk_queue_sq_sched(q)) sq_hctx = blk_mq_get_sq_hctx(q); queue_for_each_hw_ctx(q, hctx, i) { if (blk_mq_hctx_stopped(hctx)) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 70ff2a599ef6..8f7c745b4a57 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -421,6 +421,8 @@ static int kyber_init_sched(struct request_queue *q, struct elevator_type *e) blk_stat_enable_accounting(q); + blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); + eq->elevator_data = kqd; q->elevator = eq; @@ -1033,7 +1035,6 @@ static struct elevator_type kyber_sched = { #endif .elevator_attrs = kyber_sched_attrs, .elevator_name = "kyber", - .elevator_features = ELEVATOR_F_MQ_AWARE, .elevator_owner = THIS_MODULE, }; diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 6ed602b2f80a..1a9e835e816c 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -642,6 +642,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) spin_lock_init(&dd->lock); spin_lock_init(&dd->zone_lock); + /* We dispatch from request queue wide instead of hw queue */ + blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); + q->elevator = eq; return 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 608d577734c2..bb6e3c31b3b7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -575,6 +575,7 @@ struct request_queue { #define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */ #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */ #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ +#define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_SAME_COMP) | \ @@ -616,6 +617,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) #define blk_queue_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags) +#define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) extern void blk_set_pm_only(struct request_queue *q); extern void blk_clear_pm_only(struct request_queue *q); @@ -1006,8 +1008,6 @@ void disk_set_independent_access_ranges(struct gendisk *disk, */ /* Supports zoned block devices sequential write constraint */ #define ELEVATOR_F_ZBD_SEQ_WRITE (1U << 0) -/* Supports scheduling on multiple hardware queues */ -#define ELEVATOR_F_MQ_AWARE (1U << 1) extern void blk_queue_required_elevator_features(struct request_queue *q, unsigned int features); From 6cfeadbff3f8905f2854735ebb88e581402c16c4 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Thu, 16 Jun 2022 09:44:01 +0800 Subject: [PATCH 222/298] blk-mq: don't clear flush_rq from tags->rqs[] commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") is added to clear the to-be-free flush request from tags->rqs[] for avoiding use-after-free on the flush rq. Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time by ~8s because running scsi probe which may create and remove lots of unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and each request queue has lots of hw queues. Improve the situation by not running blk_mq_clear_flush_rq_mapping if disk isn't added when there can't be any flush request issued. Reviewed-by: Christoph Hellwig Reported-by: Yu Kuai Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com Signed-off-by: Jens Axboe --- block/blk-mq.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 4cdb08a70912..33145ba52c96 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3431,8 +3431,9 @@ static void blk_mq_exit_hctx(struct request_queue *q, if (blk_mq_hw_queue_mapped(hctx)) blk_mq_tag_idle(hctx); - blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx], - set->queue_depth, flush_rq); + if (blk_queue_init_done(q)) + blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx], + set->queue_depth, flush_rq); if (set->ops->exit_request) set->ops->exit_request(set, flush_rq, hctx_idx); From 21f356f990262329bc387910355833378524fe9f Mon Sep 17 00:00:00 2001 From: Heiko Stuebner Date: Thu, 26 May 2022 22:56:45 +0200 Subject: [PATCH 223/298] riscv: fix dependency for t-head errata alternatives only work correctly on non-xip-kernels and while the selected alternative-symbol has the correct dependency the symbol selecting it also needs that dependency. So add the missing dependency to the T-Head errata Kconfig symbol. Reported-by: kernel test robot Reviewed-by: Guo Ren Signed-off-by: Heiko Stuebner Link: https://lore.kernel.org/r/20220526205646.258337-5-heiko@sntech.de Fixes: a35707c3d850 ("riscv: add memory-type errata for T-Head") Signed-off-by: Palmer Dabbelt --- arch/riscv/Kconfig.erratas | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/riscv/Kconfig.erratas b/arch/riscv/Kconfig.erratas index ebfcd5cc6eaf..457ac72c9b36 100644 --- a/arch/riscv/Kconfig.erratas +++ b/arch/riscv/Kconfig.erratas @@ -35,6 +35,7 @@ config ERRATA_SIFIVE_CIP_1200 config ERRATA_THEAD bool "T-HEAD errata" + depends on !XIP_KERNEL select RISCV_ALTERNATIVE help All T-HEAD errata Kconfig depend on this Kconfig. Disabling From 237c0ee4742b6462cb41cdb3fda1ca55011e4aaf Mon Sep 17 00:00:00 2001 From: Heiko Stuebner Date: Thu, 26 May 2022 22:56:42 +0200 Subject: [PATCH 224/298] riscv: drop cpufeature_apply_feature tracking variable The variable was tracking which feature patches got applied but that information was never actually used - and thus resulted in a warning as well. Drop the variable. Reported-by: kernel test robot Signed-off-by: Heiko Stuebner Reviewed-by: Guo Ren Link: https://lore.kernel.org/r/20220526205646.258337-2-heiko@sntech.de Fixes: ff689fd21cb1 ("riscv: add RISC-V Svpbmt extension support") Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/cpufeature.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c index a6f62a6d1edd..12b05ce164bb 100644 --- a/arch/riscv/kernel/cpufeature.c +++ b/arch/riscv/kernel/cpufeature.c @@ -293,7 +293,6 @@ void __init_or_module riscv_cpufeature_patch_func(struct alt_entry *begin, unsigned int stage) { u32 cpu_req_feature = cpufeature_probe(stage); - u32 cpu_apply_feature = 0; struct alt_entry *alt; u32 tmp; @@ -307,10 +306,8 @@ void __init_or_module riscv_cpufeature_patch_func(struct alt_entry *begin, } tmp = (1U << alt->errata_id); - if (cpu_req_feature & tmp) { + if (cpu_req_feature & tmp) patch_text_nosync(alt->old_ptr, alt->alt_ptr, alt->alt_len); - cpu_apply_feature |= tmp; - } } } #endif From 924cbb8cbe3460ea192e6243017ceb0ceb255b1b Mon Sep 17 00:00:00 2001 From: Heiko Stuebner Date: Thu, 26 May 2022 22:56:43 +0200 Subject: [PATCH 225/298] riscv: Improve description for RISCV_ISA_SVPBMT Kconfig symbol This improves the symbol's description to make it easier for people to understand what it is about. Suggested-by: Christoph Hellwig Suggested-by: Philipp Tomsich Signed-off-by: Heiko Stuebner Reviewed-by: Guo Ren Link: https://lore.kernel.org/r/20220526205646.258337-3-heiko@sntech.de Signed-off-by: Palmer Dabbelt --- arch/riscv/Kconfig | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index c22f58155948..32ffef9f6e5b 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -364,8 +364,13 @@ config RISCV_ISA_SVPBMT select RISCV_ALTERNATIVE default y help - Adds support to dynamically detect the presence of the SVPBMT extension - (Supervisor-mode: page-based memory types) and enable its usage. + Adds support to dynamically detect the presence of the SVPBMT + ISA-extension (Supervisor-mode: page-based memory types) and + enable its usage. + + The memory type for a page contains a combination of attributes + that indicate the cacheability, idempotency, and ordering + properties for access to that page. The SVPBMT extension is only available on 64Bit cpus. From b96f3cab59654ee2c30e6adf0b1c13cf8c0850fa Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Mon, 13 Jun 2022 09:32:34 -0700 Subject: [PATCH 226/298] block/bfq: Enable I/O statistics BFQ uses io_start_time_ns. That member variable is only set if I/O statistics are enabled. Hence this patch that enables I/O statistics at the time BFQ is associated with a request queue. Compile-tested only. Reported-by: Cixi Geng Cc: Cixi Geng Cc: Yu Kuai Cc: Paolo Valente Reviewed-by: Jan Kara Signed-off-by: Bart Van Assche Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index caa55a5624bc..e6d7e6b01a05 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -7046,6 +7046,7 @@ static void bfq_exit_queue(struct elevator_queue *e) spin_unlock_irq(&bfqd->lock); #endif + blk_stat_disable_accounting(bfqd->queue); wbt_enable_default(bfqd->queue); kfree(bfqd); @@ -7192,6 +7193,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); wbt_disable_default(q); + blk_stat_enable_accounting(q); + return 0; out_free: From 7c05eae8db9296e28b5dd34deec1ca5ef96d0f08 Mon Sep 17 00:00:00 2001 From: Steve French Date: Wed, 15 Jun 2022 22:40:23 -0500 Subject: [PATCH 227/298] smb3: add trace point for SMB2_set_eof In order to debug problems with file size being reported incorrectly temporarily (in this case xfstest generic/584 intermittent failure) we need to add trace point for the non-compounded code path where we set the file size (SMB2_set_eof). The new trace point is: "smb3_set_eof" Here is sample output from the tracepoint: TASK-PID CPU# ||||| TIMESTAMP FUNCTION | | | ||||| | | xfs_io-75403 [002] ..... 95219.189835: smb3_set_eof: xid=221 sid=0xeef1cbd2 tid=0x27079ee6 fid=0x52edb58c offset=0x100000 aio-dio-append--75418 [010] ..... 95219.242402: smb3_set_eof: xid=226 sid=0xeef1cbd2 tid=0x27079ee6 fid=0xae89852d offset=0x0 Reviewed-by: Paulo Alcantara (SUSE) Signed-off-by: Steve French --- fs/cifs/smb2pdu.c | 2 ++ fs/cifs/trace.h | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+) diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index eaf975f1ad89..b515140bad8d 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -5154,6 +5154,8 @@ SMB2_set_eof(const unsigned int xid, struct cifs_tcon *tcon, u64 persistent_fid, data = &info; size = sizeof(struct smb2_file_eof_info); + trace_smb3_set_eof(xid, persistent_fid, tcon->tid, tcon->ses->Suid, le64_to_cpu(*eof)); + return send_set_info(xid, tcon, persistent_fid, volatile_fid, pid, FILE_END_OF_FILE_INFORMATION, SMB2_O_INFO_FILE, 0, 1, &data, &size); diff --git a/fs/cifs/trace.h b/fs/cifs/trace.h index 2be5e0c8564d..6b88dc2e364f 100644 --- a/fs/cifs/trace.h +++ b/fs/cifs/trace.h @@ -121,6 +121,44 @@ DEFINE_SMB3_RW_DONE_EVENT(query_dir_done); DEFINE_SMB3_RW_DONE_EVENT(zero_done); DEFINE_SMB3_RW_DONE_EVENT(falloc_done); +/* For logging successful set EOF (truncate) */ +DECLARE_EVENT_CLASS(smb3_eof_class, + TP_PROTO(unsigned int xid, + __u64 fid, + __u32 tid, + __u64 sesid, + __u64 offset), + TP_ARGS(xid, fid, tid, sesid, offset), + TP_STRUCT__entry( + __field(unsigned int, xid) + __field(__u64, fid) + __field(__u32, tid) + __field(__u64, sesid) + __field(__u64, offset) + ), + TP_fast_assign( + __entry->xid = xid; + __entry->fid = fid; + __entry->tid = tid; + __entry->sesid = sesid; + __entry->offset = offset; + ), + TP_printk("xid=%u sid=0x%llx tid=0x%x fid=0x%llx offset=0x%llx", + __entry->xid, __entry->sesid, __entry->tid, __entry->fid, + __entry->offset) +) + +#define DEFINE_SMB3_EOF_EVENT(name) \ +DEFINE_EVENT(smb3_eof_class, smb3_##name, \ + TP_PROTO(unsigned int xid, \ + __u64 fid, \ + __u32 tid, \ + __u64 sesid, \ + __u64 offset), \ + TP_ARGS(xid, fid, tid, sesid, offset)) + +DEFINE_SMB3_EOF_EVENT(set_eof); + /* * For handle based calls other than read and write, and get/set info */ From 5d7362d0d56da3b85b19b5e5ce657026c2eef479 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Thu, 16 Jun 2022 13:21:27 -0400 Subject: [PATCH 228/298] dm: fix use-after-free in dm_put_live_table_bio dm_put_live_table_bio is called from the end of dm_submit_bio. However, at this point, the bio may be already finished and the caller may have freed the bio. Consequently, dm_put_live_table_bio accesses the stale "bio" pointer. Fix this bug by loading the bi_opf value and passing it to dm_get_live_table_bio and dm_put_live_table_bio instead of the bio. This bug was found by running the lvm2 testsuite with kasan. Fixes: 563a225c9fd2 ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer --- drivers/md/dm.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index d5e6d33700e5..6ea14ab94aa6 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -715,18 +715,18 @@ static void dm_put_live_table_fast(struct mapped_device *md) __releases(RCU) } static inline struct dm_table *dm_get_live_table_bio(struct mapped_device *md, - int *srcu_idx, struct bio *bio) + int *srcu_idx, unsigned bio_opf) { - if (bio->bi_opf & REQ_NOWAIT) + if (bio_opf & REQ_NOWAIT) return dm_get_live_table_fast(md); else return dm_get_live_table(md, srcu_idx); } static inline void dm_put_live_table_bio(struct mapped_device *md, int srcu_idx, - struct bio *bio) + unsigned bio_opf) { - if (bio->bi_opf & REQ_NOWAIT) + if (bio_opf & REQ_NOWAIT) dm_put_live_table_fast(md); else dm_put_live_table(md, srcu_idx); @@ -1715,8 +1715,9 @@ static void dm_submit_bio(struct bio *bio) struct mapped_device *md = bio->bi_bdev->bd_disk->private_data; int srcu_idx; struct dm_table *map; + unsigned bio_opf = bio->bi_opf; - map = dm_get_live_table_bio(md, &srcu_idx, bio); + map = dm_get_live_table_bio(md, &srcu_idx, bio_opf); /* If suspended, or map not yet available, queue this IO for later */ if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) || @@ -1732,7 +1733,7 @@ static void dm_submit_bio(struct bio *bio) dm_split_and_process_bio(md, map, bio); out: - dm_put_live_table_bio(md, srcu_idx, bio); + dm_put_live_table_bio(md, srcu_idx, bio_opf); } static bool dm_poll_dm_io(struct dm_io *io, struct io_comp_batch *iob, From 1ee88de395c3ad6791c4baeba40e83b6ec97657a Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Thu, 16 Jun 2022 14:14:39 -0400 Subject: [PATCH 229/298] dm: fix narrow race for REQ_NOWAIT bios being issued despite no support Starting with the commit 63a225c9fd20, device mapper has an optimization that it will take cheaper table lock (dm_get_live_table_fast instead of dm_get_live_table) if the bio has REQ_NOWAIT. The bios with REQ_NOWAIT must not block in the target request routine, if they did, we would be blocking while holding rcu_read_lock, which is prohibited. The targets that are suitable for REQ_NOWAIT optimization (and that don't block in the map routine) have the flag DM_TARGET_NOWAIT set. Device mapper will test if all the targets and all the devices in a table support nowait (see the function dm_table_supports_nowait) and it will set or clear the QUEUE_FLAG_NOWAIT flag on its request queue according to this check. There's a test in submit_bio_noacct: "if ((bio->bi_opf & REQ_NOWAIT) && !blk_queue_nowait(q)) goto not_supported" - this will make sure that REQ_NOWAIT bios can't enter a request queue that doesn't support them. This mechanism works to prevent REQ_NOWAIT bios from reaching dm targets that don't support the REQ_NOWAIT flag (and that may block in the map routine) - except that there is a small race condition: submit_bio_noacct checks if the queue has the QUEUE_FLAG_NOWAIT without holding any locks. Immediatelly after this check, the device mapper table may be reloaded with a table that doesn't support REQ_NOWAIT (for example, if we start moving the logical volume or if we activate a snapshot). However the REQ_NOWAIT bio that already passed the check in submit_bio_noacct would be sent to device mapper, where it could be redirected to a dm target that doesn't support REQ_NOWAIT - the result is sleeping while we hold rcu_read_lock. In order to fix this race, we double-check if the target supports REQ_NOWAIT while we hold the table lock (so that the table can't change under us). Fixes: 563a225c9fd2 ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer --- drivers/md/dm.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 6ea14ab94aa6..b6b25d319ef7 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1613,7 +1613,12 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci) ti = dm_table_find_target(ci->map, ci->sector); if (unlikely(!ti)) return BLK_STS_IOERR; - else if (unlikely(ci->is_abnormal_io)) + + if (unlikely((ci->bio->bi_opf & REQ_NOWAIT) != 0) && + unlikely(!dm_target_supports_nowait(ti->type))) + return BLK_STS_NOTSUPP; + + if (unlikely(ci->is_abnormal_io)) return __process_abnormal_io(ci, ti); /* From 85e123c27d5cbc22cfdc01de1e2ca1d9003a02d0 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Thu, 16 Jun 2022 13:28:57 -0400 Subject: [PATCH 230/298] dm mirror log: round up region bitmap size to BITS_PER_LONG The code in dm-log rounds up bitset_size to 32 bits. It then uses find_next_zero_bit_le on the allocated region. find_next_zero_bit_le accesses the bitmap using unsigned long pointers. So, on 64-bit architectures, it may access 4 bytes beyond the allocated size. Fix this bug by rounding up bitset_size to BITS_PER_LONG. This bug was found by running the lvm2 testsuite with kasan. Fixes: 29121bd0b00e ("[PATCH] dm mirror log: bitset_size fix") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer --- drivers/md/dm-log.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c index 06f328928a7f..2dda05aada23 100644 --- a/drivers/md/dm-log.c +++ b/drivers/md/dm-log.c @@ -415,8 +415,7 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti, /* * Work out how many "unsigned long"s we need to hold the bitset. */ - bitset_size = dm_round_up(region_count, - sizeof(*lc->clean_bits) << BYTE_SHIFT); + bitset_size = dm_round_up(region_count, BITS_PER_LONG); bitset_size >>= BYTE_SHIFT; lc->bitset_uint32_count = bitset_size / sizeof(*lc->clean_bits); From 6436c770f120a9ffeb4e791650467f30f1d062d1 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 17 Jun 2022 06:24:26 -0600 Subject: [PATCH 231/298] io_uring: recycle provided buffer if we punt to io-wq io_arm_poll_handler() will recycle the buffer appropriately if we end up arming poll (or if we're ready to retry), but not for the io-wq case if we have attempted poll first. Explicitly recycle the buffer to avoid both hanging on to it too long, but also to avoid multiple reads grabbing the same one. This can happen for ring mapped buffers, since it hasn't necessarily been committed. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Link: https://github.com/axboe/liburing/issues/605 Signed-off-by: Jens Axboe --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index 95a1a78d799a..d3ee4fc532fa 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8690,6 +8690,7 @@ static void io_queue_async(struct io_kiocb *req, int ret) * Queued up for async execution, worker will release * submit reference when the iocb is actually submitted. */ + io_kbuf_recycle(req, 0); io_queue_iowq(req, NULL); break; case IO_APOLL_OK: From b672332ef9161f8cada005aaa9b333a19e496f07 Mon Sep 17 00:00:00 2001 From: Youling Tang Date: Mon, 13 Jun 2022 18:54:12 +0800 Subject: [PATCH 232/298] LoongArch: vmlinux.lds.S: Add missing ELF_DETAILS Commit c604abc3f6e ("vmlinux.lds.h: Split ELF_DETAILS from STABS_DEBUG") splits ELF_DETAILS from STABS_DEBUG, resulting in missing ELF_DETAILS information in LoongArch architecture, so add it. Fixes: c604abc3f6e ("vmlinux.lds.h: Split ELF_DETAILS from STABS_DEBUG") Signed-off-by: Youling Tang Signed-off-by: Huacai Chen --- arch/loongarch/kernel/vmlinux.lds.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/loongarch/kernel/vmlinux.lds.S b/arch/loongarch/kernel/vmlinux.lds.S index 9d508158fe1a..78311a6101a3 100644 --- a/arch/loongarch/kernel/vmlinux.lds.S +++ b/arch/loongarch/kernel/vmlinux.lds.S @@ -101,6 +101,7 @@ SECTIONS STABS_DEBUG DWARF_DEBUG + ELF_DETAILS .gptab.sdata : { *(.gptab.data) From a667e4d3d0b021e13faad19f59cc49b706ae3d16 Mon Sep 17 00:00:00 2001 From: Yanteng Si Date: Fri, 17 Jun 2022 20:47:54 +0800 Subject: [PATCH 233/298] docs/LoongArch: Fix notes rendering by using reST directives Notes are better expressed with reST admonitions. Fixes: 0ea8ce61cb2c ("Documentation: LoongArch: Add basic documentations") Reviewed-by: WANG Xuerui Signed-off-by: Yanteng Si Signed-off-by: Huacai Chen --- Documentation/loongarch/introduction.rst | 15 +++++++++------ Documentation/loongarch/irq-chip-model.rst | 22 +++++++++++++--------- 2 files changed, 22 insertions(+), 15 deletions(-) diff --git a/Documentation/loongarch/introduction.rst b/Documentation/loongarch/introduction.rst index 2bf40ad370df..216b3f390e80 100644 --- a/Documentation/loongarch/introduction.rst +++ b/Documentation/loongarch/introduction.rst @@ -45,10 +45,12 @@ Name Alias Usage Preserved ``$r23``-``$r31`` ``$s0``-``$s8`` Static registers Yes ================= =============== =================== ============ -Note: The register ``$r21`` is reserved in the ELF psABI, but used by the Linux -kernel for storing the percpu base address. It normally has no ABI name, but is -called ``$u0`` in the kernel. You may also see ``$v0`` or ``$v1`` in some old code, -however they are deprecated aliases of ``$a0`` and ``$a1`` respectively. +.. Note:: + The register ``$r21`` is reserved in the ELF psABI, but used by the Linux + kernel for storing the percpu base address. It normally has no ABI name, + but is called ``$u0`` in the kernel. You may also see ``$v0`` or ``$v1`` + in some old code,however they are deprecated aliases of ``$a0`` and ``$a1`` + respectively. FPRs ---- @@ -69,8 +71,9 @@ Name Alias Usage Preserved ``$f24``-``$f31`` ``$fs0``-``$fs7`` Static registers Yes ================= ================== =================== ============ -Note: You may see ``$fv0`` or ``$fv1`` in some old code, however they are deprecated -aliases of ``$fa0`` and ``$fa1`` respectively. +.. Note:: + You may see ``$fv0`` or ``$fv1`` in some old code, however they are + deprecated aliases of ``$fa0`` and ``$fa1`` respectively. VRs ---- diff --git a/Documentation/loongarch/irq-chip-model.rst b/Documentation/loongarch/irq-chip-model.rst index 8d88f7ab2e5e..7988f4192363 100644 --- a/Documentation/loongarch/irq-chip-model.rst +++ b/Documentation/loongarch/irq-chip-model.rst @@ -145,12 +145,16 @@ Documentation of Loongson's LS7A chipset: https://github.com/loongson/LoongArch-Documentation/releases/latest/download/Loongson-7A1000-usermanual-2.00-EN.pdf (in English) -Note: CPUINTC is CSR.ECFG/CSR.ESTAT and its interrupt controller described -in Section 7.4 of "LoongArch Reference Manual, Vol 1"; LIOINTC is "Legacy I/O -Interrupts" described in Section 11.1 of "Loongson 3A5000 Processor Reference -Manual"; EIOINTC is "Extended I/O Interrupts" described in Section 11.2 of -"Loongson 3A5000 Processor Reference Manual"; HTVECINTC is "HyperTransport -Interrupts" described in Section 14.3 of "Loongson 3A5000 Processor Reference -Manual"; PCH-PIC/PCH-MSI is "Interrupt Controller" described in Section 5 of -"Loongson 7A1000 Bridge User Manual"; PCH-LPC is "LPC Interrupts" described in -Section 24.3 of "Loongson 7A1000 Bridge User Manual". +.. Note:: + - CPUINTC is CSR.ECFG/CSR.ESTAT and its interrupt controller described + in Section 7.4 of "LoongArch Reference Manual, Vol 1"; + - LIOINTC is "Legacy I/OInterrupts" described in Section 11.1 of + "Loongson 3A5000 Processor Reference Manual"; + - EIOINTC is "Extended I/O Interrupts" described in Section 11.2 of + "Loongson 3A5000 Processor Reference Manual"; + - HTVECINTC is "HyperTransport Interrupts" described in Section 14.3 of + "Loongson 3A5000 Processor Reference Manual"; + - PCH-PIC/PCH-MSI is "Interrupt Controller" described in Section 5 of + "Loongson 7A1000 Bridge User Manual"; + - PCH-LPC is "LPC Interrupts" described in Section 24.3 of + "Loongson 7A1000 Bridge User Manual". From 03dfb4a3abc4cc497850e6968b59005485592369 Mon Sep 17 00:00:00 2001 From: Yanteng Si Date: Fri, 17 Jun 2022 20:47:55 +0800 Subject: [PATCH 234/298] docs/zh_CN/LoongArch: Fix notes rendering by using reST directives Notes are better expressed with reST admonitions. Fixes: f23b22599f8e ("Documentation/zh_CN: Add basic LoongArch documentations") Reviewed-by: WANG Xuerui Signed-off-by: Yanteng Si Signed-off-by: Huacai Chen --- .../translations/zh_CN/loongarch/introduction.rst | 14 ++++++++------ .../zh_CN/loongarch/irq-chip-model.rst | 14 ++++++++------ 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/Documentation/translations/zh_CN/loongarch/introduction.rst b/Documentation/translations/zh_CN/loongarch/introduction.rst index e31a1a928c48..11686ee0caeb 100644 --- a/Documentation/translations/zh_CN/loongarch/introduction.rst +++ b/Documentation/translations/zh_CN/loongarch/introduction.rst @@ -46,10 +46,11 @@ LA64中每个寄存器为64位宽。 ``$r0`` 的内容总是固定为0,而其 ``$r23``-``$r31`` ``$s0``-``$s8`` 静态寄存器 是 ================= =============== =================== ========== -注意:``$r21``寄存器在ELF psABI中保留未使用,但是在Linux内核用于保存每CPU -变量基地址。该寄存器没有ABI命名,不过在内核中称为``$u0``。在一些遗留代码 -中有时可能见到``$v0``和``$v1``,它们是``$a0``和``$a1``的别名,属于已经废弃 -的用法。 +.. note:: + 注意: ``$r21`` 寄存器在ELF psABI中保留未使用,但是在Linux内核用于保 + 存每CPU变量基地址。该寄存器没有ABI命名,不过在内核中称为 ``$u0`` 。在 + 一些遗留代码中有时可能见到 ``$v0`` 和 ``$v1`` ,它们是 ``$a0`` 和 + ``$a1`` 的别名,属于已经废弃的用法。 浮点寄存器 ---------- @@ -68,8 +69,9 @@ LA64中每个寄存器为64位宽。 ``$r0`` 的内容总是固定为0,而其 ``$f24``-``$f31`` ``$fs0``-``$fs7`` 静态寄存器 是 ================= ================== =================== ========== -注意:在一些遗留代码中有时可能见到 ``$v0`` 和 ``$v1`` ,它们是 ``$a0`` -和 ``$a1`` 的别名,属于已经废弃的用法。 +.. note:: + 注意:在一些遗留代码中有时可能见到 ``$v0`` 和 ``$v1`` ,它们是 + ``$a0`` 和 ``$a1`` 的别名,属于已经废弃的用法。 向量寄存器 diff --git a/Documentation/translations/zh_CN/loongarch/irq-chip-model.rst b/Documentation/translations/zh_CN/loongarch/irq-chip-model.rst index 2a4c3ad38be4..fb5d23b49ed5 100644 --- a/Documentation/translations/zh_CN/loongarch/irq-chip-model.rst +++ b/Documentation/translations/zh_CN/loongarch/irq-chip-model.rst @@ -147,9 +147,11 @@ PCH-LPC:: https://github.com/loongson/LoongArch-Documentation/releases/latest/download/Loongson-7A1000-usermanual-2.00-EN.pdf (英文版) -注:CPUINTC即《龙芯架构参考手册卷一》第7.4节所描述的CSR.ECFG/CSR.ESTAT寄存器及其中断 -控制逻辑;LIOINTC即《龙芯3A5000处理器使用手册》第11.1节所描述的“传统I/O中断”;EIOINTC -即《龙芯3A5000处理器使用手册》第11.2节所描述的“扩展I/O中断”;HTVECINTC即《龙芯3A5000 -处理器使用手册》第14.3节所描述的“HyperTransport中断”;PCH-PIC/PCH-MSI即《龙芯7A1000桥 -片用户手册》第5章所描述的“中断控制器”;PCH-LPC即《龙芯7A1000桥片用户手册》第24.3节所 -描述的“LPC中断”。 +.. note:: + - CPUINTC:即《龙芯架构参考手册卷一》第7.4节所描述的CSR.ECFG/CSR.ESTAT寄存器及其 + 中断控制逻辑; + - LIOINTC:即《龙芯3A5000处理器使用手册》第11.1节所描述的“传统I/O中断”; + - EIOINTC:即《龙芯3A5000处理器使用手册》第11.2节所描述的“扩展I/O中断”; + - HTVECINTC:即《龙芯3A5000处理器使用手册》第14.3节所描述的“HyperTransport中断”; + - PCH-PIC/PCH-MSI:即《龙芯7A1000桥片用户手册》第5章所描述的“中断控制器”; + - PCH-LPC:即《龙芯7A1000桥片用户手册》第24.3节所描述的“LPC中断”。 From c50f11c6196f45c92ca48b16a5071615d4ae0572 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 10 Jun 2022 16:12:27 +0100 Subject: [PATCH 235/298] arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer Invalidating the buffer memory in arch_sync_dma_for_device() for FROM_DEVICE transfers When using the streaming DMA API to map a buffer prior to inbound non-coherent DMA (i.e. DMA_FROM_DEVICE), we invalidate any dirty CPU cachelines so that they will not be written back during the transfer and corrupt the buffer contents written by the DMA. This, however, poses two potential problems: (1) If the DMA transfer does not write to every byte in the buffer, then the unwritten bytes will contain stale data once the transfer has completed. (2) If the buffer has a virtual alias in userspace, then stale data may be visible via this alias during the period between performing the cache invalidation and the DMA writes landing in memory. Address both of these issues by cleaning (aka writing-back) the dirty lines in arch_sync_dma_for_device(DMA_FROM_DEVICE) instead of discarding them using invalidation. Cc: Ard Biesheuvel Cc: Christoph Hellwig Cc: Robin Murphy Cc: Russell King Cc: Link: https://lore.kernel.org/r/20220606152150.GA31568@willie-the-truck Signed-off-by: Will Deacon Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20220610151228.4562-2-will@kernel.org Signed-off-by: Catalin Marinas --- arch/arm64/mm/cache.S | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S index 0ea6cc25dc66..21c907987080 100644 --- a/arch/arm64/mm/cache.S +++ b/arch/arm64/mm/cache.S @@ -218,8 +218,6 @@ SYM_FUNC_ALIAS(__dma_flush_area, __pi___dma_flush_area) */ SYM_FUNC_START(__pi___dma_map_area) add x1, x0, x1 - cmp w2, #DMA_FROM_DEVICE - b.eq __pi_dcache_inval_poc b __pi_dcache_clean_poc SYM_FUNC_END(__pi___dma_map_area) SYM_FUNC_ALIAS(__dma_map_area, __pi___dma_map_area) From a2b36ffbf5b6ec301e61249c8b09e610bc80772f Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Sun, 12 Jun 2022 16:43:25 +0200 Subject: [PATCH 236/298] x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions" This reverts commit 4c5e242d3e93. Prior to 4c5e242d3e93 ("x86/PCI: Clip only host bridge windows for E820 regions"), E820 regions did not affect PCI host bridge windows. We only looked at E820 regions and avoided them when allocating new MMIO space. If firmware PCI bridge window and BAR assignments used E820 regions, we left them alone. After 4c5e242d3e93, we removed E820 regions from the PCI host bridge windows before looking at BARs, so firmware assignments in E820 regions looked like errors, and we moved things around to fit in the space left (if any) after removing the E820 regions. This unnecessary BAR reassignment broke several machines. Guilherme reported that Steam Deck fails to boot after 4c5e242d3e93. We clipped the window that contained most 32-bit BARs: BIOS-e820: [mem 0x00000000a0000000-0x00000000a00fffff] reserved acpi PNP0A08:00: clipped [mem 0x80000000-0xf7ffffff window] to [mem 0xa0100000-0xf7ffffff window] for e820 entry [mem 0xa0000000-0xa00fffff] which forced us to reassign all those BARs, for example, this NVMe BAR: pci 0000:00:01.2: PCI bridge to [bus 01] pci 0000:00:01.2: bridge window [mem 0x80600000-0x806fffff] pci 0000:01:00.0: BAR 0: [mem 0x80600000-0x80603fff 64bit] pci 0000:00:01.2: can't claim window [mem 0x80600000-0x806fffff]: no compatible bridge window pci 0000:01:00.0: can't claim BAR 0 [mem 0x80600000-0x80603fff 64bit]: no compatible bridge window pci 0000:00:01.2: bridge window: assigned [mem 0xa0100000-0xa01fffff] pci 0000:01:00.0: BAR 0: assigned [mem 0xa0100000-0xa0103fff 64bit] All the reassignments were successful, so the devices should have been functional at the new addresses, but some were not. Andy reported a similar failure on an Intel MID platform. Benjamin reported a similar failure on a VMWare Fusion VM. Note: this is not a clean revert; this revert keeps the later change to make the clipping dependent on a new pci_use_e820 bool, moving the checking of this bool to arch_remove_reservations(). [bhelgaas: commit log, add more reporters and testers] BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216109 Reported-by: Guilherme G. Piccoli Reported-by: Andy Shevchenko Reported-by: Benjamin Coddington Reported-by: Jongman Heo Fixes: 4c5e242d3e93 ("x86/PCI: Clip only host bridge windows for E820 regions") Link: https://lore.kernel.org/r/20220612144325.85366-1-hdegoede@redhat.com Tested-by: Guilherme G. Piccoli Tested-by: Andy Shevchenko Tested-by: Benjamin Coddington Signed-off-by: Hans de Goede Signed-off-by: Bjorn Helgaas --- arch/x86/include/asm/e820/api.h | 5 ----- arch/x86/include/asm/pci_x86.h | 8 ++++++++ arch/x86/kernel/resource.c | 14 +++++++++----- arch/x86/pci/acpi.c | 8 +------- 4 files changed, 18 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/e820/api.h b/arch/x86/include/asm/e820/api.h index 5a39ed59b6db..e8f58ddd06d9 100644 --- a/arch/x86/include/asm/e820/api.h +++ b/arch/x86/include/asm/e820/api.h @@ -4,9 +4,6 @@ #include -struct device; -struct resource; - extern struct e820_table *e820_table; extern struct e820_table *e820_table_kexec; extern struct e820_table *e820_table_firmware; @@ -46,8 +43,6 @@ extern void e820__register_nosave_regions(unsigned long limit_pfn); extern int e820__get_entry_type(u64 start, u64 end); -extern void remove_e820_regions(struct device *dev, struct resource *avail); - /* * Returns true iff the specified range [start,end) is completely contained inside * the ISA region. diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h index f52a886d35cf..70533fdcbf02 100644 --- a/arch/x86/include/asm/pci_x86.h +++ b/arch/x86/include/asm/pci_x86.h @@ -69,6 +69,8 @@ void pcibios_scan_specific_bus(int busn); /* pci-irq.c */ +struct pci_dev; + struct irq_info { u8 bus, devfn; /* Bus, device and function */ struct { @@ -246,3 +248,9 @@ static inline void mmio_config_writel(void __iomem *pos, u32 val) # define x86_default_pci_init_irq NULL # define x86_default_pci_fixup_irqs NULL #endif + +#if defined(CONFIG_PCI) && defined(CONFIG_ACPI) +extern bool pci_use_e820; +#else +#define pci_use_e820 false +#endif diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c index db2b350a37b7..bba1abd05bfe 100644 --- a/arch/x86/kernel/resource.c +++ b/arch/x86/kernel/resource.c @@ -1,7 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 -#include #include +#include #include +#include static void resource_clip(struct resource *res, resource_size_t start, resource_size_t end) @@ -24,14 +25,14 @@ static void resource_clip(struct resource *res, resource_size_t start, res->start = end + 1; } -void remove_e820_regions(struct device *dev, struct resource *avail) +static void remove_e820_regions(struct resource *avail) { int i; struct e820_entry *entry; u64 e820_start, e820_end; struct resource orig = *avail; - if (!(avail->flags & IORESOURCE_MEM)) + if (!pci_use_e820) return; for (i = 0; i < e820_table->nr_entries; i++) { @@ -41,7 +42,7 @@ void remove_e820_regions(struct device *dev, struct resource *avail) resource_clip(avail, e820_start, e820_end); if (orig.start != avail->start || orig.end != avail->end) { - dev_info(dev, "clipped %pR to %pR for e820 entry [mem %#010Lx-%#010Lx]\n", + pr_info("clipped %pR to %pR for e820 entry [mem %#010Lx-%#010Lx]\n", &orig, avail, e820_start, e820_end); orig = *avail; } @@ -55,6 +56,9 @@ void arch_remove_reservations(struct resource *avail) * the low 1MB unconditionally, as this area is needed for some ISA * cards requiring a memory range, e.g. the i82365 PCMCIA controller. */ - if (avail->flags & IORESOURCE_MEM) + if (avail->flags & IORESOURCE_MEM) { resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END); + + remove_e820_regions(avail); + } } diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c index a4f43054bc79..2f82480fd430 100644 --- a/arch/x86/pci/acpi.c +++ b/arch/x86/pci/acpi.c @@ -8,7 +8,6 @@ #include #include #include -#include struct pci_root_info { struct acpi_pci_root_info common; @@ -20,7 +19,7 @@ struct pci_root_info { #endif }; -static bool pci_use_e820 = true; +bool pci_use_e820 = true; static bool pci_use_crs = true; static bool pci_ignore_seg; @@ -387,11 +386,6 @@ static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci) status = acpi_pci_probe_root_resources(ci); - if (pci_use_e820) { - resource_list_for_each_entry(entry, &ci->resources) - remove_e820_regions(&device->dev, entry->res); - } - if (pci_use_crs) { resource_list_for_each_entry_safe(entry, tmp, &ci->resources) if (resource_is_pcicfg_ioport(entry->res)) From 1e7769653b06b56b7ea7d56911d2d5b2957750cd Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Tue, 14 Jun 2022 15:01:35 +0300 Subject: [PATCH 237/298] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page load_unaligned_zeropad() can lead to unwanted loads across page boundaries. The unwanted loads are typically harmless. But, they might be made to totally unrelated or even unmapped memory. load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now #VE) to recover from these unwanted loads. In TDX guests, the second page can be shared page and a VMM may configure it to trigger #VE. The kernel assumes that #VE on a shared page is an MMIO access and tries to decode instruction to handle it. In case of load_unaligned_zeropad() it may result in confusion as it is not MMIO access. Fix it by detecting split page MMIO accesses and failing them. load_unaligned_zeropad() will recover using exception fixups. The issue was discovered by analysis and reproduced artificially. It was not triggered during testing. [ dhansen: fix up changelogs and comments for grammar and clarity, plus incorporate Kirill's off-by-one fix] Signed-off-by: Kirill A. Shutemov Signed-off-by: Dave Hansen Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com --- arch/x86/coco/tdx/tdx.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index c8d44f463283..928dcf7a20d9 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -333,8 +333,8 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val) static int handle_mmio(struct pt_regs *regs, struct ve_info *ve) { + unsigned long *reg, val, vaddr; char buffer[MAX_INSN_SIZE]; - unsigned long *reg, val; struct insn insn = {}; enum mmio_type mmio; int size, extend_size; @@ -360,6 +360,19 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve) return -EINVAL; } + /* + * Reject EPT violation #VEs that split pages. + * + * MMIO accesses are supposed to be naturally aligned and therefore + * never cross page boundaries. Seeing split page accesses indicates + * a bug or a load_unaligned_zeropad() that stepped into an MMIO page. + * + * load_unaligned_zeropad() will recover using exception fixups. + */ + vaddr = (unsigned long)insn_get_addr_ref(&insn, regs); + if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE) + return -EFAULT; + /* Handle writes first */ switch (mmio) { case MMIO_WRITE: From 5d24968f5b7e00bae564b1646c3b9e0e3750aabe Mon Sep 17 00:00:00 2001 From: Shyam Prasad N Date: Tue, 14 Jun 2022 11:47:24 +0000 Subject: [PATCH 238/298] cifs: when a channel is not found for server, log its connection id cifs_ses_get_chan_index gets the index for a given server pointer. When a match is not found, we warn about a possible bug. However, printing details about the non-matching server could be more useful to debug here. Signed-off-by: Shyam Prasad N Signed-off-by: Steve French --- fs/cifs/sess.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c index 0bece97547d4..d417de354d9d 100644 --- a/fs/cifs/sess.c +++ b/fs/cifs/sess.c @@ -81,6 +81,9 @@ cifs_ses_get_chan_index(struct cifs_ses *ses, } /* If we didn't find the channel, it is likely a bug */ + if (server) + cifs_dbg(VFS, "unable to get chan index for server: 0x%llx", + server->conn_id); WARN_ON(1); return 0; } From 9b6641dd95a0c441b277dd72ba22fed8d61f76ad Mon Sep 17 00:00:00 2001 From: Ye Bin Date: Wed, 25 May 2022 09:29:04 +0800 Subject: [PATCH 239/298] ext4: fix super block checksum incorrect after mount MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We got issue as follows: [home]# mount /dev/sda test EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended [home]# dmesg EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended EXT4-fs (sda): Errors on filesystem, clearing orphan list. EXT4-fs (sda): recovery complete EXT4-fs (sda): mounted filesystem with ordered data mode. Quota mode: none. [home]# debugfs /dev/sda debugfs 1.46.5 (30-Dec-2021) Checksum errors in superblock! Retrying... Reason is ext4_orphan_cleanup will reset ‘s_last_orphan’ but not update super block checksum. To solve above issue, defer update super block checksum after ext4_orphan_cleanup. Signed-off-by: Ye Bin Cc: stable@kernel.org Reviewed-by: Jan Kara Reviewed-by: Ritesh Harjani Link: https://lore.kernel.org/r/20220525012904.1604737-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/super.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b2ecae8adbfc..13d562d11235 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5302,14 +5302,6 @@ no_journal: err = percpu_counter_init(&sbi->s_freeinodes_counter, freei, GFP_KERNEL); } - /* - * Update the checksum after updating free space/inode - * counters. Otherwise the superblock can have an incorrect - * checksum in the buffer cache until it is written out and - * e2fsprogs programs trying to open a file system immediately - * after it is mounted can fail. - */ - ext4_superblock_csum_set(sb); if (!err) err = percpu_counter_init(&sbi->s_dirs_counter, ext4_count_dirs(sb), GFP_KERNEL); @@ -5367,6 +5359,14 @@ no_journal: EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS; ext4_orphan_cleanup(sb, es); EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS; + /* + * Update the checksum after updating free space/inode counters and + * ext4_orphan_cleanup. Otherwise the superblock can have an incorrect + * checksum in the buffer cache until it is written out and + * e2fsprogs programs trying to open a file system immediately + * after it is mounted can fail. + */ + ext4_superblock_csum_set(sb); if (needs_recovery) { ext4_msg(sb, KERN_INFO, "recovery complete"); err = ext4_mark_recovery_complete(sb, es); From 4efd9f0d120c55b08852ee5605dbb02a77089a5d Mon Sep 17 00:00:00 2001 From: Shuqi Zhang Date: Wed, 25 May 2022 11:01:20 +0800 Subject: [PATCH 240/298] ext4: use kmemdup() to replace kmalloc + memcpy Replace kmalloc + memcpy with kmemdup() Signed-off-by: Shuqi Zhang Reviewed-by: Ritesh Harjani Link: https://lore.kernel.org/r/20220525030120.803330-1-zhangshuqi3@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/xattr.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 042325349098..564e28a1aa94 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -1895,11 +1895,10 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode, unlock_buffer(bs->bh); ea_bdebug(bs->bh, "cloning"); - s->base = kmalloc(bs->bh->b_size, GFP_NOFS); + s->base = kmemdup(BHDR(bs->bh), bs->bh->b_size, GFP_NOFS); error = -ENOMEM; if (s->base == NULL) goto cleanup; - memcpy(s->base, BHDR(bs->bh), bs->bh->b_size); s->first = ENTRY(header(s->base)+1); header(s->base)->h_refcount = cpu_to_le32(1); s->here = ENTRY(s->base + offset); From 85456054e10b0247920b00422d27365e689d9f4a Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Wed, 25 May 2022 21:04:12 -0700 Subject: [PATCH 241/298] ext4: fix up test_dummy_encryption handling for new mount API Since ext4 was converted to the new mount API, the test_dummy_encryption mount option isn't being handled entirely correctly, because the needed fscrypt_set_test_dummy_encryption() helper function combines parsing/checking/applying into one function. That doesn't work well with the new mount API, which split these into separate steps. This was sort of okay anyway, due to the parsing logic that was copied from fscrypt_set_test_dummy_encryption() into ext4_parse_param(), combined with an additional check in ext4_check_test_dummy_encryption(). However, these overlooked the case of changing the value of test_dummy_encryption on remount, which isn't allowed but ext4 wasn't detecting until ext4_apply_options() when it's too late to fail. Another bug is that if test_dummy_encryption was specified multiple times with an argument, memory was leaked. Fix this up properly by using the new helper functions that allow splitting up the parse/check/apply steps for test_dummy_encryption. Fixes: cebe85d570cf ("ext4: switch to the new mount api") Signed-off-by: Eric Biggers Link: https://lore.kernel.org/r/20220526040412.173025-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o --- fs/ext4/super.c | 134 +++++++++++++++++++++++++----------------------- 1 file changed, 71 insertions(+), 63 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 13d562d11235..845f2f8aee5f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -87,7 +87,7 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb, static int ext4_validate_options(struct fs_context *fc); static int ext4_check_opt_consistency(struct fs_context *fc, struct super_block *sb); -static int ext4_apply_options(struct fs_context *fc, struct super_block *sb); +static void ext4_apply_options(struct fs_context *fc, struct super_block *sb); static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param); static int ext4_get_tree(struct fs_context *fc); static int ext4_reconfigure(struct fs_context *fc); @@ -1870,31 +1870,12 @@ ext4_sb_read_encoding(const struct ext4_super_block *es) } #endif -static int ext4_set_test_dummy_encryption(struct super_block *sb, char *arg) -{ -#ifdef CONFIG_FS_ENCRYPTION - struct ext4_sb_info *sbi = EXT4_SB(sb); - int err; - - err = fscrypt_set_test_dummy_encryption(sb, arg, - &sbi->s_dummy_enc_policy); - if (err) { - ext4_msg(sb, KERN_WARNING, - "Error while setting test dummy encryption [%d]", err); - return err; - } - ext4_msg(sb, KERN_WARNING, "Test dummy encryption mode enabled"); -#endif - return 0; -} - #define EXT4_SPEC_JQUOTA (1 << 0) #define EXT4_SPEC_JQFMT (1 << 1) #define EXT4_SPEC_DATAJ (1 << 2) #define EXT4_SPEC_SB_BLOCK (1 << 3) #define EXT4_SPEC_JOURNAL_DEV (1 << 4) #define EXT4_SPEC_JOURNAL_IOPRIO (1 << 5) -#define EXT4_SPEC_DUMMY_ENCRYPTION (1 << 6) #define EXT4_SPEC_s_want_extra_isize (1 << 7) #define EXT4_SPEC_s_max_batch_time (1 << 8) #define EXT4_SPEC_s_min_batch_time (1 << 9) @@ -1911,7 +1892,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb, char *arg) struct ext4_fs_context { char *s_qf_names[EXT4_MAXQUOTAS]; - char *test_dummy_enc_arg; + struct fscrypt_dummy_policy dummy_enc_policy; int s_jquota_fmt; /* Format of quota to use */ #ifdef CONFIG_EXT4_DEBUG int s_fc_debug_max_replay; @@ -1953,7 +1934,7 @@ static void ext4_fc_free(struct fs_context *fc) for (i = 0; i < EXT4_MAXQUOTAS; i++) kfree(ctx->s_qf_names[i]); - kfree(ctx->test_dummy_enc_arg); + fscrypt_free_dummy_policy(&ctx->dummy_enc_policy); kfree(ctx); } @@ -2029,6 +2010,29 @@ static int unnote_qf_name(struct fs_context *fc, int qtype) } #endif +static int ext4_parse_test_dummy_encryption(const struct fs_parameter *param, + struct ext4_fs_context *ctx) +{ + int err; + + if (!IS_ENABLED(CONFIG_FS_ENCRYPTION)) { + ext4_msg(NULL, KERN_WARNING, + "test_dummy_encryption option not supported"); + return -EINVAL; + } + err = fscrypt_parse_test_dummy_encryption(param, + &ctx->dummy_enc_policy); + if (err == -EINVAL) { + ext4_msg(NULL, KERN_WARNING, + "Value of option \"%s\" is unrecognized", param->key); + } else if (err == -EEXIST) { + ext4_msg(NULL, KERN_WARNING, + "Conflicting test_dummy_encryption options"); + return -EINVAL; + } + return err; +} + #define EXT4_SET_CTX(name) \ static inline void ctx_set_##name(struct ext4_fs_context *ctx, \ unsigned long flag) \ @@ -2291,29 +2295,7 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param) ctx->spec |= EXT4_SPEC_JOURNAL_IOPRIO; return 0; case Opt_test_dummy_encryption: -#ifdef CONFIG_FS_ENCRYPTION - if (param->type == fs_value_is_flag) { - ctx->spec |= EXT4_SPEC_DUMMY_ENCRYPTION; - ctx->test_dummy_enc_arg = NULL; - return 0; - } - if (*param->string && - !(!strcmp(param->string, "v1") || - !strcmp(param->string, "v2"))) { - ext4_msg(NULL, KERN_WARNING, - "Value of option \"%s\" is unrecognized", - param->key); - return -EINVAL; - } - ctx->spec |= EXT4_SPEC_DUMMY_ENCRYPTION; - ctx->test_dummy_enc_arg = kmemdup_nul(param->string, param->size, - GFP_KERNEL); - return 0; -#else - ext4_msg(NULL, KERN_WARNING, - "test_dummy_encryption option not supported"); - return -EINVAL; -#endif + return ext4_parse_test_dummy_encryption(param, ctx); case Opt_dax: case Opt_dax_type: #ifdef CONFIG_FS_DAX @@ -2504,7 +2486,8 @@ parse_failed: if (s_ctx->spec & EXT4_SPEC_JOURNAL_IOPRIO) m_ctx->journal_ioprio = s_ctx->journal_ioprio; - ret = ext4_apply_options(fc, sb); + ext4_apply_options(fc, sb); + ret = 0; out_free: if (fc) { @@ -2673,11 +2656,11 @@ err_jquota_specified: static int ext4_check_test_dummy_encryption(const struct fs_context *fc, struct super_block *sb) { -#ifdef CONFIG_FS_ENCRYPTION const struct ext4_fs_context *ctx = fc->fs_private; const struct ext4_sb_info *sbi = EXT4_SB(sb); + int err; - if (!(ctx->spec & EXT4_SPEC_DUMMY_ENCRYPTION)) + if (!fscrypt_is_dummy_policy_set(&ctx->dummy_enc_policy)) return 0; if (!ext4_has_feature_encrypt(sb)) { @@ -2691,14 +2674,46 @@ static int ext4_check_test_dummy_encryption(const struct fs_context *fc, * needed to allow it to be set or changed during remount. We do allow * it to be specified during remount, but only if there is no change. */ - if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE && - !sbi->s_dummy_enc_policy.policy) { + if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) { + if (fscrypt_dummy_policies_equal(&sbi->s_dummy_enc_policy, + &ctx->dummy_enc_policy)) + return 0; ext4_msg(NULL, KERN_WARNING, - "Can't set test_dummy_encryption on remount"); + "Can't set or change test_dummy_encryption on remount"); return -EINVAL; } -#endif /* CONFIG_FS_ENCRYPTION */ - return 0; + /* Also make sure s_mount_opts didn't contain a conflicting value. */ + if (fscrypt_is_dummy_policy_set(&sbi->s_dummy_enc_policy)) { + if (fscrypt_dummy_policies_equal(&sbi->s_dummy_enc_policy, + &ctx->dummy_enc_policy)) + return 0; + ext4_msg(NULL, KERN_WARNING, + "Conflicting test_dummy_encryption options"); + return -EINVAL; + } + /* + * fscrypt_add_test_dummy_key() technically changes the super_block, so + * technically it should be delayed until ext4_apply_options() like the + * other changes. But since we never get here for remounts (see above), + * and this is the last chance to report errors, we do it here. + */ + err = fscrypt_add_test_dummy_key(sb, &ctx->dummy_enc_policy); + if (err) + ext4_msg(NULL, KERN_WARNING, + "Error adding test dummy encryption key [%d]", err); + return err; +} + +static void ext4_apply_test_dummy_encryption(struct ext4_fs_context *ctx, + struct super_block *sb) +{ + if (!fscrypt_is_dummy_policy_set(&ctx->dummy_enc_policy) || + /* if already set, it was already verified to be the same */ + fscrypt_is_dummy_policy_set(&EXT4_SB(sb)->s_dummy_enc_policy)) + return; + EXT4_SB(sb)->s_dummy_enc_policy = ctx->dummy_enc_policy; + memset(&ctx->dummy_enc_policy, 0, sizeof(ctx->dummy_enc_policy)); + ext4_msg(sb, KERN_WARNING, "Test dummy encryption mode enabled"); } static int ext4_check_opt_consistency(struct fs_context *fc, @@ -2785,11 +2800,10 @@ fail_dax_change_remount: return ext4_check_quota_consistency(fc, sb); } -static int ext4_apply_options(struct fs_context *fc, struct super_block *sb) +static void ext4_apply_options(struct fs_context *fc, struct super_block *sb) { struct ext4_fs_context *ctx = fc->fs_private; struct ext4_sb_info *sbi = fc->s_fs_info; - int ret = 0; sbi->s_mount_opt &= ~ctx->mask_s_mount_opt; sbi->s_mount_opt |= ctx->vals_s_mount_opt; @@ -2825,11 +2839,7 @@ static int ext4_apply_options(struct fs_context *fc, struct super_block *sb) #endif ext4_apply_quota_options(fc, sb); - - if (ctx->spec & EXT4_SPEC_DUMMY_ENCRYPTION) - ret = ext4_set_test_dummy_encryption(sb, ctx->test_dummy_enc_arg); - - return ret; + ext4_apply_test_dummy_encryption(ctx, sb); } @@ -4552,9 +4562,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb) if (err < 0) goto failed_mount; - err = ext4_apply_options(fc, sb); - if (err < 0) - goto failed_mount; + ext4_apply_options(fc, sb); #if IS_ENABLED(CONFIG_UNICODE) if (ext4_has_feature_casefold(sb) && !sb->s_encoding) { From a08f789d2ab5242c07e716baf9a835725046be89 Mon Sep 17 00:00:00 2001 From: Baokun Li Date: Sat, 28 May 2022 19:00:15 +0800 Subject: [PATCH 242/298] ext4: fix bug_on ext4_mb_use_inode_pa Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/mballoc.c:3211! [...] RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f [...] Call Trace: ext4_mb_new_blocks+0x9df/0x5d30 ext4_ext_map_blocks+0x1803/0x4d80 ext4_map_blocks+0x3a4/0x1a10 ext4_writepages+0x126d/0x2c30 do_writepages+0x7f/0x1b0 __filemap_fdatawrite_range+0x285/0x3b0 file_write_and_wait_range+0xb1/0x140 ext4_sync_file+0x1aa/0xca0 vfs_fsync_range+0xfb/0x260 do_fsync+0x48/0xa0 [...] ================================================================== Above issue may happen as follows: ------------------------------------- do_fsync vfs_fsync_range ext4_sync_file file_write_and_wait_range __filemap_fdatawrite_range do_writepages ext4_writepages mpage_map_and_submit_extent mpage_map_one_extent ext4_map_blocks ext4_mb_new_blocks ext4_mb_normalize_request >>> start + size <= ac->ac_o_ex.fe_logical ext4_mb_regular_allocator ext4_mb_simple_scan_group ext4_mb_use_best_found ext4_mb_new_preallocation ext4_mb_new_inode_pa ext4_mb_use_inode_pa >>> set ac->ac_b_ex.fe_len <= 0 ext4_mb_mark_diskspace_used >>> BUG_ON(ac->ac_b_ex.fe_len <= 0); we can easily reproduce this problem with the following commands: `fallocate -l100M disk` `mkfs.ext4 -b 1024 -g 256 disk` `mount disk /mnt` `fsstress -d /mnt -l 0 -n 1000 -p 1` The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur when the size is truncated. So start should be the start position of the group where ac_o_ex.fe_logical is located after alignment. In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP is very large, the value calculated by start_off is more accurate. Cc: stable@kernel.org Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") Reported-by: Hulk Robot Signed-off-by: Baokun Li Reviewed-by: Ritesh Harjani Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/mballoc.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 9f12f29bc346..4d3740fdff90 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4104,6 +4104,15 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, size = size >> bsbits; start = start_off >> bsbits; + /* + * For tiny groups (smaller than 8MB) the chosen allocation + * alignment may be larger than group size. Make sure the + * alignment does not move allocation to a different group which + * makes mballoc fail assertions later. + */ + start = max(start, rounddown(ac->ac_o_ex.fe_logical, + (ext4_lblk_t)EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); + /* don't cover already allocated blocks in selected range */ if (ar->pleft && start <= ar->lleft) { size -= ar->lleft + 1 - start; From cf4ff938b47fc5c00b0ccce53a3b50eca9b32281 Mon Sep 17 00:00:00 2001 From: Baokun Li Date: Sat, 28 May 2022 19:00:16 +0800 Subject: [PATCH 243/298] ext4: correct the judgment of BUG in ext4_mb_normalize_request ext4_mb_normalize_request() can move logical start of allocated blocks to reduce fragmentation and better utilize preallocation. However logical block requested as a start of allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated blocks so we should check that by modifying and to or in the assertion. Signed-off-by: Baokun Li Reviewed-by: Ritesh Harjani Link: https://lore.kernel.org/r/20220528110017.354175-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/mballoc.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 4d3740fdff90..9e06334771a3 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4185,7 +4185,22 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, } rcu_read_unlock(); - if (start + size <= ac->ac_o_ex.fe_logical && + /* + * In this function "start" and "size" are normalized for better + * alignment and length such that we could preallocate more blocks. + * This normalization is done such that original request of + * ac->ac_o_ex.fe_logical & fe_len should always lie within "start" and + * "size" boundaries. + * (Note fe_len can be relaxed since FS block allocation API does not + * provide gurantee on number of contiguous blocks allocation since that + * depends upon free space left, etc). + * In case of inode pa, later we use the allocated blocks + * [pa_start + fe_logical - pa_lstart, fe_len/size] from the preallocated + * range of goal/best blocks [start, size] to put it at the + * ac_o_ex.fe_logical extent of this inode. + * (See ext4_mb_use_inode_pa() for more details) + */ + if (start + size <= ac->ac_o_ex.fe_logical || start > ac->ac_o_ex.fe_logical) { ext4_msg(ac->ac_sb, KERN_ERR, "start %lu, size %lu, fe_logical %lu", From bc75a6eb856cb1507fa907bf6c1eda91b3fef52f Mon Sep 17 00:00:00 2001 From: Ding Xiang Date: Mon, 30 May 2022 18:00:47 +0800 Subject: [PATCH 244/298] ext4: make variable "count" signed Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to be a signed integer so we can correctly check for an error code returned by dx_make_map(). Fixes: 46c116b920eb ("ext4: verify dir block before splitting it") Cc: stable@kernel.org Signed-off-by: Ding Xiang Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com Signed-off-by: Theodore Ts'o --- fs/ext4/namei.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 47d0ca4c795b..db4ba99d1ceb 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1929,7 +1929,8 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir, struct dx_hash_info *hinfo) { unsigned blocksize = dir->i_sb->s_blocksize; - unsigned count, continued; + unsigned continued; + int count; struct buffer_head *bh2; ext4_lblk_t newblock; u32 hash2; From b55c3cd102a6f48b90e61c44f7f3dda8c290c694 Mon Sep 17 00:00:00 2001 From: Zhang Yi Date: Wed, 1 Jun 2022 17:27:17 +0800 Subject: [PATCH 245/298] ext4: add reserved GDT blocks check We capture a NULL pointer issue when resizing a corrupt ext4 image which is freshly clear resize_inode feature (not run e2fsck). It could be simply reproduced by following steps. The problem is because of the resize_inode feature was cleared, and it will convert the filesystem to meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was not reduced to zero, so could we mistakenly call reserve_backup_gdb() and passing an uninitialized resize_inode to it when adding new group descriptors. mkfs.ext4 /dev/sda 3G tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck mount /dev/sda /mnt resize2fs /dev/sda 8G ======== BUG: kernel NULL pointer dereference, address: 0000000000000028 CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748 ... RIP: 0010:ext4_flex_group_add+0xe08/0x2570 ... Call Trace: ext4_resize_fs+0xbec/0x1660 __ext4_ioctl+0x1749/0x24e0 ext4_ioctl+0x12/0x20 __x64_sys_ioctl+0xa6/0x110 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2dd739617b ======== The fix is simple, add a check in ext4_resize_begin() to make sure that the es->s_reserved_gdt_blocks is zero when the resize_inode feature is disabled. Cc: stable@kernel.org Signed-off-by: Zhang Yi Reviewed-by: Ritesh Harjani Reviewed-by: Jan Kara Link: https://lore.kernel.org/r/20220601092717.763694-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o --- fs/ext4/resize.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 90a941d20dff..8b70a4701293 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -53,6 +53,16 @@ int ext4_resize_begin(struct super_block *sb) if (!capable(CAP_SYS_RESOURCE)) return -EPERM; + /* + * If the reserved GDT blocks is non-zero, the resize_inode feature + * should always be set. + */ + if (EXT4_SB(sb)->s_es->s_reserved_gdt_blocks && + !ext4_has_feature_resize_inode(sb)) { + ext4_error(sb, "resize_inode disabled but reserved GDT blocks non-zero"); + return -EFSCORRUPTED; + } + /* * If we are not using the primary superblock/GDT copy don't resize, * because the user tools have no way of handling this. Probably a From 1f3ddff3755915a2b38de92d53508594de432d3d Mon Sep 17 00:00:00 2001 From: Xiang wangx Date: Sun, 5 Jun 2022 17:15:03 +0800 Subject: [PATCH 246/298] ext4: fix a doubled word "need" in a comment Signed-off-by: Xiang wangx Link: https://lore.kernel.org/r/20220605091503.12513-1-wangxiang@cdjrlc.com Signed-off-by: Theodore Ts'o --- fs/ext4/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 7a5353a8cfd7..42f590518b4c 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -438,7 +438,7 @@ int ext4_ext_migrate(struct inode *inode) /* * Worst case we can touch the allocation bitmaps and a block - * group descriptor block. We do need need to worry about + * group descriptor block. We do need to worry about * credits for modifying the quota inode. */ handle = ext4_journal_start(inode, EXT4_HT_MIGRATE, From a111daf0c53ae91e71fd2bfe7497862d14132e3e Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 19 Jun 2022 15:06:47 -0500 Subject: [PATCH 247/298] Linux 5.19-rc3 --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 1a6678d817bd..513c1fbf7888 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ VERSION = 5 PATCHLEVEL = 19 SUBLEVEL = 0 -EXTRAVERSION = -rc2 +EXTRAVERSION = -rc3 NAME = Superb Owl # *DOCUMENTATION* From 1267d983117178b507b40c516cdcc5cceec553f9 Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 30 Jun 2022 05:02:38 -0500 Subject: [PATCH 248/298] dt-bindings: interrupt-controller: sifive,plic: Document Renesas RZ/Five SoC Renesas RZ/Five (R9A07G043) SoC is equipped with NCEPLIC100 RISC-V platform level interrupt controller from Andes Technology. NCEPLIC100 ignores subsequent EDGE interrupts until the previous EDGE interrupt is completed, due to this issue we have to follow different interrupt flow for EDGE and LEVEL interrupts. This patch documents Renesas RZ/Five (R9A07G043) SoC. Signed-off-by: Lad Prabhakar Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220630100241.35233-2-samuel@sholland.org --- .../sifive,plic-1.0.0.yaml | 64 +++++++++++++++++-- 1 file changed, 59 insertions(+), 5 deletions(-) diff --git a/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml b/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml index 27092c6a86c4..cd2b8bcaec3b 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml @@ -26,9 +26,14 @@ description: with priority below this threshold will not cause the PLIC to raise its interrupt line leading to the context. - While the PLIC supports both edge-triggered and level-triggered interrupts, - interrupt handlers are oblivious to this distinction and therefore it is not - specified in the PLIC device-tree binding. + The PLIC supports both edge-triggered and level-triggered interrupts. For + edge-triggered interrupts, the RISC-V PLIC spec allows two responses to edges + seen while an interrupt handler is active; the PLIC may either queue them or + ignore them. In the first case, handlers are oblivious to the trigger type, so + it is not included in the interrupt specifier. In the second case, software + needs to know the trigger type, so it can reorder the interrupt flow to avoid + missing interrupts. This special handling is needed by at least the Renesas + RZ/Five SoC (AX45MP AndesCore with a NCEPLIC100). While the RISC-V ISA doesn't specify a memory layout for the PLIC, the "sifive,plic-1.0.0" device is a concrete implementation of the PLIC that @@ -47,6 +52,10 @@ maintainers: properties: compatible: oneOf: + - items: + - enum: + - renesas,r9a07g043-plic + - const: andestech,nceplic100 - items: - enum: - sifive,fu540-c000-plic @@ -64,8 +73,7 @@ properties: '#address-cells': const: 0 - '#interrupt-cells': - const: 1 + '#interrupt-cells': true interrupt-controller: true @@ -82,6 +90,12 @@ properties: description: Specifies how many external interrupts are supported by this controller. + clocks: true + + power-domains: true + + resets: true + required: - compatible - '#address-cells' @@ -91,6 +105,46 @@ required: - interrupts-extended - riscv,ndev +allOf: + - if: + properties: + compatible: + contains: + enum: + - andestech,nceplic100 + + then: + properties: + '#interrupt-cells': + const: 2 + + else: + properties: + '#interrupt-cells': + const: 1 + + - if: + properties: + compatible: + contains: + const: renesas,r9a07g043-plic + + then: + properties: + clocks: + maxItems: 1 + + power-domains: + maxItems: 1 + + resets: + maxItems: 1 + + required: + - clocks + - power-domains + - resets + additionalProperties: false examples: From dd46337ca69662b6912bc230d393c4261d126b8f Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 30 Jun 2022 05:02:39 -0500 Subject: [PATCH 249/298] irqchip/sifive-plic: Add support for Renesas RZ/Five SoC The Renesas RZ/Five SoC has a RISC-V AX45MP AndesCore with NCEPLIC100. The NCEPLIC100 supports both edge-triggered and level-triggered interrupts. In case of edge-triggered interrupts NCEPLIC100 ignores the next interrupt edge until the previous completion message has been received and NCEPLIC100 doesn't support pending interrupt counter, hence losing the interrupts if not acknowledged in time. So the workaround for edge-triggered interrupts to be handled correctly and without losing is that it needs to be acknowledged first and then handler must be run so that we don't miss on the next edge-triggered interrupt. This patch adds a new compatible string for NCEPLIC100 (from Andes Technology) interrupt controller found on Renesas RZ/Five SoC and adds quirk bits to priv structure and implements PLIC_QUIRK_EDGE_INTERRUPT quirk to change the interrupt flow. Suggested-by: Marc Zyngier Signed-off-by: Lad Prabhakar Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220630100241.35233-3-samuel@sholland.org --- drivers/irqchip/irq-sifive-plic.c | 78 +++++++++++++++++++++++++++++-- 1 file changed, 74 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-sifive-plic.c b/drivers/irqchip/irq-sifive-plic.c index bb87e4c3b88e..90e44367bee9 100644 --- a/drivers/irqchip/irq-sifive-plic.c +++ b/drivers/irqchip/irq-sifive-plic.c @@ -60,10 +60,13 @@ #define PLIC_DISABLE_THRESHOLD 0x7 #define PLIC_ENABLE_THRESHOLD 0 +#define PLIC_QUIRK_EDGE_INTERRUPT 0 + struct plic_priv { struct cpumask lmask; struct irq_domain *irqdomain; void __iomem *regs; + unsigned long plic_quirks; }; struct plic_handler { @@ -81,6 +84,8 @@ static int plic_parent_irq __ro_after_init; static bool plic_cpuhp_setup_done __ro_after_init; static DEFINE_PER_CPU(struct plic_handler, plic_handlers); +static int plic_irq_set_type(struct irq_data *d, unsigned int type); + static void __plic_toggle(void __iomem *enable_base, int hwirq, int enable) { u32 __iomem *reg = enable_base + (hwirq / 32) * sizeof(u32); @@ -176,6 +181,17 @@ static void plic_irq_eoi(struct irq_data *d) } } +static struct irq_chip plic_edge_chip = { + .name = "SiFive PLIC", + .irq_ack = plic_irq_eoi, + .irq_mask = plic_irq_mask, + .irq_unmask = plic_irq_unmask, +#ifdef CONFIG_SMP + .irq_set_affinity = plic_set_affinity, +#endif + .irq_set_type = plic_irq_set_type, +}; + static struct irq_chip plic_chip = { .name = "SiFive PLIC", .irq_mask = plic_irq_mask, @@ -184,8 +200,32 @@ static struct irq_chip plic_chip = { #ifdef CONFIG_SMP .irq_set_affinity = plic_set_affinity, #endif + .irq_set_type = plic_irq_set_type, }; +static int plic_irq_set_type(struct irq_data *d, unsigned int type) +{ + struct plic_priv *priv = irq_data_get_irq_chip_data(d); + + if (!test_bit(PLIC_QUIRK_EDGE_INTERRUPT, &priv->plic_quirks)) + return IRQ_SET_MASK_OK_NOCOPY; + + switch (type) { + case IRQ_TYPE_EDGE_RISING: + irq_set_chip_handler_name_locked(d, &plic_edge_chip, + handle_edge_irq, NULL); + break; + case IRQ_TYPE_LEVEL_HIGH: + irq_set_chip_handler_name_locked(d, &plic_chip, + handle_fasteoi_irq, NULL); + break; + default: + return -EINVAL; + } + + return IRQ_SET_MASK_OK; +} + static int plic_irqdomain_map(struct irq_domain *d, unsigned int irq, irq_hw_number_t hwirq) { @@ -198,6 +238,19 @@ static int plic_irqdomain_map(struct irq_domain *d, unsigned int irq, return 0; } +static int plic_irq_domain_translate(struct irq_domain *d, + struct irq_fwspec *fwspec, + unsigned long *hwirq, + unsigned int *type) +{ + struct plic_priv *priv = d->host_data; + + if (test_bit(PLIC_QUIRK_EDGE_INTERRUPT, &priv->plic_quirks)) + return irq_domain_translate_twocell(d, fwspec, hwirq, type); + + return irq_domain_translate_onecell(d, fwspec, hwirq, type); +} + static int plic_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, unsigned int nr_irqs, void *arg) { @@ -206,7 +259,7 @@ static int plic_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, unsigned int type; struct irq_fwspec *fwspec = arg; - ret = irq_domain_translate_onecell(domain, fwspec, &hwirq, &type); + ret = plic_irq_domain_translate(domain, fwspec, &hwirq, &type); if (ret) return ret; @@ -220,7 +273,7 @@ static int plic_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, } static const struct irq_domain_ops plic_irqdomain_ops = { - .translate = irq_domain_translate_onecell, + .translate = plic_irq_domain_translate, .alloc = plic_irq_domain_alloc, .free = irq_domain_free_irqs_top, }; @@ -281,8 +334,9 @@ static int plic_starting_cpu(unsigned int cpu) return 0; } -static int __init plic_init(struct device_node *node, - struct device_node *parent) +static int __init __plic_init(struct device_node *node, + struct device_node *parent, + unsigned long plic_quirks) { int error = 0, nr_contexts, nr_handlers = 0, i; u32 nr_irqs; @@ -293,6 +347,8 @@ static int __init plic_init(struct device_node *node, if (!priv) return -ENOMEM; + priv->plic_quirks = plic_quirks; + priv->regs = of_iomap(node, 0); if (WARN_ON(!priv->regs)) { error = -EIO; @@ -410,6 +466,20 @@ out_free_priv: return error; } +static int __init plic_init(struct device_node *node, + struct device_node *parent) +{ + return __plic_init(node, parent, 0); +} + IRQCHIP_DECLARE(sifive_plic, "sifive,plic-1.0.0", plic_init); IRQCHIP_DECLARE(riscv_plic0, "riscv,plic0", plic_init); /* for legacy systems */ IRQCHIP_DECLARE(thead_c900_plic, "thead,c900-plic", plic_init); /* for firmware driver */ + +static int __init plic_edge_init(struct device_node *node, + struct device_node *parent) +{ + return __plic_init(node, parent, BIT(PLIC_QUIRK_EDGE_INTERRUPT)); +} + +IRQCHIP_DECLARE(andestech_nceplic100, "andestech,nceplic100", plic_edge_init); From d60df7fd225af37e31859a9badb0cca73f7aa12d Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Thu, 30 Jun 2022 05:02:40 -0500 Subject: [PATCH 250/298] dt-bindings: interrupt-controller: Require trigger type for T-HEAD PLIC MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The RISC-V PLIC specification unfortunately allows PLIC implementations to ignore edges seen while an edge-triggered interrupt is being handled: Depending on the design of the device and the interrupt handler, in between sending an interrupt request and receiving notice of its handler’s completion, the gateway might either ignore additional matching edges or increment a counter of pending interrupts. Like the NCEPLIC100, the T-HEAD C900 PLIC also has this behavior. Thus it also needs to inform software about each interrupt's trigger type, so the driver can use the right interrupt flow. Reviewed-by: Lad Prabhakar Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220630100241.35233-4-samuel@sholland.org --- .../bindings/interrupt-controller/sifive,plic-1.0.0.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml b/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml index cd2b8bcaec3b..92e0f8c3eff2 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/sifive,plic-1.0.0.yaml @@ -33,7 +33,7 @@ description: it is not included in the interrupt specifier. In the second case, software needs to know the trigger type, so it can reorder the interrupt flow to avoid missing interrupts. This special handling is needed by at least the Renesas - RZ/Five SoC (AX45MP AndesCore with a NCEPLIC100). + RZ/Five SoC (AX45MP AndesCore with a NCEPLIC100) and the T-HEAD C900 PLIC. While the RISC-V ISA doesn't specify a memory layout for the PLIC, the "sifive,plic-1.0.0" device is a concrete implementation of the PLIC that @@ -112,6 +112,7 @@ allOf: contains: enum: - andestech,nceplic100 + - thead,c900-plic then: properties: From 5873ba559101fa37ad9764e79856f71bf54021aa Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Thu, 30 Jun 2022 05:02:41 -0500 Subject: [PATCH 251/298] irqchip/sifive-plic: Fix T-HEAD PLIC edge trigger handling The T-HEAD PLIC ignores additional edges seen while an edge-triggered interrupt is being handled. Because of this behavior, the driver needs to complete edge-triggered interrupts in the .irq_ack callback before handling them, instead of in the .irq_eoi callback afterward. Otherwise, it could miss some interrupts. Reviewed-by: Lad Prabhakar Signed-off-by: Samuel Holland Reviewed-by: Guo Ren Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220630100241.35233-5-samuel@sholland.org --- drivers/irqchip/irq-sifive-plic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-sifive-plic.c b/drivers/irqchip/irq-sifive-plic.c index 90e44367bee9..b3a36dca7f1b 100644 --- a/drivers/irqchip/irq-sifive-plic.c +++ b/drivers/irqchip/irq-sifive-plic.c @@ -474,7 +474,6 @@ static int __init plic_init(struct device_node *node, IRQCHIP_DECLARE(sifive_plic, "sifive,plic-1.0.0", plic_init); IRQCHIP_DECLARE(riscv_plic0, "riscv,plic0", plic_init); /* for legacy systems */ -IRQCHIP_DECLARE(thead_c900_plic, "thead,c900-plic", plic_init); /* for firmware driver */ static int __init plic_edge_init(struct device_node *node, struct device_node *parent) @@ -483,3 +482,4 @@ static int __init plic_edge_init(struct device_node *node, } IRQCHIP_DECLARE(andestech_nceplic100, "andestech,nceplic100", plic_edge_init); +IRQCHIP_DECLARE(thead_c900_plic, "thead,c900-plic", plic_edge_init); From 95001b756467ecc9f5973eb5e74e97699d9bbdf1 Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Thu, 12 May 2022 18:05:44 +0200 Subject: [PATCH 252/298] genirq: Don't return error on missing optional irq_request_resources() Function irq_chip::irq_request_resources() is reported as optional in the declaration of struct irq_chip. If the parent irq_chip does not implement it, we should ignore it and return. Don't return error if the functions is missing. Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220512160544.13561-1-antonio.borneo@foss.st.com --- kernel/irq/chip.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 886789dcee43..c19040530789 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -1516,7 +1516,8 @@ int irq_chip_request_resources_parent(struct irq_data *data) if (data->chip->irq_request_resources) return data->chip->irq_request_resources(data); - return -ENOSYS; + /* no error on missing optional irq_chip::irq_request_resources */ + return 0; } EXPORT_SYMBOL_GPL(irq_chip_request_resources_parent); From 3e17683ff4a870ed99e989425bc976a944978711 Mon Sep 17 00:00:00 2001 From: Ludovic Barre Date: Mon, 6 Jun 2022 18:27:52 +0200 Subject: [PATCH 253/298] irqchip/stm32-exti: Fix irq_set_affinity return value When there is no parent, there is no specific action to do in stm32-exti irqchip. In such case, it's incorrect returning an error. Let irq_set_affinity to return IRQ_SET_MASK_OK_DONE when there is no parent. Signed-off-by: Ludovic Barre Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606162757.415354-2-antonio.borneo@foss.st.com --- drivers/irqchip/irq-stm32-exti.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-stm32-exti.c b/drivers/irqchip/irq-stm32-exti.c index 9d18f47040eb..10c9c742c216 100644 --- a/drivers/irqchip/irq-stm32-exti.c +++ b/drivers/irqchip/irq-stm32-exti.c @@ -614,7 +614,7 @@ static int stm32_exti_h_set_affinity(struct irq_data *d, if (d->parent_data->chip) return irq_chip_set_affinity_parent(d, dest, force); - return -EINVAL; + return IRQ_SET_MASK_OK_DONE; } static int __maybe_unused stm32_exti_h_suspend(void) From f8b3eb4245113c8a9156d5db8e80c6134127bcc1 Mon Sep 17 00:00:00 2001 From: Loic Pallardy Date: Mon, 6 Jun 2022 18:27:53 +0200 Subject: [PATCH 254/298] irqchip/stm32-exti: Fix irq_mask/irq_unmask for direct events The driver has to mask/unmask the corresponding flag in the Interrupt Mask Register (IMR). This is already done for configurable event, while direct events only forward the mask/unmask request to the parent. Use the existing stm32_exti_h_mask()/stm32_exti_h_unmask() for direct events too. Signed-off-by: Loic Pallardy Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606162757.415354-3-antonio.borneo@foss.st.com --- drivers/irqchip/irq-stm32-exti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-stm32-exti.c b/drivers/irqchip/irq-stm32-exti.c index 10c9c742c216..1145f064faa8 100644 --- a/drivers/irqchip/irq-stm32-exti.c +++ b/drivers/irqchip/irq-stm32-exti.c @@ -691,8 +691,8 @@ static struct irq_chip stm32_exti_h_chip_direct = { .name = "stm32-exti-h-direct", .irq_eoi = irq_chip_eoi_parent, .irq_ack = irq_chip_ack_parent, - .irq_mask = irq_chip_mask_parent, - .irq_unmask = irq_chip_unmask_parent, + .irq_mask = stm32_exti_h_mask, + .irq_unmask = stm32_exti_h_unmask, .irq_retrigger = irq_chip_retrigger_hierarchy, .irq_set_type = irq_chip_set_type_parent, .irq_set_wake = stm32_exti_h_set_wake, From c16ae609214e835692c33b1a090b5a15bf1b9e7e Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Mon, 6 Jun 2022 18:27:54 +0200 Subject: [PATCH 255/298] irqchip/stm32-exti: Prevent illegal read due to unbounded DT value The value hwirq is received from DT. If it exceeds the maximum valid value it causes the code to address unexisting irq chips reading outside the array boundary. Check the value of hwirq before using it. Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606162757.415354-4-antonio.borneo@foss.st.com --- drivers/irqchip/irq-stm32-exti.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/irqchip/irq-stm32-exti.c b/drivers/irqchip/irq-stm32-exti.c index 1145f064faa8..e2722e499ae5 100644 --- a/drivers/irqchip/irq-stm32-exti.c +++ b/drivers/irqchip/irq-stm32-exti.c @@ -713,6 +713,9 @@ static int stm32_exti_h_domain_alloc(struct irq_domain *dm, int bank; hwirq = fwspec->param[0]; + if (hwirq >= host_data->drv_data->bank_nr * IRQS_PER_BANK) + return -EINVAL; + bank = hwirq / IRQS_PER_BANK; chip_data = &host_data->chips_data[bank]; From b38040f0167d25092e813c8d1a70cf2708c1720b Mon Sep 17 00:00:00 2001 From: Alexandre Torgue Date: Mon, 6 Jun 2022 18:27:55 +0200 Subject: [PATCH 256/298] irqchip/stm32-exti: Tag emr register as undefined for stm32mp15 The reference manual RM0436 of stm32mp15 till version v4.0 was erroneously reporting the Event Mask Registers (EMR) for the Cortex-A CPUs. These registers have been removed from v5.0 of the manual and the corresponding offsets have been marked as 'Reserved'. Prevent accessing these reserved addresses by tagging the EMR offsets as UNDEF_REG and modifying the code to handle this case. Signed-off-by: Alexandre Torgue Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606162757.415354-5-antonio.borneo@foss.st.com --- drivers/irqchip/irq-stm32-exti.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-stm32-exti.c b/drivers/irqchip/irq-stm32-exti.c index e2722e499ae5..e8fa91bda4ba 100644 --- a/drivers/irqchip/irq-stm32-exti.c +++ b/drivers/irqchip/irq-stm32-exti.c @@ -132,7 +132,7 @@ static const struct stm32_exti_drv_data stm32h7xx_drv_data = { static const struct stm32_exti_bank stm32mp1_exti_b1 = { .imr_ofst = 0x80, - .emr_ofst = 0x84, + .emr_ofst = UNDEF_REG, .rtsr_ofst = 0x00, .ftsr_ofst = 0x04, .swier_ofst = 0x08, @@ -142,7 +142,7 @@ static const struct stm32_exti_bank stm32mp1_exti_b1 = { static const struct stm32_exti_bank stm32mp1_exti_b2 = { .imr_ofst = 0x90, - .emr_ofst = 0x94, + .emr_ofst = UNDEF_REG, .rtsr_ofst = 0x20, .ftsr_ofst = 0x24, .swier_ofst = 0x28, @@ -152,7 +152,7 @@ static const struct stm32_exti_bank stm32mp1_exti_b2 = { static const struct stm32_exti_bank stm32mp1_exti_b3 = { .imr_ofst = 0xA0, - .emr_ofst = 0xA4, + .emr_ofst = UNDEF_REG, .rtsr_ofst = 0x40, .ftsr_ofst = 0x44, .swier_ofst = 0x48, @@ -795,7 +795,8 @@ stm32_exti_chip_data *stm32_exti_chip_init(struct stm32_exti_host_data *h_data, * clear registers to avoid residue */ writel_relaxed(0, base + stm32_bank->imr_ofst); - writel_relaxed(0, base + stm32_bank->emr_ofst); + if (stm32_bank->emr_ofst != UNDEF_REG) + writel_relaxed(0, base + stm32_bank->emr_ofst); pr_info("%pOF: bank%d\n", node, bank_idx); From ce4ef8f9f2abcf104a5417225cbfe3560e779093 Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Mon, 6 Jun 2022 18:27:56 +0200 Subject: [PATCH 257/298] irqchip/stm32-exti: Read event trigger type from event_trg register The flag reporting whether an event is 'direct' or 'configurable' is available in the read-only registers EVENT_TRG. Drop this redundant information from the struct stm32_desc_irq and use the proper bit from EVENT_TRG register. On armv7a this patch reduces by 3% the size of the driver, from text data bss dec hex filename 7233 424 4 7661 1ded irq-stm32-exti.o to 6977 424 4 7405 1ced irq-stm32-exti.o Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606162757.415354-6-antonio.borneo@foss.st.com --- drivers/irqchip/irq-stm32-exti.c | 180 ++++++++++++++++--------------- 1 file changed, 96 insertions(+), 84 deletions(-) diff --git a/drivers/irqchip/irq-stm32-exti.c b/drivers/irqchip/irq-stm32-exti.c index e8fa91bda4ba..045589627e67 100644 --- a/drivers/irqchip/irq-stm32-exti.c +++ b/drivers/irqchip/irq-stm32-exti.c @@ -34,6 +34,7 @@ struct stm32_exti_bank { u32 swier_ofst; u32 rpr_ofst; u32 fpr_ofst; + u32 trg_ofst; }; #define UNDEF_REG ~0 @@ -41,7 +42,6 @@ struct stm32_exti_bank { struct stm32_desc_irq { u32 exti; u32 irq_parent; - struct irq_chip *chip; }; struct stm32_exti_drv_data { @@ -78,6 +78,7 @@ static const struct stm32_exti_bank stm32f4xx_exti_b1 = { .swier_ofst = 0x10, .rpr_ofst = 0x14, .fpr_ofst = UNDEF_REG, + .trg_ofst = UNDEF_REG, }; static const struct stm32_exti_bank *stm32f4xx_exti_banks[] = { @@ -97,6 +98,7 @@ static const struct stm32_exti_bank stm32h7xx_exti_b1 = { .swier_ofst = 0x08, .rpr_ofst = 0x88, .fpr_ofst = UNDEF_REG, + .trg_ofst = UNDEF_REG, }; static const struct stm32_exti_bank stm32h7xx_exti_b2 = { @@ -107,6 +109,7 @@ static const struct stm32_exti_bank stm32h7xx_exti_b2 = { .swier_ofst = 0x28, .rpr_ofst = 0x98, .fpr_ofst = UNDEF_REG, + .trg_ofst = UNDEF_REG, }; static const struct stm32_exti_bank stm32h7xx_exti_b3 = { @@ -117,6 +120,7 @@ static const struct stm32_exti_bank stm32h7xx_exti_b3 = { .swier_ofst = 0x48, .rpr_ofst = 0xA8, .fpr_ofst = UNDEF_REG, + .trg_ofst = UNDEF_REG, }; static const struct stm32_exti_bank *stm32h7xx_exti_banks[] = { @@ -138,6 +142,7 @@ static const struct stm32_exti_bank stm32mp1_exti_b1 = { .swier_ofst = 0x08, .rpr_ofst = 0x0C, .fpr_ofst = 0x10, + .trg_ofst = 0x3EC, }; static const struct stm32_exti_bank stm32mp1_exti_b2 = { @@ -148,6 +153,7 @@ static const struct stm32_exti_bank stm32mp1_exti_b2 = { .swier_ofst = 0x28, .rpr_ofst = 0x2C, .fpr_ofst = 0x30, + .trg_ofst = 0x3E8, }; static const struct stm32_exti_bank stm32mp1_exti_b3 = { @@ -158,6 +164,7 @@ static const struct stm32_exti_bank stm32mp1_exti_b3 = { .swier_ofst = 0x48, .rpr_ofst = 0x4C, .fpr_ofst = 0x50, + .trg_ofst = 0x3E4, }; static const struct stm32_exti_bank *stm32mp1_exti_banks[] = { @@ -170,90 +177,90 @@ static struct irq_chip stm32_exti_h_chip; static struct irq_chip stm32_exti_h_chip_direct; static const struct stm32_desc_irq stm32mp1_desc_irq[] = { - { .exti = 0, .irq_parent = 6, .chip = &stm32_exti_h_chip }, - { .exti = 1, .irq_parent = 7, .chip = &stm32_exti_h_chip }, - { .exti = 2, .irq_parent = 8, .chip = &stm32_exti_h_chip }, - { .exti = 3, .irq_parent = 9, .chip = &stm32_exti_h_chip }, - { .exti = 4, .irq_parent = 10, .chip = &stm32_exti_h_chip }, - { .exti = 5, .irq_parent = 23, .chip = &stm32_exti_h_chip }, - { .exti = 6, .irq_parent = 64, .chip = &stm32_exti_h_chip }, - { .exti = 7, .irq_parent = 65, .chip = &stm32_exti_h_chip }, - { .exti = 8, .irq_parent = 66, .chip = &stm32_exti_h_chip }, - { .exti = 9, .irq_parent = 67, .chip = &stm32_exti_h_chip }, - { .exti = 10, .irq_parent = 40, .chip = &stm32_exti_h_chip }, - { .exti = 11, .irq_parent = 42, .chip = &stm32_exti_h_chip }, - { .exti = 12, .irq_parent = 76, .chip = &stm32_exti_h_chip }, - { .exti = 13, .irq_parent = 77, .chip = &stm32_exti_h_chip }, - { .exti = 14, .irq_parent = 121, .chip = &stm32_exti_h_chip }, - { .exti = 15, .irq_parent = 127, .chip = &stm32_exti_h_chip }, - { .exti = 16, .irq_parent = 1, .chip = &stm32_exti_h_chip }, - { .exti = 19, .irq_parent = 3, .chip = &stm32_exti_h_chip_direct }, - { .exti = 21, .irq_parent = 31, .chip = &stm32_exti_h_chip_direct }, - { .exti = 22, .irq_parent = 33, .chip = &stm32_exti_h_chip_direct }, - { .exti = 23, .irq_parent = 72, .chip = &stm32_exti_h_chip_direct }, - { .exti = 24, .irq_parent = 95, .chip = &stm32_exti_h_chip_direct }, - { .exti = 25, .irq_parent = 107, .chip = &stm32_exti_h_chip_direct }, - { .exti = 26, .irq_parent = 37, .chip = &stm32_exti_h_chip_direct }, - { .exti = 27, .irq_parent = 38, .chip = &stm32_exti_h_chip_direct }, - { .exti = 28, .irq_parent = 39, .chip = &stm32_exti_h_chip_direct }, - { .exti = 29, .irq_parent = 71, .chip = &stm32_exti_h_chip_direct }, - { .exti = 30, .irq_parent = 52, .chip = &stm32_exti_h_chip_direct }, - { .exti = 31, .irq_parent = 53, .chip = &stm32_exti_h_chip_direct }, - { .exti = 32, .irq_parent = 82, .chip = &stm32_exti_h_chip_direct }, - { .exti = 33, .irq_parent = 83, .chip = &stm32_exti_h_chip_direct }, - { .exti = 47, .irq_parent = 93, .chip = &stm32_exti_h_chip_direct }, - { .exti = 48, .irq_parent = 138, .chip = &stm32_exti_h_chip_direct }, - { .exti = 50, .irq_parent = 139, .chip = &stm32_exti_h_chip_direct }, - { .exti = 52, .irq_parent = 140, .chip = &stm32_exti_h_chip_direct }, - { .exti = 53, .irq_parent = 141, .chip = &stm32_exti_h_chip_direct }, - { .exti = 54, .irq_parent = 135, .chip = &stm32_exti_h_chip_direct }, - { .exti = 61, .irq_parent = 100, .chip = &stm32_exti_h_chip_direct }, - { .exti = 65, .irq_parent = 144, .chip = &stm32_exti_h_chip }, - { .exti = 68, .irq_parent = 143, .chip = &stm32_exti_h_chip }, - { .exti = 70, .irq_parent = 62, .chip = &stm32_exti_h_chip_direct }, - { .exti = 73, .irq_parent = 129, .chip = &stm32_exti_h_chip }, + { .exti = 0, .irq_parent = 6 }, + { .exti = 1, .irq_parent = 7 }, + { .exti = 2, .irq_parent = 8 }, + { .exti = 3, .irq_parent = 9 }, + { .exti = 4, .irq_parent = 10 }, + { .exti = 5, .irq_parent = 23 }, + { .exti = 6, .irq_parent = 64 }, + { .exti = 7, .irq_parent = 65 }, + { .exti = 8, .irq_parent = 66 }, + { .exti = 9, .irq_parent = 67 }, + { .exti = 10, .irq_parent = 40 }, + { .exti = 11, .irq_parent = 42 }, + { .exti = 12, .irq_parent = 76 }, + { .exti = 13, .irq_parent = 77 }, + { .exti = 14, .irq_parent = 121 }, + { .exti = 15, .irq_parent = 127 }, + { .exti = 16, .irq_parent = 1 }, + { .exti = 19, .irq_parent = 3 }, + { .exti = 21, .irq_parent = 31 }, + { .exti = 22, .irq_parent = 33 }, + { .exti = 23, .irq_parent = 72 }, + { .exti = 24, .irq_parent = 95 }, + { .exti = 25, .irq_parent = 107 }, + { .exti = 26, .irq_parent = 37 }, + { .exti = 27, .irq_parent = 38 }, + { .exti = 28, .irq_parent = 39 }, + { .exti = 29, .irq_parent = 71 }, + { .exti = 30, .irq_parent = 52 }, + { .exti = 31, .irq_parent = 53 }, + { .exti = 32, .irq_parent = 82 }, + { .exti = 33, .irq_parent = 83 }, + { .exti = 47, .irq_parent = 93 }, + { .exti = 48, .irq_parent = 138 }, + { .exti = 50, .irq_parent = 139 }, + { .exti = 52, .irq_parent = 140 }, + { .exti = 53, .irq_parent = 141 }, + { .exti = 54, .irq_parent = 135 }, + { .exti = 61, .irq_parent = 100 }, + { .exti = 65, .irq_parent = 144 }, + { .exti = 68, .irq_parent = 143 }, + { .exti = 70, .irq_parent = 62 }, + { .exti = 73, .irq_parent = 129 }, }; static const struct stm32_desc_irq stm32mp13_desc_irq[] = { - { .exti = 0, .irq_parent = 6, .chip = &stm32_exti_h_chip }, - { .exti = 1, .irq_parent = 7, .chip = &stm32_exti_h_chip }, - { .exti = 2, .irq_parent = 8, .chip = &stm32_exti_h_chip }, - { .exti = 3, .irq_parent = 9, .chip = &stm32_exti_h_chip }, - { .exti = 4, .irq_parent = 10, .chip = &stm32_exti_h_chip }, - { .exti = 5, .irq_parent = 24, .chip = &stm32_exti_h_chip }, - { .exti = 6, .irq_parent = 65, .chip = &stm32_exti_h_chip }, - { .exti = 7, .irq_parent = 66, .chip = &stm32_exti_h_chip }, - { .exti = 8, .irq_parent = 67, .chip = &stm32_exti_h_chip }, - { .exti = 9, .irq_parent = 68, .chip = &stm32_exti_h_chip }, - { .exti = 10, .irq_parent = 41, .chip = &stm32_exti_h_chip }, - { .exti = 11, .irq_parent = 43, .chip = &stm32_exti_h_chip }, - { .exti = 12, .irq_parent = 77, .chip = &stm32_exti_h_chip }, - { .exti = 13, .irq_parent = 78, .chip = &stm32_exti_h_chip }, - { .exti = 14, .irq_parent = 106, .chip = &stm32_exti_h_chip }, - { .exti = 15, .irq_parent = 109, .chip = &stm32_exti_h_chip }, - { .exti = 16, .irq_parent = 1, .chip = &stm32_exti_h_chip }, - { .exti = 19, .irq_parent = 3, .chip = &stm32_exti_h_chip_direct }, - { .exti = 21, .irq_parent = 32, .chip = &stm32_exti_h_chip_direct }, - { .exti = 22, .irq_parent = 34, .chip = &stm32_exti_h_chip_direct }, - { .exti = 23, .irq_parent = 73, .chip = &stm32_exti_h_chip_direct }, - { .exti = 24, .irq_parent = 93, .chip = &stm32_exti_h_chip_direct }, - { .exti = 25, .irq_parent = 114, .chip = &stm32_exti_h_chip_direct }, - { .exti = 26, .irq_parent = 38, .chip = &stm32_exti_h_chip_direct }, - { .exti = 27, .irq_parent = 39, .chip = &stm32_exti_h_chip_direct }, - { .exti = 28, .irq_parent = 40, .chip = &stm32_exti_h_chip_direct }, - { .exti = 29, .irq_parent = 72, .chip = &stm32_exti_h_chip_direct }, - { .exti = 30, .irq_parent = 53, .chip = &stm32_exti_h_chip_direct }, - { .exti = 31, .irq_parent = 54, .chip = &stm32_exti_h_chip_direct }, - { .exti = 32, .irq_parent = 83, .chip = &stm32_exti_h_chip_direct }, - { .exti = 33, .irq_parent = 84, .chip = &stm32_exti_h_chip_direct }, - { .exti = 44, .irq_parent = 96, .chip = &stm32_exti_h_chip_direct }, - { .exti = 47, .irq_parent = 92, .chip = &stm32_exti_h_chip_direct }, - { .exti = 48, .irq_parent = 116, .chip = &stm32_exti_h_chip_direct }, - { .exti = 50, .irq_parent = 117, .chip = &stm32_exti_h_chip_direct }, - { .exti = 52, .irq_parent = 118, .chip = &stm32_exti_h_chip_direct }, - { .exti = 53, .irq_parent = 119, .chip = &stm32_exti_h_chip_direct }, - { .exti = 68, .irq_parent = 63, .chip = &stm32_exti_h_chip_direct }, - { .exti = 70, .irq_parent = 98, .chip = &stm32_exti_h_chip_direct }, + { .exti = 0, .irq_parent = 6 }, + { .exti = 1, .irq_parent = 7 }, + { .exti = 2, .irq_parent = 8 }, + { .exti = 3, .irq_parent = 9 }, + { .exti = 4, .irq_parent = 10 }, + { .exti = 5, .irq_parent = 24 }, + { .exti = 6, .irq_parent = 65 }, + { .exti = 7, .irq_parent = 66 }, + { .exti = 8, .irq_parent = 67 }, + { .exti = 9, .irq_parent = 68 }, + { .exti = 10, .irq_parent = 41 }, + { .exti = 11, .irq_parent = 43 }, + { .exti = 12, .irq_parent = 77 }, + { .exti = 13, .irq_parent = 78 }, + { .exti = 14, .irq_parent = 106 }, + { .exti = 15, .irq_parent = 109 }, + { .exti = 16, .irq_parent = 1 }, + { .exti = 19, .irq_parent = 3 }, + { .exti = 21, .irq_parent = 32 }, + { .exti = 22, .irq_parent = 34 }, + { .exti = 23, .irq_parent = 73 }, + { .exti = 24, .irq_parent = 93 }, + { .exti = 25, .irq_parent = 114 }, + { .exti = 26, .irq_parent = 38 }, + { .exti = 27, .irq_parent = 39 }, + { .exti = 28, .irq_parent = 40 }, + { .exti = 29, .irq_parent = 72 }, + { .exti = 30, .irq_parent = 53 }, + { .exti = 31, .irq_parent = 54 }, + { .exti = 32, .irq_parent = 83 }, + { .exti = 33, .irq_parent = 84 }, + { .exti = 44, .irq_parent = 96 }, + { .exti = 47, .irq_parent = 92 }, + { .exti = 48, .irq_parent = 116 }, + { .exti = 50, .irq_parent = 117 }, + { .exti = 52, .irq_parent = 118 }, + { .exti = 53, .irq_parent = 119 }, + { .exti = 68, .irq_parent = 63 }, + { .exti = 70, .irq_parent = 98 }, }; static const struct stm32_exti_drv_data stm32mp1_drv_data = { @@ -711,6 +718,8 @@ static int stm32_exti_h_domain_alloc(struct irq_domain *dm, struct irq_fwspec p_fwspec; irq_hw_number_t hwirq; int bank; + u32 event_trg; + struct irq_chip *chip; hwirq = fwspec->param[0]; if (hwirq >= host_data->drv_data->bank_nr * IRQS_PER_BANK) @@ -724,8 +733,11 @@ static int stm32_exti_h_domain_alloc(struct irq_domain *dm, if (!desc) return -EINVAL; - irq_domain_set_hwirq_and_chip(dm, virq, hwirq, desc->chip, - chip_data); + event_trg = readl_relaxed(host_data->base + chip_data->reg_bank->trg_ofst); + chip = (event_trg & BIT(hwirq % IRQS_PER_BANK)) ? + &stm32_exti_h_chip : &stm32_exti_h_chip_direct; + + irq_domain_set_hwirq_and_chip(dm, virq, hwirq, chip, chip_data); if (desc->irq_parent) { p_fwspec.fwnode = dm->parent->fwnode; p_fwspec.param_count = 3; From c297493336b7bc0c12ced484a9e61d04ec2d9403 Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Mon, 6 Jun 2022 18:27:57 +0200 Subject: [PATCH 258/298] irqchip/stm32-exti: Simplify irq description table Having removed the event trigger type from struct stm32_desc_irq makes worthless keep using a struct. Replace the struct by a single dimension array and use 8 bit type to reduce the overal memory footprint. On armv7a this patch reduces by 7% the size of the driver, from text data bss dec hex filename 6977 424 4 7405 1ced irq-stm32-exti.o to 6449 424 4 6877 1add irq-stm32-exti.o Signed-off-by: Antonio Borneo Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220606162757.415354-7-antonio.borneo@foss.st.com --- drivers/irqchip/irq-stm32-exti.c | 220 ++++++++++++++----------------- 1 file changed, 101 insertions(+), 119 deletions(-) diff --git a/drivers/irqchip/irq-stm32-exti.c b/drivers/irqchip/irq-stm32-exti.c index 045589627e67..a73763d475f0 100644 --- a/drivers/irqchip/irq-stm32-exti.c +++ b/drivers/irqchip/irq-stm32-exti.c @@ -39,16 +39,10 @@ struct stm32_exti_bank { #define UNDEF_REG ~0 -struct stm32_desc_irq { - u32 exti; - u32 irq_parent; -}; - struct stm32_exti_drv_data { const struct stm32_exti_bank **exti_banks; - const struct stm32_desc_irq *desc_irqs; + const u8 *desc_irqs; u32 bank_nr; - u32 irq_nr; }; struct stm32_exti_chip_data { @@ -176,126 +170,114 @@ static const struct stm32_exti_bank *stm32mp1_exti_banks[] = { static struct irq_chip stm32_exti_h_chip; static struct irq_chip stm32_exti_h_chip_direct; -static const struct stm32_desc_irq stm32mp1_desc_irq[] = { - { .exti = 0, .irq_parent = 6 }, - { .exti = 1, .irq_parent = 7 }, - { .exti = 2, .irq_parent = 8 }, - { .exti = 3, .irq_parent = 9 }, - { .exti = 4, .irq_parent = 10 }, - { .exti = 5, .irq_parent = 23 }, - { .exti = 6, .irq_parent = 64 }, - { .exti = 7, .irq_parent = 65 }, - { .exti = 8, .irq_parent = 66 }, - { .exti = 9, .irq_parent = 67 }, - { .exti = 10, .irq_parent = 40 }, - { .exti = 11, .irq_parent = 42 }, - { .exti = 12, .irq_parent = 76 }, - { .exti = 13, .irq_parent = 77 }, - { .exti = 14, .irq_parent = 121 }, - { .exti = 15, .irq_parent = 127 }, - { .exti = 16, .irq_parent = 1 }, - { .exti = 19, .irq_parent = 3 }, - { .exti = 21, .irq_parent = 31 }, - { .exti = 22, .irq_parent = 33 }, - { .exti = 23, .irq_parent = 72 }, - { .exti = 24, .irq_parent = 95 }, - { .exti = 25, .irq_parent = 107 }, - { .exti = 26, .irq_parent = 37 }, - { .exti = 27, .irq_parent = 38 }, - { .exti = 28, .irq_parent = 39 }, - { .exti = 29, .irq_parent = 71 }, - { .exti = 30, .irq_parent = 52 }, - { .exti = 31, .irq_parent = 53 }, - { .exti = 32, .irq_parent = 82 }, - { .exti = 33, .irq_parent = 83 }, - { .exti = 47, .irq_parent = 93 }, - { .exti = 48, .irq_parent = 138 }, - { .exti = 50, .irq_parent = 139 }, - { .exti = 52, .irq_parent = 140 }, - { .exti = 53, .irq_parent = 141 }, - { .exti = 54, .irq_parent = 135 }, - { .exti = 61, .irq_parent = 100 }, - { .exti = 65, .irq_parent = 144 }, - { .exti = 68, .irq_parent = 143 }, - { .exti = 70, .irq_parent = 62 }, - { .exti = 73, .irq_parent = 129 }, +#define EXTI_INVALID_IRQ U8_MAX +#define STM32MP1_DESC_IRQ_SIZE (ARRAY_SIZE(stm32mp1_exti_banks) * IRQS_PER_BANK) + +static const u8 stm32mp1_desc_irq[] = { + /* default value */ + [0 ... (STM32MP1_DESC_IRQ_SIZE - 1)] = EXTI_INVALID_IRQ, + + [0] = 6, + [1] = 7, + [2] = 8, + [3] = 9, + [4] = 10, + [5] = 23, + [6] = 64, + [7] = 65, + [8] = 66, + [9] = 67, + [10] = 40, + [11] = 42, + [12] = 76, + [13] = 77, + [14] = 121, + [15] = 127, + [16] = 1, + [19] = 3, + [21] = 31, + [22] = 33, + [23] = 72, + [24] = 95, + [25] = 107, + [26] = 37, + [27] = 38, + [28] = 39, + [29] = 71, + [30] = 52, + [31] = 53, + [32] = 82, + [33] = 83, + [47] = 93, + [48] = 138, + [50] = 139, + [52] = 140, + [53] = 141, + [54] = 135, + [61] = 100, + [65] = 144, + [68] = 143, + [70] = 62, + [73] = 129, }; -static const struct stm32_desc_irq stm32mp13_desc_irq[] = { - { .exti = 0, .irq_parent = 6 }, - { .exti = 1, .irq_parent = 7 }, - { .exti = 2, .irq_parent = 8 }, - { .exti = 3, .irq_parent = 9 }, - { .exti = 4, .irq_parent = 10 }, - { .exti = 5, .irq_parent = 24 }, - { .exti = 6, .irq_parent = 65 }, - { .exti = 7, .irq_parent = 66 }, - { .exti = 8, .irq_parent = 67 }, - { .exti = 9, .irq_parent = 68 }, - { .exti = 10, .irq_parent = 41 }, - { .exti = 11, .irq_parent = 43 }, - { .exti = 12, .irq_parent = 77 }, - { .exti = 13, .irq_parent = 78 }, - { .exti = 14, .irq_parent = 106 }, - { .exti = 15, .irq_parent = 109 }, - { .exti = 16, .irq_parent = 1 }, - { .exti = 19, .irq_parent = 3 }, - { .exti = 21, .irq_parent = 32 }, - { .exti = 22, .irq_parent = 34 }, - { .exti = 23, .irq_parent = 73 }, - { .exti = 24, .irq_parent = 93 }, - { .exti = 25, .irq_parent = 114 }, - { .exti = 26, .irq_parent = 38 }, - { .exti = 27, .irq_parent = 39 }, - { .exti = 28, .irq_parent = 40 }, - { .exti = 29, .irq_parent = 72 }, - { .exti = 30, .irq_parent = 53 }, - { .exti = 31, .irq_parent = 54 }, - { .exti = 32, .irq_parent = 83 }, - { .exti = 33, .irq_parent = 84 }, - { .exti = 44, .irq_parent = 96 }, - { .exti = 47, .irq_parent = 92 }, - { .exti = 48, .irq_parent = 116 }, - { .exti = 50, .irq_parent = 117 }, - { .exti = 52, .irq_parent = 118 }, - { .exti = 53, .irq_parent = 119 }, - { .exti = 68, .irq_parent = 63 }, - { .exti = 70, .irq_parent = 98 }, +static const u8 stm32mp13_desc_irq[] = { + /* default value */ + [0 ... (STM32MP1_DESC_IRQ_SIZE - 1)] = EXTI_INVALID_IRQ, + + [0] = 6, + [1] = 7, + [2] = 8, + [3] = 9, + [4] = 10, + [5] = 24, + [6] = 65, + [7] = 66, + [8] = 67, + [9] = 68, + [10] = 41, + [11] = 43, + [12] = 77, + [13] = 78, + [14] = 106, + [15] = 109, + [16] = 1, + [19] = 3, + [21] = 32, + [22] = 34, + [23] = 73, + [24] = 93, + [25] = 114, + [26] = 38, + [27] = 39, + [28] = 40, + [29] = 72, + [30] = 53, + [31] = 54, + [32] = 83, + [33] = 84, + [44] = 96, + [47] = 92, + [48] = 116, + [50] = 117, + [52] = 118, + [53] = 119, + [68] = 63, + [70] = 98, }; static const struct stm32_exti_drv_data stm32mp1_drv_data = { .exti_banks = stm32mp1_exti_banks, .bank_nr = ARRAY_SIZE(stm32mp1_exti_banks), .desc_irqs = stm32mp1_desc_irq, - .irq_nr = ARRAY_SIZE(stm32mp1_desc_irq), }; static const struct stm32_exti_drv_data stm32mp13_drv_data = { .exti_banks = stm32mp1_exti_banks, .bank_nr = ARRAY_SIZE(stm32mp1_exti_banks), .desc_irqs = stm32mp13_desc_irq, - .irq_nr = ARRAY_SIZE(stm32mp13_desc_irq), }; -static const struct -stm32_desc_irq *stm32_exti_get_desc(const struct stm32_exti_drv_data *drv_data, - irq_hw_number_t hwirq) -{ - const struct stm32_desc_irq *desc = NULL; - int i; - - if (!drv_data->desc_irqs) - return NULL; - - for (i = 0; i < drv_data->irq_nr; i++) { - desc = &drv_data->desc_irqs[i]; - if (desc->exti == hwirq) - break; - } - - return desc; -} - static unsigned long stm32_exti_pending(struct irq_chip_generic *gc) { struct stm32_exti_chip_data *chip_data = gc->private; @@ -713,7 +695,7 @@ static int stm32_exti_h_domain_alloc(struct irq_domain *dm, { struct stm32_exti_host_data *host_data = dm->host_data; struct stm32_exti_chip_data *chip_data; - const struct stm32_desc_irq *desc; + u8 desc_irq; struct irq_fwspec *fwspec = data; struct irq_fwspec p_fwspec; irq_hw_number_t hwirq; @@ -728,21 +710,21 @@ static int stm32_exti_h_domain_alloc(struct irq_domain *dm, bank = hwirq / IRQS_PER_BANK; chip_data = &host_data->chips_data[bank]; - - desc = stm32_exti_get_desc(host_data->drv_data, hwirq); - if (!desc) - return -EINVAL; - event_trg = readl_relaxed(host_data->base + chip_data->reg_bank->trg_ofst); chip = (event_trg & BIT(hwirq % IRQS_PER_BANK)) ? &stm32_exti_h_chip : &stm32_exti_h_chip_direct; irq_domain_set_hwirq_and_chip(dm, virq, hwirq, chip, chip_data); - if (desc->irq_parent) { + + if (!host_data->drv_data || !host_data->drv_data->desc_irqs) + return -EINVAL; + + desc_irq = host_data->drv_data->desc_irqs[hwirq]; + if (desc_irq != EXTI_INVALID_IRQ) { p_fwspec.fwnode = dm->parent->fwnode; p_fwspec.param_count = 3; p_fwspec.param[0] = GIC_SPI; - p_fwspec.param[1] = desc->irq_parent; + p_fwspec.param[1] = desc_irq; p_fwspec.param[2] = IRQ_TYPE_LEVEL_HIGH; return irq_domain_alloc_irqs_parent(dm, virq, 1, &p_fwspec); From 8190cc572981f2f13b6ffc26c7cfa7899e5d3ccc Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:49 -0500 Subject: [PATCH 259/298] irqchip/mips-gic: Only register IPI domain when SMP is enabled The MIPS GIC irqchip driver may be selected in a uniprocessor configuration, but it unconditionally registers an IPI domain. Limit the part of the driver dealing with IPIs to only be compiled when GENERIC_IRQ_IPI is enabled, which corresponds to an SMP configuration. Reported-by: kernel test robot Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-2-samuel@sholland.org --- drivers/irqchip/Kconfig | 3 +- drivers/irqchip/irq-mips-gic.c | 80 +++++++++++++++++++++++----------- 2 files changed, 56 insertions(+), 27 deletions(-) diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 1f23a6be7d88..d26a4ff7c99f 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -322,7 +322,8 @@ config KEYSTONE_IRQ config MIPS_GIC bool - select GENERIC_IRQ_IPI + select GENERIC_IRQ_IPI if SMP + select IRQ_DOMAIN_HIERARCHY select MIPS_CM config INGENIC_IRQ diff --git a/drivers/irqchip/irq-mips-gic.c b/drivers/irqchip/irq-mips-gic.c index ff89b36267dd..8a9efb6ae587 100644 --- a/drivers/irqchip/irq-mips-gic.c +++ b/drivers/irqchip/irq-mips-gic.c @@ -52,13 +52,15 @@ static DEFINE_PER_CPU_READ_MOSTLY(unsigned long[GIC_MAX_LONGS], pcpu_masks); static DEFINE_SPINLOCK(gic_lock); static struct irq_domain *gic_irq_domain; -static struct irq_domain *gic_ipi_domain; static int gic_shared_intrs; static unsigned int gic_cpu_pin; static unsigned int timer_cpu_pin; static struct irq_chip gic_level_irq_controller, gic_edge_irq_controller; + +#ifdef CONFIG_GENERIC_IRQ_IPI static DECLARE_BITMAP(ipi_resrv, GIC_MAX_INTRS); static DECLARE_BITMAP(ipi_available, GIC_MAX_INTRS); +#endif /* CONFIG_GENERIC_IRQ_IPI */ static struct gic_all_vpes_chip_data { u32 map; @@ -472,9 +474,11 @@ static int gic_irq_domain_map(struct irq_domain *d, unsigned int virq, u32 map; if (hwirq >= GIC_SHARED_HWIRQ_BASE) { +#ifdef CONFIG_GENERIC_IRQ_IPI /* verify that shared irqs don't conflict with an IPI irq */ if (test_bit(GIC_HWIRQ_TO_SHARED(hwirq), ipi_resrv)) return -EBUSY; +#endif /* CONFIG_GENERIC_IRQ_IPI */ err = irq_domain_set_hwirq_and_chip(d, virq, hwirq, &gic_level_irq_controller, @@ -567,6 +571,8 @@ static const struct irq_domain_ops gic_irq_domain_ops = { .map = gic_irq_domain_map, }; +#ifdef CONFIG_GENERIC_IRQ_IPI + static int gic_ipi_domain_xlate(struct irq_domain *d, struct device_node *ctrlr, const u32 *intspec, unsigned int intsize, irq_hw_number_t *out_hwirq, @@ -670,6 +676,48 @@ static const struct irq_domain_ops gic_ipi_domain_ops = { .match = gic_ipi_domain_match, }; +static int gic_register_ipi_domain(struct device_node *node) +{ + struct irq_domain *gic_ipi_domain; + unsigned int v[2], num_ipis; + + gic_ipi_domain = irq_domain_add_hierarchy(gic_irq_domain, + IRQ_DOMAIN_FLAG_IPI_PER_CPU, + GIC_NUM_LOCAL_INTRS + gic_shared_intrs, + node, &gic_ipi_domain_ops, NULL); + if (!gic_ipi_domain) { + pr_err("Failed to add IPI domain"); + return -ENXIO; + } + + irq_domain_update_bus_token(gic_ipi_domain, DOMAIN_BUS_IPI); + + if (node && + !of_property_read_u32_array(node, "mti,reserved-ipi-vectors", v, 2)) { + bitmap_set(ipi_resrv, v[0], v[1]); + } else { + /* + * Reserve 2 interrupts per possible CPU/VP for use as IPIs, + * meeting the requirements of arch/mips SMP. + */ + num_ipis = 2 * num_possible_cpus(); + bitmap_set(ipi_resrv, gic_shared_intrs - num_ipis, num_ipis); + } + + bitmap_copy(ipi_available, ipi_resrv, GIC_MAX_INTRS); + + return 0; +} + +#else /* !CONFIG_GENERIC_IRQ_IPI */ + +static inline int gic_register_ipi_domain(struct device_node *node) +{ + return 0; +} + +#endif /* !CONFIG_GENERIC_IRQ_IPI */ + static int gic_cpu_startup(unsigned int cpu) { /* Enable or disable EIC */ @@ -688,11 +736,12 @@ static int gic_cpu_startup(unsigned int cpu) static int __init gic_of_init(struct device_node *node, struct device_node *parent) { - unsigned int cpu_vec, i, gicconfig, v[2], num_ipis; + unsigned int cpu_vec, i, gicconfig; unsigned long reserved; phys_addr_t gic_base; struct resource res; size_t gic_len; + int ret; /* Find the first available CPU vector. */ i = 0; @@ -780,30 +829,9 @@ static int __init gic_of_init(struct device_node *node, return -ENXIO; } - gic_ipi_domain = irq_domain_add_hierarchy(gic_irq_domain, - IRQ_DOMAIN_FLAG_IPI_PER_CPU, - GIC_NUM_LOCAL_INTRS + gic_shared_intrs, - node, &gic_ipi_domain_ops, NULL); - if (!gic_ipi_domain) { - pr_err("Failed to add IPI domain"); - return -ENXIO; - } - - irq_domain_update_bus_token(gic_ipi_domain, DOMAIN_BUS_IPI); - - if (node && - !of_property_read_u32_array(node, "mti,reserved-ipi-vectors", v, 2)) { - bitmap_set(ipi_resrv, v[0], v[1]); - } else { - /* - * Reserve 2 interrupts per possible CPU/VP for use as IPIs, - * meeting the requirements of arch/mips SMP. - */ - num_ipis = 2 * num_possible_cpus(); - bitmap_set(ipi_resrv, gic_shared_intrs - num_ipis, num_ipis); - } - - bitmap_copy(ipi_available, ipi_resrv, GIC_MAX_INTRS); + ret = gic_register_ipi_domain(node); + if (ret) + return ret; board_bind_eic_interrupt = &gic_bind_eic_interrupt; From 0f5209fee90b4544c58b4278d944425292789967 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:50 -0500 Subject: [PATCH 260/298] genirq: GENERIC_IRQ_IPI depends on SMP The generic IPI code depends on the IRQ affinity mask being allocated and initialized. This will not be the case if SMP is disabled. Fix up the remaining driver that selected GENERIC_IRQ_IPI in a non-SMP config. Reported-by: kernel test robot Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-3-samuel@sholland.org --- drivers/irqchip/Kconfig | 2 +- kernel/irq/Kconfig | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index d26a4ff7c99f..5dd98a81efc8 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -177,7 +177,7 @@ config MADERA_IRQ config IRQ_MIPS_CPU bool select GENERIC_IRQ_CHIP - select GENERIC_IRQ_IPI if SYS_SUPPORTS_MULTITHREADING + select GENERIC_IRQ_IPI if SMP && SYS_SUPPORTS_MULTITHREADING select IRQ_DOMAIN select GENERIC_IRQ_EFFECTIVE_AFF_MASK diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index 10929eda9825..fc760d064a65 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -82,6 +82,7 @@ config IRQ_FASTEOI_HIERARCHY_HANDLERS # Generic IRQ IPI support config GENERIC_IRQ_IPI bool + depends on SMP select IRQ_DOMAIN_HIERARCHY # Generic MSI interrupt support From 0e6c027c035507abc67b356264a12c49f58c946e Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:51 -0500 Subject: [PATCH 261/298] genirq: GENERIC_IRQ_EFFECTIVE_AFF_MASK depends on SMP An IRQ's effective affinity can only be different from its configured affinity if there are multiple CPUs. Make it clear that this option is only meaningful when SMP is enabled. Most of the relevant code in irqdesc.c is already hidden behind CONFIG_SMP anyway. Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-4-samuel@sholland.org --- arch/arm/mach-hisi/Kconfig | 2 +- drivers/irqchip/Kconfig | 14 +++++++------- kernel/irq/Kconfig | 1 + 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/arm/mach-hisi/Kconfig b/arch/arm/mach-hisi/Kconfig index 75cccbd3f05f..7b3440687176 100644 --- a/arch/arm/mach-hisi/Kconfig +++ b/arch/arm/mach-hisi/Kconfig @@ -40,7 +40,7 @@ config ARCH_HIP04 select HAVE_ARM_ARCH_TIMER select MCPM if SMP select MCPM_QUAD_CLUSTER if SMP - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP help Support for Hisilicon HiP04 SoC family diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 5dd98a81efc8..462adac905a6 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -8,7 +8,7 @@ config IRQCHIP config ARM_GIC bool select IRQ_DOMAIN_HIERARCHY - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config ARM_GIC_PM bool @@ -34,7 +34,7 @@ config ARM_GIC_V3 bool select IRQ_DOMAIN_HIERARCHY select PARTITION_PERCPU - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config ARM_GIC_V3_ITS bool @@ -76,7 +76,7 @@ config ARMADA_370_XP_IRQ bool select GENERIC_IRQ_CHIP select PCI_MSI if PCI - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config ALPINE_MSI bool @@ -112,7 +112,7 @@ config BCM6345_L1_IRQ bool select GENERIC_IRQ_CHIP select IRQ_DOMAIN - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config BCM7038_L1_IRQ tristate "Broadcom STB 7038-style L1/L2 interrupt controller driver" @@ -120,7 +120,7 @@ config BCM7038_L1_IRQ default ARCH_BRCMSTB || BMIPS_GENERIC select GENERIC_IRQ_CHIP select IRQ_DOMAIN - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config BCM7120_L2_IRQ tristate "Broadcom STB 7120-style L2 interrupt controller driver" @@ -179,7 +179,7 @@ config IRQ_MIPS_CPU select GENERIC_IRQ_CHIP select GENERIC_IRQ_IPI if SMP && SYS_SUPPORTS_MULTITHREADING select IRQ_DOMAIN - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config CLPS711X_IRQCHIP bool @@ -294,7 +294,7 @@ config VERSATILE_FPGA_IRQ_NR config XTENSA_MX bool select IRQ_DOMAIN - select GENERIC_IRQ_EFFECTIVE_AFF_MASK + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP config XILINX_INTC bool "Xilinx Interrupt Controller IP" diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index fc760d064a65..db3d174c53d4 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -24,6 +24,7 @@ config GENERIC_IRQ_SHOW_LEVEL # Supports effective affinity mask config GENERIC_IRQ_EFFECTIVE_AFF_MASK + depends on SMP bool # Support for delayed migration from interrupt context From 610306306aaa56ab324d03a55138ea611be9e282 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:52 -0500 Subject: [PATCH 262/298] genirq: Drop redundant irq_init_effective_affinity It does exactly the same thing as irq_data_update_effective_affinity. Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-5-samuel@sholland.org --- kernel/irq/manage.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 8c396319d5ac..40fe7806cc8c 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -205,16 +205,8 @@ static void irq_validate_effective_affinity(struct irq_data *data) pr_warn_once("irq_chip %s did not update eff. affinity mask of irq %u\n", chip->name, data->irq); } - -static inline void irq_init_effective_affinity(struct irq_data *data, - const struct cpumask *mask) -{ - cpumask_copy(irq_data_get_effective_affinity_mask(data), mask); -} #else static inline void irq_validate_effective_affinity(struct irq_data *data) { } -static inline void irq_init_effective_affinity(struct irq_data *data, - const struct cpumask *mask) { } #endif int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, @@ -347,7 +339,7 @@ static bool irq_set_affinity_deactivated(struct irq_data *data, return false; cpumask_copy(desc->irq_common_data.affinity, mask); - irq_init_effective_affinity(data, mask); + irq_data_update_effective_affinity(data, mask); irqd_set(data, IRQD_AFFINITY_SET); return true; } From 961343d7822624d0e329ab4167c7e1d02bb53112 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:53 -0500 Subject: [PATCH 263/298] genirq: Refactor accessors to use irq_data_get_affinity_mask A couple of functions directly reference the affinity mask. Route them through irq_data_get_affinity_mask so they will pick up any refactoring done there. Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-6-samuel@sholland.org --- include/linux/irq.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/irq.h b/include/linux/irq.h index 505308253d23..69ee4e2f36ce 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -879,16 +879,16 @@ static inline int irq_data_get_node(struct irq_data *d) return irq_common_data_get_node(d->common); } +static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d) +{ + return d->common->affinity; +} + static inline struct cpumask *irq_get_affinity_mask(int irq) { struct irq_data *d = irq_get_irq_data(irq); - return d ? d->common->affinity : NULL; -} - -static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d) -{ - return d->common->affinity; + return d ? irq_data_get_affinity_mask(d) : NULL; } #ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK @@ -910,7 +910,7 @@ static inline void irq_data_update_effective_affinity(struct irq_data *d, static inline struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) { - return d->common->affinity; + return irq_data_get_affinity_mask(d); } #endif From 073352e951f60946452da358d64841066c3142ff Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:54 -0500 Subject: [PATCH 264/298] genirq: Add and use an irq_data_update_affinity helper Some architectures and irqchip drivers modify the cpumask returned by irq_data_get_affinity_mask, usually by copying in to it. This is problematic for uniprocessor configurations, where the affinity mask should be constant, as it is known at compile time. Add and use a setter for the affinity mask, following the pattern of irq_data_update_effective_affinity. This allows the getter function to return a const cpumask pointer. Signed-off-by: Samuel Holland Reviewed-by: Oleksandr Tyshchenko # Xen bits Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-7-samuel@sholland.org --- arch/alpha/kernel/irq.c | 2 +- arch/ia64/kernel/iosapic.c | 2 +- arch/ia64/kernel/irq.c | 4 ++-- arch/ia64/kernel/msi_ia64.c | 4 ++-- arch/parisc/kernel/irq.c | 2 +- drivers/irqchip/irq-bcm6345-l1.c | 4 ++-- drivers/parisc/iosapic.c | 2 +- drivers/sh/intc/chip.c | 2 +- drivers/xen/events/events_base.c | 7 ++++--- include/linux/irq.h | 6 ++++++ 10 files changed, 21 insertions(+), 14 deletions(-) diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c index f6d2946edbd2..15f2effd6baf 100644 --- a/arch/alpha/kernel/irq.c +++ b/arch/alpha/kernel/irq.c @@ -60,7 +60,7 @@ int irq_select_affinity(unsigned int irq) cpu = (cpu < (NR_CPUS-1) ? cpu + 1 : 0); last_cpu = cpu; - cpumask_copy(irq_data_get_affinity_mask(data), cpumask_of(cpu)); + irq_data_update_affinity(data, cpumask_of(cpu)); chip->irq_set_affinity(data, cpumask_of(cpu), false); return 0; } diff --git a/arch/ia64/kernel/iosapic.c b/arch/ia64/kernel/iosapic.c index 35adcf89035a..99300850abc1 100644 --- a/arch/ia64/kernel/iosapic.c +++ b/arch/ia64/kernel/iosapic.c @@ -834,7 +834,7 @@ iosapic_unregister_intr (unsigned int gsi) if (iosapic_intr_info[irq].count == 0) { #ifdef CONFIG_SMP /* Clear affinity */ - cpumask_setall(irq_get_affinity_mask(irq)); + irq_data_update_affinity(irq_get_irq_data(irq), cpu_all_mask); #endif /* Clear the interrupt information */ iosapic_intr_info[irq].dest = 0; diff --git a/arch/ia64/kernel/irq.c b/arch/ia64/kernel/irq.c index ecef17c7c35b..275b9ea58c64 100644 --- a/arch/ia64/kernel/irq.c +++ b/arch/ia64/kernel/irq.c @@ -57,8 +57,8 @@ static char irq_redir [NR_IRQS]; // = { [0 ... NR_IRQS-1] = 1 }; void set_irq_affinity_info (unsigned int irq, int hwid, int redir) { if (irq < NR_IRQS) { - cpumask_copy(irq_get_affinity_mask(irq), - cpumask_of(cpu_logical_id(hwid))); + irq_data_update_affinity(irq_get_irq_data(irq), + cpumask_of(cpu_logical_id(hwid))); irq_redir[irq] = (char) (redir & 0xff); } } diff --git a/arch/ia64/kernel/msi_ia64.c b/arch/ia64/kernel/msi_ia64.c index df5c28f252e3..025e5133c860 100644 --- a/arch/ia64/kernel/msi_ia64.c +++ b/arch/ia64/kernel/msi_ia64.c @@ -37,7 +37,7 @@ static int ia64_set_msi_irq_affinity(struct irq_data *idata, msg.data = data; pci_write_msi_msg(irq, &msg); - cpumask_copy(irq_data_get_affinity_mask(idata), cpumask_of(cpu)); + irq_data_update_affinity(idata, cpumask_of(cpu)); return 0; } @@ -132,7 +132,7 @@ static int dmar_msi_set_affinity(struct irq_data *data, msg.address_lo |= MSI_ADDR_DEST_ID_CPU(cpu_physical_id(cpu)); dmar_msi_write(irq, &msg); - cpumask_copy(irq_data_get_affinity_mask(data), mask); + irq_data_update_affinity(data, mask); return 0; } diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c index 0fe2d79fb123..5ebb1771b4ab 100644 --- a/arch/parisc/kernel/irq.c +++ b/arch/parisc/kernel/irq.c @@ -315,7 +315,7 @@ unsigned long txn_affinity_addr(unsigned int irq, int cpu) { #ifdef CONFIG_SMP struct irq_data *d = irq_get_irq_data(irq); - cpumask_copy(irq_data_get_affinity_mask(d), cpumask_of(cpu)); + irq_data_update_affinity(d, cpumask_of(cpu)); #endif return per_cpu(cpu_data, cpu).txn_addr; diff --git a/drivers/irqchip/irq-bcm6345-l1.c b/drivers/irqchip/irq-bcm6345-l1.c index 142a7431745f..6899e37810a8 100644 --- a/drivers/irqchip/irq-bcm6345-l1.c +++ b/drivers/irqchip/irq-bcm6345-l1.c @@ -216,11 +216,11 @@ static int bcm6345_l1_set_affinity(struct irq_data *d, enabled = intc->cpus[old_cpu]->enable_cache[word] & mask; if (enabled) __bcm6345_l1_mask(d); - cpumask_copy(irq_data_get_affinity_mask(d), dest); + irq_data_update_affinity(d, dest); if (enabled) __bcm6345_l1_unmask(d); } else { - cpumask_copy(irq_data_get_affinity_mask(d), dest); + irq_data_update_affinity(d, dest); } raw_spin_unlock_irqrestore(&intc->lock, flags); diff --git a/drivers/parisc/iosapic.c b/drivers/parisc/iosapic.c index 8a3b0c3a1e92..3a8c98615634 100644 --- a/drivers/parisc/iosapic.c +++ b/drivers/parisc/iosapic.c @@ -677,7 +677,7 @@ static int iosapic_set_affinity_irq(struct irq_data *d, if (dest_cpu < 0) return -1; - cpumask_copy(irq_data_get_affinity_mask(d), cpumask_of(dest_cpu)); + irq_data_update_affinity(d, cpumask_of(dest_cpu)); vi->txn_addr = txn_affinity_addr(d->irq, dest_cpu); spin_lock_irqsave(&iosapic_lock, flags); diff --git a/drivers/sh/intc/chip.c b/drivers/sh/intc/chip.c index 358df7510186..828d81e02b37 100644 --- a/drivers/sh/intc/chip.c +++ b/drivers/sh/intc/chip.c @@ -72,7 +72,7 @@ static int intc_set_affinity(struct irq_data *data, if (!cpumask_intersects(cpumask, cpu_online_mask)) return -1; - cpumask_copy(irq_data_get_affinity_mask(data), cpumask); + irq_data_update_affinity(data, cpumask); return IRQ_SET_MASK_OK_NOCOPY; } diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 46d9295d9a6e..5e8321f43cbd 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -528,9 +528,10 @@ static void bind_evtchn_to_cpu(evtchn_port_t evtchn, unsigned int cpu, BUG_ON(irq == -1); if (IS_ENABLED(CONFIG_SMP) && force_affinity) { - cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(cpu)); - cpumask_copy(irq_get_effective_affinity_mask(irq), - cpumask_of(cpu)); + struct irq_data *data = irq_get_irq_data(irq); + + irq_data_update_affinity(data, cpumask_of(cpu)); + irq_data_update_effective_affinity(data, cpumask_of(cpu)); } xen_evtchn_port_bind_to_cpu(evtchn, cpu, info->cpu); diff --git a/include/linux/irq.h b/include/linux/irq.h index 69ee4e2f36ce..adcfebceb777 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -884,6 +884,12 @@ static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d) return d->common->affinity; } +static inline void irq_data_update_affinity(struct irq_data *d, + const struct cpumask *m) +{ + cpumask_copy(d->common->affinity, m); +} + static inline struct cpumask *irq_get_affinity_mask(int irq) { struct irq_data *d = irq_get_irq_data(irq); From 4d0b8298818b623f5fa51d5c49e1a142d3618ac9 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:55 -0500 Subject: [PATCH 265/298] genirq: Return a const cpumask from irq_data_get_affinity_mask Now that the irq_data_update_affinity helper exists, enforce its use by returning a a const cpumask from irq_data_get_affinity_mask. Since the previous commit already updated places that needed to call irq_data_update_affinity, this commit updates the remaining code that either did not modify the cpumask or immediately passed the modified mask to irq_set_affinity. Signed-off-by: Samuel Holland Reviewed-by: Michael Kelley Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-8-samuel@sholland.org --- arch/mips/cavium-octeon/octeon-irq.c | 4 ++-- arch/sh/kernel/irq.c | 7 ++++--- arch/x86/hyperv/irqdomain.c | 2 +- arch/xtensa/kernel/irq.c | 7 ++++--- drivers/iommu/hyperv-iommu.c | 2 +- drivers/pci/controller/pci-hyperv.c | 10 +++++----- include/linux/irq.h | 12 +++++++----- kernel/irq/chip.c | 8 +++++--- kernel/irq/debugfs.c | 2 +- kernel/irq/ipi.c | 16 +++++++++------- 10 files changed, 39 insertions(+), 31 deletions(-) diff --git a/arch/mips/cavium-octeon/octeon-irq.c b/arch/mips/cavium-octeon/octeon-irq.c index 6cdcbf4de763..9cb9ed44bcaf 100644 --- a/arch/mips/cavium-octeon/octeon-irq.c +++ b/arch/mips/cavium-octeon/octeon-irq.c @@ -263,7 +263,7 @@ static int next_cpu_for_irq(struct irq_data *data) #ifdef CONFIG_SMP int cpu; - struct cpumask *mask = irq_data_get_affinity_mask(data); + const struct cpumask *mask = irq_data_get_affinity_mask(data); int weight = cpumask_weight(mask); struct octeon_ciu_chip_data *cd = irq_data_get_irq_chip_data(data); @@ -758,7 +758,7 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data *data) { int cpu = smp_processor_id(); cpumask_t new_affinity; - struct cpumask *mask = irq_data_get_affinity_mask(data); + const struct cpumask *mask = irq_data_get_affinity_mask(data); if (!cpumask_test_cpu(cpu, mask)) return; diff --git a/arch/sh/kernel/irq.c b/arch/sh/kernel/irq.c index ef0f0827cf57..56269c2c3414 100644 --- a/arch/sh/kernel/irq.c +++ b/arch/sh/kernel/irq.c @@ -230,16 +230,17 @@ void migrate_irqs(void) struct irq_data *data = irq_get_irq_data(irq); if (irq_data_get_node(data) == cpu) { - struct cpumask *mask = irq_data_get_affinity_mask(data); + const struct cpumask *mask = irq_data_get_affinity_mask(data); unsigned int newcpu = cpumask_any_and(mask, cpu_online_mask); if (newcpu >= nr_cpu_ids) { pr_info_ratelimited("IRQ%u no longer affine to CPU%u\n", irq, cpu); - cpumask_setall(mask); + irq_set_affinity(irq, cpu_all_mask); + } else { + irq_set_affinity(irq, mask); } - irq_set_affinity(irq, mask); } } } diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c index 7e0f6bedc248..42c70d28ef27 100644 --- a/arch/x86/hyperv/irqdomain.c +++ b/arch/x86/hyperv/irqdomain.c @@ -192,7 +192,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) struct pci_dev *dev; struct hv_interrupt_entry out_entry, *stored_entry; struct irq_cfg *cfg = irqd_cfg(data); - cpumask_t *affinity; + const cpumask_t *affinity; int cpu; u64 status; diff --git a/arch/xtensa/kernel/irq.c b/arch/xtensa/kernel/irq.c index 529fe9245821..42f106004400 100644 --- a/arch/xtensa/kernel/irq.c +++ b/arch/xtensa/kernel/irq.c @@ -169,7 +169,7 @@ void migrate_irqs(void) for_each_active_irq(i) { struct irq_data *data = irq_get_irq_data(i); - struct cpumask *mask; + const struct cpumask *mask; unsigned int newcpu; if (irqd_is_per_cpu(data)) @@ -185,9 +185,10 @@ void migrate_irqs(void) pr_info_ratelimited("IRQ%u no longer affine to CPU%u\n", i, cpu); - cpumask_setall(mask); + irq_set_affinity(i, cpu_all_mask); + } else { + irq_set_affinity(i, mask); } - irq_set_affinity(i, mask); } } #endif /* CONFIG_HOTPLUG_CPU */ diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index e285a220c913..51bd66a45a11 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -194,7 +194,7 @@ hyperv_root_ir_compose_msi_msg(struct irq_data *irq_data, struct msi_msg *msg) u32 vector; struct irq_cfg *cfg; int ioapic_id; - struct cpumask *affinity; + const struct cpumask *affinity; int cpu; struct hv_interrupt_entry entry; struct hyperv_root_ir_data *data = irq_data->chip_data; diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index db814f7b93ba..aebada45569b 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -642,7 +642,7 @@ static void hv_arch_irq_unmask(struct irq_data *data) struct hv_retarget_device_interrupt *params; struct tran_int_desc *int_desc; struct hv_pcibus_device *hbus; - struct cpumask *dest; + const struct cpumask *dest; cpumask_var_t tmp; struct pci_bus *pbus; struct pci_dev *pdev; @@ -1613,7 +1613,7 @@ out: } static u32 hv_compose_msi_req_v1( - struct pci_create_interrupt *int_pkt, struct cpumask *affinity, + struct pci_create_interrupt *int_pkt, const struct cpumask *affinity, u32 slot, u8 vector, u8 vector_count) { int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; @@ -1641,7 +1641,7 @@ static int hv_compose_msi_req_get_cpu(struct cpumask *affinity) } static u32 hv_compose_msi_req_v2( - struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity, + struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity, u32 slot, u8 vector, u8 vector_count) { int cpu; @@ -1660,7 +1660,7 @@ static u32 hv_compose_msi_req_v2( } static u32 hv_compose_msi_req_v3( - struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity, + struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity, u32 slot, u32 vector, u8 vector_count) { int cpu; @@ -1697,7 +1697,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) struct hv_pci_dev *hpdev; struct pci_bus *pbus; struct pci_dev *pdev; - struct cpumask *dest; + const struct cpumask *dest; struct compose_comp_ctxt comp; struct tran_int_desc *int_desc; struct msi_desc *msi_desc; diff --git a/include/linux/irq.h b/include/linux/irq.h index adcfebceb777..02073f7a156e 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -879,7 +879,8 @@ static inline int irq_data_get_node(struct irq_data *d) return irq_common_data_get_node(d->common); } -static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d) +static inline +const struct cpumask *irq_data_get_affinity_mask(struct irq_data *d) { return d->common->affinity; } @@ -890,7 +891,7 @@ static inline void irq_data_update_affinity(struct irq_data *d, cpumask_copy(d->common->affinity, m); } -static inline struct cpumask *irq_get_affinity_mask(int irq) +static inline const struct cpumask *irq_get_affinity_mask(int irq) { struct irq_data *d = irq_get_irq_data(irq); @@ -899,7 +900,7 @@ static inline struct cpumask *irq_get_affinity_mask(int irq) #ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK static inline -struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) +const struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) { return d->common->effective_affinity; } @@ -914,13 +915,14 @@ static inline void irq_data_update_effective_affinity(struct irq_data *d, { } static inline -struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) +const struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) { return irq_data_get_affinity_mask(d); } #endif -static inline struct cpumask *irq_get_effective_affinity_mask(unsigned int irq) +static inline +const struct cpumask *irq_get_effective_affinity_mask(unsigned int irq) { struct irq_data *d = irq_get_irq_data(irq); diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 886789dcee43..9c7ad2266317 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -188,7 +188,8 @@ enum { #ifdef CONFIG_SMP static int -__irq_startup_managed(struct irq_desc *desc, struct cpumask *aff, bool force) +__irq_startup_managed(struct irq_desc *desc, const struct cpumask *aff, + bool force) { struct irq_data *d = irq_desc_get_irq_data(desc); @@ -224,7 +225,8 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask *aff, bool force) } #else static __always_inline int -__irq_startup_managed(struct irq_desc *desc, struct cpumask *aff, bool force) +__irq_startup_managed(struct irq_desc *desc, const struct cpumask *aff, + bool force) { return IRQ_STARTUP_NORMAL; } @@ -252,7 +254,7 @@ static int __irq_startup(struct irq_desc *desc) int irq_startup(struct irq_desc *desc, bool resend, bool force) { struct irq_data *d = irq_desc_get_irq_data(desc); - struct cpumask *aff = irq_data_get_affinity_mask(d); + const struct cpumask *aff = irq_data_get_affinity_mask(d); int ret = 0; desc->depth = 0; diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index bc8e40cf2b65..bbcaac64038e 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -30,7 +30,7 @@ static void irq_debug_show_bits(struct seq_file *m, int ind, unsigned int state, static void irq_debug_show_masks(struct seq_file *m, struct irq_desc *desc) { struct irq_data *data = irq_desc_get_irq_data(desc); - struct cpumask *msk; + const struct cpumask *msk; msk = irq_data_get_affinity_mask(data); seq_printf(m, "affinity: %*pbl\n", cpumask_pr_args(msk)); diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c index 08ce7da3b57c..bbd945bacef0 100644 --- a/kernel/irq/ipi.c +++ b/kernel/irq/ipi.c @@ -115,11 +115,11 @@ free_descs: int irq_destroy_ipi(unsigned int irq, const struct cpumask *dest) { struct irq_data *data = irq_get_irq_data(irq); - struct cpumask *ipimask = data ? irq_data_get_affinity_mask(data) : NULL; + const struct cpumask *ipimask; struct irq_domain *domain; unsigned int nr_irqs; - if (!irq || !data || !ipimask) + if (!irq || !data) return -EINVAL; domain = data->domain; @@ -131,7 +131,8 @@ int irq_destroy_ipi(unsigned int irq, const struct cpumask *dest) return -EINVAL; } - if (WARN_ON(!cpumask_subset(dest, ipimask))) + ipimask = irq_data_get_affinity_mask(data); + if (!ipimask || WARN_ON(!cpumask_subset(dest, ipimask))) /* * Must be destroying a subset of CPUs to which this IPI * was set up to target @@ -162,12 +163,13 @@ int irq_destroy_ipi(unsigned int irq, const struct cpumask *dest) irq_hw_number_t ipi_get_hwirq(unsigned int irq, unsigned int cpu) { struct irq_data *data = irq_get_irq_data(irq); - struct cpumask *ipimask = data ? irq_data_get_affinity_mask(data) : NULL; + const struct cpumask *ipimask; - if (!data || !ipimask || cpu >= nr_cpu_ids) + if (!data || cpu >= nr_cpu_ids) return INVALID_HWIRQ; - if (!cpumask_test_cpu(cpu, ipimask)) + ipimask = irq_data_get_affinity_mask(data); + if (!ipimask || !cpumask_test_cpu(cpu, ipimask)) return INVALID_HWIRQ; /* @@ -186,7 +188,7 @@ EXPORT_SYMBOL_GPL(ipi_get_hwirq); static int ipi_send_verify(struct irq_chip *chip, struct irq_data *data, const struct cpumask *dest, unsigned int cpu) { - struct cpumask *ipimask = irq_data_get_affinity_mask(data); + const struct cpumask *ipimask = irq_data_get_affinity_mask(data); if (!chip || !ipimask) return -EINVAL; From aa0813581b8d37bdd91cd40b67ef79ffa45104b2 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:00:56 -0500 Subject: [PATCH 266/298] genirq: Provide an IRQ affinity mask in non-SMP configs IRQ affinity masks are not allocated in uniprocessor configurations. This requires special case non-SMP code in drivers for irqchips which have per-CPU enable or mask registers. Since IRQ affinity is always the same in a uniprocessor configuration, we can provide a correct affinity mask without allocating one per IRQ. By returning a real cpumask from irq_data_get_affinity_mask even when SMP is disabled, irqchip drivers which iterate over that mask will automatically do the right thing. Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701200056.46555-9-samuel@sholland.org --- include/linux/irq.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/irq.h b/include/linux/irq.h index 02073f7a156e..996e22744edd 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -151,7 +151,9 @@ struct irq_common_data { #endif void *handler_data; struct msi_desc *msi_desc; +#ifdef CONFIG_SMP cpumask_var_t affinity; +#endif #ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK cpumask_var_t effective_affinity; #endif @@ -882,13 +884,19 @@ static inline int irq_data_get_node(struct irq_data *d) static inline const struct cpumask *irq_data_get_affinity_mask(struct irq_data *d) { +#ifdef CONFIG_SMP return d->common->affinity; +#else + return cpumask_of(0); +#endif } static inline void irq_data_update_affinity(struct irq_data *d, const struct cpumask *m) { +#ifdef CONFIG_SMP cpumask_copy(d->common->affinity, m); +#endif } static inline const struct cpumask *irq_get_affinity_mask(int irq) From 9167fd5d5549bcea6d4735a270908da2a3475f3a Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Thu, 7 Jul 2022 19:49:31 -0500 Subject: [PATCH 267/298] PCI: hv: Take a const cpumask in hv_compose_msi_req_get_cpu() The cpumask that is passed to this function ultimately comes from irq_data_get_effective_affinity_mask(), which was recently changed to return a const cpumask pointer. The first level of functions handling the affinity mask were updated, but not this helper function. Fixes: 4d0b8298818b ("genirq: Return a const cpumask from irq_data_get_affinity_mask") Reported-by: kernel test robot Signed-off-by: Samuel Holland Reviewed-by: Michael Kelley Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220708004931.1672-1-samuel@sholland.org --- drivers/pci/controller/pci-hyperv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index aebada45569b..e7c6f6629e7c 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1635,7 +1635,7 @@ static u32 hv_compose_msi_req_v1( * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten * by subsequent retarget in hv_irq_unmask(). */ -static int hv_compose_msi_req_get_cpu(struct cpumask *affinity) +static int hv_compose_msi_req_get_cpu(const struct cpumask *affinity) { return cpumask_first_and(affinity, cpu_online_mask); } From 91a29af413def677495e447fb9a06957ebc8bed5 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Thu, 7 Jul 2022 19:23:09 +0100 Subject: [PATCH 268/298] gpio: Remove dynamic allocation from populate_parent_alloc_arg() The gpiolib is unique in the way it uses intermediate fwspecs when feeding an interrupt specifier to the parent domain, as it relies on the populate_parent_alloc_arg() callback to perform a dynamic allocation. This is pretty inefficient (we free the structure almost immediately), and the only reason this isn't a stack allocation is that our ThunderX friend uses MSIs rather than a FW-constructed structure. Let's solve it by providing a new type composed of the union of a struct irq_fwspec and a msi_info_t, which satisfies both requirements. This allows us to use a stack allocation, and we can move the handful of users to this new scheme. Also perform some additional cleanup, such as getting rid of the stub versions of the irq_domain_translate_*cell helpers, which are never used when CONFIG_IRQ_DOMAIN_HIERARCHY isn't selected. Tested on a Tegra186. Reviewed-by: Linus Walleij Signed-off-by: Marc Zyngier Cc: Daniel Palmer Cc: Romain Perier Cc: Bartosz Golaszewski Cc: Thierry Reding Cc: Jonathan Hunter Cc: Robert Richter Cc: Nobuhiro Iwamatsu Cc: Andy Gross Cc: Bjorn Andersson Acked-by: Bartosz Golaszewski Link: https://lore.kernel.org/r/20220707182314.66610-2-prabhakar.mahadev-lad.rj@bp.renesas.com --- drivers/gpio/gpio-msc313.c | 15 ++++----- drivers/gpio/gpio-tegra.c | 15 ++++----- drivers/gpio/gpio-tegra186.c | 15 ++++----- drivers/gpio/gpio-thunderx.c | 15 ++++----- drivers/gpio/gpio-visconti.c | 15 ++++----- drivers/gpio/gpiolib.c | 42 ++++++++++-------------- drivers/pinctrl/qcom/pinctrl-spmi-gpio.c | 15 ++++----- include/linux/gpio/driver.h | 42 +++++++++++------------- 8 files changed, 73 insertions(+), 101 deletions(-) diff --git a/drivers/gpio/gpio-msc313.c b/drivers/gpio/gpio-msc313.c index b2c90bdd39d0..52d7b8d99170 100644 --- a/drivers/gpio/gpio-msc313.c +++ b/drivers/gpio/gpio-msc313.c @@ -550,15 +550,12 @@ static struct irq_chip msc313_gpio_irqchip = { * so we need to provide the fwspec. Essentially gpiochip_populate_parent_fwspec_twocell * that puts GIC_SPI into the first cell. */ -static void *msc313_gpio_populate_parent_fwspec(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type) +static int msc313_gpio_populate_parent_fwspec(struct gpio_chip *gc, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { - struct irq_fwspec *fwspec; - - fwspec = kmalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = gc->irq.parent_domain->fwnode; fwspec->param_count = 3; @@ -566,7 +563,7 @@ static void *msc313_gpio_populate_parent_fwspec(struct gpio_chip *gc, fwspec->param[1] = parent_hwirq; fwspec->param[2] = parent_type; - return fwspec; + return 0; } static int msc313e_gpio_child_to_parent_hwirq(struct gpio_chip *chip, diff --git a/drivers/gpio/gpio-tegra.c b/drivers/gpio/gpio-tegra.c index ff2d2a1f9c73..e4fb4cb38a0f 100644 --- a/drivers/gpio/gpio-tegra.c +++ b/drivers/gpio/gpio-tegra.c @@ -443,15 +443,12 @@ static int tegra_gpio_child_to_parent_hwirq(struct gpio_chip *chip, return 0; } -static void *tegra_gpio_populate_parent_fwspec(struct gpio_chip *chip, - unsigned int parent_hwirq, - unsigned int parent_type) +static int tegra_gpio_populate_parent_fwspec(struct gpio_chip *chip, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { - struct irq_fwspec *fwspec; - - fwspec = kmalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = chip->irq.parent_domain->fwnode; fwspec->param_count = 3; @@ -459,7 +456,7 @@ static void *tegra_gpio_populate_parent_fwspec(struct gpio_chip *chip, fwspec->param[1] = parent_hwirq; fwspec->param[2] = parent_type; - return fwspec; + return 0; } #ifdef CONFIG_PM_SLEEP diff --git a/drivers/gpio/gpio-tegra186.c b/drivers/gpio/gpio-tegra186.c index de28a68daea0..54d9fa7da9c1 100644 --- a/drivers/gpio/gpio-tegra186.c +++ b/drivers/gpio/gpio-tegra186.c @@ -621,16 +621,13 @@ static int tegra186_gpio_irq_domain_translate(struct irq_domain *domain, return 0; } -static void *tegra186_gpio_populate_parent_fwspec(struct gpio_chip *chip, - unsigned int parent_hwirq, - unsigned int parent_type) +static int tegra186_gpio_populate_parent_fwspec(struct gpio_chip *chip, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { struct tegra_gpio *gpio = gpiochip_get_data(chip); - struct irq_fwspec *fwspec; - - fwspec = kmalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = chip->irq.parent_domain->fwnode; fwspec->param_count = 3; @@ -638,7 +635,7 @@ static void *tegra186_gpio_populate_parent_fwspec(struct gpio_chip *chip, fwspec->param[1] = parent_hwirq; fwspec->param[2] = parent_type; - return fwspec; + return 0; } static int tegra186_gpio_child_to_parent_hwirq(struct gpio_chip *chip, diff --git a/drivers/gpio/gpio-thunderx.c b/drivers/gpio/gpio-thunderx.c index 9f66deab46ea..e1dedbca0c85 100644 --- a/drivers/gpio/gpio-thunderx.c +++ b/drivers/gpio/gpio-thunderx.c @@ -408,18 +408,15 @@ static int thunderx_gpio_child_to_parent_hwirq(struct gpio_chip *gc, return 0; } -static void *thunderx_gpio_populate_parent_alloc_info(struct gpio_chip *chip, - unsigned int parent_hwirq, - unsigned int parent_type) +static int thunderx_gpio_populate_parent_alloc_info(struct gpio_chip *chip, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { - msi_alloc_info_t *info; - - info = kmalloc(sizeof(*info), GFP_KERNEL); - if (!info) - return NULL; + msi_alloc_info_t *info = &gfwspec->msiinfo; info->hwirq = parent_hwirq; - return info; + return 0; } static int thunderx_gpio_probe(struct pci_dev *pdev, diff --git a/drivers/gpio/gpio-visconti.c b/drivers/gpio/gpio-visconti.c index e6534ea1eaa7..5e108ba9956a 100644 --- a/drivers/gpio/gpio-visconti.c +++ b/drivers/gpio/gpio-visconti.c @@ -103,15 +103,12 @@ static int visconti_gpio_child_to_parent_hwirq(struct gpio_chip *gc, return -EINVAL; } -static void *visconti_gpio_populate_parent_fwspec(struct gpio_chip *chip, - unsigned int parent_hwirq, - unsigned int parent_type) +static int visconti_gpio_populate_parent_fwspec(struct gpio_chip *chip, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { - struct irq_fwspec *fwspec; - - fwspec = kmalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = chip->irq.parent_domain->fwnode; fwspec->param_count = 3; @@ -119,7 +116,7 @@ static void *visconti_gpio_populate_parent_fwspec(struct gpio_chip *chip, fwspec->param[1] = parent_hwirq; fwspec->param[2] = parent_type; - return fwspec; + return 0; } static int visconti_gpio_probe(struct platform_device *pdev) diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index 9535f48e18d1..bfde94243752 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -1107,7 +1107,7 @@ static int gpiochip_hierarchy_irq_domain_alloc(struct irq_domain *d, irq_hw_number_t hwirq; unsigned int type = IRQ_TYPE_NONE; struct irq_fwspec *fwspec = data; - void *parent_arg; + union gpio_irq_fwspec gpio_parent_fwspec = {}; unsigned int parent_hwirq; unsigned int parent_type; struct gpio_irq_chip *girq = &gc->irq; @@ -1147,14 +1147,15 @@ static int gpiochip_hierarchy_irq_domain_alloc(struct irq_domain *d, irq_set_probe(irq); /* This parent only handles asserted level IRQs */ - parent_arg = girq->populate_parent_alloc_arg(gc, parent_hwirq, parent_type); - if (!parent_arg) - return -ENOMEM; + ret = girq->populate_parent_alloc_arg(gc, &gpio_parent_fwspec, + parent_hwirq, parent_type); + if (ret) + return ret; chip_dbg(gc, "alloc_irqs_parent for %d parent hwirq %d\n", irq, parent_hwirq); irq_set_lockdep_class(irq, gc->irq.lock_key, gc->irq.request_key); - ret = irq_domain_alloc_irqs_parent(d, irq, 1, parent_arg); + ret = irq_domain_alloc_irqs_parent(d, irq, 1, &gpio_parent_fwspec); /* * If the parent irqdomain is msi, the interrupts have already * been allocated, so the EEXIST is good. @@ -1166,7 +1167,6 @@ static int gpiochip_hierarchy_irq_domain_alloc(struct irq_domain *d, "failed to allocate parent hwirq %d for hwirq %lu\n", parent_hwirq, hwirq); - kfree(parent_arg); return ret; } @@ -1230,34 +1230,28 @@ static bool gpiochip_hierarchy_is_hierarchical(struct gpio_chip *gc) return !!gc->irq.parent_domain; } -void *gpiochip_populate_parent_fwspec_twocell(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type) +int gpiochip_populate_parent_fwspec_twocell(struct gpio_chip *gc, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { - struct irq_fwspec *fwspec; - - fwspec = kmalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = gc->irq.parent_domain->fwnode; fwspec->param_count = 2; fwspec->param[0] = parent_hwirq; fwspec->param[1] = parent_type; - return fwspec; + return 0; } EXPORT_SYMBOL_GPL(gpiochip_populate_parent_fwspec_twocell); -void *gpiochip_populate_parent_fwspec_fourcell(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type) +int gpiochip_populate_parent_fwspec_fourcell(struct gpio_chip *gc, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { - struct irq_fwspec *fwspec; - - fwspec = kmalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = gc->irq.parent_domain->fwnode; fwspec->param_count = 4; @@ -1266,7 +1260,7 @@ void *gpiochip_populate_parent_fwspec_fourcell(struct gpio_chip *gc, fwspec->param[2] = 0; fwspec->param[3] = parent_type; - return fwspec; + return 0; } EXPORT_SYMBOL_GPL(gpiochip_populate_parent_fwspec_fourcell); diff --git a/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c b/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c index fd5fff9adff0..3be2a08ae3a6 100644 --- a/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c +++ b/drivers/pinctrl/qcom/pinctrl-spmi-gpio.c @@ -966,16 +966,13 @@ static int pmic_gpio_child_to_parent_hwirq(struct gpio_chip *chip, return 0; } -static void *pmic_gpio_populate_parent_fwspec(struct gpio_chip *chip, - unsigned int parent_hwirq, - unsigned int parent_type) +static int pmic_gpio_populate_parent_fwspec(struct gpio_chip *chip, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) { struct pmic_gpio_state *state = gpiochip_get_data(chip); - struct irq_fwspec *fwspec; - - fwspec = kzalloc(sizeof(*fwspec), GFP_KERNEL); - if (!fwspec) - return NULL; + struct irq_fwspec *fwspec = &gfwspec->fwspec; fwspec->fwnode = chip->irq.parent_domain->fwnode; @@ -985,7 +982,7 @@ static void *pmic_gpio_populate_parent_fwspec(struct gpio_chip *chip, /* param[2] must be left as 0 */ fwspec->param[3] = parent_type; - return fwspec; + return 0; } static int pmic_gpio_probe(struct platform_device *pdev) diff --git a/include/linux/gpio/driver.h b/include/linux/gpio/driver.h index b1e0f1f8ee2e..ad5b92b74ccf 100644 --- a/include/linux/gpio/driver.h +++ b/include/linux/gpio/driver.h @@ -12,6 +12,8 @@ #include #include +#include + struct gpio_desc; struct of_phandle_args; struct device_node; @@ -23,6 +25,13 @@ enum gpio_lookup_flags; struct gpio_chip; +union gpio_irq_fwspec { + struct irq_fwspec fwspec; +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + msi_alloc_info_t msiinfo; +#endif +}; + #define GPIO_LINE_DIRECTION_IN 1 #define GPIO_LINE_DIRECTION_OUT 0 @@ -103,9 +112,10 @@ struct gpio_irq_chip { * variant named &gpiochip_populate_parent_fwspec_fourcell is also * available. */ - void *(*populate_parent_alloc_arg)(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type); + int (*populate_parent_alloc_arg)(struct gpio_chip *gc, + union gpio_irq_fwspec *fwspec, + unsigned int parent_hwirq, + unsigned int parent_type); /** * @child_offset_to_irq: @@ -646,28 +656,14 @@ struct bgpio_pdata { #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY -void *gpiochip_populate_parent_fwspec_twocell(struct gpio_chip *gc, +int gpiochip_populate_parent_fwspec_twocell(struct gpio_chip *gc, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type); +int gpiochip_populate_parent_fwspec_fourcell(struct gpio_chip *gc, + union gpio_irq_fwspec *gfwspec, unsigned int parent_hwirq, unsigned int parent_type); -void *gpiochip_populate_parent_fwspec_fourcell(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type); - -#else - -static inline void *gpiochip_populate_parent_fwspec_twocell(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type) -{ - return NULL; -} - -static inline void *gpiochip_populate_parent_fwspec_fourcell(struct gpio_chip *gc, - unsigned int parent_hwirq, - unsigned int parent_type) -{ - return NULL; -} #endif /* CONFIG_IRQ_DOMAIN_HIERARCHY */ From 96fed779d3d4cb3c221bb70e94de59b8dec0abfc Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 7 Jul 2022 19:23:10 +0100 Subject: [PATCH 269/298] dt-bindings: interrupt-controller: Add Renesas RZ/G2L Interrupt Controller Add DT bindings for the Renesas RZ/G2L Interrupt Controller. Signed-off-by: Lad Prabhakar Reviewed-by: Rob Herring Reviewed-by: Geert Uytterhoeven Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220707182314.66610-3-prabhakar.mahadev-lad.rj@bp.renesas.com --- .../renesas,rzg2l-irqc.yaml | 133 ++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml diff --git a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml new file mode 100644 index 000000000000..ffbb4ab4d9a7 --- /dev/null +++ b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml @@ -0,0 +1,133 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/interrupt-controller/renesas,rzg2l-irqc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Renesas RZ/G2L (and alike SoC's) Interrupt Controller (IA55) + +maintainers: + - Lad Prabhakar + - Geert Uytterhoeven + +description: | + IA55 performs various interrupt controls including synchronization for the external + interrupts of NMI, IRQ, and GPIOINT and the interrupts of the built-in peripheral + interrupts output by each IP. And it notifies the interrupt to the GIC + - IRQ sense select for 8 external interrupts, mapped to 8 GIC SPI interrupts + - GPIO pins used as external interrupt input pins, mapped to 32 GIC SPI interrupts + - NMI edge select (NMI is not treated as NMI exception and supports fall edge and + stand-up edge detection interrupts) + +allOf: + - $ref: /schemas/interrupt-controller.yaml# + +properties: + compatible: + items: + - enum: + - renesas,r9a07g044-irqc # RZ/G2L + - const: renesas,rzg2l-irqc + + '#interrupt-cells': + description: The first cell should contain external interrupt number (IRQ0-7) and the + second cell is used to specify the flag. + const: 2 + + '#address-cells': + const: 0 + + interrupt-controller: true + + reg: + maxItems: 1 + + interrupts: + maxItems: 41 + + clocks: + maxItems: 2 + + clock-names: + items: + - const: clk + - const: pclk + + power-domains: + maxItems: 1 + + resets: + maxItems: 1 + +required: + - compatible + - '#interrupt-cells' + - '#address-cells' + - interrupt-controller + - reg + - interrupts + - clocks + - clock-names + - power-domains + - resets + +unevaluatedProperties: false + +examples: + - | + #include + #include + + irqc: interrupt-controller@110a0000 { + compatible = "renesas,r9a07g044-irqc", "renesas,rzg2l-irqc"; + reg = <0x110a0000 0x10000>; + #interrupt-cells = <2>; + #address-cells = <0>; + interrupt-controller; + interrupts = , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + , + ; + clocks = <&cpg CPG_MOD R9A07G044_IA55_CLK>, + <&cpg CPG_MOD R9A07G044_IA55_PCLK>; + clock-names = "clk", "pclk"; + power-domains = <&cpg>; + resets = <&cpg R9A07G044_IA55_RESETN>; + }; From 3fed09559cd8be79568d368ce02bf7f2d56259b6 Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 7 Jul 2022 19:23:11 +0100 Subject: [PATCH 270/298] irqchip: Add RZ/G2L IA55 Interrupt Controller driver Add a driver for the Renesas RZ/G2L Interrupt Controller. This supports external pins being used as interrupts. It supports one line for NMI, 8 external pins and 32 GPIO pins (out of 123) to be used as IRQ lines. Signed-off-by: Lad Prabhakar Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220707182314.66610-4-prabhakar.mahadev-lad.rj@bp.renesas.com --- drivers/irqchip/Kconfig | 8 + drivers/irqchip/Makefile | 1 + drivers/irqchip/irq-renesas-rzg2l.c | 393 ++++++++++++++++++++++++++++ 3 files changed, 402 insertions(+) create mode 100644 drivers/irqchip/irq-renesas-rzg2l.c diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 1f23a6be7d88..b1ab85171a3a 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -242,6 +242,14 @@ config RENESAS_RZA1_IRQC Enable support for the Renesas RZ/A1 Interrupt Controller, to use up to 8 external interrupts with configurable sense select. +config RENESAS_RZG2L_IRQC + bool "Renesas RZ/G2L (and alike SoC) IRQC support" if COMPILE_TEST + select GENERIC_IRQ_CHIP + select IRQ_DOMAIN_HIERARCHY + help + Enable support for the Renesas RZ/G2L (and alike SoC) Interrupt Controller + for external devices. + config SL28CPLD_INTC bool "Kontron sl28cpld IRQ controller" depends on MFD_SL28CPLD=y || COMPILE_TEST diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile index 5b67450a9538..04cb15b647c5 100644 --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -51,6 +51,7 @@ obj-$(CONFIG_RDA_INTC) += irq-rda-intc.o obj-$(CONFIG_RENESAS_INTC_IRQPIN) += irq-renesas-intc-irqpin.o obj-$(CONFIG_RENESAS_IRQC) += irq-renesas-irqc.o obj-$(CONFIG_RENESAS_RZA1_IRQC) += irq-renesas-rza1.o +obj-$(CONFIG_RENESAS_RZG2L_IRQC) += irq-renesas-rzg2l.o obj-$(CONFIG_VERSATILE_FPGA_IRQ) += irq-versatile-fpga.o obj-$(CONFIG_ARCH_NSPIRE) += irq-zevio.o obj-$(CONFIG_ARCH_VT8500) += irq-vt8500.o diff --git a/drivers/irqchip/irq-renesas-rzg2l.c b/drivers/irqchip/irq-renesas-rzg2l.c new file mode 100644 index 000000000000..25fd8ee66565 --- /dev/null +++ b/drivers/irqchip/irq-renesas-rzg2l.c @@ -0,0 +1,393 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Renesas RZ/G2L IRQC Driver + * + * Copyright (C) 2022 Renesas Electronics Corporation. + * + * Author: Lad Prabhakar + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define IRQC_IRQ_START 1 +#define IRQC_IRQ_COUNT 8 +#define IRQC_TINT_START (IRQC_IRQ_START + IRQC_IRQ_COUNT) +#define IRQC_TINT_COUNT 32 +#define IRQC_NUM_IRQ (IRQC_TINT_START + IRQC_TINT_COUNT) + +#define ISCR 0x10 +#define IITSR 0x14 +#define TSCR 0x20 +#define TITSR0 0x24 +#define TITSR1 0x28 +#define TITSR0_MAX_INT 16 +#define TITSEL_WIDTH 0x2 +#define TSSR(n) (0x30 + ((n) * 4)) +#define TIEN BIT(7) +#define TSSEL_SHIFT(n) (8 * (n)) +#define TSSEL_MASK GENMASK(7, 0) +#define IRQ_MASK 0x3 + +#define TSSR_OFFSET(n) ((n) % 4) +#define TSSR_INDEX(n) ((n) / 4) + +#define TITSR_TITSEL_EDGE_RISING 0 +#define TITSR_TITSEL_EDGE_FALLING 1 +#define TITSR_TITSEL_LEVEL_HIGH 2 +#define TITSR_TITSEL_LEVEL_LOW 3 + +#define IITSR_IITSEL(n, sense) ((sense) << ((n) * 2)) +#define IITSR_IITSEL_LEVEL_LOW 0 +#define IITSR_IITSEL_EDGE_FALLING 1 +#define IITSR_IITSEL_EDGE_RISING 2 +#define IITSR_IITSEL_EDGE_BOTH 3 +#define IITSR_IITSEL_MASK(n) IITSR_IITSEL((n), 3) + +#define TINT_EXTRACT_HWIRQ(x) FIELD_GET(GENMASK(15, 0), (x)) +#define TINT_EXTRACT_GPIOINT(x) FIELD_GET(GENMASK(31, 16), (x)) + +struct rzg2l_irqc_priv { + void __iomem *base; + struct irq_fwspec fwspec[IRQC_NUM_IRQ]; + raw_spinlock_t lock; +}; + +static struct rzg2l_irqc_priv *irq_data_to_priv(struct irq_data *data) +{ + return data->domain->host_data; +} + +static void rzg2l_irq_eoi(struct irq_data *d) +{ + unsigned int hw_irq = irqd_to_hwirq(d) - IRQC_IRQ_START; + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + u32 bit = BIT(hw_irq); + u32 reg; + + reg = readl_relaxed(priv->base + ISCR); + if (reg & bit) + writel_relaxed(reg & ~bit, priv->base + ISCR); +} + +static void rzg2l_tint_eoi(struct irq_data *d) +{ + unsigned int hw_irq = irqd_to_hwirq(d) - IRQC_TINT_START; + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + u32 bit = BIT(hw_irq); + u32 reg; + + reg = readl_relaxed(priv->base + TSCR); + if (reg & bit) + writel_relaxed(reg & ~bit, priv->base + TSCR); +} + +static void rzg2l_irqc_eoi(struct irq_data *d) +{ + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + unsigned int hw_irq = irqd_to_hwirq(d); + + raw_spin_lock(&priv->lock); + if (hw_irq >= IRQC_IRQ_START && hw_irq <= IRQC_IRQ_COUNT) + rzg2l_irq_eoi(d); + else if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ) + rzg2l_tint_eoi(d); + raw_spin_unlock(&priv->lock); + irq_chip_eoi_parent(d); +} + +static void rzg2l_irqc_irq_disable(struct irq_data *d) +{ + unsigned int hw_irq = irqd_to_hwirq(d); + + if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ) { + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + u32 offset = hw_irq - IRQC_TINT_START; + u32 tssr_offset = TSSR_OFFSET(offset); + u8 tssr_index = TSSR_INDEX(offset); + u32 reg; + + raw_spin_lock(&priv->lock); + reg = readl_relaxed(priv->base + TSSR(tssr_index)); + reg &= ~(TSSEL_MASK << tssr_offset); + writel_relaxed(reg, priv->base + TSSR(tssr_index)); + raw_spin_unlock(&priv->lock); + } + irq_chip_disable_parent(d); +} + +static void rzg2l_irqc_irq_enable(struct irq_data *d) +{ + unsigned int hw_irq = irqd_to_hwirq(d); + + if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ) { + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + unsigned long tint = (uintptr_t)d->chip_data; + u32 offset = hw_irq - IRQC_TINT_START; + u32 tssr_offset = TSSR_OFFSET(offset); + u8 tssr_index = TSSR_INDEX(offset); + u32 reg; + + raw_spin_lock(&priv->lock); + reg = readl_relaxed(priv->base + TSSR(tssr_index)); + reg |= (TIEN | tint) << TSSEL_SHIFT(tssr_offset); + writel_relaxed(reg, priv->base + TSSR(tssr_index)); + raw_spin_unlock(&priv->lock); + } + irq_chip_enable_parent(d); +} + +static int rzg2l_irq_set_type(struct irq_data *d, unsigned int type) +{ + unsigned int hw_irq = irqd_to_hwirq(d) - IRQC_IRQ_START; + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + u16 sense, tmp; + + switch (type & IRQ_TYPE_SENSE_MASK) { + case IRQ_TYPE_LEVEL_LOW: + sense = IITSR_IITSEL_LEVEL_LOW; + break; + + case IRQ_TYPE_EDGE_FALLING: + sense = IITSR_IITSEL_EDGE_FALLING; + break; + + case IRQ_TYPE_EDGE_RISING: + sense = IITSR_IITSEL_EDGE_RISING; + break; + + case IRQ_TYPE_EDGE_BOTH: + sense = IITSR_IITSEL_EDGE_BOTH; + break; + + default: + return -EINVAL; + } + + raw_spin_lock(&priv->lock); + tmp = readl_relaxed(priv->base + IITSR); + tmp &= ~IITSR_IITSEL_MASK(hw_irq); + tmp |= IITSR_IITSEL(hw_irq, sense); + writel_relaxed(tmp, priv->base + IITSR); + raw_spin_unlock(&priv->lock); + + return 0; +} + +static int rzg2l_tint_set_edge(struct irq_data *d, unsigned int type) +{ + struct rzg2l_irqc_priv *priv = irq_data_to_priv(d); + unsigned int hwirq = irqd_to_hwirq(d); + u32 titseln = hwirq - IRQC_TINT_START; + u32 offset; + u8 sense; + u32 reg; + + switch (type & IRQ_TYPE_SENSE_MASK) { + case IRQ_TYPE_EDGE_RISING: + sense = TITSR_TITSEL_EDGE_RISING; + break; + + case IRQ_TYPE_EDGE_FALLING: + sense = TITSR_TITSEL_EDGE_FALLING; + break; + + default: + return -EINVAL; + } + + offset = TITSR0; + if (titseln >= TITSR0_MAX_INT) { + titseln -= TITSR0_MAX_INT; + offset = TITSR1; + } + + raw_spin_lock(&priv->lock); + reg = readl_relaxed(priv->base + offset); + reg &= ~(IRQ_MASK << (titseln * TITSEL_WIDTH)); + reg |= sense << (titseln * TITSEL_WIDTH); + writel_relaxed(reg, priv->base + offset); + raw_spin_unlock(&priv->lock); + + return 0; +} + +static int rzg2l_irqc_set_type(struct irq_data *d, unsigned int type) +{ + unsigned int hw_irq = irqd_to_hwirq(d); + int ret = -EINVAL; + + if (hw_irq >= IRQC_IRQ_START && hw_irq <= IRQC_IRQ_COUNT) + ret = rzg2l_irq_set_type(d, type); + else if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ) + ret = rzg2l_tint_set_edge(d, type); + if (ret) + return ret; + + return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH); +} + +static const struct irq_chip irqc_chip = { + .name = "rzg2l-irqc", + .irq_eoi = rzg2l_irqc_eoi, + .irq_mask = irq_chip_mask_parent, + .irq_unmask = irq_chip_unmask_parent, + .irq_disable = rzg2l_irqc_irq_disable, + .irq_enable = rzg2l_irqc_irq_enable, + .irq_get_irqchip_state = irq_chip_get_parent_state, + .irq_set_irqchip_state = irq_chip_set_parent_state, + .irq_retrigger = irq_chip_retrigger_hierarchy, + .irq_set_type = rzg2l_irqc_set_type, + .flags = IRQCHIP_MASK_ON_SUSPEND | + IRQCHIP_SET_TYPE_MASKED | + IRQCHIP_SKIP_SET_WAKE, +}; + +static int rzg2l_irqc_alloc(struct irq_domain *domain, unsigned int virq, + unsigned int nr_irqs, void *arg) +{ + struct rzg2l_irqc_priv *priv = domain->host_data; + unsigned long tint = 0; + irq_hw_number_t hwirq; + unsigned int type; + int ret; + + ret = irq_domain_translate_twocell(domain, arg, &hwirq, &type); + if (ret) + return ret; + + /* + * For TINT interrupts ie where pinctrl driver is child of irqc domain + * the hwirq and TINT are encoded in fwspec->param[0]. + * hwirq for TINT range from 9-40, hwirq is embedded 0-15 bits and TINT + * from 16-31 bits. TINT from the pinctrl driver needs to be programmed + * in IRQC registers to enable a given gpio pin as interrupt. + */ + if (hwirq > IRQC_IRQ_COUNT) { + tint = TINT_EXTRACT_GPIOINT(hwirq); + hwirq = TINT_EXTRACT_HWIRQ(hwirq); + + if (hwirq < IRQC_TINT_START) + return -EINVAL; + } + + if (hwirq > (IRQC_NUM_IRQ - 1)) + return -EINVAL; + + ret = irq_domain_set_hwirq_and_chip(domain, virq, hwirq, &irqc_chip, + (void *)(uintptr_t)tint); + if (ret) + return ret; + + return irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, &priv->fwspec[hwirq]); +} + +static const struct irq_domain_ops rzg2l_irqc_domain_ops = { + .alloc = rzg2l_irqc_alloc, + .free = irq_domain_free_irqs_common, + .translate = irq_domain_translate_twocell, +}; + +static int rzg2l_irqc_parse_interrupts(struct rzg2l_irqc_priv *priv, + struct device_node *np) +{ + struct of_phandle_args map; + unsigned int i; + int ret; + + for (i = 0; i < IRQC_NUM_IRQ; i++) { + ret = of_irq_parse_one(np, i, &map); + if (ret) + return ret; + of_phandle_args_to_fwspec(np, map.args, map.args_count, + &priv->fwspec[i]); + } + + return 0; +} + +static int rzg2l_irqc_init(struct device_node *node, struct device_node *parent) +{ + struct irq_domain *irq_domain, *parent_domain; + struct platform_device *pdev; + struct reset_control *resetn; + struct rzg2l_irqc_priv *priv; + int ret; + + pdev = of_find_device_by_node(node); + if (!pdev) + return -ENODEV; + + parent_domain = irq_find_host(parent); + if (!parent_domain) { + dev_err(&pdev->dev, "cannot find parent domain\n"); + return -ENODEV; + } + + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->base = devm_of_iomap(&pdev->dev, pdev->dev.of_node, 0, NULL); + if (IS_ERR(priv->base)) + return PTR_ERR(priv->base); + + ret = rzg2l_irqc_parse_interrupts(priv, node); + if (ret) { + dev_err(&pdev->dev, "cannot parse interrupts: %d\n", ret); + return ret; + } + + resetn = devm_reset_control_get_exclusive(&pdev->dev, NULL); + if (IS_ERR(resetn)) + return PTR_ERR(resetn); + + ret = reset_control_deassert(resetn); + if (ret) { + dev_err(&pdev->dev, "failed to deassert resetn pin, %d\n", ret); + return ret; + } + + pm_runtime_enable(&pdev->dev); + ret = pm_runtime_resume_and_get(&pdev->dev); + if (ret < 0) { + dev_err(&pdev->dev, "pm_runtime_resume_and_get failed: %d\n", ret); + goto pm_disable; + } + + raw_spin_lock_init(&priv->lock); + + irq_domain = irq_domain_add_hierarchy(parent_domain, 0, IRQC_NUM_IRQ, + node, &rzg2l_irqc_domain_ops, + priv); + if (!irq_domain) { + dev_err(&pdev->dev, "failed to add irq domain\n"); + ret = -ENOMEM; + goto pm_put; + } + + return 0; + +pm_put: + pm_runtime_put(&pdev->dev); +pm_disable: + pm_runtime_disable(&pdev->dev); + reset_control_assert(resetn); + return ret; +} + +IRQCHIP_PLATFORM_DRIVER_BEGIN(rzg2l_irqc) +IRQCHIP_MATCH("renesas,rzg2l-irqc", rzg2l_irqc_init) +IRQCHIP_PLATFORM_DRIVER_END(rzg2l_irqc) +MODULE_AUTHOR("Lad Prabhakar "); +MODULE_DESCRIPTION("Renesas RZ/G2L IRQC Driver"); +MODULE_LICENSE("GPL"); From 08f12b4534c274c67245019021961206e7a3eefa Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 7 Jul 2022 19:23:12 +0100 Subject: [PATCH 271/298] gpio: gpiolib: Allow free() callback to be overridden Allow free() callback to be overridden from irq_domain_ops for hierarchical chips. This allows drivers to free up resources which are allocated during child_to_parent_hwirq()/populate_parent_alloc_arg() callbacks. On Renesas RZ/G2L platform a bitmap is maintained for TINT slots, a slot is allocated in child_to_parent_hwirq() callback which is freed up in free callback hence this override. Signed-off-by: Lad Prabhakar Reviewed-by: Geert Uytterhoeven Acked-by: Bartosz Golaszewski Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220707182314.66610-5-prabhakar.mahadev-lad.rj@bp.renesas.com --- drivers/gpio/gpiolib.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index bfde94243752..68d9f95d7799 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -1181,15 +1181,18 @@ static void gpiochip_hierarchy_setup_domain_ops(struct irq_domain_ops *ops) ops->activate = gpiochip_irq_domain_activate; ops->deactivate = gpiochip_irq_domain_deactivate; ops->alloc = gpiochip_hierarchy_irq_domain_alloc; - ops->free = irq_domain_free_irqs_common; /* - * We only allow overriding the translate() function for + * We only allow overriding the translate() and free() functions for * hierarchical chips, and this should only be done if the user - * really need something other than 1:1 translation. + * really need something other than 1:1 translation for translate() + * callback and free if user wants to free up any resources which + * were allocated during callbacks, for example populate_parent_alloc_arg. */ if (!ops->translate) ops->translate = gpiochip_hierarchy_irq_domain_translate; + if (!ops->free) + ops->free = irq_domain_free_irqs_common; } static int gpiochip_hierarchy_add_domain(struct gpio_chip *gc) From 35c37efd12733d8ddbdc11ab9c8dbcee472a487f Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 7 Jul 2022 19:23:13 +0100 Subject: [PATCH 272/298] dt-bindings: pinctrl: renesas,rzg2l-pinctrl: Document the properties to handle GPIO IRQ Document the required properties to handle GPIO IRQ. Signed-off-by: Lad Prabhakar Reviewed-by: Geert Uytterhoeven Reviewed-by: Rob Herring Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220707182314.66610-6-prabhakar.mahadev-lad.rj@bp.renesas.com --- .../bindings/pinctrl/renesas,rzg2l-pinctrl.yaml | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/Documentation/devicetree/bindings/pinctrl/renesas,rzg2l-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/renesas,rzg2l-pinctrl.yaml index 52df1b146174..997b74639112 100644 --- a/Documentation/devicetree/bindings/pinctrl/renesas,rzg2l-pinctrl.yaml +++ b/Documentation/devicetree/bindings/pinctrl/renesas,rzg2l-pinctrl.yaml @@ -47,6 +47,17 @@ properties: gpio-ranges: maxItems: 1 + interrupt-controller: true + + '#interrupt-cells': + const: 2 + description: + The first cell contains the global GPIO port index, constructed using the + RZG2L_GPIO() helper macro in and the + second cell is used to specify the flag. + E.g. "interrupts = ;" if P43_0 is + being used as an interrupt. + clocks: maxItems: 1 @@ -110,6 +121,8 @@ required: - gpio-controller - '#gpio-cells' - gpio-ranges + - interrupt-controller + - '#interrupt-cells' - clocks - power-domains - resets @@ -126,6 +139,8 @@ examples: gpio-controller; #gpio-cells = <2>; gpio-ranges = <&pinctrl 0 0 392>; + interrupt-controller; + #interrupt-cells = <2>; clocks = <&cpg CPG_MOD R9A07G044_GPIO_HCLK>; resets = <&cpg R9A07G044_GPIO_RSTN>, <&cpg R9A07G044_GPIO_PORT_RESETN>, From db2e5f21a48edf1d1110d348add54bf22050643b Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Thu, 7 Jul 2022 19:23:14 +0100 Subject: [PATCH 273/298] pinctrl: renesas: pinctrl-rzg2l: Add IRQ domain to handle GPIO interrupt Add IRQ domain to RZ/G2L pinctrl driver to handle GPIO interrupt. GPIO0-GPIO122 pins can be used as IRQ lines but only 32 pins can be used as IRQ lines at a given time. Selection of pins as IRQ lines is handled by IA55 (which is the IRQC block) which sits in between the GPIO and GIC. Signed-off-by: Lad Prabhakar Reviewed-by: Linus Walleij Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220707182314.66610-7-prabhakar.mahadev-lad.rj@bp.renesas.com --- drivers/pinctrl/renesas/pinctrl-rzg2l.c | 233 ++++++++++++++++++++++++ 1 file changed, 233 insertions(+) diff --git a/drivers/pinctrl/renesas/pinctrl-rzg2l.c b/drivers/pinctrl/renesas/pinctrl-rzg2l.c index a48cac55152c..c47eed9d948f 100644 --- a/drivers/pinctrl/renesas/pinctrl-rzg2l.c +++ b/drivers/pinctrl/renesas/pinctrl-rzg2l.c @@ -9,8 +9,10 @@ #include #include #include +#include #include #include +#include #include #include #include @@ -89,6 +91,7 @@ #define PIN(n) (0x0800 + 0x10 + (n)) #define IOLH(n) (0x1000 + (n) * 8) #define IEN(n) (0x1800 + (n) * 8) +#define ISEL(n) (0x2c80 + (n) * 8) #define PWPR (0x3014) #define SD_CH(n) (0x3000 + (n) * 4) #define QSPI (0x3008) @@ -112,6 +115,10 @@ #define RZG2L_PIN_ID_TO_PORT_OFFSET(id) (RZG2L_PIN_ID_TO_PORT(id) + 0x10) #define RZG2L_PIN_ID_TO_PIN(id) ((id) % RZG2L_PINS_PER_PORT) +#define RZG2L_TINT_MAX_INTERRUPT 32 +#define RZG2L_TINT_IRQ_START_INDEX 9 +#define RZG2L_PACK_HWIRQ(t, i) (((t) << 16) | (i)) + struct rzg2l_dedicated_configs { const char *name; u32 config; @@ -137,6 +144,9 @@ struct rzg2l_pinctrl { struct gpio_chip gpio_chip; struct pinctrl_gpio_range gpio_range; + DECLARE_BITMAP(tint_slot, RZG2L_TINT_MAX_INTERRUPT); + spinlock_t bitmap_lock; + unsigned int hwirq[RZG2L_TINT_MAX_INTERRUPT]; spinlock_t lock; }; @@ -883,8 +893,14 @@ static int rzg2l_gpio_get(struct gpio_chip *chip, unsigned int offset) static void rzg2l_gpio_free(struct gpio_chip *chip, unsigned int offset) { + unsigned int virq; + pinctrl_gpio_free(chip->base + offset); + virq = irq_find_mapping(chip->irq.domain, offset); + if (virq) + irq_dispose_mapping(virq); + /* * Set the GPIO as an input to ensure that the next GPIO request won't * drive the GPIO pin as an output. @@ -1104,14 +1120,221 @@ static struct { } }; +static int rzg2l_gpio_get_gpioint(unsigned int virq) +{ + unsigned int gpioint; + unsigned int i; + u32 port, bit; + + port = virq / 8; + bit = virq % 8; + + if (port >= ARRAY_SIZE(rzg2l_gpio_configs) || + bit >= RZG2L_GPIO_PORT_GET_PINCNT(rzg2l_gpio_configs[port])) + return -EINVAL; + + gpioint = bit; + for (i = 0; i < port; i++) + gpioint += RZG2L_GPIO_PORT_GET_PINCNT(rzg2l_gpio_configs[i]); + + return gpioint; +} + +static void rzg2l_gpio_irq_disable(struct irq_data *d) +{ + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); + struct rzg2l_pinctrl *pctrl = container_of(gc, struct rzg2l_pinctrl, gpio_chip); + unsigned int hwirq = irqd_to_hwirq(d); + unsigned long flags; + void __iomem *addr; + u32 port; + u8 bit; + + port = RZG2L_PIN_ID_TO_PORT(hwirq); + bit = RZG2L_PIN_ID_TO_PIN(hwirq); + + addr = pctrl->base + ISEL(port); + if (bit >= 4) { + bit -= 4; + addr += 4; + } + + spin_lock_irqsave(&pctrl->lock, flags); + writel(readl(addr) & ~BIT(bit * 8), addr); + spin_unlock_irqrestore(&pctrl->lock, flags); + + gpiochip_disable_irq(gc, hwirq); + irq_chip_disable_parent(d); +} + +static void rzg2l_gpio_irq_enable(struct irq_data *d) +{ + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); + struct rzg2l_pinctrl *pctrl = container_of(gc, struct rzg2l_pinctrl, gpio_chip); + unsigned int hwirq = irqd_to_hwirq(d); + unsigned long flags; + void __iomem *addr; + u32 port; + u8 bit; + + gpiochip_enable_irq(gc, hwirq); + + port = RZG2L_PIN_ID_TO_PORT(hwirq); + bit = RZG2L_PIN_ID_TO_PIN(hwirq); + + addr = pctrl->base + ISEL(port); + if (bit >= 4) { + bit -= 4; + addr += 4; + } + + spin_lock_irqsave(&pctrl->lock, flags); + writel(readl(addr) | BIT(bit * 8), addr); + spin_unlock_irqrestore(&pctrl->lock, flags); + + irq_chip_enable_parent(d); +} + +static int rzg2l_gpio_irq_set_type(struct irq_data *d, unsigned int type) +{ + return irq_chip_set_type_parent(d, type); +} + +static void rzg2l_gpio_irqc_eoi(struct irq_data *d) +{ + irq_chip_eoi_parent(d); +} + +static void rzg2l_gpio_irq_print_chip(struct irq_data *data, struct seq_file *p) +{ + struct gpio_chip *gc = irq_data_get_irq_chip_data(data); + + seq_printf(p, dev_name(gc->parent)); +} + +static const struct irq_chip rzg2l_gpio_irqchip = { + .name = "rzg2l-gpio", + .irq_disable = rzg2l_gpio_irq_disable, + .irq_enable = rzg2l_gpio_irq_enable, + .irq_mask = irq_chip_mask_parent, + .irq_unmask = irq_chip_unmask_parent, + .irq_set_type = rzg2l_gpio_irq_set_type, + .irq_eoi = rzg2l_gpio_irqc_eoi, + .irq_print_chip = rzg2l_gpio_irq_print_chip, + .flags = IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS, +}; + +static int rzg2l_gpio_child_to_parent_hwirq(struct gpio_chip *gc, + unsigned int child, + unsigned int child_type, + unsigned int *parent, + unsigned int *parent_type) +{ + struct rzg2l_pinctrl *pctrl = gpiochip_get_data(gc); + unsigned long flags; + int gpioint, irq; + + gpioint = rzg2l_gpio_get_gpioint(child); + if (gpioint < 0) + return gpioint; + + spin_lock_irqsave(&pctrl->bitmap_lock, flags); + irq = bitmap_find_free_region(pctrl->tint_slot, RZG2L_TINT_MAX_INTERRUPT, get_order(1)); + spin_unlock_irqrestore(&pctrl->bitmap_lock, flags); + if (irq < 0) + return -ENOSPC; + pctrl->hwirq[irq] = child; + irq += RZG2L_TINT_IRQ_START_INDEX; + + /* All these interrupts are level high in the CPU */ + *parent_type = IRQ_TYPE_LEVEL_HIGH; + *parent = RZG2L_PACK_HWIRQ(gpioint, irq); + return 0; +} + +static int rzg2l_gpio_populate_parent_fwspec(struct gpio_chip *chip, + union gpio_irq_fwspec *gfwspec, + unsigned int parent_hwirq, + unsigned int parent_type) +{ + struct irq_fwspec *fwspec = &gfwspec->fwspec; + + fwspec->fwnode = chip->irq.parent_domain->fwnode; + fwspec->param_count = 2; + fwspec->param[0] = parent_hwirq; + fwspec->param[1] = parent_type; + + return 0; +} + +static void rzg2l_gpio_irq_domain_free(struct irq_domain *domain, unsigned int virq, + unsigned int nr_irqs) +{ + struct irq_data *d; + + d = irq_domain_get_irq_data(domain, virq); + if (d) { + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); + struct rzg2l_pinctrl *pctrl = container_of(gc, struct rzg2l_pinctrl, gpio_chip); + irq_hw_number_t hwirq = irqd_to_hwirq(d); + unsigned long flags; + unsigned int i; + + for (i = 0; i < RZG2L_TINT_MAX_INTERRUPT; i++) { + if (pctrl->hwirq[i] == hwirq) { + spin_lock_irqsave(&pctrl->bitmap_lock, flags); + bitmap_release_region(pctrl->tint_slot, i, get_order(1)); + spin_unlock_irqrestore(&pctrl->bitmap_lock, flags); + pctrl->hwirq[i] = 0; + break; + } + } + } + irq_domain_free_irqs_common(domain, virq, nr_irqs); +} + +static void rzg2l_init_irq_valid_mask(struct gpio_chip *gc, + unsigned long *valid_mask, + unsigned int ngpios) +{ + struct rzg2l_pinctrl *pctrl = gpiochip_get_data(gc); + struct gpio_chip *chip = &pctrl->gpio_chip; + unsigned int offset; + + /* Forbid unused lines to be mapped as IRQs */ + for (offset = 0; offset < chip->ngpio; offset++) { + u32 port, bit; + + port = offset / 8; + bit = offset % 8; + + if (port >= ARRAY_SIZE(rzg2l_gpio_configs) || + bit >= RZG2L_GPIO_PORT_GET_PINCNT(rzg2l_gpio_configs[port])) + clear_bit(offset, valid_mask); + } +} + static int rzg2l_gpio_register(struct rzg2l_pinctrl *pctrl) { struct device_node *np = pctrl->dev->of_node; struct gpio_chip *chip = &pctrl->gpio_chip; const char *name = dev_name(pctrl->dev); + struct irq_domain *parent_domain; struct of_phandle_args of_args; + struct device_node *parent_np; + struct gpio_irq_chip *girq; int ret; + parent_np = of_irq_find_parent(np); + if (!parent_np) + return -ENXIO; + + parent_domain = irq_find_host(parent_np); + of_node_put(parent_np); + if (!parent_domain) + return -EPROBE_DEFER; + ret = of_parse_phandle_with_fixed_args(np, "gpio-ranges", 3, 0, &of_args); if (ret) { dev_err(pctrl->dev, "Unable to parse gpio-ranges\n"); @@ -1138,6 +1361,15 @@ static int rzg2l_gpio_register(struct rzg2l_pinctrl *pctrl) chip->base = -1; chip->ngpio = of_args.args[2]; + girq = &chip->irq; + gpio_irq_chip_set_chip(girq, &rzg2l_gpio_irqchip); + girq->fwnode = of_node_to_fwnode(np); + girq->parent_domain = parent_domain; + girq->child_to_parent_hwirq = rzg2l_gpio_child_to_parent_hwirq; + girq->populate_parent_alloc_arg = rzg2l_gpio_populate_parent_fwspec; + girq->child_irq_domain_ops.free = rzg2l_gpio_irq_domain_free; + girq->init_valid_mask = rzg2l_init_irq_valid_mask; + pctrl->gpio_range.id = 0; pctrl->gpio_range.pin_base = 0; pctrl->gpio_range.base = 0; @@ -1253,6 +1485,7 @@ static int rzg2l_pinctrl_probe(struct platform_device *pdev) } spin_lock_init(&pctrl->lock); + spin_lock_init(&pctrl->bitmap_lock); platform_set_drvdata(pdev, pctrl); From de078949218242d57f791b63fac87cdb09cb0424 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:24:39 -0500 Subject: [PATCH 274/298] irqchip/sifive-plic: Make better use of the effective affinity mask The PLIC driver already updates the effective affinity mask in its .irq_set_affinity callback. Take advantage of that information to only touch bits (and take spinlocks) for the specific relevant hart contexts. First, make sure the effective affinity mask is set before IRQ startup. Then, since this mask already takes priv->lmask into account, checking that mask later is no longer needed (and handler->present is equivalent to the bit being set in priv->lmask). Finally, when (un)masking or changing affinity, only clear/set the enable bits in the specific old/new context(s). The cpumask operations in plic_irq_unmask() are not needed because they duplicate the code in plic_set_affinity(). Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701202440.59059-2-samuel@sholland.org --- drivers/irqchip/Kconfig | 1 + drivers/irqchip/irq-sifive-plic.c | 27 +++++++++------------------ 2 files changed, 10 insertions(+), 18 deletions(-) diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 462adac905a6..ea7b7485327c 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -531,6 +531,7 @@ config SIFIVE_PLIC bool "SiFive Platform-Level Interrupt Controller" depends on RISCV select IRQ_DOMAIN_HIERARCHY + select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP help This enables support for the PLIC chip found in SiFive (and potentially other) RISC-V systems. The PLIC controls devices diff --git a/drivers/irqchip/irq-sifive-plic.c b/drivers/irqchip/irq-sifive-plic.c index b3a36dca7f1b..46595e607b0e 100644 --- a/drivers/irqchip/irq-sifive-plic.c +++ b/drivers/irqchip/irq-sifive-plic.c @@ -114,31 +114,18 @@ static inline void plic_irq_toggle(const struct cpumask *mask, for_each_cpu(cpu, mask) { struct plic_handler *handler = per_cpu_ptr(&plic_handlers, cpu); - if (handler->present && - cpumask_test_cpu(cpu, &handler->priv->lmask)) - plic_toggle(handler, d->hwirq, enable); + plic_toggle(handler, d->hwirq, enable); } } static void plic_irq_unmask(struct irq_data *d) { - struct cpumask amask; - unsigned int cpu; - struct plic_priv *priv = irq_data_get_irq_chip_data(d); - - cpumask_and(&amask, &priv->lmask, cpu_online_mask); - cpu = cpumask_any_and(irq_data_get_affinity_mask(d), - &amask); - if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) - return; - plic_irq_toggle(cpumask_of(cpu), d, 1); + plic_irq_toggle(irq_data_get_effective_affinity_mask(d), d, 1); } static void plic_irq_mask(struct irq_data *d) { - struct plic_priv *priv = irq_data_get_irq_chip_data(d); - - plic_irq_toggle(&priv->lmask, d, 0); + plic_irq_toggle(irq_data_get_effective_affinity_mask(d), d, 0); } #ifdef CONFIG_SMP @@ -159,11 +146,13 @@ static int plic_set_affinity(struct irq_data *d, if (cpu >= nr_cpu_ids) return -EINVAL; - plic_irq_toggle(&priv->lmask, d, 0); - plic_irq_toggle(cpumask_of(cpu), d, !irqd_irq_masked(d)); + plic_irq_mask(d); irq_data_update_effective_affinity(d, cpumask_of(cpu)); + if (!irqd_irq_masked(d)) + plic_irq_unmask(d); + return IRQ_SET_MASK_OK_DONE; } #endif @@ -190,6 +179,7 @@ static struct irq_chip plic_edge_chip = { .irq_set_affinity = plic_set_affinity, #endif .irq_set_type = plic_irq_set_type, + .flags = IRQCHIP_AFFINITY_PRE_STARTUP, }; static struct irq_chip plic_chip = { @@ -201,6 +191,7 @@ static struct irq_chip plic_chip = { .irq_set_affinity = plic_set_affinity, #endif .irq_set_type = plic_irq_set_type, + .flags = IRQCHIP_AFFINITY_PRE_STARTUP, }; static int plic_irq_set_type(struct irq_data *d, unsigned int type) From a1706a1c5062e0908528170f853601ed53f428c8 Mon Sep 17 00:00:00 2001 From: Samuel Holland Date: Fri, 1 Jul 2022 15:24:40 -0500 Subject: [PATCH 275/298] irqchip/sifive-plic: Separate the enable and mask operations The PLIC has two per-IRQ checks before sending an IRQ to a hart context. First, it checks that the IRQ's priority is nonzero. Then, it checks that the enable bit is set for that combination of IRQ and context. Currently, the PLIC driver sets both the priority value and the enable bit in its (un)mask operations. However, modifying the enable bit is problematic for two reasons: 1) The enable bits are packed, so changes are not atomic and require taking a spinlock. 2) The following requirement from the PLIC spec, which explains the racy (un)mask operations in plic_irq_eoi(): If the completion ID does not match an interrupt source that is currently enabled for the target, the completion is silently ignored. Both of these problems are solved by using the priority value to mask IRQs. Each IRQ has a separate priority register, so writing the priority value is atomic. And since the enable bit remains set while an IRQ is masked, the EOI operation works normally. The enable bits are still used to control the IRQ's affinity. Signed-off-by: Samuel Holland Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220701202440.59059-3-samuel@sholland.org --- drivers/irqchip/irq-sifive-plic.c | 55 +++++++++++++++++++------------ 1 file changed, 34 insertions(+), 21 deletions(-) diff --git a/drivers/irqchip/irq-sifive-plic.c b/drivers/irqchip/irq-sifive-plic.c index 46595e607b0e..ba4938188570 100644 --- a/drivers/irqchip/irq-sifive-plic.c +++ b/drivers/irqchip/irq-sifive-plic.c @@ -108,9 +108,7 @@ static inline void plic_irq_toggle(const struct cpumask *mask, struct irq_data *d, int enable) { int cpu; - struct plic_priv *priv = irq_data_get_irq_chip_data(d); - writel(enable, priv->regs + PRIORITY_BASE + d->hwirq * PRIORITY_PER_ID); for_each_cpu(cpu, mask) { struct plic_handler *handler = per_cpu_ptr(&plic_handlers, cpu); @@ -118,16 +116,37 @@ static inline void plic_irq_toggle(const struct cpumask *mask, } } -static void plic_irq_unmask(struct irq_data *d) +static void plic_irq_enable(struct irq_data *d) { plic_irq_toggle(irq_data_get_effective_affinity_mask(d), d, 1); } -static void plic_irq_mask(struct irq_data *d) +static void plic_irq_disable(struct irq_data *d) { plic_irq_toggle(irq_data_get_effective_affinity_mask(d), d, 0); } +static void plic_irq_unmask(struct irq_data *d) +{ + struct plic_priv *priv = irq_data_get_irq_chip_data(d); + + writel(1, priv->regs + PRIORITY_BASE + d->hwirq * PRIORITY_PER_ID); +} + +static void plic_irq_mask(struct irq_data *d) +{ + struct plic_priv *priv = irq_data_get_irq_chip_data(d); + + writel(0, priv->regs + PRIORITY_BASE + d->hwirq * PRIORITY_PER_ID); +} + +static void plic_irq_eoi(struct irq_data *d) +{ + struct plic_handler *handler = this_cpu_ptr(&plic_handlers); + + writel(d->hwirq, handler->hart_base + CONTEXT_CLAIM); +} + #ifdef CONFIG_SMP static int plic_set_affinity(struct irq_data *d, const struct cpumask *mask_val, bool force) @@ -146,32 +165,21 @@ static int plic_set_affinity(struct irq_data *d, if (cpu >= nr_cpu_ids) return -EINVAL; - plic_irq_mask(d); + plic_irq_disable(d); irq_data_update_effective_affinity(d, cpumask_of(cpu)); - if (!irqd_irq_masked(d)) - plic_irq_unmask(d); + if (!irqd_irq_disabled(d)) + plic_irq_enable(d); return IRQ_SET_MASK_OK_DONE; } #endif -static void plic_irq_eoi(struct irq_data *d) -{ - struct plic_handler *handler = this_cpu_ptr(&plic_handlers); - - if (irqd_irq_masked(d)) { - plic_irq_unmask(d); - writel(d->hwirq, handler->hart_base + CONTEXT_CLAIM); - plic_irq_mask(d); - } else { - writel(d->hwirq, handler->hart_base + CONTEXT_CLAIM); - } -} - static struct irq_chip plic_edge_chip = { .name = "SiFive PLIC", + .irq_enable = plic_irq_enable, + .irq_disable = plic_irq_disable, .irq_ack = plic_irq_eoi, .irq_mask = plic_irq_mask, .irq_unmask = plic_irq_unmask, @@ -184,6 +192,8 @@ static struct irq_chip plic_edge_chip = { static struct irq_chip plic_chip = { .name = "SiFive PLIC", + .irq_enable = plic_irq_enable, + .irq_disable = plic_irq_disable, .irq_mask = plic_irq_mask, .irq_unmask = plic_irq_unmask, .irq_eoi = plic_irq_eoi, @@ -429,8 +439,11 @@ static int __init __plic_init(struct device_node *node, i * CONTEXT_ENABLE_SIZE; handler->priv = priv; done: - for (hwirq = 1; hwirq <= nr_irqs; hwirq++) + for (hwirq = 1; hwirq <= nr_irqs; hwirq++) { plic_toggle(handler, hwirq, 0); + writel(1, priv->regs + PRIORITY_BASE + + hwirq * PRIORITY_PER_ID); + } nr_handlers++; } From 7dc487d27f7fef32a79eacb4159636b0ea425a5b Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Mon, 11 Jul 2022 08:55:21 +0100 Subject: [PATCH 276/298] gpio: thunderx: Don't directly include asm-generic/msi.h On architectures that require it, asm-generic/msi.h is already dragged in by the higher level include files, and is commonly refered to as 'asm/msi.h'. It is also architecture specific, and breaks compilation in a pretty bad way now that linux/gpio/driver.h includes asm/msi.h (which drags a conflicting but nonetheless correct version of msi_alloc_info_t on x86). Reported-by: Stephen Rothwell Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220711154252.4b88a601@canb.auug.org.au Link: https://lore.kernel.org/r/20220711081257.132901-1-maz@kernel.org Fixes: 91a29af413de ("gpio: Remove dynamic allocation from populate_parent_alloc_arg()") Cc: Bartosz Golaszewski Cc: Linus Walleij --- drivers/gpio/gpio-thunderx.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpio/gpio-thunderx.c b/drivers/gpio/gpio-thunderx.c index e1dedbca0c85..cc62c6e64103 100644 --- a/drivers/gpio/gpio-thunderx.c +++ b/drivers/gpio/gpio-thunderx.c @@ -15,8 +15,6 @@ #include #include #include -#include - #define GPIO_RX_DAT 0x0 #define GPIO_TX_SET 0x8 From ef6e5d61eb7a0a30f776a829274573094185d03d Mon Sep 17 00:00:00 2001 From: Michael Walle Date: Wed, 6 Jul 2022 17:15:52 +0200 Subject: [PATCH 277/298] genirq: Allow irq_set_chip_handler_name_locked() to take a const irq_chip Similar to commit 393e1280f765 ("genirq: Allow irq_chip registration functions to take a const irq_chip"), allow the irq_set_chip_handler_name_locked() function to take a const irq_chip argument. Signed-off-by: Michael Walle Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220706151553.1580790-1-michael@walle.cc --- include/linux/irqdesc.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index a77584593f7d..1cd4e36890fb 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -209,14 +209,15 @@ static inline void irq_set_handler_locked(struct irq_data *data, * Must be called with irq_desc locked and valid parameters. */ static inline void -irq_set_chip_handler_name_locked(struct irq_data *data, struct irq_chip *chip, +irq_set_chip_handler_name_locked(struct irq_data *data, + const struct irq_chip *chip, irq_flow_handler_t handler, const char *name) { struct irq_desc *desc = irq_data_to_desc(data); desc->handle_irq = handler; desc->name = name; - data->chip = chip; + data->chip = (struct irq_chip *)chip; } bool irq_check_status_bit(unsigned int irq, unsigned int bitmask); From 51ff93923e21ed2862e83f208706e3ca31d6f409 Mon Sep 17 00:00:00 2001 From: Michael Walle Date: Wed, 6 Jul 2022 17:15:53 +0200 Subject: [PATCH 278/298] pinctrl: ocelot: Make irq_chip immutable Since recently, the kernel is nagging about mutable irq_chips: [ 2.593426] gpio gpiochip0: (ocelot-gpio): not an immutable chip, please consider fixing it! Make it const, flag it as IRQCHIP_IMMUTABLE, add the new helper functions and call the appropriate gpiolib functions. Signed-off-by: Michael Walle Acked-by: Linus Walleij Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220706151553.1580790-2-michael@walle.cc --- drivers/pinctrl/pinctrl-ocelot.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/pinctrl/pinctrl-ocelot.c b/drivers/pinctrl/pinctrl-ocelot.c index 5f4a8c5c6650..425c1a9883fd 100644 --- a/drivers/pinctrl/pinctrl-ocelot.c +++ b/drivers/pinctrl/pinctrl-ocelot.c @@ -1761,6 +1761,7 @@ static void ocelot_irq_mask(struct irq_data *data) regmap_update_bits(info->map, REG(OCELOT_GPIO_INTR_ENA, info, gpio), BIT(gpio % 32), 0); + gpiochip_disable_irq(chip, gpio); } static void ocelot_irq_unmask(struct irq_data *data) @@ -1769,6 +1770,7 @@ static void ocelot_irq_unmask(struct irq_data *data) struct ocelot_pinctrl *info = gpiochip_get_data(chip); unsigned int gpio = irqd_to_hwirq(data); + gpiochip_enable_irq(chip, gpio); regmap_update_bits(info->map, REG(OCELOT_GPIO_INTR_ENA, info, gpio), BIT(gpio % 32), BIT(gpio % 32)); } @@ -1790,8 +1792,10 @@ static struct irq_chip ocelot_eoi_irqchip = { .irq_mask = ocelot_irq_mask, .irq_eoi = ocelot_irq_ack, .irq_unmask = ocelot_irq_unmask, - .flags = IRQCHIP_EOI_THREADED | IRQCHIP_EOI_IF_HANDLED, + .flags = IRQCHIP_EOI_THREADED | IRQCHIP_EOI_IF_HANDLED | + IRQCHIP_IMMUTABLE, .irq_set_type = ocelot_irq_set_type, + GPIOCHIP_IRQ_RESOURCE_HELPERS }; static struct irq_chip ocelot_irqchip = { @@ -1800,6 +1804,8 @@ static struct irq_chip ocelot_irqchip = { .irq_ack = ocelot_irq_ack, .irq_unmask = ocelot_irq_unmask, .irq_set_type = ocelot_irq_set_type, + .flags = IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS }; static int ocelot_irq_set_type(struct irq_data *data, unsigned int type) @@ -1863,7 +1869,7 @@ static int ocelot_gpiochip_register(struct platform_device *pdev, irq = platform_get_irq_optional(pdev, 0); if (irq > 0) { girq = &gc->irq; - girq->chip = &ocelot_irqchip; + gpio_irq_chip_set_chip(girq, &ocelot_irqchip); girq->parent_handler = ocelot_irq_handler; girq->num_parents = 1; girq->parents = devm_kcalloc(&pdev->dev, 1, From 8cfc90ecd33e73f4a30207b316bc6886e3c3a166 Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Mon, 18 Jul 2022 20:37:45 +0100 Subject: [PATCH 279/298] dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Document RZ/V2L SoC Document RZ/V2L (R9A07G054) IRQC bindings. The RZ/V2L IRQC block is identical to one found on the RZ/G2L SoC. No driver changes are required as generic compatible string "renesas,rzg2l-irqc" will be used as a fallback. While at it, update the comment "# RZ/G2L" to "# RZ/G2{L,LC}" for "renesas,r9a07g044-irqc" compatible string as both RZ/G2L and RZ/G2LC SoC's use the common SoC DTSI and have the same IRQC block. Signed-off-by: Lad Prabhakar Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220718193745.7472-1-prabhakar.mahadev-lad.rj@bp.renesas.com --- .../bindings/interrupt-controller/renesas,rzg2l-irqc.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml index ffbb4ab4d9a7..33b90e975e33 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml @@ -26,7 +26,8 @@ properties: compatible: items: - enum: - - renesas,r9a07g044-irqc # RZ/G2L + - renesas,r9a07g044-irqc # RZ/G2{L,LC} + - renesas,r9a07g054-irqc # RZ/V2L - const: renesas,rzg2l-irqc '#interrupt-cells': From 295171705c9ac98f4626609033f6bab0c2e37ed0 Mon Sep 17 00:00:00 2001 From: Jason Wang Date: Fri, 15 Jul 2022 13:12:58 +0800 Subject: [PATCH 280/298] irqchip/gic-v3: Fix comment typo The double `the' is duplicated in line 1786, remove one. Signed-off-by: Jason Wang Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220715051258.28889-1-wangborong@cdjrlc.com --- drivers/irqchip/irq-gic-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 5c1cf907ee68..d28b45f3240a 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1783,7 +1783,7 @@ static void gic_enable_nmi_support(void) * the security state of the GIC (controlled by the GICD_CTRL.DS bit) * and if Group 0 interrupts can be delivered to Linux in the non-secure * world as FIQs (controlled by the SCR_EL3.FIQ bit). These affect the - * the ICC_PMR_EL1 register and the priority that software assigns to + * ICC_PMR_EL1 register and the priority that software assigns to * interrupts: * * GICD_CTRL.DS | SCR_EL3.FIQ | ICC_PMR_EL1 | Group 1 priority From 6f194c99f466147148cc08452718b46664112548 Mon Sep 17 00:00:00 2001 From: Xu Qiang Date: Tue, 19 Jul 2022 06:36:40 +0000 Subject: [PATCH 281/298] irqdomain: Report irq number for NOMAP domains MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When using a NOMAP domain, __irq_resolve_mapping() doesn't store the Linux IRQ number at the address optionally provided by the caller. While this isn't a huge deal (the returned value is guaranteed to the hwirq that was passed as a parameter), let's honour the letter of the API by writing the expected value. Fixes: d22558dd0a6c (“irqdomain: Introduce irq_resolve_mapping()”) Signed-off-by: Xu Qiang [maz: commit message] Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220719063641.56541-2-xuqiang36@huawei.com --- kernel/irq/irqdomain.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index d5ce96510549..481abb885d61 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -910,6 +910,8 @@ struct irq_desc *__irq_resolve_mapping(struct irq_domain *domain, data = irq_domain_get_irq_data(domain, hwirq); if (data && data->hwirq == hwirq) desc = irq_data_to_desc(data); + if (irq && desc) + *irq = hwirq; } return desc; From ef50cd57a73a8bbfad403e5e2edb3309611f58ad Mon Sep 17 00:00:00 2001 From: Xu Qiang Date: Tue, 19 Jul 2022 06:36:41 +0000 Subject: [PATCH 282/298] irqdomain: Use hwirq_max instead of revmap_size for NOMAP domains NOMAP irq domains use the revmap_size field to indicate the maximum hwirq number the domain accepts. This is a bit confusing as revmap_size is usually used to indicate the size of the revmap array, which a NOMAP domain doesn't have. Instead, use the hwirq_max field which has the correct semantics, and keep revmap_size to 0 for a NOMAP domain. Signed-off-by: Xu Qiang [maz: commit message] Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220719063641.56541-3-xuqiang36@huawei.com --- kernel/irq/irqdomain.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 481abb885d61..8fe1da9614ee 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -147,7 +147,8 @@ struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, unsigned int s static atomic_t unknown_domains; if (WARN_ON((size && direct_max) || - (!IS_ENABLED(CONFIG_IRQ_DOMAIN_NOMAP) && direct_max))) + (!IS_ENABLED(CONFIG_IRQ_DOMAIN_NOMAP) && direct_max) || + (direct_max && (direct_max != hwirq_max)))) return NULL; domain = kzalloc_node(struct_size(domain, revmap, size), @@ -219,7 +220,6 @@ struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, unsigned int s domain->hwirq_max = hwirq_max; if (direct_max) { - size = direct_max; domain->flags |= IRQ_DOMAIN_FLAG_NO_MAP; } @@ -650,9 +650,9 @@ unsigned int irq_create_direct_mapping(struct irq_domain *domain) pr_debug("create_direct virq allocation failed\n"); return 0; } - if (virq >= domain->revmap_size) { - pr_err("ERROR: no free irqs available below %i maximum\n", - domain->revmap_size); + if (virq >= domain->hwirq_max) { + pr_err("ERROR: no free irqs available below %lu maximum\n", + domain->hwirq_max); irq_free_desc(virq); return 0; } @@ -906,7 +906,7 @@ struct irq_desc *__irq_resolve_mapping(struct irq_domain *domain, return desc; if (irq_domain_is_nomap(domain)) { - if (hwirq < domain->revmap_size) { + if (hwirq < domain->hwirq_max) { data = irq_domain_get_irq_data(domain, hwirq); if (data && data->hwirq == hwirq) desc = irq_data_to_desc(data); From af6a1cfa6859dab4a843ea07f1c2f04938f1715b Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 20 Jul 2022 18:51:20 +0800 Subject: [PATCH 283/298] LoongArch: Provisionally add ACPICA data structures The LoongArch architecture is using ACPI, but the spec containing the required updates still is in an unreleased state. Instead of preventing the inclusion of the IRQ support into the kernel, add the missing bits to the arch-specific parts of the ACPICA support. Once the ACPICA bits are updated to the version that supports LoongArch, these bits can eventually be removed. Signed-off-by: Marc Zyngier Signed-off-by: Jianmin Lv Link: https://lore.kernel.org/r/1658314292-35346-2-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/acpi.h | 142 ++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) diff --git a/arch/loongarch/include/asm/acpi.h b/arch/loongarch/include/asm/acpi.h index 62044cd5b7bc..c5108213876c 100644 --- a/arch/loongarch/include/asm/acpi.h +++ b/arch/loongarch/include/asm/acpi.h @@ -31,6 +31,148 @@ static inline bool acpi_has_cpu_in_madt(void) extern struct list_head acpi_wakeup_device_list; +/* + * Temporary definitions until the core ACPICA code gets updated (see + * 1656837932-18257-1-git-send-email-lvjianmin@loongson.cn and its + * follow-ups for the "rationale"). + * + * Once the "legal reasons" are cleared and that the code is merged, + * this can be dropped entierely. + */ +#if (ACPI_CA_VERSION == 0x20220331 && !defined(LOONGARCH_ACPICA_EXT)) + +#define LOONGARCH_ACPICA_EXT 1 + +#define ACPI_MADT_TYPE_CORE_PIC 17 +#define ACPI_MADT_TYPE_LIO_PIC 18 +#define ACPI_MADT_TYPE_HT_PIC 19 +#define ACPI_MADT_TYPE_EIO_PIC 20 +#define ACPI_MADT_TYPE_MSI_PIC 21 +#define ACPI_MADT_TYPE_BIO_PIC 22 +#define ACPI_MADT_TYPE_LPC_PIC 23 + +/* Values for Version field above */ + +enum acpi_madt_core_pic_version { + ACPI_MADT_CORE_PIC_VERSION_NONE = 0, + ACPI_MADT_CORE_PIC_VERSION_V1 = 1, + ACPI_MADT_CORE_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +enum acpi_madt_lio_pic_version { + ACPI_MADT_LIO_PIC_VERSION_NONE = 0, + ACPI_MADT_LIO_PIC_VERSION_V1 = 1, + ACPI_MADT_LIO_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +enum acpi_madt_eio_pic_version { + ACPI_MADT_EIO_PIC_VERSION_NONE = 0, + ACPI_MADT_EIO_PIC_VERSION_V1 = 1, + ACPI_MADT_EIO_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +enum acpi_madt_ht_pic_version { + ACPI_MADT_HT_PIC_VERSION_NONE = 0, + ACPI_MADT_HT_PIC_VERSION_V1 = 1, + ACPI_MADT_HT_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +enum acpi_madt_bio_pic_version { + ACPI_MADT_BIO_PIC_VERSION_NONE = 0, + ACPI_MADT_BIO_PIC_VERSION_V1 = 1, + ACPI_MADT_BIO_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +enum acpi_madt_msi_pic_version { + ACPI_MADT_MSI_PIC_VERSION_NONE = 0, + ACPI_MADT_MSI_PIC_VERSION_V1 = 1, + ACPI_MADT_MSI_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +enum acpi_madt_lpc_pic_version { + ACPI_MADT_LPC_PIC_VERSION_NONE = 0, + ACPI_MADT_LPC_PIC_VERSION_V1 = 1, + ACPI_MADT_LPC_PIC_VERSION_RESERVED = 2 /* 2 and greater are reserved */ +}; + +#pragma pack(1) + +/* Core Interrupt Controller */ + +struct acpi_madt_core_pic { + struct acpi_subtable_header header; + u8 version; + u32 processor_id; + u32 core_id; + u32 flags; +}; + +/* Legacy I/O Interrupt Controller */ + +struct acpi_madt_lio_pic { + struct acpi_subtable_header header; + u8 version; + u64 address; + u16 size; + u8 cascade[2]; + u32 cascade_map[2]; +}; + +/* Extend I/O Interrupt Controller */ + +struct acpi_madt_eio_pic { + struct acpi_subtable_header header; + u8 version; + u8 cascade; + u8 node; + u64 node_map; +}; + +/* HT Interrupt Controller */ + +struct acpi_madt_ht_pic { + struct acpi_subtable_header header; + u8 version; + u64 address; + u16 size; + u8 cascade[8]; +}; + +/* Bridge I/O Interrupt Controller */ + +struct acpi_madt_bio_pic { + struct acpi_subtable_header header; + u8 version; + u64 address; + u16 size; + u16 id; + u16 gsi_base; +}; + +/* MSI Interrupt Controller */ + +struct acpi_madt_msi_pic { + struct acpi_subtable_header header; + u8 version; + u64 msg_address; + u32 start; + u32 count; +}; + +/* LPC Interrupt Controller */ + +struct acpi_madt_lpc_pic { + struct acpi_subtable_header header; + u8 version; + u64 address; + u16 size; + u8 cascade; +}; + +#pragma pack() + +#endif + #endif /* !CONFIG_ACPI */ #define ACPI_TABLE_UPGRADE_MAX_PHYS ARCH_LOW_ADDRESS_LIMIT From 7327b16f5f56741960e11ae4d7ef0ffdff5fd252 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 20 Jul 2022 18:51:21 +0800 Subject: [PATCH 284/298] APCI: irq: Add support for multiple GSI domains In an unfortunate departure from the ACPI spec, the LoongArch architecture split its GSI space across multiple interrupt controllers. In order to be able to reuse the core code and prevent architectures from reinventing an already square wheel, offer the arch code the ability to register a dispatcher function that will return the domain fwnode for a given GSI. The ARM GIC drivers are updated to support this (with a single domain, as intended). Signed-off-by: Marc Zyngier Cc: Hanjun Guo Cc: Lorenzo Pieralisi Signed-off-by: Jianmin Lv Tested-by: Hanjun Guo Reviewed-by: Hanjun Guo Link: https://lore.kernel.org/r/1658314292-35346-3-git-send-email-lvjianmin@loongson.cn --- drivers/acpi/irq.c | 40 ++++++++++++++++++++++-------------- drivers/irqchip/irq-gic-v3.c | 18 ++++++++++------ drivers/irqchip/irq-gic.c | 18 ++++++++++------ include/linux/acpi.h | 2 +- 4 files changed, 50 insertions(+), 28 deletions(-) diff --git a/drivers/acpi/irq.c b/drivers/acpi/irq.c index c68e694fca26..f0de76879497 100644 --- a/drivers/acpi/irq.c +++ b/drivers/acpi/irq.c @@ -12,7 +12,7 @@ enum acpi_irq_model_id acpi_irq_model; -static struct fwnode_handle *acpi_gsi_domain_id; +static struct fwnode_handle *(*acpi_get_gsi_domain_id)(u32 gsi); /** * acpi_gsi_to_irq() - Retrieve the linux irq number for a given GSI @@ -26,9 +26,10 @@ static struct fwnode_handle *acpi_gsi_domain_id; */ int acpi_gsi_to_irq(u32 gsi, unsigned int *irq) { - struct irq_domain *d = irq_find_matching_fwnode(acpi_gsi_domain_id, - DOMAIN_BUS_ANY); + struct irq_domain *d; + d = irq_find_matching_fwnode(acpi_get_gsi_domain_id(gsi), + DOMAIN_BUS_ANY); *irq = irq_find_mapping(d, gsi); /* * *irq == 0 means no mapping, that should @@ -53,12 +54,12 @@ int acpi_register_gsi(struct device *dev, u32 gsi, int trigger, { struct irq_fwspec fwspec; - if (WARN_ON(!acpi_gsi_domain_id)) { + fwspec.fwnode = acpi_get_gsi_domain_id(gsi); + if (WARN_ON(!fwspec.fwnode)) { pr_warn("GSI: No registered irqchip, giving up\n"); return -EINVAL; } - fwspec.fwnode = acpi_gsi_domain_id; fwspec.param[0] = gsi; fwspec.param[1] = acpi_dev_get_irq_type(trigger, polarity); fwspec.param_count = 2; @@ -73,13 +74,14 @@ EXPORT_SYMBOL_GPL(acpi_register_gsi); */ void acpi_unregister_gsi(u32 gsi) { - struct irq_domain *d = irq_find_matching_fwnode(acpi_gsi_domain_id, - DOMAIN_BUS_ANY); + struct irq_domain *d; int irq; if (WARN_ON(acpi_irq_model == ACPI_IRQ_MODEL_GIC && gsi < 16)) return; + d = irq_find_matching_fwnode(acpi_get_gsi_domain_id(gsi), + DOMAIN_BUS_ANY); irq = irq_find_mapping(d, gsi); irq_dispose_mapping(irq); } @@ -97,7 +99,8 @@ EXPORT_SYMBOL_GPL(acpi_unregister_gsi); * The referenced device fwhandle or NULL on failure */ static struct fwnode_handle * -acpi_get_irq_source_fwhandle(const struct acpi_resource_source *source) +acpi_get_irq_source_fwhandle(const struct acpi_resource_source *source, + u32 gsi) { struct fwnode_handle *result; struct acpi_device *device; @@ -105,7 +108,7 @@ acpi_get_irq_source_fwhandle(const struct acpi_resource_source *source) acpi_status status; if (!source->string_length) - return acpi_gsi_domain_id; + return acpi_get_gsi_domain_id(gsi); status = acpi_get_handle(NULL, source->string_ptr, &handle); if (WARN_ON(ACPI_FAILURE(status))) @@ -194,7 +197,7 @@ static acpi_status acpi_irq_parse_one_cb(struct acpi_resource *ares, ctx->index -= irq->interrupt_count; return AE_OK; } - fwnode = acpi_gsi_domain_id; + fwnode = acpi_get_gsi_domain_id(irq->interrupts[ctx->index]); acpi_irq_parse_one_match(fwnode, irq->interrupts[ctx->index], irq->triggering, irq->polarity, irq->shareable, ctx); @@ -207,7 +210,8 @@ static acpi_status acpi_irq_parse_one_cb(struct acpi_resource *ares, ctx->index -= eirq->interrupt_count; return AE_OK; } - fwnode = acpi_get_irq_source_fwhandle(&eirq->resource_source); + fwnode = acpi_get_irq_source_fwhandle(&eirq->resource_source, + eirq->interrupts[ctx->index]); acpi_irq_parse_one_match(fwnode, eirq->interrupts[ctx->index], eirq->triggering, eirq->polarity, eirq->shareable, ctx); @@ -291,10 +295,10 @@ EXPORT_SYMBOL_GPL(acpi_irq_get); * GSI interrupts */ void __init acpi_set_irq_model(enum acpi_irq_model_id model, - struct fwnode_handle *fwnode) + struct fwnode_handle *(*fn)(u32)) { acpi_irq_model = model; - acpi_gsi_domain_id = fwnode; + acpi_get_gsi_domain_id = fn; } /** @@ -312,8 +316,14 @@ struct irq_domain *acpi_irq_create_hierarchy(unsigned int flags, const struct irq_domain_ops *ops, void *host_data) { - struct irq_domain *d = irq_find_matching_fwnode(acpi_gsi_domain_id, - DOMAIN_BUS_ANY); + struct irq_domain *d; + + /* This only works for the GIC model... */ + if (acpi_irq_model != ACPI_IRQ_MODEL_GIC) + return NULL; + + d = irq_find_matching_fwnode(acpi_get_gsi_domain_id(0), + DOMAIN_BUS_ANY); if (!d) return NULL; diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 5c1cf907ee68..c66470378074 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -2360,11 +2360,17 @@ static void __init gic_acpi_setup_kvm_info(void) vgic_set_kvm_info(&gic_v3_kvm_info); } +static struct fwnode_handle *gsi_domain_handle; + +static struct fwnode_handle *gic_v3_get_gsi_domain_id(u32 gsi) +{ + return gsi_domain_handle; +} + static int __init gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_distributor *dist; - struct fwnode_handle *domain_handle; size_t size; int i, err; @@ -2396,18 +2402,18 @@ gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) if (err) goto out_redist_unmap; - domain_handle = irq_domain_alloc_fwnode(&dist->base_address); - if (!domain_handle) { + gsi_domain_handle = irq_domain_alloc_fwnode(&dist->base_address); + if (!gsi_domain_handle) { err = -ENOMEM; goto out_redist_unmap; } err = gic_init_bases(acpi_data.dist_base, acpi_data.redist_regs, - acpi_data.nr_redist_regions, 0, domain_handle); + acpi_data.nr_redist_regions, 0, gsi_domain_handle); if (err) goto out_fwhandle_free; - acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, domain_handle); + acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, gic_v3_get_gsi_domain_id); if (static_branch_likely(&supports_deactivate_key)) gic_acpi_setup_kvm_info(); @@ -2415,7 +2421,7 @@ gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) return 0; out_fwhandle_free: - irq_domain_free_fwnode(domain_handle); + irq_domain_free_fwnode(gsi_domain_handle); out_redist_unmap: for (i = 0; i < acpi_data.nr_redist_regions; i++) if (acpi_data.redist_regs[i].redist_base) diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 820404cb56bc..4c7bae0ec8f9 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -1682,11 +1682,17 @@ static void __init gic_acpi_setup_kvm_info(void) vgic_set_kvm_info(&gic_v2_kvm_info); } +static struct fwnode_handle *gsi_domain_handle; + +static struct fwnode_handle *gic_v2_get_gsi_domain_id(u32 gsi) +{ + return gsi_domain_handle; +} + static int __init gic_v2_acpi_init(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_distributor *dist; - struct fwnode_handle *domain_handle; struct gic_chip_data *gic = &gic_data[0]; int count, ret; @@ -1724,22 +1730,22 @@ static int __init gic_v2_acpi_init(union acpi_subtable_headers *header, /* * Initialize GIC instance zero (no multi-GIC support). */ - domain_handle = irq_domain_alloc_fwnode(&dist->base_address); - if (!domain_handle) { + gsi_domain_handle = irq_domain_alloc_fwnode(&dist->base_address); + if (!gsi_domain_handle) { pr_err("Unable to allocate domain handle\n"); gic_teardown(gic); return -ENOMEM; } - ret = __gic_init_bases(gic, domain_handle); + ret = __gic_init_bases(gic, gsi_domain_handle); if (ret) { pr_err("Failed to initialise GIC\n"); - irq_domain_free_fwnode(domain_handle); + irq_domain_free_fwnode(gsi_domain_handle); gic_teardown(gic); return ret; } - acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, domain_handle); + acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, gic_v2_get_gsi_domain_id); if (IS_ENABLED(CONFIG_ARM_GIC_V2M)) gicv2m_init(NULL, gic_data[0].domain); diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 4f82a5bc6d98..957e23f727ea 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -356,7 +356,7 @@ int acpi_gsi_to_irq (u32 gsi, unsigned int *irq); int acpi_isa_irq_to_gsi (unsigned isa_irq, u32 *gsi); void acpi_set_irq_model(enum acpi_irq_model_id model, - struct fwnode_handle *fwnode); + struct fwnode_handle *(*)(u32)); struct irq_domain *acpi_irq_create_hierarchy(unsigned int flags, unsigned int size, From 744b9a0c3c8334d705dea6af3645ea30d597c360 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 20 Jul 2022 18:51:22 +0800 Subject: [PATCH 285/298] ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback It appears that the generic version of acpi_gsi_to_irq() doesn't fallback to establishing a mapping if there is no pre-existing one while the x86 version does. While arm64 seems unaffected by it, LoongArch is relying on the x86 behaviour. In an effort to prevent new architectures from reinventing the proverbial wheel, provide an optional callback that the arch code can set to restore the x86 behaviour. Hopefully we can eventually get rid of this in the future once the expected behaviour has been clarified. Reported-by: Jianmin Lv Signed-off-by: Marc Zyngier Signed-off-by: Jianmin Lv Tested-by: Hanjun Guo Reviewed-by: Hanjun Guo Link: https://lore.kernel.org/r/1658314292-35346-4-git-send-email-lvjianmin@loongson.cn --- drivers/acpi/irq.c | 18 ++++++++++++++++-- include/linux/acpi.h | 1 + 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/irq.c b/drivers/acpi/irq.c index f0de76879497..dabe45eba055 100644 --- a/drivers/acpi/irq.c +++ b/drivers/acpi/irq.c @@ -13,6 +13,7 @@ enum acpi_irq_model_id acpi_irq_model; static struct fwnode_handle *(*acpi_get_gsi_domain_id)(u32 gsi); +static u32 (*acpi_gsi_to_irq_fallback)(u32 gsi); /** * acpi_gsi_to_irq() - Retrieve the linux irq number for a given GSI @@ -32,9 +33,12 @@ int acpi_gsi_to_irq(u32 gsi, unsigned int *irq) DOMAIN_BUS_ANY); *irq = irq_find_mapping(d, gsi); /* - * *irq == 0 means no mapping, that should - * be reported as a failure + * *irq == 0 means no mapping, that should be reported as a + * failure, unless there is an arch-specific fallback handler. */ + if (!*irq && acpi_gsi_to_irq_fallback) + *irq = acpi_gsi_to_irq_fallback(gsi); + return (*irq > 0) ? 0 : -EINVAL; } EXPORT_SYMBOL_GPL(acpi_gsi_to_irq); @@ -301,6 +305,16 @@ void __init acpi_set_irq_model(enum acpi_irq_model_id model, acpi_get_gsi_domain_id = fn; } +/** + * acpi_set_gsi_to_irq_fallback - Register a GSI transfer + * callback to fallback to arch specified implementation. + * @fn: arch-specific fallback handler + */ +void __init acpi_set_gsi_to_irq_fallback(u32 (*fn)(u32)) +{ + acpi_gsi_to_irq_fallback = fn; +} + /** * acpi_irq_create_hierarchy - Create a hierarchical IRQ domain with the default * GSI domain as its parent. diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 957e23f727ea..e2b60d53ca60 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -357,6 +357,7 @@ int acpi_isa_irq_to_gsi (unsigned isa_irq, u32 *gsi); void acpi_set_irq_model(enum acpi_irq_model_id model, struct fwnode_handle *(*)(u32)); +void acpi_set_gsi_to_irq_fallback(u32 (*)(u32)); struct irq_domain *acpi_irq_create_hierarchy(unsigned int flags, unsigned int size, From d319a299f4066685a787cfb89ad36fd78bb830ed Mon Sep 17 00:00:00 2001 From: Jianmin Lv Date: Wed, 20 Jul 2022 18:51:23 +0800 Subject: [PATCH 286/298] genirq/generic_chip: Export irq_unmap_generic_chip Some irq controllers have to re-implement a private version for irq_generic_chip_ops, because they have a different xlate to translate hwirq. Export irq_unmap_generic_chip to allow reusing in drivers. Signed-off-by: Jianmin Lv Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-5-git-send-email-lvjianmin@loongson.cn --- include/linux/irq.h | 1 + kernel/irq/generic-chip.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/irq.h b/include/linux/irq.h index 505308253d23..83a4574308fa 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1121,6 +1121,7 @@ int irq_gc_set_wake(struct irq_data *d, unsigned int on); /* Setup functions for irq_chip_generic */ int irq_map_generic_chip(struct irq_domain *d, unsigned int virq, irq_hw_number_t hw_irq); +void irq_unmap_generic_chip(struct irq_domain *d, unsigned int virq); struct irq_chip_generic * irq_alloc_generic_chip(const char *name, int nr_ct, unsigned int irq_base, void __iomem *reg_base, irq_flow_handler_t handler); diff --git a/kernel/irq/generic-chip.c b/kernel/irq/generic-chip.c index f0862eb6b506..c653cd31548d 100644 --- a/kernel/irq/generic-chip.c +++ b/kernel/irq/generic-chip.c @@ -431,7 +431,7 @@ int irq_map_generic_chip(struct irq_domain *d, unsigned int virq, return 0; } -static void irq_unmap_generic_chip(struct irq_domain *d, unsigned int virq) +void irq_unmap_generic_chip(struct irq_domain *d, unsigned int virq) { struct irq_data *data = irq_domain_get_irq_data(d, virq); struct irq_domain_chip_generic *dgc = d->gc; From cd057667585411fbecc0c140727177d7d707c63a Mon Sep 17 00:00:00 2001 From: Jianmin Lv Date: Wed, 20 Jul 2022 18:51:24 +0800 Subject: [PATCH 287/298] LoongArch: Use ACPI_GENERIC_GSI for gsi handling For LoongArch, generic gsi code(driver/acpi/irq.c) can be reused after following patchs: APCI: irq: Add support for multiple GSI domains ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback So, config ACPI_GENERIC_GSI for LoongArch with removing the gsi code in arch directory. Signed-off-by: Jianmin Lv Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-6-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/Kconfig | 1 + arch/loongarch/kernel/acpi.c | 65 ------------------------------------ 2 files changed, 1 insertion(+), 65 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 1920d52653b4..fb1e73adbd1d 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -2,6 +2,7 @@ config LOONGARCH bool default y + select ACPI_GENERIC_GSI if ACPI select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI select ARCH_BINFMT_ELF_STATE select ARCH_ENABLE_MEMORY_HOTPLUG diff --git a/arch/loongarch/kernel/acpi.c b/arch/loongarch/kernel/acpi.c index bb729ee8a237..03aa14581d0a 100644 --- a/arch/loongarch/kernel/acpi.c +++ b/arch/loongarch/kernel/acpi.c @@ -25,7 +25,6 @@ EXPORT_SYMBOL(acpi_pci_disabled); int acpi_strict = 1; /* We have no workarounds on LoongArch */ int num_processors; int disabled_cpus; -enum acpi_irq_model_id acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM; u64 acpi_saved_sp; @@ -33,70 +32,6 @@ u64 acpi_saved_sp; #define PREFIX "ACPI: " -int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp) -{ - if (irqp != NULL) - *irqp = acpi_register_gsi(NULL, gsi, -1, -1); - return (*irqp >= 0) ? 0 : -EINVAL; -} -EXPORT_SYMBOL_GPL(acpi_gsi_to_irq); - -int acpi_isa_irq_to_gsi(unsigned int isa_irq, u32 *gsi) -{ - if (gsi) - *gsi = isa_irq; - return 0; -} - -/* - * success: return IRQ number (>=0) - * failure: return < 0 - */ -int acpi_register_gsi(struct device *dev, u32 gsi, int trigger, int polarity) -{ - struct irq_fwspec fwspec; - - switch (gsi) { - case GSI_MIN_CPU_IRQ ... GSI_MAX_CPU_IRQ: - fwspec.fwnode = liointc_domain->fwnode; - fwspec.param[0] = gsi - GSI_MIN_CPU_IRQ; - fwspec.param_count = 1; - - return irq_create_fwspec_mapping(&fwspec); - - case GSI_MIN_LPC_IRQ ... GSI_MAX_LPC_IRQ: - if (!pch_lpc_domain) - return -EINVAL; - - fwspec.fwnode = pch_lpc_domain->fwnode; - fwspec.param[0] = gsi - GSI_MIN_LPC_IRQ; - fwspec.param[1] = acpi_dev_get_irq_type(trigger, polarity); - fwspec.param_count = 2; - - return irq_create_fwspec_mapping(&fwspec); - - case GSI_MIN_PCH_IRQ ... GSI_MAX_PCH_IRQ: - if (!pch_pic_domain[0]) - return -EINVAL; - - fwspec.fwnode = pch_pic_domain[0]->fwnode; - fwspec.param[0] = gsi - GSI_MIN_PCH_IRQ; - fwspec.param[1] = IRQ_TYPE_LEVEL_HIGH; - fwspec.param_count = 2; - - return irq_create_fwspec_mapping(&fwspec); - } - - return -EINVAL; -} -EXPORT_SYMBOL_GPL(acpi_register_gsi); - -void acpi_unregister_gsi(u32 gsi) -{ - -} -EXPORT_SYMBOL_GPL(acpi_unregister_gsi); - void __init __iomem * __acpi_map_table(unsigned long phys, unsigned long size) { From 2dfded47da329a0dd619144a6bb43aefc13a77ba Mon Sep 17 00:00:00 2001 From: Jianmin Lv Date: Wed, 20 Jul 2022 18:51:25 +0800 Subject: [PATCH 288/298] LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain For systems with two chipsets, there are two related pch-pic and pch-msi irqdomains, each of which has the same node id as its parent irqdomain. So we use a structure to mantain the relation of node and it's parent irqdomain as pch irqdomin, the 'pci_segment' field is only used to match the pci segment of a pci device when setting msi irqdomain for the device. struct acpi_vector_group { int node; int pci_segment; struct irq_domain *parent; }; The field 'pci_segment' and 'node' are initialized from MCFG, and the parent irqdomain driver will set field 'parent' by matching same 'node'. Signed-off-by: Jianmin Lv Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-7-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 8 +++++++ arch/loongarch/kernel/irq.c | 38 ++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index ace3ea6da72e..a2540d7c533d 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -48,6 +48,14 @@ void arch_trigger_cpumask_backtrace(const struct cpumask *mask, bool exclude_sel #define MAX_IO_PICS 2 #define NR_IRQS (64 + (256 * MAX_IO_PICS)) +struct acpi_vector_group { + int node; + int pci_segment; + struct irq_domain *parent; +}; +extern struct acpi_vector_group pch_group[MAX_IO_PICS]; +extern struct acpi_vector_group msi_group[MAX_IO_PICS]; + #define CORES_PER_EIO_NODE 4 #define LOONGSON_CPU_UART0_VEC 10 /* CPU UART0 */ diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index b34b8d792aa4..37dd2dca8221 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -31,6 +31,8 @@ struct irq_domain *pch_lpc_domain; struct irq_domain *pch_msi_domain[MAX_IO_PICS]; struct irq_domain *pch_pic_domain[MAX_IO_PICS]; +struct acpi_vector_group pch_group[MAX_IO_PICS]; +struct acpi_vector_group msi_group[MAX_IO_PICS]; /* * 'what should we do if we get a hw irq event on an illegal vector'. * each architecture has to answer this themselves. @@ -56,6 +58,41 @@ int arch_show_interrupts(struct seq_file *p, int prec) return 0; } +static int __init early_pci_mcfg_parse(struct acpi_table_header *header) +{ + struct acpi_table_mcfg *mcfg; + struct acpi_mcfg_allocation *mptr; + int i, n; + + if (header->length < sizeof(struct acpi_table_mcfg)) + return -EINVAL; + + n = (header->length - sizeof(struct acpi_table_mcfg)) / + sizeof(struct acpi_mcfg_allocation); + mcfg = (struct acpi_table_mcfg *)header; + mptr = (struct acpi_mcfg_allocation *) &mcfg[1]; + + for (i = 0; i < n; i++, mptr++) { + msi_group[i].pci_segment = mptr->pci_segment; + pch_group[i].node = msi_group[i].node = (mptr->address >> 44) & 0xf; + } + + return 0; +} + +static void __init init_vec_parent_group(void) +{ + int i; + + for (i = 0; i < MAX_IO_PICS; i++) { + msi_group[i].pci_segment = -1; + msi_group[i].node = -1; + pch_group[i].node = -1; + } + + acpi_table_parse(ACPI_SIG_MCFG, early_pci_mcfg_parse); +} + void __init init_IRQ(void) { int i; @@ -69,6 +106,7 @@ void __init init_IRQ(void) clear_csr_ecfg(ECFG0_IM); clear_csr_estat(ESTATF_IP); + init_vec_parent_group(); irqchip_init(); #ifdef CONFIG_SMP ipi_irq = EXCCODE_IPI - EXCCODE_INT_START; From ee73f14ee9eb7e1a04051b303b56130c4dd6e048 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Wed, 20 Jul 2022 18:51:26 +0800 Subject: [PATCH 289/298] irqchip: Add Loongson PCH LPC controller support PCH-LPC stands for "LPC Interrupts" that described in Section 24.3 of "Loongson 7A1000 Bridge User Manual". For more information please refer Documentation/loongarch/irq-chip-model.rst. Co-developed-by: Jianmin Lv Signed-off-by: Jianmin Lv Signed-off-by: Huacai Chen Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-8-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 4 +- arch/loongarch/kernel/irq.c | 1 - drivers/irqchip/Kconfig | 8 + drivers/irqchip/Makefile | 1 + drivers/irqchip/irq-loongson-pch-lpc.c | 205 +++++++++++++++++++++++++ 5 files changed, 216 insertions(+), 3 deletions(-) create mode 100644 drivers/irqchip/irq-loongson-pch-lpc.c diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index a2540d7c533d..76a7b364f58b 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -112,7 +112,7 @@ struct irq_domain *eiointc_acpi_init(struct irq_domain *parent, struct irq_domain *htvec_acpi_init(struct irq_domain *parent, struct acpi_madt_ht_pic *acpi_htvec); -struct irq_domain *pch_lpc_acpi_init(struct irq_domain *parent, +int pch_lpc_acpi_init(struct irq_domain *parent, struct acpi_madt_lpc_pic *acpi_pchlpc); struct irq_domain *pch_msi_acpi_init(struct irq_domain *parent, struct acpi_madt_msi_pic *acpi_pchmsi); @@ -129,7 +129,7 @@ extern struct acpi_madt_bio_pic *acpi_pchpic[MAX_IO_PICS]; extern struct irq_domain *cpu_domain; extern struct irq_domain *liointc_domain; -extern struct irq_domain *pch_lpc_domain; +extern struct fwnode_handle *pch_lpc_handle; extern struct irq_domain *pch_msi_domain[MAX_IO_PICS]; extern struct irq_domain *pch_pic_domain[MAX_IO_PICS]; diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index 37dd2dca8221..181504ba7e90 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -27,7 +27,6 @@ EXPORT_PER_CPU_SYMBOL(irq_stat); struct irq_domain *cpu_domain; struct irq_domain *liointc_domain; -struct irq_domain *pch_lpc_domain; struct irq_domain *pch_msi_domain[MAX_IO_PICS]; struct irq_domain *pch_pic_domain[MAX_IO_PICS]; diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 1f23a6be7d88..c1d527ffe0d2 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -591,6 +591,14 @@ config LOONGSON_PCH_MSI help Support for the Loongson PCH MSI Controller. +config LOONGSON_PCH_LPC + bool "Loongson PCH LPC Controller" + depends on MACH_LOONGSON64 + default (MACH_LOONGSON64 && LOONGARCH) + select IRQ_DOMAIN_HIERARCHY + help + Support for the Loongson PCH LPC Controller. + config MST_IRQ bool "MStar Interrupt Controller" depends on ARCH_MEDIATEK || ARCH_MSTARV7 || COMPILE_TEST diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile index 5b67450a9538..242b8b3568f8 100644 --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -108,6 +108,7 @@ obj-$(CONFIG_LOONGSON_HTPIC) += irq-loongson-htpic.o obj-$(CONFIG_LOONGSON_HTVEC) += irq-loongson-htvec.o obj-$(CONFIG_LOONGSON_PCH_PIC) += irq-loongson-pch-pic.o obj-$(CONFIG_LOONGSON_PCH_MSI) += irq-loongson-pch-msi.o +obj-$(CONFIG_LOONGSON_PCH_LPC) += irq-loongson-pch-lpc.o obj-$(CONFIG_MST_IRQ) += irq-mst-intc.o obj-$(CONFIG_SL28CPLD_INTC) += irq-sl28cpld.o obj-$(CONFIG_MACH_REALTEK_RTL) += irq-realtek-rtl.o diff --git a/drivers/irqchip/irq-loongson-pch-lpc.c b/drivers/irqchip/irq-loongson-pch-lpc.c new file mode 100644 index 000000000000..bf2324910a75 --- /dev/null +++ b/drivers/irqchip/irq-loongson-pch-lpc.c @@ -0,0 +1,205 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Loongson LPC Interrupt Controller support + * + * Copyright (C) 2020-2022 Loongson Technology Corporation Limited + */ + +#define pr_fmt(fmt) "lpc: " fmt + +#include +#include +#include +#include +#include +#include + +/* Registers */ +#define LPC_INT_CTL 0x00 +#define LPC_INT_ENA 0x04 +#define LPC_INT_STS 0x08 +#define LPC_INT_CLR 0x0c +#define LPC_INT_POL 0x10 +#define LPC_COUNT 16 + +/* LPC_INT_CTL */ +#define LPC_INT_CTL_EN BIT(31) + +struct pch_lpc { + void __iomem *base; + struct irq_domain *lpc_domain; + raw_spinlock_t lpc_lock; + u32 saved_reg_ctl; + u32 saved_reg_ena; + u32 saved_reg_pol; +}; + +struct fwnode_handle *pch_lpc_handle; + +static void lpc_irq_ack(struct irq_data *d) +{ + unsigned long flags; + struct pch_lpc *priv = d->domain->host_data; + + raw_spin_lock_irqsave(&priv->lpc_lock, flags); + writel(0x1 << d->hwirq, priv->base + LPC_INT_CLR); + raw_spin_unlock_irqrestore(&priv->lpc_lock, flags); +} + +static void lpc_irq_mask(struct irq_data *d) +{ + unsigned long flags; + struct pch_lpc *priv = d->domain->host_data; + + raw_spin_lock_irqsave(&priv->lpc_lock, flags); + writel(readl(priv->base + LPC_INT_ENA) & (~(0x1 << (d->hwirq))), + priv->base + LPC_INT_ENA); + raw_spin_unlock_irqrestore(&priv->lpc_lock, flags); +} + +static void lpc_irq_unmask(struct irq_data *d) +{ + unsigned long flags; + struct pch_lpc *priv = d->domain->host_data; + + raw_spin_lock_irqsave(&priv->lpc_lock, flags); + writel(readl(priv->base + LPC_INT_ENA) | (0x1 << (d->hwirq)), + priv->base + LPC_INT_ENA); + raw_spin_unlock_irqrestore(&priv->lpc_lock, flags); +} + +static int lpc_irq_set_type(struct irq_data *d, unsigned int type) +{ + u32 val; + u32 mask = 0x1 << (d->hwirq); + struct pch_lpc *priv = d->domain->host_data; + + if (!(type & IRQ_TYPE_LEVEL_MASK)) + return 0; + + val = readl(priv->base + LPC_INT_POL); + + if (type == IRQ_TYPE_LEVEL_HIGH) + val |= mask; + else + val &= ~mask; + + writel(val, priv->base + LPC_INT_POL); + + return 0; +} + +static const struct irq_chip pch_lpc_irq_chip = { + .name = "PCH LPC", + .irq_mask = lpc_irq_mask, + .irq_unmask = lpc_irq_unmask, + .irq_ack = lpc_irq_ack, + .irq_set_type = lpc_irq_set_type, + .flags = IRQCHIP_SKIP_SET_WAKE, +}; + +static void lpc_irq_dispatch(struct irq_desc *desc) +{ + u32 pending, bit; + struct irq_chip *chip = irq_desc_get_chip(desc); + struct pch_lpc *priv = irq_desc_get_handler_data(desc); + + chained_irq_enter(chip, desc); + + pending = readl(priv->base + LPC_INT_ENA); + pending &= readl(priv->base + LPC_INT_STS); + if (!pending) + spurious_interrupt(); + + while (pending) { + bit = __ffs(pending); + + generic_handle_domain_irq(priv->lpc_domain, bit); + pending &= ~BIT(bit); + } + chained_irq_exit(chip, desc); +} + +static int pch_lpc_map(struct irq_domain *d, unsigned int irq, + irq_hw_number_t hw) +{ + irq_set_chip_and_handler(irq, &pch_lpc_irq_chip, handle_level_irq); + return 0; +} + +static const struct irq_domain_ops pch_lpc_domain_ops = { + .map = pch_lpc_map, + .translate = irq_domain_translate_twocell, +}; + +static void pch_lpc_reset(struct pch_lpc *priv) +{ + /* Enable the LPC interrupt, bit31: en bit30: edge */ + writel(LPC_INT_CTL_EN, priv->base + LPC_INT_CTL); + writel(0, priv->base + LPC_INT_ENA); + /* Clear all 18-bit interrpt bit */ + writel(GENMASK(17, 0), priv->base + LPC_INT_CLR); +} + +static int pch_lpc_disabled(struct pch_lpc *priv) +{ + return (readl(priv->base + LPC_INT_ENA) == 0xffffffff) && + (readl(priv->base + LPC_INT_STS) == 0xffffffff); +} + +int __init pch_lpc_acpi_init(struct irq_domain *parent, + struct acpi_madt_lpc_pic *acpi_pchlpc) +{ + int parent_irq; + struct pch_lpc *priv; + struct irq_fwspec fwspec; + struct fwnode_handle *irq_handle; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + raw_spin_lock_init(&priv->lpc_lock); + + priv->base = ioremap(acpi_pchlpc->address, acpi_pchlpc->size); + if (!priv->base) + goto free_priv; + + if (pch_lpc_disabled(priv)) { + pr_err("Failed to get LPC status\n"); + goto iounmap_base; + } + + irq_handle = irq_domain_alloc_named_fwnode("lpcintc"); + if (!irq_handle) { + pr_err("Unable to allocate domain handle\n"); + goto iounmap_base; + } + + priv->lpc_domain = irq_domain_create_linear(irq_handle, LPC_COUNT, + &pch_lpc_domain_ops, priv); + if (!priv->lpc_domain) { + pr_err("Failed to create IRQ domain\n"); + goto free_irq_handle; + } + pch_lpc_reset(priv); + + fwspec.fwnode = parent->fwnode; + fwspec.param[0] = acpi_pchlpc->cascade + GSI_MIN_PCH_IRQ; + fwspec.param[1] = IRQ_TYPE_LEVEL_HIGH; + fwspec.param_count = 2; + parent_irq = irq_create_fwspec_mapping(&fwspec); + irq_set_chained_handler_and_data(parent_irq, lpc_irq_dispatch, priv); + + pch_lpc_handle = irq_handle; + return 0; + +free_irq_handle: + irq_domain_free_fwnode(irq_handle); +iounmap_base: + iounmap(priv->base); +free_priv: + kfree(priv); + + return -ENOMEM; +} From bcdd75c596c89d7925a3438fde2578ca23a62b06 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Wed, 20 Jul 2022 18:51:27 +0800 Subject: [PATCH 290/298] irqchip/loongson-pch-pic: Add ACPI init support PCH-PIC/PCH-MSI stands for "Interrupt Controller" that described in Section 5 of "Loongson 7A1000 Bridge User Manual". For more information please refer Documentation/loongarch/irq-chip-model.rst. Co-developed-by: Jianmin Lv Signed-off-by: Jianmin Lv Signed-off-by: Huacai Chen Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-9-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 5 +- arch/loongarch/kernel/irq.c | 1 - arch/mips/include/asm/mach-loongson64/irq.h | 2 +- drivers/irqchip/Kconfig | 2 +- drivers/irqchip/irq-loongson-pch-pic.c | 177 ++++++++++++++++---- 5 files changed, 151 insertions(+), 36 deletions(-) diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index 76a7b364f58b..9549806858bd 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -116,8 +116,9 @@ int pch_lpc_acpi_init(struct irq_domain *parent, struct acpi_madt_lpc_pic *acpi_pchlpc); struct irq_domain *pch_msi_acpi_init(struct irq_domain *parent, struct acpi_madt_msi_pic *acpi_pchmsi); -struct irq_domain *pch_pic_acpi_init(struct irq_domain *parent, +int pch_pic_acpi_init(struct irq_domain *parent, struct acpi_madt_bio_pic *acpi_pchpic); +int find_pch_pic(u32 gsi); extern struct acpi_madt_lio_pic *acpi_liointc; extern struct acpi_madt_eio_pic *acpi_eiointc[MAX_IO_PICS]; @@ -131,7 +132,7 @@ extern struct irq_domain *cpu_domain; extern struct irq_domain *liointc_domain; extern struct fwnode_handle *pch_lpc_handle; extern struct irq_domain *pch_msi_domain[MAX_IO_PICS]; -extern struct irq_domain *pch_pic_domain[MAX_IO_PICS]; +extern struct fwnode_handle *pch_pic_handle[MAX_IO_PICS]; extern irqreturn_t loongson3_ipi_interrupt(int irq, void *dev); diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index 181504ba7e90..575b8de08289 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -28,7 +28,6 @@ EXPORT_PER_CPU_SYMBOL(irq_stat); struct irq_domain *cpu_domain; struct irq_domain *liointc_domain; struct irq_domain *pch_msi_domain[MAX_IO_PICS]; -struct irq_domain *pch_pic_domain[MAX_IO_PICS]; struct acpi_vector_group pch_group[MAX_IO_PICS]; struct acpi_vector_group msi_group[MAX_IO_PICS]; diff --git a/arch/mips/include/asm/mach-loongson64/irq.h b/arch/mips/include/asm/mach-loongson64/irq.h index 98ea977cf0b8..55e0dee12cb0 100644 --- a/arch/mips/include/asm/mach-loongson64/irq.h +++ b/arch/mips/include/asm/mach-loongson64/irq.h @@ -7,7 +7,7 @@ #define NR_MIPS_CPU_IRQS 8 #define NR_MAX_CHAINED_IRQS 40 /* Chained IRQs means those not directly used by devices */ #define NR_IRQS (NR_IRQS_LEGACY + NR_MIPS_CPU_IRQS + NR_MAX_CHAINED_IRQS + 256) - +#define MAX_IO_PICS 1 #define MIPS_CPU_IRQ_BASE NR_IRQS_LEGACY #include diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index c1d527ffe0d2..f62bdeca38c5 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -574,7 +574,7 @@ config LOONGSON_HTVEC config LOONGSON_PCH_PIC bool "Loongson PCH PIC Controller" - depends on MACH_LOONGSON64 || COMPILE_TEST + depends on MACH_LOONGSON64 default MACH_LOONGSON64 select IRQ_DOMAIN_HIERARCHY select IRQ_FASTEOI_HIERARCHY_HANDLERS diff --git a/drivers/irqchip/irq-loongson-pch-pic.c b/drivers/irqchip/irq-loongson-pch-pic.c index a4eb8a2181c7..b6f1392964b1 100644 --- a/drivers/irqchip/irq-loongson-pch-pic.c +++ b/drivers/irqchip/irq-loongson-pch-pic.c @@ -33,13 +33,40 @@ #define PIC_REG_IDX(irq_id) ((irq_id) / PIC_COUNT_PER_REG) #define PIC_REG_BIT(irq_id) ((irq_id) % PIC_COUNT_PER_REG) +static int nr_pics; + struct pch_pic { void __iomem *base; struct irq_domain *pic_domain; u32 ht_vec_base; raw_spinlock_t pic_lock; + u32 vec_count; + u32 gsi_base; }; +static struct pch_pic *pch_pic_priv[MAX_IO_PICS]; + +struct fwnode_handle *pch_pic_handle[MAX_IO_PICS]; + +int find_pch_pic(u32 gsi) +{ + int i; + + /* Find the PCH_PIC that manages this GSI. */ + for (i = 0; i < MAX_IO_PICS; i++) { + struct pch_pic *priv = pch_pic_priv[i]; + + if (!priv) + return -1; + + if (gsi >= priv->gsi_base && gsi < (priv->gsi_base + priv->vec_count)) + return i; + } + + pr_err("ERROR: Unable to locate PCH_PIC for GSI %d\n", gsi); + return -1; +} + static void pch_pic_bitset(struct pch_pic *priv, int offset, int bit) { u32 reg; @@ -139,6 +166,28 @@ static struct irq_chip pch_pic_irq_chip = { .irq_set_type = pch_pic_set_type, }; +static int pch_pic_domain_translate(struct irq_domain *d, + struct irq_fwspec *fwspec, + unsigned long *hwirq, + unsigned int *type) +{ + struct pch_pic *priv = d->host_data; + struct device_node *of_node = to_of_node(fwspec->fwnode); + + if (fwspec->param_count < 1) + return -EINVAL; + + if (of_node) { + *hwirq = fwspec->param[0] + priv->ht_vec_base; + *type = fwspec->param[1] & IRQ_TYPE_SENSE_MASK; + } else { + *hwirq = fwspec->param[0] - priv->gsi_base; + *type = IRQ_TYPE_NONE; + } + + return 0; +} + static int pch_pic_alloc(struct irq_domain *domain, unsigned int virq, unsigned int nr_irqs, void *arg) { @@ -149,13 +198,13 @@ static int pch_pic_alloc(struct irq_domain *domain, unsigned int virq, struct irq_fwspec parent_fwspec; struct pch_pic *priv = domain->host_data; - err = irq_domain_translate_twocell(domain, fwspec, &hwirq, &type); + err = pch_pic_domain_translate(domain, fwspec, &hwirq, &type); if (err) return err; parent_fwspec.fwnode = domain->parent->fwnode; parent_fwspec.param_count = 1; - parent_fwspec.param[0] = hwirq + priv->ht_vec_base; + parent_fwspec.param[0] = hwirq; err = irq_domain_alloc_irqs_parent(domain, virq, 1, &parent_fwspec); if (err) @@ -170,7 +219,7 @@ static int pch_pic_alloc(struct irq_domain *domain, unsigned int virq, } static const struct irq_domain_ops pch_pic_domain_ops = { - .translate = irq_domain_translate_twocell, + .translate = pch_pic_domain_translate, .alloc = pch_pic_alloc, .free = irq_domain_free_irqs_parent, }; @@ -180,7 +229,7 @@ static void pch_pic_reset(struct pch_pic *priv) int i; for (i = 0; i < PIC_COUNT; i++) { - /* Write vectored ID */ + /* Write vector ID */ writeb(priv->ht_vec_base + i, priv->base + PCH_INT_HTVEC(i)); /* Hardcode route to HT0 Lo */ writeb(1, priv->base + PCH_INT_ROUTE(i)); @@ -198,50 +247,37 @@ static void pch_pic_reset(struct pch_pic *priv) } } -static int pch_pic_of_init(struct device_node *node, - struct device_node *parent) +static int pch_pic_init(phys_addr_t addr, unsigned long size, int vec_base, + struct irq_domain *parent_domain, struct fwnode_handle *domain_handle, + u32 gsi_base) { struct pch_pic *priv; - struct irq_domain *parent_domain; - int err; priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) return -ENOMEM; raw_spin_lock_init(&priv->pic_lock); - priv->base = of_iomap(node, 0); - if (!priv->base) { - err = -ENOMEM; + priv->base = ioremap(addr, size); + if (!priv->base) goto free_priv; - } - parent_domain = irq_find_host(parent); - if (!parent_domain) { - pr_err("Failed to find the parent domain\n"); - err = -ENXIO; - goto iounmap_base; - } - - if (of_property_read_u32(node, "loongson,pic-base-vec", - &priv->ht_vec_base)) { - pr_err("Failed to determine pic-base-vec\n"); - err = -EINVAL; - goto iounmap_base; - } + priv->ht_vec_base = vec_base; + priv->vec_count = ((readq(priv->base) >> 48) & 0xff) + 1; + priv->gsi_base = gsi_base; priv->pic_domain = irq_domain_create_hierarchy(parent_domain, 0, - PIC_COUNT, - of_node_to_fwnode(node), - &pch_pic_domain_ops, - priv); + priv->vec_count, domain_handle, + &pch_pic_domain_ops, priv); + if (!priv->pic_domain) { pr_err("Failed to create IRQ domain\n"); - err = -ENOMEM; goto iounmap_base; } pch_pic_reset(priv); + pch_pic_handle[nr_pics] = domain_handle; + pch_pic_priv[nr_pics++] = priv; return 0; @@ -250,7 +286,86 @@ iounmap_base: free_priv: kfree(priv); - return err; + return -EINVAL; +} + +#ifdef CONFIG_OF + +static int pch_pic_of_init(struct device_node *node, + struct device_node *parent) +{ + int err, vec_base; + struct resource res; + struct irq_domain *parent_domain; + + if (of_address_to_resource(node, 0, &res)) + return -EINVAL; + + parent_domain = irq_find_host(parent); + if (!parent_domain) { + pr_err("Failed to find the parent domain\n"); + return -ENXIO; + } + + if (of_property_read_u32(node, "loongson,pic-base-vec", &vec_base)) { + pr_err("Failed to determine pic-base-vec\n"); + return -EINVAL; + } + + err = pch_pic_init(res.start, resource_size(&res), vec_base, + parent_domain, of_node_to_fwnode(node), 0); + if (err < 0) + return err; + + return 0; } IRQCHIP_DECLARE(pch_pic, "loongson,pch-pic-1.0", pch_pic_of_init); + +#endif + +#ifdef CONFIG_ACPI +static int __init +pch_lpc_parse_madt(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_lpc_pic *pchlpc_entry = (struct acpi_madt_lpc_pic *)header; + + return pch_lpc_acpi_init(pch_pic_priv[0]->pic_domain, pchlpc_entry); +} + +static int __init acpi_cascade_irqdomain_init(void) +{ + acpi_table_parse_madt(ACPI_MADT_TYPE_LPC_PIC, + pch_lpc_parse_madt, 0); + return 0; +} + +int __init pch_pic_acpi_init(struct irq_domain *parent, + struct acpi_madt_bio_pic *acpi_pchpic) +{ + int ret, vec_base; + struct fwnode_handle *domain_handle; + + vec_base = acpi_pchpic->gsi_base - GSI_MIN_PCH_IRQ; + + domain_handle = irq_domain_alloc_fwnode((phys_addr_t *)acpi_pchpic); + if (!domain_handle) { + pr_err("Unable to allocate domain handle\n"); + return -ENOMEM; + } + + ret = pch_pic_init(acpi_pchpic->address, acpi_pchpic->size, + vec_base, parent, domain_handle, acpi_pchpic->gsi_base); + + if (ret < 0) { + irq_domain_free_fwnode(domain_handle); + return ret; + } + + if (acpi_pchpic->id == 0) + acpi_cascade_irqdomain_init(); + + return ret; +} +#endif From 023087324000ae704cf3cfd0abf1fc30c6e0e8d5 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Wed, 20 Jul 2022 18:51:28 +0800 Subject: [PATCH 291/298] irqchip/loongson-pch-msi: Add ACPI init support PCH-PIC/PCH-MSI stands for "Interrupt Controller" that described in Section 5 of "Loongson 7A1000 Bridge User Manual". For more information please refer Documentation/loongarch/irq-chip-model.rst. Co-developed-by: Jianmin Lv Signed-off-by: Jianmin Lv Signed-off-by: Huacai Chen Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-10-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 12 ++- arch/loongarch/kernel/irq.c | 1 - drivers/irqchip/Kconfig | 2 +- drivers/irqchip/irq-loongson-pch-msi.c | 127 +++++++++++++++++-------- 4 files changed, 96 insertions(+), 46 deletions(-) diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index 9549806858bd..e9039f2322a4 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -114,11 +114,20 @@ struct irq_domain *htvec_acpi_init(struct irq_domain *parent, struct acpi_madt_ht_pic *acpi_htvec); int pch_lpc_acpi_init(struct irq_domain *parent, struct acpi_madt_lpc_pic *acpi_pchlpc); -struct irq_domain *pch_msi_acpi_init(struct irq_domain *parent, +#if IS_ENABLED(CONFIG_LOONGSON_PCH_MSI) +int pch_msi_acpi_init(struct irq_domain *parent, struct acpi_madt_msi_pic *acpi_pchmsi); +#else +static inline int pch_msi_acpi_init(struct irq_domain *parent, + struct acpi_madt_msi_pic *acpi_pchmsi) +{ + return 0; +} +#endif int pch_pic_acpi_init(struct irq_domain *parent, struct acpi_madt_bio_pic *acpi_pchpic); int find_pch_pic(u32 gsi); +struct fwnode_handle *get_pch_msi_handle(int pci_segment); extern struct acpi_madt_lio_pic *acpi_liointc; extern struct acpi_madt_eio_pic *acpi_eiointc[MAX_IO_PICS]; @@ -131,7 +140,6 @@ extern struct acpi_madt_bio_pic *acpi_pchpic[MAX_IO_PICS]; extern struct irq_domain *cpu_domain; extern struct irq_domain *liointc_domain; extern struct fwnode_handle *pch_lpc_handle; -extern struct irq_domain *pch_msi_domain[MAX_IO_PICS]; extern struct fwnode_handle *pch_pic_handle[MAX_IO_PICS]; extern irqreturn_t loongson3_ipi_interrupt(int irq, void *dev); diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index 575b8de08289..066f892943fe 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -27,7 +27,6 @@ EXPORT_PER_CPU_SYMBOL(irq_stat); struct irq_domain *cpu_domain; struct irq_domain *liointc_domain; -struct irq_domain *pch_msi_domain[MAX_IO_PICS]; struct acpi_vector_group pch_group[MAX_IO_PICS]; struct acpi_vector_group msi_group[MAX_IO_PICS]; diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index f62bdeca38c5..8844e6b53b3c 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -583,7 +583,7 @@ config LOONGSON_PCH_PIC config LOONGSON_PCH_MSI bool "Loongson PCH MSI Controller" - depends on MACH_LOONGSON64 || COMPILE_TEST + depends on MACH_LOONGSON64 depends on PCI default MACH_LOONGSON64 select IRQ_DOMAIN_HIERARCHY diff --git a/drivers/irqchip/irq-loongson-pch-msi.c b/drivers/irqchip/irq-loongson-pch-msi.c index e3801c4a77ed..d0e8551bebfa 100644 --- a/drivers/irqchip/irq-loongson-pch-msi.c +++ b/drivers/irqchip/irq-loongson-pch-msi.c @@ -15,6 +15,8 @@ #include #include +static int nr_pics; + struct pch_msi_data { struct mutex msi_map_lock; phys_addr_t doorbell; @@ -23,6 +25,8 @@ struct pch_msi_data { unsigned long *msi_map; }; +static struct fwnode_handle *pch_msi_handle[MAX_IO_PICS]; + static void pch_msi_mask_msi_irq(struct irq_data *d) { pci_msi_mask_irq(d); @@ -154,12 +158,12 @@ static const struct irq_domain_ops pch_msi_middle_domain_ops = { }; static int pch_msi_init_domains(struct pch_msi_data *priv, - struct device_node *node, - struct irq_domain *parent) + struct irq_domain *parent, + struct fwnode_handle *domain_handle) { struct irq_domain *middle_domain, *msi_domain; - middle_domain = irq_domain_create_linear(of_node_to_fwnode(node), + middle_domain = irq_domain_create_linear(domain_handle, priv->num_irqs, &pch_msi_middle_domain_ops, priv); @@ -171,7 +175,7 @@ static int pch_msi_init_domains(struct pch_msi_data *priv, middle_domain->parent = parent; irq_domain_update_bus_token(middle_domain, DOMAIN_BUS_NEXUS); - msi_domain = pci_msi_create_irq_domain(of_node_to_fwnode(node), + msi_domain = pci_msi_create_irq_domain(domain_handle, &pch_msi_domain_info, middle_domain); if (!msi_domain) { @@ -183,19 +187,11 @@ static int pch_msi_init_domains(struct pch_msi_data *priv, return 0; } -static int pch_msi_init(struct device_node *node, - struct device_node *parent) +static int pch_msi_init(phys_addr_t msg_address, int irq_base, int irq_count, + struct irq_domain *parent_domain, struct fwnode_handle *domain_handle) { - struct pch_msi_data *priv; - struct irq_domain *parent_domain; - struct resource res; int ret; - - parent_domain = irq_find_host(parent); - if (!parent_domain) { - pr_err("Failed to find the parent domain\n"); - return -ENXIO; - } + struct pch_msi_data *priv; priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) @@ -203,48 +199,95 @@ static int pch_msi_init(struct device_node *node, mutex_init(&priv->msi_map_lock); - ret = of_address_to_resource(node, 0, &res); - if (ret) { - pr_err("Failed to allocate resource\n"); - goto err_priv; - } - - priv->doorbell = res.start; - - if (of_property_read_u32(node, "loongson,msi-base-vec", - &priv->irq_first)) { - pr_err("Unable to parse MSI vec base\n"); - ret = -EINVAL; - goto err_priv; - } - - if (of_property_read_u32(node, "loongson,msi-num-vecs", - &priv->num_irqs)) { - pr_err("Unable to parse MSI vec number\n"); - ret = -EINVAL; - goto err_priv; - } + priv->doorbell = msg_address; + priv->irq_first = irq_base; + priv->num_irqs = irq_count; priv->msi_map = bitmap_zalloc(priv->num_irqs, GFP_KERNEL); - if (!priv->msi_map) { - ret = -ENOMEM; + if (!priv->msi_map) goto err_priv; - } pr_debug("Registering %d MSIs, starting at %d\n", priv->num_irqs, priv->irq_first); - ret = pch_msi_init_domains(priv, node, parent_domain); + ret = pch_msi_init_domains(priv, parent_domain, domain_handle); if (ret) goto err_map; + pch_msi_handle[nr_pics++] = domain_handle; return 0; err_map: bitmap_free(priv->msi_map); err_priv: kfree(priv); - return ret; + + return -EINVAL; } -IRQCHIP_DECLARE(pch_msi, "loongson,pch-msi-1.0", pch_msi_init); +#ifdef CONFIG_OF +static int pch_msi_of_init(struct device_node *node, struct device_node *parent) +{ + int err; + int irq_base, irq_count; + struct resource res; + struct irq_domain *parent_domain; + + parent_domain = irq_find_host(parent); + if (!parent_domain) { + pr_err("Failed to find the parent domain\n"); + return -ENXIO; + } + + if (of_address_to_resource(node, 0, &res)) { + pr_err("Failed to allocate resource\n"); + return -EINVAL; + } + + if (of_property_read_u32(node, "loongson,msi-base-vec", &irq_base)) { + pr_err("Unable to parse MSI vec base\n"); + return -EINVAL; + } + + if (of_property_read_u32(node, "loongson,msi-num-vecs", &irq_count)) { + pr_err("Unable to parse MSI vec number\n"); + return -EINVAL; + } + + err = pch_msi_init(res.start, irq_base, irq_count, parent_domain, of_node_to_fwnode(node)); + if (err < 0) + return err; + + return 0; +} + +IRQCHIP_DECLARE(pch_msi, "loongson,pch-msi-1.0", pch_msi_of_init); +#endif + +#ifdef CONFIG_ACPI +struct fwnode_handle *get_pch_msi_handle(int pci_segment) +{ + int i; + + for (i = 0; i < MAX_IO_PICS; i++) { + if (msi_group[i].pci_segment == pci_segment) + return pch_msi_handle[i]; + } + return NULL; +} + +int __init pch_msi_acpi_init(struct irq_domain *parent, + struct acpi_madt_msi_pic *acpi_pchmsi) +{ + int ret; + struct fwnode_handle *domain_handle; + + domain_handle = irq_domain_alloc_fwnode((phys_addr_t *)acpi_pchmsi); + ret = pch_msi_init(acpi_pchmsi->msg_address, acpi_pchmsi->start, + acpi_pchmsi->count, parent, domain_handle); + if (ret < 0) + irq_domain_free_fwnode(domain_handle); + + return ret; +} +#endif From 0858ed035a85c3ae79553200d2d818797cf849f5 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Wed, 20 Jul 2022 18:51:29 +0800 Subject: [PATCH 292/298] irqchip/loongson-liointc: Add ACPI init support LIOINTC stands for "Legacy I/O Interrupts" that described in Section 11.1 of "Loongson 3A5000 Processor Reference Manual". For more information please refer Documentation/loongarch/irq-chip-model.rst. Co-developed-by: Jianmin Lv Signed-off-by: Jianmin Lv Signed-off-by: Huacai Chen Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-11-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 4 +- arch/loongarch/kernel/irq.c | 1 - arch/mips/include/asm/mach-loongson64/irq.h | 1 + drivers/irqchip/irq-loongson-liointc.c | 207 ++++++++++++-------- 4 files changed, 133 insertions(+), 80 deletions(-) diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index e9039f2322a4..c847300c4cd1 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -105,7 +105,7 @@ struct acpi_madt_lpc_pic; struct irq_domain *loongarch_cpu_irq_init(void); -struct irq_domain *liointc_acpi_init(struct irq_domain *parent, +int liointc_acpi_init(struct irq_domain *parent, struct acpi_madt_lio_pic *acpi_liointc); struct irq_domain *eiointc_acpi_init(struct irq_domain *parent, struct acpi_madt_eio_pic *acpi_eiointc); @@ -138,7 +138,7 @@ extern struct acpi_madt_msi_pic *acpi_pchmsi[MAX_IO_PICS]; extern struct acpi_madt_bio_pic *acpi_pchpic[MAX_IO_PICS]; extern struct irq_domain *cpu_domain; -extern struct irq_domain *liointc_domain; +extern struct fwnode_handle *liointc_handle; extern struct fwnode_handle *pch_lpc_handle; extern struct fwnode_handle *pch_pic_handle[MAX_IO_PICS]; diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index 066f892943fe..da131f51225a 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -26,7 +26,6 @@ DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); EXPORT_PER_CPU_SYMBOL(irq_stat); struct irq_domain *cpu_domain; -struct irq_domain *liointc_domain; struct acpi_vector_group pch_group[MAX_IO_PICS]; struct acpi_vector_group msi_group[MAX_IO_PICS]; diff --git a/arch/mips/include/asm/mach-loongson64/irq.h b/arch/mips/include/asm/mach-loongson64/irq.h index 55e0dee12cb0..67c15f320f93 100644 --- a/arch/mips/include/asm/mach-loongson64/irq.h +++ b/arch/mips/include/asm/mach-loongson64/irq.h @@ -9,6 +9,7 @@ #define NR_IRQS (NR_IRQS_LEGACY + NR_MIPS_CPU_IRQS + NR_MAX_CHAINED_IRQS + 256) #define MAX_IO_PICS 1 #define MIPS_CPU_IRQ_BASE NR_IRQS_LEGACY +#define GSI_MIN_CPU_IRQ 0 #include diff --git a/drivers/irqchip/irq-loongson-liointc.c b/drivers/irqchip/irq-loongson-liointc.c index 8d05d8bcf56f..c4f3c886ad61 100644 --- a/drivers/irqchip/irq-loongson-liointc.c +++ b/drivers/irqchip/irq-loongson-liointc.c @@ -23,7 +23,7 @@ #endif #define LIOINTC_CHIP_IRQ 32 -#define LIOINTC_NUM_PARENT 4 +#define LIOINTC_NUM_PARENT 4 #define LIOINTC_NUM_CORES 4 #define LIOINTC_INTC_CHIP_START 0x20 @@ -58,6 +58,8 @@ struct liointc_priv { bool has_lpc_irq_errata; }; +struct fwnode_handle *liointc_handle; + static void liointc_chained_handle_irq(struct irq_desc *desc) { struct liointc_handler_data *handler = irq_desc_get_handler_data(desc); @@ -153,97 +155,79 @@ static void liointc_resume(struct irq_chip_generic *gc) irq_gc_unlock_irqrestore(gc, flags); } -static const char * const parent_names[] = {"int0", "int1", "int2", "int3"}; -static const char * const core_reg_names[] = {"isr0", "isr1", "isr2", "isr3"}; +static int parent_irq[LIOINTC_NUM_PARENT]; +static u32 parent_int_map[LIOINTC_NUM_PARENT]; +static const char *const parent_names[] = {"int0", "int1", "int2", "int3"}; +static const char *const core_reg_names[] = {"isr0", "isr1", "isr2", "isr3"}; -static void __iomem *liointc_get_reg_byname(struct device_node *node, - const char *name) +static int liointc_domain_xlate(struct irq_domain *d, struct device_node *ctrlr, + const u32 *intspec, unsigned int intsize, + unsigned long *out_hwirq, unsigned int *out_type) { - int index = of_property_match_string(node, "reg-names", name); - - if (index < 0) - return NULL; - - return of_iomap(node, index); + if (WARN_ON(intsize < 1)) + return -EINVAL; + *out_hwirq = intspec[0] - GSI_MIN_CPU_IRQ; + *out_type = IRQ_TYPE_NONE; + return 0; } -static int __init liointc_of_init(struct device_node *node, - struct device_node *parent) +static const struct irq_domain_ops acpi_irq_gc_ops = { + .map = irq_map_generic_chip, + .unmap = irq_unmap_generic_chip, + .xlate = liointc_domain_xlate, +}; + +static int liointc_init(phys_addr_t addr, unsigned long size, int revision, + struct fwnode_handle *domain_handle, struct device_node *node) { + int i, err; + void __iomem *base; + struct irq_chip_type *ct; struct irq_chip_generic *gc; struct irq_domain *domain; - struct irq_chip_type *ct; struct liointc_priv *priv; - void __iomem *base; - u32 of_parent_int_map[LIOINTC_NUM_PARENT]; - int parent_irq[LIOINTC_NUM_PARENT]; - bool have_parent = FALSE; - int sz, i, err = 0; priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) return -ENOMEM; - if (of_device_is_compatible(node, "loongson,liointc-2.0")) { - base = liointc_get_reg_byname(node, "main"); - if (!base) { - err = -ENODEV; - goto out_free_priv; - } + base = ioremap(addr, size); + if (!base) + goto out_free_priv; - for (i = 0; i < LIOINTC_NUM_CORES; i++) - priv->core_isr[i] = liointc_get_reg_byname(node, core_reg_names[i]); - if (!priv->core_isr[0]) { - err = -ENODEV; - goto out_iounmap_base; - } - } else { - base = of_iomap(node, 0); - if (!base) { - err = -ENODEV; - goto out_free_priv; - } - - for (i = 0; i < LIOINTC_NUM_CORES; i++) - priv->core_isr[i] = base + LIOINTC_REG_INTC_STATUS; - } - - for (i = 0; i < LIOINTC_NUM_PARENT; i++) { - parent_irq[i] = of_irq_get_byname(node, parent_names[i]); - if (parent_irq[i] > 0) - have_parent = TRUE; - } - if (!have_parent) { - err = -ENODEV; - goto out_iounmap_isr; - } - - sz = of_property_read_variable_u32_array(node, - "loongson,parent_int_map", - &of_parent_int_map[0], - LIOINTC_NUM_PARENT, - LIOINTC_NUM_PARENT); - if (sz < 4) { - pr_err("loongson-liointc: No parent_int_map\n"); - err = -ENODEV; - goto out_iounmap_isr; - } + for (i = 0; i < LIOINTC_NUM_CORES; i++) + priv->core_isr[i] = base + LIOINTC_REG_INTC_STATUS; for (i = 0; i < LIOINTC_NUM_PARENT; i++) - priv->handler[i].parent_int_map = of_parent_int_map[i]; + priv->handler[i].parent_int_map = parent_int_map[i]; + + if (revision > 1) { + for (i = 0; i < LIOINTC_NUM_CORES; i++) { + int index = of_property_match_string(node, + "reg-names", core_reg_names[i]); + + if (index < 0) + return -EINVAL; + + priv->core_isr[i] = of_iomap(node, index); + } + } /* Setup IRQ domain */ - domain = irq_domain_add_linear(node, 32, + if (!acpi_disabled) + domain = irq_domain_create_linear(domain_handle, LIOINTC_CHIP_IRQ, + &acpi_irq_gc_ops, priv); + else + domain = irq_domain_create_linear(domain_handle, LIOINTC_CHIP_IRQ, &irq_generic_chip_ops, priv); if (!domain) { pr_err("loongson-liointc: cannot add IRQ domain\n"); - err = -EINVAL; - goto out_iounmap_isr; + goto out_iounmap; } - err = irq_alloc_domain_generic_chips(domain, 32, 1, - node->full_name, handle_level_irq, - IRQ_NOPROBE, 0, 0); + err = irq_alloc_domain_generic_chips(domain, LIOINTC_CHIP_IRQ, 1, + (node ? node->full_name : "LIOINTC"), + handle_level_irq, 0, IRQ_NOPROBE, 0); if (err) { pr_err("loongson-liointc: unable to register IRQ domain\n"); goto out_free_domain; @@ -299,24 +283,93 @@ static int __init liointc_of_init(struct device_node *node, liointc_chained_handle_irq, &priv->handler[i]); } + liointc_handle = domain_handle; return 0; out_free_domain: irq_domain_remove(domain); -out_iounmap_isr: - for (i = 0; i < LIOINTC_NUM_CORES; i++) { - if (!priv->core_isr[i]) - continue; - iounmap(priv->core_isr[i]); - } -out_iounmap_base: +out_iounmap: iounmap(base); out_free_priv: kfree(priv); - return err; + return -EINVAL; +} + +#ifdef CONFIG_OF + +static int __init liointc_of_init(struct device_node *node, + struct device_node *parent) +{ + bool have_parent = FALSE; + int sz, i, index, revision, err = 0; + struct resource res; + + if (!of_device_is_compatible(node, "loongson,liointc-2.0")) { + index = 0; + revision = 1; + } else { + index = of_property_match_string(node, "reg-names", "main"); + revision = 2; + } + + if (of_address_to_resource(node, index, &res)) + return -EINVAL; + + for (i = 0; i < LIOINTC_NUM_PARENT; i++) { + parent_irq[i] = of_irq_get_byname(node, parent_names[i]); + if (parent_irq[i] > 0) + have_parent = TRUE; + } + if (!have_parent) + return -ENODEV; + + sz = of_property_read_variable_u32_array(node, + "loongson,parent_int_map", + &parent_int_map[0], + LIOINTC_NUM_PARENT, + LIOINTC_NUM_PARENT); + if (sz < 4) { + pr_err("loongson-liointc: No parent_int_map\n"); + return -ENODEV; + } + + err = liointc_init(res.start, resource_size(&res), + revision, of_node_to_fwnode(node), node); + if (err < 0) + return err; + + return 0; } IRQCHIP_DECLARE(loongson_liointc_1_0, "loongson,liointc-1.0", liointc_of_init); IRQCHIP_DECLARE(loongson_liointc_1_0a, "loongson,liointc-1.0a", liointc_of_init); IRQCHIP_DECLARE(loongson_liointc_2_0, "loongson,liointc-2.0", liointc_of_init); + +#endif + +#ifdef CONFIG_ACPI +int __init liointc_acpi_init(struct irq_domain *parent, struct acpi_madt_lio_pic *acpi_liointc) +{ + int ret; + struct fwnode_handle *domain_handle; + + parent_int_map[0] = acpi_liointc->cascade_map[0]; + parent_int_map[1] = acpi_liointc->cascade_map[1]; + + parent_irq[0] = irq_create_mapping(parent, acpi_liointc->cascade[0]); + parent_irq[1] = irq_create_mapping(parent, acpi_liointc->cascade[1]); + + domain_handle = irq_domain_alloc_fwnode((phys_addr_t *)acpi_liointc); + if (!domain_handle) { + pr_err("Unable to allocate domain handle\n"); + return -ENOMEM; + } + ret = liointc_init(acpi_liointc->address, acpi_liointc->size, + 1, domain_handle, NULL); + if (ret) + irq_domain_free_fwnode(domain_handle); + + return ret; +} +#endif From dd281e1a1a937ee2f13bd0db5be78e5f5b811ca7 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Wed, 20 Jul 2022 18:51:30 +0800 Subject: [PATCH 293/298] irqchip: Add Loongson Extended I/O interrupt controller support EIOINTC stands for "Extended I/O Interrupts" that described in Section 11.2 of "Loongson 3A5000 Processor Reference Manual". For more information please refer Documentation/loongarch/irq-chip-model.rst. Loongson-3A5000 has 4 cores per NUMA node, and each NUMA node has an EIOINTC; while Loongson-3C5000 has 16 cores per NUMA node, and each NUMA node has 4 EIOINTCs. In other words, 16 cores of one NUMA node in Loongson-3C5000 are organized in 4 groups, each group connects to an EIOINTC. We call the "group" here as an EIOINTC node, so each EIOINTC node always includes 4 cores (both in Loongson-3A5000 and Loongson- 3C5000). Co-developed-by: Jianmin Lv Signed-off-by: Jianmin Lv Signed-off-by: Huacai Chen Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-12-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 11 +- drivers/irqchip/Kconfig | 10 + drivers/irqchip/Makefile | 1 + drivers/irqchip/irq-loongson-eiointc.c | 395 +++++++++++++++++++++++++ include/linux/cpuhotplug.h | 1 + 5 files changed, 408 insertions(+), 10 deletions(-) create mode 100644 drivers/irqchip/irq-loongson-eiointc.c diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index c847300c4cd1..67ebcc5495b8 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -87,15 +87,6 @@ extern struct acpi_vector_group msi_group[MAX_IO_PICS]; extern int find_pch_pic(u32 gsi); extern int eiointc_get_node(int id); -static inline void eiointc_enable(void) -{ - uint64_t misc; - - misc = iocsr_read64(LOONGARCH_IOCSR_MISC_FUNC); - misc |= IOCSR_MISC_FUNC_EXT_IOI_EN; - iocsr_write64(misc, LOONGARCH_IOCSR_MISC_FUNC); -} - struct acpi_madt_lio_pic; struct acpi_madt_eio_pic; struct acpi_madt_ht_pic; @@ -107,7 +98,7 @@ struct irq_domain *loongarch_cpu_irq_init(void); int liointc_acpi_init(struct irq_domain *parent, struct acpi_madt_lio_pic *acpi_liointc); -struct irq_domain *eiointc_acpi_init(struct irq_domain *parent, +int eiointc_acpi_init(struct irq_domain *parent, struct acpi_madt_eio_pic *acpi_eiointc); struct irq_domain *htvec_acpi_init(struct irq_domain *parent, diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 8844e6b53b3c..8f077d353e67 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -555,6 +555,16 @@ config LOONGSON_LIOINTC help Support for the Loongson Local I/O Interrupt Controller. +config LOONGSON_EIOINTC + bool "Loongson Extend I/O Interrupt Controller" + depends on LOONGARCH + depends on MACH_LOONGSON64 + default MACH_LOONGSON64 + select IRQ_DOMAIN_HIERARCHY + select GENERIC_IRQ_CHIP + help + Support for the Loongson3 Extend I/O Interrupt Vector Controller. + config LOONGSON_HTPIC bool "Loongson3 HyperTransport PIC Controller" depends on MACH_LOONGSON64 && MIPS diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile index 242b8b3568f8..0cfd4f046751 100644 --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -104,6 +104,7 @@ obj-$(CONFIG_TI_SCI_INTR_IRQCHIP) += irq-ti-sci-intr.o obj-$(CONFIG_TI_SCI_INTA_IRQCHIP) += irq-ti-sci-inta.o obj-$(CONFIG_TI_PRUSS_INTC) += irq-pruss-intc.o obj-$(CONFIG_LOONGSON_LIOINTC) += irq-loongson-liointc.o +obj-$(CONFIG_LOONGSON_EIOINTC) += irq-loongson-eiointc.o obj-$(CONFIG_LOONGSON_HTPIC) += irq-loongson-htpic.o obj-$(CONFIG_LOONGSON_HTVEC) += irq-loongson-htvec.o obj-$(CONFIG_LOONGSON_PCH_PIC) += irq-loongson-pch-pic.o diff --git a/drivers/irqchip/irq-loongson-eiointc.c b/drivers/irqchip/irq-loongson-eiointc.c new file mode 100644 index 000000000000..80d8ca6f2d46 --- /dev/null +++ b/drivers/irqchip/irq-loongson-eiointc.c @@ -0,0 +1,395 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Loongson Extend I/O Interrupt Controller support + * + * Copyright (C) 2020-2022 Loongson Technology Corporation Limited + */ + +#define pr_fmt(fmt) "eiointc: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define EIOINTC_REG_NODEMAP 0x14a0 +#define EIOINTC_REG_IPMAP 0x14c0 +#define EIOINTC_REG_ENABLE 0x1600 +#define EIOINTC_REG_BOUNCE 0x1680 +#define EIOINTC_REG_ISR 0x1800 +#define EIOINTC_REG_ROUTE 0x1c00 + +#define VEC_REG_COUNT 4 +#define VEC_COUNT_PER_REG 64 +#define VEC_COUNT (VEC_REG_COUNT * VEC_COUNT_PER_REG) +#define VEC_REG_IDX(irq_id) ((irq_id) / VEC_COUNT_PER_REG) +#define VEC_REG_BIT(irq_id) ((irq_id) % VEC_COUNT_PER_REG) +#define EIOINTC_ALL_ENABLE 0xffffffff + +#define MAX_EIO_NODES (NR_CPUS / CORES_PER_EIO_NODE) + +static int nr_pics; + +struct eiointc_priv { + u32 node; + nodemask_t node_map; + cpumask_t cpuspan_map; + struct fwnode_handle *domain_handle; + struct irq_domain *eiointc_domain; +}; + +static struct eiointc_priv *eiointc_priv[MAX_IO_PICS]; + +static void eiointc_enable(void) +{ + uint64_t misc; + + misc = iocsr_read64(LOONGARCH_IOCSR_MISC_FUNC); + misc |= IOCSR_MISC_FUNC_EXT_IOI_EN; + iocsr_write64(misc, LOONGARCH_IOCSR_MISC_FUNC); +} + +static int cpu_to_eio_node(int cpu) +{ + return cpu_logical_map(cpu) / CORES_PER_EIO_NODE; +} + +static void eiointc_set_irq_route(int pos, unsigned int cpu, unsigned int mnode, nodemask_t *node_map) +{ + int i, node, cpu_node, route_node; + unsigned char coremap; + uint32_t pos_off, data, data_byte, data_mask; + + pos_off = pos & ~3; + data_byte = pos & 3; + data_mask = ~BIT_MASK(data_byte) & 0xf; + + /* Calculate node and coremap of target irq */ + cpu_node = cpu_logical_map(cpu) / CORES_PER_EIO_NODE; + coremap = BIT(cpu_logical_map(cpu) % CORES_PER_EIO_NODE); + + for_each_online_cpu(i) { + node = cpu_to_eio_node(i); + if (!node_isset(node, *node_map)) + continue; + + /* EIO node 0 is in charge of inter-node interrupt dispatch */ + route_node = (node == mnode) ? cpu_node : node; + data = ((coremap | (route_node << 4)) << (data_byte * 8)); + csr_any_send(EIOINTC_REG_ROUTE + pos_off, data, data_mask, node * CORES_PER_EIO_NODE); + } +} + +static DEFINE_RAW_SPINLOCK(affinity_lock); + +static int eiointc_set_irq_affinity(struct irq_data *d, const struct cpumask *affinity, bool force) +{ + unsigned int cpu; + unsigned long flags; + uint32_t vector, regaddr; + struct cpumask intersect_affinity; + struct eiointc_priv *priv = d->domain->host_data; + + raw_spin_lock_irqsave(&affinity_lock, flags); + + cpumask_and(&intersect_affinity, affinity, cpu_online_mask); + cpumask_and(&intersect_affinity, &intersect_affinity, &priv->cpuspan_map); + + if (cpumask_empty(&intersect_affinity)) { + raw_spin_unlock_irqrestore(&affinity_lock, flags); + return -EINVAL; + } + cpu = cpumask_first(&intersect_affinity); + + vector = d->hwirq; + regaddr = EIOINTC_REG_ENABLE + ((vector >> 5) << 2); + + /* Mask target vector */ + csr_any_send(regaddr, EIOINTC_ALL_ENABLE & (~BIT(vector & 0x1F)), 0x0, 0); + /* Set route for target vector */ + eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map); + /* Unmask target vector */ + csr_any_send(regaddr, EIOINTC_ALL_ENABLE, 0x0, 0); + + irq_data_update_effective_affinity(d, cpumask_of(cpu)); + + raw_spin_unlock_irqrestore(&affinity_lock, flags); + + return IRQ_SET_MASK_OK; +} + +static int eiointc_index(int node) +{ + int i; + + for (i = 0; i < nr_pics; i++) { + if (node_isset(node, eiointc_priv[i]->node_map)) + return i; + } + + return -1; +} + +static int eiointc_router_init(unsigned int cpu) +{ + int i, bit; + uint32_t data; + uint32_t node = cpu_to_eio_node(cpu); + uint32_t index = eiointc_index(node); + + if (index < 0) { + pr_err("Error: invalid nodemap!\n"); + return -1; + } + + if ((cpu_logical_map(cpu) % CORES_PER_EIO_NODE) == 0) { + eiointc_enable(); + + for (i = 0; i < VEC_COUNT / 32; i++) { + data = (((1 << (i * 2 + 1)) << 16) | (1 << (i * 2))); + iocsr_write32(data, EIOINTC_REG_NODEMAP + i * 4); + } + + for (i = 0; i < VEC_COUNT / 32 / 4; i++) { + bit = BIT(1 + index); /* Route to IP[1 + index] */ + data = bit | (bit << 8) | (bit << 16) | (bit << 24); + iocsr_write32(data, EIOINTC_REG_IPMAP + i * 4); + } + + for (i = 0; i < VEC_COUNT / 4; i++) { + /* Route to Node-0 Core-0 */ + if (index == 0) + bit = BIT(cpu_logical_map(0)); + else + bit = (eiointc_priv[index]->node << 4) | 1; + + data = bit | (bit << 8) | (bit << 16) | (bit << 24); + iocsr_write32(data, EIOINTC_REG_ROUTE + i * 4); + } + + for (i = 0; i < VEC_COUNT / 32; i++) { + data = 0xffffffff; + iocsr_write32(data, EIOINTC_REG_ENABLE + i * 4); + iocsr_write32(data, EIOINTC_REG_BOUNCE + i * 4); + } + } + + return 0; +} + +static void eiointc_irq_dispatch(struct irq_desc *desc) +{ + int i; + u64 pending; + bool handled = false; + struct irq_chip *chip = irq_desc_get_chip(desc); + struct eiointc_priv *priv = irq_desc_get_handler_data(desc); + + chained_irq_enter(chip, desc); + + for (i = 0; i < VEC_REG_COUNT; i++) { + pending = iocsr_read64(EIOINTC_REG_ISR + (i << 3)); + iocsr_write64(pending, EIOINTC_REG_ISR + (i << 3)); + while (pending) { + int bit = __ffs(pending); + int irq = bit + VEC_COUNT_PER_REG * i; + + generic_handle_domain_irq(priv->eiointc_domain, irq); + pending &= ~BIT(bit); + handled = true; + } + } + + if (!handled) + spurious_interrupt(); + + chained_irq_exit(chip, desc); +} + +static void eiointc_ack_irq(struct irq_data *d) +{ +} + +static void eiointc_mask_irq(struct irq_data *d) +{ +} + +static void eiointc_unmask_irq(struct irq_data *d) +{ +} + +static struct irq_chip eiointc_irq_chip = { + .name = "EIOINTC", + .irq_ack = eiointc_ack_irq, + .irq_mask = eiointc_mask_irq, + .irq_unmask = eiointc_unmask_irq, + .irq_set_affinity = eiointc_set_irq_affinity, +}; + +static int eiointc_domain_alloc(struct irq_domain *domain, unsigned int virq, + unsigned int nr_irqs, void *arg) +{ + int ret; + unsigned int i, type; + unsigned long hwirq = 0; + struct eiointc *priv = domain->host_data; + + ret = irq_domain_translate_onecell(domain, arg, &hwirq, &type); + if (ret) + return ret; + + for (i = 0; i < nr_irqs; i++) { + irq_domain_set_info(domain, virq + i, hwirq + i, &eiointc_irq_chip, + priv, handle_edge_irq, NULL, NULL); + } + + return 0; +} + +static void eiointc_domain_free(struct irq_domain *domain, unsigned int virq, + unsigned int nr_irqs) +{ + int i; + + for (i = 0; i < nr_irqs; i++) { + struct irq_data *d = irq_domain_get_irq_data(domain, virq + i); + + irq_set_handler(virq + i, NULL); + irq_domain_reset_irq_data(d); + } +} + +static const struct irq_domain_ops eiointc_domain_ops = { + .translate = irq_domain_translate_onecell, + .alloc = eiointc_domain_alloc, + .free = eiointc_domain_free, +}; + +static void acpi_set_vec_parent(int node, struct irq_domain *parent, struct acpi_vector_group *vec_group) +{ + int i; + + if (cpu_has_flatmode) + node = cpu_to_node(node * CORES_PER_EIO_NODE); + + for (i = 0; i < MAX_IO_PICS; i++) { + if (node == vec_group[i].node) { + vec_group[i].parent = parent; + return; + } + } +} + +struct irq_domain *acpi_get_vec_parent(int node, struct acpi_vector_group *vec_group) +{ + int i; + + for (i = 0; i < MAX_IO_PICS; i++) { + if (node == vec_group[i].node) + return vec_group[i].parent; + } + return NULL; +} + +static int __init +pch_pic_parse_madt(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_bio_pic *pchpic_entry = (struct acpi_madt_bio_pic *)header; + unsigned int node = (pchpic_entry->address >> 44) & 0xf; + struct irq_domain *parent = acpi_get_vec_parent(node, pch_group); + + if (parent) + return pch_pic_acpi_init(parent, pchpic_entry); + + return -EINVAL; +} + +static int __init +pch_msi_parse_madt(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_msi_pic *pchmsi_entry = (struct acpi_madt_msi_pic *)header; + struct irq_domain *parent = acpi_get_vec_parent(eiointc_priv[nr_pics - 1]->node, msi_group); + + if (parent) + return pch_msi_acpi_init(parent, pchmsi_entry); + + return -EINVAL; +} + +static int __init acpi_cascade_irqdomain_init(void) +{ + acpi_table_parse_madt(ACPI_MADT_TYPE_BIO_PIC, + pch_pic_parse_madt, 0); + acpi_table_parse_madt(ACPI_MADT_TYPE_MSI_PIC, + pch_msi_parse_madt, 1); + return 0; +} + +int __init eiointc_acpi_init(struct irq_domain *parent, + struct acpi_madt_eio_pic *acpi_eiointc) +{ + int i, parent_irq; + unsigned long node_map; + struct eiointc_priv *priv; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->domain_handle = irq_domain_alloc_fwnode((phys_addr_t *)acpi_eiointc); + if (!priv->domain_handle) { + pr_err("Unable to allocate domain handle\n"); + goto out_free_priv; + } + + priv->node = acpi_eiointc->node; + node_map = acpi_eiointc->node_map ? : -1ULL; + + for_each_possible_cpu(i) { + if (node_map & (1ULL << cpu_to_eio_node(i))) { + node_set(cpu_to_eio_node(i), priv->node_map); + cpumask_or(&priv->cpuspan_map, &priv->cpuspan_map, cpumask_of(i)); + } + } + + /* Setup IRQ domain */ + priv->eiointc_domain = irq_domain_create_linear(priv->domain_handle, VEC_COUNT, + &eiointc_domain_ops, priv); + if (!priv->eiointc_domain) { + pr_err("loongson-eiointc: cannot add IRQ domain\n"); + goto out_free_handle; + } + + eiointc_priv[nr_pics++] = priv; + + eiointc_router_init(0); + + parent_irq = irq_create_mapping(parent, acpi_eiointc->cascade); + irq_set_chained_handler_and_data(parent_irq, eiointc_irq_dispatch, priv); + + cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_LOONGARCH_STARTING, + "irqchip/loongarch/intc:starting", + eiointc_router_init, NULL); + + acpi_set_vec_parent(acpi_eiointc->node, priv->eiointc_domain, pch_group); + acpi_set_vec_parent(acpi_eiointc->node, priv->eiointc_domain, msi_group); + acpi_cascade_irqdomain_init(); + + return 0; + +out_free_handle: + irq_domain_free_fwnode(priv->domain_handle); + priv->domain_handle = NULL; +out_free_priv: + kfree(priv); + + return -ENOMEM; +} diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 19f0dbfdd7fe..de662f3a6cee 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -151,6 +151,7 @@ enum cpuhp_state { CPUHP_AP_IRQ_BCM2836_STARTING, CPUHP_AP_IRQ_MIPS_GIC_STARTING, CPUHP_AP_IRQ_RISCV_STARTING, + CPUHP_AP_IRQ_LOONGARCH_STARTING, CPUHP_AP_IRQ_SIFIVE_PLIC_STARTING, CPUHP_AP_ARM_MVEBU_COHERENCY, CPUHP_AP_MICROCODE_LOADER, From b2d3e3354e2a0d0e912308618ea33d0337f405c3 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Wed, 20 Jul 2022 18:51:31 +0800 Subject: [PATCH 294/298] irqchip: Add LoongArch CPU interrupt controller support LoongArch CPUINTC stands for CSR.ECFG/CSR.ESTAT and related interrupt controller that described in Section 7.4 of "LoongArch Reference Manual, Vol 1". For more information please refer Documentation/loongarch/irq- chip-model.rst. LoongArch CPUINTC has 13 interrupt sources: SWI0~1, HWI0~7, IPI, TI (Timer) and PCOV (PMC). IRQ mappings of HWI0~7 are configurable (can be created from DT/ACPI), but IPI, TI (Timer) and PCOV (PMC) are hardcoded bits, so we expose the fwnode_handle to map them, and get mapped irq by irq_create_mapping when using them. Co-developed-by: Jianmin Lv Signed-off-by: Jianmin Lv Signed-off-by: Huacai Chen Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-13-git-send-email-lvjianmin@loongson.cn --- arch/loongarch/include/asm/irq.h | 7 +- arch/loongarch/kernel/irq.c | 16 +++- arch/loongarch/kernel/time.c | 14 +++- drivers/irqchip/Kconfig | 10 +++ drivers/irqchip/Makefile | 1 + drivers/irqchip/irq-loongarch-cpu.c | 111 ++++++++++++++++++++++++++++ 6 files changed, 149 insertions(+), 10 deletions(-) create mode 100644 drivers/irqchip/irq-loongarch-cpu.c diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h index 67ebcc5495b8..149b2123e7f4 100644 --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@ -35,9 +35,6 @@ static inline bool on_irq_stack(int cpu, unsigned long sp) return (low <= sp && sp <= high); } -int get_ipi_irq(void); -int get_pmc_irq(void); -int get_timer_irq(void); void spurious_interrupt(void); #define NR_IRQS_LEGACY 16 @@ -94,8 +91,6 @@ struct acpi_madt_bio_pic; struct acpi_madt_msi_pic; struct acpi_madt_lpc_pic; -struct irq_domain *loongarch_cpu_irq_init(void); - int liointc_acpi_init(struct irq_domain *parent, struct acpi_madt_lio_pic *acpi_liointc); int eiointc_acpi_init(struct irq_domain *parent, @@ -128,7 +123,7 @@ extern struct acpi_madt_lpc_pic *acpi_pchlpc; extern struct acpi_madt_msi_pic *acpi_pchmsi[MAX_IO_PICS]; extern struct acpi_madt_bio_pic *acpi_pchpic[MAX_IO_PICS]; -extern struct irq_domain *cpu_domain; +extern struct fwnode_handle *cpuintc_handle; extern struct fwnode_handle *liointc_handle; extern struct fwnode_handle *pch_lpc_handle; extern struct fwnode_handle *pch_pic_handle[MAX_IO_PICS]; diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index da131f51225a..1ba19c76563e 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -25,8 +25,6 @@ DEFINE_PER_CPU(unsigned long, irq_stack); DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); EXPORT_PER_CPU_SYMBOL(irq_stat); -struct irq_domain *cpu_domain; - struct acpi_vector_group pch_group[MAX_IO_PICS]; struct acpi_vector_group msi_group[MAX_IO_PICS]; /* @@ -89,6 +87,16 @@ static void __init init_vec_parent_group(void) acpi_table_parse(ACPI_SIG_MCFG, early_pci_mcfg_parse); } +static int __init get_ipi_irq(void) +{ + struct irq_domain *d = irq_find_matching_fwnode(cpuintc_handle, DOMAIN_BUS_ANY); + + if (d) + return irq_create_mapping(d, EXCCODE_IPI - EXCCODE_INT_START); + + return -EINVAL; +} + void __init init_IRQ(void) { int i; @@ -105,7 +113,9 @@ void __init init_IRQ(void) init_vec_parent_group(); irqchip_init(); #ifdef CONFIG_SMP - ipi_irq = EXCCODE_IPI - EXCCODE_INT_START; + ipi_irq = get_ipi_irq(); + if (ipi_irq < 0) + panic("IPI IRQ mapping failed\n"); irq_set_percpu_devid(ipi_irq); r = request_percpu_irq(ipi_irq, loongson3_ipi_interrupt, "IPI", &ipi_dummy_dev); if (r < 0) diff --git a/arch/loongarch/kernel/time.c b/arch/loongarch/kernel/time.c index fe6823875895..79dc5eddf504 100644 --- a/arch/loongarch/kernel/time.c +++ b/arch/loongarch/kernel/time.c @@ -123,6 +123,16 @@ void sync_counter(void) csr_write64(-init_timeval, LOONGARCH_CSR_CNTC); } +static int get_timer_irq(void) +{ + struct irq_domain *d = irq_find_matching_fwnode(cpuintc_handle, DOMAIN_BUS_ANY); + + if (d) + return irq_create_mapping(d, EXCCODE_TIMER - EXCCODE_INT_START); + + return -EINVAL; +} + int constant_clockevent_init(void) { unsigned int irq; @@ -132,7 +142,9 @@ int constant_clockevent_init(void) struct clock_event_device *cd; static int timer_irq_installed = 0; - irq = EXCCODE_TIMER - EXCCODE_INT_START; + irq = get_timer_irq(); + if (irq < 0) + pr_err("Failed to map irq %d (timer)\n", irq); cd = &per_cpu(constant_clockevent_device, cpu); diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig index 8f077d353e67..f53164ccdc9f 100644 --- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -546,6 +546,16 @@ config EXYNOS_IRQ_COMBINER Say yes here to add support for the IRQ combiner devices embedded in Samsung Exynos chips. +config IRQ_LOONGARCH_CPU + bool + select GENERIC_IRQ_CHIP + select IRQ_DOMAIN + select GENERIC_IRQ_EFFECTIVE_AFF_MASK + help + Support for the LoongArch CPU Interrupt Controller. For details of + irq chip hierarchy on LoongArch platforms please read the document + Documentation/loongarch/irq-chip-model.rst. + config LOONGSON_LIOINTC bool "Loongson Local I/O Interrupt Controller" depends on MACH_LOONGSON64 diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile index 0cfd4f046751..e559007bec73 100644 --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_LS1X_IRQ) += irq-ls1x.o obj-$(CONFIG_TI_SCI_INTR_IRQCHIP) += irq-ti-sci-intr.o obj-$(CONFIG_TI_SCI_INTA_IRQCHIP) += irq-ti-sci-inta.o obj-$(CONFIG_TI_PRUSS_INTC) += irq-pruss-intc.o +obj-$(CONFIG_IRQ_LOONGARCH_CPU) += irq-loongarch-cpu.o obj-$(CONFIG_LOONGSON_LIOINTC) += irq-loongson-liointc.o obj-$(CONFIG_LOONGSON_EIOINTC) += irq-loongson-eiointc.o obj-$(CONFIG_LOONGSON_HTPIC) += irq-loongson-htpic.o diff --git a/drivers/irqchip/irq-loongarch-cpu.c b/drivers/irqchip/irq-loongarch-cpu.c new file mode 100644 index 000000000000..28ddc60c8608 --- /dev/null +++ b/drivers/irqchip/irq-loongarch-cpu.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2020-2022 Loongson Technology Corporation Limited + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +static struct irq_domain *irq_domain; +struct fwnode_handle *cpuintc_handle; + +static void mask_loongarch_irq(struct irq_data *d) +{ + clear_csr_ecfg(ECFGF(d->hwirq)); +} + +static void unmask_loongarch_irq(struct irq_data *d) +{ + set_csr_ecfg(ECFGF(d->hwirq)); +} + +static struct irq_chip cpu_irq_controller = { + .name = "CPUINTC", + .irq_mask = mask_loongarch_irq, + .irq_unmask = unmask_loongarch_irq, +}; + +static void handle_cpu_irq(struct pt_regs *regs) +{ + int hwirq; + unsigned int estat = read_csr_estat() & CSR_ESTAT_IS; + + while ((hwirq = ffs(estat))) { + estat &= ~BIT(hwirq - 1); + generic_handle_domain_irq(irq_domain, hwirq - 1); + } +} + +static int loongarch_cpu_intc_map(struct irq_domain *d, unsigned int irq, + irq_hw_number_t hwirq) +{ + irq_set_noprobe(irq); + irq_set_chip_and_handler(irq, &cpu_irq_controller, handle_percpu_irq); + + return 0; +} + +static const struct irq_domain_ops loongarch_cpu_intc_irq_domain_ops = { + .map = loongarch_cpu_intc_map, + .xlate = irq_domain_xlate_onecell, +}; + +static int __init +liointc_parse_madt(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_lio_pic *liointc_entry = (struct acpi_madt_lio_pic *)header; + + return liointc_acpi_init(irq_domain, liointc_entry); +} + +static int __init +eiointc_parse_madt(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_eio_pic *eiointc_entry = (struct acpi_madt_eio_pic *)header; + + return eiointc_acpi_init(irq_domain, eiointc_entry); +} + +static int __init acpi_cascade_irqdomain_init(void) +{ + acpi_table_parse_madt(ACPI_MADT_TYPE_LIO_PIC, + liointc_parse_madt, 0); + acpi_table_parse_madt(ACPI_MADT_TYPE_EIO_PIC, + eiointc_parse_madt, 0); + return 0; +} + +static int __init cpuintc_acpi_init(union acpi_subtable_headers *header, + const unsigned long end) +{ + if (irq_domain) + return 0; + + /* Mask interrupts. */ + clear_csr_ecfg(ECFG0_IM); + clear_csr_estat(ESTATF_IP); + + cpuintc_handle = irq_domain_alloc_fwnode(NULL); + irq_domain = irq_domain_create_linear(cpuintc_handle, EXCCODE_INT_NUM, + &loongarch_cpu_intc_irq_domain_ops, NULL); + + if (!irq_domain) + panic("Failed to add irqdomain for LoongArch CPU"); + + set_handle_irq(&handle_cpu_irq); + acpi_cascade_irqdomain_init(); + + return 0; +} + +IRQCHIP_ACPI_DECLARE(cpuintc_v1, ACPI_MADT_TYPE_CORE_PIC, + NULL, ACPI_MADT_CORE_PIC_VERSION_V1, cpuintc_acpi_init); From e8bba72b396cef7c919c73710f3c5884521adb4e Mon Sep 17 00:00:00 2001 From: Jianmin Lv Date: Wed, 20 Jul 2022 18:51:32 +0800 Subject: [PATCH 295/298] irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch For LoongArch, ACPI_IRQ_MODEL_LPIC is introduced, and then the callback acpi_get_gsi_domain_id and acpi_gsi_to_irq_fallback are implemented. The acpi_get_gsi_domain_id callback returns related fwnode handle of irqdomain for different GSI range. The acpi_gsi_to_irq_fallback will create new mapping for gsi when the mapping of it is not found. Signed-off-by: Jianmin Lv Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/1658314292-35346-14-git-send-email-lvjianmin@loongson.cn --- drivers/acpi/bus.c | 3 +++ drivers/irqchip/irq-loongarch-cpu.c | 37 +++++++++++++++++++++++++++++ include/linux/acpi.h | 1 + 3 files changed, 41 insertions(+) diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index 86fa61a21826..63fbf0022d39 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -1145,6 +1145,9 @@ static int __init acpi_bus_init_irq(void) case ACPI_IRQ_MODEL_PLATFORM: message = "platform specific model"; break; + case ACPI_IRQ_MODEL_LPIC: + message = "LPIC"; + break; default: pr_info("Unknown interrupt routing model\n"); return -ENODEV; diff --git a/drivers/irqchip/irq-loongarch-cpu.c b/drivers/irqchip/irq-loongarch-cpu.c index 28ddc60c8608..327f3ab62c03 100644 --- a/drivers/irqchip/irq-loongarch-cpu.c +++ b/drivers/irqchip/irq-loongarch-cpu.c @@ -16,6 +16,41 @@ static struct irq_domain *irq_domain; struct fwnode_handle *cpuintc_handle; +static u32 lpic_gsi_to_irq(u32 gsi) +{ + /* Only pch irqdomain transferring is required for LoongArch. */ + if (gsi >= GSI_MIN_PCH_IRQ && gsi <= GSI_MAX_PCH_IRQ) + return acpi_register_gsi(NULL, gsi, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH); + + return 0; +} + +static struct fwnode_handle *lpic_get_gsi_domain_id(u32 gsi) +{ + int id; + struct fwnode_handle *domain_handle = NULL; + + switch (gsi) { + case GSI_MIN_CPU_IRQ ... GSI_MAX_CPU_IRQ: + if (liointc_handle) + domain_handle = liointc_handle; + break; + + case GSI_MIN_LPC_IRQ ... GSI_MAX_LPC_IRQ: + if (pch_lpc_handle) + domain_handle = pch_lpc_handle; + break; + + case GSI_MIN_PCH_IRQ ... GSI_MAX_PCH_IRQ: + id = find_pch_pic(gsi); + if (id >= 0 && pch_pic_handle[id]) + domain_handle = pch_pic_handle[id]; + break; + } + + return domain_handle; +} + static void mask_loongarch_irq(struct irq_data *d) { clear_csr_ecfg(ECFGF(d->hwirq)); @@ -102,6 +137,8 @@ static int __init cpuintc_acpi_init(union acpi_subtable_headers *header, panic("Failed to add irqdomain for LoongArch CPU"); set_handle_irq(&handle_cpu_irq); + acpi_set_irq_model(ACPI_IRQ_MODEL_LPIC, lpic_get_gsi_domain_id); + acpi_set_gsi_to_irq_fallback(lpic_gsi_to_irq); acpi_cascade_irqdomain_init(); return 0; diff --git a/include/linux/acpi.h b/include/linux/acpi.h index e2b60d53ca60..76520f379313 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -105,6 +105,7 @@ enum acpi_irq_model_id { ACPI_IRQ_MODEL_IOSAPIC, ACPI_IRQ_MODEL_PLATFORM, ACPI_IRQ_MODEL_GIC, + ACPI_IRQ_MODEL_LPIC, ACPI_IRQ_MODEL_COUNT }; From c904cda04482d5ab545e5a82cee6084078ef9543 Mon Sep 17 00:00:00 2001 From: Paran Lee Date: Sun, 10 Jul 2022 20:26:14 +0900 Subject: [PATCH 296/298] genirq: Use for_each_action_of_desc in actions_show() Refactor action_show() to use for_each_action_of_desc instead of a similar open-coded loop. Signed-off-by: Paran Lee [maz: reword commit message] Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220710112614.19410-1-p4ranlee@gmail.com --- kernel/irq/irqdesc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index d323b180b0f3..5db0230aa6b5 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -251,7 +251,7 @@ static ssize_t actions_show(struct kobject *kobj, char *p = ""; raw_spin_lock_irq(&desc->lock); - for (action = desc->action; action != NULL; action = action->next) { + for_each_action_of_desc(desc, action) { ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%s%s", p, action->name); p = ","; From 71349cc85e5930dce78ed87084dee098eba24b59 Mon Sep 17 00:00:00 2001 From: William Dean Date: Sat, 23 Jul 2022 18:01:28 +0800 Subject: [PATCH 297/298] irqchip/mips-gic: Check the return value of ioremap() in gic_of_init() The function ioremap() in gic_of_init() can fail, so its return value should be checked. Reported-by: Hacash Robot Signed-off-by: William Dean Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220723100128.2964304-1-williamsukatube@163.com --- drivers/irqchip/irq-mips-gic.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/irqchip/irq-mips-gic.c b/drivers/irqchip/irq-mips-gic.c index ff89b36267dd..a1f6d955794a 100644 --- a/drivers/irqchip/irq-mips-gic.c +++ b/drivers/irqchip/irq-mips-gic.c @@ -734,6 +734,10 @@ static int __init gic_of_init(struct device_node *node, } mips_gic_base = ioremap(gic_base, gic_len); + if (!mips_gic_base) { + pr_err("Failed to ioremap gic_base\n"); + return -ENOMEM; + } gicconfig = read_gic_config(); gic_shared_intrs = FIELD_GET(GIC_CONFIG_NUMINTERRUPTS, gicconfig); From 9d9b010f12cc4e254bbba32d79c965b564a8957f Mon Sep 17 00:00:00 2001 From: Ben Dooks Date: Sun, 24 Jul 2022 23:21:52 +0100 Subject: [PATCH 298/298] irqchip/mmp: Declare init functions in common header file The functions icu_init_irq and mmp2_init_icu are exported from this code, so declare them in the header file to avoid the following sparse warnings: drivers/irqchip/irq-mmp.c:248:13: warning: symbol 'icu_init_irq' was not declared. Should it be static? drivers/irqchip/irq-mmp.c:271:13: warning: symbol 'mmp2_init_icu' was not declared. Should it be static? Signed-off-by: Ben Dooks [maz: fixup commit message] Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220724222152.551850-1-ben-linux@fluff.org --- arch/arm/mach-mmp/mmp2.h | 2 +- arch/arm/mach-mmp/pxa168.h | 2 +- arch/arm/mach-mmp/pxa910.h | 2 +- include/linux/irqchip/mmp.h | 3 +++ 4 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/arm/mach-mmp/mmp2.h b/arch/arm/mach-mmp/mmp2.h index 3ebc1bb13f71..7f80b90248fb 100644 --- a/arch/arm/mach-mmp/mmp2.h +++ b/arch/arm/mach-mmp/mmp2.h @@ -5,13 +5,13 @@ #include extern void mmp2_timer_init(void); -extern void __init mmp2_init_icu(void); extern void __init mmp2_init_irq(void); extern void mmp2_clear_pmic_int(void); #include #include #include +#include #include "devices.h" diff --git a/arch/arm/mach-mmp/pxa168.h b/arch/arm/mach-mmp/pxa168.h index 34f907cd165a..c1547e098f09 100644 --- a/arch/arm/mach-mmp/pxa168.h +++ b/arch/arm/mach-mmp/pxa168.h @@ -5,7 +5,6 @@ #include extern void pxa168_timer_init(void); -extern void __init icu_init_irq(void); extern void __init pxa168_init_irq(void); extern void pxa168_restart(enum reboot_mode, const char *); extern void pxa168_clear_keypad_wakeup(void); @@ -18,6 +17,7 @@ extern void pxa168_clear_keypad_wakeup(void); #include #include #include +#include #include "devices.h" diff --git a/arch/arm/mach-mmp/pxa910.h b/arch/arm/mach-mmp/pxa910.h index 6ace5a8aa15b..7d229214065a 100644 --- a/arch/arm/mach-mmp/pxa910.h +++ b/arch/arm/mach-mmp/pxa910.h @@ -3,13 +3,13 @@ #define __ASM_MACH_PXA910_H extern void pxa910_timer_init(void); -extern void __init icu_init_irq(void); extern void __init pxa910_init_irq(void); #include #include #include #include