46131 Commits

Author SHA1 Message Date
Linus Torvalds
27fd80851d Misc x86 fixes:
- Follow up fixes for the BHI mitigations code.
 
  - Fix !SPECULATION_MITIGATIONS bug not turning off
    mitigations as expected.
 
  - Work around an APIC emulation bug when the kernel is built with
    Clang and run as a SEV guest.
 
  - Follow up x86 topology fixes.
 
 Note that there's minor cleanups included in the BHI fixes,
 which we'd normally delay to the next merge window, but the
 BHI mitigations code is new and will be backported widely,
 so we thought it would be better to have a unified codebase
 at this stage. (Let me know if that assumption is wrong and
 I'll rebase it.)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbmmkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iArg//R2VYSVgAfUf4cN3rTm8RuX8U/8V4A1C/
 FklDWziEWTw8DhQCidiIt7ETuMjvOf7HFzVTfLP+oOiBxHGMCEjxtf74mHRaeZ+f
 avPvwOfp3mrIdn/h2+t7pSYxZDf310EDvarliIXq0z2hmNjmQjKXo3dyWNiWJe9i
 LFzST8dRyR0Tg4MjLuY2g9ZEauRHpY6aAVk9UrRi7uaiLyoHnLBASn1DCL1pmMhT
 cqKfHhilkeUMYwTXbDTu1iIQwBHqcpCmUp2h6VPpxPkTxJXwzo0E4lVuFVu7tzUM
 yrMZrNxhtC5Cg9NVF56AqZocKjutlJfWXnmuvpc+dM7z1dF/EwFbx2ZM+3PDuw8Z
 uODs6bVzYlhzWxTMb9Obsp3RvHe1B7ZCFCZ8uo3G9lMFXqu047UwfuwqnZw4YpX6
 CDEKz3zLbV4s64HPlvbju3CpX+m9CXhg5cR6HfCJiwYCytb1bxZU0ottnPnYxVCb
 9Th7wi2f1sCZvtQ/T8aZpF7FhYe7abt/CDvlDoDxiRg1f+Z2nduznfDJH81nn1KS
 duZEicG0O+iYi9OCPVBTVutRxSWA6D8D4F0VN7cDRr4QHyXpOFJ1KsZyHBbZo9I6
 ubp9RiFNaocgdWiWVgEpvvmlYpVPDnR38oVHNPpNAyIYHu3mlCoS76znZTNX8WfJ
 GlvHqOp+EM0=
 =2ig7
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Follow up fixes for the BHI mitigations code

 - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
   expected

 - Work around an APIC emulation bug when the kernel is built with Clang
   and run as a SEV guest

 - Follow up x86 topology fixes

* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Move TOPOEXT enablement into the topology parser
  x86/cpu/amd: Make the NODEID_MSR union actually work
  x86/cpu/amd: Make the CPUID 0x80000008 parser correct
  x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
  x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
  x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
  x86/bugs: Fix BHI handling of RRSBA
  x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
  x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
  x86/bugs: Fix BHI documentation
  x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
  x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
  x86/bugs: Fix return type of spectre_bhi_state()
  x86/apic: Force native_apic_mem_read() to use the MOV instruction
2024-04-14 10:48:51 -07:00
Linus Torvalds
a1505c47e7 Fix the x86 PMU multi-counter code returning invalid
data in certain circumstances.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbkE8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gZURAAjtG4+vnthjGFIC9HSKPFa95aOhAd0tCE
 NfW8HDTuA1zhK0MoKU2cQkiE/1LVcrSS60UomYUhwB7Hrdx+HCz81W9CMrwyzkY1
 0MhiUiq2cSYo0FsWflzipOoFffyY2THrlimqILQmhzo5JlbYA8WNTxAWs6Th651e
 an5GZJvPq98LhhWEbULiiT+2GGYi4CtbVtZvo+FyCJvYcgxF70EeJ/jvxBUax4yW
 23r+uxruo3kuiJoBAs3+OBWt/C+Ij/YF9tKqOW1XdXpPxYAbGKtYw2Ck0EIkSVHS
 kilU0ig4TMRCUmUXXSqWnTxheZBF7eXwu9cMKdnqydLjazuO5M8uh+yiZEuA/UFA
 I5YGBmuYe/Q+CaZmobFtRyvRZZE3kF1xTxyyw2vJMDjbW1FOppXPrdFrnbav5/h0
 pDxBz7H5f6L/Egyi173cRY8dzcKTjP0adtczb1M5Q6BxnuCO4I0P7RUIg41tabxg
 YuojtmIGAjhMXVXmC9VUabVDzB2o5vYXbZ4xTM6uG5XG7/2IX0nOJidwhjotw+1D
 6LK4U1kcaY8kqolYiUt5zyEc96CPvFwM7w8592Ohxs5arROYEgz0QnjGnmRu8OsP
 yV5U8lz+aoX2Cwh6n06XJB5GJqWX5BzkJ+D9q/mRqPRL8/BDscD2WcHkKdRE89T6
 YkecmTrHYD0=
 =L86s
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Ingo Molnar:
 "Fix the x86 PMU multi-counter code returning invalid data in certain
  circumstances"

* tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix out of range data
2024-04-14 10:26:27 -07:00
Thomas Gleixner
7211274fe0 x86/cpu/amd: Move TOPOEXT enablement into the topology parser
The topology rework missed that early_init_amd() tries to re-enable the
Topology Extensions when the BIOS disabled them.

The new parser is invoked before early_init_amd() so the re-enable attempt
happens too late.

Move it into the AMD specific topology parser code where it belongs.

Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx
2024-04-12 12:05:54 +02:00
Thomas Gleixner
c064b536a8 x86/cpu/amd: Make the NODEID_MSR union actually work
A system with NODEID_MSR was reported to crash during early boot without
any output.

The reason is that the union which is used for accessing the bitfields in
the MSR is written wrongly and the resulting executable code accesses the
wrong part of the MSR data.

As a consequence a later division by that value results in 0 and that
result is used for another division as divisor, which obviously does not
work well.

The magic world of C, unions and bitfields:

    union {
    	  u64   bita : 3,
	        bitb : 3;
	  u64   all;
    } x;

    x.all = foo();

    a = x.bita;
    b = x.bitb;

results in the effective executable code of:

   a = b = x.bita;

because bita and bitb are treated as union members and therefore both end
up at bit offset 0.

Wrapping the bitfield into an anonymous struct:

    union {
    	  struct {
    	     u64  bita : 3,
	          bitb : 3;
          };
	  u64	  all;
    } x;

works like expected.

Rework the NODEID_MSR union in exactly that way to cure the problem.

Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de
Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/
2024-04-12 12:05:54 +02:00
Thomas Gleixner
1b3108f689 x86/cpu/amd: Make the CPUID 0x80000008 parser correct
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the
package. The parser uses this value to initialize the SMT domain level.

That's wrong because cpu_nthreads does not describe the number of threads
per physical core. So this needs to set the CORE domain level and let the
later parsers set the SMT shift if available.

Preset the SMT domain level with the assumption of one thread per core,
which is correct ifrt here are no other CPUID leafs to parse, and propagate
cpu_nthreads and the core level APIC bitwidth into the CORE domain.

Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de
2024-04-12 12:05:54 +02:00
Josh Poimboeuf
4f511739c5 x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
For consistency with the other CONFIG_MITIGATION_* options, replace the
CONFIG_SPECTRE_BHI_{ON,OFF} options with a single
CONFIG_MITIGATION_SPECTRE_BHI option.

[ mingo: Fix ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/3833812ea63e7fdbe36bf8b932e63f70d18e2a2a.1712813475.git.jpoimboe@kernel.org
2024-04-12 12:05:54 +02:00
Josh Poimboeuf
36d4fe147c x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
Unlike most other mitigations' "auto" options, spectre_bhi=auto only
mitigates newer systems, which is confusing and not particularly useful.

Remove it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org
2024-04-12 12:05:54 +02:00
Linus Torvalds
52e5070f60 hyperv-fixes for v6.9-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmYYYPkTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXnxhB/4/8c7lFT53VbujFmVA5sNvpP5Ji5Xg
 ERhVID7tzDyaVRPr+tpPJIW0Oj/t34SH9seoTYCBCM/UABWe/Gxceg2JaoOzIx+l
 LHi73T4BBaqExiXbCCFj8N7gLO5P4Xz6ZZRgwHws1KmMXsiWYmYsbv36eSv9x6qK
 +z/n6p9/ubKFNj2/vsvfiGmY0XHayD3NM4Y4toMbYE/tuRT8uZ7D5sqWdRf+UhW/
 goRDA5qppeSfuaQu2LNVoz1e6wRmeJFv8OHgaPvQqAjTRLzPwwss28HICmKc8gh3
 HDDUUJCHSs1XItSGDFip6rIFso5X/ZHO0d6pV75hOKCisd7lV0qH6NIZ
 =k62H
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Some cosmetic changes (Erni Sri Satya Vennela, Li Zhijian)

 - Introduce hv_numa_node_to_pxm_info() (Nuno Das Neves)

 - Fix KVP daemon to handle IPv4 and IPv6 combination for keyfile format
   (Shradha Gupta)

 - Avoid freeing decrypted memory in a confidential VM (Rick Edgecombe
   and Michael Kelley)

* tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: vmbus: Don't free ring buffers that couldn't be re-encrypted
  uio_hv_generic: Don't free decrypted memory
  hv_netvsc: Don't free decrypted memory
  Drivers: hv: vmbus: Track decrypted status in vmbus_gpadl
  Drivers: hv: vmbus: Leak pages if set_memory_encrypted() fails
  hv/hv_kvp_daemon: Handle IPv4 and Ipv6 combination for keyfile format
  hv: vmbus: Convert sprintf() family to sysfs_emit() family
  mshyperv: Introduce hv_numa_node_to_pxm_info()
  x86/hyperv: Cosmetic changes for hv_apic.c
2024-04-11 16:23:56 -07:00
Josh Poimboeuf
5f882f3b0a x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
While syscall hardening helps prevent some BHI attacks, there's still
other low-hanging fruit remaining.  Don't classify it as a mitigation
and make it clear that the system may still be vulnerable if it doesn't
have a HW or SW mitigation enabled.

Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Josh Poimboeuf
1cea8a280d x86/bugs: Fix BHI handling of RRSBA
The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been
disabled by the Spectre v2 mitigation (or can otherwise be disabled by
the BHI mitigation itself if needed).  In that case retpolines are fine.

Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/6f56f13da34a0834b69163467449be7f58f253dc.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Ingo Molnar
d0485730d2 x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
So we are using the 'ia32_cap' value in a number of places,
which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register.

But there's very little 'IA32' about it - this isn't 32-bit only
code, nor does it originate from there, it's just a historic
quirk that many Intel MSR names are prefixed with IA32_.

This is already clear from the helper method around the MSR:
x86_read_arch_cap_msr(), which doesn't have the IA32 prefix.

So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with
its role and with the naming of the helper function.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Josh Poimboeuf
cb2db5bb04 x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and
over.  It's even read in the BHI sysfs function which is a big no-no.
Just read it once and cache it.

Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Thomas Gleixner
a9025cd1c6 x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
topo_set_cpuids() updates cpu_present_map and cpu_possible map. It is
invoked during enumeration and "physical hotplug" operations. In the
latter case this results in a kernel crash because cpu_possible_map is
marked read only after init completes.

There is no reason to update cpu_possible_map in that function. During
enumeration cpu_possible_map is not relevant and gets fully initialized
after enumeration completed. On "physical hotplug" the bit is already set
because the kernel allows only CPUs to be plugged which have been
enumerated and associated to a CPU number during early boot.

Remove the bogus update of cpu_possible_map.

Fixes: 0e53e7b656cf ("x86/cpu/topology: Sanitize the APIC admission logic")
Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87ttkc6kwx.ffs@tglx
2024-04-10 15:31:38 +02:00
Daniel Sneddon
04f4230e2f x86/bugs: Fix return type of spectre_bhi_state()
The definition of spectre_bhi_state() incorrectly returns a const char
* const. This causes the a compiler warning when building with W=1:

 warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
 2812 | static const char * const spectre_bhi_state(void)

Remove the const qualifier from the pointer.

Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240409230806.1545822-1-daniel.sneddon@linux.intel.com
2024-04-10 07:05:04 +02:00
Ingo Molnar
a40d2525ea Merge branch 'linus' into x86/urgent, to pick up dependent commits
Prepare to fix aspects of the new BHI code.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-10 07:04:04 +02:00
Namhyung Kim
dec8ced871 perf/x86: Fix out of range data
On x86 each struct cpu_hw_events maintains a table for counter assignment but
it missed to update one for the deleted event in x86_pmu_del().  This
can make perf_clear_dirty_counters() reset used counter if it's called
before event scheduling or enabling.  Then it would return out of range
data which doesn't make sense.

The following code can reproduce the problem.

  $ cat repro.c
  #include <pthread.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <linux/perf_event.h>
  #include <sys/ioctl.h>
  #include <sys/mman.h>
  #include <sys/syscall.h>

  struct perf_event_attr attr = {
  	.type = PERF_TYPE_HARDWARE,
  	.config = PERF_COUNT_HW_CPU_CYCLES,
  	.disabled = 1,
  };

  void *worker(void *arg)
  {
  	int cpu = (long)arg;
  	int fd1 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
  	int fd2 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
  	void *p;

  	do {
  		ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
  		p = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd1, 0);
  		ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);

  		ioctl(fd2, PERF_EVENT_IOC_DISABLE, 0);
  		munmap(p, 4096);
  		ioctl(fd1, PERF_EVENT_IOC_DISABLE, 0);
  	} while (1);

  	return NULL;
  }

  int main(void)
  {
  	int i;
  	int n = sysconf(_SC_NPROCESSORS_ONLN);
  	pthread_t *th = calloc(n, sizeof(*th));

  	for (i = 0; i < n; i++)
  		pthread_create(&th[i], NULL, worker, (void *)(long)i);
  	for (i = 0; i < n; i++)
  		pthread_join(th[i], NULL);

  	free(th);
  	return 0;
  }

And you can see the out of range data using perf stat like this.
Probably it'd be easier to see on a large machine.

  $ gcc -o repro repro.c -pthread
  $ ./repro &
  $ sudo perf stat -A -I 1000 2>&1 | awk '{ if (length($3) > 15) print }'
       1.001028462 CPU6   196,719,295,683,763      cycles                           # 194290.996 GHz                       (71.54%)
       1.001028462 CPU3   396,077,485,787,730      branch-misses                    # 15804359784.80% of all branches      (71.07%)
       1.001028462 CPU17  197,608,350,727,877      branch-misses                    # 14594186554.56% of all branches      (71.22%)
       2.020064073 CPU4   198,372,472,612,140      cycles                           # 194681.113 GHz                       (70.95%)
       2.020064073 CPU6   199,419,277,896,696      cycles                           # 195720.007 GHz                       (70.57%)
       2.020064073 CPU20  198,147,174,025,639      cycles                           # 194474.654 GHz                       (71.03%)
       2.020064073 CPU20  198,421,240,580,145      stalled-cycles-frontend          #  100.14% frontend cycles idle        (70.93%)
       3.037443155 CPU4   197,382,689,923,416      cycles                           # 194043.065 GHz                       (71.30%)
       3.037443155 CPU20  196,324,797,879,414      cycles                           # 193003.773 GHz                       (71.69%)
       3.037443155 CPU5   197,679,956,608,205      stalled-cycles-backend           # 1315606428.66% backend cycles idle   (71.19%)
       3.037443155 CPU5   198,571,860,474,851      instructions                     # 13215422.58  insn per cycle

It should move the contents in the cpuc->assign as well.

Fixes: 5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240306061003.1894224-1-namhyung@kernel.org
2024-04-10 06:12:01 +02:00
Linus Torvalds
2bb69f5fc7 x86 mitigations for the native BHI hardware vulnerabilty:
Branch History Injection (BHI) attacks may allow a malicious application to
 influence indirect branch prediction in kernel by poisoning the branch
 history. eIBRS isolates indirect branch targets in ring0.  The BHB can
 still influence the choice of indirect branch predictor entry, and although
 branch predictor entries are isolated between modes when eIBRS is enabled,
 the BHB itself is not isolated between modes.
 
 Add mitigations against it either with the help of microcode or with
 software sequences for the affected CPUs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYUKPMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofT8EACJJix+GzGUcJjOvfWFZcxwziY152hO
 5XSzHOZZL6oz5Yk/Rye/S9RVTN7aDjn1CEvI0cD/ULxaTP869sS9dDdUcHhEJ//5
 6hjqWsWiKc1QmLjBy3Pcb97GZHQXM5a9D1f6jXnJD+0FMLbQHpzSEBit0H4tv/TC
 75myGgYihvUbhN9/bL10M5fz+UADU42nChvPWDMr9ukljjCqa46tPTmKUIAW5TWj
 /xsyf+Nk+4kZpdaidKGhpof6KCV2rNeevvzUGN8Pv5y13iAmvlyplqTcQ6dlubnZ
 CuDX5Ji9spNF9WmhKpLgy5N+Ocb64oVHov98N2zw1sT1N8XOYcSM0fBj7SQIFURs
 L7T4jBZS+1c3ZGJPPFWIaGjV8w1ZMhelglwJxjY7ZgRD6fK3mwRx/ks54J8H4HjE
 FbirXaZLeKlscDIOKtnxxKoIGwpdGwLKQYi/wEw7F9NhCLSj9wMia+j3uYIUEEHr
 6xEiYEtyjcV3ocxagH7eiHyrasOKG64vjx2h1XodusBA2Wrvgm/jXlchUu+wb6B4
 LiiZJt+DmOdQ1h5j3r2rt3hw7+nWa7kyq34qfN6NSUCHiedp6q7BClueSaKiOCGk
 RoNibNiS+CqaxwGxj/RGuvajEJeEMCsLuCxzT3aeaDBsqscW6Ka/HkGA76Tpb5nJ
 E3JyjYE7AlG4rw==
 =W0W3
 -----END PGP SIGNATURE-----

Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mitigations from Thomas Gleixner:
 "Mitigations for the native BHI hardware vulnerabilty:

  Branch History Injection (BHI) attacks may allow a malicious
  application to influence indirect branch prediction in kernel by
  poisoning the branch history. eIBRS isolates indirect branch targets
  in ring0. The BHB can still influence the choice of indirect branch
  predictor entry, and although branch predictor entries are isolated
  between modes when eIBRS is enabled, the BHB itself is not isolated
  between modes.

  Add mitigations against it either with the help of microcode or with
  software sequences for the affected CPUs"

[ This also ends up enabling the full mitigation by default despite the
  system call hardening, because apparently there are other indirect
  calls that are still sufficiently reachable, and the 'auto' case just
  isn't hardened enough.

  We'll have some more inevitable tweaking in the future    - Linus ]

* tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Add BHI_NO
  x86/bhi: Mitigate KVM by default
  x86/bhi: Add BHI mitigation knob
  x86/bhi: Enumerate Branch History Injection (BHI) bug
  x86/bhi: Define SPEC_CTRL_BHI_DIS_S
  x86/bhi: Add support for clearing branch history at syscall entry
  x86/syscall: Don't force use of indirect calls for system calls
  x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
2024-04-08 20:07:51 -07:00
Daniel Sneddon
ed2e8d49b5 KVM: x86: Add BHI_NO
Intel processors that aren't vulnerable to BHI will set
MSR_IA32_ARCH_CAPABILITIES[BHI_NO] = 1;. Guests may use this BHI_NO bit to
determine if they need to implement BHI mitigations or not.  Allow this bit
to be passed to the guests.

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:06 +02:00
Pawan Gupta
95a6ccbdc7 x86/bhi: Mitigate KVM by default
BHI mitigation mode spectre_bhi=auto does not deploy the software
mitigation by default. In a cloud environment, it is a likely scenario
where userspace is trusted but the guests are not trusted. Deploying
system wide mitigation in such cases is not desirable.

Update the auto mode to unconditionally mitigate against malicious
guests. Deploy the software sequence at VMexit in auto mode also, when
hardware mitigation is not available. Unlike the force =on mode,
software sequence is not deployed at syscalls in auto mode.

Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:06 +02:00
Pawan Gupta
ec9404e40e x86/bhi: Add BHI mitigation knob
Branch history clearing software sequences and hardware control
BHI_DIS_S were defined to mitigate Branch History Injection (BHI).

Add cmdline spectre_bhi={on|off|auto} to control BHI mitigation:

 auto - Deploy the hardware mitigation BHI_DIS_S, if available.
 on   - Deploy the hardware mitigation BHI_DIS_S, if available,
        otherwise deploy the software sequence at syscall entry and
	VMexit.
 off  - Turn off BHI mitigation.

The default is auto mode which does not deploy the software sequence
mitigation.  This is because of the hardening done in the syscall
dispatch path, which is the likely target of BHI.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Pawan Gupta
be482ff950 x86/bhi: Enumerate Branch History Injection (BHI) bug
Mitigation for BHI is selected based on the bug enumeration. Add bits
needed to enumerate BHI bug.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Daniel Sneddon
0f4a837615 x86/bhi: Define SPEC_CTRL_BHI_DIS_S
Newer processors supports a hardware control BHI_DIS_S to mitigate
Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel
from userspace BHI attacks without having to manually overwrite the
branch history.

Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL.
Mitigation is enabled later.

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Pawan Gupta
7390db8aea x86/bhi: Add support for clearing branch history at syscall entry
Branch History Injection (BHI) attacks may allow a malicious application to
influence indirect branch prediction in kernel by poisoning the branch
history. eIBRS isolates indirect branch targets in ring0.  The BHB can
still influence the choice of indirect branch predictor entry, and although
branch predictor entries are isolated between modes when eIBRS is enabled,
the BHB itself is not isolated between modes.

Alder Lake and new processors supports a hardware control BHI_DIS_S to
mitigate BHI.  For older processors Intel has released a software sequence
to clear the branch history on parts that don't support BHI_DIS_S. Add
support to execute the software sequence at syscall entry and VMexit to
overwrite the branch history.

For now, branch history is not cleared at interrupt entry, as malicious
applications are not believed to have sufficient control over the
registers, since previous register state is cleared at interrupt
entry. Researchers continue to poke at this area and it may become
necessary to clear at interrupt entry as well in the future.

This mitigation is only defined here. It is enabled later.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Linus Torvalds
1e3ad78334 x86/syscall: Don't force use of indirect calls for system calls
Make <asm/syscall.h> build a switch statement instead, and the compiler can
either decide to generate an indirect jump, or - more likely these days due
to mitigations - just a series of conditional branches.

Yes, the conditional branches also have branch prediction, but the branch
prediction is much more controlled, in that it just causes speculatively
running the wrong system call (harmless), rather than speculatively running
possibly wrong random less controlled code gadgets.

This doesn't mitigate other indirect calls, but the system call indirection
is the first and most easily triggered case.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Josh Poimboeuf
0cd01ac5dc x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
Change the format of the 'spectre_v2' vulnerabilities sysfs file
slightly by converting the commas to semicolons, so that mitigations for
future variants can be grouped together and separated by commas.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2024-04-08 19:27:05 +02:00
Adam Dunlap
5ce344beac x86/apic: Force native_apic_mem_read() to use the MOV instruction
When done from a virtual machine, instructions that touch APIC memory
must be emulated. By convention, MMIO accesses are typically performed
via io.h helpers such as readl() or writeq() to simplify instruction
emulation/decoding (ex: in KVM hosts and SEV guests) [0].

Currently, native_apic_mem_read() does not follow this convention,
allowing the compiler to emit instructions other than the MOV
instruction generated by readl(). In particular, when the kernel is
compiled with clang and run as a SEV-ES or SEV-SNP guest, the compiler
would emit a TESTL instruction which is not supported by the SEV-ES
emulator, causing a boot failure in that environment. It is likely the
same problem would happen in a TDX guest as that uses the same
instruction emulator as SEV-ES.

To make sure all emulators can emulate APIC memory reads via MOV, use
the readl() function in native_apic_mem_read(). It is expected that any
emulator would support MOV in any addressing mode as it is the most
generic and is what is usually emitted currently.

The TESTL instruction is emitted when native_apic_mem_read() is inlined
into apic_mem_wait_icr_idle(). The emulator comes from
insn_decode_mmio() in arch/x86/lib/insn-eval.c. It's not worth it to
extend insn_decode_mmio() to support more instructions since, in theory,
the compiler could choose to output nearly any instruction for such
reads which would bloat the emulator beyond reason.

  [0] https://lore.kernel.org/all/20220405232939.73860-12-kirill.shutemov@linux.intel.com/

  [ bp: Massage commit message, fix typos. ]

Signed-off-by: Adam Dunlap <acdunlap@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Kevin Loughlin <kevinloughlin@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20240318230927.2191933-1-acdunlap@google.com
2024-04-08 15:37:57 +02:00
Linus Torvalds
9fe30842a9 Miscellaneous x86 fixes:
- Fix MCE timer reinit locking
  - Fix/improve CoCo guest random entropy pool init
  - Fix SEV-SNP late disable bugs
  - Fix false positive objtool build warning
  - Fix header dependency bug
  - Fix resctrl CPU offlining bug
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSVU8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gUxA//cbhkLCyHnkM7XWPRhEMD5ukhys7rppeD
 4UdRrMSwQPwZSjUVWPtY/0wL6MHerVq6ghZC483EAuwKXF2y2O3ug3MKAJP0jyaM
 Ap3sZ+RCZljNUg2Rs35LAIUMdRRrGBzWgi8rnxF7JYT9F11GftjNnzyTTKORJXA0
 NQ7633ZBlIeDk21OOkdQq3GkwaYEbUwH69AdOtxKeyT1jOOoWbF6agt/nIw/mdgv
 UkHEVhw4ySvG5Gcj+h77XiWmY2nz/+iDJ763UerMACMlL2niKG9h3Q1ASFeSFDN+
 TQejy/uHRjuHCCG83ebtKd921z6128e2p5g3wjAn7gXBx8ERb8+3CRxOalPrkXC2
 OhcR74lxDqent2OupXqjp1ndLXwKCpnNYUiR/PIhFNjUAE+JQJ4wDTZ6VxL841sb
 t/U6V35/8SIQ52wMv48P2SlaYzUhZHDl4AZ0AVqrRNKAWLH0sOQneEMo+dYRnfwu
 L3uRCLkzs/r0g7dyJjzYdAbilFnIDdCBPzyf//gPBG31XJMZUmYdtQ8vqj8ZDfYd
 qVq98nowRaL16IHwHvoRiP7tBOvX+mV7ariTpASfIPz5mD9dbOsBYJ+N4KSeV5t4
 9oOxwzoaEg9Z9erGfB52aG7Bwzi4XS4wfjrIxqqUXO1bpYFYsmvQe6+Ptgp3QMot
 OGzpalHZous=
 =CIt3
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix MCE timer reinit locking

 - Fix/improve CoCo guest random entropy pool init

 - Fix SEV-SNP late disable bugs

 - Fix false positive objtool build warning

 - Fix header dependency bug

 - Fix resctrl CPU offlining bug

* tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
  x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()
  x86/CPU/AMD: Track SNP host status with cc_platform_*()
  x86/cc: Add cc_platform_set/_clear() helpers
  x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
  x86/coco: Require seeding RNG with RDRAND on CoCo systems
  x86/numa/32: Include missing <asm/pgtable_areas.h>
  x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline
2024-04-07 09:33:21 -07:00
Linus Torvalds
e2948effa9 Fix a combined PEBS events bug on x86 Intel CPUs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSRmYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gGLg/9GQS5B/kub1ydjoFx2sCuHe5Zl9vLXpLg
 515yOSKRWnpzjDJaMlAJg9MHiNlcR2Df5rcnVLaKpQSomajFQBKpsJkX4skqAUm3
 wuuCwP0Wc+XJFNDzHGb5yqWaCo06XI80BnZyGHDHDD6jJ5fypzKVyJERnGBpLwjO
 h2YTomtLP5j5Gq7em2d+A9pVRpKwcBOCB8K2sBnJRlPNVR190MWTOm1wEQ+vYQeR
 x8nIx7WaaA6SInqWVGkhloasTmeWEH89Q2wjCltZpYFnRiEa1yS/VHVT6ZQKrpOy
 +mBqr92tXdxT2Y+8LNAMUg5PgRVMbZoY+Glin0Q0N4Cg92BZIl8NX8wtb3oacYgd
 XhiRyRWaw8JDCC2mEmhlEa01M2Y7PXtcBjvOVQwoZLS/711Zyf+fHjyX4FUG8Vcb
 T0PgaQoterlVnN4H2uWq8Za8ubjI0TW0nRBw2oQKlSv/5ldJ2IKJsQdsbl1q4wQr
 TtYJY2bq5Hrn+qlFZi6jFB2KvBOUV3molXlZAPJ0Nr/Y9mkMBRVcq6ufrAunpgUB
 l62Ls61HHZ9+hVNIIpM8/p/rTYjeVilA7vjHGiCJFcsclvPNBBkobtbpri/ioE0t
 4+pH60LsMBIwAhOAKlJ6Jzf4LaWEikJfDPpj8yMKixGOxDT542rUN7A3NwdP2H6b
 2fV3Nyr7sEw=
 =v7jY
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf fix from Ingo Molnar:
 "Fix a combined PEBS events bug on x86 Intel CPUs"

* tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
2024-04-07 09:14:46 -07:00
Borislav Petkov (AMD)
b377c66ae3 x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
srso_alias_untrain_ret() is special code, even if it is a dummy
which is called in the !SRSO case, so annotate it like its real
counterpart, to address the following objtool splat:

  vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0

Fixes: 4535e1a4174c ("x86/bugs: Fix the SRSO mitigation on Zen3/4")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
2024-04-06 13:01:50 +02:00
Ingo Molnar
5f2ca44ed2 Merge branch 'linus' into x86/urgent, to pick up dependent commit
We want to fix:

  0e110732473e ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")

So merge in Linus's latest into x86/urgent to have it available.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-06 13:00:32 +02:00
Linus Torvalds
af709adfaa 8 hotfixes, 3 are cc:stable
There are a couple of fixups for this cycle's vmalloc changes and one for
 the stackdepot changes.  And a fix for a very old x86 PAT issue which can
 cause a warning splat.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZhBEXAAKCRDdBJ7gKXxA
 ju9fAQCxPdqApKQ49IAJpUtMcRJmI594dmB+/CfrFgiS+GaQcwEA1+2SidI9fQWT
 R/fcKrRr4+zlgQw0T0aSDR1HBLUPxw0=
 =rkxT
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "8 hotfixes, 3 are cc:stable

  There are a couple of fixups for this cycle's vmalloc changes and one
  for the stackdepot changes. And a fix for a very old x86 PAT issue
  which can cause a warning splat"

* tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  stackdepot: rename pool_index to pool_index_plus_1
  x86/mm/pat: fix VM_PAT handling in COW mappings
  MAINTAINERS: change vmware.com addresses to broadcom.com
  selftests/mm: include strings.h for ffsl
  mm: vmalloc: fix lockdep warning
  mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
  init: open output files from cpio unpacking with O_LARGEFILE
  mm/secretmem: fix GUP-fast succeeding on secretmem folios
2024-04-05 13:30:01 -07:00
David Hildenbrand
04c35ab3bd x86/mm/pat: fix VM_PAT handling in COW mappings
PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios.  Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.

Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().

In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.

To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.

We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode.  We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.

For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.

Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():

<--- C reproducer --->
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
 #include <liburing.h>

 int main(void)
 {
         struct io_uring_params p = {};
         int ring_fd;
         size_t size;
         char *map;

         ring_fd = io_uring_setup(1, &p);
         if (ring_fd < 0) {
                 perror("io_uring_setup");
                 return 1;
         }
         size = p.sq_off.array + p.sq_entries * sizeof(unsigned);

         /* Map the submission queue ring MAP_PRIVATE */
         map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                    ring_fd, IORING_OFF_SQ_RING);
         if (map == MAP_FAILED) {
                 perror("mmap");
                 return 1;
         }

         /* We have at least one page. Let's COW it. */
         *map = 0;
         pause();
         return 0;
 }
<--- C reproducer --->

On a system with 16 GiB RAM and swap configured:
 # ./iouring &
 # memhog 16G
 # killall iouring
[  301.552930] ------------[ cut here ]------------
[  301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[  301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[  301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[  301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[  301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[  301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[  301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[  301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[  301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[  301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[  301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[  301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[  301.564186] FS:  0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[  301.564773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[  301.565725] PKRU: 55555554
[  301.565944] Call Trace:
[  301.566148]  <TASK>
[  301.566325]  ? untrack_pfn+0xf4/0x100
[  301.566618]  ? __warn+0x81/0x130
[  301.566876]  ? untrack_pfn+0xf4/0x100
[  301.567163]  ? report_bug+0x171/0x1a0
[  301.567466]  ? handle_bug+0x3c/0x80
[  301.567743]  ? exc_invalid_op+0x17/0x70
[  301.568038]  ? asm_exc_invalid_op+0x1a/0x20
[  301.568363]  ? untrack_pfn+0xf4/0x100
[  301.568660]  ? untrack_pfn+0x65/0x100
[  301.568947]  unmap_single_vma+0xa6/0xe0
[  301.569247]  unmap_vmas+0xb5/0x190
[  301.569532]  exit_mmap+0xec/0x340
[  301.569801]  __mmput+0x3e/0x130
[  301.570051]  do_exit+0x305/0xaf0
...

Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05 11:21:31 -07:00
Sean Christopherson
8cb4a9a82b x86/cpufeatures: Add CPUID_LNX_5 to track recently added Linux-defined word
Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate
compile-time assert in KVM to prevent direct lookups on the features in
CPUID_LNX_5.  KVM uses X86_FEATURE_* flags to manage guest CPUID, and so
must translate features that are scattered by Linux from the Linux-defined
bit to the hardware-defined bit, i.e. should never try to directly access
scattered features in guest CPUID.

Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a
compile-time assert in KVM's CPUID infrastructure to ensure that future
additions update cpuid_leafs along with NCAPINTS.

No functional change intended.

Fixes: 7f274e609f3d ("x86/cpufeatures: Add new word for scattered features")
Cc: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-04 17:42:19 -07:00
Linus Torvalds
c88b9b4cde Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
 accompanying one of the fixes is also becoming a common occurrence.
 
 Current release - regressions:
 
  - ipv6: fix infinite recursion in fib6_dump_done()
 
  - net/rds: fix possible null-deref in newly added error path
 
 Current release - new code bugs:
 
  - net: do not consume a full cacheline for system_page_pool
 
  - bpf: fix bpf_arena-related file descriptor leaks in the verifier
 
  - drv: ice: fix freeing uninitialized pointers, fixing misuse of
    the newfangled __free() auto-cleanup
 
 Previous releases - regressions:
 
  - x86/bpf: fixes the BPF JIT with retbleed=stuff
 
  - xen-netfront: add missing skb_mark_for_recycle, fix page pool
    accounting leaks, revealed by recently added explicit warning
 
  - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
    non-wildcard addresses
 
  - Bluetooth:
    - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
      with better workarounds to un-break some buggy Qualcomm devices
    - set conn encrypted before conn establishes, fix re-connecting
      to some headsets which use slightly unusual sequence of msgs
 
  - mptcp:
    - prevent BPF accessing lowat from a subflow socket
    - don't account accept() of non-MPC client as fallback to TCP
 
  - drv: mana: fix Rx DMA datasize and skb_over_panic
 
  - drv: i40e: fix VF MAC filter removal
 
 Previous releases - always broken:
 
  - gro: various fixes related to UDP tunnels - netns crossing problems,
    incorrect checksum conversions, and incorrect packet transformations
    which may lead to panics
 
  - bpf: support deferring bpf_link dealloc to after RCU grace period
 
  - nf_tables:
    - release batch on table validation from abort path
    - release mutex after nft_gc_seq_end from abort path
    - flush pending destroy work before exit_net release
 
  - drv: r8169: skip DASH fw status checks when DASH is disabled
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
 IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
 pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
 dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
 6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
 ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
 zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
 V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
 urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
 a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
 pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
 v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
 =6bQI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, bluetooth and bpf.

  Fairly usual collection of driver and core fixes. The large selftest
  accompanying one of the fixes is also becoming a common occurrence.

  Current release - regressions:

   - ipv6: fix infinite recursion in fib6_dump_done()

   - net/rds: fix possible null-deref in newly added error path

  Current release - new code bugs:

   - net: do not consume a full cacheline for system_page_pool

   - bpf: fix bpf_arena-related file descriptor leaks in the verifier

   - drv: ice: fix freeing uninitialized pointers, fixing misuse of the
     newfangled __free() auto-cleanup

  Previous releases - regressions:

   - x86/bpf: fixes the BPF JIT with retbleed=stuff

   - xen-netfront: add missing skb_mark_for_recycle, fix page pool
     accounting leaks, revealed by recently added explicit warning

   - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
     non-wildcard addresses

   - Bluetooth:
      - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
        better workarounds to un-break some buggy Qualcomm devices
      - set conn encrypted before conn establishes, fix re-connecting to
        some headsets which use slightly unusual sequence of msgs

   - mptcp:
      - prevent BPF accessing lowat from a subflow socket
      - don't account accept() of non-MPC client as fallback to TCP

   - drv: mana: fix Rx DMA datasize and skb_over_panic

   - drv: i40e: fix VF MAC filter removal

  Previous releases - always broken:

   - gro: various fixes related to UDP tunnels - netns crossing
     problems, incorrect checksum conversions, and incorrect packet
     transformations which may lead to panics

   - bpf: support deferring bpf_link dealloc to after RCU grace period

   - nf_tables:
      - release batch on table validation from abort path
      - release mutex after nft_gc_seq_end from abort path
      - flush pending destroy work before exit_net release

   - drv: r8169: skip DASH fw status checks when DASH is disabled"

* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  netfilter: validate user input for expected length
  net/sched: act_skbmod: prevent kernel-infoleak
  net: usb: ax88179_178a: avoid the interface always configured as random address
  net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
  net: ravb: Always update error counters
  net: ravb: Always process TX descriptor ring
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
  Revert "tg3: Remove residual error handling in tg3_suspend"
  tg3: Remove residual error handling in tg3_suspend
  net: mana: Fix Rx DMA datasize and skb_over_panic
  net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
  net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
  net: stmmac: fix rx queue priority assignment
  net: txgbe: fix i2c dev name cannot match clkdev
  net: fec: Set mac_managed_pm during probe
  ...
2024-04-04 14:49:10 -07:00
Borislav Petkov (AMD)
3ddf944b32 x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()
Modifying a MCA bank's MCA_CTL bits which control which error types to
be reported is done over

  /sys/devices/system/machinecheck/
  ├── machinecheck0
  │   ├── bank0
  │   ├── bank1
  │   ├── bank10
  │   ├── bank11
  ...

sysfs nodes by writing the new bit mask of events to enable.

When the write is accepted, the kernel deletes all current timers and
reinits all banks.

Doing that in parallel can lead to initializing a timer which is already
armed and in the timer wheel, i.e., in use already:

  ODEBUG: init active (active state 0) object: ffff888063a28000 object
  type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642
  WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514
  debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514

Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code
does.

Reported by: Yue Sun <samsun1006219@gmail.com>
Reported by: xingwei lee <xrivendell7@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%2Bx_-PqMv98gKig55=2vbzffRw@mail.gmail.com
2024-04-04 17:25:15 +02:00
Borislav Petkov (AMD)
0ecaefb303 x86/CPU/AMD: Track SNP host status with cc_platform_*()
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.

Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.

Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.

Move kdump_sev_callback() to its rightful place, while at it.

Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
2024-04-04 10:40:30 +02:00
Borislav Petkov (AMD)
bc6f707fc0 x86/cc: Add cc_platform_set/_clear() helpers
Add functionality to set and/or clear different attributes of the
machine as a confidential computing platform. Add the first one too:
whether the machine is running as a host for SEV-SNP guests.

Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-5-bp@alien8.de
2024-04-04 10:40:27 +02:00
Borislav Petkov (AMD)
54f5f47b60 x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
The functionality to load SEV-SNP guests by the host will soon rely on
cc_platform* helpers because the cpu_feature* API with the early
patching is insufficient when SNP support needs to be disabled late.

Therefore, pull that functionality in.

Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-4-bp@alien8.de
2024-04-04 10:40:23 +02:00
Jason A. Donenfeld
99485c4c02 x86/coco: Require seeding RNG with RDRAND on CoCo systems
There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.

If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.

So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.

Add this deliberately to be "just a CoCo x86 driver feature" and not
part of the RNG itself. Many device drivers and platforms have some
desire to contribute something to the RNG, and add_device_randomness()
is specifically meant for this purpose.

Any driver can call it with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse.

Rather than trying to build something into the core of the RNG, consider
the particular CoCo issue just a CoCo issue, and therefore separate it
all out into driver (well, arch/platform) code.

  [ bp: Massage commit message. ]

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
2024-04-04 10:40:19 +02:00
Arnd Bergmann
9852b1dc6a x86/numa/32: Include missing <asm/pgtable_areas.h>
The __vmalloc_start_set declaration is in a header that is not included
in numa_32.c in current linux-next:

  arch/x86/mm/numa_32.c: In function 'initmem_init':
  arch/x86/mm/numa_32.c:57:9: error: '__vmalloc_start_set' undeclared (first use in this function)
     57 |         __vmalloc_start_set = true;
        |         ^~~~~~~~~~~~~~~~~~~
  arch/x86/mm/numa_32.c:57:9: note: each undeclared identifier is reported only once for each function it appears in

Add an explicit #include.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240403202344.3463169-1-arnd@kernel.org
2024-04-04 09:39:38 +02:00
Linus Torvalds
0f099dc9d1 ARM:
- Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.
 
 - Restore out-of-range handler for stage-2 translation faults.
 
 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.
 
 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.
 
 - Correct a format specifier warning in the arch_timer selftest.
 
 - Make the KVM banner message correctly handle all of the possible
   configurations.
 
 RISC-V:
 
 - Remove redundant semicolon in num_isa_ext_regs().
 
 - Fix APLIC setipnum_le/be write emulation.
 
 - Fix APLIC in_clrip[x] read emulation.
 
 x86:
 
 - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
   vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting
   disabled.
 
 - Documentation fixes for SEV.
 
 - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP.
 
 - Fix a 14-year-old goof in a declaration shared by host and guest; the enabled
   field used by Linux when running as a guest pushes the size of "struct
   kvm_vcpu_pv_apf_data" from 64 to 68 bytes.  This is really unconsequential
   because KVM never consumes anything beyond the first 64 bytes, but the
   resulting struct does not match the documentation.
 
 Selftests:
 
 - Fix spelling mistake in arch_timer selftest.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYMOJYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP2zAf/Z7/cK0+yFSvm7/tsbWtjnWofad/p
 82puu0V+8lZSjGVs3AydiDCV+FahvLS0QIwgrffVr4XA10Km5ZZMjZyJ3uH4xki/
 VFFsDnZPdKuj55T0wwN7JFn0YVOMdtgcP0b+F8aMbkL0uoJXjutOMKNhssuW12kw
 9cmPjaBWm/bfrfoTUUB9mCh0Ub3HKpguYwTLQuf6Fyn2FK7oORpt87Zi+oIKUn6H
 pFXFtZYduLg6M2LXvZqsXZLXnvABPjANNWEhiiwrvuF/wmXXTwTpvRXlYXhCvpAN
 q0AhxPhPm3NnsmRhEB6SmoMjXyZIByezcEiqAspBrUvEqs/2u6VyzFMrXw==
 =PlsI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM:

   - Ensure perf events programmed to count during guest execution are
     actually enabled before entering the guest in the nVHE
     configuration

   - Restore out-of-range handler for stage-2 translation faults

   - Several fixes to stage-2 TLB invalidations to avoid stale
     translations, possibly including partial walk caches

   - Fix early handling of architectural VHE-only systems to ensure E2H
     is appropriately set

   - Correct a format specifier warning in the arch_timer selftest

   - Make the KVM banner message correctly handle all of the possible
     configurations

  RISC-V:

   - Remove redundant semicolon in num_isa_ext_regs()

   - Fix APLIC setipnum_le/be write emulation

   - Fix APLIC in_clrip[x] read emulation

  x86:

   - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID
     entries (old vs. new) and ultimately neglects to clear PV_UNHALT
     from vCPUs with HLT-exiting disabled

   - Documentation fixes for SEV

   - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP

   - Fix a 14-year-old goof in a declaration shared by host and guest;
     the enabled field used by Linux when running as a guest pushes the
     size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is
     really unconsequential because KVM never consumes anything beyond
     the first 64 bytes, but the resulting struct does not match the
     documentation

  Selftests:

   - Fix spelling mistake in arch_timer selftest"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: arm64: Rationalise KVM banner output
  arm64: Fix early handling of FEAT_E2H0 not being implemented
  KVM: arm64: Ensure target address is granule-aligned for range TLBI
  KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()
  KVM: arm64: Don't pass a TLBI level hint when zapping table entries
  KVM: arm64: Don't defer TLB invalidation when zapping table entries
  KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test
  KVM: arm64: Fix out-of-IPA space translation fault handling
  KVM: arm64: Fix host-programmed guest events in nVHE
  RISC-V: KVM: Fix APLIC in_clrip[x] read emulation
  RISC-V: KVM: Fix APLIC setipnum_le/be write emulation
  RISC-V: KVM: Remove second semicolon
  KVM: selftests: Fix spelling mistake "trigged" -> "triggered"
  Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP
  Documentation: kvm/sev: separate description of firmware
  KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
  KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled
  KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
  KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
  KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
  ...
2024-04-03 10:26:37 -07:00
Borislav Petkov (AMD)
0e11073247 x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO
The srso_alias_untrain_ret() dummy thunk in the !CONFIG_MITIGATION_SRSO
case is there only for the altenative in CALL_UNTRAIN_RET to have
a symbol to resolve.

However, testing with kernels which don't have CONFIG_MITIGATION_SRSO
enabled, leads to the warning in patch_return() to fire:

  missing return thunk: srso_alias_untrain_ret+0x0/0x10-0x0: eb 0e 66 66 2e
  WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:826 apply_returns (arch/x86/kernel/alternative.c:826

Put in a plain "ret" there so that gcc doesn't put a return thunk in
in its place which special and gets checked.

In addition:

  ERROR: modpost: "srso_alias_untrain_ret" [arch/x86/kvm/kvm-amd.ko] undefined!
  make[2]: *** [scripts/Makefile.modpost:145: Module.symvers] Chyba 1
  make[1]: *** [/usr/src/linux-6.8.3/Makefile:1873: modpost] Chyba 2
  make: *** [Makefile:240: __sub-make] Chyba 2

since !SRSO builds would use the dummy return thunk as reported by
petr.pisar@atlas.cz, https://bugzilla.kernel.org/show_bug.cgi?id=218679.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202404020901.da75a60f-oliver.sang@intel.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/202404020901.da75a60f-oliver.sang@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-03 10:12:38 -07:00
Kan Liang
312be9fc22 perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
The MSR_PEBS_DATA_CFG MSR register is used to configure which data groups
should be generated into a PEBS record, and it's shared among all counters.

If there are different configurations among counters, perf combines all the
configurations.

The first perf command as below requires a complete PEBS record
(including memory info, GPRs, XMMs, and LBRs). The second perf command
only requires a basic group. However, after the second perf command is
running, the MSR_PEBS_DATA_CFG register is cleared. Only a basic group is
generated in a PEBS record, which is wrong. The required information
for the first perf command is missed.

 $ perf record --intr-regs=AX,SP,XMM0 -a -C 8 -b -W -d -c 100000003 -o /dev/null -e cpu/event=0xd0,umask=0x81/upp &
 $ sleep 5
 $ perf record  --per-thread  -c 1  -e cycles:pp --no-timestamp --no-tid taskset -c 8 ./noploop 1000

The first PEBS event is a system-wide PEBS event. The second PEBS event
is a per-thread event. When the thread is scheduled out, the
intel_pmu_pebs_del() function is invoked to update the PEBS state.
Since the system-wide event is still available, the cpuc->n_pebs is 1.
The cpuc->pebs_data_cfg is cleared. The data configuration for the
system-wide PEBS event is lost.

The (cpuc->n_pebs == 1) check was introduced in commit:

  b6a32f023fcc ("perf/x86: Fix PEBS threshold initialization")

At that time, it indeed didn't hurt whether the state was updated
during the removal, because only the threshold is updated.

The calculation of the threshold takes the last PEBS event into
account.

However, since commit:

  b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")

we delay the threshold update, and clear the PEBS data config, which triggers
the bug.

The PEBS data config update scope should not be shrunk during removal.

[ mingo: Improved the changelog & comments. ]

Fixes: b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240401133320.703971-1-kan.liang@linux.intel.com
2024-04-03 10:19:20 +02:00
Reinette Chatre
c3eeb1ffc6 x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline
Tony encountered this OOPS when the last CPU of a domain goes
offline while running a kernel built with CONFIG_NO_HZ_FULL:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP NOPTI
    ...
    RIP: 0010:__find_nth_andnot_bit+0x66/0x110
    ...
    Call Trace:
     <TASK>
     ? __die()
     ? page_fault_oops()
     ? exc_page_fault()
     ? asm_exc_page_fault()
     cpumask_any_housekeeping()
     mbm_setup_overflow_handler()
     resctrl_offline_cpu()
     resctrl_arch_offline_cpu()
     cpuhp_invoke_callback()
     cpuhp_thread_fun()
     smpboot_thread_fn()
     kthread()
     ret_from_fork()
     ret_from_fork_asm()
     </TASK>

The NULL pointer dereference is encountered while searching for another
online CPU in the domain (of which there are none) that can be used to
run the MBM overflow handler.

Because the kernel is configured with CONFIG_NO_HZ_FULL the search for
another CPU (in its effort to prefer those CPUs that aren't marked
nohz_full) consults the mask representing the nohz_full CPUs,
tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y
tick_nohz_full_mask is not allocated unless the kernel is booted with
the "nohz_full=" parameter and because of that any access to
tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled().

Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled().
The latter ensures tick_nohz_full_mask can be accessed safely and can be
used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not.

[ Use Ingo's suggestion that combines the two NO_HZ checks into one. ]

Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com
Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
2024-04-03 09:30:01 +02:00
Paolo Bonzini
52b761b48f KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution
    are actually enabled before entering the guest in the nVHE
    configuration.
 
  - Restore out-of-range handler for stage-2 translation faults.
 
  - Several fixes to stage-2 TLB invalidations to avoid stale
    translations, possibly including partial walk caches.
 
  - Fix early handling of architectural VHE-only systems to ensure E2H is
    appropriately set.
 
  - Correct a format specifier warning in the arch_timer selftest.
 
  - Make the KVM banner message correctly handle all of the possible
    configurations.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZgtpWBccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFoilAQCQk6kLIeuih5QOe50fK4XkNsyg
 PGcxw0a0BP8cfjtJsgEArwLlfHQOTE4tRWtXyEHvapJfe/bE1hjLmzUJx7BwLQ4=
 =6hNq
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.9, part #1

 - Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.

 - Restore out-of-range handler for stage-2 translation faults.

 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.

 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.

 - Correct a format specifier warning in the arch_timer selftest.

 - Make the KVM banner message correctly handle all of the possible
   configurations.
2024-04-02 12:26:15 -04:00
Joan Bruguera Micó
6a53745300 x86/bpf: Fix IP for relocating call depth accounting
The commit:

  59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")

made PER_CPU_VAR() to use rip-relative addressing, hence
INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative
asm code inside of it. A follow up commit:

  17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template")

changed x86_call_depth_emit_accounting() to use apply_relocation(),
but mistakenly assumed that the code is being patched in-place (where
the destination of the relocation matches the address of the code),
using *pprog as the destination ip. This is not true for the call depth
accounting, emitted by the BPF JIT, so the calculated address was wrong,
JIT-ed BPF progs on kernels with call depth tracking got broken and
usually caused a page fault.

Pass the destination IP when the BPF JIT emits call depth accounting.

Fixes: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-01 20:37:56 -07:00
Uros Bizjak
9d98aa0883 x86/bpf: Fix IP after emitting call depth accounting
Adjust the IP passed to `emit_patch` so it calculates the correct offset
for the CALL instruction if `x86_call_depth_emit_accounting` emits code.
Otherwise we will skip some instructions and most likely crash.

Fixes: b2e9dfe54be4 ("x86/bpf: Emit call depth accounting if required")
Link: https://lore.kernel.org/lkml/20230105214922.250473-1-joanbrugueram@gmail.com/
Co-developed-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-2-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-01 20:37:56 -07:00
Linus Torvalds
448f828feb - Define the correct set of default hw events on AMD Zen4
- Use the correct stalled cycles PMCs on AMD Zen2 and newer
 
 - Fix detection of the LBR freeze feature on AMD
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJR3oACgkQEsHwGGHe
 VUo/eRAAnGUq0rSi4ZUqsTtbu/jNepKbaeS7jR3/p3V6iSwfmjEoJ2xE6uIdN5vD
 fnL6UkeDRMc8LaKHIdLD4ZbN8NRa3hOyzf5K7wwVp5bwle0NeyrcG5wVK8LgT/X/
 rPSk7YxoR5frkYcA6zZwezJOv3HGYt8RMr5bKMD3YiJ35/XCdPsKnbHJTHb+F23Y
 tYFBeyzRzOebQu0fFKP8ML9LbqvELESqJ5Smwu/jQ25aBW7sFsUNAxseGU2tYahX
 c6pm8ytIlpZFwmi1HzXmMICF7lWugFO/KkP/ndCM1IpmujVGy56hrpLEy5gT3gzh
 NE/nZDoqJAO2zhg2FuKybh3akdT+IgXUTjxYMYGUOkJIChzie3o4p9OqichgTIv5
 +ngAq5qzjAHfC7cZ5nA96XWkw1fFU6BqlA3KPs1mzQU9uTDz7tSkyxIitp3C8L0B
 JlilTr6yHUprJzFwCDk4hb+hfP5A9qYnrNeacMlldZmbH1jLYHEzB9FudK82MeM+
 tIKFnM2jyRaRs/s8n+/UdrOVFNGk/+scX8GQllEBF451a8J5x1CYeHB7dGW+4pf/
 cx5TupHg8dDRgNMsbaeEvwERoPu4h/VRozfBi6r1WjjskVm24lIdFFKTSm3BDbLk
 EH3cflv/h8KE19cr0XLb7aYYw/9jb4cpnb0WBMw1gQOSvUMXzxU=
 =gmta
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf fixes from Borislav Petkov:

 - Define the correct set of default hw events on AMD Zen4

 - Use the correct stalled cycles PMCs on AMD Zen2 and newer

 - Fix detection of the LBR freeze feature on AMD

* tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later
  perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later
  perf/x86/amd/lbr: Use freeze based on availability
  x86/cpufeatures: Add new word for scattered features
2024-03-31 10:43:11 -07:00
Linus Torvalds
1aac9cb7e6 - Make sure single object builds in arch/x86/virt/ ala
make ... arch/x86/virt/vmx/tdx/seamcall.o
   work again
 
 - Do not do ROM range scans and memory validation when the kernel is
   running as a SEV-SNP guest as those can get problematic and, before
   that, are not really needed in such a guest
 
 - Exclude the build-time generated vdso-image-x32.o object from objtool
   validation and in particular the return sites in there due to
   a warning which fires when an unpatched return thunk is being used
 
 - Improve the NMI CPUs stall message to show additional information
   about the state of each CPU wrt the NMI handler
 
 - Enable gcc named address spaces support only on !KCSAN configs due to
   compiler options incompatibility
 
 - Revert a change which was trying to use GB pages for mapping regions
   only when the regions would be large enough but that change lead to
   kexec failing
 
 - A documentation fixlet
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJN64ACgkQEsHwGGHe
 VUo9KA/48ELqeKhCAFQdkZn8nnYf57g7bSMibMCnI5/ndNjyuShdZO92SGtBV7fV
 4L77vJRJXx2ytz4MztOPVClXaBEs4uDEa0NyZFSuEdvRIvJuglibsleQOFTbKfkD
 HdL0geI5exyfCC3BmslCS857sd6aBanqmYPddVk1+5mwhTGcAcwy9aiNTCzxxE22
 KuaO12A0+UNOdXNLAuUythmYL2V0Xn2Z9sXpDRchwsRnj12C1S2flIhPsWxg9AU+
 3ws9PnfTFTcfcViEug5p1nyN0gWCgYhMxaN8i3IS4smEc09Yq0pxVsitwYqJB1D6
 +neC27FTF4DCBr4G/8yu/x5BVHS92s9VK4u9Wo4nf9M4XBlMVU+If0bn9jtvE6jI
 GbdoHzoF7e2SEgzVvIjOHpdiBxEzW6S4lZGEtj6oaYR+cI0Fc7vZXoHy8kXohtTz
 QT2M1Hl9tMh2HjkwvoQLtNMYVHMWnuDy1X9k25Qk4nnCQJy9DbGiPnpxaGuPtaJ7
 s8kha4r+U/zpEiEa0S5AoIozVD/syXe1lOtaN2dUpOWjqBJAMGr1QkK3NnOxEcO3
 CeKOCcSfjvvAKPyTnk1Vrtec5KwHmUgD/VN2fn7EdHZVwxlbaoCybpP3g4CT4Jpl
 TQakg0du9TyG9BXituN4XaZgGskDYFRlXv+U+Lity4wvEjWgmg==
 =JrjO
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure single object builds in arch/x86/virt/ ala
      make ... arch/x86/virt/vmx/tdx/seamcall.o
   work again

 - Do not do ROM range scans and memory validation when the kernel is
   running as a SEV-SNP guest as those can get problematic and, before
   that, are not really needed in such a guest

 - Exclude the build-time generated vdso-image-x32.o object from objtool
   validation and in particular the return sites in there due to a
   warning which fires when an unpatched return thunk is being used

 - Improve the NMI CPUs stall message to show additional information
   about the state of each CPU wrt the NMI handler

 - Enable gcc named address spaces support only on !KCSAN configs due to
   compiler options incompatibility

 - Revert a change which was trying to use GB pages for mapping regions
   only when the regions would be large enough but that change lead to
   kexec failing

 - A documentation fixlet

* tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Use obj-y to descend into arch/x86/virt/
  x86/sev: Skip ROM range scans and validation for SEV-SNP guests
  x86/vdso: Fix rethunk patching for vdso-image-x32.o too
  x86/nmi: Upgrade NMI backtrace stall checks & messages
  x86/percpu: Disable named address spaces for KCSAN
  Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
  Documentation/x86: Fix title underline length
2024-03-31 10:16:34 -07:00
Masahiro Yamada
3f1a9bc5d8 x86/build: Use obj-y to descend into arch/x86/virt/
Commit c33621b4c5ad ("x86/virt/tdx: Wire up basic SEAMCALL functions")
introduced a new instance of core-y instead of the standardized obj-y
syntax.

X86 Makefiles descend into subdirectories of arch/x86/virt inconsistently;
into arch/x86/virt/ via core-y defined in arch/x86/Makefile, but into
arch/x86/virt/svm/ via obj-y defined in arch/x86/Kbuild.

This is problematic when you build a single object in parallel because
multiple threads attempt to build the same file.

  $ make -j$(nproc) arch/x86/virt/vmx/tdx/seamcall.o
    [ snip ]
    AS      arch/x86/virt/vmx/tdx/seamcall.o
    AS      arch/x86/virt/vmx/tdx/seamcall.o
  fixdep: error opening file: arch/x86/virt/vmx/tdx/.seamcall.o.d: No such file or directory
  make[4]: *** [scripts/Makefile.build:362: arch/x86/virt/vmx/tdx/seamcall.o] Error 2

Use the obj-y syntax, as it works correctly.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240330060554.18524-1-masahiroy@kernel.org
2024-03-30 10:41:49 +01:00