Commit Graph

17025 Commits

Author SHA1 Message Date
Paolo Bonzini
f3cf800778 Merge tag 'kvm-s390-master-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: selftests: Fixes

- provide memory model for  IBM z196 and zEC12
- do not require 64GB of memory
2021-07-14 12:14:27 -04:00
Paolo Bonzini
b8917b4ae4 KVM/arm64 updates for v5.14.
- Add MTE support in guests, complete with tag save/restore interface
 - Reduce the impact of CMOs by moving them in the page-table code
 - Allow device block mappings at stage-2
 - Reduce the footprint of the vmemmap in protected mode
 - Support the vGIC on dumb systems such as the Apple M1
 - Add selftest infrastructure to support multiple configuration
   and apply that to PMU/non-PMU setups
 - Add selftests for the debug architecture
 - The usual crop of PMU fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmDV2bEPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDEr8P/ivwROx5NwGcHGmU5RfUCT3aFqhtVHHwD/lu
 jPcgoO61kz9TelOu6QRaVuK+mVHxcq3iP4R8nPq/QCkUlEXTmK2xkyhXhGXSYpH4
 6jM8+BbC3eG7iAxx6H0UM4JTl4Riwat6ZZtXpWEWs9TKqOHOQYFpMkxSttwVZ1CZ
 SjbtFvXLEdzKn6PzUWnKdBNMV/mHsdAtohZit9oJOc4ttc8072XxETQ4TFQ+MSvA
 j9zY9QPmWzgcZnotqRRu9sbTGO2vxtXuUtY3sjdD8+C9OgSe9qvpnNjymcmfwaMu
 1fBkfh65oaO4ItJBdGOUOoEcFqwN5imPiI7CB/O+ZYkO9sBCuTUPSQwPkyiwXb9r
 bUkTaQw2nZiNWsqR1x07fQ2sGYbMp5mnmgmqiV4MUWkLmFp9LZATCWYTTn24cBNS
 6SjVP6/8S0r3EhLnYjH0Pn1we5PooU1EF6RlCAd3ewYoo+9fPnwjNYwIWH5i5wB7
 +tnei44NACAw9cfbos+BYQQ/dY15OSFzLzIMomlabB7OpXOdDg3H6tJnPbFwWwXb
 9nF8XdHqxeDVVVrDCAx1BSodSXm9xqgnQM2RDGTUnpVcAfqAr3MXX6VsyKQDzj8T
 QXF9qOVCBAABv6BXAvSQ6mvMJZDUVbUPEPhf7kXzF46JsRd6A7wWoU/OnMGHQ/w7
 wjvH8HVy
 =fWBV
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for v5.14.

- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration
  and apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
2021-06-25 11:24:24 -04:00
Paolo Bonzini
c3ab0e28a4 Merge branch 'topic/ppc-kvm' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux into HEAD
- Support for the H_RPT_INVALIDATE hypercall

- Conversion of Book3S entry/exit to C

- Bug fixes
2021-06-23 07:30:41 -04:00
Linus Torvalds
8363e795eb A first set of urgent fixes to the FPU/XSTATE handling mess^W code.
(There's a lot more in the pipe):
 
 - Prevent corruption of the XSTATE buffer in signal handling by
   validating what is being copied from userspace first.
 
 - Invalidate other task's preserved FPU registers on XRSTOR failure
   (#PF) because latter can still modify some of them.
 
 - Restore the proper PKRU value in case userspace modified it
 
 - Reset FPU state when signal restoring fails
 
 Other:
 
 - Map EFI boot services data memory as encrypted in a SEV guest so that
   the guest can access it and actually boot properly
 
 - Two SGX correctness fixes: proper resources freeing and a NUMA fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO5vQACgkQEsHwGGHe
 VUrUjw//fRU8BPZ3/SWNQO188QhHdFpm3jqtjRJsZD1FfnnLdxIg2SCP4RjFxv+Y
 eFyN0nYLekG8a3CMV081H9Rhr5tt3bflk0oTcGAar7m2qQiCiqaAH0wptIlQonSu
 nQCSs+PeaaK4nRCtW+TUJnwG0ZU/y7fEXa3pxJ6hSMnxZjz3lj70zKhpA1nQtqRZ
 OOStvBNtaWcDdTTE4r8XuFIxuMUUEuwHlQQmkAVHQYUf6vxGYfnDYEg83Wddvq1E
 1leSRNFlLcCAbPUV/fax3KGvaekeJ1U411uWqXlain6m105+mk+irmrLxtur/lJ5
 cWTVb5CbIHFZnJvC5jzNPv/03GbIIQaVm4jPI2qB1AZbjcVlAPKj1Ne+U1fzvmDT
 wNUob/rnIXiGptvtUMNYGURxBTj65Nnom3iAJV+AdMOThDwYMvsJJjFkMnC5wO2n
 ZAexumWPnUzWoxSMTraT7a6b/kilFUrcPljxSrFd9yVeU8E6a1OSW35oWoQ3itrc
 xx/ne8RodLmCPC9DjecFcQR+qUuXsF+XCCj07QpfKNTAObr17e9nsKJneR6MX79C
 Lpc7Ka/CiTGYcebWX7tqtjwGPfa6iqekswxYRRp7j54bQ4sHmKyordZy0Q8+c079
 gmMlPdNbqQg3YwHyXW2yeJETDS1HBp61RRojAP15BsL73wyYQNE=
 =AuXr
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "A first set of urgent fixes to the FPU/XSTATE handling mess^W code.
  (There's a lot more in the pipe):

   - Prevent corruption of the XSTATE buffer in signal handling by
     validating what is being copied from userspace first.

   - Invalidate other task's preserved FPU registers on XRSTOR failure
     (#PF) because latter can still modify some of them.

   - Restore the proper PKRU value in case userspace modified it

   - Reset FPU state when signal restoring fails

  Other:

   - Map EFI boot services data memory as encrypted in a SEV guest so
     that the guest can access it and actually boot properly

   - Two SGX correctness fixes: proper resources freeing and a NUMA fix"

* tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Avoid truncating memblocks for SGX memory
  x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed
  x86/fpu: Reset state for all signal restore failures
  x86/pkru: Write hardware init value to PKRU when xstate is init
  x86/process: Check PF_KTHREAD and not current->mm for kernel threads
  x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer
  x86/fpu: Prevent state corruption in __fpu__restore_sig()
  x86/ioremap: Map EFI-reserved memory as encrypted for SEV
2021-06-20 09:09:58 -07:00
Vineeth Pillai
a6c776a952 hyperv: Detect Nested virtualization support for SVM
Previously, to detect nested virtualization enlightenment support,
we were using HV_X64_ENLIGHTENED_VMCS_RECOMMENDED feature bit of
HYPERV_CPUID_ENLIGHTMENT_INFO.EAX CPUID as docuemented in TLFS:
 "Bit 14: Recommend a nested hypervisor using the enlightened VMCS
  interface. Also indicates that additional nested enlightenments
  may be available (see leaf 0x4000000A)".

Enlightened VMCS, however, is an Intel only feature so the above
detection method doesn't work for AMD. So, use the
HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS.EAX CPUID information ("The
maximum input value for hypervisor CPUID information.") and this
works for both AMD and Intel.

Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Message-Id: <43b25ff21cd2d9a51582033c9bdd895afefac056.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:36 -04:00
Kai Huang
4692bc775d x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed
xa_destroy() needs to be called to destroy a virtual EPC's page array
before calling kfree() to free the virtual EPC. Currently it is not
called so add the missing xa_destroy().

Fixes: 540745ddbc ("x86/sgx: Introduce virtual EPC for use by KVM guests")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Yang Zhong <yang.zhong@intel.com>
Link: https://lkml.kernel.org/r/20210615101639.291929-1-kai.huang@intel.com
2021-06-15 18:03:45 +02:00
Linus Torvalds
191aaf6cc4 Misc fixes:
- Fix the NMI watchdog on ancient Intel CPUs
 
  - Remove a misguided, NMI-unsafe KASAN callback
    from the NMI-safe irq_work path used by perf.
 
  - Fix uncore events on Ice Lake servers.
 
  - Someone booted maxcpus=1 on an SNB-EP, and the
    uncore driver emitted warnings and was probably
    buggy. Fix it.
 
  - KCSAN found a genuine data race in the core perf
    code. Somewhat ironically the bug was introduced
    through a recent race fix. :-/ In our defense, the
    new race window was much more narrow. Fix it.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDErJkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gjNxAAhWPl+zsVr+bMZGQVnjPf7swXSaqsphtU
 LrP0hrs4nH0JiB7lZJVjPhCMQKXb+gvP0CTmxkOXmNORDKDK3slIS/zp9uyH1F+d
 nXhmWi7c1bHU0vortnv87LGJpeeI1E7rQ/uBxK6b2v6kOBmCnjvQEiPvJEIGTtpE
 YimVBERdPDTBQiW5EQbbyL3VScwm5QUN2STnLPjUtVc9HES/zCdhXNlsASfhn/Tn
 8rlSAqVEOUcsTpTXYadHckNi1zn4zrpuhWKpSHXrvXCo3qU8QpISjYNwAJ/0IGBj
 CMdg2r+MneF6gop76R5aRcA0JDvDgtv56LKFVhi9gEkE5em9YAni17HU0IeTvJmT
 mL9j64h8oUErC/TpAU1vXCJjIxH7jLq8YQoNwHUvF0pSvcNGsaFeWu1ADQuTEIi9
 fyKHRpFwPMBhwc+AMaRepgQ9FlvE4567fQmwlrUDUKlCU0x0dfvFCM2z/o61YFlH
 oFgB0h0SNxdoj5EXny50LtokP1Kp/oBNVhhNsUpH8wVxWLi61BHJOslcc7nzdP6t
 JBqVE6bLQlxmlKt2AwiOkxe9xVv34o3AMxUYtUBYgCTZSlRjL//7pcqgG5r+CZH/
 eXEU3wWcGtRPEItGXtiGT9Vm2ZYSaUMFF7k7OrTPCHgkW51oEW4FUoaV7M+9fl43
 638x9Wnse4Q=
 =9LoT
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Misc fixes:

   - Fix the NMI watchdog on ancient Intel CPUs

   - Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe
     irq_work path used by perf.

   - Fix uncore events on Ice Lake servers.

   - Someone booted maxcpus=1 on an SNB-EP, and the uncore driver
     emitted warnings and was probably buggy. Fix it.

   - KCSAN found a genuine data race in the core perf code. Somewhat
     ironically the bug was introduced through a recent race fix. :-/
     In our defense, the new race window was much more narrow. Fix it"

* tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
  irq_work: Make irq_work_queue() NMI-safe again
  perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
  perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1
  perf: Fix data race between pin_count increment/decrement
2021-06-12 11:34:49 -07:00
CodyYao-oc
a8383dfb21 x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
The following commit:

   3a4ac121c2 ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")

Got the old-style NMI watchdog logic wrong and broke it for basically every
Intel CPU where it was active. Which is only truly old CPUs, so few people noticed.

On CPUs with perf events support we turn off the old-style NMI watchdog, so it
was pretty pointless to add the logic for X86_VENDOR_ZHAOXIN to begin with ... :-/

Anyway, the fix is to restore the old logic and add a 'break'.

[ mingo: Wrote a new changelog. ]

Fixes: 3a4ac121c2 ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
Signed-off-by: CodyYao-oc <CodyYao-oc@zhaoxin.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210607025335.9643-1-CodyYao-oc@zhaoxin.com
2021-06-10 10:04:40 +02:00
Thomas Gleixner
efa1655049 x86/fpu: Reset state for all signal restore failures
If access_ok() or fpregs_soft_set() fails in __fpu__restore_sig() then the
function just returns but does not clear the FPU state as it does for all
other fatal failures.

Clear the FPU state for these failures as well.

Fixes: 72a671ced6 ("x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtryyhhz.ffs@nanos.tec.linutronix.de
2021-06-10 08:04:24 +02:00
Andy Lutomirski
d8778e393a x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer
Both Intel and AMD consider it to be architecturally valid for XRSTOR to
fail with #PF but nonetheless change the register state.  The actual
conditions under which this might occur are unclear [1], but it seems
plausible that this might be triggered if one sibling thread unmaps a page
and invalidates the shared TLB while another sibling thread is executing
XRSTOR on the page in question.

__fpu__restore_sig() can execute XRSTOR while the hardware registers
are preserved on behalf of a different victim task (using the
fpu_fpregs_owner_ctx mechanism), and, in theory, XRSTOR could fail but
modify the registers.

If this happens, then there is a window in which __fpu__restore_sig()
could schedule out and the victim task could schedule back in without
reloading its own FPU registers. This would result in part of the FPU
state that __fpu__restore_sig() was attempting to load leaking into the
victim task's user-visible state.

Invalidate preserved FPU registers on XRSTOR failure to prevent this
situation from corrupting any state.

[1] Frequent readers of the errata lists might imagine "complex
    microarchitectural conditions".

Fixes: 1d731e731c ("x86/fpu: Add a fastpath to __fpu__restore_sig()")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144345.758116583@linutronix.de
2021-06-09 09:49:38 +02:00
Thomas Gleixner
484cea4f36 x86/fpu: Prevent state corruption in __fpu__restore_sig()
The non-compacted slowpath uses __copy_from_user() and copies the entire
user buffer into the kernel buffer, verbatim.  This means that the kernel
buffer may now contain entirely invalid state on which XRSTOR will #GP.
validate_user_xstate_header() can detect some of that corruption, but that
leaves the onus on callers to clear the buffer.

Prior to XSAVES support, it was possible just to reinitialize the buffer,
completely, but with supervisor states that is not longer possible as the
buffer clearing code split got it backwards. Fixing that is possible but
not corrupting the state in the first place is more robust.

Avoid corruption of the kernel XSAVE buffer by using copy_user_to_xstate()
which validates the XSAVE header contents before copying the actual states
to the kernel. copy_user_to_xstate() was previously only called for
compacted-format kernel buffers, but it works for both compacted and
non-compacted forms.

Using it for the non-compacted form is slower because of multiple
__copy_from_user() operations, but that cost is less important than robust
code in an already slow path.

[ Changelog polished by Dave Hansen ]

Fixes: b860eb8dce ("x86/fpu/xstate: Define new functions for clearing fpregs and xstates")
Reported-by: syzbot+2067e764dbcd10721e2e@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144345.611833074@linutronix.de
2021-06-09 09:28:21 +02:00
Mike Rapoport
f1d4d47c58 x86/setup: Always reserve the first 1M of RAM
There are BIOSes that are known to corrupt the memory under 1M, or more
precisely under 640K because the memory above 640K is anyway reserved
for the EGA/VGA frame buffer and BIOS.

To prevent usage of the memory that will be potentially clobbered by the
kernel, the beginning of the memory is always reserved. The exact size
of the reserved area is determined by CONFIG_X86_RESERVE_LOW build time
and the "reservelow=" command line option. The reserved range may be
from 4K to 640K with the default of 64K. There are also configurations
that reserve the entire 1M range, like machines with SandyBridge graphic
devices or systems that enable crash kernel.

In addition to the potentially clobbered memory, EBDA of unknown size may
be as low as 128K and the memory above that EBDA start is also reserved
early.

It would have been possible to reserve the entire range under 1M unless for
the real mode trampoline that must reside in that area.

To accommodate placement of the real mode trampoline and keep the memory
safe from being clobbered by BIOS, reserve the first 64K of RAM before
memory allocations are possible and then, after the real mode trampoline
is allocated, reserve the entire range from 0 to 1M.

Update trim_snb_memory() and reserve_real_mode() to avoid redundant
reservations of the same memory range.

Also make sure the memory under 1M is not getting freed by
efi_free_boot_services().

 [ bp: Massage commit message and comments. ]

Fixes: a799c2bd29 ("x86/setup: Consolidate early memory reservations")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213177
Link: https://lkml.kernel.org/r/20210601075354.5149-2-rppt@kernel.org
2021-06-03 19:57:55 +02:00
Borislav Petkov
2b31e8ed96 x86/alternative: Optimize single-byte NOPs at an arbitrary position
Up until now the assumption was that an alternative patching site would
have some instructions at the beginning and trailing single-byte NOPs
(0x90) padding. Therefore, the patching machinery would go and optimize
those single-byte NOPs into longer ones.

However, this assumption is broken on 32-bit when code like
hv_do_hypercall() in hyperv_init() would use the ratpoline speculation
killer CALL_NOSPEC. The 32-bit version of that macro would align certain
insns to 16 bytes, leading to the compiler issuing a one or more
single-byte NOPs, depending on the holes it needs to fill for alignment.

That would lead to the warning in optimize_nops() to fire:

  ------------[ cut here ]------------
  Not a NOP at 0xc27fb598
   WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:211 optimize_nops.isra.13

due to that function verifying whether all of the following bytes really
are single-byte NOPs.

Therefore, carve out the NOP padding into a separate function and call
it for each NOP range beginning with a single-byte NOP.

Fixes: 23c1ad538f ("x86/alternatives: Optimize optimize_nops()")
Reported-by: Richard Narron <richard@aaazen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213301
Link: https://lkml.kernel.org/r/20210601212125.17145-1-bp@alien8.de
2021-06-03 16:33:09 +02:00
Thomas Gleixner
9bfecd0583 x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
While digesting the XSAVE-related horrors which got introduced with
the supervisor/user split, the recent addition of ENQCMD-related
functionality got on the radar and turned out to be similarly broken.

update_pasid(), which is only required when X86_FEATURE_ENQCMD is
available, is invoked from two places:

 1) From switch_to() for the incoming task

 2) Via a SMP function call from the IOMMU/SMV code

#1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
   by enforcing the state to be 'present', but all the conditionals in that
   code are completely pointless for that.

   Also the invocation is just useless overhead because at that point
   it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
   and all of this can be handled at return to user space.

#2 is broken beyond repair. The comment in the code claims that it is safe
   to invoke this in an IPI, but that's just wishful thinking.

   FPU state of a running task is protected by fregs_lock() which is
   nothing else than a local_bh_disable(). As BH-disabled regions run
   usually with interrupts enabled the IPI can hit a code section which
   modifies FPU state and there is absolutely no guarantee that any of the
   assumptions which are made for the IPI case is true.

   Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
   invoked with a NULL pointer argument, so it can hit a completely
   unrelated task and unconditionally force an update for nothing.
   Worse, it can hit a kernel thread which operates on a user space
   address space and set a random PASID for it.

The offending commit does not cleanly revert, but it's sufficient to
force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
code to make this dysfunctional all over the place. Anything more
complex would require more surgery and none of the related functions
outside of the x86 core code are blatantly wrong, so removing those
would be overkill.

As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
required to make this actually work, this cannot result in a regression
except for related out of tree train-wrecks, but they are broken already
today.

Fixes: 20f0afd1fb ("x86/mmu: Allocate/free a PASID")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
2021-06-03 16:33:09 +02:00
Borislav Petkov
9a90ed065a x86/thermal: Fix LVT thermal setup for SMI delivery mode
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):

  [    0.033858] read lvtthmr: 0x330, val: 0x200

which means:

 - bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
 - bits [10:8] have 010b which means SMI delivery mode.

Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:

  setup_local_APIC:

  	...

	/*
	 * If this comes from kexec/kcrash the APIC might be enabled in
	 * SPIV. Soft disable it before doing further initialization.
	 */
	value = apic_read(APIC_SPIV);
	value &= ~APIC_SPIV_APIC_ENABLED;
	apic_write(APIC_SPIV, value);

which means (from the SDM):

"10.4.7.2 Local APIC State After It Has Been Software Disabled

...

* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."

And this happens too:

  [    0.124111] APIC: Switch to symmetric I/O mode setup
  [    0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
  [    0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0

This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...

Now, before

  4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")

due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.

But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.

Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.

Fixes: 4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-31 22:32:26 +02:00
Thomas Gleixner
7d65f9e806 x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
PIC interrupts do not support affinity setting and they can end up on
any online CPU. Therefore, it's required to mark the associated vectors
as system-wide reserved. Otherwise, the corresponding irq descriptors
are copied to the secondary CPUs but the vectors are not marked as
assigned or reserved. This works correctly for the IO/APIC case.

When the IO/APIC is disabled via config, kernel command line or lack of
enumeration then all legacy interrupts are routed through the PIC, but
nothing marks them as system-wide reserved vectors.

As a consequence, a subsequent allocation on a secondary CPU can result in
allocating one of these vectors, which triggers the BUG() in
apic_update_vector() because the interrupt descriptor slot is not empty.

Imran tried to work around that by marking those interrupts as allocated
when a CPU comes online. But that's wrong in case that the IO/APIC is
available and one of the legacy interrupts, e.g. IRQ0, has been switched to
PIC mode because then marking them as allocated will fail as they are
already marked as system vectors.

Stay consistent and update the legacy vectors after attempting IO/APIC
initialization and mark them as system vectors in case that no IO/APIC is
available.

Fixes: 69cde0004a ("x86/vector: Use matrix allocator for vector assignment")
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
2021-05-29 11:41:14 +02:00
Linus Torvalds
7de7ac8d60 - Fix how SEV handles MMIO accesses by forwarding potential page faults instead
of killing the machine and by using the accessors with the exact functionality
 needed when accessing memory.
 
 - Fix a confusion with Clang LTO compiler switches passed to the it
 
 - Handle the case gracefully when VMGEXIT has been executed in userspace
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCqKdwACgkQEsHwGGHe
 VUrnfBAAitJ9ytn5PzrLhg9cKt+BRVg8QQExWUYqOrSDXHus5+X/21YKey7BBhIj
 rMJSHi7qytO5rrfj5nw3dIH30hnat8nn5GWcNMG0hi1ptep+GP0xMG1nGw7INJDW
 85FpQI9jpO+vz0AcoZYAtSOWbwonVqbhjdHGzDhIi2e0Qt+1uKbjsT+iPxANBpyB
 fyEU3biPyWfKY4JSr1n0EHBywR329IW5I+yZInb2SBEU42V4vDBGFCXgdS8eFGo5
 KPz/bikERC/gZuDIRXDP6riKIpy1yCO1JZb0EgukwDddbzNz/ox7dX9JL+dEeRzl
 0zr28cJSoZgYQjdi3LU412CMVa8eYw7Ca0/mbhADdZK6Wd7xUNEiUR7FFoBA2Jxp
 +oYzYe4KvlsaFQyPrt8mfJDA36r+FZcqr3WJF+LYmPbRi+cbNDbKSoeDqShAh+Fq
 uUVNloWiOltsRuCS5/du8qzhmJLdIH1uFqtYK37PGLzAHz+KJ9SAdLWaYaLx4GFd
 rrFuCnk5DmoDf3I5lQvIzIEmYysEQOloGgDR6dDaPFRymOgor7BsCdR+dtxVQ6P6
 SMSUzyJLq4tC4dzT5PxWfZDlO+wIxu5QAOhu95oWIdZbsaoABZYCuLf7T7XQr9PA
 DLil4v4i7/FGpDBh+2s3V5hTXHKATuI7SGXnMNfx1eLurChg07k=
 =51BK
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix how SEV handles MMIO accesses by forwarding potential page faults
   instead of killing the machine and by using the accessors with the
   exact functionality needed when accessing memory.

 - Fix a confusion with Clang LTO compiler switches passed to the it

 - Handle the case gracefully when VMGEXIT has been executed in
   userspace

* tag 'x86_urgent_for_v5.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Use __put_user()/__get_user() for data accesses
  x86/sev-es: Forward page-faults which happen during emulation
  x86/sev-es: Don't return NULL from sev_es_get_ghcb()
  x86/build: Fix location of '-plugin-opt=' flags
  x86/sev-es: Invalidate the GHCB after completing VMGEXIT
  x86/sev-es: Move sev_es_put_ghcb() in prep for follow on patch
2021-05-23 06:12:25 -10:00
Linus Torvalds
a0e31f3a38 Merge branch 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo fix from Eric Biederman:
 "During the merge window an issue with si_perf and the siginfo ABI came
  up. The alpha and sparc siginfo structure layout had changed with the
  addition of SIGTRAP TRAP_PERF and the new field si_perf.

  The reason only alpha and sparc were affected is that they are the
  only architectures that use si_trapno.

  Looking deeper it was discovered that si_trapno is used for only a few
  select signals on alpha and sparc, and that none of the other
  _sigfault fields past si_addr are used at all. Which means technically
  no regression on alpha and sparc.

  While the alignment concerns might be dismissed the abuse of si_errno
  by SIGTRAP TRAP_PERF does have the potential to cause regressions in
  existing userspace.

  While we still have time before userspace starts using and depending
  on the new definition siginfo for SIGTRAP TRAP_PERF this set of
  changes cleans up siginfo_t.

   - The si_trapno field is demoted from magic alpha and sparc status
     and made an ordinary union member of the _sigfault member of
     siginfo_t. Without moving it of course.

   - si_perf is replaced with si_perf_data and si_perf_type ending the
     abuse of si_errno.

   - Unnecessary additions to signalfd_siginfo are removed"

* 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
  signal: Deliver all of the siginfo perf data in _perf
  signal: Factor force_sig_perf out of perf_sigtrap
  signal: Implement SIL_FAULT_TRAPNO
  siginfo: Move si_trapno inside the union inside _si_fault
2021-05-21 06:12:52 -10:00
Joerg Roedel
4954f5b8ef x86/sev-es: Use __put_user()/__get_user() for data accesses
The put_user() and get_user() functions do checks on the address which is
passed to them. They check whether the address is actually a user-space
address and whether its fine to access it. They also call might_fault()
to indicate that they could fault and possibly sleep.

All of these checks are neither wanted nor needed in the #VC exception
handler, which can be invoked from almost any context and also for MMIO
instructions from kernel space on kernel memory. All the #VC handler
wants to know is whether a fault happened when the access was tried.

This is provided by __put_user()/__get_user(), which just do the access
no matter what. Also add comments explaining why __get_user() and
__put_user() are the best choice here and why it is safe to use them
in this context. Also explain why copy_to/from_user can't be used.

In addition, also revert commit

  7024f60d65 ("x86/sev-es: Handle string port IO to kernel memory properly")

because using __get_user()/__put_user() fixes the same problem while
the above commit introduced several problems:

  1) It uses access_ok() which is only allowed in task context.

  2) It uses memcpy() which has no fault handling at all and is
     thus unsafe to use here.

  [ bp: Fix up commit ID of the reverted commit above. ]

Fixes: f980f9c31a ("x86/sev-es: Compile early handler code into kernel image")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lkml.kernel.org/r/20210519135251.30093-4-joro@8bytes.org
2021-05-19 18:45:37 +02:00
Joerg Roedel
c25bbdb564 x86/sev-es: Forward page-faults which happen during emulation
When emulating guest instructions for MMIO or IOIO accesses, the #VC
handler might get a page-fault and will not be able to complete. Forward
the page-fault in this case to the correct handler instead of killing
the machine.

Fixes: 0786138c78 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lkml.kernel.org/r/20210519135251.30093-3-joro@8bytes.org
2021-05-19 17:13:04 +02:00
Joerg Roedel
b250f2f779 x86/sev-es: Don't return NULL from sev_es_get_ghcb()
sev_es_get_ghcb() is called from several places but only one of them
checks the return value. The reaction to returning NULL is always the
same: calling panic() and kill the machine.

Instead of adding checks to all call sites, move the panic() into the
function itself so that it will no longer return NULL.

Fixes: 0786138c78 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lkml.kernel.org/r/20210519135251.30093-2-joro@8bytes.org
2021-05-19 17:05:13 +02:00
Eric W. Biederman
0683b53197 signal: Deliver all of the siginfo perf data in _perf
Don't abuse si_errno and deliver all of the perf data in _perf member
of siginfo_t.

Note: The data field in the perf data structures in a u64 to allow a
pointer to be encoded without needed to implement a 32bit and 64bit
version of the same structure.  There already exists a 32bit and 64bit
versions siginfo_t, and the 32bit version can not include a 64bit
member as it only has 32bit alignment.  So unsigned long is used in
siginfo_t instead of a u64 as unsigned long can encode a pointer on
all architectures linux supports.

v1: https://lkml.kernel.org/r/m11rarqqx2.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210503203814.25487-10-ebiederm@xmission.com
v3: https://lkml.kernel.org/r/20210505141101.11519-11-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-4-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:54 -05:00
Eric W. Biederman
add0b32ef9 siginfo: Move si_trapno inside the union inside _si_fault
It turns out that linux uses si_trapno very sparingly, and as such it
can be considered extra information for a very narrow selection of
signals, rather than information that is present with every fault
reported in siginfo.

As such move si_trapno inside the union inside of _si_fault.  This
results in no change in placement, and makes it eaiser
to extend _si_fault in the future as this reduces the number of
special cases.  In particular with si_trapno included in the union it
is no longer a concern that the union must be pointer aligned on most
architectures because the union follows immediately after si_addr
which is a pointer.

This change results in a difference in siginfo field placement on
sparc and alpha for the fields si_addr_lsb, si_lower, si_upper,
si_pkey, and si_perf.  These architectures do not implement the
signals that would use si_addr_lsb, si_lower, si_upper, si_pkey, and
si_perf.  Further these architecture have not yet implemented the
userspace that would use si_perf.

The point of this change is in fact to correct these placement issues
before sparc or alpha grow userspace that cares.  This change was
discussed[1] and the agreement is that this change is currently safe.

[1]: https://lkml.kernel.org/r/CAK8P3a0+uKYwL1NhY6Hvtieghba2hKYGD6hcKx5n8=4Gtt+pHA@mail.gmail.com
Acked-by: Marco Elver <elver@google.com>
v1: https://lkml.kernel.org/r/m1tunns7yf.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-5-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:17:03 -05:00
Tom Lendacky
a50c5bebc9 x86/sev-es: Invalidate the GHCB after completing VMGEXIT
Since the VMGEXIT instruction can be issued from userspace, invalidate
the GHCB after performing VMGEXIT processing in the kernel.

Invalidation is only required after userspace is available, so call
vc_ghcb_invalidate() from sev_es_put_ghcb(). Update vc_ghcb_invalidate()
to additionally clear the GHCB exit code so that it is always presented
as 0 when VMGEXIT has been issued by anything else besides the kernel.

Fixes: 0786138c78 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/5a8130462e4f0057ee1184509cd056eedd78742b.1621273353.git.thomas.lendacky@amd.com
2021-05-18 07:06:29 +02:00
Tom Lendacky
fea63d54f7 x86/sev-es: Move sev_es_put_ghcb() in prep for follow on patch
Move the location of sev_es_put_ghcb() in preparation for an update to it
in a follow-on patch. This will better highlight the changes being made
to the function.

No functional change.

Fixes: 0786138c78 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/8c07662ec17d3d82e5c53841a1d9e766d3bdbab6.1621273353.git.thomas.lendacky@amd.com
2021-05-18 06:49:37 +02:00
Paolo Bonzini
a4345a7cec KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
 - Fix regression with debug register decoding resulting in overlapping access
 - Commit exception state on exit to usrspace
 - Fix the MMU notifier return values
 - Add missing 'static' qualifiers in the new host stage-2 code
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
 1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
 51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
 9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
 RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
 lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
 oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
 kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
 ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
 0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
 lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
 sP2LC4iv
 =57tg
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.13, take #1

- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
2021-05-17 09:55:12 +02:00
Linus Torvalds
ccb013c29d - Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCg1XQACgkQEsHwGGHe
 VUpRKA//dwzDD1QU16JucfhgFlv/9OTm48ukSwAb9lZjDEy4H1CtVL3xEHFd7L3G
 LJp0LTW+OQf0/0aGlQp/cP6sBF6G9Bf4mydx70Id4SyCQt8eZDodB+ZOOWbeteWq
 p92fJPbX8CzAglutbE+3v/MD8CCAllTiLZnJZPVj4Kux2/wF6EryDgF1+rb5q8jp
 ObTT9817mHVwWVUYzbgceZtd43IocOlKZRmF1qivwScMGylQTe1wfMjunpD5pVt8
 Zg4UDNknNfYduqpaG546E6e1zerGNaJK7SHnsuzHRUVU5icNqtgBk061CehP9Ksq
 DvYXLUl4xF16j6xJAqIZPNrBkJGdQf4q1g5x2FiBm7rSQU5owzqh5rkVk4EBFFzn
 UtzeXpqbStbsZHXycyxBNdq2HXxkFPf2NXZ+bkripPg+DifOGots1uwvAft+6iAE
 GudK6qxAvr8phR1cRyy6BahGtgOStXbZYEz0ZdU6t7qFfZMz+DomD5Jimj0kAe6B
 s6ras5xm8q3/Py87N/KNjKtSEpgsHv/7F+idde7ODtHhpRL5HCBqhkZOSRkMMZqI
 ptX1oSTvBXwRKyi5x9YhkKHUFqfFSUTfJhiRFCWK+IEAv3Y7SipJtfkqxRbI6fEV
 FfCeueKDDdViBtseaRceVLJ8Tlr6Qjy27fkPPTqJpthqPpCdoZ0=
 =ENfF
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "The three SEV commits are not really urgent material. But we figured
  since getting them in now will avoid a huge amount of conflicts
  between future SEV changes touching tip, the kvm and probably other
  trees, sending them to you now would be best.

  The idea is that the tip, kvm etc branches for 5.14 will all base
  ontop of -rc2 and thus everything will be peachy. What is more, those
  changes are purely mechanical and defines movement so they should be
  fine to go now (famous last words).

  Summary:

   - Enable -Wundef for the compressed kernel build stage

   - Reorganize SEV code to streamline and simplify future development"

* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/compressed: Enable -Wundef
  x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
  x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
  x86/sev-es: Rename sev-es.{ch} to sev.{ch}
2021-05-16 09:31:06 -07:00
Huang Rui
3743d55b28 x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
Some AMD Ryzen generations has different calculation method on maximum
performance. 255 is not for all ASICs, some specific generations should use 166
as the maximum performance. Otherwise, it will report incorrect frequency value
like below:

  ~ → lscpu | grep MHz
  CPU MHz:                         3400.000
  CPU max MHz:                     7228.3198
  CPU min MHz:                     2200.0000

[ mingo: Tidied up whitespace use. ]
[ Alexander Monakov <amonakov@ispras.ru>: fix 225 -> 255 typo. ]

Fixes: 41ea667227 ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Fixed-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425073451.2557394-1-ray.huang@amd.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211791
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-13 12:10:24 +02:00
Linus Torvalds
0aa099a312 * Lots of bug fixes.
* Fix virtualization of RDPID
 
 * Virtualization of DR6_BUS_LOCK, which on bare metal is new in
   the 5.13 merge window
 
 * More nested virtualization migration fixes (nSVM and eVMCS)
 
 * Fix for KVM guest hibernation
 
 * Fix for warning in SEV-ES SRCU usage
 
 * Block KVM from loading on AMD machines with 5-level page tables,
   due to the APM not mentioning how host CR4.LA57 exactly impacts
   the guest.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCZWwgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE9wgAk7Io8cuvnhC9ogVqzZWrPweWqFg8
 fJcPMB584JRnMqYHBVYbkTPGe8SsCHKR2MKsNdc4cEP111cyr3suWsxOdmjJn58i
 7ahy6PcKx7wWeWwEt7O599l6CeoX5XB9ExvA6eiXAv7iZeOJHFa+Ny2GlWgauy6Y
 DELryEomx1r4IUkZaSR+2fYjzvOWTXQixwU/jwx8NcTJz0DrzknzLE7XOciPBfn0
 t0Q2rCXdL2nF1uPksZbntx8Qoa6t6GDVIyrH/ZCPQYJtAX6cjxNAh3zwCe+hMnOd
 fW8ntBH1nZRiNnberA4IICAzqnUokgPWdKBrZT2ntWHBK+aqxXHznrlPJA==
 =e+gD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Lots of bug fixes.

 - Fix virtualization of RDPID

 - Virtualization of DR6_BUS_LOCK, which on bare metal is new to this
   release

 - More nested virtualization migration fixes (nSVM and eVMCS)

 - Fix for KVM guest hibernation

 - Fix for warning in SEV-ES SRCU usage

 - Block KVM from loading on AMD machines with 5-level page tables, due
   to the APM not mentioning how host CR4.LA57 exactly impacts the
   guest.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (48 commits)
  KVM: SVM: Move GHCB unmapping to fix RCU warning
  KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers
  kvm: Cap halt polling at kvm->max_halt_poll_ns
  tools/kvm_stat: Fix documentation typo
  KVM: x86: Prevent deadlock against tk_core.seq
  KVM: x86: Cancel pvclock_gtod_work on module removal
  KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
  KVM: X86: Expose bus lock debug exception to guest
  KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
  KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
  KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed
  KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
  KVM: x86: Move uret MSR slot management to common x86
  KVM: x86: Export the number of uret MSRs to vendor modules
  KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way
  KVM: VMX: Use common x86's uret MSR list as the one true list
  KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list
  KVM: VMX: Configure list of user return MSRs at module init
  KVM: x86: Add support for RDPID without RDTSCP
  KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in host
  ...
2021-05-10 12:30:45 -07:00
Brijesh Singh
059e5c321a x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
2021-05-10 07:51:38 +02:00
Brijesh Singh
b81fc74d53 x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
The guest and the hypervisor contain separate macros to get and set
the GHCB MSR protocol and NAE event fields. Consolidate the GHCB
protocol definitions and helper macros in one place.

Leave the supported protocol version define in separate files to keep
the guest and hypervisor flexibility to support different GHCB version
in the same release.

There is no functional change intended.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-3-brijesh.singh@amd.com
2021-05-10 07:46:39 +02:00
Brijesh Singh
e759959fe3 x86/sev-es: Rename sev-es.{ch} to sev.{ch}
SEV-SNP builds upon the SEV-ES functionality while adding new hardware
protection. Version 2 of the GHCB specification adds new NAE events that
are SEV-SNP specific. Rename the sev-es.{ch} to sev.{ch} so that all
SEV* functionality can be consolidated in one place.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-2-brijesh.singh@amd.com
2021-05-10 07:40:27 +02:00
Linus Torvalds
dd3e4012dd - Fix guest vtime accounting so that ticks happening while the guest is running
can also be accounted to it. Along with a consolidation to the guest-specific context
 tracking helpers.
 
 - Provide for the host NMI handler running after a VMX VMEXIT to be able to run
 on the kernel stack correctly.
 
 - Initialize MSR_TSC_AUX when RDPID is supported and not RDTSCP (virt relevant -
   real hw supports both)
 
 - A code generation improvement to TASK_SIZE_MAX through the use of alternatives
 
 - The usual misc. and related cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCXqpcACgkQEsHwGGHe
 VUp6Kg/+O0y2PvL6dhfYnUvTmQD7be0DOfWeFSLfBBA0c6yaHL1INbFHWDDptNuJ
 ZV50V+vyqXWV9q0AWF94fYHBs2kB0S79/En0Pwt1a3kb/xlfVTh8VAMPr36utnTY
 VWvOwHgixfPbY+8g1AoqIm/IeFuYWubXQ9CyBrLx/zkJjszfot1eooGRYKDPc2qi
 dNEqBO4IKzw24OdO+oIzW1/owLfnBF+GnXrwCb8fFC2U7luyFAJmp9c1bYnyNuCm
 BdQySOTfm8nnE2RpN4wfc8Akvu/ETKHOPSQOqHIb5glzv6lVfRKXu3CgpYbzoCNl
 Iohb6z8xmgAG29g2VpBjNvCWyyO79y4Ckf94ibWl+qt01EdeYefcP0euK+MGi85A
 cN/MrMt7QjHHEO7ok5J9rBSeKobOtng6A4MHenSOLvjifOYoupRFijaLVxRluATW
 3NsC2IhL10u1c69Zsq6JJFJKoAytInKSigEN9VFZp+4NdE/FzDxfebC/6rSKznGi
 XoaEjOOX0JQ5TXM1gDoyzowAvt2vgndvldpwJTnPY5NP3X9fdiHhoOF9cU2yvl+x
 ZjgD1VxRWLGZKBojNfAa+0oDMZ/cTwPoeZ5Rr5p7SMr/Xw2fsUQ68KVjhOR7ZbaU
 8zEV//JtetwGSN86NhQ/V32hqiF2fni62yBZjYGZ8XM/AnDqaMQ=
 =O3BS
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "A bunch of things accumulated for x86 in the last two weeks:

   - Fix guest vtime accounting so that ticks happening while the guest
     is running can also be accounted to it. Along with a consolidation
     to the guest-specific context tracking helpers.

   - Provide for the host NMI handler running after a VMX VMEXIT to be
     able to run on the kernel stack correctly.

   - Initialize MSR_TSC_AUX when RDPID is supported and not RDTSCP (virt
     relevant - real hw supports both)

   - A code generation improvement to TASK_SIZE_MAX through the use of
     alternatives

   - The usual misc and related cleanups and improvements"

* tag 'x86_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Consolidate guest enter/exit logic to common helpers
  context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
  context_tracking: Consolidate guest enter/exit wrappers
  sched/vtime: Move guest enter/exit vtime accounting to vtime.h
  sched/vtime: Move vtime accounting external declarations above inlines
  KVM: x86: Defer vtime accounting 'til after IRQ handling
  context_tracking: Move guest exit vtime accounting to separate helpers
  context_tracking: Move guest exit context tracking to separate helpers
  KVM/VMX: Invoke NMI non-IST entry instead of IST entry
  x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers
  x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
  x86/resctrl: Fix init const confusion
  x86: Delete UD0, UD1 traces
  x86/smpboot: Remove duplicate includes
  x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant
2021-05-09 12:52:25 -07:00
Linus Torvalds
28b4afeb59 io_uring-5.13-2021-05-07
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCVVmMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv7yEAC/WV1alcH9XdEqLrc2aDwlaScMmSlrMQhY
 ihtDCR9BsX11E3QcUB7D+VYjBo68uKR+ksa1/GN2Xp+vvqmdjQvZindgto/5b6u1
 ko0Dradl2zulCAc7QIdjb2tbmL+Q+JOX5wxv14/+2XabEcce3OegWIvIgX+56NFW
 ZHg80SQzXUhEtQcAUVCoPeBN+H+xzadgz38VlOI08gOG7/M6tS965GH3tZqTjh2K
 P7dLjUn0WcxZ3euAYAsQzNN2O2ObJfpCsQtsG2eSf8DGpanPe4gQjAud1BstDtN0
 CJ0+b6DHgzQYOAgPFjm7l0jjs+VnIYIMnoBBxm5EkIoktsj0hHdqTnEugoz4wTnS
 T8WgojaU6jYNx+Jj6vciCLk0lb5c3O3nxmw3w84/rtTwtaEChCAbWdAkl4cleNaw
 3/Z2bksCVrQWDVskmu4FP7+kGYpjpV+ZiA2+6OGwILTCN+W7vi079NByQAzdLaRb
 K/4lEGM7VYEXtq/I7C6VzjtY7gq46TJmpFW+OdQnPIguavp+7vlUl2pLV3oTeGBc
 E6c+xltgIN+sbbDc/57EJEvhHQod4A6HYOGwBMyjHrhr/sdQ4xvUaJPNmG9HfqRK
 SM3TOlwpHRWFTgbO+6qoJQSMvACQyE/SDqiPi08q75zFVTNCcYM7uYV3fJMsQ9sj
 vA+5HAaRKQ==
 =YwTw
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.13-2021-05-07' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Mostly fixes for merge window merged code. In detail:

   - Error case memory leak fixes (Colin, Zqiang)

   - Add the tools/io_uring/ to the list of maintained files (Lukas)

   - Set of fixes for the modified buffer registration API (Pavel)

   - Sanitize io thread setup on x86 (Stefan)

   - Ensure we truncate transfer count for registered buffers (Thadeu)"

* tag 'io_uring-5.13-2021-05-07' of git://git.kernel.dk/linux-block:
  x86/process: setup io_threads more like normal user space threads
  MAINTAINERS: add io_uring tool to IO_URING
  io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers
  io_uring: Fix memory leak in io_sqe_buffers_register()
  io_uring: Fix premature return from loop and memory leak
  io_uring: fix unchecked error in switch_start()
  io_uring: allow empty slots for reg buffers
  io_uring: add more build check for uapi
  io_uring: dont overlap internal and user req flags
  io_uring: fix drain with rsrc CQEs
2021-05-07 11:29:23 -07:00
Vitaly Kuznetsov
384fc672f5 x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
Simplify the code by making PV features shutdown happen in one place.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:11 -04:00
Vitaly Kuznetsov
3d6b84132d x86/kvm: Disable all PV features on crash
Crash shutdown handler only disables kvmclock and steal time, other PV
features remain active so we risk corrupting memory or getting some
side-effects in kdump kernel. Move crash handler to kvm.c and unify
with CPU offline.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:10 -04:00
Vitaly Kuznetsov
c02027b574 x86/kvm: Disable kvmclock on all CPUs on shutdown
Currenly, we disable kvmclock from machine_shutdown() hook and this
only happens for boot CPU. We need to disable it for all CPUs to
guard against memory corruption e.g. on restore from hibernate.

Note, writing '0' to kvmclock MSR doesn't clear memory location, it
just prevents hypervisor from updating the location so for the short
while after write and while CPU is still alive, the clock remains usable
and correct so we don't need to switch to some other clocksource.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:10 -04:00
Vitaly Kuznetsov
8b79feffec x86/kvm: Teardown PV features on boot CPU as well
Various PV features (Async PF, PV EOI, steal time) work through memory
shared with hypervisor and when we restore from hibernation we must
properly teardown all these features to make sure hypervisor doesn't
write to stale locations after we jump to the previously hibernated kernel
(which can try to place anything there). For secondary CPUs the job is
already done by kvm_cpu_down_prepare(), register syscore ops to do
the same for boot CPU.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:05:36 -04:00
Stefan Metzmacher
50b7b6f29d x86/process: setup io_threads more like normal user space threads
As io_threads are fully set up USER threads it's clearer to separate the
code path from the KTHREAD logic.

The only remaining difference to user space threads is that io_threads
never return to user space again. Instead they loop within the given
worker function.

The fact that they never return to user space means they don't have an
user space thread stack. In order to indicate that to tools like gdb we
reset the stack and instruction pointers to 0.

This allows gdb attach to user space processes using io-uring, which like
means that they have io_threads, without printing worrying message like
this:

  warning: Selected architecture i386:x86-64 is not compatible with reported target architecture i386

  warning: Architecture rejected target-supplied description

The output will be something like this:

  (gdb) info threads
    Id   Target Id                  Frame
  * 1    LWP 4863 "io_uring-cp-for" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
    2    LWP 4864 "iou-mgr-4863"    0x0000000000000000 in ?? ()
    3    LWP 4865 "iou-wrk-4863"    0x0000000000000000 in ?? ()
  (gdb) thread 3
  [Switching to thread 3 (LWP 4865)]
  #0  0x0000000000000000 in ?? ()
  (gdb) bt
  #0  0x0000000000000000 in ?? ()
  Backtrace stopped: Cannot access memory at address 0x0

Fixes: 4727dc20e0 ("arch: setup PF_IO_WORKER threads like PF_KTHREAD")
Link: https://lore.kernel.org/io-uring/044d0bad-6888-a211-e1d3-159a4aeed52d@polymtl.ca/T/#m1bbf5727e3d4e839603f6ec7ed79c7eebfba6267
Signed-off-by: Stefan Metzmacher <metze@samba.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Andy Lutomirski <luto@kernel.org>
cc: linux-kernel@vger.kernel.org
cc: io-uring@vger.kernel.org
cc: x86@kernel.org
Link: https://lore.kernel.org/r/20210505110310.237537-1-metze@samba.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-05 17:47:41 -06:00
Lai Jiangshan
a217a6593c KVM/VMX: Invoke NMI non-IST entry instead of IST entry
In VMX, the host NMI handler needs to be invoked after NMI VM-Exit.
Before commit 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect
call instead of INTn"), this was done by INTn ("int $2"). But INTn
microcode is relatively expensive, so the commit reworked NMI VM-Exit
handling to invoke the kernel handler by function call.

But this missed a detail. The NMI entry point for direct invocation is
fetched from the IDT table and called on the kernel stack.  But on 64-bit
the NMI entry installed in the IDT expects to be invoked on the IST stack.
It relies on the "NMI executing" variable on the IST stack to work
correctly, which is at a fixed position in the IST stack.  When the entry
point is unexpectedly called on the kernel stack, the RSP-addressed "NMI
executing" variable is obviously also on the kernel stack and is
"uninitialized" and can cause the NMI entry code to run in the wrong way.

Provide a non-ist entry point for VMX which shares the C-function with
the regular NMI entry and invoke the new asm entry point instead.

On 32-bit this just maps to the regular NMI entry point as 32-bit has no
ISTs and is not affected.

[ tglx: Made it independent for backporting, massaged changelog ]

Fixes: 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lai Jiangshan <laijs@linux.alibaba.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87r1imi8i1.ffs@nanos.tec.linutronix.de
2021-05-05 22:54:10 +02:00
Sean Christopherson
fc48a6d1fa x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers
Drop write_tsc() and write_rdtscp_aux(); the former has no users, and the
latter has only a single user and is slightly misleading since the only
in-kernel consumer of MSR_TSC_AUX is RDPID, not RDTSCP.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210504225632.1532621-3-seanjc@google.com
2021-05-05 21:50:14 +02:00
Sean Christopherson
b6b4fbd90b x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
Initialize MSR_TSC_AUX with CPU node information if RDTSCP or RDPID is
supported.  This fixes a bug where vdso_read_cpunode() will read garbage
via RDPID if RDPID is supported but RDTSCP is not.  While no known CPU
supports RDPID but not RDTSCP, both Intel's SDM and AMD's APM allow for
RDPID to exist without RDTSCP, e.g. it's technically a legal CPU model
for a virtual machine.

Note, technically MSR_TSC_AUX could be initialized if and only if RDPID
is supported since RDTSCP is currently not used to retrieve the CPU node.
But, the cost of the superfluous WRMSR is negigible, whereas leaving
MSR_TSC_AUX uninitialized is just asking for future breakage if someone
decides to utilize RDTSCP.

Fixes: a582c540ac ("x86/vdso: Use RDPID in preference to LSL when available")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210504225632.1532621-2-seanjc@google.com
2021-05-05 21:50:14 +02:00
Andi Kleen
4029b9706d x86/resctrl: Fix init const confusion
const variable must be initconst, not initdata.

Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210425211229.3157674-1-ak@linux.intel.com
2021-05-05 21:50:14 +02:00
Wan Jiabing
3cf4524ce4 x86/smpboot: Remove duplicate includes
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210427063835.9039-1-wanjiabing@vivo.com
2021-05-05 21:50:13 +02:00
Vitaly Kuznetsov
0a269a008f x86/kvm: Fix pr_info() for async PF setup/teardown
'pr_fmt' already has 'kvm-guest: ' so 'KVM' prefix is redundant.
"Unregister pv shared memory" is very ambiguous, it's hard to
say which particular PV feature it relates to.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-05 04:54:51 -04:00
Linus Torvalds
152d32aa84 ARM:
- Stage-2 isolation for the host kernel when running in protected mode
 
 - Guest SVE support when running in nVHE mode
 
 - Force W^X hypervisor mappings in nVHE mode
 
 - ITS save/restore for guests using direct injection with GICv4.1
 
 - nVHE panics now produce readable backtraces
 
 - Guest support for PTP using the ptp_kvm driver
 
 - Performance improvements in the S2 fault handler
 
 x86:
 
 - Optimizations and cleanup of nested SVM code
 
 - AMD: Support for virtual SPEC_CTRL
 
 - Optimizations of the new MMU code: fast invalidation,
   zap under read lock, enable/disably dirty page logging under
   read lock
 
 - /dev/kvm API for AMD SEV live migration (guest API coming soon)
 
 - support SEV virtual machines sharing the same encryption context
 
 - support SGX in virtual machines
 
 - add a few more statistics
 
 - improved directed yield heuristics
 
 - Lots and lots of cleanups
 
 Generic:
 
 - Rework of MMU notifier interface, simplifying and optimizing
 the architecture-specific code
 
 - Some selftests improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
 y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
 c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
 Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
 +2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
 M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
 =AXUi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This is a large update by KVM standards, including AMD PSP (Platform
  Security Processor, aka "AMD Secure Technology") and ARM CoreSight
  (debug and trace) changes.

  ARM:

   - CoreSight: Add support for ETE and TRBE

   - Stage-2 isolation for the host kernel when running in protected
     mode

   - Guest SVE support when running in nVHE mode

   - Force W^X hypervisor mappings in nVHE mode

   - ITS save/restore for guests using direct injection with GICv4.1

   - nVHE panics now produce readable backtraces

   - Guest support for PTP using the ptp_kvm driver

   - Performance improvements in the S2 fault handler

  x86:

   - AMD PSP driver changes

   - Optimizations and cleanup of nested SVM code

   - AMD: Support for virtual SPEC_CTRL

   - Optimizations of the new MMU code: fast invalidation, zap under
     read lock, enable/disably dirty page logging under read lock

   - /dev/kvm API for AMD SEV live migration (guest API coming soon)

   - support SEV virtual machines sharing the same encryption context

   - support SGX in virtual machines

   - add a few more statistics

   - improved directed yield heuristics

   - Lots and lots of cleanups

  Generic:

   - Rework of MMU notifier interface, simplifying and optimizing the
     architecture-specific code

   - a handful of "Get rid of oprofile leftovers" patches

   - Some selftests improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
  KVM: selftests: Speed up set_memory_region_test
  selftests: kvm: Fix the check of return value
  KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
  KVM: SVM: Skip SEV cache flush if no ASIDs have been used
  KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
  KVM: SVM: Drop redundant svm_sev_enabled() helper
  KVM: SVM: Move SEV VMCB tracking allocation to sev.c
  KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
  KVM: SVM: Unconditionally invoke sev_hardware_teardown()
  KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
  KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
  KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
  KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
  KVM: SVM: Move SEV module params/variables to sev.c
  KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
  KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
  KVM: SVM: Zero out the VMCB array used to track SEV ASID association
  x86/sev: Drop redundant and potentially misleading 'sev_enabled'
  KVM: x86: Move reverse CPUID helpers to separate header file
  KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
  ...
2021-05-01 10:14:08 -07:00
Brian Geffon
14d071134c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"
This reverts commit cd544fd1dc.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called.  This meant that
vm_ops->mremap() would never be called on such mappings.  Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL.

1. https://lkml.org/lkml/2020/12/28/2340

Link: https://lkml.kernel.org/r/20210323182520.2712101-2-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:39 -07:00
Linus Torvalds
635de956a7 The x86 MM changes in this cycle were:
- Implement concurrent TLB flushes, which overlaps the local TLB flush with the
    remote TLB flush. In testing this improved sysbench performance measurably by
    a couple of percentage points, especially if TLB-heavy security mitigations
    are active.
 
  - Further micro-optimizations to improve the performance of TLB flushes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCKbNcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hjYBAAsyNUa/gOu0g6/Cx8R86w9HtHHmm5vso/
 6nJjWj2fd2qJ9JShlddxvXEMeXtPTYabVWQkiiriFMuofk6JeKnlHm1Jzl6keABX
 OQFwjIFeNASPRcdXvuuYPOVWAJJdr2oL9QUr6OOK1ccQJTz/Cd0zA+VQ5YqcsCon
 yaWbkxELwKXpgql+qt66eAZ6Q2Y1TKXyrTW7ZgxQi0yeeWqMaEOub0/oyS7Ax1Rg
 qEJMwm1prb76NPzeqR/G3e4KTrDZfQ/B/KnSsz36GTJpl4eye6XqWDUgm1nAGNIc
 5dbc4Vx7JtZsUOuC0AmzWb3hsDyzVcN/lQvijdZ2RsYR3gvuYGaBhKqExqV0XH6P
 oqaWOKWCz+LqWbsgJmxCpqkt1LZl5+VUOcfJ97WkIS7DyIPtSHTzQXbBMZqKLeat
 mn5UcKYB2Gi7wsUPv6VC2ChKbDqN0VT8G86XbYylGo4BE46KoZKPUNY/QWKLUPd6
 0UKcVeNM2HFyf1C73p/tO/z7hzu3qLuMMnsphP6/c2pKLpdgawEXgbnVKNId1B/c
 NrzyhTvVaMt+Um28bBRhHONIlzPJwWcnZbdY7NqMnu+LBKQ68cL/h4FOIV/RDLNb
 GJLgfAr8fIw/zIpqYuFHiiMNo9wWqVtZko1MvXhGceXUL69QuzTra2XR/6aDxkPf
 6gQVesetTvo=
 =3Cyp
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tlb updates from Ingo Molnar:
 "The x86 MM changes in this cycle were:

   - Implement concurrent TLB flushes, which overlaps the local TLB
     flush with the remote TLB flush.

     In testing this improved sysbench performance measurably by a
     couple of percentage points, especially if TLB-heavy security
     mitigations are active.

   - Further micro-optimizations to improve the performance of TLB
     flushes"

* tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Micro-optimize smp_call_function_many_cond()
  smp: Inline on_each_cpu_cond() and on_each_cpu()
  x86/mm/tlb: Remove unnecessary uses of the inline keyword
  cpumask: Mark functions as pure
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  smp: Run functions concurrently in smp_call_function_many_cond()
2021-04-29 11:41:43 -07:00
Linus Torvalds
0080665fbd Devicetree updates for v5.13:
- Refactoring powerpc and arm64 kexec DT handling to common code. This
   enables IMA on arm64.
 
 - Add kbuild support for applying DT overlays at build time. The first
   user are the DT unittests.
 
 - Fix kerneldoc formatting and W=1 warnings in drivers/of/
 
 - Fix handling 64-bit flag on PCI resources
 
 - Bump dtschema version required to v2021.2.1
 
 - Enable undocumented compatible checks for dtbs_check. This allows
   tracking of missing binding schemas.
 
 - DT docs improvements. Regroup the DT docs and add the example schema
   and DT kernel ABI docs to the doc build.
 
 - Convert Broadcom Bluetooth and video-mux bindings to schema
 
 - Add QCom sm8250 Venus video codec binding schema
 
 - Add vendor prefixes for AESOP, YIC System Co., Ltd, and Siliconfile
   Technologies Inc.
 
 - Cleanup of DT schema type references on common properties and
   standard unit properties
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmCIYdgQHHJvYmhAa2Vy
 bmVsLm9yZwAKCRD6+121jbxhw/PKEACkOCWDnLSY9U7w1uGDHr6UgXIWOY9j8bYy
 2pTvDrVa6KZphT6yGU/hxrOk8Mqh5AMd2vUhO2OCoyyl/priTv+Ktqo+bikvJZLa
 MQm3JnrLpPy/GetdmVD8wq1l+FoeOSTnRIJqRxInsd8UFVpZImtP22ELox6KgGiv
 keVHIrjsHU/HpafK3w8wHCLikCZk+1Gl6pL/QgFDv2FaaCTKW16Dt64dPqYm49Xk
 j7YMMQWl+3NJ9ywZV0+PMbl9udi3EjGm5Ap5VfKzpj53Nh07QObg/QtH/1sj0HPo
 apyW7jAyQFyLytbjxzFL/tljtOeW/5rZos1GWThZ326e+Y0mTKUTDZShvNplfjIf
 e26FvVi7gndWlRSr30Ia5gdNFAx72IkpJUAuypBXgb+qNPchBJjAXLn9tcIcg/k+
 2R6BIB7SkVLpgTnJ1Bq1+PRqkKM+ggACdJNJIUApj44xoiG01vtGDGRaFuIio+Ch
 HT4aBbic4kLvagm8VzuiIF/sL7af5pntzArcyOfQTaZ92DyGI2C0j90rK3yPRIYM
 u9qX/24t1SXiUji74QpoQFzt/+Egy5hYXMJOJJSywUjKf7DBhehqklTjiJRQHKm6
 0DJ/n8q4lNru8F0Y4keKSuYTfHBstF7fS3UTH/rUmBAbfEwkvZe6B29KQbs+7aph
 GTw+jeoR5Q==
 =rF27
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree updates from Rob Herring:

 - Refactor powerpc and arm64 kexec DT handling to common code. This
   enables IMA on arm64.

 - Add kbuild support for applying DT overlays at build time. The first
   user are the DT unittests.

 - Fix kerneldoc formatting and W=1 warnings in drivers/of/

 - Fix handling 64-bit flag on PCI resources

 - Bump dtschema version required to v2021.2.1

 - Enable undocumented compatible checks for dtbs_check. This allows
   tracking of missing binding schemas.

 - DT docs improvements. Regroup the DT docs and add the example schema
   and DT kernel ABI docs to the doc build.

 - Convert Broadcom Bluetooth and video-mux bindings to schema

 - Add QCom sm8250 Venus video codec binding schema

 - Add vendor prefixes for AESOP, YIC System Co., Ltd, and Siliconfile
   Technologies Inc.

 - Cleanup of DT schema type references on common properties and
   standard unit properties

* tag 'devicetree-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (64 commits)
  powerpc: If kexec_build_elf_info() fails return immediately from elf64_load()
  powerpc: Free fdt on error in elf64_load()
  of: overlay: Fix kerneldoc warning in of_overlay_remove()
  of: linux/of.h: fix kernel-doc warnings
  of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses
  dt-bindings: bcm4329-fmac: add optional brcm,ccode-map
  docs: dt: update writing-schema.rst references
  dt-bindings: media: venus: Add sm8250 dt schema
  of: base: Fix spelling issue with function param 'prop'
  docs: dt: Add DT API documentation
  of: Add missing 'Return' section in kerneldoc comments
  of: Fix kerneldoc output formatting
  docs: dt: Group DT docs into relevant sub-sections
  docs: dt: Make 'Devicetree' wording more consistent
  docs: dt: writing-schema: Include the example schema in the doc build
  docs: dt: writing-schema: Remove spurious indentation
  dt-bindings: Fix reference in submitting-patches.rst to the DT ABI doc
  dt-bindings: ddr: Add optional manufacturer and revision ID to LPDDR3
  dt-bindings: media: video-interfaces: Drop the example
  devicetree: bindings: clock: Minor typo fix in the file armada3700-tbg-clock.txt
  ...
2021-04-28 15:50:24 -07:00
Linus Torvalds
42dec9a936 Perf events changes in this cycle were:
- Improve Intel uncore PMU support:
 
      - Parse uncore 'discovery tables' - a new hardware capability enumeration method
        introduced on the latest Intel platforms. This table is in a well-defined PCI
        namespace location and is read via MMIO. It is organized in an rbtree.
 
        These uncore tables will allow the discovery of standard counter blocks, but
        fancier counters still need to be enumerated explicitly.
 
      - Add Alder Lake support
 
      - Improve IIO stacks to PMON mapping support on Skylake servers
 
  - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
    and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
    cores.
 
    The CPU-side feature set is entirely symmetrical - but on the PMU side there's
    core type dependent PMU functionality.
 
  - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
    fixing the AUX allocation watermark logic.
 
  - Improve ring buffer allocation on NUMA systems
 
  - Put 'struct perf_event' into their separate kmem_cache pool
 
  - Add support for synchronous signals for select perf events. The immediate motivation
    is to support low-overhead sampling-based race detection for user-space code. The
    feature consists of the following main changes:
 
     - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
       inheritance of events to CLONE_THREAD.
 
     - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
 
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
       ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
 
    The siginfo support is adequate for breakpoints right now - but the new field can be used
    to introduce support for other types of metadata passed over siginfo as well.
 
  - Misc fixes, cleanups and smaller updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
 QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
 mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
 NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
 5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
 UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
 USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
 I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
 BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
 7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
 nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
 goUMOL/D0lg=
 =NhQ7
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - Improve Intel uncore PMU support:

     - Parse uncore 'discovery tables' - a new hardware capability
       enumeration method introduced on the latest Intel platforms. This
       table is in a well-defined PCI namespace location and is read via
       MMIO. It is organized in an rbtree.

       These uncore tables will allow the discovery of standard counter
       blocks, but fancier counters still need to be enumerated
       explicitly.

     - Add Alder Lake support

     - Improve IIO stacks to PMON mapping support on Skylake servers

 - Add Intel Alder Lake PMU support - which requires the introduction of
   'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
   and Gracemont ('small' - Atom derived) cores.

   The CPU-side feature set is entirely symmetrical - but on the PMU
   side there's core type dependent PMU functionality.

 - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
   profiling, by fixing the AUX allocation watermark logic.

 - Improve ring buffer allocation on NUMA systems

 - Put 'struct perf_event' into their separate kmem_cache pool

 - Add support for synchronous signals for select perf events. The
   immediate motivation is to support low-overhead sampling-based race
   detection for user-space code. The feature consists of the following
   main changes:

     - Add thread-only event inheritance via
       perf_event_attr::inherit_thread, which limits inheritance of
       events to CLONE_THREAD.

     - Add the ability for events to not leak through exec(), via
       perf_event_attr::remove_on_exec.

     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
       extend siginfo with an u64 ::si_perf, and add the breakpoint
       information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.

   The siginfo support is adequate for breakpoints right now - but the
   new field can be used to introduce support for other types of
   metadata passed over siginfo as well.

 - Misc fixes, cleanups and smaller updates.

* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  signal, perf: Add missing TRAP_PERF case in siginfo_layout()
  signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
  perf/x86: Allow for 8<num_fixed_counters<16
  perf/x86/rapl: Add support for Intel Alder Lake
  perf/x86/cstate: Add Alder Lake CPU support
  perf/x86/msr: Add Alder Lake CPU support
  perf/x86/intel/uncore: Add Alder Lake support
  perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
  perf/x86/intel: Add Alder Lake Hybrid support
  perf/x86: Support filter_match callback
  perf/x86/intel: Add attr_update for Hybrid PMUs
  perf/x86: Add structures for the attributes of Hybrid PMUs
  perf/x86: Register hybrid PMUs
  perf/x86: Factor out x86_pmu_show_pmu_cap
  perf/x86: Remove temporary pmu assignment in event_init
  perf/x86/intel: Factor out intel_pmu_check_extra_regs
  perf/x86/intel: Factor out intel_pmu_check_event_constraints
  perf/x86/intel: Factor out intel_pmu_check_num_counters
  perf/x86: Hybrid PMU support for extra_regs
  perf/x86: Hybrid PMU support for event constraints
  ...
2021-04-28 13:03:44 -07:00