43577 Commits

Author SHA1 Message Date
Linus Torvalds
1f2d9ffc7a Scheduler updates in this cycle are:
- Improve the scalability of the CFS bandwidth unthrottling logic
    with large number of CPUs.
 
  - Fix & rework various cpuidle routines, simplify interaction with
    the generic scheduler code. Add __cpuidle methods as noinstr to
    objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
 
  - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
    to query previously issued registrations.
 
  - Limit scheduler slice duration to the sysctl_sched_latency period,
    to improve scheduling granularity with a large number of SCHED_IDLE
    tasks.
 
  - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
    but also enable them to prevent a cascade of followup problems and
    repeat warnings.
 
  - Fix the rescheduling logic in prio_changed_dl().
 
  - Micro-optimize cpufreq and sched-util methods.
 
  - Micro-optimize ttwu_runnable()
 
  - Micro-optimize the idle-scanning in update_numa_stats(),
    select_idle_capacity() and steal_cookie_task().
 
  - Update the RSEQ code & self-tests
 
  - Constify various scheduler methods
 
  - Remove unused methods
 
  - Refine __init tags
 
  - Documentation updates
 
  - ... Misc other cleanups, fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
 iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
 7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
 UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
 d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
 eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
 AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
 UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
 VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
 H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
 T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
 skkQXoONNaM=
 =l1nN
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve the scalability of the CFS bandwidth unthrottling logic with
   large number of CPUs.

 - Fix & rework various cpuidle routines, simplify interaction with the
   generic scheduler code. Add __cpuidle methods as noinstr to objtool's
   noinstr detection and fix boatloads of cpuidle bugs & quirks.

 - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
   previously issued registrations.

 - Limit scheduler slice duration to the sysctl_sched_latency period, to
   improve scheduling granularity with a large number of SCHED_IDLE
   tasks.

 - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
   but also enable them to prevent a cascade of followup problems and
   repeat warnings.

 - Fix the rescheduling logic in prio_changed_dl().

 - Micro-optimize cpufreq and sched-util methods.

 - Micro-optimize ttwu_runnable()

 - Micro-optimize the idle-scanning in update_numa_stats(),
   select_idle_capacity() and steal_cookie_task().

 - Update the RSEQ code & self-tests

 - Constify various scheduler methods

 - Remove unused methods

 - Refine __init tags

 - Documentation updates

 - Misc other cleanups, fixes

* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  sched/rt: pick_next_rt_entity(): check list_entry
  sched/deadline: Add more reschedule cases to prio_changed_dl()
  sched/fair: sanitize vruntime of entity being placed
  sched/fair: Remove capacity inversion detection
  sched/fair: unlink misfit task from cpu overutilized
  objtool: mem*() are not uaccess safe
  cpuidle: Fix poll_idle() noinstr annotation
  sched/clock: Make local_clock() noinstr
  sched/clock/x86: Mark sched_clock() noinstr
  x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
  x86/atomics: Always inline arch_atomic64*()
  cpuidle: tracing, preempt: Squash _rcuidle tracing
  cpuidle: tracing: Warn about !rcu_is_watching()
  cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
  cpuidle: drivers: firmware: psci: Dont instrument suspend code
  KVM: selftests: Fix build of rseq test
  exit: Detect and fix irq disabled state in oops
  cpuidle, arm64: Fix the ARM64 cpuidle logic
  cpuidle: mvebu: Fix duplicate flags assignment
  sched/fair: Limit sched slice duration
  ...
2023-02-20 17:41:08 -08:00
Linus Torvalds
a2f0e7eee1 The latest perf updates in this cycle are:
- Optimize perf_sample_data layout
  - Prepare sample data handling for BPF integration
  - Update the x86 PMU driver for Intel Meteor Lake
  - Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
    discovery breakage
  - Fix the x86 Zhaoxin PMU driver
  - Cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzaHgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jYQg/+KRfobCevMQlZVnz09T3SsJ4ahJ587BL6
 g2C6kobyUNfeChpFVroBkTR+yCb6Mq4xGr2nda9+2E978BYu9eanpx/u/bXNQ6NU
 6YhLwgRrlFXonYn07kFfUJeELZ0W+zpPvymEN1KhTQWcrgXDfXRt2VfMwNsVxGRF
 ZRyCWK+UOzSMU22FtW3I/xVLBB0vio9Y6wRC5QOpDVW5YtGwQGust7GJ53JPK43J
 m2soJvWORauT+v0aqc7ggOtKd6pahVoXrDrbktxtq9N0ZGI+PubVCGevex++cXm/
 B3QSf6VcMMuU6pfzxiEwRa8Whrc3XFeSDEfvMjC5v3becGNkdNBnGOJzYprwgRZJ
 irb6/dSrv5P2lj6WphsO1Wzcm7EoWh8M7DVOMh/13Y/oODRdOrv48112Don9UURC
 EPyvzAzizqdwdDopUmfiqUwuAXqb8uPZqCgmlz/NJkVz1/ijlfrmLgeDuf0vI7Aq
 HznzzRwjFHzyCH7D+rtonFh3JDaqgaouY76tpC5yTtzKbZPlFT8kzeCvqkTMnGgH
 czZnSNc/kBup0HDkNSlthK+TyrMXWKeVa8KQSY1E0NJHO4IBBCMzZywSoAaeofQK
 hqfQyofX9XHmuHhCA4yIfv1XkZGlBTxpPAyDdHjgs9iJTsodSYMs8ESY08eW8DXn
 Ld/35O6SylM=
 =ztUT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:

 - Optimize perf_sample_data layout

 - Prepare sample data handling for BPF integration

 - Update the x86 PMU driver for Intel Meteor Lake

 - Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
   discovery breakage

 - Fix the x86 Zhaoxin PMU driver

 - Cleanups

* tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  perf/x86/intel/uncore: Add Meteor Lake support
  x86/perf/zhaoxin: Add stepping check for ZXC
  perf/x86/intel/ds: Fix the conversion from TSC to perf time
  perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
  perf/x86/uncore: Add a quirk for UPI on SPR
  perf/x86/uncore: Ignore broken units in discovery table
  perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
  perf/x86/uncore: Factor out uncore_device_to_die()
  perf/core: Call perf_prepare_sample() before running BPF
  perf/core: Introduce perf_prepare_header()
  perf/core: Do not pass header for sample ID init
  perf/core: Set data->sample_flags in perf_prepare_sample()
  perf/core: Add perf_sample_save_brstack() helper
  perf/core: Add perf_sample_save_raw_data() helper
  perf/core: Add perf_sample_save_callchain() helper
  perf/core: Save the dynamic parts of sample data size
  x86/kprobes: Use switch-case for 0xFF opcodes in prepare_emulation
  perf/core: Change the layout of perf_sample_data
  perf/x86/msr: Add Meteor Lake support
  perf/x86/cstate: Add Meteor Lake support
  ...
2023-02-20 17:29:55 -08:00
Linus Torvalds
6e649d0856 Updates for this cycle were:
- rwsem micro-optimizations
  - spinlock micro-optimizations
  - cleanups, simplifications
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzZkURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gBvA/6A5RMmSdOIVojmiwUYvZInA/Kpm2wuW0q
 bQWW9maLb96JMpj3FB5Xs5U993WiF0Gt9aGHoND9V2wOYbnv01ElCKKgsw7zLnXb
 c++txpmD+HoUGp94H8T2nA3szPLR7OpPpLmfjTWHKeWQRTStJobTTqi5jVTUZT37
 92MZ2tVzapQJq5VESk0C+0FBFDobh0gTX8hwkEj83ubXK4rC071/gJD4JHZt4nWN
 Up9YGNoNvw+ns7upo2C1XJ4H4ucFoCXT2smH4Oh0gk8Cfs6oP1k5H8J5aJQ+23fT
 EOWzkk0vJdpukNXI1+4G4KMwCO6zv+xVxXpEBizEeTgKWwbJpgBeGrisheAUyMHT
 bwfztsn+NQET11NsccmtRzspscUT42Nc+FUW0KeR2LiBKZhuD6l1Tac3w2HolycA
 2YjMQx3ATOEnMFgv4jGlldlasIAnYj0qitw6wCGqkJSvrC3au/LfcBHn45SxkBWc
 KZV1Oj26aH1hDxYSLyZRmFEvf/46D9CHmv8ReuFbmM6FIYwL+go+Odw/MHMFAbZR
 aP9YR4e94p6WmaNwMqzozP+wN67E4TME2vG6+T1n/szKDogoBlcn/wl053pkcHa/
 CsjELY82/CRRrDgWSnSKUZEFvnnBujyEiSz7pTZCzdBTMc/EcxK5CHOSyN23x+LI
 TvvxFn7KM/o=
 =yTly
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - rwsem micro-optimizations

 - spinlock micro-optimizations

 - cleanups, simplifications

* tag 'locking-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  vduse: Remove include of rwlock.h
  locking/lockdep: Remove lockdep_init_map_crosslock.
  x86/ACPI/boot: Use try_cmpxchg() in __acpi_{acquire,release}_global_lock()
  x86/PAT: Use try_cmpxchg() in set_page_memtype()
  locking/rwsem: Disable preemption in all down_write*() and up_write() code paths
  locking/rwsem: Disable preemption in all down_read*() and up_read() code paths
  locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
  locking/qspinlock: Micro-optimize pending state waiting for unlock
2023-02-20 17:18:23 -08:00
Linus Torvalds
db77b8502a asm-generic: cleanups for 6.3
Only three minor changes: a cross-platform series from Mike Rapoport to
 consolidate asm/agp.h between architectures, and a correctness change
 for __generic_cmpxchg_local() from Matt Evans.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmPvuk8ACgkQmmx57+YA
 GNmbHw/9GbG2s+rUXOKZCx/ChA11aawJ2K7FUB8zNAb2TSInxgV8/RFdQyTgcmi7
 lh7TFOSiWWw0TvYGPz8gyP70vqGM6SEfepB9Kx5Wnb8VrXAEDX80Y1PrFojp5emF
 CYUXIzvT5XrbCLJFOpsxjEK4BB3DBfujosZZHBxx2UzUdme4lwL2vjzDmfbMpfRy
 N/TiqW96I3E9TPvqac567jmq4ghrhnFAuD3fqAndCpv0ANtZT/iNaROAgTiEOCUL
 azUoe6e6W+oIkV9tnzwaAtIBs5pkdt0DmPymxCvschpmLuh952YfJxuu6dwwl6Ue
 DLVntWRYmRBgfxi8e4DbRURa5rFnj7xE+ZgsszvJJyditCHWuw4DWU4eI3SzDSV1
 YVTjhDGoIBQYpMPeNMEaDfmMC6h7b+fP1zDwBA1mQlpS/YQJGntQ5jU6p+46ceFG
 ZfoniYOfEjwJlJA6G5yTGcro4Z1U7ghg7rvp/iTvAVM+5T3hEoLbDcI1jZSTXQB7
 JTi6LzdQVsqdQQReNAtpcB3V9l5OT8ZeFeMqd5b4i7pEs5SteUjTa23Mj+O7fmM1
 LLoeLb3X5N9DiMRaOEjJAfsS/aEsC+whf6qIl81s22XnZEQ4h3BtBrNQY6eP8mAD
 rtojRnWCJI2vYVyQTIWXN91f2cqRww4J22GZHn9a1DdMHUDCSoM=
 =3646
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic cleanups from Arnd Bergmann:
 "Only three minor changes: a cross-platform series from Mike Rapoport
  to consolidate asm/agp.h between architectures, and a correctness
  change for __generic_cmpxchg_local() from Matt Evans"

* tag 'asm-generic-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  char/agp: introduce asm-generic/agp.h
  char/agp: consolidate {alloc,free}_gatt_pages()
  locking/atomic: cmpxchg: Make __generic_cmpxchg_local compare against zero-extended 'old' value
2023-02-20 15:55:47 -08:00
Yang Jihong
f1c97a1b4e x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range
When arch_prepare_optimized_kprobe calculating jump destination address,
it copies original instructions from jmp-optimized kprobe (see
__recover_optprobed_insn), and calculated based on length of original
instruction.

arch_check_optimized_kprobe does not check KPROBE_FLAG_OPTIMATED when
checking whether jmp-optimized kprobe exists.
As a result, setup_detour_execution may jump to a range that has been
overwritten by jump destination address, resulting in an inval opcode error.

For example, assume that register two kprobes whose addresses are
<func+9> and <func+11> in "func" function.
The original code of "func" function is as follows:

   0xffffffff816cb5e9 <+9>:     push   %r12
   0xffffffff816cb5eb <+11>:    xor    %r12d,%r12d
   0xffffffff816cb5ee <+14>:    test   %rdi,%rdi
   0xffffffff816cb5f1 <+17>:    setne  %r12b
   0xffffffff816cb5f5 <+21>:    push   %rbp

1.Register the kprobe for <func+11>, assume that is kp1, corresponding optimized_kprobe is op1.
  After the optimization, "func" code changes to:

   0xffffffff816cc079 <+9>:     push   %r12
   0xffffffff816cc07b <+11>:    jmp    0xffffffffa0210000
   0xffffffff816cc080 <+16>:    incl   0xf(%rcx)
   0xffffffff816cc083 <+19>:    xchg   %eax,%ebp
   0xffffffff816cc084 <+20>:    (bad)
   0xffffffff816cc085 <+21>:    push   %rbp

Now op1->flags == KPROBE_FLAG_OPTIMATED;

2. Register the kprobe for <func+9>, assume that is kp2, corresponding optimized_kprobe is op2.

register_kprobe(kp2)
  register_aggr_kprobe
    alloc_aggr_kprobe
      __prepare_optimized_kprobe
        arch_prepare_optimized_kprobe
          __recover_optprobed_insn    // copy original bytes from kp1->optinsn.copied_insn,
                                      // jump address = <func+14>

3. disable kp1:

disable_kprobe(kp1)
  __disable_kprobe
    ...
    if (p == orig_p || aggr_kprobe_disabled(orig_p)) {
      ret = disarm_kprobe(orig_p, true)       // add op1 in unoptimizing_list, not unoptimized
      orig_p->flags |= KPROBE_FLAG_DISABLED;  // op1->flags ==  KPROBE_FLAG_OPTIMATED | KPROBE_FLAG_DISABLED
    ...

4. unregister kp2
__unregister_kprobe_top
  ...
  if (!kprobe_disabled(ap) && !kprobes_all_disarmed) {
    optimize_kprobe(op)
      ...
      if (arch_check_optimized_kprobe(op) < 0) // because op1 has KPROBE_FLAG_DISABLED, here not return
        return;
      p->kp.flags |= KPROBE_FLAG_OPTIMIZED;   //  now op2 has KPROBE_FLAG_OPTIMIZED
  }

"func" code now is:

   0xffffffff816cc079 <+9>:     int3
   0xffffffff816cc07a <+10>:    push   %rsp
   0xffffffff816cc07b <+11>:    jmp    0xffffffffa0210000
   0xffffffff816cc080 <+16>:    incl   0xf(%rcx)
   0xffffffff816cc083 <+19>:    xchg   %eax,%ebp
   0xffffffff816cc084 <+20>:    (bad)
   0xffffffff816cc085 <+21>:    push   %rbp

5. if call "func", int3 handler call setup_detour_execution:

  if (p->flags & KPROBE_FLAG_OPTIMIZED) {
    ...
    regs->ip = (unsigned long)op->optinsn.insn + TMPL_END_IDX;
    ...
  }

The code for the destination address is

   0xffffffffa021072c:  push   %r12
   0xffffffffa021072e:  xor    %r12d,%r12d
   0xffffffffa0210731:  jmp    0xffffffff816cb5ee <func+14>

However, <func+14> is not a valid start instruction address. As a result, an error occurs.

Link: https://lore.kernel.org/all/20230216034247.32348-3-yangjihong1@huawei.com/

Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-02-21 08:49:16 +09:00
Yang Jihong
868a6fc0ca x86/kprobes: Fix __recover_optprobed_insn check optimizing logic
Since the following commit:

  commit f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code")

modified the update timing of the KPROBE_FLAG_OPTIMIZED, a optimized_kprobe
may be in the optimizing or unoptimizing state when op.kp->flags
has KPROBE_FLAG_OPTIMIZED and op->list is not empty.

The __recover_optprobed_insn check logic is incorrect, a kprobe in the
unoptimizing state may be incorrectly determined as unoptimizing.
As a result, incorrect instructions are copied.

The optprobe_queued_unopt function needs to be exported for invoking in
arch directory.

Link: https://lore.kernel.org/all/20230216034247.32348-2-yangjihong1@huawei.com/

Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code")
Cc: stable@vger.kernel.org
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-02-21 08:49:16 +09:00
Paolo Bonzini
4090871d77 KVM/arm64 updates for 6.3
- Provide a virtual cache topology to the guest to avoid
    inconsistencies with migration on heterogenous systems. Non secure
    software has no practical need to traverse the caches by set/way in
    the first place.
 
  - Add support for taking stage-2 access faults in parallel. This was an
    accidental omission in the original parallel faults implementation,
    but should provide a marginal improvement to machines w/o FEAT_HAFDBS
    (such as hardware from the fruit company).
 
  - A preamble to adding support for nested virtualization to KVM,
    including vEL2 register state, rudimentary nested exception handling
    and masking unsupported features for nested guests.
 
  - Fixes to the PSCI relay that avoid an unexpected host SVE trap when
    resuming a CPU when running pKVM.
 
  - VGIC maintenance interrupt support for the AIC
 
  - Improvements to the arch timer emulation, primarily aimed at reducing
    the trap overhead of running nested.
 
  - Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
    interest of CI systems.
 
  - Avoid VM-wide stop-the-world operations when a vCPU accesses its own
    redistributor.
 
  - Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
    in the host.
 
  - Aesthetic and comment/kerneldoc fixes
 
  - Drop the vestiges of the old Columbia mailing list and add myself as
    co-maintainer
 
 This also drags in a couple of branches to avoid conflicts:
 
  - The shared 'kvm-hw-enable-refactor' branch that reworks
    initialization, as it conflicted with the virtual cache topology
    changes.
 
  - arm64's 'for-next/sme2' branch, as the PSCI relay changes, as both
    touched the EL2 initialization code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmPw29cPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpD9doQAIJyMW0odT6JBe15uGCxTuTnJbb8mniajJdX
 CuSxPl85WyKLtZbIJLRTQgyt6Nzbu0N38zM0y/qBZT5BvAnWYI8etvnJhYZjooAy
 jrf0Me/GM5hnORXN+1dByCmlV+DSuBkax86tgIC7HhU71a2SWpjlmWQi/mYvQmIK
 PBAqpFF+w2cWHi0ZvCq96c5EXBdN4FLEA5cdZhekCbgw1oX8+x+HxdpBuGW5lTEr
 9oWOzOzJQC1uFnjP3unFuIaG94QIo+NA4aGLMzfb7wm2wdQUnKebtdj/RxsDZOKe
 43Q1+MDFWMsxxFu4FULH8fPMwidIm5rfz3pw3JJloqaZp8vk/vjDLID7AYucMIX8
 1G/mjqz6E9lYvv57WBmBhT/+apSDAmeHlAT97piH73Nemga91esDKuHSdtA8uB5j
 mmzcUYajuB2GH9rsaXJhVKt/HW7l9fbGliCkI99ckq/oOTO9VsKLsnwS/rMRIsPn
 y2Y8Lyoe4eqokd1DNn5/bo+3qDnfmzm6iDmZOo+JYuJv9KS95zuw17Wu7la9UAPV
 e13+btoijHDvu8RnTecuXljWfAAKVtEjpEIoS5aP2R2iDvhr0d8POlMPaJ40YuRq
 D2fKr18b6ngt+aI0TY63/ksEIFexx67HuwQsUZ2lRjyjq5/x+u3YIqUPbKrU4Rnl
 uxXjSvyr
 =r4s/
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.3

 - Provide a virtual cache topology to the guest to avoid
   inconsistencies with migration on heterogenous systems. Non secure
   software has no practical need to traverse the caches by set/way in
   the first place.

 - Add support for taking stage-2 access faults in parallel. This was an
   accidental omission in the original parallel faults implementation,
   but should provide a marginal improvement to machines w/o FEAT_HAFDBS
   (such as hardware from the fruit company).

 - A preamble to adding support for nested virtualization to KVM,
   including vEL2 register state, rudimentary nested exception handling
   and masking unsupported features for nested guests.

 - Fixes to the PSCI relay that avoid an unexpected host SVE trap when
   resuming a CPU when running pKVM.

 - VGIC maintenance interrupt support for the AIC

 - Improvements to the arch timer emulation, primarily aimed at reducing
   the trap overhead of running nested.

 - Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
   interest of CI systems.

 - Avoid VM-wide stop-the-world operations when a vCPU accesses its own
   redistributor.

 - Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
   in the host.

 - Aesthetic and comment/kerneldoc fixes

 - Drop the vestiges of the old Columbia mailing list and add [Oliver]
   as co-maintainer

This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
2023-02-20 06:12:42 -05:00
Linus Torvalds
925cf0457d A single fix for x86:
Revert the recent change to the MTRR code which aimed to support
   SEV-SNP guests on Hyper-V. It cuased a regression on XEN Dom0
   kernels.
 
   The underlying issue of MTTR (mis)handling in the x86 code needs some
   deeper investigation and is definitely not 6.2 material.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmPxYccTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRESEACXnTGYtTw5JTXBcztuwXFsGuEvKcp4
 yj1TF2l/OVjWh4MurjJRch0CSzkVZYOn4ymqR4H9ivjuwQkKxzvsO03+AwuJr8Rf
 aDqbDzwf0i6E1NmbSU30GkQYGPFNY5fyEn46ptuzu8Humc4FfHt6IrJ5xgNt2nnr
 xzRB0kJrB+EITe07Oh+3mj2YeTHNF7Uk0wQFhhXeLK8Xr2W3w/mmTAEddU42fgTE
 GCboRec6a8eTOA+4l3gAINJcERxOBnwKEIWo2NWDWdMrrSfcq1Ig1vvHH10rIXxe
 cPIHZY19kIBEvALfKlkzZk13cl6XDhHUnyK5mc9KBi/jZWBK8DF/8reyelW2ymyJ
 GneU8HESPlVAv8UvyOc/1jFcolgASfScO3aG/S9twv0RGFUbRHVvk0yhyPJ61rPM
 ibva+7PQG70AQc+kmsRBtZnR7zHNtoOYz3ufkKfl+InknkAMByR0DmT7Iw4HouAZ
 oZ1WDAgrvg/ld+bIVjyjDnEmPyg50d3COjsGlzMTXeogJTERbIAUwTiC3iRmJksB
 evTqKEYYiBbeex9LI8FOHvmLVew1SEvCScXdXX2+SjXdtJoAKO9uoUScju9MWeq5
 /bdfudRekNG4z1fQ8Qhs/nBkVsIguT2xoMREqy6D3LRbEXgnCKiy5V1P1XEjlwwJ
 ecBpXlRM+lIaBQ==
 =nNAW
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "A single fix for x86.

  Revert the recent change to the MTRR code which aimed to support
  SEV-SNP guests on Hyper-V. It caused a regression on XEN Dom0 kernels.

  The underlying issue of MTTR (mis)handling in the x86 code needs some
  deeper investigation and is definitely not 6.2 material"

* tag 'x86-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mtrr: Revert 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
2023-02-18 17:57:16 -08:00
Linus Torvalds
5e725d112e x86:
* zero all padding for KVM_GET_DEBUGREGS
 
 * fix rST warning
 
 * disable vPMU support on hybrid CPUs
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmPw5PsUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNqaAf6A0zjrdY1KjSyvrcHr0NV6lUfU+Ye
 lc5xrtAJDuO7kERgnqFGPeg3a72tA4a9tlFrTdqqQIrAxnvVn4JNP5gtD7UxfpOn
 PELO7JUbG3/CV2oErTugH02n3lKN/pLSISAClFkO7uAL5sJEM2pXH+ws1CZ7F7kN
 FbPdnmvzi7tnTpv3oJ+gVl2l0HZYTnH4DydFGo68O3lP+oFgRXznkF5rpMxAe6oK
 93fvSWGabVCft278sSVq5XpYfKQSJb5j8KjB8L4qqAlRh0ZJA5haDZWQyaaJvNY0
 oefFj9XYPpA08l8VqZ2ti5vE4b6e+o2/oTg3Nwf/5DrQxJiYOGTHKoA/NA==
 =H68L
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm/x86 fixes from Paolo Bonzini:

 - zero all padding for KVM_GET_DEBUGREGS

 - fix rST warning

 - disable vPMU support on hybrid CPUs

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: initialize all of the kvm_debugregs structure before sending it to userspace
  perf/x86: Refuse to export capabilities for hybrid PMUs
  KVM: x86/pmu: Disable vPMU support on hybrid CPUs (host PMUs)
  Documentation/hw-vuln: Fix rST warning
2023-02-18 11:07:32 -08:00
Jan Beulich
20e7da1bbb x86/Xen: drop leftover VM-assist uses
Both the 4Gb-segments and the PAE-extended-CR3 one are applicable to
32-bit guests only. The PAE-extended-CR3 use, furthermore, was redundant
with the PAE_MODE ELF note anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

Link: https://lore.kernel.org/r/215515af-cfb9-3237-03ba-3312e3fa0d34@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-18 09:59:01 +01:00
David S. Miller
675f176b4d Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Some of the devlink bits were tricky, but I think I got it right.

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-17 11:06:39 +00:00
Greg Kroah-Hartman
2c10b61421 kvm: initialize all of the kvm_debugregs structure before sending it to userspace
When calling the KVM_GET_DEBUGREGS ioctl, on some configurations, there
might be some unitialized portions of the kvm_debugregs structure that
could be copied to userspace.  Prevent this as is done in the other kvm
ioctls, by setting the whole structure to 0 before copying anything into
it.

Bonus is that this reduces the lines of code as the explicit flag
setting and reserved space zeroing out can be removed.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable <stable@kernel.org>
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <20230214103304.3689213-1-gregkh@linuxfoundation.org>
Tested-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-16 12:31:40 -05:00
David Matlack
7f604e92fb KVM: x86/mmu: Make tdp_mmu_allowed static
Make tdp_mmu_allowed static since it is only ever used within
arch/x86/kvm/mmu/mmu.c.

Link: https://lore.kernel.org/kvm/202302072055.odjDVd5V-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20230213212844.3062733-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-16 12:29:50 -05:00
Nuno Das Neves
b14033a3e6 x86/hyperv: Fix hv_get/set_register for nested bringup
hv_get_nested_reg only translates SINT0, resulting in the wrong sint
being registered by nested vmbus.

Fix the issue with new utility function hv_is_sint_reg.

While at it, improve clarity of hv_set_non_nested_register and hv_is_synic_reg.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Jinank Jain <jinankjain@linux.microsoft.com>
Link: https://lore.kernel.org/r/1675980172-6851-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-02-16 14:32:37 +00:00
Paolo Bonzini
33436335e9 KVM/riscv changes for 6.3
- Fix wrong usage of PGDIR_SIZE to check page sizes
 - Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect()
 - Redirect illegal instruction traps to guest
 - SBI PMU support for guest
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmPifFIACgkQrUjsVaLH
 LAcEyxAAinMBaBhiPmwWZQvcCzh/UFmJo8BQCwAPuwoc/a4ZGAR7ylzd0oJilP8M
 wSgX6Ad8XF+CEW2VpxW9nwyi41N25ep1Lrf8vOaWy9L9QNUo0t15WrCIbXT2p399
 HrK9fz7HHKKIMsJy+rYb9EepdmMf55xtr1Y/EjyvhoDQbrEMlKsAODYz/SUoriQG
 Tn3cCYBzLdvzDzu0xXM9v+nsetWXdajK/v4je+mE3NQceXhePAO4oVWP4IpnoROd
 ZQm3evvVdf0WtKG9curxwMB7jjBqDBFrcLYl0qHGa7pi2o5PzVM7esgaV47KwetH
 IgA/Mrf1IfzpgM7VYDDax5wUHlKj63KisqU0J8rU3PUloQXaWqv7+ho51t9GzZ/i
 9x4uyO/evVntgyTw6HCbqmQJDgEtJiG1ydrR/ydBMYHLnh7LPY2UpKgcqmirtbkK
 1/DYDp84vikQ5VW1hc8IACdoBShh9Moh4xsEStzkTrIeHcZCjtORXUh8UIPZ0Mu2
 7Mnkktu9I55SLwA3rwH/EYT1ISrOV1G+q3wfqgeLpn8YUWwCIiqWQ5Ur0/WSMJse
 uJ3HedZDzj9T4n4khX+mKEYh6joAafQZag+4TID2lRSwd0S/mpeC22hYrViMdDmq
 yhE+JNin/sz4AVaHNzGwfqk2NC2RFl9aRn2X0xTwyBubif9pKMQ=
 =spUL
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.3-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.3

- Fix wrong usage of PGDIR_SIZE to check page sizes
- Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect()
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
2023-02-15 12:33:28 -05:00
Paolo Bonzini
27b025ebb0 KVM VMX changes for 6.3:
- Handle NMI VM-Exits before leaving the noinstr region
 
  - A few trivial cleanups in the VM-Enter flows
 
  - Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
    EPTP switching (or any other VM function) for L1
 
  - Fix a crash when using eVMCS's enlighted MSR bitmaps
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsL4USHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5fvQP/jOhCos/8VrNPJ5vtVkELSgbefbLFPbR
 ThCefVekbnbqkJ4PdxLFOw/nzK+NREnoet02lvLXIeBt243qMeDWbNuGncZ+rqqX
 ohJ6ALIS7ag2egG2ZllM4KTV1jWnm7bG6C8QsJwrFuG7wHe69lPYNnv34OD4l8QO
 PR7reXBl7iyqQehF0rCqkYvtpv1QI+VlVd3ODCg/0vaZX4AOPZI6X1hX2+myUfTy
 LI1oQ88SQ0ptoH5o7xBtNQw7Zknm/SP82Djzpg/1Wpr3cZv4g/1vblNcxpU/D+3I
 Cp8xM42c82GIX1Zk6uYfsoCyQZ7Bd38llbgcLM5Fi0F2IuKJKoN1iuH77RJ1Juqf
 v+SkRmFAr5Uf3UYVp4ck4FxTRH58ooDPeIv2jdaguyE9NgRfbVsghn6QDC+gkZMx
 z4xdwCBXgsAAkhJwOLqDy98ZrQZoy1kxlmG2jLIe5RvOzI4WQDUJkJAd+u0dBD/W
 U7PRqbqfJFtvvKN784BTa4CO8m+KwaNz4Xrwcg4YtBnKBgDU/cxFUZnnSEHdi9Pr
 tVf462dfReK3slK3rkPw5HlacFdQsgfX6+VX1eeFKY+gE88TXoZks/X7ma24MXEw
 hOu6VZMat1NjkfcgUcMffBVZlejK7C6oXTeVmqVKsx6hUs+zzqxpQ05AT4TMbAUI
 gPTHJyy9ByHO
 =WXGG
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.3:

 - Handle NMI VM-Exits before leaving the noinstr region

 - A few trivial cleanups in the VM-Enter flows

 - Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
   EPTP switching (or any other VM function) for L1

 - Fix a crash when using eVMCS's enlighted MSR bitmaps
2023-02-15 12:23:19 -05:00
Paolo Bonzini
4bc6dcaa15 KVM SVM changes for 6.3:
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
 
  - Move the SVM-specific "host flags" into vcpu_svm (extracted from the
    vNMI enabling series)
 
  - A handful for fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsH7kSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5/dQQAJSVCYA7F7LcfJf+c1ULG0XCd8rHnXdR
 EGTygTnWrzwTCRaunOBPE4AJRxKdkwTKy+yfnnQVRdYYfRe1SZKpUQ2XNEDGn0+v
 zVfOmSFFcCWXmJeY8y5n1GlDH4ENO3G7nD1ncDQ0I9PazmOsmxChoVZ9afFJ5bpo
 73hjcYVfUDYxGkeRLWSSSFtWIGguE8BkpRH3wZ8MGZi+ueoFUJPPBKHeDtxCV2/T
 KcJLne8tQVTiWCdMO3EFwxgIvsjQDoT0gZYLYNHJ6KqD9Szc3jA9v2ryTm5IYlpb
 akYUqePaD0SGrrfDBrwz3bLu3fehDu7eduXESRlzzb8S4xP7/qXeo9KeVN+b4MBb
 nmBBFncvMWbC8Po5wB5OVAfAa7ACmGiXeBV8pfgGI6FTq1fpc4VNm2PevKkDvlqN
 O2eZ1KuNkwBnbIPj3JVPPnsJcUjYXFjZyzfpMV1T+ExmL/IYceatX4S7zfgL5nUg
 3qFi5mX2Cufk2EBvBu+Dkpt/H4lze+ysZRciMC+v7Q4LWAYZ8HW1a44pnBVUJMPM
 bWiJ1/O8RIWM1tWIrlO38+ZZalbu3spIVMBXKzqEGXvpUwJ4UgZM1tFiWvISTVFe
 2X6N3d7aT/DQ1PzZU6BsyVZWAFaodHBauMcr9FUkWqqGu3HOhqC4rSJZ9eRR7V5O
 WSp1gTVY1JXy
 =AVpx
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.3:

 - Fix a mostly benign overflow bug in SEV's send|receive_update_data()

 - Move the SVM-specific "host flags" into vcpu_svm (extracted from the
   vNMI enabling series)

 - A handful for fixes and cleanups
2023-02-15 12:23:06 -05:00
Rafael J. Wysocki
7e71a13353 Merge branches 'pm-cpuidle', 'pm-core' and 'pm-sleep'
Merge cpuidle updates, PM core updates and changes related to system
sleep handling for 6.3-rc1:

 - Make the TEO cpuidle governor check CPU utilization in order to refine
   idle state selection (Kajetan Puchalski).

 - Make Kconfig select the haltpoll cpuidle governor when the haltpoll
   cpuidle driver is selected and replace a default_idle() call in that
   driver with arch_cpu_idle() which allows MWAIT to be used (Li
   RongQing).

 - Add Emerald Rapids Xeon support to the intel_idle driver (Artem
   Bityutskiy).

 - Add ARCH_SUSPEND_POSSIBLE dependencies for ARMv4 cpuidle drivers to
   avoid randconfig build failures (Arnd Bergmann).

 - Make kobj_type structures used in the cpuidle sysfs interface
   constant (Thomas Weißschuh).

 - Make the cpuidle driver registration code update microsecond values
   of idle state parameters in accordance with their nanosecond values
   if they are provided (Rafael Wysocki).

 - Make the PSCI cpuidle driver prevent topology CPUs from being
   suspended on PREEMPT_RT (Krzysztof Kozlowski).

 - Document that pm_runtime_force_suspend() cannot be used with
   DPM_FLAG_SMART_SUSPEND (Richard Fitzgerald).

 - Add EXPORT macros for exporting PM functions from drivers (Richard
   Fitzgerald).

 - Drop "select SRCU" from system sleep Kconfig (Paul E. McKenney).

 - Remove /** from non-kernel-doc comments in hibernation code (Randy
   Dunlap).

* pm-cpuidle:
  cpuidle: psci: Do not suspend topology CPUs on PREEMPT_RT
  cpuidle: driver: Update microsecond values of state parameters as needed
  cpuidle: sysfs: make kobj_type structures constant
  cpuidle: add ARCH_SUSPEND_POSSIBLE dependencies
  intel_idle: add Emerald Rapids Xeon support
  cpuidle-haltpoll: Replace default_idle() with arch_cpu_idle()
  cpuidle-haltpoll: select haltpoll governor
  cpuidle: teo: Introduce util-awareness
  cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()

* pm-core:
  PM: Add EXPORT macros for exporting PM functions
  PM: runtime: Document that force_suspend() is incompatible with SMART_SUSPEND

* pm-sleep:
  PM: sleep: Remove "select SRCU"
  PM: hibernate: swap: don't use /** for non-kernel-doc comments
2023-02-15 15:59:48 +01:00
Sean Christopherson
4b4191b8ae perf/x86: Refuse to export capabilities for hybrid PMUs
Now that KVM disables vPMU support on hybrid CPUs, WARN and return zeros
if perf_get_x86_pmu_capability() is invoked on a hybrid CPU.  The helper
doesn't provide an accurate accounting of the PMU capabilities for hybrid
CPUs and needs to be enhanced if KVM, or anything else outside of perf,
wants to act on the PMU capabilities.

Cc: stable@vger.kernel.org
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/all/20220818181530.2355034-1-kan.liang@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230208204230.1360502-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-15 08:25:44 -05:00
Sean Christopherson
4d7404e5ee KVM: x86/pmu: Disable vPMU support on hybrid CPUs (host PMUs)
Disable KVM support for virtualizing PMUs on hosts with hybrid PMUs until
KVM gains a sane way to enumeration the hybrid vPMU to userspace and/or
gains a mechanism to let userspace opt-in to the dangers of exposing a
hybrid vPMU to KVM guests.  Virtualizing a hybrid PMU, or at least part of
a hybrid PMU, is possible, but it requires careful, deliberate
configuration from userspace.

E.g. to expose full functionality, vCPUs need to be pinned to pCPUs to
prevent migrating a vCPU between a big core and a little core, userspace
must enumerate a reasonable topology to the guest, and guest CPUID must be
curated per vCPU to enumerate accurate vPMU capabilities.

The last point is especially problematic, as KVM doesn't control which
pCPU it runs on when enumerating KVM's vPMU capabilities to userspace,
i.e. userspace can't rely on KVM_GET_SUPPORTED_CPUID in it's current form.

Alternatively, userspace could enable vPMU support by enumerating the
set of features that are common and coherent across all cores, e.g. by
filtering PMU events and restricting guest capabilities.  But again, that
requires userspace to take action far beyond reflecting KVM's supported
feature set into the guest.

For now, simply disable vPMU support on hybrid CPUs to avoid inducing
seemingly random #GPs in guests, and punt support for hybrid CPUs to a
future enabling effort.

Reported-by: Jianfeng Gao <jianfeng.gao@intel.com>
Cc: stable@vger.kernel.org
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/all/20220818181530.2355034-1-kan.liang@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230208204230.1360502-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-15 08:25:43 -05:00
Paolo Bonzini
157ed9cb04 KVM x86 PMU changes for 6.3:
- Add support for created masked events for the PMU filter to allow
    userspace to heavily restrict what events the guest can use without
    needing to create an absurd number of events
 
  - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
    support is disabled
 
  - Add PEBS support for Intel SPR
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsFZ4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5eKEP/0qeZsOQot53wkf+wNiGh1X6qDacBPFP
 A8GAPC70fEisxAt776DeKEBwikHpARPglCt1Il9dFvkG+0jgYpvPu8UGF1LpouKX
 cD/7itr2k8GZlXZBg2Rgu3TRyFBJEGHT6tAu7PBhZyL6yWQDUxao8FPFrRGfmJ7O
 Z6eFMo1cERNHICQm+W/2TBd1xguiF+m4CXKlA70R4wzM37aPF9o5HvmIwAvPzyhU
 w4WzcIQbjVPs1VpBTzwPqRmyZ8omSlDYo7VqmsDiRtJbucqgbhFI2wR+nyImFCa9
 D2pI5TV3CFTt0fvd8SZpH19nR3S6cMLCXONOsijmvR2BmS3PhJMP4dMm5m4R06nF
 RBtnTj9fkbeL1ghFEkMxHBZVTG3bBlO4ySOxIqNHCvPjqQ37mJ+xP4C8kcIC9p5F
 +xL3AvZ7zenPv3A29SY9YH+QvZLBwyDJzAsveLeYkLFoJxoDT4glOY/Wpi1rkZ17
 /zHDZWoF49l1Eu3Bql0hFetkCreUNFGpa4moUmEC0evYOvV2WCb+39TDXZ8CPCGD
 +cDiRnD8MFQpBw47F03EnFheFHxiJoL0Clv5vvM3C+xOq2J9WVG9mqQWCk+4ta2B
 Um4D++0a9lwvJhOImaR7uyiV3K7oVm+rU8+46x+nTNGaIP2bnE+vronY+b6KGeUx
 7+xzTKlYygGe
 =ev5v
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.3:

 - Add support for created masked events for the PMU filter to allow
   userspace to heavily restrict what events the guest can use without
   needing to create an absurd number of events

 - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
   support is disabled

 - Add PEBS support for Intel SPR
2023-02-15 08:23:24 -05:00
Paolo Bonzini
1c5ec0d433 KVM x86 MMU changes for 6.3:
- Fix and cleanup the range-based TLB flushing code, used when KVM is
    running on Hyper-V
 
  - A few one-off cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsEC4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5lxgQAJrV3ri3UlK4kplKrrjjalHCD45H1z2Q
 fuSalDMMQUbsZI+/rMjdzwCitc9+cQXPg7aCUzWfZgMFATxWUKexoY8SrR4i5rxK
 jJLKo1AQbIdxYKLq0qHGI+6Uf8LQ4LT3UGiwDi54Ad51+5B0t525mMHBZpJ53Ung
 XVNOtA7nMUNQIV8x5iFqL5Cj/uhMz5dYGfAKlm/05G8HT7U5gWs9YsN63pUOIDGx
 HsytBV/ZWd1KYmAqTdVjW4MM+J4bi5lbqhkz5/9HRF/GypqjF+XzzzWONJgLrzr0
 pEiNW37RDyW+4PrS6UEs6jadGIMGHRLHb+9oZol7Jri34c9oeEUErhLx/BW+WstX
 czP4OyzqkCHSbymPvy9xV3leUesC7r+Tz6yHp4sI9Ppm+u/ZGpkuMoZR/bwz+LvC
 Rn3SrweH0FCgblK/foxeR+2/Vm3SNxBTKKw4nTUw0W5BtXI5suu1Po3rLMBrGKDW
 9h1ST0uVmsnMpMC5XheE2dRyXc92TqCw53DNRydA6kPkzPJsd3Va3GU9fk4qI1cG
 iVUDATViTUBfSq9jbb1PmHjGNI/Vc8XAu5TPiwqC6Q1wFXm3zTlCTrzKl8q7Hu0G
 A9Q3vhxJMCY58Rxz0yLt+WTzTctuteyTr5arRT8asfZmI4df8U7RvZb3deQ9C6TK
 jdqKUtGJK9ZX
 =7T3y
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.3:

 - Fix and cleanup the range-based TLB flushing code, used when KVM is
   running on Hyper-V

 - A few one-off cleanups
2023-02-15 08:22:44 -05:00
Arnd Bergmann
f9bb7f6a7e x86/build: Make 64-bit defconfig the default
Running 'make ARCH=x86 defconfig' on anything other than an x86_64
machine currently results in a 32-bit build, which is rarely what
anyone wants these days.

Change the default so that the 64-bit config gets used unless
the user asks for i386_defconfig, uses ARCH=i386 or runs on
a system that "uname -m" identifies as i386/i486/i586/i686.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230215091706.1623070-1-arnd@kernel.org
2023-02-15 14:20:17 +01:00
Greg Kroah-Hartman
ade1229cae dma-mapping: no need to pass a bus_type into get_arch_dma_ops()
The get_arch_dma_ops() arch-specific function never does anything with
the struct bus_type that is passed into it, so remove it entirely as it
is not needed.

Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: linux-alpha@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: iommu@lists.linux.dev
Cc: linux-arch@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230214140121.131859-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-15 12:35:20 +01:00
Srivatsa S. Bhat (VMware)
fcb3a81d22 x86/hotplug: Remove incorrect comment about mwait_play_dead()
The comment that says mwait_play_dead() returns only on failure is a bit
misleading because mwait_play_dead() could actually return for valid
reasons (such as mwait not being supported by the platform) that do not
indicate a failure of the CPU offline operation. So, remove the comment.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230128003751.141317-1-srivatsa@csail.mit.edu
2023-02-14 23:44:34 +01:00
Linus Torvalds
82eac0c830 Certain AMD processors are vulnerable to a cross-thread return address
predictions bug. When running in SMT mode and one of the sibling threads
 transitions out of C0 state, the other thread gets access to twice as many
 entries in the RSB, but unfortunately the predictions of the now-halted
 logical processor are not purged.  Therefore, the executing processor
 could speculatively execute from locations that the now-halted processor
 had trained the RSB on.
 
 The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
 when context switching to the idle thread. However, KVM allows a VMM to
 prevent exiting guest mode when transitioning out of C0 using the
 KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change this
 behavior. To mitigate the cross-thread return address predictions bug,
 a VMM must not be allowed to override the default behavior to intercept
 C0 transitions.
 
 These patches introduce a KVM module parameter that, if set, will prevent
 the user from disabling the HLT, MWAIT and CSTATE exits.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmPrvAAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMQAgf8CirL+yrngzQ+39UTFQgXj3IS1UyR
 o2mF39w3bVlQhLNf5MBHF965FUeOV/8A16x73hJGjhiiuisphtQWS/6xKR6uYbq7
 0Qi821skqN6XpRsWTWqFHMsdY+n0skr8QeXG4k/GJu7Ghb3tqs4eTGgnf2WBfI8/
 K1UgTmjd9+ikM5gKZoVLpcqZnti0gx3lM+cvZGdfrIUaXB+i+hNd2NfRTiGsTOiK
 fX7vZtLvOeje2TPoKLhzekTbh8kTU07HRWID9aVXT8bLy6Zd6tg2CHlv11noKpwv
 DFVV+RsJ1SiAQYSwT+4IvWfIG4oq4onBQ972g2a27pP2cxF+38GXzt4NQw==
 =xscg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Certain AMD processors are vulnerable to a cross-thread return address
  predictions bug. When running in SMT mode and one of the sibling
  threads transitions out of C0 state, the other thread gets access to
  twice as many entries in the RSB, but unfortunately the predictions of
  the now-halted logical processor are not purged. Therefore, the
  executing processor could speculatively execute from locations that
  the now-halted processor had trained the RSB on.

  The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
  when context switching to the idle thread. However, KVM allows a VMM
  to prevent exiting guest mode when transitioning out of C0 using the
  KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change
  this behavior. To mitigate the cross-thread return address predictions
  bug, a VMM must not be allowed to override the default behavior to
  intercept C0 transitions.

  These patches introduce a KVM module parameter that, if set, will
  prevent the user from disabling the HLT, MWAIT and CSTATE exits"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  Documentation/hw-vuln: Add documentation for Cross-Thread Return Predictions
  KVM: x86: Mitigate the cross-thread return address predictions bug
  x86/speculation: Identify processors vulnerable to SMT RSB predictions
2023-02-14 09:17:01 -08:00
Juergen Gross
f9f57da2c2 x86/mtrr: Revert 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
Commit

  90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")

broke the use case of running Xen dom0 kernels on machines with an
external disk enclosure attached via USB, see Link tag.

What this commit was originally fixing - SEV-SNP guests on Hyper-V - is
a more specialized situation which has other issues at the moment anyway
so reverting this now and addressing the issue properly later is the
prudent thing to do.

So revert it in time for the 6.2 proper release.

  [ bp: Rewrite commit message. ]

Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de
2023-02-14 10:16:34 +01:00
Taehee Yoo
8b84475318 crypto: x86/aria-avx - Do not use avx2 instructions
vpbroadcastb and vpbroadcastd are not AVX instructions.
But the aria-avx assembly code contains these instructions.
So, kernel panic will occur if the aria-avx works on AVX2 unsupported
CPU.

vbroadcastss, and vpshufb are used to avoid using vpbroadcastb in it.
Unfortunately, this change reduces performance by about 5%.
Also, vpbroadcastd is simply replaced by vmovdqa in it.

Fixes: ba3579e6e45c ("crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher")
Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-02-14 13:39:33 +08:00
Oliver Upton
022d3f0800 Merge branch kvm-arm64/misc into kvmarm/next
* kvm-arm64/misc:
  : Miscellaneous updates
  :
  :  - Convert CPACR_EL1_TTA to the new, generated system register
  :    definitions.
  :
  :  - Serialize toggling CPACR_EL1.SMEN to avoid unexpected exceptions when
  :    accessing SVCR in the host.
  :
  :  - Avoid quiescing the guest if a vCPU accesses its own redistributor's
  :    SGIs/PPIs, eliminating the need to IPI. Largely an optimization for
  :    nested virtualization, as the L1 accesses the affected registers
  :    rather often.
  :
  :  - Conversion to kstrtobool()
  :
  :  - Common definition of INVALID_GPA across architectures
  :
  :  - Enable CONFIG_USERFAULTFD for CI runs of KVM selftests
  KVM: arm64: Fix non-kerneldoc comments
  KVM: selftests: Enable USERFAULTFD
  KVM: selftests: Remove redundant setbuf()
  arm64/sysreg: clean up some inconsistent indenting
  KVM: MMU: Make the definition of 'INVALID_GPA' common
  KVM: arm64: vgic-v3: Use kstrtobool() instead of strtobool()
  KVM: arm64: vgic-v3: Limit IPI-ing when accessing GICR_{C,S}ACTIVER0
  KVM: arm64: Synchronize SMEN on vcpu schedule out
  KVM: arm64: Kill CPACR_EL1_TTA definition

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-02-13 23:33:25 +00:00
Oliver Upton
92425e058a Merge branch kvm/kvm-hw-enable-refactor into kvmarm/next
Merge the kvm_init() + hardware enable rework to avoid conflicts
with kvmarm.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-02-13 22:28:34 +00:00
Mike Rapoport
0e4f2c4567 char/agp: consolidate {alloc,free}_gatt_pages()
There is a copy of alloc_gatt_pages() and free_gatt_pages in several
architectures in arch/$ARCH/include/asm/agp.h. All the copies do exactly
the same: alias alloc_gatt_pages() to __get_free_pages(GFP_KERNEL) and
alias free_gatt_pages() to free_pages().

Define alloc_gatt_pages() and free_gatt_pages() in drivers/char/agp/agp.h
and drop per-architecture definitions.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-02-13 22:13:12 +01:00
Johan Hovold
a14e7fdd43 x86/uv: Use irq_domain_create_hierarchy()
Use the irq_domain_create_hierarchy() helper to create the hierarchical
domain, which both serves as documentation and avoids poking at
irqdomain internals.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-14-johan+linaro@kernel.org
2023-02-13 19:31:24 +00:00
Johan Hovold
bc1bc1b309 x86/ioapic: Use irq_domain_create_hierarchy()
Use the irq_domain_create_hierarchy() helper to create the hierarchical
domain, which both serves as documentation and avoids poking at
irqdomain internals.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-13-johan+linaro@kernel.org
2023-02-13 19:31:24 +00:00
Thomas Gleixner
ab407a1919 Clocksource watchdog commits for v6.3
This pull request contains the following:
 
 o	Improvements to clocksource-watchdog console messages.
 
 o	Loosening of the clocksource-watchdog skew criteria to match
 	those of NTP (500 parts per million, relaxed from 400 parts
 	per million).  If it is good enough for NTP, it is good enough
 	for the clocksource watchdog.
 
 o	Suspend clocksource-watchdog checking temporarily when high
 	memory latencies are detected.	This avoids the false-positive
 	clock-skew events that have been seen on production systems
 	running memory-intensive workloads.
 
 o	On systems where the TSC is deemed trustworthy, use it as the
 	watchdog timesource, but only when specifically requested using
 	the tsc=watchdog kernel boot parameter.  This permits clock-skew
 	events to be detected, but avoids forcing workloads to use the
 	slow HPET and ACPI PM timers.  These last two timers are slow
 	enough to cause systems to be needlessly marked bad on the one
 	hand, and real skew does sometimes happen on production systems
 	running production workloads on the other.  And sometimes it is
 	the fault of the TSC, or at least of the firmware that told the
 	kernel to program the TSC with the wrong frequency.
 
 o	Add a tsc=revalidate kernel boot parameter to allow the kernel
 	to diagnose cases where the TSC hardware works fine, but was told
 	by firmware to tick at the wrong frequency.  Such cases are rare,
 	but they really have happened on production systems.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPhnhkTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jClDD/9gTo62MakVQz2wzBRBcWunzX4BAfy2
 2ORqZYqq8cJ4ccFVWtSq7gZ+0bxiT+J4jaVyJpmUPzaiCSfNUT+GLjWyLGzF9Xq+
 xLWpFJOhFhKYjYN2m1ottuQ81V7aTlorC8AJt/o+oCJFGUCb/heg/UrmoZ6DweHw
 H7uXS9yenKdKgYoMENW+8IVsy16sT4D5Fe8XAD/2J6vBBUbgBzKWhi8XSgSHB/Xw
 GCP4UfXVGl5QRG9Xu4ZgrFV1t4azxtmdBghFm7/Kep/j6ttSY78yoS43AbI57bhD
 fWB5mfAQvO+Zo5/9rLjcDzeZCp/PSdARD41aycPMiei08K278tIN9T/fmfSoG6rV
 lVRdFxTHrQcqc9d+g+mGASQBezCF8pxonm9HYLBpNjyfYHnKV70SPXywO4oqAJ1I
 7dCm+uv3Y8KaJdVnPUWOHJjvQLx9NWK5/pXBYjsYnLR+69EVmGDgPZ+/ulQxkWBj
 DtrQgs+sHQ8gngNpAilxuu/lrUXzrC8N4mtxXKBFQoCPYQMFBkr9S+aAEHIgZT9H
 1dWwR1QxeR5uxt7U+3DmTyJ1XKfYjDyyScesILlLMLbdKgZtTS5wGaK4QdJ3QW2z
 z4zqPDccWDDZKZy9W4QBnFBx6Rn49C8xThy7f6Loc+2cKAT10hrEmRJsn79AOCDc
 6hV0S2U9a6ypQg==
 =OWY2
 -----END PGP SIGNATURE-----

Merge tag 'clocksource.2023.02.06b' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into timers/core

Pull clocksource watchdog changes from Paul McKenney:

     o	Improvements to clocksource-watchdog console messages.

     o	Loosening of the clocksource-watchdog skew criteria to match
     	those of NTP (500 parts per million, relaxed from 400 parts
     	per million).  If it is good enough for NTP, it is good enough
     	for the clocksource watchdog.

     o	Suspend clocksource-watchdog checking temporarily when high
     	memory latencies are detected.	This avoids the false-positive
     	clock-skew events that have been seen on production systems
     	running memory-intensive workloads.

     o	On systems where the TSC is deemed trustworthy, use it as the
     	watchdog timesource, but only when specifically requested using
     	the tsc=watchdog kernel boot parameter.  This permits clock-skew
     	events to be detected, but avoids forcing workloads to use the
     	slow HPET and ACPI PM timers.  These last two timers are slow
     	enough to cause systems to be needlessly marked bad on the one
     	hand, and real skew does sometimes happen on production systems
     	running production workloads on the other.  And sometimes it is
     	the fault of the TSC, or at least of the firmware that told the
     	kernel to program the TSC with the wrong frequency.

     o	Add a tsc=revalidate kernel boot parameter to allow the kernel
     	to diagnose cases where the TSC hardware works fine, but was told
     	by firmware to tick at the wrong frequency.  Such cases are rare,
     	but they really have happened on production systems.

Link: https://lore.kernel.org/r/20230210193640.GA3325193@paulmck-ThinkPad-P17-Gen-1
2023-02-13 19:28:48 +01:00
Peter Foley
83e913f52a um: Support LTO
Only a handful of changes are necessary to get it to work.

Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-02-13 10:14:31 +01:00
Krister Johansen
caea091e48 x86/xen/time: prefer tsc as clocksource when it is invariant
Kvm elects to use tsc instead of kvm-clock when it can detect that the
TSC is invariant.

(As of commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
invariant TSC is exposed")).

Notable cloud vendors[1] and performance engineers[2] recommend that Xen
users preferentially select tsc over xen-clocksource due the performance
penalty incurred by the latter.  These articles are persuasive and
tailored to specific use cases.  In order to understand the tradeoffs
around this choice more fully, this author had to reference the
documented[3] complexities around the Xen configuration, as well as the
kernel's clocksource selection algorithm.  Many users may not attempt
this to correctly configure the right clock source in their guest.

The approach taken in the kvm-clock module spares users this confusion,
where possible.

Both the Intel SDM[4] and the Xen tsc documentation explain that marking
a tsc as invariant means that it should be considered stable by the OS
and is elibile to be used as a wall clock source.

In order to obtain better out-of-the-box performance, and reduce the
need for user tuning, follow kvm's approach and decrease the xen clock
rating so that tsc is preferable, if it is invariant, stable, and the
tsc will never be emulated.

[1] https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source/
[2] https://www.brendangregg.com/blog/2021-09-26/the-speed-of-time.html
[3] https://xenbits.xen.org/docs/unstable/man/xen-tscmode.7.html
[4] Intel 64 and IA-32 Architectures Sofware Developer's Manual Volume
    3b: System Programming Guide, Part 2, Section 17.17.1, Invariant TSC

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Code-reviewed-by: David Reaver <me@davidreaver.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20221216162118.GB2633@templeofstupid.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13 06:53:20 +01:00
Juergen Gross
f697cb00af x86/xen: mark xen_pv_play_dead() as __noreturn
Mark xen_pv_play_dead() and related to that xen_cpu_bringup_again()
as "__noreturn".

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221125063248.30256-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13 06:53:19 +01:00
Juergen Gross
336f560a89 x86/xen: don't let xen_pv_play_dead() return
A function called via the paravirt play_dead() hook should not return
to the caller.

xen_pv_play_dead() has a problem in this regard, as it currently will
return in case an offlined cpu is brought to life again. This can be
changed only by doing basically a longjmp() to cpu_bringup_and_idle(),
as the hypercall for bringing down the cpu will just return when the
cpu is coming up again. Just re-initializing the cpu isn't possible,
as the Xen hypervisor will deny that operation.

So introduce xen_cpu_bringup_again() resetting the stack and calling
cpu_bringup_and_idle(), which can be called after HYPERVISOR_vcpu_op()
in xen_pv_play_dead().

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221125063248.30256-2-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13 06:53:19 +01:00
Kan Liang
c828441f21 perf/x86/intel/uncore: Add Meteor Lake support
The uncore subsystem for Meteor Lake is similar to the previous Alder
Lake. The main difference is that MTL provides PMU support for different
tiles, while ADL only provides PMU support for the whole package. On
ADL, there are CBOX, ARB, and clockbox uncore PMON units. On MTL, they
are split into CBOX/HAC_CBOX, ARB/HAC_ARB, and cncu/sncu which provides
a fixed counter for clockticks. Also, new MSR addresses are introduced
on MTL.

The IMC uncore PMON is the same as Alder Lake. Add new PCIIDs of IMC for
Meteor Lake.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230210190238.1726237-1-kan.liang@linux.intel.com
2023-02-11 12:39:03 +01:00
Josh Poimboeuf
37064583f6 x86/entry: Fix unwinding from kprobe on PUSH/POP instruction
If a kprobe (INT3) is set on a stack-modifying single-byte instruction,
like a single-byte PUSH/POP or a LEAVE, ORC fails to unwind past it:

  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x90
   handler_pre+0x33/0x40 [kprobe_example]
   aggr_pre_handler+0x49/0x90
   kprobe_int3_handler+0xe3/0x180
   do_int3+0x3a/0x80
   exc_int3+0x7d/0xc0
   asm_exc_int3+0x35/0x40
  RIP: 0010:kernel_clone+0xe/0x3a0
  Code: cc e8 16 b2 bf 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 cc <53> 48 89 fb 48 83 ec 68 4c 8b 27 65 48 8b 04 25 28 00 00 00 48 89
  RSP: 0018:ffffc9000074fda0 EFLAGS: 00000206
  RAX: 0000000000808100 RBX: ffff888109de9d80 RCX: 0000000000000000
  RDX: 0000000000000011 RSI: ffff888109de9d80 RDI: ffffc9000074fdc8
  RBP: ffff8881019543c0 R08: ffffffff81127e30 R09: 00000000e71742a5
  R10: ffff888104764a18 R11: 0000000071742a5e R12: ffff888100078800
  R13: ffff888100126000 R14: 0000000000000000 R15: ffff888100126005
   ? __pfx_call_usermodehelper_exec_async+0x10/0x10
   ? kernel_clone+0xe/0x3a0
   ? user_mode_thread+0x5b/0x80
   ? __pfx_call_usermodehelper_exec_async+0x10/0x10
   ? call_usermodehelper_exec_work+0x77/0xb0
   ? process_one_work+0x299/0x5f0
   ? worker_thread+0x4f/0x3a0
   ? __pfx_worker_thread+0x10/0x10
   ? kthread+0xf2/0x120
   ? __pfx_kthread+0x10/0x10
   ? ret_from_fork+0x29/0x50
   </TASK>

The problem is that #BP saves the pointer to the instruction immediately
*after* the INT3, rather than to the INT3 itself.  The instruction
replaced by the INT3 hasn't actually run, but ORC assumes otherwise and
expects the wrong stack layout.

Fix it by annotating the #BP exception as a non-signal stack frame,
which tells the ORC unwinder to decrement the instruction pointer before
looking up the corresponding ORC entry.

Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/baafcd3cc1abb14cb757fe081fa696012a5265ee.1676068346.git.jpoimboe@kernel.org
2023-02-11 12:37:51 +01:00
Josh Poimboeuf
ffb1b4a410 x86/unwind/orc: Add 'signal' field to ORC metadata
Add a 'signal' field which allows unwind hints to specify whether the
instruction pointer should be taken literally (like for most interrupts
and exceptions) rather than decremented (like for call stack return
addresses) when used to find the next ORC entry.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/d2c5ec4d83a45b513d8fd72fab59f1a8cfa46871.1676068346.git.jpoimboe@kernel.org
2023-02-11 12:37:51 +01:00
silviazhao
fd636b6a9b x86/perf/zhaoxin: Add stepping check for ZXC
Some of Nano series processors will lead GP when accessing
PMC fixed counter. Meanwhile, their hardware support for PMC
has not announced externally. So exclude Nano CPUs from ZXC
by checking stepping information. This is an unambiguous way
to differentiate between ZXC and Nano CPUs.

Following are Nano and ZXC FMS information:
Nano FMS: Family=6, Model=F, Stepping=[0-A][C-D]
ZXC FMS:  Family=6, Model=F, Stepping=E-F OR
          Family=6, Model=0x19, Stepping=0-3

Fixes: 3a4ac121c2ca ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")

Reported-by: Arjan <8vvbbqzo567a@nospam.xutrox.com>
Reported-by: Kevin Brace <kevinbrace@gmx.com>
Signed-off-by: silviazhao <silviazhao-oc@zhaoxin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212389
2023-02-11 11:18:12 +01:00
Kan Liang
89e97eb8ce perf/x86/intel/ds: Fix the conversion from TSC to perf time
The time order is incorrect when the TSC in a PEBS record is used.

 $perf record -e cycles:upp dd if=/dev/zero of=/dev/null
  count=10000
 $ perf script --show-task-events
       perf-exec     0     0.000000: PERF_RECORD_COMM: perf-exec:915/915
              dd   915   106.479872: PERF_RECORD_COMM exec: dd:915/915
              dd   915   106.483270: PERF_RECORD_EXIT(915:915):(914:914)
              dd   915   106.512429:          1 cycles:upp:
 ffffffff96c011b7 [unknown] ([unknown])
 ... ...

The perf time is from sched_clock_cpu(). The current PEBS code
unconditionally convert the TSC to native_sched_clock(). There is a
shift between the two clocks. If the TSC is stable, the shift is
consistent, __sched_clock_offset. If the TSC is unstable, the shift has
to be calculated at runtime.

This patch doesn't support the conversion when the TSC is unstable. The
TSC unstable case is a corner case and very unlikely to happen. If it
happens, the TSC in a PEBS record will be dropped and fall back to
perf_event_clock().

Fixes: 47a3aeb39e8d ("perf/x86/intel/pebs: Fix PEBS timestamps overwritten")
Reported-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/CAM9d7cgWDVAq8-11RbJ2uGfwkKD6fA-OMwOKDrNUrU_=8MgEjg@mail.gmail.com/
2023-02-11 11:18:12 +01:00
Borislav Petkov (AMD)
6b8d5dde5b x86/tsc: Do feature check as the very first thing
Do the feature check as the very first thing in the function. Everything
else comes after that and is meaningless work if the TSC CPUID bit is
not even set. Switch to cpu_feature_enabled() too, while at it.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y5990CUCuWd5jfBH@zn.tnic
2023-02-11 10:44:07 +01:00
Borislav Petkov (AMD)
8fe6d84947 x86/tsc: Make recalibrate_cpu_khz() export GPL only
A quick search doesn't reveal any use outside of the kernel - which
would be questionable to begin with anyway - so make the export GPL
only.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y599miBzWRAuOwhg@zn.tnic
2023-02-11 10:44:07 +01:00
Borislav Petkov (AMD)
851026a2bf x86/cacheinfo: Remove unused trace variable
15cd8812ab2c ("x86: Remove the CPU cache size printk's") removed the
last use of the trace local var. Remove it too and the useless trace
cache case.

No functional changes.

Reported-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230210234541.9694-1-bp@alien8.de
Link: http://lore.kernel.org/r/20220705073349.1512-1-jiapeng.chong@linux.alibaba.com
2023-02-11 10:43:31 +01:00
Ammar Faizi
5541992e51 x86: um: vdso: Add '%rcx' and '%r11' to the syscall clobber list
The 'syscall' instruction clobbers '%rcx' and '%r11', but they are not
listed in the inline Assembly that performs the syscall instruction.

No real bug is found. It wasn't buggy by luck because '%rcx' and '%r11'
are caller-saved registers, and not used in the functions, and the
functions are never inlined.

Add them to the clobber list for code correctness.

Fixes: f1c2bb8b9964ed31de988910f8b1cfb586d30091 ("um: implement a x86_64 vDSO")
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-02-10 21:33:37 +01:00
David Gow
8849818679 rust: arch/um: Disable FP/SIMD instruction to match x86
The kernel disables all SSE and similar FP/SIMD instructions on
x86-based architectures (partly because we shouldn't be using floats in
the kernel, and partly to avoid the need for stack alignment, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383 )

UML does not do the same thing, which isn't in itself a problem, but
does add to the list of differences between UML and "normal" x86 builds.

In addition, there was a crash bug with LLVM < 15 / rustc < 1.65 when
building with SSE, so disabling it fixes rust builds with earlier
compiler versions, see:
https://github.com/Rust-for-Linux/linux/pull/881

Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Sergio González Collado <sergio.collado@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-02-10 21:28:25 +01:00
Ard Biesheuvel
45d5165426 efi: Add mixed-mode thunk recipe for GetMemoryAttributes
EFI mixed mode on x86 requires a recipe for each protocol method or
firmware service that takes u64 arguments by value, or returns pointer
or 'native int' (UINTN) values by reference (e.g,, through a void ** or
unsigned long * parameter), due to the fact that these types cannot be
translated 1:1 between the i386 and MS x64 calling conventions.

So add the missing recipe for GetMemoryAttributes, which is not actually
being used yet on x86, but the code exists and can be built for x86 so
let's make sure it works as it should.

Cc: Evgeniy Baskov <baskov@ispras.ru>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-02-10 15:20:55 +01:00
Tom Lendacky
6f0f2d5ef8 KVM: x86: Mitigate the cross-thread return address predictions bug
By default, KVM/SVM will intercept attempts by the guest to transition
out of C0. However, the KVM_CAP_X86_DISABLE_EXITS capability can be used
by a VMM to change this behavior. To mitigate the cross-thread return
address predictions bug (X86_BUG_SMT_RSB), a VMM must not be allowed to
override the default behavior to intercept C0 transitions.

Use a module parameter to control the mitigation on processors that are
vulnerable to X86_BUG_SMT_RSB. If the processor is vulnerable to the
X86_BUG_SMT_RSB bug and the module parameter is set to mitigate the bug,
KVM will not allow the disabling of the HLT, MWAIT and CSTATE exits.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <4019348b5e07148eb4d593380a5f6713b93c9a16.1675956146.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-10 07:27:37 -05:00