12269 Commits

Author SHA1 Message Date
Linus Torvalds
941d77c773 - Compute the purposeful misalignment of zen_untrain_ret automatically
and assert __x86_return_thunk's alignment so that future changes to
   the symbol macros do not accidentally break them.
 
 - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
   pointless
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ1wgACgkQEsHwGGHe
 VUrXlRAAhIonFM1suIHo6w085jY5YA1XnsziJr/bT3e16FdHrF1i3RBEX4ml0m3O
 ADwa9dMsC9UJIa+/TKRNFfQvfRcLE/rsUKlS1Rluf/IRIxuSt/Oa4bFQHGXFRwnV
 eSlnWTNiaWrRs/vJEYAnMOe98oRyElHWa9kZ7K5FC+Ksfn/WO1U1RQ2NWg2A2wkN
 8MHJiS41w2piOrLU/nfUoI7+esHgHNlib222LoptDGHuaY8V2kBugFooxAEnTwS3
 PCzWUqCTgahs393vbx6JimoIqgJDa7bVdUMB0kOUHxtpbBiNdYYVy6e7UKnV1yjB
 qP3v9jQW4+xIyRmlFiErJXEZx7DjAIP5nulGRrUMzRfWEGF8mdRZ+ugGqFMHCeC8
 vXI+Ixp2vvsfhG3N/algsJUdkjlpt3hBpElRZCfR08M253KAbAmUNMOr4sx4RPi5
 ymC+pLIHd1K0G9jiZaFnOMaY71gAzWizwxwjFKLQMo44q+lpNJvsVO00cr+9RBYj
 LQL2APkONVEzHPMYR/LrXCslYaW//DrfLdRQjNbzUTonxFadkTO2Eu8J90B/5SFZ
 CqC1NYKMQPVFeg4XuGWCgZEH+jokCGhl8vvmXClAMcOEOZt0/s4H89EKFkmziyon
 L1ZrA/U72gWV8EwD7GLtuFJmnV4Ayl/hlek2j0qNKaj6UUgTFg8=
 =LcUq
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Compute the purposeful misalignment of zen_untrain_ret automatically
   and assert __x86_return_thunk's alignment so that future changes to
   the symbol macros do not accidentally break them.

 - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
   pointless

* tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/retbleed: Add __x86_return_thunk alignment checks
  x86/cpu: Remove X86_FEATURE_NAMES
  x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
2023-06-26 15:42:34 -07:00
Linus Torvalds
2c96136a3f - Add support for unaccepted memory as specified in the UEFI spec v2.9.
The gist of it all is that Intel TDX and AMD SEV-SNP confidential
   computing guests define the notion of accepting memory before using it
   and thus preventing a whole set of attacks against such guests like
   memory replay and the like.
 
   There are a couple of strategies of how memory should be accepted
   - the current implementation does an on-demand way of accepting.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ0f4ACgkQEsHwGGHe
 VUpasw//RKoNW9HSU1csY+XnG9uuaT6QKgji+gIEZWWIGPO9iibvbBj6P5WxJE8T
 fe7yb6CGa6d6thoU0v+mQGVVvCd7OjCFwPD5wAo4mXToD7Ig+4mI6jMkaKifqa2f
 N1Uuy8u/zQnGyWrP5Y//WH5bJYfsmds4UGwXI2nLvKlhE7MG90/ePjt7iqnnwZsy
 waLp6a0Q1VeOvnfRszFLHZw/SoER5RSJ4qeVqttkFNmPPEKMK1Kirrl2poR56OQJ
 nMr6LqVtD7erlSJ36VRXOKzLI443A4iIEIg/wBjIOU6L5ZEWJGNqtCDnIqFJ6+TM
 XatsejfRYkkMZH0qXtX9+M0u+HJHbZPCH5rEcA21P3Nbd7od/ANq91qCGoMjtUZ4
 7pZohMG8M6IDvkLiOb8fQVkR5k/9Jbk8UvdN/8jdPx1ERxYMFO3BDvJpV2gzrW4B
 KYtFTPR7j2nY3eKfDpe3flanqYzKUBsKoTlLnlH7UHaiMZ2idwG8AQjlrhC/erCq
 /Lq1LXt4Mq46FyHABc+PSHytu0WWj1nBUftRt+lviY/Uv7TlkBldOTT7wm7itsfF
 HUCTfLWl0CJXKPq8rbbZhAG/exN6Ay6MO3E3OcNq8A72E5y4cXenuG3ic/0tUuOu
 FfjpiMk35qE2Qb4hnj1YtF3XINtd1MpKcuwzGSzEdv9s3J7hrS0=
 =FS95
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 confidential computing update from Borislav Petkov:

 - Add support for unaccepted memory as specified in the UEFI spec v2.9.

   The gist of it all is that Intel TDX and AMD SEV-SNP confidential
   computing guests define the notion of accepting memory before using
   it and thus preventing a whole set of attacks against such guests
   like memory replay and the like.

   There are a couple of strategies of how memory should be accepted -
   the current implementation does an on-demand way of accepting.

* tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  virt: sevguest: Add CONFIG_CRYPTO dependency
  x86/efi: Safely enable unaccepted memory in UEFI
  x86/sev: Add SNP-specific unaccepted memory support
  x86/sev: Use large PSC requests if applicable
  x86/sev: Allow for use of the early boot GHCB for PSC requests
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  x86/sev: Fix calculation of end address based on number of pages
  x86/tdx: Add unaccepted memory support
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  efi: Add unaccepted memory support
  x86/boot/compressed: Handle unaccepted memory
  efi/libstub: Implement support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  mm: Add support for unaccepted memory
2023-06-26 15:32:39 -07:00
Linus Torvalds
8c69e7afe9 - Up until now the Fast Short Rep Mov optimizations implied the presence
of the ERMS CPUID flag. AMD decoupled them with a BIOS setting so decouple
   that dependency in the kernel code too
 
 - Teach the alternatives machinery to handle relocations
 
 - Make debug_alternative accept flags in order to see only that set of
   patching done one is interested in
 
 - Other fixes, cleanups and optimizations to the patching code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZi2AACgkQEsHwGGHe
 VUqhGw/9EC/m5HTFBlCy9PS5Qy6pPLzmHR5Tuy4meqlnB1gN+5wzfxdYEwHm46hH
 SR6WqR12yVaCMIzh66y8nTJyMbIykaBbfFJb3WesdDrBIYUZ9f+7O+Xd0JS6Jykd
 2HBHOyaVS1/W75+y6w9JhTExBH5xieCpJVIYyAvifbn/pB8XmuTTwJ1Z3EJ8DzkK
 AN16i46bUiKNBdTYZUMhtKL4vHVfqLYMskgWe6IG7DmRLOwikR0uRVhuVqP/bmUj
 U128cUacGJT2AYbZarTAKmOa42nDj3TpJqRp1qit3y6Cun4vxKH+1A91UPd7IHTa
 M5H1bNSgfXMm8rU+JgfvXKqrCTckGn2OqlCkJfPV3RBeP9IcQBBF0vE3dnM/X2We
 dwbXeDfJvc+1s4/M41MOhyahTUbW+4iRK5UCZEt1mprTbtzHTlN7RROo7QLpFsWx
 T0Jqvsd1raAutPTgTjU7ToQwDpSQNnn4Y/KoEdpvOCXR8wU7Wo5/+Qa4tEkIY3W6
 mUFpJcgFC9QEKLuaNAofPIhMuZ/vzRVtpK7wbLn4KR5JZA8AxznenMFVg8YPWRFI
 4oga0kMFJ7t6z/CXHtrxFaLQ9e7WAUSRU6gPiz8As1F/K9N0JWMUfjuTJcgjUsF8
 bwdCNinwG8y3rrPUCrqbO5N766ZkLYd6NksKlmIyUvtCcS0ksbg=
 =mH38
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 instruction alternatives updates from Borislav Petkov:

 - Up until now the Fast Short Rep Mov optimizations implied the
   presence of the ERMS CPUID flag. AMD decoupled them with a BIOS
   setting so decouple that dependency in the kernel code too

 - Teach the alternatives machinery to handle relocations

 - Make debug_alternative accept flags in order to see only that set of
   patching done one is interested in

 - Other fixes, cleanups and optimizations to the patching code

* tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternative: PAUSE is not a NOP
  x86/alternatives: Add cond_resched() to text_poke_bp_batch()
  x86/nospec: Shorten RESET_CALL_DEPTH
  x86/alternatives: Add longer 64-bit NOPs
  x86/alternatives: Fix section mismatch warnings
  x86/alternative: Optimize returns patching
  x86/alternative: Complicate optimize_nops() some more
  x86/alternative: Rewrite optimize_nops() some
  x86/lib/memmove: Decouple ERMS from FSRM
  x86/alternative: Support relocations in alternatives
  x86/alternative: Make debug-alternative selective
2023-06-26 15:14:55 -07:00
Linus Torvalds
88afbb21d4 A set of fixes for kexec(), reboot and shutdown issues
- Ensure that the WBINVD in stop_this_cpu() has been completed before the
    control CPU proceedes.
 
    stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs
    in a HLT loop.
 
    The control CPU sends an IPI to the APs and waits for their CPU online bits
    to be cleared. Once they all are marked "offline" it proceeds.
 
    But stop_this_cpu() clears the CPU online bit before issuing WBINVD,
    which means there is no guarantee that the AP has reached the HLT loop.
 
    This was reported to cause intermittent reboot/shutdown failures due to
    some dubious interaction with the firmware.
 
    This is not only a problem of WBINVD. The code to actually "stop" the
    CPU which runs between clearing the online bit and reaching the HLT loop
    can cause large enough delays on its own (think virtualization). That's
    especially dangerous for kexec() as kexec() expects that all APs are in
    a safe state and not executing code while the boot CPU jumps to the new
    kernel. There are more issues vs. kexec() which are addressed separately.
 
    Cure this by implementing an explicit synchronization point right before
    the AP reaches HLT. This guarantees that the AP has completed the full
    stop proceedure.
 
  - Fix the condition for WBINVD in stop_this_cpu().
 
    The WBINVD in stop_this_cpu() is required for ensuring that when
    switching to or from memory encryption no dirty data is left in the
    cache lines which might cause a write back in the wrong more later.
 
    This checks CPUID directly because the feature bit might have been
    cleared due to a command line option.
 
    But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel
    CPUs return the content of the highest supported leaf when a non-existing
    leaf is read, while AMD CPUs return all zeros for unsupported leafs.
 
    So the result of the test on Intel CPUs is lottery and on AMD its just
    correct by chance.
 
    While harmless it's incorrect and causes the conditional wbinvd() to be
    issued where not required, which caused the above issue to be unearthed.
 
  - Make kexec() robust against AP code execution
 
    Ashok observed triple faults when doing kexec() on a system which had
    been booted with "nosmt".
 
    It turned out that the SMT siblings which had been brought up partially
    are parked in mwait_play_dead() to enable power savings.
 
    mwait_play_dead() is monitoring the thread flags of the AP's idle task,
    which has been chosen as it's unlikely to be written to.
 
    But kexec() can overwrite the previous kernel text and data including
    page tables etc. When it overwrites the cache lines monitored by an AP
    that AP resumes execution after the MWAIT on eventually overwritten
    text, stack and page tables, which obviously might end up in a triple
    fault easily.
 
    Make this more robust in several steps:
 
     1) Use an explicit per CPU cache line for monitoring.
 
     2) Write a command to these cache lines to kick APs out of MWAIT before
        proceeding with kexec(), shutdown or reboot.
 
        The APs confirm the wakeup by writing status back and then enter a
        HLT loop.
 
     3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs
        in INIT state.
 
        HLT is not a guarantee that an AP won't wake up and resume
        execution. HLT is woken up by NMI and SMI. SMI puts the CPU back
        into HLT (+/- firmware bugs), but NMI is delivered to the CPU which
        executes the NMI handler. Same issue as the MWAIT scenario described
        above.
 
        Sending an INIT/INIT sequence to the APs puts them into wait for
        STARTUP state, which is safe against NMI.
 
     There is still an issue remaining which can't be fixed: #MCE
 
     If the AP sits in HLT and receives a broadcast #MCE it will try to
     handle it with the obvious consequences.
 
     INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to
     shut down the machine.
 
     So there is a choice between fire (HLT) and frying pan (INIT). Frying
     pan has been chosen as it's at least preventing the NMI issue.
 
     On systems which are not using INIT/INIT/STARTUP there is not much
     which can be done right now, but at least the obvious and easy to
     trigger MWAIT issue has been addressed.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZfpQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeZpD/9gSJN2qtGqoOgE8bWAenEeqppmBGFE
 EAhuhsvN1qG9JosUFo4KzxsGD/aWt2P6XglBDrGti8mFNol67jutmwWklntL3/ZR
 m8D6D+Pl7/CaDgACDTDbrnVC3lOGyMhD301yJrnBigS/SEoHeHI9UtadbHukuLQj
 TlKt5KtAnap15bE6QL846cDIptB9SjYLLPULo3i4azXEis/l6eAkffwAR6dmKlBh
 2RbhLK1xPPG9nqWYjqZXnex09acKwD9xY9xHj4+GampV4UqHJRWfW0YtFs5ENi01
 r3FVCdKEcvMkUw0zh0IAviBRs2vCI/R3YSfEc7P0264yn5WzMhAT+OGCovNjByiW
 sB4Iqa+Yf6aoBWwux6W4d22xu7uYhmFk/jiLyRZJPW/gvGZCZATT/x/T2hRoaYA8
 3S0Rs7n/gbfvynQETgniifuM0bXRW0lEJAmn840GwyVQwlpDEPBJSwW4El49kbkc
 +dHxnmpMCfnBxfVLS1YDd4WOmkWBeECNcW330FShlQQ8mM3UG31+Q8Jc55Ze9SW0
 w1h+IgIOHlA0DpQUUM8DJTSuxFx2piQsZxjOtzd70+BiKZpCsHqVLIp4qfnf+/GO
 gyP0cCQLbafpABbV9uVy8A/qgUGi0Qii0GJfCTy0OdmU+JX3C2C/gsM3uN0g3qAj
 vUhkuCXEGL5k1w==
 =KgZ0
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Thomas Gleixner:
 "A set of fixes for kexec(), reboot and shutdown issues:

   - Ensure that the WBINVD in stop_this_cpu() has been completed before
     the control CPU proceedes.

     stop_this_cpu() is used for kexec(), reboot and shutdown to park
     the APs in a HLT loop.

     The control CPU sends an IPI to the APs and waits for their CPU
     online bits to be cleared. Once they all are marked "offline" it
     proceeds.

     But stop_this_cpu() clears the CPU online bit before issuing
     WBINVD, which means there is no guarantee that the AP has reached
     the HLT loop.

     This was reported to cause intermittent reboot/shutdown failures
     due to some dubious interaction with the firmware.

     This is not only a problem of WBINVD. The code to actually "stop"
     the CPU which runs between clearing the online bit and reaching the
     HLT loop can cause large enough delays on its own (think
     virtualization). That's especially dangerous for kexec() as kexec()
     expects that all APs are in a safe state and not executing code
     while the boot CPU jumps to the new kernel. There are more issues
     vs kexec() which are addressed separately.

     Cure this by implementing an explicit synchronization point right
     before the AP reaches HLT. This guarantees that the AP has
     completed the full stop proceedure.

   - Fix the condition for WBINVD in stop_this_cpu().

     The WBINVD in stop_this_cpu() is required for ensuring that when
     switching to or from memory encryption no dirty data is left in the
     cache lines which might cause a write back in the wrong more later.

     This checks CPUID directly because the feature bit might have been
     cleared due to a command line option.

     But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
     Intel CPUs return the content of the highest supported leaf when a
     non-existing leaf is read, while AMD CPUs return all zeros for
     unsupported leafs.

     So the result of the test on Intel CPUs is lottery and on AMD its
     just correct by chance.

     While harmless it's incorrect and causes the conditional wbinvd()
     to be issued where not required, which caused the above issue to be
     unearthed.

   - Make kexec() robust against AP code execution

     Ashok observed triple faults when doing kexec() on a system which
     had been booted with "nosmt".

     It turned out that the SMT siblings which had been brought up
     partially are parked in mwait_play_dead() to enable power savings.

     mwait_play_dead() is monitoring the thread flags of the AP's idle
     task, which has been chosen as it's unlikely to be written to.

     But kexec() can overwrite the previous kernel text and data
     including page tables etc. When it overwrites the cache lines
     monitored by an AP that AP resumes execution after the MWAIT on
     eventually overwritten text, stack and page tables, which obviously
     might end up in a triple fault easily.

     Make this more robust in several steps:

      1) Use an explicit per CPU cache line for monitoring.

      2) Write a command to these cache lines to kick APs out of MWAIT
         before proceeding with kexec(), shutdown or reboot.

         The APs confirm the wakeup by writing status back and then
         enter a HLT loop.

      3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
         APs in INIT state.

         HLT is not a guarantee that an AP won't wake up and resume
         execution. HLT is woken up by NMI and SMI. SMI puts the CPU
         back into HLT (+/- firmware bugs), but NMI is delivered to the
         CPU which executes the NMI handler. Same issue as the MWAIT
         scenario described above.

         Sending an INIT/INIT sequence to the APs puts them into wait
         for STARTUP state, which is safe against NMI.

     There is still an issue remaining which can't be fixed: #MCE

     If the AP sits in HLT and receives a broadcast #MCE it will try to
     handle it with the obvious consequences.

     INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
     #MCE to shut down the machine.

     So there is a choice between fire (HLT) and frying pan (INIT).
     Frying pan has been chosen as it's at least preventing the NMI
     issue.

     On systems which are not using INIT/INIT/STARTUP there is not much
     which can be done right now, but at least the obvious and easy to
     trigger MWAIT issue has been addressed"

* tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Put CPUs into INIT on shutdown if possible
  x86/smp: Split sending INIT IPI out into a helper function
  x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  x86/smp: Use dedicated cache-line for mwait_play_dead()
  x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
  x86/smp: Dont access non-existing CPUID leaf
  x86/smp: Make stop_other_cpus() more robust
2023-06-26 14:45:53 -07:00
Linus Torvalds
9244724fbf A large update for SMP management:
- Parallel CPU bringup
 
     The reason why people are interested in parallel bringup is to shorten
     the (kexec) reboot time of cloud servers to reduce the downtime of the
     VM tenants.
 
     The current fully serialized bringup does the following per AP:
 
       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state
 
     There are two significant delays:
 
       #3 The time for an AP to report alive state in start_secondary() on
          x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.
 
       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending on
          the microcode patch size to apply.
 
     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to come
     up and apply microcode. That's more than 80% of the actual onlining
     procedure.
 
     This can be reduced significantly by splitting the bringup mechanism
     into two parts:
 
       1) Run the prepare callbacks and kick the AP alive for each AP which
       	 needs to be brought up.
 
 	 The APs wake up, do their firmware initialization and run the low
       	 level kernel startup code including microcode loading in parallel
       	 up to the first synchronization point. (#1 and #2 above)
 
       2) Run the rest of the bringup code strictly serialized per CPU
       	 (#3 - #5 above) as it's done today.
 
 	 Parallelizing that stage of the CPU bringup might be possible in
 	 theory, but it's questionable whether required surgery would be
 	 justified for a pretty small gain.
 
     If the system is large enough the first AP is already waiting at the
     first synchronization point when the boot CPU finished the wake-up of
     the last AP. That reduces the AP bringup time on that SKL from ~800ms
     to ~80ms, i.e. by a factor ~10x.
 
     The actual gain varies wildly depending on the system, CPU, microcode
     patch size and other factors. There are some opportunities to reduce
     the overhead further, but that needs some deep surgery in the x86 CPU
     bringup code.
 
     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.
 
   - Enhancements for SMP function call tracing so it is possible to locate
     the scheduling and the actual execution points. That allows to measure
     IPI delivery time precisely.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll
 05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X
 FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w
 zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ
 wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L
 37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o
 K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi
 a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg
 dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2
 Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn
 2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T
 TRiSzvssbYYmaw==
 =Y8if
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Thomas Gleixner:
 "A large update for SMP management:

   - Parallel CPU bringup

     The reason why people are interested in parallel bringup is to
     shorten the (kexec) reboot time of cloud servers to reduce the
     downtime of the VM tenants.

     The current fully serialized bringup does the following per AP:

       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state

     There are two significant delays:

       #3 The time for an AP to report alive state in start_secondary()
          on x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.

       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending
          on the microcode patch size to apply.

     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to
     come up and apply microcode. That's more than 80% of the actual
     onlining procedure.

     This can be reduced significantly by splitting the bringup
     mechanism into two parts:

       1) Run the prepare callbacks and kick the AP alive for each AP
          which needs to be brought up.

          The APs wake up, do their firmware initialization and run the
          low level kernel startup code including microcode loading in
          parallel up to the first synchronization point. (#1 and #2
          above)

       2) Run the rest of the bringup code strictly serialized per CPU
          (#3 - #5 above) as it's done today.

          Parallelizing that stage of the CPU bringup might be possible
          in theory, but it's questionable whether required surgery
          would be justified for a pretty small gain.

     If the system is large enough the first AP is already waiting at
     the first synchronization point when the boot CPU finished the
     wake-up of the last AP. That reduces the AP bringup time on that
     SKL from ~800ms to ~80ms, i.e. by a factor ~10x.

     The actual gain varies wildly depending on the system, CPU,
     microcode patch size and other factors. There are some
     opportunities to reduce the overhead further, but that needs some
     deep surgery in the x86 CPU bringup code.

     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.

   - Enhancements for SMP function call tracing so it is possible to
     locate the scheduling and the actual execution points. That allows
     to measure IPI delivery time precisely"

* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  trace,smp: Add tracepoints for scheduling remotelly called functions
  trace,smp: Add tracepoints around remotelly called functions
  MAINTAINERS: Add CPU HOTPLUG entry
  x86/smpboot: Fix the parallel bringup decision
  x86/realmode: Make stack lock work in trampoline_compat()
  x86/smp: Initialize cpu_primary_thread_mask late
  cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
  x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
  x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
  x86/smpboot: Support parallel startup of secondary CPUs
  x86/smpboot: Implement a bit spinlock to protect the realmode stack
  x86/apic: Save the APIC virtual base address
  cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
  x86/apic: Provide cpu_primary_thread mask
  x86/smpboot: Enable split CPU startup
  cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
  cpu/hotplug: Reset task stack state in _cpu_up()
  cpu/hotplug: Remove unused state functions
  riscv: Switch to hotplug core state synchronization
  parisc: Switch to hotplug core state synchronization
  ...
2023-06-26 13:59:56 -07:00
Linus Torvalds
7cffdbe360 Updates for the x86 boot process:
- Initialize FPU late.
 
    Right now FPU is initialized very early during boot. There is no real
    requirement to do so. The only requirement is to have it done before
    alternatives are patched.
 
    That's done in check_bugs() which does way more than what the function
    name suggests.
 
    So first rename check_bugs() to arch_cpu_finalize_init() which makes it
    clear what this is about.
 
    Move the invocation of arch_cpu_finalize_init() earlier in
    start_kernel() as it has to be done before fork_init() which needs to
    know the FPU register buffer size.
 
    With those prerequisites the FPU initialization can be moved into
    arch_cpu_finalize_init(), which removes it from the early and fragile
    part of the x86 bringup.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZdNYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaNBEACWtVd1uhqQldIFgSvZYujsrWXlmkU+
 pok6gDzKQNwZADiXW/tn5fP8SBLWT0pgLM9d+oZ5mEaLaOW7HcZLEHcVrn74e3TT
 53xN8e1zCzyjCJ/x22vrKH4sn/bU+bQyzSNVu9Disqn9Fl+ts37FqAHDv/ExbneD
 DaYXXCLgQsyGbPLD8B7yGOpJTGBUTJxNQS1ZFElBaRsAaw0mYZOEoPvuTFK4o7Uz
 GUB2vGefmeNfX+EgLYKG9QoS0F3SMS9X2IYswy1H76ZnV/eXmTsA1S3u3X9yX7kC
 XBnPtCC+iX+7o3xFkTpa0oQUdzEyGOItExZZgce6jEQu4Fl7NoIJxhlMg9/Y+vcF
 ntipEKSWFLAi1GkZzeKRwSSsoWqRaFxOKLy8qhn9kud09k+UtMBkNrF1CSp9laAz
 QParu3B1oHPEzx/jS0bSOCMN+AQZH8rX7LxRp4kpBOeBSZNCnfaBUzfIvmccPls+
 EJTO/0JUpRm5LsPSDiJhypPRoOOIP26IloR6OoZTcI3p76NrnYblRvisvuFAgDU6
 bk7Belf+GDx0kBZugqQgok7nDaHIBR7vEmca1NV8507UrffVyxLAiI4CiWPcFdOq
 ovhO8K+gP4xvzZx4cXZBwYwusjvl/oxKy8yQiGgoftDiWU4sdUCSrwX3x27+hUYL
 2P1OLDOXSGwESQ==
 =yxMj
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Thomas Gleixner:
 "Initialize FPU late.

  Right now FPU is initialized very early during boot. There is no real
  requirement to do so. The only requirement is to have it done before
  alternatives are patched.

  That's done in check_bugs() which does way more than what the function
  name suggests.

  So first rename check_bugs() to arch_cpu_finalize_init() which makes
  it clear what this is about.

  Move the invocation of arch_cpu_finalize_init() earlier in
  start_kernel() as it has to be done before fork_init() which needs to
  know the FPU register buffer size.

  With those prerequisites the FPU initialization can be moved into
  arch_cpu_finalize_init(), which removes it from the early and fragile
  part of the x86 bringup"

* tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
  x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
  x86/fpu: Mark init functions __init
  x86/fpu: Remove cpuinfo argument from init functions
  x86/init: Initialize signal frame size late
  init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
  init: Invoke arch_cpu_finalize_init() earlier
  init: Remove check_bugs() leftovers
  um/cpu: Switch to arch_cpu_finalize_init()
  sparc/cpu: Switch to arch_cpu_finalize_init()
  sh/cpu: Switch to arch_cpu_finalize_init()
  mips/cpu: Switch to arch_cpu_finalize_init()
  m68k/cpu: Switch to arch_cpu_finalize_init()
  loongarch/cpu: Switch to arch_cpu_finalize_init()
  ia64/cpu: Switch to arch_cpu_finalize_init()
  ARM: cpu: Switch to arch_cpu_finalize_init()
  x86/cpu: Switch to arch_cpu_finalize_init()
  init: Provide arch_cpu_finalize_init()
2023-06-26 13:39:10 -07:00
Linus Torvalds
64bf6ae93e v6.5/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc
 ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1
 cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI=
 =zdfx
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Miscellaneous features, cleanups, and fixes for vfs and individual fs

  Features:

   - Use mode 0600 for file created by cachefilesd so it can be run by
     unprivileged users. This aligns them with directories which are
     already created with mode 0700 by cachefilesd

   - Reorder a few members in struct file to prevent some false sharing
     scenarios

   - Indicate that an eventfd is used a semaphore in the eventfd's
     fdinfo procfs file

   - Add a missing uapi header for eventfd exposing relevant uapi
     defines

   - Let the VFS protect transitions of a superblock from read-only to
     read-write in addition to the protection it already provides for
     transitions from read-write to read-only. Protecting read-only to
     read-write transitions allows filesystems such as ext4 to perform
     internal writes, keeping writers away until the transition is
     completed

  Cleanups:

   - Arnd removed the architecture specific arch_report_meminfo()
     prototypes and added a generic one into procfs.h. Note, we got a
     report about a warning in amdpgpu codepaths that suggested this was
     bisectable to this change but we concluded it was a false positive

   - Remove unused parameters from split_fs_names()

   - Rename put_and_unmap_page() to unmap_and_put_page() to let the name
     reflect the order of the cleanup operation that has to unmap before
     the actual put

   - Unexport buffer_check_dirty_writeback() as it is not used outside
     of block device aops

   - Stop allocating aio rings from highmem

   - Protecting read-{only,write} transitions in the VFS used open-coded
     barriers in various places. Replace them with proper little helpers
     and document both the helpers and all barrier interactions involved
     when transitioning between read-{only,write} states

   - Use flexible array members in old readdir codepaths

  Fixes:

   - Use the correct type __poll_t for epoll and eventfd

   - Replace all deprecated strlcpy() invocations, whose return value
     isn't checked with an equivalent strscpy() call

   - Fix some kernel-doc warnings in fs/open.c

   - Reduce the stack usage in jffs2's xattr codepaths finally getting
     rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
     bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
     royally annoying compilation warning

   - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
     fmode_t is required to avoid fmode_t to integer degradation
     warnings

   - Create coredumps with O_WRONLY instead of O_RDWR. There's a long
     explanation in that commit how O_RDWR is actually a bug which we
     found out with the help of Linus and git archeology

   - Fix "no previous prototype" warnings in the pipe codepaths

   - Add overflow calculations for remap_verify_area() as a signed
     addition overflow could be triggered in xfstests

   - Fix a null pointer dereference in sysv

   - Use an unsigned variable for length calculations in jfs avoiding
     compilation warnings with gcc 13

   - Fix a dangling pipe pointer in the watch queue codepath

   - The legacy mount option parser provided as a fallback by the VFS
     for filesystems not yet converted to the new mount api did prefix
     the generated mount option string with a leading ',' causing issues
     for some filesystems

   - Fix a repeated word in a comment in fs.h

   - autofs: Update the ctime when mtime is updated as mandated by
     POSIX"

* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  readdir: Replace one-element arrays with flexible-array members
  fs: Provide helpers for manipulating sb->s_readonly_remount
  fs: Protect reconfiguration of sb read-write from racing writes
  eventfd: add a uapi header for eventfd userspace APIs
  autofs: set ctime as well when mtime changes on a dir
  eventfd: show the EFD_SEMAPHORE flag in fdinfo
  fs/aio: Stop allocating aio rings from HIGHMEM
  fs: Fix comment typo
  fs: unexport buffer_check_dirty_writeback
  fs: avoid empty option when generating legacy mount string
  watch_queue: prevent dangling pipe pointer
  fs.h: Optimize file struct to prevent false sharing
  highmem: Rename put_and_unmap_page() to unmap_and_put_page()
  cachefiles: Allow the cache to be non-root
  init: remove unused names parameter in split_fs_names()
  jfs: Use unsigned variable for length calculations
  fs/sysv: Null check to prevent null-ptr-deref bug
  fs: use UB-safe check for signed addition overflow in remap_verify_area
  procfs: consolidate arch_report_meminfo declaration
  fs: pipe: reveal missing function protoypes
  ...
2023-06-26 09:50:21 -07:00
Thomas Gleixner
45e34c8af5 x86/smp: Put CPUs into INIT on shutdown if possible
Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
d7893093a7 x86/smp: Cure kexec() vs. mwait_play_dead() breakage
TLDR: It's a mess.

When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offlined CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
1f5e7eb786 x86/smp: Make stop_other_cpus() more robust
Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbinvd() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.

Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().

The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only the
CPUs which have not reported within the timeout window.

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx
2023-06-20 14:51:46 +02:00
Thomas Gleixner
0a9567ac5e x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
Moving mem_encrypt_init() broke the AMD_MEM_ENCRYPT=n because the
declaration of that function was under #ifdef CONFIG_AMD_MEM_ENCRYPT and
the obvious placement for the inline stub was the #else path.

This is a leftover of commit 20f07a044a76 ("x86/sev: Move common memory
encryption code to mem_encrypt.c") which made mem_encrypt_init() depend on
X86_MEM_ENCRYPT without moving the prototype. That did not fail back then
because there was no stub inline as the core init code had a weak function.

Move both the declaration and the stub out of the CONFIG_AMD_MEM_ENCRYPT
section and guard it with CONFIG_X86_MEM_ENCRYPT.

Fixes: 439e17576eb4 ("init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/oe-kbuild-all/202306170247.eQtCJPE8-lkp@intel.com/
2023-06-16 22:23:27 +02:00
Omar Sandoval
b9f174c811 x86/unwind/orc: Add ELF section with ORC version identifier
Commits ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC
metadata") and fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in
two") changed the ORC format. Although ORC is internal to the kernel,
it's the only way for external tools to get reliable kernel stack traces
on x86-64. In particular, the drgn debugger [1] uses ORC for stack
unwinding, and these format changes broke it [2]. As the drgn
maintainer, I don't care how often or how much the kernel changes the
ORC format as long as I have a way to detect the change.

It suffices to store a version identifier in the vmlinux and kernel
module ELF files (to use when parsing ORC sections from ELF), and in
kernel memory (to use when parsing ORC from a core dump+symbol table).
Rather than hard-coding a version number that needs to be manually
bumped, Peterz suggested hashing the definitions from orc_types.h. If
there is a format change that isn't caught by this, the hashing script
can be updated.

This patch adds an .orc_header allocated ELF section containing the
20-byte hash to vmlinux and kernel modules, along with the corresponding
__start_orc_header and __stop_orc_header symbols in vmlinux.

1: https://github.com/osandov/drgn
2: https://github.com/osandov/drgn/issues/303

Fixes: ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata")
Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com
2023-06-16 17:17:42 +02:00
Thomas Gleixner
1f34bb2a24 x86/fpu: Remove cpuinfo argument from init functions
Nothing in the call chain requires it

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.783704297@linutronix.de
2023-06-16 10:16:01 +02:00
Thomas Gleixner
54d9a91a3d x86/init: Initialize signal frame size late
No point in doing this during really early boot. Move it to an early
initcall so that it is set up before possible user mode helpers are started
during device initialization.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.727330699@linutronix.de
2023-06-16 10:16:00 +02:00
Thomas Gleixner
439e17576e init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
Invoke the X86ism mem_encrypt_init() from X86 arch_cpu_finalize_init() and
remove the weak fallback from the core code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.670360645@linutronix.de
2023-06-16 10:16:00 +02:00
Thomas Gleixner
7c7077a726 x86/cpu: Switch to arch_cpu_finalize_init()
check_bugs() is a dumping ground for finalizing the CPU bringup. Only parts of
it has to do with actual CPU bugs.

Split it apart into arch_cpu_finalize_init() and cpu_select_mitigations().

Fixup the bogus 32bit comments while at it.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613224545.019583869@linutronix.de
2023-06-16 10:15:59 +02:00
Tom Lendacky
6c32117963 x86/sev: Add SNP-specific unaccepted memory support
Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Since the boot path and the core kernel paths perform similar operations,
move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c
to avoid code duplication.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/a52fa69f460fd1876d70074b20ad68210dfc31dd.1686063086.git.thomas.lendacky@amd.com
2023-06-06 18:31:37 +02:00
Tom Lendacky
15d9088779 x86/sev: Use large PSC requests if applicable
In advance of providing support for unaccepted memory, request 2M Page
State Change (PSC) requests when the address range allows for it. By using
a 2M page size, more PSC operations can be handled in a single request to
the hypervisor. The hypervisor will determine if it can accommodate the
larger request by checking the mapping in the nested page table. If mapped
as a large page, then the 2M page request can be performed, otherwise the
2M page request will be broken down into 512 4K page requests. This is
still more efficient than having the guest perform multiple PSC requests
in order to process the 512 4K pages.

In conjunction with the 2M PSC requests, attempt to perform the associated
PVALIDATE instruction of the page using the 2M page size. If PVALIDATE
fails with a size mismatch, then fallback to validating 512 4K pages. To
do this, page validation is modified to work with the PSC structure and
not just a virtual address range.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/050d17b460dfc237b51d72082e5df4498d3513cb.1686063086.git.thomas.lendacky@amd.com
2023-06-06 18:29:35 +02:00
Tom Lendacky
69dcb1e3bb x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.

The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.

If the reduction in PSC entries results in any kind of performance issue
(that is not seen at the moment), use of a larger static PSC struct, with
fallback to the smaller stack version, can be investigated.

For more background info on this decision, see the subthread in the Link:
tag below.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com
2023-06-06 18:28:25 +02:00
Tom Lendacky
5dee19b6b2 x86/sev: Fix calculation of end address based on number of pages
When calculating an end address based on an unsigned int number of pages,
any value greater than or equal to 0x100000 that is shift PAGE_SHIFT bits
results in a 0 value, resulting in an invalid end address. Change the
number of pages variable in various routines from an unsigned int to an
unsigned long to calculate the end address correctly.

Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Fixes: dc3f3d2474b8 ("x86/mm: Validate memory when changing the C-bit")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/6a6e4eea0e1414402bac747744984fa4e9c01bb6.1686063086.git.thomas.lendacky@amd.com
2023-06-06 18:27:20 +02:00
Kirill A. Shutemov
75d090fd16 x86/tdx: Add unaccepted memory support
Hookup TDX-specific code to accept memory.

Accepting the memory is done with ACCEPT_PAGE module call on every page
in the range. MAP_GPA hypercall is not required as the unaccepted memory
is considered private already.

Extract the part of tdx_enc_status_changed() that does memory acceptance
in a new helper. Move the helper tdx-shared.c. It is going to be used by
both main kernel and decompressor.

  [ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com
2023-06-06 18:25:57 +02:00
Kirill A. Shutemov
ff40b5769a x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230606142637.5171-8-kirill.shutemov@linux.intel.com
2023-06-06 17:31:23 +02:00
Kirill A. Shutemov
745e3ed85f efi/libstub: Implement support for unaccepted memory
UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 (or whatever the arch uses) has to be modified on every
page acceptance. It leads to table fragmentation and there's a limited
number of entries in the e820 table.

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents a
naturally aligned power-2-sized region of address space -- unit.

For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to unit_size gets accepted
upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via EFI configuration table. allocate_e820() allocates the
bitmap if unaccepted memory is present, according to the size of
unaccepted region.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230606142637.5171-4-kirill.shutemov@linux.intel.com
2023-06-06 16:58:23 +02:00
Mike Christie
f9010dbdce fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.

This is a modified version of the patch written by Mike Christie
<michael.christie@oracle.com> which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file.  To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec.  The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Co-developed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-01 17:15:33 -04:00
Thomas Gleixner
ff3cfcb0d4 x86/smpboot: Fix the parallel bringup decision
The decision to allow parallel bringup of secondary CPUs checks
CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
parallel bootup because accessing the local APIC is intercepted and raises
a #VC or #VE, which cannot be handled at that point.

The check works correctly, but only for AMD encrypted guests. TDX does not
set that flag.

As there is no real connection between CC attributes and the inability to
support parallel bringup, replace this with a generic control flag in
x86_cpuinit and let SEV-ES and TDX init code disable it.

Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it")
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx
2023-05-31 16:49:34 +02:00
Peter Zijlstra
3496d1c64a x86/nospec: Shorten RESET_CALL_DEPTH
RESET_CALL_DEPTH is a pretty fat monster and blows up UNTRAIN_RET to
20 bytes:

  19:       48 c7 c0 80 00 00 00    mov    $0x80,%rax
  20:       48 c1 e0 38             shl    $0x38,%rax
  24:       65 48 89 04 25 00 00 00 00      mov    %rax,%gs:0x0     29: R_X86_64_32S        pcpu_hot+0x10

Shrink it by 4 bytes:

  0:   31 c0				xor %eax,%eax
  2:   48 0f ba e8 3f			bts $0x3f,%rax
  7:   65 48 89 04 25 00 00 00 00	mov %rax,%gs:0x0

Shrink RESET_CALL_DEPTH_FROM_CALL by 5 bytes by only setting %al, the
other bits are shifted out (the same could be done for RESET_CALL_DEPTH,
but the XOR+BTS sequence has less dependencies due to the zeroing).

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515093020.729622326@infradead.org
2023-05-31 13:40:57 +02:00
Peter Zijlstra
df25edbac3 x86/alternatives: Add longer 64-bit NOPs
By adding support for longer NOPs there are a few more alternatives
that can turn into a single instruction.

Add up to NOP11, the same limit where GNU as .nops also stops
generating longer nops. This is because a number of uarchs have severe
decode penalties for more than 3 prefixes.

  [ bp: Sync up with the version in tools/ while at it. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515093020.661756940@infradead.org
2023-05-31 10:21:21 +02:00
Andrew Cooper
6a4be69845 x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
"x86/smpboot: Support parallel startup of secondary CPUs" adds the first use
of X2APIC_ENABLE in assembly, but older binutils don't tolerate the UL suffix.

Switch to using BIT() instead.

Fixes: 7e75178a0950 ("x86/smpboot: Support parallel startup of secondary CPUs")
Reported-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://lore.kernel.org/r/20230522105738.2378364-1-andrew.cooper3@citrix.com
2023-05-22 14:06:33 +02:00
Jacob Xu
3367eeab97 KVM: VMX: Fix header file dependency of asm/vmx.h
Include a definition of WARN_ON_ONCE() before using it.

Fixes: bb1fcc70d98f ("KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT")
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jacob Xu <jacobhxu@google.com>
[reworded commit message; changed <asm/bug.h> to <linux/bug.h>]
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220225012959.1554168-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-19 13:56:25 -04:00
Arnd Bergmann
ef104443bf procfs: consolidate arch_report_meminfo declaration
The arch_report_meminfo() function is provided by four architectures,
with a __weak fallback in procfs itself. On architectures that don't
have a custom version, the __weak version causes a warning because
of the missing prototype.

Remove the architecture specific prototypes and instead add one
in linux/proc_fs.h.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for arch/x86
Acked-by: Helge Deller <deller@gmx.de> # parisc
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Message-Id: <20230516195834.551901-1-arnd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-17 09:24:49 +02:00
Lukas Bulwahn
7583e8fbdc x86/cpu: Remove X86_FEATURE_NAMES
While discussing to change the visibility of X86_FEATURE_NAMES (see Link)
in order to remove CONFIG_EMBEDDED, Boris suggested to simply make the
X86_FEATURE_NAMES functionality unconditional.

As the need for really tiny kernel images has gone away and kernel images
with !X86_FEATURE_NAMES are hardly tested, remove this config and the whole
ifdeffery in the source code.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20230509084007.24373-1-lukas.bulwahn@gmail.com/
Link: https://lore.kernel.org/r/20230510065713.10996-3-lukas.bulwahn@gmail.com
2023-05-15 20:03:08 +02:00
David Woodhouse
7e75178a09 x86/smpboot: Support parallel startup of secondary CPUs
In parallel startup mode the APs are kicked alive by the control CPU
quickly after each other and run through the early startup code in
parallel. The real-mode startup code is already serialized with a
bit-spinlock to protect the real-mode stack.

In parallel startup mode the smpboot_control variable obviously cannot
contain the Linux CPU number so the APs have to determine their Linux CPU
number on their own. This is required to find the CPUs per CPU offset in
order to find the idle task stack and other per CPU data.

To achieve this, export the cpuid_to_apicid[] array so that each AP can
find its own CPU number by searching therein based on its APIC ID.

Introduce a flag in the top bits of smpboot_control which indicates that
the AP should find its CPU number by reading the APIC ID from the APIC.

This is required because CPUID based APIC ID retrieval can only provide the
initial APIC ID, which might have been overruled by the firmware. Some AMD
APUs come up with APIC ID = initial APIC ID + 0x10, so the APIC ID to CPU
number lookup would fail miserably if based on CPUID. Also virtualization
can make its own APIC ID assignements. The only requirement is that the
APIC IDs are consistent with the APCI/MADT table.

For the boot CPU or in case parallel bringup is disabled the control bits
are empty and the CPU number is directly available in bit 0-23 of
smpboot_control.

[ tglx: Initial proof of concept patch with bitlock and APIC ID lookup ]
[ dwmw2: Rework and testing, commit message, CPUID 0x1 and CPU0 support ]
[ seanc: Fix stray override of initial_gs in common_cpu_up() ]
[ Oleksandr Natalenko: reported suspend/resume issue fixed in
  x86_acpi_suspend_lowlevel ]
[ tglx: Make it read the APIC ID from the APIC instead of using CPUID,
  	split the bitlock part out ]

Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.411554373@linutronix.de
2023-05-15 13:45:04 +02:00
Thomas Gleixner
f6f1ae9128 x86/smpboot: Implement a bit spinlock to protect the realmode stack
Parallel AP bringup requires that the APs can run fully parallel through
the early startup code including the real mode trampoline.

To prepare for this implement a bit-spinlock to serialize access to the
real mode stack so that parallel upcoming APs are not going to corrupt each
others stack while going through the real mode startup code.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.355425551@linutronix.de
2023-05-15 13:45:03 +02:00
Thomas Gleixner
bea629d57d x86/apic: Save the APIC virtual base address
For parallel CPU brinugp it's required to read the APIC ID in the low level
startup code. The virtual APIC base address is a constant because its a
fix-mapped address. Exposing that constant which is composed via macros to
assembly code is non-trivial due to header inclusion hell.

Aside of that it's constant only because of the vsyscall ABI
requirement. Once vsyscall is out of the picture the fixmap can be placed
at runtime.

Avoid header hell, stay flexible and store the address in a variable which
can be exposed to the low level startup code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.299231005@linutronix.de
2023-05-15 13:45:03 +02:00
Thomas Gleixner
f54d4434c2 x86/apic: Provide cpu_primary_thread mask
Make the primary thread tracking CPU mask based in preparation for simpler
handling of parallel bootup.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.186599880@linutronix.de
2023-05-15 13:45:02 +02:00
Thomas Gleixner
8b5a0f957c x86/smpboot: Enable split CPU startup
The x86 CPU bringup state currently does AP wake-up, wait for AP to
respond and then release it for full bringup.

It is safe to be split into a wake-up and and a separate wait+release
state.

Provide the required functions and enable the split CPU bringup, which
prepares for parallel bringup, where the bringup of the non-boot CPUs takes
two iterations: One to prepare and wake all APs and the second to wait and
release them. Depending on timing this can eliminate the wait time
completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.133453992@linutronix.de
2023-05-15 13:45:01 +02:00
Thomas Gleixner
2711b8e2b7 x86/smpboot: Switch to hotplug core state synchronization
The new AP state tracking and synchronization mechanism in the CPU hotplug
core code allows to remove quite some x86 specific code:

  1) The AP alive synchronization based on cpumasks

  2) The decision whether an AP can be brought up again

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.529657366@linutronix.de
2023-05-15 13:44:56 +02:00
Thomas Gleixner
9d349d47f0 x86/smpboot: Make TSC synchronization function call based
Spin-waiting on the control CPU until the AP reaches the TSC
synchronization is just a waste especially in the case that there is no
synchronization required.

As the synchronization has to run with interrupts disabled the control CPU
part can just be done from a SMP function call. The upcoming AP issues that
call async only in the case that synchronization is required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.148255496@linutronix.de
2023-05-15 13:44:53 +02:00
Thomas Gleixner
d4f28f07c2 x86/smpboot: Move synchronization masks to SMP boot code
The usage is in smpboot.c and not in the CPU initialization code.

The XEN_PV usage of cpu_callout_mask is obsolete as cpu_init() not longer
waits and cacheinfo has its own CPU mask now, so cpu_callout_mask can be
made static too.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.091511483@linutronix.de
2023-05-15 13:44:52 +02:00
Thomas Gleixner
e94cd1503b x86/smpboot: Get rid of cpu_init_secondary()
The synchronization of the AP with the control CPU is a SMP boot problem
and has nothing to do with cpu_init().

Open code cpu_init_secondary() in start_secondary() and move
wait_for_master_cpu() into the SMP boot code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.981999763@linutronix.de
2023-05-15 13:44:51 +02:00
Thomas Gleixner
5475abbde7 x86/smpboot: Remove the CPU0 hotplug kludge
This was introduced with commit e1c467e69040 ("x86, hotplug: Wake up CPU0
via NMI instead of INIT, SIPI, SIPI") to eventually support physical
hotplug of CPU0:

 "We'll change this code in the future to wake up hard offlined CPU0 if
  real platform and request are available."

11 years later this has not happened and physical hotplug is not officially
supported. Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.768845190@linutronix.de
2023-05-15 13:44:49 +02:00
Thomas Gleixner
e59e74dc48 x86/topology: Remove CPU0 hotplug option
This was introduced together with commit e1c467e69040 ("x86, hotplug: Wake
up CPU0 via NMI instead of INIT, SIPI, SIPI") to eventually support
physical hotplug of CPU0:

 "We'll change this code in the future to wake up hard offlined CPU0 if
  real platform and request are available."

11 years later this has not happened and physical hotplug is not officially
supported. Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.715707999@linutronix.de
2023-05-15 13:44:49 +02:00
Thomas Gleixner
666e1156b2 x86/smpboot: Rename start_cpu0() to soft_restart_cpu()
This is used in the SEV play_dead() implementation to re-online CPUs. But
that has nothing to do with CPU0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.662319599@linutronix.de
2023-05-15 13:44:48 +02:00
Thomas Gleixner
5107e3ebb8 x86/smpboot: Cleanup topology_phys_to_logical_pkg()/die()
Make topology_phys_to_logical_pkg_die() static as it's only used in
smpboot.c and fixup the kernel-doc warnings for both functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.493750666@linutronix.de
2023-05-15 13:44:47 +02:00
Kan Liang
b752ea0c28 perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG
Several similar kernel warnings can be triggered,

  [56605.607840] CPU0 PEBS record size 0, expected 32, config 0 cpuc->record_size=208

when the below commands are running in parallel for a while on SPR.

  while true;
  do
	perf record --no-buildid -a --intr-regs=AX  \
		    -e cpu/event=0xd0,umask=0x81/pp \
		    -c 10003 -o /dev/null ./triad;
  done &

  while true;
  do
	perf record -o /tmp/out -W -d \
		    -e '{ld_blocks.store_forward:period=1000000, \
                         MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' \
		    -c 1037 ./triad;
  done

The triad program is just the generation of loads/stores.

The warnings are triggered when an unexpected PEBS record (with a
different config and size) is found.

A system-wide PEBS event with the large PEBS config may be enabled
during a context switch. Some PEBS records for the system-wide PEBS
may be generated while the old task is sched out but the new one
hasn't been sched in yet. When the new task is sched in, the
cpuc->pebs_record_size may be updated for the per-task PEBS events. So
the existing system-wide PEBS records have a different size from the
later PEBS records.

The PEBS buffer should be flushed right before the hardware is
reprogrammed. The new size and threshold should be updated after the
old buffer has been flushed.

Reported-by: Stephane Eranian <eranian@google.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230421184529.3320912-1-kan.liang@linux.intel.com
2023-05-08 10:58:27 +02:00
Linus Torvalds
b115d85a95 Locking changes in v6.4:
- Introduce local{,64}_try_cmpxchg() - a slightly more optimal
    primitive, which will be used in perf events ring-buffer code.
 
  - Simplify/modify rwsems on PREEMPT_RT, to address writer starvation.
 
  - Misc cleanups/fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRUvUoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hlIhAArP33rTKi+HAndQ3UHW3XtmHRxEEQTfiE
 wvIoN89h58QW4DGMeAV4ltafbIPQAkI233Aogwz903L0qbDV0Ro4OU3XJembRuWl
 LeOADKwYyypXdOa8XICuY9aIP7e1/h0DF3ySs7inLcwK9JCyAIxnsVHYej+hsRXA
 kZoXN98T3TR1C0V9UQy4SU3HI1lC3tsG3R9Ti9TnYUg3ygVXhRE9lOQ4kv9lFPVz
 BNuj2Blj7KNiVaY9kehrhO54THI7NmsCVZO44Rcl48I0KAcFulAmFcNlE7GnR8Nj
 thj38pU6XAFVHXG8MYjgE+Al+PnK48NtJxexCtHyGvGG4D2aLzRMnkolxAUCcVuK
 G+UBsQm3ybjYgHgt1zuN6ehcpT+5tULkDH8JA7vrgZYaVgxHzsUaHgYfCCWKnmUY
 mPR6aImEmYZwZVNLskhe0HT4mq244bp+VnWlnJ6LZK7t/itenvDhqnj7KTi4Bfej
 lTHplOTitV/8uCEW8V4pX+YTEenVsIQmTc/G3iIabXP/6HzLffA3q4vyW6vKIErE
 pqrpuFA0Z4GB+pU0mJXt7+I7zscDVthwI055jDyQBjA7IcdVGm2MjQ6xcNRW5FYN
 UynvaEMocue4ZO4WdFsd1ZBUd9VfoNzGQspBw46DhCL1MEQBYv36SKQNjej/9aRr
 ilVwqnOWI2s=
 =mM0A
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Introduce local{,64}_try_cmpxchg() - a slightly more optimal
   primitive, which will be used in perf events ring-buffer code

 - Simplify/modify rwsems on PREEMPT_RT, to address writer starvation

 - Misc cleanups/fixes

* tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/atomic: Correct (cmp)xchg() instrumentation
  locking/x86: Define arch_try_cmpxchg_local()
  locking/arch: Wire up local_try_cmpxchg()
  locking/generic: Wire up local{,64}_try_cmpxchg()
  locking/atomic: Add generic try_cmpxchg{,64}_local() support
  locking/rwbase: Mitigate indefinite writer starvation
  locking/arch: Rename all internal __xchg() names to __arch_xchg()
2023-05-05 12:56:55 -07:00
Linus Torvalds
798dec3304 x86-64: mm: clarify the 'positive addresses' user address rules
Dave Hansen found the "(long) addr >= 0" code in the x86-64 access_ok
checks somewhat confusing, and suggested using a helper to clarify what
the code is doing.

So this does exactly that: clarifying what the sign bit check is all
about, by adding a helper macro that makes it clear what it is testing.

This also adds some explicit comments talking about how even with LAM
enabled, any addresses with the sign bit will still GP-fault in the
non-canonical region just above the sign bit.

This is all what allows us to do the user address checks with just the
sign bit, and furthermore be a bit cavalier about accesses that might be
done with an additional offset even past that point.

(And yes, this talks about 'positive' even though zero is also a valid
user address and so technically we should call them 'non-negative'.  But
I don't think using 'non-negative' ends up being more understandable).

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-03 10:37:22 -07:00
Linus Torvalds
1dbc0a9515 x86: mm: remove 'sign' games from LAM untagged_addr*() macros
The intent of the sign games was to not modify kernel addresses when
untagging them.  However, that had two issues:

 (a) it didn't actually work as intended, since the mask was calculated
     as 'addr >> 63' on an _unsigned_ address. So instead of getting a
     mask of all ones for kernel addresses, you just got '1'.

 (b) untagging a kernel address isn't actually a valid operation anyway.

Now, (a) had originally been true for both 'untagged_addr()' and the
remote version of it, but had accidentally been fixed for the regular
version of untagged_addr() by commit e0bddc19ba95 ("x86/mm: Reduce
untagged_addr() overhead for systems without LAM").  That one rewrote
the shift to be part of the alternative asm code, and in the process
changed the unsigned shift into a signed 'sar' instruction.

And while it is true that we don't want to turn what looks like a kernel
address into a user address by masking off the high bit, that doesn't
need these sign masking games - all it needs is that the mm context
'untag_mask' value has the high bit set.

Which it always does.

So simplify the code by just removing the superfluous (and in the case
of untagged_addr_remote(), still buggy) sign bit games in the address
masking.

Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-03 10:37:22 -07:00
Linus Torvalds
b9bd9f605c x86: uaccess: move 32-bit and 64-bit parts into proper <asm/uaccess_N.h> header
The x86 <asm/uaccess.h> file has grown features that are specific to
x86-64 like LAM support and the related access_ok() changes.  They
really should be in the <asm/uaccess_64.h> file and not pollute the
generic x86 header.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-03 10:37:22 -07:00
Linus Torvalds
6ccdc91d6a x86: mm: remove architecture-specific 'access_ok()' define
There's already a generic definition of 'access_ok()' in the
asm-generic/access_ok.h header file, and the only difference bwteen that
and the x86-specific one is the added check for WARN_ON_IN_IRQ().

And it turns out that the reason for that check is long gone: it used to
use a "user_addr_max()" inline function that depended on the current
thread, and caused problems in non-thread contexts.

For details, see commits 7c4788950ba5 ("x86/uaccess, sched/preempt:
Verify access_ok() context") and in particular commit ae31fe51a3cc
("perf/x86: Restore TASK_SIZE check on frame pointer") about how and why
this came to be.

But that "current task" issue was removed in the big set_fs() removal by
Christoph Hellwig in commit 47058bb54b57 ("x86: remove address space
overrides using set_fs()").

So the reason for the test and the architecture-specific access_ok()
define no longer exists, and is actually harmful these days.  For
example, it led various 'copy_from_user_nmi()' games (eg using
__range_not_ok() instead, and then later converted to __access_ok() when
that became ok).

And that in turn meant that LAM was broken for the frame following
before this series, because __access_ok() used to not do the address
untagging.

Accessing user state still needs care in many contexts, but access_ok()
is not the place for this test.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org>
2023-05-03 10:37:22 -07:00