682 Commits

Author SHA1 Message Date
Linus Torvalds
e3da8db055 A single fix for the mechanism to park CPUs with an INIT IPI.
On shutdown or kexec, the kernel tries to park the non-boot CPUs with an
 INIT IPI. But the same code path is also used by the crash utility. If the
 CPU which panics is not the boot CPU then it sends an INIT IPI to the boot
 CPU which resets the machine. Prevent this by validating that the CPU which
 runs the stop mechanism is the boot CPU. If not, leave the other CPUs in
 HLT.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSqoEcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYNSEACwo5zgibek27qeMvJGfNztm0qRa4mw
 wN0qV31yaNcEfhqL8bMU8n3wvEA+pZBqhaU5fyalY+yxc29jI/j9eda5zR+Fi9e5
 kVyFT2M0rVSDLFraoQeD+T/tSSK2MJtswF12ytY5mHzHMCb6Uy9fNCpUiQlB+i81
 AcnlKQk9ifZXFdMJPj5E+E6l776T8NZPoYEdFgJloxaYOGTdFJDWDlryx4LD7Urz
 Fx/ec8Ug/FYSPl2XzXHugvHjNefxKoomcZ3v3CSZonBcav7Gz6F06HAR5vVRWSHx
 4Dlh6zdy+60YKBmkvpb+RJIBMo8aXclwT+tntaoJvGHZ+PNASO6JVz9PvmoNgfWK
 Oy2n1K687qIOY6d+yxUZgbZpwXX5bG6kc0xbicUNigGagrYTfd83G5RAfwxNkqsY
 23Qw4Ue8uxve4M8iM/FfxKIShuDBiLCIDDIrWDjEkvIAnr1pd+NPUv7kqOTI7Kz/
 srNgcOwalypzuS93lgaN1yjRv1mmaPXhdhjy0DwGbC54bKgNzfq+7z75Ibn0dSFF
 JUPFVjztB+ymnM6PJ1dR77SvPi+xOi60nw7L+Qu9US4yKkW0NeGiIWVsggNorbU6
 UPFSE5gxwFD0w1EZ9W+IDeOZUNhjJUINZsn8txm+tb+oEqTIGRPHPOo0C1dBmLW9
 AmDIeHljj0iWIw==
 =DOCF
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "A single fix for the mechanism to park CPUs with an INIT IPI.

  On shutdown or kexec, the kernel tries to park the non-boot CPUs with
  an INIT IPI. But the same code path is also used by the crash utility.
  If the CPU which panics is not the boot CPU then it sends an INIT IPI
  to the boot CPU which resets the machine.

  Prevent this by validating that the CPU which runs the stop mechanism
  is the boot CPU. If not, leave the other CPUs in HLT"

* tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Don't send INIT to boot CPU
2023-07-09 10:08:38 -07:00
Thomas Gleixner
b1472a60a5 x86/smp: Don't send INIT to boot CPU
Parking CPUs in INIT works well, except for the crash case when the CPU
which invokes smp_park_other_cpus_in_init() is not the boot CPU. Sending
INIT to the boot CPU resets the whole machine.

Prevent this by validating that this runs on the boot CPU. If not fall back
and let CPUs hang in HLT.

Fixes: 45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible")
Reported-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/87ttui91jo.ffs@tglx
2023-07-07 15:42:31 +02:00
Linus Torvalds
ed3b7923a8 Scheduler changes for v6.5:
- Scheduler SMP load-balancer improvements:
 
     - Avoid unnecessary migrations within SMT domains on hybrid systems.
 
       Problem:
 
         On hybrid CPU systems, (processors with a mixture of higher-frequency
 	SMT cores and lower-frequency non-SMT cores), under the old code
 	lower-priority CPUs pulled tasks from the higher-priority cores if
 	more than one SMT sibling was busy - resulting in many unnecessary
 	task migrations.
 
       Solution:
 
         The new code improves the load balancer to recognize SMT cores with more
         than one busy sibling and allows lower-priority CPUs to pull tasks, which
         avoids superfluous migrations and lets lower-priority cores inspect all SMT
         siblings for the busiest queue.
 
     - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU
       contention in frequency, EAS max util & load-balance busiest CPU selection.
 
       This improves CPU utilization for certain workloads, while leaves other key
       workloads unchanged.
 
 - Scheduler infrastructure improvements:
 
     - Rewrite the scheduler topology setup code by consolidating it
       into the build_sched_topology() helper function and building
       it dynamically on the fly.
 
     - Resolve the local_clock() vs. noinstr complications by rewriting
       the code: provide separate sched_clock_noinstr() and
       local_clock_noinstr() functions to be used in instrumentation code,
       and make sure it is all instrumentation-safe.
 
 - Fixes:
 
     - Fix a kthread_park() race with wait_woken()
 
     - Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
        - Fix UP PREEMPT bug by unifying the SMP and UP implementations.
        - Fix task_struct::saved_state handling.
 
     - Fix various rq clock update bugs, unearthed by turning on the rq clock
       debugging code.
 
     - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by
       creating enough cgroups, by removing the warnign and restricting
       window size triggers to PSI file write-permission or CAP_SYS_RESOURCE.
 
     - Propagate SMT flags in the topology when removing degenerate domain
 
     - Fix grub_reclaim() calculation bug in the deadline scheduler code
 
     - Avoid resetting the min update period when it is unnecessary, in
       psi_trigger_destroy().
 
     - Don't balance a task to its current running CPU in load_balance(),
       which was possible on certain NUMA topologies with overlapping
       groups.
 
     - Fix the sched-debug printing of rq->nr_uninterruptible
 
 - Cleanups:
 
     - Address various -Wmissing-prototype warnings, as a preparation
       to (maybe) enable this warning in the future.
 
     - Remove unused code
 
     - Mark more functions __init
 
     - Fix shadow-variable warnings
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS
 qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w
 egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk
 o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo
 9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR
 u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf
 vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl
 AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T
 wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr
 4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF
 VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N
 PiDbtX760ic=
 =EWQA
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scheduler SMP load-balancer improvements:

   - Avoid unnecessary migrations within SMT domains on hybrid systems.

     Problem:

        On hybrid CPU systems, (processors with a mixture of
        higher-frequency SMT cores and lower-frequency non-SMT cores),
        under the old code lower-priority CPUs pulled tasks from the
        higher-priority cores if more than one SMT sibling was busy -
        resulting in many unnecessary task migrations.

     Solution:

        The new code improves the load balancer to recognize SMT cores
        with more than one busy sibling and allows lower-priority CPUs
        to pull tasks, which avoids superfluous migrations and lets
        lower-priority cores inspect all SMT siblings for the busiest
        queue.

   - Implement the 'runnable boosting' feature in the EAS balancer:
     consider CPU contention in frequency, EAS max util & load-balance
     busiest CPU selection.

     This improves CPU utilization for certain workloads, while leaves
     other key workloads unchanged.

  Scheduler infrastructure improvements:

   - Rewrite the scheduler topology setup code by consolidating it into
     the build_sched_topology() helper function and building it
     dynamically on the fly.

   - Resolve the local_clock() vs. noinstr complications by rewriting
     the code: provide separate sched_clock_noinstr() and
     local_clock_noinstr() functions to be used in instrumentation code,
     and make sure it is all instrumentation-safe.

  Fixes:

   - Fix a kthread_park() race with wait_woken()

   - Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
       - Fix UP PREEMPT bug by unifying the SMP and UP implementations
       - Fix task_struct::saved_state handling

   - Fix various rq clock update bugs, unearthed by turning on the rq
     clock debugging code.

   - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
     by creating enough cgroups, by removing the warnign and restricting
     window size triggers to PSI file write-permission or
     CAP_SYS_RESOURCE.

   - Propagate SMT flags in the topology when removing degenerate domain

   - Fix grub_reclaim() calculation bug in the deadline scheduler code

   - Avoid resetting the min update period when it is unnecessary, in
     psi_trigger_destroy().

   - Don't balance a task to its current running CPU in load_balance(),
     which was possible on certain NUMA topologies with overlapping
     groups.

   - Fix the sched-debug printing of rq->nr_uninterruptible

  Cleanups:

   - Address various -Wmissing-prototype warnings, as a preparation to
     (maybe) enable this warning in the future.

   - Remove unused code

   - Mark more functions __init

   - Fix shadow-variable warnings"

* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
  sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
  sched/core: Fixed missing rq clock update before calling set_rq_offline()
  sched/deadline: Update GRUB description in the documentation
  sched/deadline: Fix bandwidth reclaim equation in GRUB
  sched/wait: Fix a kthread_park race with wait_woken()
  sched/topology: Mark set_sched_topology() __init
  sched/fair: Rename variable cpu_util eff_util
  arm64/arch_timer: Fix MMIO byteswap
  sched/fair, cpufreq: Introduce 'runnable boosting'
  sched/fair: Refactor CPU utilization functions
  cpuidle: Use local_clock_noinstr()
  sched/clock: Provide local_clock_noinstr()
  x86/tsc: Provide sched_clock_noinstr()
  clocksource: hyper-v: Provide noinstr sched_clock()
  clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
  x86/vdso: Fix gettimeofday masking
  math64: Always inline u128 version of mul_u64_u64_shr()
  s390/time: Provide sched_clock_noinstr()
  loongarch: Provide noinstr sched_clock_read()
  ...
2023-06-27 14:03:21 -07:00
Linus Torvalds
88afbb21d4 A set of fixes for kexec(), reboot and shutdown issues
- Ensure that the WBINVD in stop_this_cpu() has been completed before the
    control CPU proceedes.
 
    stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs
    in a HLT loop.
 
    The control CPU sends an IPI to the APs and waits for their CPU online bits
    to be cleared. Once they all are marked "offline" it proceeds.
 
    But stop_this_cpu() clears the CPU online bit before issuing WBINVD,
    which means there is no guarantee that the AP has reached the HLT loop.
 
    This was reported to cause intermittent reboot/shutdown failures due to
    some dubious interaction with the firmware.
 
    This is not only a problem of WBINVD. The code to actually "stop" the
    CPU which runs between clearing the online bit and reaching the HLT loop
    can cause large enough delays on its own (think virtualization). That's
    especially dangerous for kexec() as kexec() expects that all APs are in
    a safe state and not executing code while the boot CPU jumps to the new
    kernel. There are more issues vs. kexec() which are addressed separately.
 
    Cure this by implementing an explicit synchronization point right before
    the AP reaches HLT. This guarantees that the AP has completed the full
    stop proceedure.
 
  - Fix the condition for WBINVD in stop_this_cpu().
 
    The WBINVD in stop_this_cpu() is required for ensuring that when
    switching to or from memory encryption no dirty data is left in the
    cache lines which might cause a write back in the wrong more later.
 
    This checks CPUID directly because the feature bit might have been
    cleared due to a command line option.
 
    But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel
    CPUs return the content of the highest supported leaf when a non-existing
    leaf is read, while AMD CPUs return all zeros for unsupported leafs.
 
    So the result of the test on Intel CPUs is lottery and on AMD its just
    correct by chance.
 
    While harmless it's incorrect and causes the conditional wbinvd() to be
    issued where not required, which caused the above issue to be unearthed.
 
  - Make kexec() robust against AP code execution
 
    Ashok observed triple faults when doing kexec() on a system which had
    been booted with "nosmt".
 
    It turned out that the SMT siblings which had been brought up partially
    are parked in mwait_play_dead() to enable power savings.
 
    mwait_play_dead() is monitoring the thread flags of the AP's idle task,
    which has been chosen as it's unlikely to be written to.
 
    But kexec() can overwrite the previous kernel text and data including
    page tables etc. When it overwrites the cache lines monitored by an AP
    that AP resumes execution after the MWAIT on eventually overwritten
    text, stack and page tables, which obviously might end up in a triple
    fault easily.
 
    Make this more robust in several steps:
 
     1) Use an explicit per CPU cache line for monitoring.
 
     2) Write a command to these cache lines to kick APs out of MWAIT before
        proceeding with kexec(), shutdown or reboot.
 
        The APs confirm the wakeup by writing status back and then enter a
        HLT loop.
 
     3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs
        in INIT state.
 
        HLT is not a guarantee that an AP won't wake up and resume
        execution. HLT is woken up by NMI and SMI. SMI puts the CPU back
        into HLT (+/- firmware bugs), but NMI is delivered to the CPU which
        executes the NMI handler. Same issue as the MWAIT scenario described
        above.
 
        Sending an INIT/INIT sequence to the APs puts them into wait for
        STARTUP state, which is safe against NMI.
 
     There is still an issue remaining which can't be fixed: #MCE
 
     If the AP sits in HLT and receives a broadcast #MCE it will try to
     handle it with the obvious consequences.
 
     INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to
     shut down the machine.
 
     So there is a choice between fire (HLT) and frying pan (INIT). Frying
     pan has been chosen as it's at least preventing the NMI issue.
 
     On systems which are not using INIT/INIT/STARTUP there is not much
     which can be done right now, but at least the obvious and easy to
     trigger MWAIT issue has been addressed.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZfpQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeZpD/9gSJN2qtGqoOgE8bWAenEeqppmBGFE
 EAhuhsvN1qG9JosUFo4KzxsGD/aWt2P6XglBDrGti8mFNol67jutmwWklntL3/ZR
 m8D6D+Pl7/CaDgACDTDbrnVC3lOGyMhD301yJrnBigS/SEoHeHI9UtadbHukuLQj
 TlKt5KtAnap15bE6QL846cDIptB9SjYLLPULo3i4azXEis/l6eAkffwAR6dmKlBh
 2RbhLK1xPPG9nqWYjqZXnex09acKwD9xY9xHj4+GampV4UqHJRWfW0YtFs5ENi01
 r3FVCdKEcvMkUw0zh0IAviBRs2vCI/R3YSfEc7P0264yn5WzMhAT+OGCovNjByiW
 sB4Iqa+Yf6aoBWwux6W4d22xu7uYhmFk/jiLyRZJPW/gvGZCZATT/x/T2hRoaYA8
 3S0Rs7n/gbfvynQETgniifuM0bXRW0lEJAmn840GwyVQwlpDEPBJSwW4El49kbkc
 +dHxnmpMCfnBxfVLS1YDd4WOmkWBeECNcW330FShlQQ8mM3UG31+Q8Jc55Ze9SW0
 w1h+IgIOHlA0DpQUUM8DJTSuxFx2piQsZxjOtzd70+BiKZpCsHqVLIp4qfnf+/GO
 gyP0cCQLbafpABbV9uVy8A/qgUGi0Qii0GJfCTy0OdmU+JX3C2C/gsM3uN0g3qAj
 vUhkuCXEGL5k1w==
 =KgZ0
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Thomas Gleixner:
 "A set of fixes for kexec(), reboot and shutdown issues:

   - Ensure that the WBINVD in stop_this_cpu() has been completed before
     the control CPU proceedes.

     stop_this_cpu() is used for kexec(), reboot and shutdown to park
     the APs in a HLT loop.

     The control CPU sends an IPI to the APs and waits for their CPU
     online bits to be cleared. Once they all are marked "offline" it
     proceeds.

     But stop_this_cpu() clears the CPU online bit before issuing
     WBINVD, which means there is no guarantee that the AP has reached
     the HLT loop.

     This was reported to cause intermittent reboot/shutdown failures
     due to some dubious interaction with the firmware.

     This is not only a problem of WBINVD. The code to actually "stop"
     the CPU which runs between clearing the online bit and reaching the
     HLT loop can cause large enough delays on its own (think
     virtualization). That's especially dangerous for kexec() as kexec()
     expects that all APs are in a safe state and not executing code
     while the boot CPU jumps to the new kernel. There are more issues
     vs kexec() which are addressed separately.

     Cure this by implementing an explicit synchronization point right
     before the AP reaches HLT. This guarantees that the AP has
     completed the full stop proceedure.

   - Fix the condition for WBINVD in stop_this_cpu().

     The WBINVD in stop_this_cpu() is required for ensuring that when
     switching to or from memory encryption no dirty data is left in the
     cache lines which might cause a write back in the wrong more later.

     This checks CPUID directly because the feature bit might have been
     cleared due to a command line option.

     But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
     Intel CPUs return the content of the highest supported leaf when a
     non-existing leaf is read, while AMD CPUs return all zeros for
     unsupported leafs.

     So the result of the test on Intel CPUs is lottery and on AMD its
     just correct by chance.

     While harmless it's incorrect and causes the conditional wbinvd()
     to be issued where not required, which caused the above issue to be
     unearthed.

   - Make kexec() robust against AP code execution

     Ashok observed triple faults when doing kexec() on a system which
     had been booted with "nosmt".

     It turned out that the SMT siblings which had been brought up
     partially are parked in mwait_play_dead() to enable power savings.

     mwait_play_dead() is monitoring the thread flags of the AP's idle
     task, which has been chosen as it's unlikely to be written to.

     But kexec() can overwrite the previous kernel text and data
     including page tables etc. When it overwrites the cache lines
     monitored by an AP that AP resumes execution after the MWAIT on
     eventually overwritten text, stack and page tables, which obviously
     might end up in a triple fault easily.

     Make this more robust in several steps:

      1) Use an explicit per CPU cache line for monitoring.

      2) Write a command to these cache lines to kick APs out of MWAIT
         before proceeding with kexec(), shutdown or reboot.

         The APs confirm the wakeup by writing status back and then
         enter a HLT loop.

      3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
         APs in INIT state.

         HLT is not a guarantee that an AP won't wake up and resume
         execution. HLT is woken up by NMI and SMI. SMI puts the CPU
         back into HLT (+/- firmware bugs), but NMI is delivered to the
         CPU which executes the NMI handler. Same issue as the MWAIT
         scenario described above.

         Sending an INIT/INIT sequence to the APs puts them into wait
         for STARTUP state, which is safe against NMI.

     There is still an issue remaining which can't be fixed: #MCE

     If the AP sits in HLT and receives a broadcast #MCE it will try to
     handle it with the obvious consequences.

     INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
     #MCE to shut down the machine.

     So there is a choice between fire (HLT) and frying pan (INIT).
     Frying pan has been chosen as it's at least preventing the NMI
     issue.

     On systems which are not using INIT/INIT/STARTUP there is not much
     which can be done right now, but at least the obvious and easy to
     trigger MWAIT issue has been addressed"

* tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Put CPUs into INIT on shutdown if possible
  x86/smp: Split sending INIT IPI out into a helper function
  x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  x86/smp: Use dedicated cache-line for mwait_play_dead()
  x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
  x86/smp: Dont access non-existing CPUID leaf
  x86/smp: Make stop_other_cpus() more robust
2023-06-26 14:45:53 -07:00
Linus Torvalds
9244724fbf A large update for SMP management:
- Parallel CPU bringup
 
     The reason why people are interested in parallel bringup is to shorten
     the (kexec) reboot time of cloud servers to reduce the downtime of the
     VM tenants.
 
     The current fully serialized bringup does the following per AP:
 
       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state
 
     There are two significant delays:
 
       #3 The time for an AP to report alive state in start_secondary() on
          x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.
 
       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending on
          the microcode patch size to apply.
 
     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to come
     up and apply microcode. That's more than 80% of the actual onlining
     procedure.
 
     This can be reduced significantly by splitting the bringup mechanism
     into two parts:
 
       1) Run the prepare callbacks and kick the AP alive for each AP which
       	 needs to be brought up.
 
 	 The APs wake up, do their firmware initialization and run the low
       	 level kernel startup code including microcode loading in parallel
       	 up to the first synchronization point. (#1 and #2 above)
 
       2) Run the rest of the bringup code strictly serialized per CPU
       	 (#3 - #5 above) as it's done today.
 
 	 Parallelizing that stage of the CPU bringup might be possible in
 	 theory, but it's questionable whether required surgery would be
 	 justified for a pretty small gain.
 
     If the system is large enough the first AP is already waiting at the
     first synchronization point when the boot CPU finished the wake-up of
     the last AP. That reduces the AP bringup time on that SKL from ~800ms
     to ~80ms, i.e. by a factor ~10x.
 
     The actual gain varies wildly depending on the system, CPU, microcode
     patch size and other factors. There are some opportunities to reduce
     the overhead further, but that needs some deep surgery in the x86 CPU
     bringup code.
 
     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.
 
   - Enhancements for SMP function call tracing so it is possible to locate
     the scheduling and the actual execution points. That allows to measure
     IPI delivery time precisely.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll
 05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X
 FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w
 zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ
 wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L
 37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o
 K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi
 a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg
 dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2
 Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn
 2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T
 TRiSzvssbYYmaw==
 =Y8if
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Thomas Gleixner:
 "A large update for SMP management:

   - Parallel CPU bringup

     The reason why people are interested in parallel bringup is to
     shorten the (kexec) reboot time of cloud servers to reduce the
     downtime of the VM tenants.

     The current fully serialized bringup does the following per AP:

       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state

     There are two significant delays:

       #3 The time for an AP to report alive state in start_secondary()
          on x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.

       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending
          on the microcode patch size to apply.

     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to
     come up and apply microcode. That's more than 80% of the actual
     onlining procedure.

     This can be reduced significantly by splitting the bringup
     mechanism into two parts:

       1) Run the prepare callbacks and kick the AP alive for each AP
          which needs to be brought up.

          The APs wake up, do their firmware initialization and run the
          low level kernel startup code including microcode loading in
          parallel up to the first synchronization point. (#1 and #2
          above)

       2) Run the rest of the bringup code strictly serialized per CPU
          (#3 - #5 above) as it's done today.

          Parallelizing that stage of the CPU bringup might be possible
          in theory, but it's questionable whether required surgery
          would be justified for a pretty small gain.

     If the system is large enough the first AP is already waiting at
     the first synchronization point when the boot CPU finished the
     wake-up of the last AP. That reduces the AP bringup time on that
     SKL from ~800ms to ~80ms, i.e. by a factor ~10x.

     The actual gain varies wildly depending on the system, CPU,
     microcode patch size and other factors. There are some
     opportunities to reduce the overhead further, but that needs some
     deep surgery in the x86 CPU bringup code.

     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.

   - Enhancements for SMP function call tracing so it is possible to
     locate the scheduling and the actual execution points. That allows
     to measure IPI delivery time precisely"

* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  trace,smp: Add tracepoints for scheduling remotelly called functions
  trace,smp: Add tracepoints around remotelly called functions
  MAINTAINERS: Add CPU HOTPLUG entry
  x86/smpboot: Fix the parallel bringup decision
  x86/realmode: Make stack lock work in trampoline_compat()
  x86/smp: Initialize cpu_primary_thread_mask late
  cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
  x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
  x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
  x86/smpboot: Support parallel startup of secondary CPUs
  x86/smpboot: Implement a bit spinlock to protect the realmode stack
  x86/apic: Save the APIC virtual base address
  cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
  x86/apic: Provide cpu_primary_thread mask
  x86/smpboot: Enable split CPU startup
  cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
  cpu/hotplug: Reset task stack state in _cpu_up()
  cpu/hotplug: Remove unused state functions
  riscv: Switch to hotplug core state synchronization
  parisc: Switch to hotplug core state synchronization
  ...
2023-06-26 13:59:56 -07:00
Thomas Gleixner
45e34c8af5 x86/smp: Put CPUs into INIT on shutdown if possible
Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
6087dd5e86 x86/smp: Split sending INIT IPI out into a helper function
Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20230615193330.551157083@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
d7893093a7 x86/smp: Cure kexec() vs. mwait_play_dead() breakage
TLDR: It's a mess.

When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offlined CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
f9c9987bf5 x86/smp: Use dedicated cache-line for mwait_play_dead()
Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.

But there is a use case where a "dead" CPU needs to be brought out of
MWAIT: kexec().

This is required as kexec() can overwrite text, pagetables, stacks and the
monitored cacheline of the original kernel. The latter causes MWAIT to
resume execution which obviously causes havoc on the kexec kernel which
results usually in triple faults.

Use a dedicated per CPU storage to prepare for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de
2023-06-20 14:51:47 +02:00
Peter Zijlstra
8f2d6c41e5 x86/sched: Rewrite topology setup
Instead of having a number of fixed topologies to pick from; build one
on the fly. This is both simpler now and simpler to extend in the
future.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230601153522.GB559993%40hirez.programming.kicks-ass.net
2023-06-05 21:11:03 +02:00
Thomas Gleixner
ff3cfcb0d4 x86/smpboot: Fix the parallel bringup decision
The decision to allow parallel bringup of secondary CPUs checks
CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
parallel bootup because accessing the local APIC is intercepted and raises
a #VC or #VE, which cannot be handled at that point.

The check works correctly, but only for AMD encrypted guests. TDX does not
set that flag.

As there is no real connection between CC attributes and the inability to
support parallel bringup, replace this with a generic control flag in
x86_cpuinit and let SEV-ES and TDX init code disable it.

Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it")
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx
2023-05-31 16:49:34 +02:00
Thomas Gleixner
0c7ffa32db x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
Implement the validation function which tells the core code whether
parallel bringup is possible.

The only condition for now is that the kernel does not run in an encrypted
guest as these will trap the RDMSR via #VC, which cannot be handled at that
point in early startup.

There was an earlier variant for AMD-SEV which used the GHBC protocol for
retrieving the APIC ID via CPUID, but there is no guarantee that the
initial APIC ID in CPUID is the same as the real APIC ID. There is no
enforcement from the secure firmware and the hypervisor can assign APIC IDs
as it sees fit as long as the ACPI/MADT table is consistent with that
assignment.

Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling
AMD-SEV guests for parallel startup needs some more thought.

Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside
the scope of this change.

Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of
CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart.

[ mikelley: Reported the announce_cpu() fallout

Originally-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.467571745@linutronix.de
2023-05-15 13:45:05 +02:00
David Woodhouse
7e75178a09 x86/smpboot: Support parallel startup of secondary CPUs
In parallel startup mode the APs are kicked alive by the control CPU
quickly after each other and run through the early startup code in
parallel. The real-mode startup code is already serialized with a
bit-spinlock to protect the real-mode stack.

In parallel startup mode the smpboot_control variable obviously cannot
contain the Linux CPU number so the APs have to determine their Linux CPU
number on their own. This is required to find the CPUs per CPU offset in
order to find the idle task stack and other per CPU data.

To achieve this, export the cpuid_to_apicid[] array so that each AP can
find its own CPU number by searching therein based on its APIC ID.

Introduce a flag in the top bits of smpboot_control which indicates that
the AP should find its CPU number by reading the APIC ID from the APIC.

This is required because CPUID based APIC ID retrieval can only provide the
initial APIC ID, which might have been overruled by the firmware. Some AMD
APUs come up with APIC ID = initial APIC ID + 0x10, so the APIC ID to CPU
number lookup would fail miserably if based on CPUID. Also virtualization
can make its own APIC ID assignements. The only requirement is that the
APIC IDs are consistent with the APCI/MADT table.

For the boot CPU or in case parallel bringup is disabled the control bits
are empty and the CPU number is directly available in bit 0-23 of
smpboot_control.

[ tglx: Initial proof of concept patch with bitlock and APIC ID lookup ]
[ dwmw2: Rework and testing, commit message, CPUID 0x1 and CPU0 support ]
[ seanc: Fix stray override of initial_gs in common_cpu_up() ]
[ Oleksandr Natalenko: reported suspend/resume issue fixed in
  x86_acpi_suspend_lowlevel ]
[ tglx: Make it read the APIC ID from the APIC instead of using CPUID,
  	split the bitlock part out ]

Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.411554373@linutronix.de
2023-05-15 13:45:04 +02:00
Thomas Gleixner
f54d4434c2 x86/apic: Provide cpu_primary_thread mask
Make the primary thread tracking CPU mask based in preparation for simpler
handling of parallel bootup.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.186599880@linutronix.de
2023-05-15 13:45:02 +02:00
Thomas Gleixner
8b5a0f957c x86/smpboot: Enable split CPU startup
The x86 CPU bringup state currently does AP wake-up, wait for AP to
respond and then release it for full bringup.

It is safe to be split into a wake-up and and a separate wait+release
state.

Provide the required functions and enable the split CPU bringup, which
prepares for parallel bringup, where the bringup of the non-boot CPUs takes
two iterations: One to prepare and wake all APs and the second to wait and
release them. Depending on timing this can eliminate the wait time
completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.133453992@linutronix.de
2023-05-15 13:45:01 +02:00
Thomas Gleixner
2711b8e2b7 x86/smpboot: Switch to hotplug core state synchronization
The new AP state tracking and synchronization mechanism in the CPU hotplug
core code allows to remove quite some x86 specific code:

  1) The AP alive synchronization based on cpumasks

  2) The decision whether an AP can be brought up again

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.529657366@linutronix.de
2023-05-15 13:44:56 +02:00
Thomas Gleixner
e464640cf7 x86/smpboot: Remove wait for cpu_online()
Now that the core code drops sparse_irq_lock after the idle thread
synchronized, it's pointless to wait for the AP to mark itself online.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.316417181@linutronix.de
2023-05-15 13:44:54 +02:00
Thomas Gleixner
c8b7fb09d1 x86/smpboot: Remove cpu_callin_mask
Now that TSC synchronization is SMP function call based there is no reason
to wait for the AP to be set in smp_callin_mask. The control CPU waits for
the AP to set itself in the online mask anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.206394064@linutronix.de
2023-05-15 13:44:53 +02:00
Thomas Gleixner
9d349d47f0 x86/smpboot: Make TSC synchronization function call based
Spin-waiting on the control CPU until the AP reaches the TSC
synchronization is just a waste especially in the case that there is no
synchronization required.

As the synchronization has to run with interrupts disabled the control CPU
part can just be done from a SMP function call. The upcoming AP issues that
call async only in the case that synchronization is required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.148255496@linutronix.de
2023-05-15 13:44:53 +02:00
Thomas Gleixner
d4f28f07c2 x86/smpboot: Move synchronization masks to SMP boot code
The usage is in smpboot.c and not in the CPU initialization code.

The XEN_PV usage of cpu_callout_mask is obsolete as cpu_init() not longer
waits and cacheinfo has its own CPU mask now, so cpu_callout_mask can be
made static too.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.091511483@linutronix.de
2023-05-15 13:44:52 +02:00
Thomas Gleixner
e94cd1503b x86/smpboot: Get rid of cpu_init_secondary()
The synchronization of the AP with the control CPU is a SMP boot problem
and has nothing to do with cpu_init().

Open code cpu_init_secondary() in start_secondary() and move
wait_for_master_cpu() into the SMP boot code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.981999763@linutronix.de
2023-05-15 13:44:51 +02:00
David Woodhouse
2b3be65d2e x86/smpboot: Split up native_cpu_up() into separate phases and document them
There are four logical parts to what native_cpu_up() does on the BSP (or
on the controlling CPU for a later hotplug):

 1) Wake the AP by sending the INIT/SIPI/SIPI sequence.

 2) Wait for the AP to make it as far as wait_for_master_cpu() which
    sets that CPU's bit in cpu_initialized_mask, then sets the bit in
    cpu_callout_mask to let the AP proceed through cpu_init().

 3) Wait for the AP to finish cpu_init() and get as far as the
    smp_callin() call, which sets that CPU's bit in cpu_callin_mask.

 4) Perform the TSC synchronization and wait for the AP to actually
    mark itself online in cpu_online_mask.

In preparation to allow these phases to operate in parallel on multiple
APs, split them out into separate functions and document the interactions
a little more clearly in both the BP and AP code paths.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.928917242@linutronix.de
2023-05-15 13:44:51 +02:00
Thomas Gleixner
c7f15dd3f0 x86/smpboot: Remove unnecessary barrier()
Peter stumbled over the barrier() after the invocation of smp_callin() in
start_secondary():

  "...this barrier() and it's comment seem weird vs smp_callin(). That
   function ends with an atomic bitop (it has to, at the very least it must
   not be weaker than store-release) but also has an explicit wmb() to order
   setup vs CPU_STARTING.

   There is no way the smp_processor_id() referred to in this comment can land
   before cpu_init() even without the barrier()."

The barrier() along with the comment was added in 2003 with commit
d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those
well documented combo patches of that time which changes world and some
more. The context back then was:

	/*
	 * Dont put anything before smp_callin(), SMP
	 * booting is too fragile that we want to limit the
	 * things done here to the most necessary things.
	 */
	cpu_init();
	smp_callin();

+	/* otherwise gcc will move up smp_processor_id before the cpu_init */
+ 	barrier();

	Dprintk("cpu %d: waiting for commence\n", smp_processor_id());

Even back in 2003 the compiler was not allowed to reorder that
smp_processor_id() invocation before the cpu_init() function call.
Especially not as smp_processor_id() resolved to:

  asm volatile("movl %%gs:%c1,%0":"=r" (ret__):"i"(pda_offset(field)):"memory");

There is no trace of this change in any mailing list archive including the
back then official x86_64 list discuss@x86-64.org, which would explain the
problem this change solved.

The debug prints are gone by now and the the only smp_processor_id()
invocation today is farther down in start_secondary() after locking
vector_lock which itself prevents reordering.

Even if the compiler would be allowed to reorder this, the code would still
be correct as GSBASE is set up early in the assembly code and is valid when
the CPU reaches start_secondary(), while the code at the time when this
barrier was added did the GSBASE setup in cpu_init().

As the barrier has zero value, remove it.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.875713771@linutronix.de
2023-05-15 13:44:50 +02:00
Thomas Gleixner
5475abbde7 x86/smpboot: Remove the CPU0 hotplug kludge
This was introduced with commit e1c467e69040 ("x86, hotplug: Wake up CPU0
via NMI instead of INIT, SIPI, SIPI") to eventually support physical
hotplug of CPU0:

 "We'll change this code in the future to wake up hard offlined CPU0 if
  real platform and request are available."

11 years later this has not happened and physical hotplug is not officially
supported. Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.768845190@linutronix.de
2023-05-15 13:44:49 +02:00
Thomas Gleixner
134a12827b x86/smpboot: Avoid pointless delay calibration if TSC is synchronized
When TSC is synchronized across sockets then there is no reason to
calibrate the delay for the first CPU which comes up on a socket.

Just reuse the existing calibration value.

This removes 100ms pointlessly wasted time from CPU hotplug per socket.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.608773568@linutronix.de
2023-05-15 13:44:48 +02:00
Thomas Gleixner
ba831b7b1a cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __init
No point in keeping them around.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de
2023-05-15 13:44:47 +02:00
Thomas Gleixner
5107e3ebb8 x86/smpboot: Cleanup topology_phys_to_logical_pkg()/die()
Make topology_phys_to_logical_pkg_die() static as it's only used in
smpboot.c and fixup the kernel-doc warnings for both functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.493750666@linutronix.de
2023-05-15 13:44:47 +02:00
Chen Yu
044f0e27de x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.

Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:38 +02:00
Ricardo Neri
995998ebde x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-11-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:37 +02:00
Linus Torvalds
2aff7c706c Objtool changes for v6.4:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did
    this inconsistently follow this new, common convention, and fix all the fallout
    that objtool can now detect statically.
 
  - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity,
    split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it.
 
  - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code.
 
  - Generate ORC data for __pfx code
 
  - Add more __noreturn annotations to various kernel startup/shutdown/panic functions.
 
  - Misc improvements & fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM
 pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul
 TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd
 LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP
 c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT
 /PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a
 1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1
 0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ
 SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p
 vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo
 x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi
 fRho4CqzrtM=
 =NwEV
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - Mark arch_cpu_idle_dead() __noreturn, make all architectures &
   drivers that did this inconsistently follow this new, common
   convention, and fix all the fallout that objtool can now detect
   statically

 - Fix/improve the ORC unwinder becoming unreliable due to
   UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK
   and UNWIND_HINT_UNDEFINED to resolve it

 - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code

 - Generate ORC data for __pfx code

 - Add more __noreturn annotations to various kernel startup/shutdown
   and panic functions

 - Misc improvements & fixes

* tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/hyperv: Mark hv_ghcb_terminate() as noreturn
  scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
  x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
  btrfs: Mark btrfs_assertfail() __noreturn
  objtool: Include weak functions in global_noreturns check
  cpu: Mark nmi_panic_self_stop() __noreturn
  cpu: Mark panic_smp_self_stop() __noreturn
  arm64/cpu: Mark cpu_park_loop() and friends __noreturn
  x86/head: Mark *_start_kernel() __noreturn
  init: Mark start_kernel() __noreturn
  init: Mark [arch_call_]rest_init() __noreturn
  objtool: Generate ORC data for __pfx code
  x86/linkage: Fix padding for typed functions
  objtool: Separate prefix code from stack validation code
  objtool: Remove superfluous dead_end_function() check
  objtool: Add symbol iteration helpers
  objtool: Add WARN_INSN()
  scripts/objdump-func: Support multiple functions
  context_tracking: Fix KCSAN noinstr violation
  objtool: Add stackleak instrumentation to uaccess safe list
  ...
2023-04-28 14:02:54 -07:00
Josh Poimboeuf
52668badd3 x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
Fixes the following warning:

  vmlinux.o: warning: objtool: resume_play_dead+0x21: unreachable instruction

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/ce1407c4bf88b1334fe40413126343792a77ca50.1681342859.git.jpoimboe@kernel.org
2023-04-14 17:31:27 +02:00
David Woodhouse
805ae9dc3b x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
When bringing up a secondary CPU from do_boot_cpu(), the warm reset flag
is set in CMOS and the starting IP for the trampoline written inside the
BDA at 0x467. Once the CPU is running, the CMOS flag is unset and the
value in the BDA cleared.

To allow for parallel bringup of CPUs, add a reference count to track the
number of CPUs currently bring brought up, and clear the state only when
the count reaches zero.

Since the RTC spinlock is required to write to the CMOS, it can be used
for mutual exclusion on the refcount too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Link: https://lore.kernel.org/r/20230316222109.1940300-5-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Brian Gerst
8f6be6d870 x86/smpboot: Remove initial_gs
Given its CPU#, each CPU can find its own per-cpu offset, and directly set
GSBASE accordingly. The global variable can be eliminated.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230316222109.1940300-9-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Brian Gerst
c253b64020 x86/smpboot: Remove early_gdt_descr on 64-bit
Build the GDT descriptor on the stack instead.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230316222109.1940300-8-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Brian Gerst
3adee777ad x86/smpboot: Remove initial_stack on 64-bit
In order to facilitate parallel startup, start to eliminate some of the
global variables passing information to CPUs in the startup path.

However, start by introducing one more: smpboot_control. For now this
merely holds the CPU# of the CPU which is coming up. Each CPU can then
find its own per-cpu data, and everything else it needs can be found
from there, allowing the other global variables to be removed.

First to be removed is initial_stack. Each CPU can load %rsp from its
current_task->thread.sp instead. That is already set up with the correct
idle thread for APs. Set up the .sp field in INIT_THREAD on x86 so that
the BSP also finds a suitable stack pointer in the static per-cpu data
when coming up on first boot.

On resume from S3, the CPU needs a temporary stack because its idle task
is already active. Instead of setting initial_stack, the sleep code can
simply set its own current->thread.sp to point to the temporary stack.
Nobody else cares about ->thread.sp for a thread which is currently on
a CPU, because the true value is actually in the %rsp register. Which
is restored with the rest of the CPU context in do_suspend_lowlevel().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230316222109.1940300-7-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Srivatsa S. Bhat (VMware)
fcb3a81d22 x86/hotplug: Remove incorrect comment about mwait_play_dead()
The comment that says mwait_play_dead() returns only on failure is a bit
misleading because mwait_play_dead() could actually return for valid
reasons (such as mwait not being supported by the platform) that do not
indicate a failure of the CPU offline operation. So, remove the comment.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230128003751.141317-1-srivatsa@csail.mit.edu
2023-02-14 23:44:34 +01:00
Linus Torvalds
94a855111e - Add the call depth tracking mitigation for Retbleed which has
been long in the making. It is a lighterweight software-only fix for
 Skylake-based cores where enabling IBRS is a big hammer and causes a
 significant performance impact.
 
 What it basically does is, it aligns all kernel functions to 16 bytes
 boundary and adds a 16-byte padding before the function, objtool
 collects all functions' locations and when the mitigation gets applied,
 it patches a call accounting thunk which is used to track the call depth
 of the stack at any time.
 
 When that call depth reaches a magical, microarchitecture-specific value
 for the Return Stack Buffer, the code stuffs that RSB and avoids its
 underflow which could otherwise lead to the Intel variant of Retbleed.
 
 This software-only solution brings a lot of the lost performance back,
 as benchmarks suggest:
 
   https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
 
 That page above also contains a lot more detailed explanation of the
 whole mechanism
 
 - Implement a new control flow integrity scheme called FineIBT which is
 based on the software kCFI implementation and uses hardware IBT support
 where present to annotate and track indirect branches using a hash to
 validate them
 
 - Other misc fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe
 VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb
 VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN
 gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A
 iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y
 Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A
 ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh
 ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd
 73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP
 i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n
 Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/
 zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs=
 =cRy1
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Add the call depth tracking mitigation for Retbleed which has been
   long in the making. It is a lighterweight software-only fix for
   Skylake-based cores where enabling IBRS is a big hammer and causes a
   significant performance impact.

   What it basically does is, it aligns all kernel functions to 16 bytes
   boundary and adds a 16-byte padding before the function, objtool
   collects all functions' locations and when the mitigation gets
   applied, it patches a call accounting thunk which is used to track
   the call depth of the stack at any time.

   When that call depth reaches a magical, microarchitecture-specific
   value for the Return Stack Buffer, the code stuffs that RSB and
   avoids its underflow which could otherwise lead to the Intel variant
   of Retbleed.

   This software-only solution brings a lot of the lost performance
   back, as benchmarks suggest:

       https://lore.kernel.org/all/20220915111039.092790446@infradead.org/

   That page above also contains a lot more detailed explanation of the
   whole mechanism

 - Implement a new control flow integrity scheme called FineIBT which is
   based on the software kCFI implementation and uses hardware IBT
   support where present to annotate and track indirect branches using a
   hash to validate them

 - Other misc fixes and cleanups

* tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
  x86/paravirt: Use common macro for creating simple asm paravirt functions
  x86/paravirt: Remove clobber bitmask from .parainstructions
  x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al
  x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit
  x86/Kconfig: Enable kernel IBT by default
  x86,pm: Force out-of-line memcpy()
  objtool: Fix weak hole vs prefix symbol
  objtool: Optimize elf_dirty_reloc_sym()
  x86/cfi: Add boot time hash randomization
  x86/cfi: Boot time selection of CFI scheme
  x86/ibt: Implement FineIBT
  objtool: Add --cfi to generate the .cfi_sites section
  x86: Add prefix symbols for function padding
  objtool: Add option to generate prefix symbols
  objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf
  objtool: Slice up elf_create_section_symbol()
  kallsyms: Revert "Take callthunks into account"
  x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces
  x86/retpoline: Fix crash printing warning
  x86/paravirt: Fix a !PARAVIRT build warning
  ...
2022-12-14 15:03:00 -08:00
Linus Torvalds
3ef3ace4e2 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
guests which do not get MTRRs exposed but only PAT. (TDX guests do not
 support the cache disabling dance when setting up MTRRs so they fall
 under the same category.) This is a cleanup work to remove all the ugly
 workarounds for such guests and init things separately (Juergen Gross)
 
 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
 Performance Bias, leading to power savings
 
 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur
 CPUs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYhIMACgkQEsHwGGHe
 VUpxug//ZKw3hYFroKhsULJi/e0j2nGARiSlJrJcFHl2vgh9yGvDsnYUyM/rgjgt
 cM3uCLbEG7nA6uhB3nupzaXZ8lBM1nU9kiEl/kjQ5oYf9nmJ48fLttvWGfxYN4s3
 kj5fYVhlOZpntQXIWrwxnPqghUysumMnZmBJeKYiYNNfkj62l3xU2Ni4Gnjnp02I
 9MmUhl7pj1aEyOQfM8rovy+wtYCg5WTOmXVlyVN+b9MwfYeK+stojvCZHxtJs9BD
 fezpJjjG+78xKUC7vVZXCh1p1N5Qvj014XJkVl9Hg0n7qizKFZRtqi8I769G2ptd
 exP8c2nDXKCqYzE8vK6ukWgDANQPs3d6Z7EqUKuXOCBF81PnMPSUMyNtQFGNM6Wp
 S5YSvFfCgUjp50IunOpvkDABgpM+PB8qeWUq72UFQJSOymzRJg/KXtE2X+qaMwtC
 0i6VLXfMddGcmqNKDppfGtCjq2W5VrNIIJedtAQQGyl+pl3XzZeNomhJpm/0mVfJ
 8UrlXZeXl/EUQ7qk40gC/Ash27pU9ZDx4CMNMy1jDIQqgufBjEoRIDSFqQlghmZq
 An5/BqMLhOMxUYNA7bRUnyeyxCBypetMdQt5ikBmVXebvBDmArXcuSNAdiy1uBFX
 KD8P3Y1AnsHIklxkLNyZRUy7fb4mgMFenUbgc0vmbYHbFl0C0pQ=
 =Zmgh
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
   guests which do not get MTRRs exposed but only PAT. (TDX guests do
   not support the cache disabling dance when setting up MTRRs so they
   fall under the same category)

   This is a cleanup work to remove all the ugly workarounds for such
   guests and init things separately (Juergen Gross)

 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
   Performance Bias, leading to power savings

 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern
   Centaur CPUs

* tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  x86/mtrr: Make message for disabled MTRRs more descriptive
  x86/pat: Handle TDX guest PAT initialization
  x86/cpuid: Carve out all CPUID functionality
  x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV
  x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area()
  x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack()
  x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode()
  x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h
  x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs
  x86/mtrr: Simplify mtrr_ops initialization
  x86/cacheinfo: Switch cache_ap_init() to hotplug callback
  x86: Decouple PAT and MTRR handling
  x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
  x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
  x86/mtrr: Get rid of __mtrr_enabled bool
  x86/mtrr: Simplify mtrr_bp_init()
  x86/mtrr: Remove set_all callback from struct mtrr_ops
  x86/mtrr: Disentangle MTRR init from PAT init
  x86/mtrr: Move cache control code to cacheinfo.c
  x86/mtrr: Split MTRR-specific handling from cache dis/enabling
  ...
2022-12-13 14:56:56 -08:00
Jason A. Donenfeld
b3883a9a1f stackprotector: move get_random_canary() into stackprotector.h
This has nothing to do with random.c and everything to do with stack
protectors. Yes, it uses randomness. But many things use randomness.
random.h and random.c are concerned with the generation of randomness,
not with each and every use. So move this function into the more
specific stackprotector.h file where it belongs.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:18:10 +01:00
Juergen Gross
30f89e524b x86/cacheinfo: Switch cache_ap_init() to hotplug callback
Instead of explicitly calling cache_ap_init() in
identify_secondary_cpu() use a CPU hotplug callback instead. By
registering the callback only after having started the non-boot CPUs
and initializing cache_aps_delayed_init with "true", calling
set_cache_aps_delayed_init() at boot time can be dropped.

It should be noted that this change results in cache_ap_init() being
called a little bit later when hotplugging CPUs. By using a new
hotplug slot right at the start of the low level bringup this is not
problematic, as no operations requiring a specific caching mode are
performed that early in CPU initialization.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-15-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Juergen Gross
0b9a6a8bed x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
Instead of having a stop_machine() handler for either a specific
MTRR register or all state at once, add a handler just for calling
cache_cpu_init() if appropriate.

Add functions for calling stop_machine() with this handler as well.

Add a generic replacement for mtrr_bp_restore() and a wrapper for
mtrr_bp_init().

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-13-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Juergen Gross
955d0e0805 x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
In order to prepare decoupling MTRR and PAT replace the MTRR-specific
mtrr_aps_delayed_init flag with a more generic cache_aps_delayed_init
one.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-12-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Thomas Gleixner
c063a217bc x86/percpu: Move current_top_of_stack next to current_task
Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
2022-10-17 16:41:05 +02:00
Thomas Gleixner
e57ef2ed97 x86: Put hot per CPU variables into a struct
The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.

Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org
2022-10-17 16:41:03 +02:00
Thomas Gleixner
1f19e2d50b x86/cpu: Get rid of redundant switch_to_new_gdt() invocations
The only place where switch_to_new_gdt() is required is early boot to
switch from the early GDT to the direct GDT. Any other invocation is
completely redundant because it does not change anything.

Secondary CPUs come out of the ASM code with GDT and GSBASE correctly set
up. The same is true for XEN_PV.

Remove all the voodoo invocations which are left overs from the ancient
past, rename the function to switch_gdt_and_percpu_base() and mark it init.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.198076128@infradead.org
2022-10-17 16:40:56 +02:00
Yury Norov
38bef8e57f smp: add set_nr_cpu_ids()
In preparation to support compile-time nr_cpu_ids, add a setter for
the variable.

This is a no-op for all arches.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19 17:51:53 -07:00
Sander Vanheule
adbcaef840 x86/cacheinfo: move shared cache map definitions
Patch series "cpumask: Fix invalid uniprocessor assumptions", v4.

On uniprocessor builds, it is currently assumed that any cpumask will
contain the single CPU: cpu0.  This assumption is used to provide
optimised implementations.

The current assumption also appears to be wrong, by ignoring the fact that
users can provide empty cpumasks.  This can result in bugs as explained in
[1] - for_each_cpu() will run one iteration of the loop even when passed
an empty cpumask.

This series introduces some basic tests, and updates the optimisations for
uniprocessor builds.

The x86 patch was written after the kernel test robot [2] ran into a
failed build.  I have tried to list the files potentially affected by the
changes to cpumask.h, in an attempt to find any other cases that fail on
!SMP.  I've gone through some of the files manually, and ran a few cross
builds, but nothing else popped up.  I (build) checked about half of the
potientally affected files, but I do not have the resources to do them
all.  I hope we can fix other issues if/when they pop up later.

[1] https://lore.kernel.org/all/20220530082552.46113-1-sander@svanheule.net/
[2] https://lore.kernel.org/all/202206060858.wA0FOzRy-lkp@intel.com/


This patch (of 5):

The maps to keep track of shared caches between CPUs on SMP systems are
declared in asm/smp.h, among them specifically cpu_llc_shared_map.  These
maps are externally defined in cpu/smpboot.c.  The latter is only compiled
on CONFIG_SMP=y, which means the declared extern symbols from asm/smp.h do
not have a corresponding definition on uniprocessor builds.

The inline cpu_llc_shared_mask() function from asm/smp.h refers to the map
declaration mentioned above.  This function is referenced in cacheinfo.c
inside for_each_cpu() loop macros, to provide cpumask for the loop.  On
uniprocessor builds, the symbol for the cpu_llc_shared_map does not exist.
However, the current implementation of for_each_cpu() also (wrongly)
ignores the provided mask.

By sheer luck, the compiler thus optimises out this unused reference to
cpu_llc_shared_map, and the linker therefore does not require the
cpu_llc_shared_mask to actually exist on uniprocessor builds.  Only on SMP
bulids does smpboot.o exist to provide the required symbols.

To no longer rely on compiler optimisations for successful uniprocessor
builds, move the definitions of cpu_llc_shared_map and cpu_l2c_shared_map
from smpboot.c to cacheinfo.c.

Link: https://lkml.kernel.org/r/cover.1656777646.git.sander@svanheule.net
Link: https://lkml.kernel.org/r/e8167ddb570f56744a3dc12c2149a660a324d969.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:31:40 -07:00
Linus Torvalds
a13dc4d409 - Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for unnecessary IPIs
 
 - Finally remove a.out support
 
 - The usual trivial cleanups and fixes all over x86
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe
 VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2
 XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV
 qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK
 /3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj
 0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK
 BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE
 M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB
 xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7
 2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W
 hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX
 TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0=
 =Dks/
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:

 - Serious sanitization and cleanup of the whole APERF/MPERF and
   frequency invariance code along with removing the need for
   unnecessary IPIs

 - Finally remove a.out support

 - The usual trivial cleanups and fixes all over x86

* tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86: Remove empty files
  x86/speculation: Add missing srbds=off to the mitigations= help text
  x86/prctl: Remove pointless task argument
  x86/aperfperf: Make it correct on 32bit and UP kernels
  x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
  x86/aperfmperf: Replace arch_freq_get_on_cpu()
  x86/aperfmperf: Replace aperfmperf_get_khz()
  x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  x86/aperfmperf: Make parts of the frequency invariance code unconditional
  x86/aperfmperf: Restructure arch_scale_freq_tick()
  x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  x86/aperfmperf: Separate AP/BP frequency invariance init
  x86/smp: Move APERF/MPERF code where it belongs
  x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  x86/process: Fix kernel-doc warning due to a changed function name
  x86: Remove a.out support
  x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
  x86: Replace cpumask_weight() with cpumask_empty() where appropriate
  x86/pkeys: Remove __arch_set_user_pkey_access() declaration
  ...
2022-05-23 18:17:09 -07:00
Linus Torvalds
3a755ebcc2 Intel Trust Domain Extensions
This is the Intel version of a confidential computing solution called
 Trust Domain Extensions (TDX). This series adds support to run the
 kernel as part of a TDX guest. It provides similar guest protections to
 AMD's SEV-SNP like guest memory and register state encryption, memory
 integrity protection and a lot more.
 
 Design-wise, it differs from AMD's solution considerably: it uses
 a software module which runs in a special CPU mode called (Secure
 Arbitration Mode) SEAM. As the name suggests, this module serves as sort
 of an arbiter which the confidential guest calls for services it needs
 during its lifetime.
 
 Just like AMD's SNP set, this series reworks and streamlines certain
 parts of x86 arch code so that this feature can be properly accomodated.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe
 VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv
 kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT
 pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm
 khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw
 r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX
 fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy
 HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs
 IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G
 tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n
 HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq
 VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM=
 =ac9K
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel TDX support from Borislav Petkov:
 "Intel Trust Domain Extensions (TDX) support.

  This is the Intel version of a confidential computing solution called
  Trust Domain Extensions (TDX). This series adds support to run the
  kernel as part of a TDX guest. It provides similar guest protections
  to AMD's SEV-SNP like guest memory and register state encryption,
  memory integrity protection and a lot more.

  Design-wise, it differs from AMD's solution considerably: it uses a
  software module which runs in a special CPU mode called (Secure
  Arbitration Mode) SEAM. As the name suggests, this module serves as
  sort of an arbiter which the confidential guest calls for services it
  needs during its lifetime.

  Just like AMD's SNP set, this series reworks and streamlines certain
  parts of x86 arch code so that this feature can be properly
  accomodated"

* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/tdx: Fix RETs in TDX asm
  x86/tdx: Annotate a noreturn function
  x86/mm: Fix spacing within memory encryption features message
  x86/kaslr: Fix build warning in KASLR code in boot stub
  Documentation/x86: Document TDX kernel architecture
  ACPICA: Avoid cache flush inside virtual machines
  x86/tdx/ioapic: Add shared bit for IOAPIC base address
  x86/mm: Make DMA memory shared for TD guest
  x86/mm/cpa: Add support for TDX shared memory
  x86/tdx: Make pages shared in ioremap()
  x86/topology: Disable CPU online/offline control for TDX guests
  x86/boot: Avoid #VE during boot for TDX platforms
  x86/boot: Set CR0.NE early and keep it set during the boot
  x86/acpi/x86/boot: Add multiprocessor wake-up support
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Port I/O: Add early boot support
  x86/tdx: Port I/O: Add runtime hypercalls
  x86/boot: Port I/O: Add decompression-time support for TDX
  x86/boot: Port I/O: Allow to hook up alternative helpers
  ...
2022-05-23 17:51:12 -07:00
Thomas Gleixner
bb6e89df90 x86/aperfmperf: Make parts of the frequency invariance code unconditional
The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.

arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.

While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.

As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de
2022-04-27 20:22:19 +02:00