44245 Commits

Author SHA1 Message Date
Linus Torvalds
3e5822e0f9 - Implement a rename operation in resctrlfs to facilitate handling
of application containers with dynamically changing task lists
 
 - When reading the tasks file, show the tasks' pid which are only in
   the current namespace as opposed to showing the pids from the init
   namespace too
 
 - Other fixes and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZzAwACgkQEsHwGGHe
 VUqp6g//dJ3OMAj8q0g9TO5M9caPGtY67tP488dvIhcemvRwwsSFr/qiXHB35l0c
 sCbmXVvF0lhaGbEIE94VZyKN7GXpvSKof29lJ2zaB/cgc4qbkm0wzkDA6e3j11xT
 OXB8R/cU4FXm1nQ0irT9Bf8w4KrpWr8f3SVbLQkGsc9+vYaSMZHbFIvZ1RmDaFBU
 T7WtpmRgfr97updmd4QkkBsHfIUNK/4HamGVBUKsdYX/seYuffRLKHzuiBqr7kDr
 WKqdR3iVOqbLFQbH2QIXuAL9+29Z2lfMVUD04e0I62TU6KCckrdObBQ67WgXGAgG
 GzCsbsNf1So7nIQmaMZSbR3OeuifXOzQPlFPtIh52SSVyafl1I66nw1tkTsMAqkd
 waqaB2dSLFHTND8hE7pUHdz84RFaGoE9/O6JiSxt0qkXQIycJY6hBthLxw76XQe7
 HnUqyL/0t7H5FT7TEwbQ26cNGFghA87x4fCc2AolIrZFK/chtPnvyFz3oTvSyLW2
 J1YhdJ+mcid70cvGmhe9w9OjFI1e4O22l1uc9jJyTPwWPk6/IxKeT9mx+bJusT7d
 mHvglK60L/x9NQHeV7FZIM9NXKhePLGi84aaE+Ly8ZOhoWcRmJTuEN/CGq+Qyuks
 KmDhvGIfd4GKRxOwMuKHQEn8WfyRvX5YDvhU24V2Zzb8I3nyFQQ=
 =nu66
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:

 - Implement a rename operation in resctrlfs to facilitate handling of
   application containers with dynamically changing task lists

 - When reading the tasks file, show the tasks' pid which are only in
   the current namespace as opposed to showing the pids from the init
   namespace too

 - Other fixes and improvements

* tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/x86: Documentation for MON group move feature
  x86/resctrl: Implement rename op for mon groups
  x86/resctrl: Factor rdtgroup lock for multi-file ops
  x86/resctrl: Only show tasks' pid in current pid namespace
2023-06-26 15:29:21 -07:00
Linus Torvalds
59035135b3 - Remove relocation information from vmlinux as it is not needed by
other tooling and thus a slimmer binary is generated. This is
   important for distros who have to distribute vmlinux blobs with their
   kernel packages too and that extraneous unnecessary data bloats them
   for no good reason
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZv+wACgkQEsHwGGHe
 VUpdzhAAvUjNy0I0+kezbE1+o0bA1CKQt06oFrJYUZGk0FEjNrgEkr4Tlgrv+blF
 RU/SNBqvFwsonKtLZkB7nTse+aKu6LO1SKqgw0y67Hi98VL3j6GqTqvuaWALGgAs
 Zh2hcOLmkVdKoj1vnc2VP8Z6nqifsJhDANAkmrKGypJq+FjEubMfu+/zKTRgi0sv
 n0yOQNRB0aXtyMwa5nv5y2fU/Cp0ibvdkxEtCuGh8OKG+yth6lGxqR+mnlrqmVxR
 2ATdgMZfnPSNs5MoggAY8zahsdG+npWkcyTzXjk4KfFRVQM269JN5uHPPrRkdMAm
 RDN+4aZmvQ8NxsT2U151IhkAiuGTCGlAe5x97KX+33juUZoc67/UhIdytmU3jNyl
 33QnkeB8F+s/iNfFbCg9itThvTcWd0ER5zAaNGeoZTaIhpIU6TmBWuJJnbr/hqLa
 ckozuIk1wSXKWJheMRPbTBMxCqI3AmP1L3kIgfayHInS/J8QNMT8YmYUDyRb7GkJ
 bhsiN/GWtMqTt+bJxYkHHNuDBPCH8tKKneO2LG/KFTUnW7FYLOgQZQf0bO3mNe+C
 SVmGjUuSZhM47RiVEmaaLPppBu82CU0SZkt6pp4FuQWkVnwCZgPA/4MF0vTVc5M+
 f2V99h3FkTXi75RP7GfVCKnwsFalu7NOyJer6KIm8Hh8x4RTTHo=
 =k9Pz
 -----END PGP SIGNATURE-----

Merge tag 'x86_build_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 build update from Borislav Petkov:

 - Remove relocation information from vmlinux as it is not needed by
   other tooling and thus a slimmer binary is generated.

   This is important for distros who have to distribute vmlinux blobs
   with their kernel packages too and that extraneous unnecessary data
   bloats them for no good reason

* tag 'x86_build_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Avoid relocation information in final vmlinux
2023-06-26 15:25:07 -07:00
Linus Torvalds
8c69e7afe9 - Up until now the Fast Short Rep Mov optimizations implied the presence
of the ERMS CPUID flag. AMD decoupled them with a BIOS setting so decouple
   that dependency in the kernel code too
 
 - Teach the alternatives machinery to handle relocations
 
 - Make debug_alternative accept flags in order to see only that set of
   patching done one is interested in
 
 - Other fixes, cleanups and optimizations to the patching code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZi2AACgkQEsHwGGHe
 VUqhGw/9EC/m5HTFBlCy9PS5Qy6pPLzmHR5Tuy4meqlnB1gN+5wzfxdYEwHm46hH
 SR6WqR12yVaCMIzh66y8nTJyMbIykaBbfFJb3WesdDrBIYUZ9f+7O+Xd0JS6Jykd
 2HBHOyaVS1/W75+y6w9JhTExBH5xieCpJVIYyAvifbn/pB8XmuTTwJ1Z3EJ8DzkK
 AN16i46bUiKNBdTYZUMhtKL4vHVfqLYMskgWe6IG7DmRLOwikR0uRVhuVqP/bmUj
 U128cUacGJT2AYbZarTAKmOa42nDj3TpJqRp1qit3y6Cun4vxKH+1A91UPd7IHTa
 M5H1bNSgfXMm8rU+JgfvXKqrCTckGn2OqlCkJfPV3RBeP9IcQBBF0vE3dnM/X2We
 dwbXeDfJvc+1s4/M41MOhyahTUbW+4iRK5UCZEt1mprTbtzHTlN7RROo7QLpFsWx
 T0Jqvsd1raAutPTgTjU7ToQwDpSQNnn4Y/KoEdpvOCXR8wU7Wo5/+Qa4tEkIY3W6
 mUFpJcgFC9QEKLuaNAofPIhMuZ/vzRVtpK7wbLn4KR5JZA8AxznenMFVg8YPWRFI
 4oga0kMFJ7t6z/CXHtrxFaLQ9e7WAUSRU6gPiz8As1F/K9N0JWMUfjuTJcgjUsF8
 bwdCNinwG8y3rrPUCrqbO5N766ZkLYd6NksKlmIyUvtCcS0ksbg=
 =mH38
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 instruction alternatives updates from Borislav Petkov:

 - Up until now the Fast Short Rep Mov optimizations implied the
   presence of the ERMS CPUID flag. AMD decoupled them with a BIOS
   setting so decouple that dependency in the kernel code too

 - Teach the alternatives machinery to handle relocations

 - Make debug_alternative accept flags in order to see only that set of
   patching done one is interested in

 - Other fixes, cleanups and optimizations to the patching code

* tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternative: PAUSE is not a NOP
  x86/alternatives: Add cond_resched() to text_poke_bp_batch()
  x86/nospec: Shorten RESET_CALL_DEPTH
  x86/alternatives: Add longer 64-bit NOPs
  x86/alternatives: Fix section mismatch warnings
  x86/alternative: Optimize returns patching
  x86/alternative: Complicate optimize_nops() some more
  x86/alternative: Rewrite optimize_nops() some
  x86/lib/memmove: Decouple ERMS from FSRM
  x86/alternative: Support relocations in alternatives
  x86/alternative: Make debug-alternative selective
2023-06-26 15:14:55 -07:00
Linus Torvalds
aa35a4835e - Add initial support for RAS hardware found on AMD server GPUs (MI200).
Those GPUs and CPUs are connected together through the coherent fabric
   and the GPU memory controllers report errors through x86's MCA so EDAC
   needs to support them. The amd64_edac driver supports now HBM (High
   Bandwidth Memory) and thus such heterogeneous memory controller
   systems
 
 - Other small cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZiUwACgkQEsHwGGHe
 VUphSQ/+JLXTAQ06CNos98MR8iCGdThVujhWt1pBIgjhQFJuf4JlEEtKs9htjbud
 9HZvgnGbHahRoO8pMCB0jwtz0ATrPbaOvz4BofVp3SIRiR5jMI0tfmyl8iSrnA3Q
 m5pbMh6uiIAlH8aPqQXret2iwp7JXOjnBWksgbmUWkI7d2qseKu98ikXyC4QoCaD
 AGRJJ6OCA3P85rdT9qabOuXh6yoELOPKw3j243s22sTLiqn+EuoTE+QX5ZjrQ8Ts
 DyXN/pYI/vGVP7sECkWf7PsEf1BkL6m5KeXDB4Ij2YJesQnBlBZQdAcxdGdY8z3M
 f/qpLdrYvpcLHQy42Jm5VnnISOvMvAl8YWqCEyUmBjXcLwSPNIKHN9LQuznhnQHr
 vssRVqQUg1J+/UWAoIzHdrAQ6zvgv1xlX2dG2YOw3t1WMDnMhztW3eoQv04etD3d
 fqQH3MrkGHI4qeq1Mice1Gz+NWQG/PXVhgBzbTBDDCiRJkg1Dhxce1OMRUiM4tUW
 0JABoU+KS0RZAKXAwine6v5duYmwK36Vl1SSCCWjqFMeR7XMwWWHA9d7t8+wdT1l
 KBIEiRTcRnXaZXyLUPSPRbEF5ALS25RgWVPCA3ibuSUnJjGU7Z7/rbwlQryAefVB
 nqjATed0zat4fbL9bvnDuOKQEzkuySvUWpU+Eozxbct6oRu5ms0=
 =Vcif
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - Add initial support for RAS hardware found on AMD server GPUs (MI200).

   Those GPUs and CPUs are connected together through the coherent
   fabric and the GPU memory controllers report errors through x86's MCA
   so EDAC needs to support them. The amd64_edac driver supports now HBM
   (High Bandwidth Memory) and thus such heterogeneous memory controller
   systems

 - Other small cleanups and improvements

* tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  EDAC/amd64: Cache and use GPU node map
  EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh
  EDAC/amd64: Document heterogeneous system enumeration
  x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors
  x86/amd_nb: Re-sort and re-indent PCI defines
  x86/amd_nb: Add MI200 PCI IDs
  ras/debugfs: Fix error checking for debugfs_create_dir()
  x86/MCE: Check a hw error's address to determine proper recovery action
2023-06-26 15:09:18 -07:00
Linus Torvalds
88afbb21d4 A set of fixes for kexec(), reboot and shutdown issues
- Ensure that the WBINVD in stop_this_cpu() has been completed before the
    control CPU proceedes.
 
    stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs
    in a HLT loop.
 
    The control CPU sends an IPI to the APs and waits for their CPU online bits
    to be cleared. Once they all are marked "offline" it proceeds.
 
    But stop_this_cpu() clears the CPU online bit before issuing WBINVD,
    which means there is no guarantee that the AP has reached the HLT loop.
 
    This was reported to cause intermittent reboot/shutdown failures due to
    some dubious interaction with the firmware.
 
    This is not only a problem of WBINVD. The code to actually "stop" the
    CPU which runs between clearing the online bit and reaching the HLT loop
    can cause large enough delays on its own (think virtualization). That's
    especially dangerous for kexec() as kexec() expects that all APs are in
    a safe state and not executing code while the boot CPU jumps to the new
    kernel. There are more issues vs. kexec() which are addressed separately.
 
    Cure this by implementing an explicit synchronization point right before
    the AP reaches HLT. This guarantees that the AP has completed the full
    stop proceedure.
 
  - Fix the condition for WBINVD in stop_this_cpu().
 
    The WBINVD in stop_this_cpu() is required for ensuring that when
    switching to or from memory encryption no dirty data is left in the
    cache lines which might cause a write back in the wrong more later.
 
    This checks CPUID directly because the feature bit might have been
    cleared due to a command line option.
 
    But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel
    CPUs return the content of the highest supported leaf when a non-existing
    leaf is read, while AMD CPUs return all zeros for unsupported leafs.
 
    So the result of the test on Intel CPUs is lottery and on AMD its just
    correct by chance.
 
    While harmless it's incorrect and causes the conditional wbinvd() to be
    issued where not required, which caused the above issue to be unearthed.
 
  - Make kexec() robust against AP code execution
 
    Ashok observed triple faults when doing kexec() on a system which had
    been booted with "nosmt".
 
    It turned out that the SMT siblings which had been brought up partially
    are parked in mwait_play_dead() to enable power savings.
 
    mwait_play_dead() is monitoring the thread flags of the AP's idle task,
    which has been chosen as it's unlikely to be written to.
 
    But kexec() can overwrite the previous kernel text and data including
    page tables etc. When it overwrites the cache lines monitored by an AP
    that AP resumes execution after the MWAIT on eventually overwritten
    text, stack and page tables, which obviously might end up in a triple
    fault easily.
 
    Make this more robust in several steps:
 
     1) Use an explicit per CPU cache line for monitoring.
 
     2) Write a command to these cache lines to kick APs out of MWAIT before
        proceeding with kexec(), shutdown or reboot.
 
        The APs confirm the wakeup by writing status back and then enter a
        HLT loop.
 
     3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs
        in INIT state.
 
        HLT is not a guarantee that an AP won't wake up and resume
        execution. HLT is woken up by NMI and SMI. SMI puts the CPU back
        into HLT (+/- firmware bugs), but NMI is delivered to the CPU which
        executes the NMI handler. Same issue as the MWAIT scenario described
        above.
 
        Sending an INIT/INIT sequence to the APs puts them into wait for
        STARTUP state, which is safe against NMI.
 
     There is still an issue remaining which can't be fixed: #MCE
 
     If the AP sits in HLT and receives a broadcast #MCE it will try to
     handle it with the obvious consequences.
 
     INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to
     shut down the machine.
 
     So there is a choice between fire (HLT) and frying pan (INIT). Frying
     pan has been chosen as it's at least preventing the NMI issue.
 
     On systems which are not using INIT/INIT/STARTUP there is not much
     which can be done right now, but at least the obvious and easy to
     trigger MWAIT issue has been addressed.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZfpQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeZpD/9gSJN2qtGqoOgE8bWAenEeqppmBGFE
 EAhuhsvN1qG9JosUFo4KzxsGD/aWt2P6XglBDrGti8mFNol67jutmwWklntL3/ZR
 m8D6D+Pl7/CaDgACDTDbrnVC3lOGyMhD301yJrnBigS/SEoHeHI9UtadbHukuLQj
 TlKt5KtAnap15bE6QL846cDIptB9SjYLLPULo3i4azXEis/l6eAkffwAR6dmKlBh
 2RbhLK1xPPG9nqWYjqZXnex09acKwD9xY9xHj4+GampV4UqHJRWfW0YtFs5ENi01
 r3FVCdKEcvMkUw0zh0IAviBRs2vCI/R3YSfEc7P0264yn5WzMhAT+OGCovNjByiW
 sB4Iqa+Yf6aoBWwux6W4d22xu7uYhmFk/jiLyRZJPW/gvGZCZATT/x/T2hRoaYA8
 3S0Rs7n/gbfvynQETgniifuM0bXRW0lEJAmn840GwyVQwlpDEPBJSwW4El49kbkc
 +dHxnmpMCfnBxfVLS1YDd4WOmkWBeECNcW330FShlQQ8mM3UG31+Q8Jc55Ze9SW0
 w1h+IgIOHlA0DpQUUM8DJTSuxFx2piQsZxjOtzd70+BiKZpCsHqVLIp4qfnf+/GO
 gyP0cCQLbafpABbV9uVy8A/qgUGi0Qii0GJfCTy0OdmU+JX3C2C/gsM3uN0g3qAj
 vUhkuCXEGL5k1w==
 =KgZ0
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Thomas Gleixner:
 "A set of fixes for kexec(), reboot and shutdown issues:

   - Ensure that the WBINVD in stop_this_cpu() has been completed before
     the control CPU proceedes.

     stop_this_cpu() is used for kexec(), reboot and shutdown to park
     the APs in a HLT loop.

     The control CPU sends an IPI to the APs and waits for their CPU
     online bits to be cleared. Once they all are marked "offline" it
     proceeds.

     But stop_this_cpu() clears the CPU online bit before issuing
     WBINVD, which means there is no guarantee that the AP has reached
     the HLT loop.

     This was reported to cause intermittent reboot/shutdown failures
     due to some dubious interaction with the firmware.

     This is not only a problem of WBINVD. The code to actually "stop"
     the CPU which runs between clearing the online bit and reaching the
     HLT loop can cause large enough delays on its own (think
     virtualization). That's especially dangerous for kexec() as kexec()
     expects that all APs are in a safe state and not executing code
     while the boot CPU jumps to the new kernel. There are more issues
     vs kexec() which are addressed separately.

     Cure this by implementing an explicit synchronization point right
     before the AP reaches HLT. This guarantees that the AP has
     completed the full stop proceedure.

   - Fix the condition for WBINVD in stop_this_cpu().

     The WBINVD in stop_this_cpu() is required for ensuring that when
     switching to or from memory encryption no dirty data is left in the
     cache lines which might cause a write back in the wrong more later.

     This checks CPUID directly because the feature bit might have been
     cleared due to a command line option.

     But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
     Intel CPUs return the content of the highest supported leaf when a
     non-existing leaf is read, while AMD CPUs return all zeros for
     unsupported leafs.

     So the result of the test on Intel CPUs is lottery and on AMD its
     just correct by chance.

     While harmless it's incorrect and causes the conditional wbinvd()
     to be issued where not required, which caused the above issue to be
     unearthed.

   - Make kexec() robust against AP code execution

     Ashok observed triple faults when doing kexec() on a system which
     had been booted with "nosmt".

     It turned out that the SMT siblings which had been brought up
     partially are parked in mwait_play_dead() to enable power savings.

     mwait_play_dead() is monitoring the thread flags of the AP's idle
     task, which has been chosen as it's unlikely to be written to.

     But kexec() can overwrite the previous kernel text and data
     including page tables etc. When it overwrites the cache lines
     monitored by an AP that AP resumes execution after the MWAIT on
     eventually overwritten text, stack and page tables, which obviously
     might end up in a triple fault easily.

     Make this more robust in several steps:

      1) Use an explicit per CPU cache line for monitoring.

      2) Write a command to these cache lines to kick APs out of MWAIT
         before proceeding with kexec(), shutdown or reboot.

         The APs confirm the wakeup by writing status back and then
         enter a HLT loop.

      3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
         APs in INIT state.

         HLT is not a guarantee that an AP won't wake up and resume
         execution. HLT is woken up by NMI and SMI. SMI puts the CPU
         back into HLT (+/- firmware bugs), but NMI is delivered to the
         CPU which executes the NMI handler. Same issue as the MWAIT
         scenario described above.

         Sending an INIT/INIT sequence to the APs puts them into wait
         for STARTUP state, which is safe against NMI.

     There is still an issue remaining which can't be fixed: #MCE

     If the AP sits in HLT and receives a broadcast #MCE it will try to
     handle it with the obvious consequences.

     INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
     #MCE to shut down the machine.

     So there is a choice between fire (HLT) and frying pan (INIT).
     Frying pan has been chosen as it's at least preventing the NMI
     issue.

     On systems which are not using INIT/INIT/STARTUP there is not much
     which can be done right now, but at least the obvious and easy to
     trigger MWAIT issue has been addressed"

* tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Put CPUs into INIT on shutdown if possible
  x86/smp: Split sending INIT IPI out into a helper function
  x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  x86/smp: Use dedicated cache-line for mwait_play_dead()
  x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
  x86/smp: Dont access non-existing CPUID leaf
  x86/smp: Make stop_other_cpus() more robust
2023-06-26 14:45:53 -07:00
Linus Torvalds
9244724fbf A large update for SMP management:
- Parallel CPU bringup
 
     The reason why people are interested in parallel bringup is to shorten
     the (kexec) reboot time of cloud servers to reduce the downtime of the
     VM tenants.
 
     The current fully serialized bringup does the following per AP:
 
       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state
 
     There are two significant delays:
 
       #3 The time for an AP to report alive state in start_secondary() on
          x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.
 
       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending on
          the microcode patch size to apply.
 
     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to come
     up and apply microcode. That's more than 80% of the actual onlining
     procedure.
 
     This can be reduced significantly by splitting the bringup mechanism
     into two parts:
 
       1) Run the prepare callbacks and kick the AP alive for each AP which
       	 needs to be brought up.
 
 	 The APs wake up, do their firmware initialization and run the low
       	 level kernel startup code including microcode loading in parallel
       	 up to the first synchronization point. (#1 and #2 above)
 
       2) Run the rest of the bringup code strictly serialized per CPU
       	 (#3 - #5 above) as it's done today.
 
 	 Parallelizing that stage of the CPU bringup might be possible in
 	 theory, but it's questionable whether required surgery would be
 	 justified for a pretty small gain.
 
     If the system is large enough the first AP is already waiting at the
     first synchronization point when the boot CPU finished the wake-up of
     the last AP. That reduces the AP bringup time on that SKL from ~800ms
     to ~80ms, i.e. by a factor ~10x.
 
     The actual gain varies wildly depending on the system, CPU, microcode
     patch size and other factors. There are some opportunities to reduce
     the overhead further, but that needs some deep surgery in the x86 CPU
     bringup code.
 
     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.
 
   - Enhancements for SMP function call tracing so it is possible to locate
     the scheduling and the actual execution points. That allows to measure
     IPI delivery time precisely.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll
 05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X
 FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w
 zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ
 wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L
 37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o
 K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi
 a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg
 dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2
 Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn
 2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T
 TRiSzvssbYYmaw==
 =Y8if
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Thomas Gleixner:
 "A large update for SMP management:

   - Parallel CPU bringup

     The reason why people are interested in parallel bringup is to
     shorten the (kexec) reboot time of cloud servers to reduce the
     downtime of the VM tenants.

     The current fully serialized bringup does the following per AP:

       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state

     There are two significant delays:

       #3 The time for an AP to report alive state in start_secondary()
          on x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.

       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending
          on the microcode patch size to apply.

     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to
     come up and apply microcode. That's more than 80% of the actual
     onlining procedure.

     This can be reduced significantly by splitting the bringup
     mechanism into two parts:

       1) Run the prepare callbacks and kick the AP alive for each AP
          which needs to be brought up.

          The APs wake up, do their firmware initialization and run the
          low level kernel startup code including microcode loading in
          parallel up to the first synchronization point. (#1 and #2
          above)

       2) Run the rest of the bringup code strictly serialized per CPU
          (#3 - #5 above) as it's done today.

          Parallelizing that stage of the CPU bringup might be possible
          in theory, but it's questionable whether required surgery
          would be justified for a pretty small gain.

     If the system is large enough the first AP is already waiting at
     the first synchronization point when the boot CPU finished the
     wake-up of the last AP. That reduces the AP bringup time on that
     SKL from ~800ms to ~80ms, i.e. by a factor ~10x.

     The actual gain varies wildly depending on the system, CPU,
     microcode patch size and other factors. There are some
     opportunities to reduce the overhead further, but that needs some
     deep surgery in the x86 CPU bringup code.

     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.

   - Enhancements for SMP function call tracing so it is possible to
     locate the scheduling and the actual execution points. That allows
     to measure IPI delivery time precisely"

* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  trace,smp: Add tracepoints for scheduling remotelly called functions
  trace,smp: Add tracepoints around remotelly called functions
  MAINTAINERS: Add CPU HOTPLUG entry
  x86/smpboot: Fix the parallel bringup decision
  x86/realmode: Make stack lock work in trampoline_compat()
  x86/smp: Initialize cpu_primary_thread_mask late
  cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
  x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
  x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
  x86/smpboot: Support parallel startup of secondary CPUs
  x86/smpboot: Implement a bit spinlock to protect the realmode stack
  x86/apic: Save the APIC virtual base address
  cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
  x86/apic: Provide cpu_primary_thread mask
  x86/smpboot: Enable split CPU startup
  cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
  cpu/hotplug: Reset task stack state in _cpu_up()
  cpu/hotplug: Remove unused state functions
  riscv: Switch to hotplug core state synchronization
  parisc: Switch to hotplug core state synchronization
  ...
2023-06-26 13:59:56 -07:00
Linus Torvalds
7cffdbe360 Updates for the x86 boot process:
- Initialize FPU late.
 
    Right now FPU is initialized very early during boot. There is no real
    requirement to do so. The only requirement is to have it done before
    alternatives are patched.
 
    That's done in check_bugs() which does way more than what the function
    name suggests.
 
    So first rename check_bugs() to arch_cpu_finalize_init() which makes it
    clear what this is about.
 
    Move the invocation of arch_cpu_finalize_init() earlier in
    start_kernel() as it has to be done before fork_init() which needs to
    know the FPU register buffer size.
 
    With those prerequisites the FPU initialization can be moved into
    arch_cpu_finalize_init(), which removes it from the early and fragile
    part of the x86 bringup.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZdNYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaNBEACWtVd1uhqQldIFgSvZYujsrWXlmkU+
 pok6gDzKQNwZADiXW/tn5fP8SBLWT0pgLM9d+oZ5mEaLaOW7HcZLEHcVrn74e3TT
 53xN8e1zCzyjCJ/x22vrKH4sn/bU+bQyzSNVu9Disqn9Fl+ts37FqAHDv/ExbneD
 DaYXXCLgQsyGbPLD8B7yGOpJTGBUTJxNQS1ZFElBaRsAaw0mYZOEoPvuTFK4o7Uz
 GUB2vGefmeNfX+EgLYKG9QoS0F3SMS9X2IYswy1H76ZnV/eXmTsA1S3u3X9yX7kC
 XBnPtCC+iX+7o3xFkTpa0oQUdzEyGOItExZZgce6jEQu4Fl7NoIJxhlMg9/Y+vcF
 ntipEKSWFLAi1GkZzeKRwSSsoWqRaFxOKLy8qhn9kud09k+UtMBkNrF1CSp9laAz
 QParu3B1oHPEzx/jS0bSOCMN+AQZH8rX7LxRp4kpBOeBSZNCnfaBUzfIvmccPls+
 EJTO/0JUpRm5LsPSDiJhypPRoOOIP26IloR6OoZTcI3p76NrnYblRvisvuFAgDU6
 bk7Belf+GDx0kBZugqQgok7nDaHIBR7vEmca1NV8507UrffVyxLAiI4CiWPcFdOq
 ovhO8K+gP4xvzZx4cXZBwYwusjvl/oxKy8yQiGgoftDiWU4sdUCSrwX3x27+hUYL
 2P1OLDOXSGwESQ==
 =yxMj
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Thomas Gleixner:
 "Initialize FPU late.

  Right now FPU is initialized very early during boot. There is no real
  requirement to do so. The only requirement is to have it done before
  alternatives are patched.

  That's done in check_bugs() which does way more than what the function
  name suggests.

  So first rename check_bugs() to arch_cpu_finalize_init() which makes
  it clear what this is about.

  Move the invocation of arch_cpu_finalize_init() earlier in
  start_kernel() as it has to be done before fork_init() which needs to
  know the FPU register buffer size.

  With those prerequisites the FPU initialization can be moved into
  arch_cpu_finalize_init(), which removes it from the early and fragile
  part of the x86 bringup"

* tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
  x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
  x86/fpu: Mark init functions __init
  x86/fpu: Remove cpuinfo argument from init functions
  x86/init: Initialize signal frame size late
  init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
  init: Invoke arch_cpu_finalize_init() earlier
  init: Remove check_bugs() leftovers
  um/cpu: Switch to arch_cpu_finalize_init()
  sparc/cpu: Switch to arch_cpu_finalize_init()
  sh/cpu: Switch to arch_cpu_finalize_init()
  mips/cpu: Switch to arch_cpu_finalize_init()
  m68k/cpu: Switch to arch_cpu_finalize_init()
  loongarch/cpu: Switch to arch_cpu_finalize_init()
  ia64/cpu: Switch to arch_cpu_finalize_init()
  ARM: cpu: Switch to arch_cpu_finalize_init()
  x86/cpu: Switch to arch_cpu_finalize_init()
  init: Provide arch_cpu_finalize_init()
2023-06-26 13:39:10 -07:00
Linus Torvalds
64bf6ae93e v6.5/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc
 ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1
 cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI=
 =zdfx
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Miscellaneous features, cleanups, and fixes for vfs and individual fs

  Features:

   - Use mode 0600 for file created by cachefilesd so it can be run by
     unprivileged users. This aligns them with directories which are
     already created with mode 0700 by cachefilesd

   - Reorder a few members in struct file to prevent some false sharing
     scenarios

   - Indicate that an eventfd is used a semaphore in the eventfd's
     fdinfo procfs file

   - Add a missing uapi header for eventfd exposing relevant uapi
     defines

   - Let the VFS protect transitions of a superblock from read-only to
     read-write in addition to the protection it already provides for
     transitions from read-write to read-only. Protecting read-only to
     read-write transitions allows filesystems such as ext4 to perform
     internal writes, keeping writers away until the transition is
     completed

  Cleanups:

   - Arnd removed the architecture specific arch_report_meminfo()
     prototypes and added a generic one into procfs.h. Note, we got a
     report about a warning in amdpgpu codepaths that suggested this was
     bisectable to this change but we concluded it was a false positive

   - Remove unused parameters from split_fs_names()

   - Rename put_and_unmap_page() to unmap_and_put_page() to let the name
     reflect the order of the cleanup operation that has to unmap before
     the actual put

   - Unexport buffer_check_dirty_writeback() as it is not used outside
     of block device aops

   - Stop allocating aio rings from highmem

   - Protecting read-{only,write} transitions in the VFS used open-coded
     barriers in various places. Replace them with proper little helpers
     and document both the helpers and all barrier interactions involved
     when transitioning between read-{only,write} states

   - Use flexible array members in old readdir codepaths

  Fixes:

   - Use the correct type __poll_t for epoll and eventfd

   - Replace all deprecated strlcpy() invocations, whose return value
     isn't checked with an equivalent strscpy() call

   - Fix some kernel-doc warnings in fs/open.c

   - Reduce the stack usage in jffs2's xattr codepaths finally getting
     rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
     bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
     royally annoying compilation warning

   - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
     fmode_t is required to avoid fmode_t to integer degradation
     warnings

   - Create coredumps with O_WRONLY instead of O_RDWR. There's a long
     explanation in that commit how O_RDWR is actually a bug which we
     found out with the help of Linus and git archeology

   - Fix "no previous prototype" warnings in the pipe codepaths

   - Add overflow calculations for remap_verify_area() as a signed
     addition overflow could be triggered in xfstests

   - Fix a null pointer dereference in sysv

   - Use an unsigned variable for length calculations in jfs avoiding
     compilation warnings with gcc 13

   - Fix a dangling pipe pointer in the watch queue codepath

   - The legacy mount option parser provided as a fallback by the VFS
     for filesystems not yet converted to the new mount api did prefix
     the generated mount option string with a leading ',' causing issues
     for some filesystems

   - Fix a repeated word in a comment in fs.h

   - autofs: Update the ctime when mtime is updated as mandated by
     POSIX"

* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  readdir: Replace one-element arrays with flexible-array members
  fs: Provide helpers for manipulating sb->s_readonly_remount
  fs: Protect reconfiguration of sb read-write from racing writes
  eventfd: add a uapi header for eventfd userspace APIs
  autofs: set ctime as well when mtime changes on a dir
  eventfd: show the EFD_SEMAPHORE flag in fdinfo
  fs/aio: Stop allocating aio rings from HIGHMEM
  fs: Fix comment typo
  fs: unexport buffer_check_dirty_writeback
  fs: avoid empty option when generating legacy mount string
  watch_queue: prevent dangling pipe pointer
  fs.h: Optimize file struct to prevent false sharing
  highmem: Rename put_and_unmap_page() to unmap_and_put_page()
  cachefiles: Allow the cache to be non-root
  init: remove unused names parameter in split_fs_names()
  jfs: Use unsigned variable for length calculations
  fs/sysv: Null check to prevent null-ptr-deref bug
  fs: use UB-safe check for signed addition overflow in remap_verify_area
  procfs: consolidate arch_report_meminfo declaration
  fs: pipe: reveal missing function protoypes
  ...
2023-06-26 09:50:21 -07:00
Arnd Bergmann
fb9b7b4b2b x86: xen: add missing prototypes
These function are all called from assembler files, or from inline
assembler, so there is no immediate need for a prototype in a header,
but if -Wmissing-prototypes is enabled, the compiler warns about them:

arch/x86/xen/efi.c:130:13: error: no previous prototype for 'xen_efi_init' [-Werror=missing-prototypes]
arch/x86/platform/pvh/enlighten.c:120:13: error: no previous prototype for 'xen_prepare_pvh' [-Werror=missing-prototypes]
arch/x86/xen/enlighten_pv.c:1233:34: error: no previous prototype for 'xen_start_kernel' [-Werror=missing-prototypes]
arch/x86/xen/irq.c:22:14: error: no previous prototype for 'xen_force_evtchn_callback' [-Werror=missing-prototypes]
arch/x86/entry/common.c:302:24: error: no previous prototype for 'xen_pv_evtchn_do_upcall' [-Werror=missing-prototypes]

Declare all of them in an appropriate header file to avoid the warnings.
For consistency, also move the asm_cpu_bringup_and_idle() declaration
out of smp_pv.c.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20230614073501.10101-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-06-26 07:47:11 +02:00
Juergen Gross
3d013424de x86/xen: add prototypes for paravirt mmu functions
The paravirt MMU functions called via the PV_CALLEE_SAVE_REGS_THUNK()
macro can't be defined to be static, as the macro is generating a
function via asm() statement calling the paravirt MMU function.

In order to avoid warnings when specifying "-Wmissing-prototypes" for
the build, add local prototypes (there should never be any external
caller of those functions).

Reported-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20230614073501.10101-2-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-06-26 07:47:11 +02:00
Ross Lagerwall
9338c2233b iscsi_ibft: Fix finding the iBFT under Xen Dom 0
To facilitate diskless iSCSI boot, the firmware can place a table of
configuration details in memory called the iBFT. The presence of this
table is not specified, nor is the precise location (and it's not in the
E820) so the kernel has to search for a magic marker to find it.

When running under Xen, Dom 0 does not have access to the entire host's
memory, only certain regions which are identity-mapped which means that
the pseudo-physical address in Dom0 == real host physical address.
Add the iBFT search bounds as a reserved region which causes it to be
identity-mapped in xen_set_identity_and_remap_chunk() which allows Dom0
access to the specific physical memory to correctly search for the iBFT
magic marker (and later access the full table).

This necessitates moving the call to reserve_ibft_region() somewhat
later so that it is called after e820__memory_setup() which is when the
Xen identity mapping adjustments are applied. The precise location of
the call is not too important so I've put it alongside dmi_setup() which
does similar scanning of memory for configuration tables.

Finally in the iBFT find code, instead of using isa_bus_to_virt() which
doesn't do the right thing under Xen, use early_memremap() like the
dmi_setup() code does.

The result of these changes is that it is possible to boot a diskless
Xen + Dom0 running off an iSCSI disk whereas previously it would fail to
find the iBFT and consequently, the iSCSI root disk.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad@darnok.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Link: https://lore.kernel.org/r/20230605102840.1521549-1-ross.lagerwall@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-06-26 07:47:11 +02:00
Arnd Bergmann
04d684875b xen: xen_debug_interrupt prototype to global header
The xen_debug_interrupt() function is only called on x86, which has a
prototype in an architecture specific header, but the definition also
exists on others, where the lack of a prototype causes a W=1 warning:

drivers/xen/events/events_2l.c:264:13: error: no previous prototype for 'xen_debug_interrupt' [-Werror=missing-prototypes]

Move the prototype into a global header instead to avoid this warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Link: https://lore.kernel.org/r/20230517124525.929201-1-arnd@kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-06-26 07:47:11 +02:00
Linus Torvalds
547cc9be86 - Drop the __weak attribute from a function prototype as it otherwise
leads to the function getting replaced by a dummy stub
 
 - Fix the umask value setup of the frontend event as former is different
   on two Intel cores
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSYChQACgkQEsHwGGHe
 VUq7cQ/7BSDauaiRGPvovvolPUarJ6Ezidrq0y24p92eHjqjfiZLZVR53AItTeFr
 5Naib991Wy7PdqvEgBElNlWdTvhefdSXSOdII7Zo8bHRtJAFAgbIaIZN6W2qRlDO
 v9scHnwjIbkjIB6mOd4zn0Ttwjnk3cvArSgxhmIa5K1EY8C7LG8Npviadi6tkhxF
 /0Zw+K/X7mmLpnaKbvKu+hY+0IxJBnpO319xk28XxrugZLbdykjuhCiLwJ4P6QxY
 FbrDqMEp1PjLQQ725sJBwVYQ+bEfBGT7qt5A6gBjglFstsPsYOhyfSPfk4rh7mvK
 DdtW11mfLyAlKSlb6GtXWajFt2KeTclNHnrpgI7Qp11S0CyeXvl88okSjYvLNeJn
 PKT6fwDyhiE35hodkQGKQYqkOijTrwInIO8wsf3KbmyPLSbZINY0boNzjPp5eCZC
 lSNwUCPh/JiJewnmiLxbalpt/yLzQI/fII4ibBqxl7hAGwIb0KdjnZwM6tZ9AdoW
 M1PZtVZJr8j9N6yI7Y+Wbxj9oVoNmH5ie90lq0Di9niGpkytGoD2VSn5LIOVSZhR
 jmFbloTR+BUWh32e54NUKnVew2I1lkaklQK8OgECw2wRrFcNYx73tu0lOXJY1RwV
 zmsmQDnM56SxHFhpKcgy2c3vwKhR5HeuLd2MJtgtuB4KIje7row=
 =hJ0i
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Drop the __weak attribute from a function prototype as it otherwise
   leads to the function getting replaced by a dummy stub

 - Fix the umask value setup of the frontend event as former is
   different on two Intel cores

* tag 'perf_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL
  perf/core: Drop __weak attribute from arch_perf_update_userpage() prototype
2023-06-25 10:13:17 -07:00
Linus Torvalds
300edd751b - Add a ORC format hash to vmlinux and modules in order for other tools
which use it, to detect changes to it and adapt accordingly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSYCDUACgkQEsHwGGHe
 VUp/mg//ed9w/x1b/pdeM8WUtQ/jSXWMwntKiJqlDLaexl8e+LKNPS20/yZjGJ6k
 tw+Hxwvgv0lTYVzeGowTehAsFKqbGPQCmW+RO77Fvrf6DYQmPdXr7cEiZWwYJZgr
 nZTYTZ97DxtGrpkRDTNYx9kqya3sbgOTma6tU+K/8l1NaLDgdl0OoNnYbGFmJEem
 ucaoIO36gPzkMafMmmB/SKu96OH+E2mhzVbPnmBWR/36JE/wUgGmcTtfqLCzuClO
 JOvHj4ylg6UQkXrJpxNqCcjLI4nzpSfYNGpIVy+bHHEmGlQ9HtmpIE60+ooYLrrZ
 YNJRvkHMtbrXizkZIOOkOm6ZjEgKtgiyxLKyXTgQu8sE1rNAXWQiFr6lbt2GdHrn
 pwZ/FXp+KKan0K28x34yHMO5B6v0TGQKS0VqafdfrYe8b/vZsscMPpKkss6I7X2O
 sh8OHOAydyjFG9tplxK6sspA1xM/Qeqh0lSeHvqbiBrd8cGGR6em5pGIwmEOgGmX
 RlvdcdQLNhXP6RDsXsNltqG2uOqPKPIqV9b3WpP616Gl2RV7wOhOT6nChXbGm//Z
 NZ4uigx3eokgsoCSDVilgQdHPdZAulbfYcnjPlLDHbcPqOhOdQObcvFCeAe5HG7v
 QhsZ//WnV7u0OjXVl2Da56/J/k1snYwStXt1xHXRkwCSVaU2Bj0=
 =n42+
 -----END PGP SIGNATURE-----

Merge tag 'objtool_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fix from Borislav Petkov:

 - Add a ORC format hash to vmlinux and modules in order for other tools
   which use it, to detect changes to it and adapt accordingly

* tag 'objtool_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/unwind/orc: Add ELF section with ORC version identifier
2023-06-25 10:00:17 -07:00
Linus Torvalds
661e723b6f - Do not use set_pgd() when updating the KASLR trampoline pgd entry
because that updates the user PGD too on KPTI builds, resulting in
   memory corruption
 
 - Prevent a panic in the IO-APIC setup code due to conflicting command
   line parameters
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSYBYwACgkQEsHwGGHe
 VUqLng//dI7c4KnGbQlVsi+4jWgUEvggEvDDIW9HuhVYLhx1YZw8rAsobi40hM8t
 vaV/cp2qhAYQFbwZsZKSgStVuZisewIby6tTOGyHG76IZGEitbdwNP1ISi3u5oDb
 c1jCn5qRcyIx6V2BEzeSwf4h+dt3QGMlIny1/TGf4f7X6JaP3MSnISiwDvlhmrkT
 t71SEH2JZ9ah7QMdy2D9A2H0vVS0PL7tEBZ9GD5d+eRNguBkbnLeJE1bKTTxU60a
 KqXTKGyFfUMgLS4icCTWsMBh7e5+OeUN866R8GdeoSfoqlRYwUM/63UsbAFbRK8+
 H6/c/yOdGlngKyFRr4UmsgmmaXfyRYeWkMZZOrGSzS9pvktoShQ85+8iw9b4Fbjp
 W9CjHJ/lA4atvxjnh2N7z/2kKZwfDLhJJdf6YuCPI7QLush2rukNJJr0ghBjzKDV
 2Wh1/ccq8qPm7BQ26VasDKO9b1ZXEhQ6mTyIIbGsx6xoCmT7cdJIplvstHCT485D
 4yuWLS+WV4T+ZAqAXjmFoADrQ/M79mdGNakNLTHKbAOb8RNAZEMIwsgLM4wAi//3
 It7kYSbYcnNtSa/WMmKQH56pbjKLGJVNnfVU6HDZ9MIiRKvj+GEJsbzqFynz2K5I
 kOqb4M5XxMlcM6GcV7Y9hcNWGvaMNZFbBR5gZpleksXEvG5ObRw=
 =9rbq
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Do not use set_pgd() when updating the KASLR trampoline pgd entry
   because that updates the user PGD too on KPTI builds, resulting in
   memory corruption

 - Prevent a panic in the IO-APIC setup code due to conflicting command
   line parameters

* tag 'x86_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Fix kernel panic when booting with intremap=off and x2apic_phys
  x86/mm: Avoid using set_pgd() outside of real PGD pages
2023-06-25 09:47:04 -07:00
Linus Torvalds
c2508ec5a5 mm: introduce new 'lock_mm_and_find_vma()' page fault helper
.. and make x86 use it.

This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.

We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.

That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.

So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.

Note: I say "common helper", but it really only handles the normal
stack-grows-down case.  Which is all architectures except for PA-RISC
and IA64.  So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.

It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.

But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24 14:12:54 -07:00
Andrew Morton
63773d2b59 Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
Linus Torvalds
8a28a0b6f1 Networking fixes for 6.4-rc8, including fixes from ipsec, bpf,
mptcp and netfilter.
 
 Current release - regressions:
 
   - netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
 
   - eth: mlx5e:
     - fix scheduling of IPsec ASO query while in atomic
     - free IRQ rmap and notifier on kernel shutdown
 
 Current release - new code bugs:
 
   - phy: manual remove LEDs to ensure correct ordering
 
 Previous releases - regressions:
 
   - mptcp: fix possible divide by zero in recvmsg()
 
   - dsa: revert "net: phy: dp83867: perform soft reset and retain established link"
 
 Previous releases - always broken:
 
   - sched: netem: acquire qdisc lock in netem_change()
 
   - bpf:
     - fix verifier id tracking of scalars on spill
     - fix NULL dereference on exceptions
     - accept function names that contain dots
 
   - netfilter: disallow element updates of bound anonymous sets
 
   - mptcp: ensure listener is unhashed before updating the sk status
 
   - xfrm:
     - add missed call to delete offloaded policies
     - fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets
 
   - selftests: fixes for FIPS mode
 
   - dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling
 
   - eth: sfc: use budget for TX completions
 
 Misc:
 
   - wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmSUZO0SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkBjAP/RfTUYdlPqz9jSvz0HmQt2Er39HyVb9I
 pzEpJSQGfO+eyIrlxmleu8cAaW5HdvyfMcBgr04uh+Jf06s+VJrD95IO9zDHHKoC
 86itYNKMS3fSt1ivzg49i5uq66MhjtAcfIOB9HMOAQ2Jd+DYlzyWOOHw28ZAxsBZ
 Q6TU97YEMuU4FdLkoKob1aVswC5cPxNx2IH9NagfbtijaYZqeN9ZX9EI5yMUyH8f
 5gboqOhXUQK0MQLM5TFySHeoayyQ+tRBz24nF0/6lWiRr+xzMTEKdkFpRza7Mxzj
 S8NxN3C+zOf96gic6kYOXmM6y0sOlbwC9JoeWTp8Tuh6DEYi6xLC2XkiYJ51idZg
 PElgRpkM1ddqvvFWFgZlNik5z0vbGnJH7pt0VuOSNntxE60cdQwvWEOr09vvPcS5
 0nMVD0uc8pds2h4hit+sdLltcVnOgoNUYr1/sI6oydofa1BrLnhFPF7z/gUs9foD
 NuCchiaBF11yBGKufcNBNEB4w35g3Kcu6TGhHb168OJi+UnSnwlI0Ccw7iO10pkv
 RjefhR60+wZC6+leo57nZeYqaLQJuALY0QYFsyeM+T0MGSYkbH24CmbNdSmO4MRr
 +VX2CwIqeIds4Hx31o0Feu+FaJqXw46/2nrSDxel/hlCJnGSMXZTw+b/4pFEHLP+
 l71ijZpJqV1S
 =GH2b
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from ipsec, bpf, mptcp and netfilter.

  Current release - regressions:

   - netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain

   - eth: mlx5e:
      - fix scheduling of IPsec ASO query while in atomic
      - free IRQ rmap and notifier on kernel shutdown

  Current release - new code bugs:

   - phy: manual remove LEDs to ensure correct ordering

  Previous releases - regressions:

   - mptcp: fix possible divide by zero in recvmsg()

   - dsa: revert "net: phy: dp83867: perform soft reset and retain
     established link"

  Previous releases - always broken:

   - sched: netem: acquire qdisc lock in netem_change()

   - bpf:
      - fix verifier id tracking of scalars on spill
      - fix NULL dereference on exceptions
      - accept function names that contain dots

   - netfilter: disallow element updates of bound anonymous sets

   - mptcp: ensure listener is unhashed before updating the sk status

   - xfrm:
      - add missed call to delete offloaded policies
      - fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets

   - selftests: fixes for FIPS mode

   - dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling

   - eth: sfc: use budget for TX completions

  Misc:

   - wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0"

* tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
  revert "net: align SO_RCVMARK required privileges with SO_MARK"
  net: wwan: iosm: Convert single instance struct member to flexible array
  sch_netem: acquire qdisc lock in netem_change()
  selftests: forwarding: Fix race condition in mirror installation
  wifi: mac80211: report all unusable beacon frames
  mptcp: ensure listener is unhashed before updating the sk status
  mptcp: drop legacy code around RX EOF
  mptcp: consolidate fallback and non fallback state machine
  mptcp: fix possible list corruption on passive MPJ
  mptcp: fix possible divide by zero in recvmsg()
  mptcp: handle correctly disconnect() failures
  bpf: Force kprobe multi expected_attach_type for kprobe_multi link
  bpf/btf: Accept function names that contain dots
  Revert "net: phy: dp83867: perform soft reset and retain established link"
  net: mdio: fix the wrong parameters
  netfilter: nf_tables: Fix for deleting base chains with payload
  netfilter: nfnetlink_osf: fix module autoload
  netfilter: nf_tables: drop module reference after updating chain
  netfilter: nf_tables: disallow timeout for anonymous sets
  netfilter: nf_tables: disallow updates of anonymous sets
  ...
2023-06-22 17:59:51 -07:00
Jakub Kicinski
59bb14bda2 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJK5DwAKCRDbK58LschI
 gyUtAQD4gT4BEVHRqvniw9yyqYo0BvElAznutDq7o9kFHFep2gEAoksEWS84OdZj
 0L5mSKjXrpHKzmY/jlMrVIcTb3VzOw0=
 =gAYE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-06-21

We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 7 files changed, 181 insertions(+), 15 deletions(-).

The main changes are:

1) Fix a verifier id tracking issue with scalars upon spill,
   from Maxim Mikityanskiy.

2) Fix NULL dereference if an exception is generated while a BPF
   subprogram is running, from Krister Johansen.

3) Fix a BTF verification failure when compiling kernel with LLVM_IAS=0,
   from Florent Revest.

4) Fix expected_attach_type enforcement for kprobe_multi link,
   from Jiri Olsa.

5) Fix a bpf_jit_dump issue for x86_64 to pick the correct JITed image,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Force kprobe multi expected_attach_type for kprobe_multi link
  bpf/btf: Accept function names that contain dots
  selftests/bpf: add a test for subprogram extables
  bpf: ensure main program has an extable
  bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable.
  selftests/bpf: Add test cases to assert proper ID tracking on spill
  bpf: Fix verifier id tracking of scalars on spill
====================

Link: https://lore.kernel.org/r/20230621101116.16122-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21 13:59:46 -07:00
YueHaibing
b360cbd254 x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
This is now unused, so can be removed.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20230620094519.15300-1-yuehaibing%40huawei.com
2023-06-21 10:57:54 -07:00
Donglin Peng
d938ba1768 x86/ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
The previous patch ("function_graph: Support recording and printing
the return value of function") has laid the groundwork for the for
the funcgraph-retval, and this modification makes it available on
the x86 platform.

We introduce a new structure called fgraph_ret_regs for the x86
platform to hold return registers and the frame pointer. We then
fill its content in the return_to_handler and pass its address
to the function ftrace_return_to_handler to record the return
value.

Link: https://lkml.kernel.org/r/53a506f0f18ff4b7aeb0feb762f1c9a5e9b83ee9.1680954589.git.pengdonglin@sangfor.com.cn

Signed-off-by: Donglin Peng <pengdonglin@sangfor.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-06-20 18:38:38 -04:00
Thomas Gleixner
45e34c8af5 x86/smp: Put CPUs into INIT on shutdown if possible
Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
6087dd5e86 x86/smp: Split sending INIT IPI out into a helper function
Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20230615193330.551157083@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
d7893093a7 x86/smp: Cure kexec() vs. mwait_play_dead() breakage
TLDR: It's a mess.

When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offlined CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
f9c9987bf5 x86/smp: Use dedicated cache-line for mwait_play_dead()
Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.

But there is a use case where a "dead" CPU needs to be brought out of
MWAIT: kexec().

This is required as kexec() can overwrite text, pagetables, stacks and the
monitored cacheline of the original kernel. The latter causes MWAIT to
resume execution which obviously causes havoc on the kexec kernel which
results usually in triple faults.

Use a dedicated per CPU storage to prepare for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de
2023-06-20 14:51:47 +02:00
Thomas Gleixner
2affa6d6db x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
The wmb()s before sending the IPIs are not synchronizing anything.

If at all then the apic IPI functions have to provide or act as appropriate
barriers.

Remove these cargo cult barriers which have no explanation of what they are
synchronizing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.378358382@linutronix.de
2023-06-20 14:51:46 +02:00
Tony Battersby
9b040453d4 x86/smp: Dont access non-existing CPUID leaf
stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. Intel
CPUs return the content of the highest supported leaf when a non-existing
leaf is read, while AMD CPUs return all zeros for unsupported leafs.

So the result of the test on Intel CPUs is lottery.

While harmless it's incorrect and causes the conditional wbinvd() to be
issued where not required.

Check whether the leaf is supported before reading it.

[ tglx: Adjusted changelog ]

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
Link: https://lore.kernel.org/r/20230615193330.322186388@linutronix.de
2023-06-20 14:51:46 +02:00
Thomas Gleixner
1f5e7eb786 x86/smp: Make stop_other_cpus() more robust
Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbinvd() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.

Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().

The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only the
CPUs which have not reported within the timeout window.

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx
2023-06-20 14:51:46 +02:00
Linus Torvalds
692b7dc87c hyperv-fixes for 6.4-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmSQ3ioTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXpREB/9nMJ5PbgsxpqKiV3ckodXZp7wLkFAv
 VK12KBZcjAr8kbZON0CHXWssC/QLBV9+UYDjvA7ciEjkzBZoIY8GMAjFZ4NNveTm
 ssZPaxg0DHX7SzVO6qDrZBwjyGmjPh8vH5TDsb6QPYk8WMuwYy+QZMWTEcxr7QU4
 o3GRbt+JShS05s5Q1B3pSeztyxDxJh1potyoTfaY1sbih0c+r6mtewlpRW3KgoSc
 ukssybTmNyRRpDos/PlT2e0gRpIzlYQnzE+sj4mGOOQFh4wGOR8wGcNPr4yirrcI
 gy/4nvIwxp0uLW0C30FBlqzNt9dirOSRflXq/Pp4MdQcSM3hpeONbDxx
 =0aJ5
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20230619' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Fix races in Hyper-V PCI controller (Dexuan Cui)

 - Fix handling of hyperv_pcpu_input_arg (Michael Kelley)

 - Fix vmbus_wait_for_unload to scan present CPUs (Michael Kelley)

 - Call hv_synic_free in the failure path of hv_synic_alloc (Dexuan Cui)

 - Add noop for real mode handlers for virtual trust level code (Saurabh
   Sengar)

* tag 'hyperv-fixes-signed-20230619' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  PCI: hv: Add a per-bus mutex state_lock
  Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
  PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
  PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
  PCI: hv: Fix a race condition bug in hv_pci_query_relations()
  arm64/hyperv: Use CPUHP_AP_HYPERV_ONLINE state to fix CPU online sequencing
  x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
  Drivers: hv: vmbus: Fix vmbus_wait_for_unload() to scan present CPUs
  Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails
  x86/hyperv/vtl: Add noop for realmode pointers
2023-06-19 17:05:43 -07:00
Hugh Dickins
653ba8108d x86: sme_populate_pgd() use pte_offset_kernel()
sme_populate_pgd() is an __init function for sme_encrypt_kernel():
it should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Link: https://lkml.kernel.org/r/497d7777-736e-85f2-c37-aa6bcf155e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:10 -07:00
Hugh Dickins
975ca3986b x86: allow get_locked_pte() to fail
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Link: https://lkml.kernel.org/r/b7fa8547-4f28-ec82-9893-1b2eb58e40b4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:10 -07:00
Dheeraj Kumar Srivastava
85d38d5810 x86/apic: Fix kernel panic when booting with intremap=off and x2apic_phys
When booting with "intremap=off" and "x2apic_phys" on the kernel command
line, the physical x2APIC driver ends up being used even when x2APIC
mode is disabled ("intremap=off" disables x2APIC mode). This happens
because the first compound condition check in x2apic_phys_probe() is
false due to x2apic_mode == 0 and so the following one returns true
after default_acpi_madt_oem_check() having already selected the physical
x2APIC driver.

This results in the following panic:

   kernel BUG at arch/x86/kernel/apic/io_apic.c:2409!
   invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
   CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-rc2-ver4.1rc2 #2
   Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021
   RIP: 0010:setup_IO_APIC+0x9c/0xaf0
   Call Trace:
    <TASK>
    ? native_read_msr
    apic_intr_mode_init
    x86_late_time_init
    start_kernel
    x86_64_start_reservations
    x86_64_start_kernel
    secondary_startup_64_no_verify
    </TASK>

which is:

setup_IO_APIC:
  apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n");
  for_each_ioapic(ioapic)
  	BUG_ON(mp_irqdomain_create(ioapic));

Return 0 to denote that x2APIC has not been enabled when probing the
physical x2APIC driver.

  [ bp: Massage commit message heavily. ]

Fixes: 9ebd680bd029 ("x86, apic: Use probe routines to simplify apic selection")
Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230616212236.1389-1-dheerajkumar.srivastava@amd.com
2023-06-19 20:59:40 +02:00
Dave Airlie
cce3b573a5 Linux 6.4-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmSPcdMeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWrQH/3KmuvZsWMC4PpJY
 VcF9VfF9i+Zv7DoG8sjD5VpNh47e87RsR6WNOFnKol5SUrM6vsBAb5i2rfQahNIv
 NSj0fPCE4/Nj9LMecKVC9WD8CitxYdbR+CF9Is21AQj1VihUl9eHXGcAWxuaMyhk
 TjPUwmbOOsRVMXXdGJzjX78cvLsxqpSv8A/5OTh16IBimbh7p+YjKJFkbfj/PMWf
 aF1quFkIEXgzJcHCpP6KDZHr2KbpY+jIN9hUENnGKJxHYNso5u+KrIW1kAm8meP1
 x26ETSquM0T70OAzovOWg+BeVkLDac/3Rh30ztLAI4AtajrlSzycvFsU9UNEJCc2
 BnM2IZI=
 =ANT5
 -----END PGP SIGNATURE-----

Backmerge tag 'v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next

Linux 6.4-rc7

Need this to pull in the msm work.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2023-06-19 16:01:25 +10:00
Michael Kelley
9636be85cc x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
These commits

a494aef23dfc ("PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg")
2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")

update the Hyper-V virtual PCI driver to use the hyperv_pcpu_input_arg
because that memory will be correctly marked as decrypted or encrypted
for all VM types (CoCo or normal). But problems ensue when CPUs in the
VM go online or offline after virtual PCI devices have been configured.

When a CPU is brought online, the hyperv_pcpu_input_arg for that CPU is
initialized by hv_cpu_init() running under state CPUHP_AP_ONLINE_DYN.
But this state occurs after state CPUHP_AP_IRQ_AFFINITY_ONLINE, which
may call the virtual PCI driver and fault trying to use the as yet
uninitialized hyperv_pcpu_input_arg. A similar problem occurs in a CoCo
VM if the MMIO read and write hypercalls are used from state
CPUHP_AP_IRQ_AFFINITY_ONLINE.

When a CPU is taken offline, IRQs may be reassigned in state
CPUHP_TEARDOWN_CPU. Again, the virtual PCI driver may fault trying to
use the hyperv_pcpu_input_arg that has already been freed by a
higher state.

Fix the onlining problem by adding state CPUHP_AP_HYPERV_ONLINE
immediately after CPUHP_AP_ONLINE_IDLE (similar to CPUHP_AP_KVM_ONLINE)
and before CPUHP_AP_IRQ_AFFINITY_ONLINE. Use this new state for
Hyper-V initialization so that hyperv_pcpu_input_arg is allocated
early enough.

Fix the offlining problem by not freeing hyperv_pcpu_input_arg when
a CPU goes offline. Retain the allocated memory, and reuse it if
the CPU comes back online later.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/1684862062-51576-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-17 23:09:47 +00:00
Thomas Gleixner
0a9567ac5e x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
Moving mem_encrypt_init() broke the AMD_MEM_ENCRYPT=n because the
declaration of that function was under #ifdef CONFIG_AMD_MEM_ENCRYPT and
the obvious placement for the inline stub was the #else path.

This is a leftover of commit 20f07a044a76 ("x86/sev: Move common memory
encryption code to mem_encrypt.c") which made mem_encrypt_init() depend on
X86_MEM_ENCRYPT without moving the prototype. That did not fail back then
because there was no stub inline as the core init code had a weak function.

Move both the declaration and the stub out of the CONFIG_AMD_MEM_ENCRYPT
section and guard it with CONFIG_X86_MEM_ENCRYPT.

Fixes: 439e17576eb4 ("init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/oe-kbuild-all/202306170247.eQtCJPE8-lkp@intel.com/
2023-06-16 22:23:27 +02:00
Andy Shevchenko
fb1273635f KVM: x86: Remove PRIx* definitions as they are solely for user space
In the Linux kernel we do not support PRI.64 specifiers.
Moreover they seem not to be used anyway here. Drop them.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230616150233.83813-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-16 12:12:10 -07:00
Lee Jones
d082d48737 x86/mm: Avoid using set_pgd() outside of real PGD pages
KPTI keeps around two PGDs: one for userspace and another for the
kernel. Among other things, set_pgd() contains infrastructure to
ensure that updates to the kernel PGD are reflected in the user PGD
as well.

One side-effect of this is that set_pgd() expects to be passed whole
pages.  Unfortunately, init_trampoline_kaslr() passes in a single entry:
'trampoline_pgd_entry'.

When KPTI is on, set_pgd() will update 'trampoline_pgd_entry' (an
8-Byte globally stored [.bss] variable) and will then proceed to
replicate that value into the non-existent neighboring user page
(located +4k away), leading to the corruption of other global [.bss]
stored variables.

Fix it by directly assigning 'trampoline_pgd_entry' and avoiding
set_pgd().

[ dhansen: tweak subject and changelog ]

Fixes: 0925dda5962e ("x86/mm/KASLR: Use only one PUD entry for real mode trampoline")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20230614163859.924309-1-lee@kernel.org/g
2023-06-16 11:46:42 -07:00
Omar Sandoval
b9f174c811 x86/unwind/orc: Add ELF section with ORC version identifier
Commits ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC
metadata") and fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in
two") changed the ORC format. Although ORC is internal to the kernel,
it's the only way for external tools to get reliable kernel stack traces
on x86-64. In particular, the drgn debugger [1] uses ORC for stack
unwinding, and these format changes broke it [2]. As the drgn
maintainer, I don't care how often or how much the kernel changes the
ORC format as long as I have a way to detect the change.

It suffices to store a version identifier in the vmlinux and kernel
module ELF files (to use when parsing ORC sections from ELF), and in
kernel memory (to use when parsing ORC from a core dump+symbol table).
Rather than hard-coding a version number that needs to be manually
bumped, Peterz suggested hashing the definitions from orc_types.h. If
there is a format change that isn't caught by this, the hashing script
can be updated.

This patch adds an .orc_header allocated ELF section containing the
20-byte hash to vmlinux and kernel modules, along with the corresponding
__start_orc_header and __stop_orc_header symbols in vmlinux.

1: https://github.com/osandov/drgn
2: https://github.com/osandov/drgn/issues/303

Fixes: ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata")
Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com
2023-06-16 17:17:42 +02:00
Kan Liang
a6742cb90b perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL
When counting a FRONTEND event, the MSR_PEBS_FRONTEND is not correctly
set on GNR and MTL p-core.

The umask value for the FRONTEND events is changed on GNR and MTL. The
new umask is missing in the extra_regs[] table.

Add a dedicated intel_gnr_extra_regs[] for GNR and MTL p-core.

Fixes: bc4000fdb009 ("perf/x86/intel: Add Granite Rapids")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230615173242.3726364-1-kan.liang@linux.intel.com
2023-06-16 16:46:33 +02:00
Juergen Gross
30d65d1b19 x86/xen: Set default memory type for PV guests to WB
When running as an unprivileged PV guest under Xen (not dom0), the
default MTRR memory type should be write-back.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20230615123959.12298-1-jgross@suse.com
2023-06-16 11:22:33 +02:00
Borislav Petkov (AMD)
013fdeb07a x86/mm: Remove unused current_untag_mask()
e0bddc19ba95 ("x86/mm: Reduce untagged_addr() overhead for systems without LAM")

removed its only usage site so drop it.

Move the tlbstate_untag_mask up in the header and drop the ugly
ifdeffery as the unused declaration should be properly discarded.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230614174148.5439-1-bp@alien8.de
2023-06-16 10:50:16 +02:00
Thomas Gleixner
b81fac906a x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
Initializing the FPU during the early boot process is a pointless
exercise. Early boot is convoluted and fragile enough.

Nothing requires that the FPU is set up early. It has to be initialized
before fork_init() because the task_struct size depends on the FPU register
buffer size.

Move the initialization to arch_cpu_finalize_init() which is the perfect
place to do so.

No functional change.

This allows to remove quite some of the custom early command line parsing,
but that's subject to the next installment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.902376621@linutronix.de
2023-06-16 10:16:01 +02:00
Thomas Gleixner
1703db2b90 x86/fpu: Mark init functions __init
No point in keeping them around.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.841685728@linutronix.de
2023-06-16 10:16:01 +02:00
Thomas Gleixner
1f34bb2a24 x86/fpu: Remove cpuinfo argument from init functions
Nothing in the call chain requires it

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.783704297@linutronix.de
2023-06-16 10:16:01 +02:00
Thomas Gleixner
54d9a91a3d x86/init: Initialize signal frame size late
No point in doing this during really early boot. Move it to an early
initcall so that it is set up before possible user mode helpers are started
during device initialization.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.727330699@linutronix.de
2023-06-16 10:16:00 +02:00
Thomas Gleixner
439e17576e init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
Invoke the X86ism mem_encrypt_init() from X86 arch_cpu_finalize_init() and
remove the weak fallback from the core code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.670360645@linutronix.de
2023-06-16 10:16:00 +02:00
Thomas Gleixner
7c7077a726 x86/cpu: Switch to arch_cpu_finalize_init()
check_bugs() is a dumping ground for finalizing the CPU bringup. Only parts of
it has to do with actual CPU bugs.

Split it apart into arch_cpu_finalize_init() and cpu_select_mitigations().

Fixup the bogus 32bit comments while at it.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613224545.019583869@linutronix.de
2023-06-16 10:15:59 +02:00
Petr Pavlu
9d9173e9ce x86/build: Avoid relocation information in final vmlinux
The Linux build process on x86 roughly consists of compiling all input
files, statically linking them into a vmlinux ELF file, and then taking
and turning this file into an actual bzImage bootable file.

vmlinux has in this process two main purposes:
1) It is an intermediate build target on the way to produce the final
   bootable image.
2) It is a file that is expected to be used by debuggers and standard
   ELF tooling to work with the built kernel.

For the second purpose, a vmlinux file is typically collected by various
package build recipes, such as distribution spec files, including the
kernel's own tar-pkg target.

When building a kernel supporting KASLR with CONFIG_X86_NEED_RELOCS,
vmlinux contains also relocation information produced by using the
--emit-relocs linker option. This is utilized by subsequent build steps
to create vmlinux.relocs and produce a relocatable image. However, the
information is not needed by debuggers and other standard ELF tooling.

The issue is then that the collected vmlinux file and hence distribution
packages end up unnecessarily large because of this extra data. The
following is a size comparison of vmlinux v6.0 with and without the
relocation information:

  | Configuration      | With relocs | Stripped relocs |
  | x86_64_defconfig   |       70 MB |           43 MB |
  | +CONFIG_DEBUG_INFO |      818 MB |          367 MB |

Optimize a resulting vmlinux by adding a postlink step that splits the
relocation information into vmlinux.relocs and then strips it from the
vmlinux binary.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220927084632.14531-1-petr.pavlu@suse.com
2023-06-14 19:54:40 +02:00
Peter Zijlstra
2bd4aa9325 x86/alternative: PAUSE is not a NOP
While chasing ghosts, I did notice that optimize_nops() was replacing
'REP NOP' aka 'PAUSE' with NOP2. This is clearly not right.

Fixes: 6c480f222128 ("x86/alternative: Rewrite optimize_nops() some")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/linux-next/20230524130104.GR83892@hirez.programming.kicks-ass.net/
2023-06-14 19:02:54 +02:00
Steven Rostedt (Google)
9350a629e8 x86/alternatives: Add cond_resched() to text_poke_bp_batch()
Debugging in the kernel has started slowing down the kernel by a
noticeable amount. The ftrace start up tests are triggering the softlockup
watchdog on some boxes. This is caused by the start up tests that enable
function and function graph tracing several times. Sprinkling
cond_resched() just in the start up test code was not enough to stop the
softlockup from triggering. It would sometimes trigger in the
text_poke_bp_batch() code.

When function tracing enables all functions, it will call
text_poke_queue() to queue the places that need to be patched. Every
256 entries will do a "flush" that calls text_poke_bp_batch() to do the
update of the 256 locations. As this is in a scheduleable context,
calling cond_resched() at the start of text_poke_bp_batch() will ensure
that other tasks could get a chance to run while the patching is
happening. This keeps the softlockup from triggering in the start up
tests.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230531092419.4d051374@rorschach.local.home
2023-06-14 18:50:00 +02:00