17008 Commits

Author SHA1 Message Date
Mike Galbraith
5849cdf8c1 x86/crash: Fix crash_setup_memmap_entries() out-of-bounds access
Commit in Fixes: added support for kexec-ing a kernel on panic using a
new system call. As part of it, it does prepare a memory map for the new
kernel.

However, while doing so, it wrongly accesses memory it has not
allocated: it accesses the first element of the cmem->ranges[] array in
memmap_exclude_ranges() but it has not allocated the memory for it in
crash_setup_memmap_entries(). As KASAN reports:

  BUG: KASAN: vmalloc-out-of-bounds in crash_setup_memmap_entries+0x17e/0x3a0
  Write of size 8 at addr ffffc90000426008 by task kexec/1187

  (gdb) list *crash_setup_memmap_entries+0x17e
  0xffffffff8107cafe is in crash_setup_memmap_entries (arch/x86/kernel/crash.c:322).
  317                                      unsigned long long mend)
  318     {
  319             unsigned long start, end;
  320
  321             cmem->ranges[0].start = mstart;
  322             cmem->ranges[0].end = mend;
  323             cmem->nr_ranges = 1;
  324
  325             /* Exclude elf header region */
  326             start = image->arch.elf_load_addr;
  (gdb)

Make sure the ranges array becomes a single element allocated.

 [ bp: Write a proper commit message. ]

Fixes: dd5f726076cc ("kexec: support for kexec on panic using new system call")
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/725fa3dc1da2737f0f6188a1a9701bead257ea9d.camel@gmx.de
2021-04-20 17:32:46 +02:00
Wanpeng Li
2b519b5797 x86/kvm: Don't bother __pv_cpu_mask when !CONFIG_SMP
Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's
move it inside CONFIG_SMP. In addition, we can avoid define and
alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable
in kvm_alloc_cpumask.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:45 -04:00
Ricardo Neri
250b3c0d79 x86/cpu: Add helper function to get the type of the current hybrid CPU
On processors with Intel Hybrid Technology (i.e., one having more than
one type of CPU in the same package), all CPUs support the same
instruction set and enumerate the same features on CPUID. Thus, all
software can run on any CPU without restrictions. However, there may be
model-specific differences among types of CPUs. For instance, each type
of CPU may support a different number of performance counters. Also,
machine check error banks may be wired differently. Even though most
software will not care about these differences, kernel subsystems
dealing with these differences must know.

Add and expose a new helper function get_this_hybrid_cpu_type() to query
the type of the current hybrid CPU. The function will be used later in
the perf subsystem.

The Intel Software Developer's Manual defines the CPU type as 8-bit
identifier.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:23 +02:00
Walter Wu
02c587733c kasan: remove redundant config option
CONFIG_KASAN_STACK and CONFIG_KASAN_STACK_ENABLE both enable KASAN stack
instrumentation, but we should only need one config, so that we remove
CONFIG_KASAN_STACK_ENABLE and make CONFIG_KASAN_STACK workable.  see [1].

When enable KASAN stack instrumentation, then for gcc we could do no
prompt and default value y, and for clang prompt and default value n.

This patch fixes the following compilation warning:

  include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not defined, evaluates to 0 [-Wundef]

[akpm@linux-foundation.org: fix merge snafu]

Link: https://bugzilla.kernel.org/show_bug.cgi?id=210221 [1]
Link: https://lkml.kernel.org/r/20210226012531.29231-1-walter-zh.wu@mediatek.com
Fixes: d9b571c885a8 ("kasan: fix KASAN_STACK dependency for HW_TAGS")
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-16 16:10:36 -07:00
Marco Elver
fb6cc127e0 signal: Introduce TRAP_PERF si_code and si_perf to siginfo
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
2021-04-16 16:32:41 +02:00
Mike Travis
26d4be3ea1 x86/platform/uv: Use x2apic enabled bit as set by BIOS to indicate APIC mode
BIOS now sets the x2apic enabled bit (and the ACPI table) for extended
APIC modes. Use that bit to indicate if extended mode is set.

 [ bp: Fixup subject prefix, merge subsequent fix
   https://lkml.kernel.org/r/20210415220626.223955-1-mike.travis@hpe.com ]

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210408160047.1703-1-mike.travis@hpe.com
2021-04-16 12:51:03 +02:00
Alison Schofield
2c88d45edb x86, sched: Treat Intel SNC topology as default, COD as exception
Commit 1340ccfa9a9a ("x86,sched: Allow topologies where NUMA nodes
share an LLC") added a vendor and model specific check to never
call topology_sane() for Intel Skylake Server systems where NUMA
nodes share an LLC.

Intel Ice Lake and Sapphire Rapids CPUs also enumerate an LLC that is
shared by multiple NUMA nodes. The LLC on these CPUs is shared for
off-package data access but private to the NUMA node for on-package
access. Rather than managing a list of allowable SNC topologies, make
this SNC topology the default, and treat Intel's Cluster-On-Die (COD)
topology as the exception.

In SNC mode, Sky Lake, Ice Lake, and Sapphire Rapids servers do not
emit this warning:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210310190233.31752-1-alison.schofield@intel.com
2021-04-15 18:34:20 +02:00
Mike Rapoport
c361e5d4d0 x86/setup: Move trim_snb_memory() later in setup_arch() to fix boot hangs
Commit

  a799c2bd29d1 ("x86/setup: Consolidate early memory reservations")

moved reservation of the memory inaccessible by Sandy Bride integrated
graphics very early, and, as a result, on systems with such devices
the first 1M was reserved by trim_snb_memory() which prevented the
allocation of the real mode trampoline and made the boot hang very
early.

Since the purpose of trim_snb_memory() is to prevent problematic pages
ever reaching the graphics device, it is safe to reserve these pages
after memblock allocations are possible.

Move trim_snb_memory() later in boot so that it will be called after
reserve_real_mode() and make comments describing trim_snb_memory()
operation more elaborate.

 [ bp: Massage a bit. ]

Fixes: a799c2bd29d1 ("x86/setup: Consolidate early memory reservations")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/f67d3e03-af90-f790-baf4-8d412fe055af@infradead.org
2021-04-14 08:16:48 +02:00
Daniel Vetter
213cc929cb Merge drm/drm-fixes into drm-next
msm-next pull request has a baseline with stuff from -fixes, roll
forward first.

Some simple conflicts in amdgpu, ttm and one in i915 where git gets
confused and tries to add the same function twice.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2021-04-13 23:15:09 +02:00
Rafael J. Wysocki
6998a8800d ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()
Commit 1a1c130ab757 ("ACPI: tables: x86: Reserve memory occupied by
ACPI tables") attempted to address an issue with reserving the memory
occupied by ACPI tables, but it broke the initrd-based table override
mechanism relied on by multiple users.

To restore the initrd-based ACPI table override functionality, move
the acpi_boot_table_init() invocation in setup_arch() on x86 after
the acpi_table_upgrade() one.

Fixes: 1a1c130ab757 ("ACPI: tables: x86: Reserve memory occupied by ACPI tables")
Reported-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-13 16:03:01 +02:00
Wei Yongjun
523caed9ef x86/sgx: Mark sgx_vepc_vm_ops static
Fix the following sparse warning:

  arch/x86/kernel/cpu/sgx/virt.c:95:35: warning:
    symbol 'sgx_vepc_vm_ops' was not declared. Should it be static?

This symbol is not used outside of virt.c so mark it static.

 [ bp: Massage commit message. ]

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210412160023.193850-1-weiyongjun1@huawei.com
2021-04-12 19:48:32 +02:00
Linus Torvalds
06f838e02d - Fix the vDSO exception handling return path to disable interrupts
again.
 
 - A fix for the CE collector to return the proper return values to its
 callers which are used to convey what the collector has done with the
 error address.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmByunsACgkQEsHwGGHe
 VUo+rRAAmhs1CRMKMcha3KoQM5e3QUk8dA8xYuuHa9UJve6r2HXzSwldAGpYmSKS
 v3Pcdeue0INovp+HWSe1UJa/U6ugQ6KcjGy+xMx01VHAuWjAv/O7wMDRfxMDOnJI
 XmgXJG6IhjZUlRuD7BNkFRkUnsk5dABFTlm3OXcpmOyXBsvRPm2M6n4/ILjIlYI+
 kZCyPf0wmR2VpmwCAkhye1tdWBBmT3I3DNwgq15bhAGf6Eh7fqcieqRmBgwYpHhJ
 bOKx7WeRJa4VayV7uvRId9MAyhi9MY66Mb+CIsK0sxkcza2KizquwapN5zUNKpu2
 i24huaNDljB8n0EV8ZJZpI9Xs9QJUBYL10w3LvaSwEySwnN7QrTWzEn5/gYAS7+J
 wR4og5eDMGzgojZi56adQdnrg3thkGPviikU2lUbXo0mpeoT5I6zaQYdkbBq9r9/
 g6LhM86dOeXqpFDPwSRKCoUgiARDoj+woi+4GF1Hc+bIaffP46K4FnOEUODePS3c
 EXWEpJC2DGZq+QfXBViJKcrQi+0/n9jDD6hY5N4TBsyxuN4iUX60rLiMwNJiphmI
 xMwd7Gcr92K3yiEd7zkav2ncuqBk/OCSadubaDyMQFb0F95evBv09yQKN/RImmZq
 Ywt83UG4x+OXIlbQpAXkgLGMhFkH1GtQJ2DOssT6zrw2PFpjP5w=
 =aV+H
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix the vDSO exception handling return path to disable interrupts
   again.

 - A fix for the CE collector to return the proper return values to its
   callers which are used to convey what the collector has done with the
   error address.

* tag 'x86_urgent_for_v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Correct exc_general_protection() and math_error() return paths
  RAS/CEC: Correct ce_add_elem()'s returned values
2021-04-11 11:42:18 -07:00
Thomas Tai
632a1c209b x86/traps: Correct exc_general_protection() and math_error() return paths
Commit

  334872a09198 ("x86/traps: Attempt to fixup exceptions in vDSO before signaling")

added return statements which bypass calling cond_local_irq_disable().

According to

  ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code"),

cond_local_irq_disable() is needed because the asm return code no longer
disables interrupts. Follow the existing code as an example to use "goto
exit" instead of "return" statement.

 [ bp: Massage commit message. ]

Fixes: 334872a09198 ("x86/traps: Attempt to fixup exceptions in vDSO before signaling")
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/1617902914-83245-1-git-send-email-thomas.tai@oracle.com
2021-04-09 13:45:09 +02:00
Jarkko Sakkinen
ae40aaf6bd x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()
The commit in Fixes: changed the SGX EPC page sanitization to end up in
sgx_free_epc_page() which puts clean and sanitized pages on the free
list.

This was done for the reason that it is best to keep the logic to assign
available-for-use EPC pages to the correct NUMA lists in a single
location.

sgx_nr_free_pages is also incremented by sgx_free_epc_pages() but those
pages which are being added there per EPC section do not belong to the
free list yet because they haven't been sanitized yet - they land on the
dirty list first and the sanitization happens later when ksgxd starts
massaging them.

So remove that addition there and have sgx_free_epc_page() do that
solely.

 [ bp: Sanitize commit message too. ]

Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list")
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210408092924.7032-1-jarkko@kernel.org
2021-04-08 17:24:42 +02:00
Yang Li
dda451f391 x86/cacheinfo: Remove unneeded dead-store initialization
$ make CC=clang clang-analyzer

(needs clang-tidy installed on the system too)

on x86_64 defconfig triggers:

  arch/x86/kernel/cpu/cacheinfo.c:880:24: warning: Value stored to 'this_cpu_ci' \
	  during its initialization is never read [clang-analyzer-deadcode.DeadStores]
        struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
                              ^
  arch/x86/kernel/cpu/cacheinfo.c:880:24: note: Value stored to 'this_cpu_ci' \
	during its initialization is never read

So simply remove this unneeded dead-store initialization.

As compilers will detect this unneeded assignment and optimize this
anyway the resulting object code is identical before and after this
change.

No functional change. No change to object code.

 [ bp: Massage commit message. ]

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lkml.kernel.org/r/1617177624-24670-1-git-send-email-yang.lee@linux.alibaba.com
2021-04-07 21:12:12 +02:00
Vitaly Kuznetsov
fa26d0c778 ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit 8cdddd182bd7 ("ACPI: processor: Fix CPU0 wakeup in
acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying
wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead()
into acpi_idle_play_dead(). The problem is that these functions are not
exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails.

The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0()
(the later from assembly) but it seems putting the whole pattern into a
new function and exporting it instead is better.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8cdddd182bd7 ("CPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-07 19:02:43 +02:00
Sean Christopherson
b3754e5d3d x86/sgx: Move provisioning device creation out of SGX driver
And extract sgx_set_attribute() out of sgx_ioc_enclave_provision() and
export it as symbol for KVM to use.

The provisioning key is sensitive. The SGX driver only allows to create
an enclave which can access the provisioning key when the enclave
creator has permission to open /dev/sgx_provision. It should apply to
a VM as well, as the provisioning key is platform-specific, thus an
unrestricted VM can also potentially compromise the provisioning key.

Move the provisioning device creation out of sgx_drv_init() to
sgx_init() as a preparation for adding SGX virtualization support,
so that even if the SGX driver is not enabled due to flexible launch
control not being available, SGX virtualization can still be enabled,
and use it to restrict a VM's capability of being able to access the
provisioning key.

 [ bp: Massage commit message. ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/0f4d044d621561f26d5f4ef73e8dc6cd18cc7e79.1616136308.git.kai.huang@intel.com
2021-04-06 19:18:46 +02:00
Sean Christopherson
d155030b1e x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
The host kernel must intercept ECREATE to impose policies on guests, and
intercept EINIT to be able to write guest's virtual SGX_LEPUBKEYHASH MSR
values to hardware before running guest's EINIT so it can run correctly
according to hardware behavior.

Provide wrappers around __ecreate() and __einit() to hide the ugliness
of overloading the ENCLS return value to encode multiple error formats
in a single int.  KVM will trap-and-execute ECREATE and EINIT as part
of SGX virtualization, and reflect ENCLS execution result to guest by
setting up guest's GPRs, or on an exception, injecting the correct fault
based on return value of __ecreate() and __einit().

Use host userspace addresses (provided by KVM based on guest physical
address of ENCLS parameters) to execute ENCLS/EINIT when possible.
Accesses to both EPC and memory originating from ENCLS are subject to
segmentation and paging mechanisms.  It's also possible to generate
kernel mappings for ENCLS parameters by resolving PFN but using
__uaccess_xx() is simpler.

 [ bp: Return early if the __user memory accesses fail, use
   cpu_feature_enabled(). ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20e09daf559aa5e9e680a0b4b5fba940f1bad86e.1616136308.git.kai.huang@intel.com
2021-04-06 19:18:27 +02:00
Kai Huang
73916b6a0c x86/sgx: Add helper to update SGX_LEPUBKEYHASHn MSRs
Add a helper to update SGX_LEPUBKEYHASHn MSRs.  SGX virtualization also
needs to update those MSRs based on guest's "virtual" SGX_LEPUBKEYHASHn
before EINIT from guest.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/dfb7cd39d4dd62ea27703b64afdd8bccb579f623.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:42 +02:00
Sean Christopherson
a67136b458 x86/sgx: Add encls_faulted() helper
Add a helper to extract the fault indicator from an encoded ENCLS return
value.  SGX virtualization will also need to detect ENCLS faults.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/c1f955898110de2f669da536fc6cf62e003dff88.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:42 +02:00
Sean Christopherson
9c55c78a73 x86/sgx: Move ENCLS leaf definitions to sgx.h
Move the ENCLS leaf definitions to sgx.h so that they can be used by
KVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/2e6cd7c5c1ced620cfcd292c3c6c382827fde6b2.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson
8ca52cc38d x86/sgx: Expose SGX architectural definitions to the kernel
Expose SGX architectural structures, as KVM will use many of the
architectural constants and structs to virtualize SGX.

Name the new header file as asm/sgx.h, rather than asm/sgx_arch.h, to
have single header to provide SGX facilities to share with other kernel
componments. Also update MAINTAINERS to include asm/sgx.h.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/6bf47acd91ab4d709e66ad1692c7803e4c9063a0.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Kai Huang
faa7d3e6f3 x86/sgx: Initialize virtual EPC driver even when SGX driver is disabled
Modify sgx_init() to always try to initialize the virtual EPC driver,
even if the SGX driver is disabled.  The SGX driver might be disabled
if SGX Launch Control is in locked mode, or not supported in the
hardware at all.  This allows (non-Linux) guests that support non-LC
configurations to use SGX.

 [ bp: De-silli-fy the test. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/d35d17a02bbf8feef83a536cec8b43746d4ea557.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson
332bfc7bec x86/cpu/intel: Allow SGX virtualization without Launch Control support
The kernel will currently disable all SGX support if the hardware does
not support launch control.  Make it more permissive to allow SGX
virtualization on systems without Launch Control support.  This will
allow KVM to expose SGX to guests that have less-strict requirements on
the availability of flexible launch control.

Improve error message to distinguish between three cases.  There are two
cases where SGX support is completely disabled:
1) SGX has been disabled completely by the BIOS
2) SGX LC is locked by the BIOS.  Bare-metal support is disabled because
   of LC unavailability.  SGX virtualization is unavailable (because of
   Kconfig).
One where it is partially available:
3) SGX LC is locked by the BIOS.  Bare-metal support is disabled because
   of LC unavailability.  SGX virtualization is supported.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/b3329777076509b3b601550da288c8f3c406a865.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson
540745ddbc x86/sgx: Introduce virtual EPC for use by KVM guests
Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
Enclave Page Cache (EPC) without an associated enclave. The intended
and only known use case for raw EPC allocation is to expose EPC to a
KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
Kconfig.

The SGX driver uses the misc device /dev/sgx_enclave to support
userspace in creating an enclave. Each file descriptor returned from
opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
KVM doesn't control how the guest uses the EPC, therefore EPC allocated
to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
is not suitable for allocating EPC for a KVM guest.

Having separate device nodes for the SGX driver and KVM virtual EPC also
allows separate permission control for running host SGX enclaves and KVM
SGX guests.

To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
intended size to get an address range of virtual EPC. Then it may use
the address range to create one KVM memory slot as virtual EPC for
a guest.

Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM. Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
    just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
    does not have to export any symbols, changes to reclaim flows don't
    need to be routed through KVM, SGX's dirty laundry doesn't have to
    get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming an EPC page used by enclave requires a special reclaim
mechanism separate from normal page reclaim, and that mechanism is not
supported for virutal EPC pages. Due to the complications of handling
reclaim conflicts between guest and host, reclaiming virtual EPC pages
is significantly more complex than basic support for SGX virtualization.

 [ bp:
   - Massage commit message and comments
   - use cpu_feature_enabled()
   - vertically align struct members init
   - massage Virtual EPC clarification text
   - move Kconfig prompt to Virtualization ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:17 +02:00
Rafael J. Wysocki
91463ebff3 Merge branches 'acpi-tables' and 'acpi-scan'
* acpi-tables:
  ACPI: tables: x86: Reserve memory occupied by ACPI tables

* acpi-scan:
  ACPI: scan: Fix _STA getting called on devices with unmet dependencies
2021-04-02 16:57:56 +02:00
Peter Zijlstra
23c1ad538f x86/alternatives: Optimize optimize_nops()
Currently, optimize_nops() scans to see if the alternative starts with
NOPs. However, the emit pattern is:

  141:	\oldinstr
  142:	.skip (len-(142b-141b)), 0x90

That is, when 'oldinstr' is short, the tail is padded with NOPs. This case
never gets optimized.

Rewrite optimize_nops() to replace any trailing string of NOPs inside
the alternative to larger NOPs. Also run it irrespective of patching,
replacing NOPs in both the original and replaced code.

A direct consequence is that 'padlen' becomes superfluous, so remove it.

 [ bp:
   - Adjust commit message
   - remove a stale comment about needing to pad
   - add a comment in optimize_nops()
   - exit early if the NOP verif. loop catches a mismatch - function
     should not not add NOPs in that case
   - fix the "optimized NOPs" offsets output ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210326151259.442992235@infradead.org
2021-04-02 12:41:17 +02:00
Ingo Molnar
b1f480bc06 Merge branch 'x86/cpu' into WIP.x86/core, to merge the NOP changes & resolve a semantic conflict
Conflict-merge this main commit in essence:

  a89dfde3dc3c: ("x86: Remove dynamic NOP selection")

With this upstream commit:

  b90829704780: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG")

Semantic merge conflict:

  arch/x86/net/bpf_jit_comp.c

  - memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
  + memcpy(prog, x86_nops[5], X86_PATCH_SIZE);

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02 12:36:30 +02:00
Ingo Molnar
e855e80d00 Linux 5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBhB7AeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCPUH+KKkSoOlN2YNu1oc
 iy2nznwZoSQTk5ZLz7PypO/WWmmtgzudkObG7yqIURdrncsAkHR17Wu2P7rdBr1j
 Ma+VhF9MQ+xx+r86upH7c3gYfhyfdUMvzuLy0rwLQ1Yrzrb7xFcVkj3BHk54TAQA
 w05sRPuVJ3/c/HPYV2iXkkdnnMbXSTCebeDDwjFb9D3qagr4vcd/PjDHmGbfNF8R
 o6gLpbK5Ly6ww1nth9gGGUjzrW95yVItvcroP6vQWljxhuy+NE1lXRm8LsGhxqtW
 foFFptJup5nhSNJXWtQt/U3huVD6mZ3W3y9cOThPjXZRy2wva3I1IpBKoEFReUpG
 /Tq8EA==
 =tPUY
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc5' into WIP.x86/core, to pick up recent NOP related changes

In particular we want to have this upstream commit:

  b90829704780: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG")

... before merging in x86/cpu changes and the removal of the NOP optimizations, and
applying PeterZ's !retpoline objtool series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02 12:33:16 +02:00
Vitaly Kuznetsov
8cdddd182b ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
Commit 496121c02127 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):

 # echo 0 > /sys/devices/system/cpu/cpu0/online
 # echo 1 > /sys/devices/system/cpu/cpu0/online
 <10 seconds delay>
 -bash: echo: write error: Input/output error

In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:

	/*
	 * If NMI wants to wake up CPU0, start CPU0.
	 */
	if (wakeup_cpu0())
		start_cpu0();

cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
 - NMI is sent to CPU0
 - wakeup_cpu0_nmi() works as expected
 - we get back to while (1) loop in acpi_idle_play_dead()
 - safe_halt() puts CPU0 to sleep again.

The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.

Fixes: 496121c02127 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-01 13:37:55 +02:00
Borislav Petkov
f2ac256b9a Merge 'x86/alternatives'
Pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-03-31 18:04:19 +02:00
Peter Zijlstra
52fa82c21f x86: Add insn_decode_kernel()
Add a helper to decode kernel instructions; there's no point in
endlessly repeating those last two arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210326151259.379242587@infradead.org
2021-03-31 16:20:22 +02:00
Thomas Gleixner
9a98bc2cf0 x86/vector: Add a sanity check to prevent IRQ2 allocations
To prevent another incidental removal of the IRQ2 ignore logic in the
IO/APIC code going unnoticed add a sanity check. Add some commentry at the
other place which ignores IRQ2 while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210318192819.795280387@linutronix.de
2021-03-30 00:39:12 +02:00
Rafael J. Wysocki
1a1c130ab7 ACPI: tables: x86: Reserve memory occupied by ACPI tables
The following problem has been reported by George Kennedy:

 Since commit 7fef431be9c9 ("mm/page_alloc: place pages to tail
 in __free_pages_core()") the following use after free occurs
 intermittently when ACPI tables are accessed.

 BUG: KASAN: use-after-free in ibft_init+0x134/0xc49
 Read of size 4 at addr ffff8880be453004 by task swapper/0/1
 CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc1-7a7fd0d #1
 Call Trace:
  dump_stack+0xf6/0x158
  print_address_description.constprop.9+0x41/0x60
  kasan_report.cold.14+0x7b/0xd4
  __asan_report_load_n_noabort+0xf/0x20
  ibft_init+0x134/0xc49
  do_one_initcall+0xc4/0x3e0
  kernel_init_freeable+0x5af/0x66b
  kernel_init+0x16/0x1d0
  ret_from_fork+0x22/0x30

 ACPI tables mapped via kmap() do not have their mapped pages
 reserved and the pages can be "stolen" by the buddy allocator.

Apparently, on the affected system, the ACPI table in question is
not located in "reserved" memory, like ACPI NVS or ACPI Data, that
will not be used by the buddy allocator, so the memory occupied by
that table has to be explicitly reserved to prevent the buddy
allocator from using it.

In order to address this problem, rearrange the initialization of the
ACPI tables on x86 to locate the initial tables earlier and reserve
the memory occupied by them.

The other architectures using ACPI should not be affected by this
change.

Link: https://lore.kernel.org/linux-acpi/1614802160-29362-1-git-send-email-george.kennedy@oracle.com/
Reported-by: George Kennedy <george.kennedy@oracle.com>
Tested-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
2021-03-29 19:26:04 +02:00
Fenghua Yu
ebb1064e7c x86/traps: Handle #DB for bus lock
Bus locks degrade performance for the whole system, not just for the CPU
that requested the bus lock. Two CPU features "#AC for split lock" and
"#DB for bus lock" provide hooks so that the operating system may choose
one of several mitigation strategies.

#AC for split lock is already implemented. Add code to use the #DB for
bus lock feature to cover additional situations with new options to
mitigate.

split_lock_detect=
		#AC for split lock		#DB for bus lock

off		Do nothing			Do nothing

warn		Kernel OOPs			Warn once per task and
		Warn once per task and		and continues to run.
		disable future checking
	 	When both features are
		supported, warn in #AC

fatal		Kernel OOPs			Send SIGBUS to user.
		Send SIGBUS to user
		When both features are
		supported, fatal in #AC

ratelimit:N	Do nothing			Limit bus lock rate to
						N per second in the
						current non-root user.

Default option is "warn".

Hardware only generates #DB for bus lock detect when CPL>0 to avoid
nested #DB from multiple bus locks while the first #DB is being handled.
So no need to handle #DB for bus lock detected in the kernel.

#DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR
while #AC for split lock is enabled by split lock detection bit 29 in
TEST_CTRL MSR.

Both breakpoint and bus lock in the same instruction can trigger one #DB.
The bus lock is handled before the breakpoint in the #DB handler.

Delivery of #DB for bus lock in userspace clears DR6[11], which is set by
the #DB handler right after reading DR6.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com
2021-03-28 22:52:15 +02:00
Lai Jiangshan
1591584e2e x86/process/64: Move cpu_current_top_of_stack out of TSS
cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed
through the cpu_entry_area which is visible with user CR3 when PTI is
enabled and active.

This makes it a coveted fruit for attackers.  An attacker can fetch the
kernel stack top from it and continue next steps of actions based on the
kernel stack.

But it is actualy not necessary to be stored in the TSS.  It is only
accessed after the entry code switched to kernel CR3 and kernel GS_BASE
which means it can be in any regular percpu variable.

The reason why it is in TSS is historical (pre PTI) because TSS is also
used as scratch space in SYSCALL_64 and therefore cache hot.

A syscall also needs the per CPU variable current_task and eventually
__preempt_count, so placing cpu_current_top_of_stack next to them makes it
likely that they end up in the same cache line which should avoid
performance regressions. This is not enforced as the compiler is free to
place these variables, so these entry relevant variables should move into
a data structure to make this enforceable.

The seccomp_benchmark doesn't show any performance loss in the "getpid
native" test result.  Actually, the result changes from 93ns before to 92ns
with this change when KPTI is disabled. The test is very stable and
although the test doesn't show a higher degree of precision it gives enough
confidence that moving cpu_current_top_of_stack does not cause a
regression.

[ tglx: Removed unneeded export. Massaged changelog ]

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
2021-03-28 22:40:10 +02:00
Alexey Makhalov
0b4a285e2c x86/vmware: Avoid TSC recalibration when frequency is known
When the TSC frequency is known because it is retrieved from the
hypervisor, skip TSC refined calibration by setting X86_FEATURE_TSC_KNOWN_FREQ.

Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210105004752.131069-1-amakhalov@vmware.com
2021-03-28 21:11:43 +02:00
Sean Christopherson
231d3dbdda x86/sgx: Add SGX_CHILD_PRESENT hardware error code
SGX driver can accurately track how enclave pages are used.  This
enables SECS to be specifically targeted and EREMOVE'd only after all
child pages have been EREMOVE'd.  This ensures that SGX driver will
never encounter SGX_CHILD_PRESENT in normal operation.

Virtual EPC is different.  The host does not track how EPC pages are
used by the guest, so it cannot guarantee EREMOVE success.  It might,
for instance, encounter a SECS with a non-zero child count.

Add a definition of SGX_CHILD_PRESENT.  It will be used exclusively by
the SGX virtualization driver to handle recoverable EREMOVE errors when
saniziting EPC pages after they are freed.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/050b198e882afde7e6eba8e6a0d4da39161dbb5a.1616136308.git.kai.huang@intel.com
2021-03-26 22:51:36 +01:00
Kai Huang
b0c7459be0 x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
EREMOVE takes a page and removes any association between that page and
an enclave. It must be run on a page before it can be added into another
enclave. Currently, EREMOVE is run as part of pages being freed into the
SGX page allocator. It is not expected to fail, as it would indicate a
use-after-free of EPC pages. Rather than add the page back to the pool
of available EPC pages, the kernel intentionally leaks the page to avoid
additional errors in the future.

However, KVM does not track how guest pages are used, which means that
SGX virtualization use of EREMOVE might fail. Specifically, it is
legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
KVM guest, because KVM/kernel doesn't track SECS pages.

To allow SGX/KVM to introduce a more permissive EREMOVE helper and
to let the SGX virtualization code use the allocator directly, break
out the EREMOVE call from the SGX page allocator. Rename the original
sgx_free_epc_page() to sgx_encl_free_epc_page(), indicating that
it is used to free an EPC page assigned to a host enclave. Replace
sgx_free_epc_page() with sgx_encl_free_epc_page() in all call sites so
there's no functional change.

At the same time, improve the error message when EREMOVE fails, and
add documentation to explain to the user what that failure means and
to suggest to the user what to do when this bug happens in the case it
happens.

 [ bp: Massage commit message, fix typos and sanitize text, simplify. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210325093057.122834-1-kai.huang@intel.com
2021-03-26 22:51:23 +01:00
Sean Christopherson
b8921dccf3 x86/cpufeatures: Add SGX1 and SGX2 sub-features
Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
features, since adding a new leaf for only two bits would be wasteful.
As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
guest, and to do so correctly needs to query hardware and kernel support
for SGX1 and SGX2.

Suppress both SGX1 and SGX2 from /proc/cpuinfo. SGX1 basically means
SGX, and for SGX2 there is no concrete use case of using it in
/proc/cpuinfo.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d787827dbfca6b3210ac3e432e3ac1202727e786.1616136308.git.kai.huang@intel.com
2021-03-25 17:33:11 +01:00
Kai Huang
e9a15a40e8 x86/cpufeatures: Make SGX_LC feature bit depend on SGX bit
Move SGX_LC feature bit to CPUID dependency table to make clearing all
SGX feature bits easier. Also remove clear_sgx_caps() since it is just
a wrapper of setup_clear_cpu_cap(X86_FEATURE_SGX) now.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/5d4220fd0a39f52af024d3fa166231c1d498dd10.1616136308.git.kai.huang@intel.com
2021-03-25 17:33:11 +01:00
Wei Yongjun
2304d14db6 x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration
Address this GCC warning:

  arch/x86/kernel/kprobes/core.c:940:1:
   warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
    940 | static int nokprobe_inline kprobe_is_ss(struct kprobe_ctlblk *kcb)
        | ^~~~~~

[ mingo: Tidied up the changelog. ]

Fixes: 6256e668b7af: ("x86/kprobes: Use int3 instead of debug trap for single-step")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210324144502.1154883-1-weiyongjun1@huawei.com
2021-03-25 13:07:58 +01:00
Masami Hiramatsu
2f706e0e5e x86/kprobes: Fix to identify indirect jmp and others using range case
Fix can_boost() to identify indirect jmp and others using range case
correctly.

Since the condition in switch statement is opcode & 0xf0, it can not
evaluate to 0xff case. This should be under the 0xf0 case. However,
there is no reason to use the conbinations of the bit-masked condition
and lower bit checking.

Use range case to clean up the switch statement too.

Fixes: 6256e668b7 ("x86/kprobes: Use int3 instead of debug trap for single-step")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/161666692308.1120877.4675552834049546493.stgit@devnote2
2021-03-25 11:37:22 +01:00
Masami Hiramatsu
6dd3b8c9f5 x86/kprobes: Fix to check non boostable prefixes correctly
There are 2 bugs in the can_boost() function because of using
x86 insn decoder. Since the insn->opcode never has a prefix byte,
it can not find CS override prefix in it. And the insn->attr is
the attribute of the opcode, thus inat_is_address_size_prefix(
insn->attr) always returns false.

Fix those by checking each prefix bytes with for_each_insn_prefix
loop and getting the correct attribute for each prefix byte.
Also, this removes unlikely, because this is a slow path.

Fixes: a8d11cd0714f ("kprobes/x86: Consolidate insn decoder users for copying code")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/161666691162.1120877.2808435205294352583.stgit@devnote2
2021-03-25 11:37:21 +01:00
Ira Weiny
633b0616cf x86/sgx: Remove unnecessary kmap() from sgx_ioc_enclave_init()
kmap() is inefficient and is being replaced by kmap_local_page(), if
possible. There is no readily apparent reason why initp_page needs to be
allocated and kmap'ed() except that 'sigstruct' needs to be page-aligned
and 'token' 512 byte-aligned.

Rather than change it to kmap_local_page(), use kmalloc() instead
because kmalloc() can give this alignment when allocating PAGE_SIZE
bytes.

Remove the alloc_page()/kmap() and replace with kmalloc(PAGE_SIZE, ...)
to get a page aligned kernel address.

In addition, add a comment to document the alignment requirements so that
others don't attempt to 'fix' this again.

 [ bp: Massage commit message. ]

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210324182246.2484875-1-ira.weiny@intel.com
2021-03-25 09:50:32 +01:00
Sunil Muthuswamy
6dc2a774cb x86/Hyper-V: Support for free page reporting
Linux has support for free page reporting now (36e66c554b5c) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Tested-by: Matheus Castello <matheus@castello.eng.br>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/SN4PR2101MB0880121FA4E2FEC67F35C1DCC0649@SN4PR2101MB0880.namprd21.prod.outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-03-24 11:35:24 +00:00
Borislav Petkov
2ffdc2c344 x86/mce/inject: Add IPID for injection too
Add an injection file in order to specify the IPID too when injecting
an error. One use case example is using the machinery to decode MCEs
collected from other machines.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210314201806.12798-1-bp@alien8.de
2021-03-24 00:04:45 +01:00
Mike Rapoport
4c674481dc x86/setup: Merge several reservations of start of memory
Currently, the first several pages are reserved both to avoid leaking
their contents on systems with L1TF and to avoid corrupting BIOS memory.

Merge the two memory reservations.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210302100406.22059-3-rppt@kernel.org
2021-03-23 17:17:36 +01:00
Mike Rapoport
a799c2bd29 x86/setup: Consolidate early memory reservations
The early reservations of memory areas used by the firmware, bootloader,
kernel text and data are spread over setup_arch(). Moreover, some of them
happen *after* memblock allocations, e.g trim_platform_memory_ranges() and
trim_low_memory_range() are called after reserve_real_mode() that allocates
memory.

There was no corruption of these memory regions because memblock always
allocates memory either from the end of memory (in top-down mode) or above
the kernel image (in bottom-up mode). However, the bottom up mode is going
to be updated to span the entire memory [1] to avoid limitations caused by
KASLR.

Consolidate early memory reservations in a dedicated function to improve
robustness against future changes. Having the early reservations in one
place also makes it clearer what memory must be reserved before memblock
allocations are allowed.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: [1] https://lore.kernel.org/lkml/20201217201214.3414100-2-guro@fb.com
Link: https://lkml.kernel.org/r/20210302100406.22059-2-rppt@kernel.org
2021-03-23 17:13:17 +01:00
Masami Hiramatsu
6256e668b7 x86/kprobes: Use int3 instead of debug trap for single-step
Use int3 instead of debug trap exception for single-stepping the
probed instructions. Some instructions which change the ip
registers or modify IF flags are emulated because those are not
able to be single-stepped by int3 or may allow the interrupt
while single-stepping.

This actually changes the kprobes behavior.

- kprobes can not probe following instructions; int3, iret,
  far jmp/call which get absolute address as immediate,
  indirect far jmp/call, indirect near jmp/call with addressing
  by memory (register-based indirect jmp/call are OK), and
  vmcall/vmlaunch/vmresume/vmxoff.

- If the kprobe post_handler doesn't set before registering,
  it may not be called in some case even if you set it afterwards.
  (IOW, kprobe booster is enabled at registration, user can not
   change it)

But both are rare issue, unsupported instructions will not be
used in the kernel (or rarely used), and post_handlers are
rarely used (I don't see it except for the test code).

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/161469874601.49483.11985325887166921076.stgit@devnote2
2021-03-23 16:07:56 +01:00