IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
- Follow up fixes for the BHI mitigations code.
- Fix !SPECULATION_MITIGATIONS bug not turning off
mitigations as expected.
- Work around an APIC emulation bug when the kernel is built with
Clang and run as a SEV guest.
- Follow up x86 topology fixes.
Note that there's minor cleanups included in the BHI fixes,
which we'd normally delay to the next merge window, but the
BHI mitigations code is new and will be backported widely,
so we thought it would be better to have a unified codebase
at this stage. (Let me know if that assumption is wrong and
I'll rebase it.)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbmmkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iArg//R2VYSVgAfUf4cN3rTm8RuX8U/8V4A1C/
FklDWziEWTw8DhQCidiIt7ETuMjvOf7HFzVTfLP+oOiBxHGMCEjxtf74mHRaeZ+f
avPvwOfp3mrIdn/h2+t7pSYxZDf310EDvarliIXq0z2hmNjmQjKXo3dyWNiWJe9i
LFzST8dRyR0Tg4MjLuY2g9ZEauRHpY6aAVk9UrRi7uaiLyoHnLBASn1DCL1pmMhT
cqKfHhilkeUMYwTXbDTu1iIQwBHqcpCmUp2h6VPpxPkTxJXwzo0E4lVuFVu7tzUM
yrMZrNxhtC5Cg9NVF56AqZocKjutlJfWXnmuvpc+dM7z1dF/EwFbx2ZM+3PDuw8Z
uODs6bVzYlhzWxTMb9Obsp3RvHe1B7ZCFCZ8uo3G9lMFXqu047UwfuwqnZw4YpX6
CDEKz3zLbV4s64HPlvbju3CpX+m9CXhg5cR6HfCJiwYCytb1bxZU0ottnPnYxVCb
9Th7wi2f1sCZvtQ/T8aZpF7FhYe7abt/CDvlDoDxiRg1f+Z2nduznfDJH81nn1KS
duZEicG0O+iYi9OCPVBTVutRxSWA6D8D4F0VN7cDRr4QHyXpOFJ1KsZyHBbZo9I6
ubp9RiFNaocgdWiWVgEpvvmlYpVPDnR38oVHNPpNAyIYHu3mlCoS76znZTNX8WfJ
GlvHqOp+EM0=
=2ig7
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Follow up fixes for the BHI mitigations code
- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
expected
- Work around an APIC emulation bug when the kernel is built with Clang
and run as a SEV guest
- Follow up x86 topology fixes
* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
x86/cpu/amd: Make the NODEID_MSR union actually work
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
x86/bugs: Fix BHI handling of RRSBA
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
x86/bugs: Fix BHI documentation
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
x86/bugs: Fix return type of spectre_bhi_state()
x86/apic: Force native_apic_mem_read() to use the MOV instruction
data in certain circumstances.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbkE8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gZURAAjtG4+vnthjGFIC9HSKPFa95aOhAd0tCE
NfW8HDTuA1zhK0MoKU2cQkiE/1LVcrSS60UomYUhwB7Hrdx+HCz81W9CMrwyzkY1
0MhiUiq2cSYo0FsWflzipOoFffyY2THrlimqILQmhzo5JlbYA8WNTxAWs6Th651e
an5GZJvPq98LhhWEbULiiT+2GGYi4CtbVtZvo+FyCJvYcgxF70EeJ/jvxBUax4yW
23r+uxruo3kuiJoBAs3+OBWt/C+Ij/YF9tKqOW1XdXpPxYAbGKtYw2Ck0EIkSVHS
kilU0ig4TMRCUmUXXSqWnTxheZBF7eXwu9cMKdnqydLjazuO5M8uh+yiZEuA/UFA
I5YGBmuYe/Q+CaZmobFtRyvRZZE3kF1xTxyyw2vJMDjbW1FOppXPrdFrnbav5/h0
pDxBz7H5f6L/Egyi173cRY8dzcKTjP0adtczb1M5Q6BxnuCO4I0P7RUIg41tabxg
YuojtmIGAjhMXVXmC9VUabVDzB2o5vYXbZ4xTM6uG5XG7/2IX0nOJidwhjotw+1D
6LK4U1kcaY8kqolYiUt5zyEc96CPvFwM7w8592Ohxs5arROYEgz0QnjGnmRu8OsP
yV5U8lz+aoX2Cwh6n06XJB5GJqWX5BzkJ+D9q/mRqPRL8/BDscD2WcHkKdRE89T6
YkecmTrHYD0=
=L86s
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Ingo Molnar:
"Fix the x86 PMU multi-counter code returning invalid data in certain
circumstances"
* tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix out of range data
The topology rework missed that early_init_amd() tries to re-enable the
Topology Extensions when the BIOS disabled them.
The new parser is invoked before early_init_amd() so the re-enable attempt
happens too late.
Move it into the AMD specific topology parser code where it belongs.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx
A system with NODEID_MSR was reported to crash during early boot without
any output.
The reason is that the union which is used for accessing the bitfields in
the MSR is written wrongly and the resulting executable code accesses the
wrong part of the MSR data.
As a consequence a later division by that value results in 0 and that
result is used for another division as divisor, which obviously does not
work well.
The magic world of C, unions and bitfields:
union {
u64 bita : 3,
bitb : 3;
u64 all;
} x;
x.all = foo();
a = x.bita;
b = x.bitb;
results in the effective executable code of:
a = b = x.bita;
because bita and bitb are treated as union members and therefore both end
up at bit offset 0.
Wrapping the bitfield into an anonymous struct:
union {
struct {
u64 bita : 3,
bitb : 3;
};
u64 all;
} x;
works like expected.
Rework the NODEID_MSR union in exactly that way to cure the problem.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de
Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the
package. The parser uses this value to initialize the SMT domain level.
That's wrong because cpu_nthreads does not describe the number of threads
per physical core. So this needs to set the CORE domain level and let the
later parsers set the SMT shift if available.
Preset the SMT domain level with the assumption of one thread per core,
which is correct ifrt here are no other CPUID leafs to parse, and propagate
cpu_nthreads and the core level APIC bitwidth into the CORE domain.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de
For consistency with the other CONFIG_MITIGATION_* options, replace the
CONFIG_SPECTRE_BHI_{ON,OFF} options with a single
CONFIG_MITIGATION_SPECTRE_BHI option.
[ mingo: Fix ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/3833812ea63e7fdbe36bf8b932e63f70d18e2a2a.1712813475.git.jpoimboe@kernel.org
Unlike most other mitigations' "auto" options, spectre_bhi=auto only
mitigates newer systems, which is confusing and not particularly useful.
Remove it.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmYYYPkTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXnxhB/4/8c7lFT53VbujFmVA5sNvpP5Ji5Xg
ERhVID7tzDyaVRPr+tpPJIW0Oj/t34SH9seoTYCBCM/UABWe/Gxceg2JaoOzIx+l
LHi73T4BBaqExiXbCCFj8N7gLO5P4Xz6ZZRgwHws1KmMXsiWYmYsbv36eSv9x6qK
+z/n6p9/ubKFNj2/vsvfiGmY0XHayD3NM4Y4toMbYE/tuRT8uZ7D5sqWdRf+UhW/
goRDA5qppeSfuaQu2LNVoz1e6wRmeJFv8OHgaPvQqAjTRLzPwwss28HICmKc8gh3
HDDUUJCHSs1XItSGDFip6rIFso5X/ZHO0d6pV75hOKCisd7lV0qH6NIZ
=k62H
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Some cosmetic changes (Erni Sri Satya Vennela, Li Zhijian)
- Introduce hv_numa_node_to_pxm_info() (Nuno Das Neves)
- Fix KVP daemon to handle IPv4 and IPv6 combination for keyfile format
(Shradha Gupta)
- Avoid freeing decrypted memory in a confidential VM (Rick Edgecombe
and Michael Kelley)
* tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Don't free ring buffers that couldn't be re-encrypted
uio_hv_generic: Don't free decrypted memory
hv_netvsc: Don't free decrypted memory
Drivers: hv: vmbus: Track decrypted status in vmbus_gpadl
Drivers: hv: vmbus: Leak pages if set_memory_encrypted() fails
hv/hv_kvp_daemon: Handle IPv4 and Ipv6 combination for keyfile format
hv: vmbus: Convert sprintf() family to sysfs_emit() family
mshyperv: Introduce hv_numa_node_to_pxm_info()
x86/hyperv: Cosmetic changes for hv_apic.c
While syscall hardening helps prevent some BHI attacks, there's still
other low-hanging fruit remaining. Don't classify it as a mitigation
and make it clear that the system may still be vulnerable if it doesn't
have a HW or SW mitigation enabled.
Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org
The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been
disabled by the Spectre v2 mitigation (or can otherwise be disabled by
the BHI mitigation itself if needed). In that case retpolines are fine.
Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/6f56f13da34a0834b69163467449be7f58f253dc.1712813475.git.jpoimboe@kernel.org
So we are using the 'ia32_cap' value in a number of places,
which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register.
But there's very little 'IA32' about it - this isn't 32-bit only
code, nor does it originate from there, it's just a historic
quirk that many Intel MSR names are prefixed with IA32_.
This is already clear from the helper method around the MSR:
x86_read_arch_cap_msr(), which doesn't have the IA32 prefix.
So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with
its role and with the naming of the helper function.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and
over. It's even read in the BHI sysfs function which is a big no-no.
Just read it once and cache it.
Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
topo_set_cpuids() updates cpu_present_map and cpu_possible map. It is
invoked during enumeration and "physical hotplug" operations. In the
latter case this results in a kernel crash because cpu_possible_map is
marked read only after init completes.
There is no reason to update cpu_possible_map in that function. During
enumeration cpu_possible_map is not relevant and gets fully initialized
after enumeration completed. On "physical hotplug" the bit is already set
because the kernel allows only CPUs to be plugged which have been
enumerated and associated to a CPU number during early boot.
Remove the bogus update of cpu_possible_map.
Fixes: 0e53e7b656cf ("x86/cpu/topology: Sanitize the APIC admission logic")
Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87ttkc6kwx.ffs@tglx
The definition of spectre_bhi_state() incorrectly returns a const char
* const. This causes the a compiler warning when building with W=1:
warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2812 | static const char * const spectre_bhi_state(void)
Remove the const qualifier from the pointer.
Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240409230806.1545822-1-daniel.sneddon@linux.intel.com
On x86 each struct cpu_hw_events maintains a table for counter assignment but
it missed to update one for the deleted event in x86_pmu_del(). This
can make perf_clear_dirty_counters() reset used counter if it's called
before event scheduling or enabling. Then it would return out of range
data which doesn't make sense.
The following code can reproduce the problem.
$ cat repro.c
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/perf_event.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.disabled = 1,
};
void *worker(void *arg)
{
int cpu = (long)arg;
int fd1 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
int fd2 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
void *p;
do {
ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
p = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd1, 0);
ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);
ioctl(fd2, PERF_EVENT_IOC_DISABLE, 0);
munmap(p, 4096);
ioctl(fd1, PERF_EVENT_IOC_DISABLE, 0);
} while (1);
return NULL;
}
int main(void)
{
int i;
int n = sysconf(_SC_NPROCESSORS_ONLN);
pthread_t *th = calloc(n, sizeof(*th));
for (i = 0; i < n; i++)
pthread_create(&th[i], NULL, worker, (void *)(long)i);
for (i = 0; i < n; i++)
pthread_join(th[i], NULL);
free(th);
return 0;
}
And you can see the out of range data using perf stat like this.
Probably it'd be easier to see on a large machine.
$ gcc -o repro repro.c -pthread
$ ./repro &
$ sudo perf stat -A -I 1000 2>&1 | awk '{ if (length($3) > 15) print }'
1.001028462 CPU6 196,719,295,683,763 cycles # 194290.996 GHz (71.54%)
1.001028462 CPU3 396,077,485,787,730 branch-misses # 15804359784.80% of all branches (71.07%)
1.001028462 CPU17 197,608,350,727,877 branch-misses # 14594186554.56% of all branches (71.22%)
2.020064073 CPU4 198,372,472,612,140 cycles # 194681.113 GHz (70.95%)
2.020064073 CPU6 199,419,277,896,696 cycles # 195720.007 GHz (70.57%)
2.020064073 CPU20 198,147,174,025,639 cycles # 194474.654 GHz (71.03%)
2.020064073 CPU20 198,421,240,580,145 stalled-cycles-frontend # 100.14% frontend cycles idle (70.93%)
3.037443155 CPU4 197,382,689,923,416 cycles # 194043.065 GHz (71.30%)
3.037443155 CPU20 196,324,797,879,414 cycles # 193003.773 GHz (71.69%)
3.037443155 CPU5 197,679,956,608,205 stalled-cycles-backend # 1315606428.66% backend cycles idle (71.19%)
3.037443155 CPU5 198,571,860,474,851 instructions # 13215422.58 insn per cycle
It should move the contents in the cpuc->assign as well.
Fixes: 5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240306061003.1894224-1-namhyung@kernel.org
Branch History Injection (BHI) attacks may allow a malicious application to
influence indirect branch prediction in kernel by poisoning the branch
history. eIBRS isolates indirect branch targets in ring0. The BHB can
still influence the choice of indirect branch predictor entry, and although
branch predictor entries are isolated between modes when eIBRS is enabled,
the BHB itself is not isolated between modes.
Add mitigations against it either with the help of microcode or with
software sequences for the affected CPUs.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYUKPMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofT8EACJJix+GzGUcJjOvfWFZcxwziY152hO
5XSzHOZZL6oz5Yk/Rye/S9RVTN7aDjn1CEvI0cD/ULxaTP869sS9dDdUcHhEJ//5
6hjqWsWiKc1QmLjBy3Pcb97GZHQXM5a9D1f6jXnJD+0FMLbQHpzSEBit0H4tv/TC
75myGgYihvUbhN9/bL10M5fz+UADU42nChvPWDMr9ukljjCqa46tPTmKUIAW5TWj
/xsyf+Nk+4kZpdaidKGhpof6KCV2rNeevvzUGN8Pv5y13iAmvlyplqTcQ6dlubnZ
CuDX5Ji9spNF9WmhKpLgy5N+Ocb64oVHov98N2zw1sT1N8XOYcSM0fBj7SQIFURs
L7T4jBZS+1c3ZGJPPFWIaGjV8w1ZMhelglwJxjY7ZgRD6fK3mwRx/ks54J8H4HjE
FbirXaZLeKlscDIOKtnxxKoIGwpdGwLKQYi/wEw7F9NhCLSj9wMia+j3uYIUEEHr
6xEiYEtyjcV3ocxagH7eiHyrasOKG64vjx2h1XodusBA2Wrvgm/jXlchUu+wb6B4
LiiZJt+DmOdQ1h5j3r2rt3hw7+nWa7kyq34qfN6NSUCHiedp6q7BClueSaKiOCGk
RoNibNiS+CqaxwGxj/RGuvajEJeEMCsLuCxzT3aeaDBsqscW6Ka/HkGA76Tpb5nJ
E3JyjYE7AlG4rw==
=W0W3
-----END PGP SIGNATURE-----
Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mitigations from Thomas Gleixner:
"Mitigations for the native BHI hardware vulnerabilty:
Branch History Injection (BHI) attacks may allow a malicious
application to influence indirect branch prediction in kernel by
poisoning the branch history. eIBRS isolates indirect branch targets
in ring0. The BHB can still influence the choice of indirect branch
predictor entry, and although branch predictor entries are isolated
between modes when eIBRS is enabled, the BHB itself is not isolated
between modes.
Add mitigations against it either with the help of microcode or with
software sequences for the affected CPUs"
[ This also ends up enabling the full mitigation by default despite the
system call hardening, because apparently there are other indirect
calls that are still sufficiently reachable, and the 'auto' case just
isn't hardened enough.
We'll have some more inevitable tweaking in the future - Linus ]
* tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM: x86: Add BHI_NO
x86/bhi: Mitigate KVM by default
x86/bhi: Add BHI mitigation knob
x86/bhi: Enumerate Branch History Injection (BHI) bug
x86/bhi: Define SPEC_CTRL_BHI_DIS_S
x86/bhi: Add support for clearing branch history at syscall entry
x86/syscall: Don't force use of indirect calls for system calls
x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
Intel processors that aren't vulnerable to BHI will set
MSR_IA32_ARCH_CAPABILITIES[BHI_NO] = 1;. Guests may use this BHI_NO bit to
determine if they need to implement BHI mitigations or not. Allow this bit
to be passed to the guests.
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
BHI mitigation mode spectre_bhi=auto does not deploy the software
mitigation by default. In a cloud environment, it is a likely scenario
where userspace is trusted but the guests are not trusted. Deploying
system wide mitigation in such cases is not desirable.
Update the auto mode to unconditionally mitigate against malicious
guests. Deploy the software sequence at VMexit in auto mode also, when
hardware mitigation is not available. Unlike the force =on mode,
software sequence is not deployed at syscalls in auto mode.
Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Branch history clearing software sequences and hardware control
BHI_DIS_S were defined to mitigate Branch History Injection (BHI).
Add cmdline spectre_bhi={on|off|auto} to control BHI mitigation:
auto - Deploy the hardware mitigation BHI_DIS_S, if available.
on - Deploy the hardware mitigation BHI_DIS_S, if available,
otherwise deploy the software sequence at syscall entry and
VMexit.
off - Turn off BHI mitigation.
The default is auto mode which does not deploy the software sequence
mitigation. This is because of the hardening done in the syscall
dispatch path, which is the likely target of BHI.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Mitigation for BHI is selected based on the bug enumeration. Add bits
needed to enumerate BHI bug.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Newer processors supports a hardware control BHI_DIS_S to mitigate
Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel
from userspace BHI attacks without having to manually overwrite the
branch history.
Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL.
Mitigation is enabled later.
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Branch History Injection (BHI) attacks may allow a malicious application to
influence indirect branch prediction in kernel by poisoning the branch
history. eIBRS isolates indirect branch targets in ring0. The BHB can
still influence the choice of indirect branch predictor entry, and although
branch predictor entries are isolated between modes when eIBRS is enabled,
the BHB itself is not isolated between modes.
Alder Lake and new processors supports a hardware control BHI_DIS_S to
mitigate BHI. For older processors Intel has released a software sequence
to clear the branch history on parts that don't support BHI_DIS_S. Add
support to execute the software sequence at syscall entry and VMexit to
overwrite the branch history.
For now, branch history is not cleared at interrupt entry, as malicious
applications are not believed to have sufficient control over the
registers, since previous register state is cleared at interrupt
entry. Researchers continue to poke at this area and it may become
necessary to clear at interrupt entry as well in the future.
This mitigation is only defined here. It is enabled later.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Make <asm/syscall.h> build a switch statement instead, and the compiler can
either decide to generate an indirect jump, or - more likely these days due
to mitigations - just a series of conditional branches.
Yes, the conditional branches also have branch prediction, but the branch
prediction is much more controlled, in that it just causes speculatively
running the wrong system call (harmless), rather than speculatively running
possibly wrong random less controlled code gadgets.
This doesn't mitigate other indirect calls, but the system call indirection
is the first and most easily triggered case.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Change the format of the 'spectre_v2' vulnerabilities sysfs file
slightly by converting the commas to semicolons, so that mitigations for
future variants can be grouped together and separated by commas.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When done from a virtual machine, instructions that touch APIC memory
must be emulated. By convention, MMIO accesses are typically performed
via io.h helpers such as readl() or writeq() to simplify instruction
emulation/decoding (ex: in KVM hosts and SEV guests) [0].
Currently, native_apic_mem_read() does not follow this convention,
allowing the compiler to emit instructions other than the MOV
instruction generated by readl(). In particular, when the kernel is
compiled with clang and run as a SEV-ES or SEV-SNP guest, the compiler
would emit a TESTL instruction which is not supported by the SEV-ES
emulator, causing a boot failure in that environment. It is likely the
same problem would happen in a TDX guest as that uses the same
instruction emulator as SEV-ES.
To make sure all emulators can emulate APIC memory reads via MOV, use
the readl() function in native_apic_mem_read(). It is expected that any
emulator would support MOV in any addressing mode as it is the most
generic and is what is usually emitted currently.
The TESTL instruction is emitted when native_apic_mem_read() is inlined
into apic_mem_wait_icr_idle(). The emulator comes from
insn_decode_mmio() in arch/x86/lib/insn-eval.c. It's not worth it to
extend insn_decode_mmio() to support more instructions since, in theory,
the compiler could choose to output nearly any instruction for such
reads which would bloat the emulator beyond reason.
[0] https://lore.kernel.org/all/20220405232939.73860-12-kirill.shutemov@linux.intel.com/
[ bp: Massage commit message, fix typos. ]
Signed-off-by: Adam Dunlap <acdunlap@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Kevin Loughlin <kevinloughlin@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20240318230927.2191933-1-acdunlap@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSRmYRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gGLg/9GQS5B/kub1ydjoFx2sCuHe5Zl9vLXpLg
515yOSKRWnpzjDJaMlAJg9MHiNlcR2Df5rcnVLaKpQSomajFQBKpsJkX4skqAUm3
wuuCwP0Wc+XJFNDzHGb5yqWaCo06XI80BnZyGHDHDD6jJ5fypzKVyJERnGBpLwjO
h2YTomtLP5j5Gq7em2d+A9pVRpKwcBOCB8K2sBnJRlPNVR190MWTOm1wEQ+vYQeR
x8nIx7WaaA6SInqWVGkhloasTmeWEH89Q2wjCltZpYFnRiEa1yS/VHVT6ZQKrpOy
+mBqr92tXdxT2Y+8LNAMUg5PgRVMbZoY+Glin0Q0N4Cg92BZIl8NX8wtb3oacYgd
XhiRyRWaw8JDCC2mEmhlEa01M2Y7PXtcBjvOVQwoZLS/711Zyf+fHjyX4FUG8Vcb
T0PgaQoterlVnN4H2uWq8Za8ubjI0TW0nRBw2oQKlSv/5ldJ2IKJsQdsbl1q4wQr
TtYJY2bq5Hrn+qlFZi6jFB2KvBOUV3molXlZAPJ0Nr/Y9mkMBRVcq6ufrAunpgUB
l62Ls61HHZ9+hVNIIpM8/p/rTYjeVilA7vjHGiCJFcsclvPNBBkobtbpri/ioE0t
4+pH60LsMBIwAhOAKlJ6Jzf4LaWEikJfDPpj8yMKixGOxDT542rUN7A3NwdP2H6b
2fV3Nyr7sEw=
=v7jY
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fix from Ingo Molnar:
"Fix a combined PEBS events bug on x86 Intel CPUs"
* tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
srso_alias_untrain_ret() is special code, even if it is a dummy
which is called in the !SRSO case, so annotate it like its real
counterpart, to address the following objtool splat:
vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0
Fixes: 4535e1a4174c ("x86/bugs: Fix the SRSO mitigation on Zen3/4")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
We want to fix:
0e110732473e ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")
So merge in Linus's latest into x86/urgent to have it available.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a couple of fixups for this cycle's vmalloc changes and one for
the stackdepot changes. And a fix for a very old x86 PAT issue which can
cause a warning splat.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZhBEXAAKCRDdBJ7gKXxA
ju9fAQCxPdqApKQ49IAJpUtMcRJmI594dmB+/CfrFgiS+GaQcwEA1+2SidI9fQWT
R/fcKrRr4+zlgQw0T0aSDR1HBLUPxw0=
=rkxT
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"8 hotfixes, 3 are cc:stable
There are a couple of fixups for this cycle's vmalloc changes and one
for the stackdepot changes. And a fix for a very old x86 PAT issue
which can cause a warning splat"
* tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
stackdepot: rename pool_index to pool_index_plus_1
x86/mm/pat: fix VM_PAT handling in COW mappings
MAINTAINERS: change vmware.com addresses to broadcom.com
selftests/mm: include strings.h for ffsl
mm: vmalloc: fix lockdep warning
mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
init: open output files from cpio unpacking with O_LARGEFILE
mm/secretmem: fix GUP-fast succeeding on secretmem folios
PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios. Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.
Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().
In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.
To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.
We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.
For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.
Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():
<--- C reproducer --->
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <liburing.h>
int main(void)
{
struct io_uring_params p = {};
int ring_fd;
size_t size;
char *map;
ring_fd = io_uring_setup(1, &p);
if (ring_fd < 0) {
perror("io_uring_setup");
return 1;
}
size = p.sq_off.array + p.sq_entries * sizeof(unsigned);
/* Map the submission queue ring MAP_PRIVATE */
map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
ring_fd, IORING_OFF_SQ_RING);
if (map == MAP_FAILED) {
perror("mmap");
return 1;
}
/* We have at least one page. Let's COW it. */
*map = 0;
pause();
return 0;
}
<--- C reproducer --->
On a system with 16 GiB RAM and swap configured:
# ./iouring &
# memhog 16G
# killall iouring
[ 301.552930] ------------[ cut here ]------------
[ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[ 301.565725] PKRU: 55555554
[ 301.565944] Call Trace:
[ 301.566148] <TASK>
[ 301.566325] ? untrack_pfn+0xf4/0x100
[ 301.566618] ? __warn+0x81/0x130
[ 301.566876] ? untrack_pfn+0xf4/0x100
[ 301.567163] ? report_bug+0x171/0x1a0
[ 301.567466] ? handle_bug+0x3c/0x80
[ 301.567743] ? exc_invalid_op+0x17/0x70
[ 301.568038] ? asm_exc_invalid_op+0x1a/0x20
[ 301.568363] ? untrack_pfn+0xf4/0x100
[ 301.568660] ? untrack_pfn+0x65/0x100
[ 301.568947] unmap_single_vma+0xa6/0xe0
[ 301.569247] unmap_vmas+0xb5/0x190
[ 301.569532] exit_mmap+0xec/0x340
[ 301.569801] __mmput+0x3e/0x130
[ 301.570051] do_exit+0x305/0xaf0
...
Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate
compile-time assert in KVM to prevent direct lookups on the features in
CPUID_LNX_5. KVM uses X86_FEATURE_* flags to manage guest CPUID, and so
must translate features that are scattered by Linux from the Linux-defined
bit to the hardware-defined bit, i.e. should never try to directly access
scattered features in guest CPUID.
Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a
compile-time assert in KVM's CPUID infrastructure to ensure that future
additions update cpuid_leafs along with NCAPINTS.
No functional change intended.
Fixes: 7f274e609f3d ("x86/cpufeatures: Add new word for scattered features")
Cc: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of
the newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
with better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting
to some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing problems,
incorrect checksum conversions, and incorrect packet transformations
which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
=6bQI
-----END PGP SIGNATURE-----
Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of the
newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting to
some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing
problems, incorrect checksum conversions, and incorrect packet
transformations which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled"
* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
netfilter: validate user input for expected length
net/sched: act_skbmod: prevent kernel-infoleak
net: usb: ax88179_178a: avoid the interface always configured as random address
net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
net: ravb: Always update error counters
net: ravb: Always process TX descriptor ring
netfilter: nf_tables: discard table flag update with pending basechain deletion
netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
netfilter: nf_tables: reject new basechain after table flag update
netfilter: nf_tables: flush pending destroy work before exit_net release
netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
netfilter: nf_tables: release batch on table validation from abort path
Revert "tg3: Remove residual error handling in tg3_suspend"
tg3: Remove residual error handling in tg3_suspend
net: mana: Fix Rx DMA datasize and skb_over_panic
net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
net: stmmac: fix rx queue priority assignment
net: txgbe: fix i2c dev name cannot match clkdev
net: fec: Set mac_managed_pm during probe
...
Modifying a MCA bank's MCA_CTL bits which control which error types to
be reported is done over
/sys/devices/system/machinecheck/
├── machinecheck0
│ ├── bank0
│ ├── bank1
│ ├── bank10
│ ├── bank11
...
sysfs nodes by writing the new bit mask of events to enable.
When the write is accepted, the kernel deletes all current timers and
reinits all banks.
Doing that in parallel can lead to initializing a timer which is already
armed and in the timer wheel, i.e., in use already:
ODEBUG: init active (active state 0) object: ffff888063a28000 object
type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642
WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514
debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514
Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code
does.
Reported by: Yue Sun <samsun1006219@gmail.com>
Reported by: xingwei lee <xrivendell7@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%2Bx_-PqMv98gKig55=2vbzffRw@mail.gmail.com
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.
Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.
Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.
Move kdump_sev_callback() to its rightful place, while at it.
Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
Add functionality to set and/or clear different attributes of the
machine as a confidential computing platform. Add the first one too:
whether the machine is running as a host for SEV-SNP guests.
Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-5-bp@alien8.de
The functionality to load SEV-SNP guests by the host will soon rely on
cc_platform* helpers because the cpu_feature* API with the early
patching is insufficient when SNP support needs to be disabled late.
Therefore, pull that functionality in.
Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-4-bp@alien8.de
There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.
If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.
So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.
Add this deliberately to be "just a CoCo x86 driver feature" and not
part of the RNG itself. Many device drivers and platforms have some
desire to contribute something to the RNG, and add_device_randomness()
is specifically meant for this purpose.
Any driver can call it with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse.
Rather than trying to build something into the core of the RNG, consider
the particular CoCo issue just a CoCo issue, and therefore separate it
all out into driver (well, arch/platform) code.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
The __vmalloc_start_set declaration is in a header that is not included
in numa_32.c in current linux-next:
arch/x86/mm/numa_32.c: In function 'initmem_init':
arch/x86/mm/numa_32.c:57:9: error: '__vmalloc_start_set' undeclared (first use in this function)
57 | __vmalloc_start_set = true;
| ^~~~~~~~~~~~~~~~~~~
arch/x86/mm/numa_32.c:57:9: note: each undeclared identifier is reported only once for each function it appears in
Add an explicit #include.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240403202344.3463169-1-arnd@kernel.org
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
RISC-V:
- Remove redundant semicolon in num_isa_ext_regs().
- Fix APLIC setipnum_le/be write emulation.
- Fix APLIC in_clrip[x] read emulation.
x86:
- Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting
disabled.
- Documentation fixes for SEV.
- Fix compat ABI for KVM_MEMORY_ENCRYPT_OP.
- Fix a 14-year-old goof in a declaration shared by host and guest; the enabled
field used by Linux when running as a guest pushes the size of "struct
kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is really unconsequential
because KVM never consumes anything beyond the first 64 bytes, but the
resulting struct does not match the documentation.
Selftests:
- Fix spelling mistake in arch_timer selftest.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYMOJYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP2zAf/Z7/cK0+yFSvm7/tsbWtjnWofad/p
82puu0V+8lZSjGVs3AydiDCV+FahvLS0QIwgrffVr4XA10Km5ZZMjZyJ3uH4xki/
VFFsDnZPdKuj55T0wwN7JFn0YVOMdtgcP0b+F8aMbkL0uoJXjutOMKNhssuW12kw
9cmPjaBWm/bfrfoTUUB9mCh0Ub3HKpguYwTLQuf6Fyn2FK7oORpt87Zi+oIKUn6H
pFXFtZYduLg6M2LXvZqsXZLXnvABPjANNWEhiiwrvuF/wmXXTwTpvRXlYXhCvpAN
q0AhxPhPm3NnsmRhEB6SmoMjXyZIByezcEiqAspBrUvEqs/2u6VyzFMrXw==
=PlsI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM:
- Ensure perf events programmed to count during guest execution are
actually enabled before entering the guest in the nVHE
configuration
- Restore out-of-range handler for stage-2 translation faults
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches
- Fix early handling of architectural VHE-only systems to ensure E2H
is appropriately set
- Correct a format specifier warning in the arch_timer selftest
- Make the KVM banner message correctly handle all of the possible
configurations
RISC-V:
- Remove redundant semicolon in num_isa_ext_regs()
- Fix APLIC setipnum_le/be write emulation
- Fix APLIC in_clrip[x] read emulation
x86:
- Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID
entries (old vs. new) and ultimately neglects to clear PV_UNHALT
from vCPUs with HLT-exiting disabled
- Documentation fixes for SEV
- Fix compat ABI for KVM_MEMORY_ENCRYPT_OP
- Fix a 14-year-old goof in a declaration shared by host and guest;
the enabled field used by Linux when running as a guest pushes the
size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is
really unconsequential because KVM never consumes anything beyond
the first 64 bytes, but the resulting struct does not match the
documentation
Selftests:
- Fix spelling mistake in arch_timer selftest"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: arm64: Rationalise KVM banner output
arm64: Fix early handling of FEAT_E2H0 not being implemented
KVM: arm64: Ensure target address is granule-aligned for range TLBI
KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()
KVM: arm64: Don't pass a TLBI level hint when zapping table entries
KVM: arm64: Don't defer TLB invalidation when zapping table entries
KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test
KVM: arm64: Fix out-of-IPA space translation fault handling
KVM: arm64: Fix host-programmed guest events in nVHE
RISC-V: KVM: Fix APLIC in_clrip[x] read emulation
RISC-V: KVM: Fix APLIC setipnum_le/be write emulation
RISC-V: KVM: Remove second semicolon
KVM: selftests: Fix spelling mistake "trigged" -> "triggered"
Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP
Documentation: kvm/sev: separate description of firmware
KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled
KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
...
The srso_alias_untrain_ret() dummy thunk in the !CONFIG_MITIGATION_SRSO
case is there only for the altenative in CALL_UNTRAIN_RET to have
a symbol to resolve.
However, testing with kernels which don't have CONFIG_MITIGATION_SRSO
enabled, leads to the warning in patch_return() to fire:
missing return thunk: srso_alias_untrain_ret+0x0/0x10-0x0: eb 0e 66 66 2e
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:826 apply_returns (arch/x86/kernel/alternative.c:826
Put in a plain "ret" there so that gcc doesn't put a return thunk in
in its place which special and gets checked.
In addition:
ERROR: modpost: "srso_alias_untrain_ret" [arch/x86/kvm/kvm-amd.ko] undefined!
make[2]: *** [scripts/Makefile.modpost:145: Module.symvers] Chyba 1
make[1]: *** [/usr/src/linux-6.8.3/Makefile:1873: modpost] Chyba 2
make: *** [Makefile:240: __sub-make] Chyba 2
since !SRSO builds would use the dummy return thunk as reported by
petr.pisar@atlas.cz, https://bugzilla.kernel.org/show_bug.cgi?id=218679.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202404020901.da75a60f-oliver.sang@intel.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/202404020901.da75a60f-oliver.sang@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MSR_PEBS_DATA_CFG MSR register is used to configure which data groups
should be generated into a PEBS record, and it's shared among all counters.
If there are different configurations among counters, perf combines all the
configurations.
The first perf command as below requires a complete PEBS record
(including memory info, GPRs, XMMs, and LBRs). The second perf command
only requires a basic group. However, after the second perf command is
running, the MSR_PEBS_DATA_CFG register is cleared. Only a basic group is
generated in a PEBS record, which is wrong. The required information
for the first perf command is missed.
$ perf record --intr-regs=AX,SP,XMM0 -a -C 8 -b -W -d -c 100000003 -o /dev/null -e cpu/event=0xd0,umask=0x81/upp &
$ sleep 5
$ perf record --per-thread -c 1 -e cycles:pp --no-timestamp --no-tid taskset -c 8 ./noploop 1000
The first PEBS event is a system-wide PEBS event. The second PEBS event
is a per-thread event. When the thread is scheduled out, the
intel_pmu_pebs_del() function is invoked to update the PEBS state.
Since the system-wide event is still available, the cpuc->n_pebs is 1.
The cpuc->pebs_data_cfg is cleared. The data configuration for the
system-wide PEBS event is lost.
The (cpuc->n_pebs == 1) check was introduced in commit:
b6a32f023fcc ("perf/x86: Fix PEBS threshold initialization")
At that time, it indeed didn't hurt whether the state was updated
during the removal, because only the threshold is updated.
The calculation of the threshold takes the last PEBS event into
account.
However, since commit:
b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
we delay the threshold update, and clear the PEBS data config, which triggers
the bug.
The PEBS data config update scope should not be shrunk during removal.
[ mingo: Improved the changelog & comments. ]
Fixes: b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240401133320.703971-1-kan.liang@linux.intel.com
Tony encountered this OOPS when the last CPU of a domain goes
offline while running a kernel built with CONFIG_NO_HZ_FULL:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
...
RIP: 0010:__find_nth_andnot_bit+0x66/0x110
...
Call Trace:
<TASK>
? __die()
? page_fault_oops()
? exc_page_fault()
? asm_exc_page_fault()
cpumask_any_housekeeping()
mbm_setup_overflow_handler()
resctrl_offline_cpu()
resctrl_arch_offline_cpu()
cpuhp_invoke_callback()
cpuhp_thread_fun()
smpboot_thread_fn()
kthread()
ret_from_fork()
ret_from_fork_asm()
</TASK>
The NULL pointer dereference is encountered while searching for another
online CPU in the domain (of which there are none) that can be used to
run the MBM overflow handler.
Because the kernel is configured with CONFIG_NO_HZ_FULL the search for
another CPU (in its effort to prefer those CPUs that aren't marked
nohz_full) consults the mask representing the nohz_full CPUs,
tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y
tick_nohz_full_mask is not allocated unless the kernel is booted with
the "nohz_full=" parameter and because of that any access to
tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled().
Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled().
The latter ensures tick_nohz_full_mask can be accessed safely and can be
used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not.
[ Use Ingo's suggestion that combines the two NO_HZ checks into one. ]
Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com
Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZgtpWBccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFoilAQCQk6kLIeuih5QOe50fK4XkNsyg
PGcxw0a0BP8cfjtJsgEArwLlfHQOTE4tRWtXyEHvapJfe/bE1hjLmzUJx7BwLQ4=
=6hNq
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
The commit:
59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
made PER_CPU_VAR() to use rip-relative addressing, hence
INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative
asm code inside of it. A follow up commit:
17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
changed x86_call_depth_emit_accounting() to use apply_relocation(),
but mistakenly assumed that the code is being patched in-place (where
the destination of the relocation matches the address of the code),
using *pprog as the destination ip. This is not true for the call depth
accounting, emitted by the BPF JIT, so the calculated address was wrong,
JIT-ed BPF progs on kernels with call depth tracking got broken and
usually caused a page fault.
Pass the destination IP when the BPF JIT emits call depth accounting.
Fixes: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adjust the IP passed to `emit_patch` so it calculates the correct offset
for the CALL instruction if `x86_call_depth_emit_accounting` emits code.
Otherwise we will skip some instructions and most likely crash.
Fixes: b2e9dfe54be4 ("x86/bpf: Emit call depth accounting if required")
Link: https://lore.kernel.org/lkml/20230105214922.250473-1-joanbrugueram@gmail.com/
Co-developed-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-2-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Use the correct stalled cycles PMCs on AMD Zen2 and newer
- Fix detection of the LBR freeze feature on AMD
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJR3oACgkQEsHwGGHe
VUo/eRAAnGUq0rSi4ZUqsTtbu/jNepKbaeS7jR3/p3V6iSwfmjEoJ2xE6uIdN5vD
fnL6UkeDRMc8LaKHIdLD4ZbN8NRa3hOyzf5K7wwVp5bwle0NeyrcG5wVK8LgT/X/
rPSk7YxoR5frkYcA6zZwezJOv3HGYt8RMr5bKMD3YiJ35/XCdPsKnbHJTHb+F23Y
tYFBeyzRzOebQu0fFKP8ML9LbqvELESqJ5Smwu/jQ25aBW7sFsUNAxseGU2tYahX
c6pm8ytIlpZFwmi1HzXmMICF7lWugFO/KkP/ndCM1IpmujVGy56hrpLEy5gT3gzh
NE/nZDoqJAO2zhg2FuKybh3akdT+IgXUTjxYMYGUOkJIChzie3o4p9OqichgTIv5
+ngAq5qzjAHfC7cZ5nA96XWkw1fFU6BqlA3KPs1mzQU9uTDz7tSkyxIitp3C8L0B
JlilTr6yHUprJzFwCDk4hb+hfP5A9qYnrNeacMlldZmbH1jLYHEzB9FudK82MeM+
tIKFnM2jyRaRs/s8n+/UdrOVFNGk/+scX8GQllEBF451a8J5x1CYeHB7dGW+4pf/
cx5TupHg8dDRgNMsbaeEvwERoPu4h/VRozfBi6r1WjjskVm24lIdFFKTSm3BDbLk
EH3cflv/h8KE19cr0XLb7aYYw/9jb4cpnb0WBMw1gQOSvUMXzxU=
=gmta
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:
- Define the correct set of default hw events on AMD Zen4
- Use the correct stalled cycles PMCs on AMD Zen2 and newer
- Fix detection of the LBR freeze feature on AMD
* tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later
perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later
perf/x86/amd/lbr: Use freeze based on availability
x86/cpufeatures: Add new word for scattered features
make ... arch/x86/virt/vmx/tdx/seamcall.o
work again
- Do not do ROM range scans and memory validation when the kernel is
running as a SEV-SNP guest as those can get problematic and, before
that, are not really needed in such a guest
- Exclude the build-time generated vdso-image-x32.o object from objtool
validation and in particular the return sites in there due to
a warning which fires when an unpatched return thunk is being used
- Improve the NMI CPUs stall message to show additional information
about the state of each CPU wrt the NMI handler
- Enable gcc named address spaces support only on !KCSAN configs due to
compiler options incompatibility
- Revert a change which was trying to use GB pages for mapping regions
only when the regions would be large enough but that change lead to
kexec failing
- A documentation fixlet
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJN64ACgkQEsHwGGHe
VUo9KA/48ELqeKhCAFQdkZn8nnYf57g7bSMibMCnI5/ndNjyuShdZO92SGtBV7fV
4L77vJRJXx2ytz4MztOPVClXaBEs4uDEa0NyZFSuEdvRIvJuglibsleQOFTbKfkD
HdL0geI5exyfCC3BmslCS857sd6aBanqmYPddVk1+5mwhTGcAcwy9aiNTCzxxE22
KuaO12A0+UNOdXNLAuUythmYL2V0Xn2Z9sXpDRchwsRnj12C1S2flIhPsWxg9AU+
3ws9PnfTFTcfcViEug5p1nyN0gWCgYhMxaN8i3IS4smEc09Yq0pxVsitwYqJB1D6
+neC27FTF4DCBr4G/8yu/x5BVHS92s9VK4u9Wo4nf9M4XBlMVU+If0bn9jtvE6jI
GbdoHzoF7e2SEgzVvIjOHpdiBxEzW6S4lZGEtj6oaYR+cI0Fc7vZXoHy8kXohtTz
QT2M1Hl9tMh2HjkwvoQLtNMYVHMWnuDy1X9k25Qk4nnCQJy9DbGiPnpxaGuPtaJ7
s8kha4r+U/zpEiEa0S5AoIozVD/syXe1lOtaN2dUpOWjqBJAMGr1QkK3NnOxEcO3
CeKOCcSfjvvAKPyTnk1Vrtec5KwHmUgD/VN2fn7EdHZVwxlbaoCybpP3g4CT4Jpl
TQakg0du9TyG9BXituN4XaZgGskDYFRlXv+U+Lity4wvEjWgmg==
=JrjO
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure single object builds in arch/x86/virt/ ala
make ... arch/x86/virt/vmx/tdx/seamcall.o
work again
- Do not do ROM range scans and memory validation when the kernel is
running as a SEV-SNP guest as those can get problematic and, before
that, are not really needed in such a guest
- Exclude the build-time generated vdso-image-x32.o object from objtool
validation and in particular the return sites in there due to a
warning which fires when an unpatched return thunk is being used
- Improve the NMI CPUs stall message to show additional information
about the state of each CPU wrt the NMI handler
- Enable gcc named address spaces support only on !KCSAN configs due to
compiler options incompatibility
- Revert a change which was trying to use GB pages for mapping regions
only when the regions would be large enough but that change lead to
kexec failing
- A documentation fixlet
* tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Use obj-y to descend into arch/x86/virt/
x86/sev: Skip ROM range scans and validation for SEV-SNP guests
x86/vdso: Fix rethunk patching for vdso-image-x32.o too
x86/nmi: Upgrade NMI backtrace stall checks & messages
x86/percpu: Disable named address spaces for KCSAN
Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
Documentation/x86: Fix title underline length
Commit c33621b4c5ad ("x86/virt/tdx: Wire up basic SEAMCALL functions")
introduced a new instance of core-y instead of the standardized obj-y
syntax.
X86 Makefiles descend into subdirectories of arch/x86/virt inconsistently;
into arch/x86/virt/ via core-y defined in arch/x86/Makefile, but into
arch/x86/virt/svm/ via obj-y defined in arch/x86/Kbuild.
This is problematic when you build a single object in parallel because
multiple threads attempt to build the same file.
$ make -j$(nproc) arch/x86/virt/vmx/tdx/seamcall.o
[ snip ]
AS arch/x86/virt/vmx/tdx/seamcall.o
AS arch/x86/virt/vmx/tdx/seamcall.o
fixdep: error opening file: arch/x86/virt/vmx/tdx/.seamcall.o.d: No such file or directory
make[4]: *** [scripts/Makefile.build:362: arch/x86/virt/vmx/tdx/seamcall.o] Error 2
Use the obj-y syntax, as it works correctly.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240330060554.18524-1-masahiroy@kernel.org