IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
guest_owns_fp_regs() will be used to check fpsimd state ownership
across kvm/arm64. Therefore, move it to kvm_host.h to widen its
scope.
Moreover, the host state is not per-vcpu anymore, the vcpu
parameter isn't used, so remove it as well.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since the host_fpsimd_state has been removed from kvm_vcpu_arch,
it isn't pointing to the hyp's version of the host fp_regs in
protected mode.
Initialize the host_data fpsimd_state point to the host_data's
context fp_regs on pKVM initialization.
Fixes: 51e09b5572d6 ("KVM: arm64: Exclude host_fpsimd_state pointer from kvm_vcpu_arch")
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Commit d5a32b60dc18 ("KVM: arm64: Allow userspace to change
ID_AA64MMFR{0-2}_EL1") made certain fields in these registers writable,
but in doing so, ID_AA64MMFR1_EL1_XNX was listed twice. Remove the
duplication.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Zenghui Yu <zenghui.yu@linux.dev>
Link: https://lore.kernel.org/r/E1s2AxF-00AWLv-03@rmk-PC.armlinux.org.uk
Signed-off-by: Marc Zyngier <maz@kernel.org>
The last genuine use case for the lpi_list_lock was the global LPI
translation cache, which has been removed in favor of a per-ITS xarray.
Remove a layer from the locking puzzle by getting rid of it.
vgic_add_lpi() still has a critical section that needs to protect
against the insertion of other LPIs; change it to take the LPI xarray's
xa_lock to retain this property.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Everything is in place to switch to per-ITS translation caches. Start
using the per-ITS cache to avoid the lock serialization related to the
global translation cache. Explicitly check for out-of-range device and
event IDs as the cache index is packed based on the range the ITS
actually supports.
Take the RCU read lock to protect against the returned descriptor being
freed while trying to take a reference on it, as it is no longer
necessary to acquire the lpi_list_lock.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-11-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The fast path will soon need to find an ITS by doorbell address, as the
translation caches will become local to an ITS. Spin off a helper to do
just that.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-10-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Within the context of a single ITS, it is possible to use an xarray to
cache the device ID & event ID translation to a particular irq
descriptor. Take advantage of this to build a translation cache capable
of fitting all valid translations for a given ITS.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
As the current LPI translation cache is global, the corresponding
invalidation helpers are also globally-scoped. In anticipation of
constructing a translation cache per ITS, add a helper for scoped cache
invalidations.
We still need to support global invalidations when LPIs are toggled on
a redistributor, as a property of the translation cache is that all
stored LPIs are known to be delieverable.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-8-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The last user has been transitioned to walking the LPI xarray directly.
Cut the wart off, and get rid of the now unneeded lpi_count while doing
so.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The vgic debug iterator is the final user of vgic_copy_lpi_list(), but
is a bit more complicated to transition to something else. Use a mark
in the LPI xarray to record the indices 'known' to the debug iterator.
Protect against the LPIs from being freed by associating an additional
reference with the xarray mark.
Rework iter_next() to let the xarray walk 'drive' the iteration after
visiting all of the SGIs, PPIs, and SPIs.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in vgic_its_cmd_handle_movall().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in vgic_its_invall().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in its_sync_lpi_pending_table().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
vgic_v2_parse_attr() is responsible for finding the vCPU that matches
the user-provided CPUID, which (of course) may not be valid. If the ID
is invalid, kvm_get_vcpu_by_id() returns NULL, which isn't handled
gracefully.
Similar to the GICv3 uaccess flow, check that kvm_get_vcpu_by_id()
actually returns something and fail the ioctl if not.
Cc: stable@vger.kernel.org
Fixes: 7d450e282171 ("KVM: arm/arm64: vgic-new: Add userland access to VGIC dist registers")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240424173959.3776798-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We still support GCC 8.x, and it appears that this toolchain
usually comes with an assembler that does not understand "pauth"
as a valid architectural extension.
This results in the NV ERETAx code breaking the build, as it relies
on this extention to make use of the PACGA instruction (required
by assemblers such as LLVM's).
Work around it by hand-assembling the instruction, which removes the
requirement for any assembler directive.
Fixes: 6ccc971ee2c6 ("KVM: arm64: nv: Add emulation for ERETAx instructions")
Reported-by: Linaro Kernel Functional Testing <lkft@linaro.org>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
We currently insist on disabling PAuth on vcpu_load(), and get to
enable it on first guest use of an instruction or a key (ignoring
the NV case for now).
It isn't clear at all what this is trying to achieve: guests tend
to use PAuth when available, and nothing forces you to expose it
to the guest if you don't want to. This also isn't totally free:
we take a full GPR save/restore between host and guest, only to
write ten 64bit registers. The "value proposition" escapes me.
So let's forget this stuff and enable PAuth eagerly if exposed to
the guest. This results in much simpler code. Performance wise,
that's not bad either (tested on M2 Pro running a fully automated
Debian installer as the workload):
- On a non-NV guest, I can see reduction of 0.24% in the number
of cycles (measured with perf over 10 consecutive runs)
- On a NV guest (L2), I see a 2% reduction in wall-clock time
(measured with 'time', as M2 doesn't have a PMUv3 and NV
doesn't support it either)
So overall, a much reduced complexity and a (small) performance
improvement.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-16-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that we (hopefully) correctly handle ERETAx, drop the masking
of the PAuth feature (something that was not even complete, as
APA3 and AGA3 were still exposed).
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-15-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that we have some emulation in place for ERETA[AB], we can
plug it into the exception handling machinery.
As for a bare ERET, an "easy" ERETAx instruction is processed as
a fixup, while something that requires a translation regime
transition or an exception delivery is left to the slow path.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-14-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
FEAT_NV has the interesting property of relying on ERET being
trapped. An added complexity is that it also traps ERETAA and
ERETAB, meaning that the Pointer Authentication aspect of these
instruction must be emulated.
Add an emulation of Pointer Authentication, limited to ERETAx
(always using SP_EL2 as the modifier and ELR_EL2 as the pointer),
using the Generic Authentication instructions.
The emulation, however small, is placed in its own compilation
unit so that it can be avoided if the configuration doesn't
include it (or the toolchan in not up to the task).
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-13-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order for a L1 hypervisor to correctly handle PAuth instructions,
it must observe traps caused by a L1 PAuth instruction when
HCR_EL2.API==0. Since we already handle the case for API==1 as
a fixup, only the exception injection case needs to be handled.
Rework the kvm_handle_ptrauth() callback to reinject the trap
in this case. Note that APK==0 is already handled by the exising
triage_sysreg_trap() helper.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Although KVM couples API and APK for simplicity, the architecture
makes no such requirement, and the two can be independently set or
cleared.
Check for which of the two possible reasons we have trapped here,
and if the corresponding L1 control bit isn't set, delegate the
handling for forwarding.
Otherwise, set this exact bit in HCR_EL2 and resume the guest.
Of course, in the non-NV case, we keep setting both bits and
be done with it. Note that the entry core already saves/restores
the keys should any of the two control bits be set.
This results in a bit of rework, and the removal of the (trivial)
vcpu_ptrauth_enable() helper.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
If the L1 hypervisor decides to trap ERETs while running L2,
make sure we don't try to emulate it, just like we wouldn't
if it had its NV bit set.
The exception will be reinjected from the core handler.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
A significant part of the FEAT_NV extension is to trap ERET
instructions so that the hypervisor gets a chance to switch
from a vEL2 L1 guest to an EL1 L2 guest.
But this also has the unfortunate consequence of trapping ERET
in unsuspecting circumstances, such as staying at vEL2 (interrupt
handling while being in the guest hypervisor), or returning to host
userspace in the case of a VHE guest.
Although we already make some effort to handle these ERET quicker
by not doing the put/load dance, it is still way too far down the
line for it to be efficient enough.
For these cases, it would ideal to ERET directly, no question asked.
Of course, we can't do that. But the next best thing is to do it as
early as possible, in fixup_guest_exit(), much as we would handle
FPSIMD exceptions.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Honor the trap forwarding bits for both ERET and SMC, using a new
helper that checks for common conditions.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add the HCR_EL2 configuration for FEAT_NV2, adding the required
bits for running a guest hypervisor, and overall merging the
allowed bits provided by the guest.
This heavily replies on unavaliable features being sanitised
when the HCR_EL2 shadow register is accessed, and only a couple
of bits must be explicitly disabled.
Non-NV guests are completely unaffected by any of this.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
It has become obvious that HCR_EL2.NV serves the exact same use
as VCPU_HYP_CONTEXT, only in an architectural way. So just drop
the flag for good.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
PAuth comes it two parts: address authentication, and generic
authentication. So far, KVM mandates that both are implemented.
PAuth also comes in three flavours: Q5, Q3, and IMPDEF. Only one
can be implemented for any of address and generic authentication.
Crucially, the architecture doesn't mandate that address and generic
authentication implement the *same* flavour. This would make
implementing ERETAx very difficult for NV, something we are not
terribly keen on.
So only allow PAuth support for KVM on systems that are not totally
insane. Which is so far 100% of the known HW.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The ESR_ELx_ERET_ISS_ERET* macros are a bit confusing:
- ESR_ELx_ERET_ISS_ERET really indicates that we have trapped an
ERETA* instruction, as opposed to an ERET
- ESR_ELx_ERET_ISS_ERETA really indicates that we have trapped
an ERETAB instruction, as opposed to an ERETAA.
We could repaint those to make more sense, but these are the
names that are present in the ARM ARM, and we are sentimentally
attached to those.
Instead, add two new helpers:
- esr_iss_is_eretax() being true tells you that you need to
authenticate the ERET
- esr_iss_is_eretab() tells you that you need to use the B key
instead of the A key
Following patches will make use of these primitives.
Suggested-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240419102935.1935571-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The denylist is blocking the 32 bit version of the call but is allowing
the 64 bit version of it. There is no reason for blocking only one of
them and the hypervisor should support these calls.
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20240411135700.2140550-1-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Adding new entries to our system register tables is a painful exercise,
as we require them to be ordered by Op0,Op1,CRn,CRm,Op2.
If an entry is misordered, we output an error that indicates the
pointer to the entry and the number *of the last valid one*.
That's not very helpful, and would be much better if we printed the
number of the *offending* entry as well as its name (which is present
in the vast majority of the cases).
This makes debugging new additions to the tables much easier.
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20240410152503.3593890-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
In retrospect, it is fairly obvious that the FP state ownership
is only meaningful for a given CPU, and that locating this
information in the vcpu was just a mistake.
Move the ownership tracking into the host data structure, and
rename it from fp_state to fp_owner, which is a better description
(name suggested by Mark Brown).
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
As the name of the field indicates, host_fpsimd_state is strictly
a host piece of data, and we reset this pointer on each PID change.
So let's move it where it belongs, and set it at load-time. Although
this is slightly more often, it is a well defined life-cycle which
matches other pieces of data.
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
As for the rest of the host debug state, the host copy of mdcr_el2
has little to do in the vcpu, and is better placed in the host_data
structure.
Reviewed-by : Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Keeping host_debug_state on a per-vcpu basis is completely
pointless. The lifetime of this data is only that of the inner
run-loop, which means it is never accessed outside of the core
EL2 code.
Move the structure into kvm_host_data, and save over 500 bytes
per vcpu.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to facilitate the introduction of new per-CPU state,
add a new host_data_ptr() helped that hides some of the per-CPU
verbosity, and make it easier to move that state around in the
future.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().
Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.
In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.
This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We are not very consistent when it comes to displaying which mode
we're in (VHE, {n,h}VHE, protected or not). For example, booting
in protected mode with hVHE results in:
[ 0.969545] kvm [1]: Protected nVHE mode initialized successfully
which is mildly amusing considering that the machine is VHE only.
We already cleaned this up a bit with commit 1f3ca7023fe6 ("KVM:
arm64: print Hyp mode"), but that's still unsatisfactory.
Unify the three strings into one and use a mess of conditional
statements to sort it out (yes, it's a slow day).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240321173706.3280796-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When zapping a table entry in stage2_try_break_pte(), we issue range
TLB invalidation for the region that was mapped by the table. However,
we neglect to align the base address down to the granule size and so
if we ended up reaching the table entry via a misaligned address then
we will accidentally skip invalidation for some prefix of the affected
address range.
Align 'ctx->addr' down to the granule size when performing TLB
invalidation for an unmapped table in stage2_try_break_pte().
Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Fixes: defc8cc7abf0 ("KVM: arm64: Invalidate the table entries upon a range")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-5-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
FEAT_LPA2") updated the __tlbi_level() macro to take the target level
as an argument, with TLBI_TTL_UNKNOWN (rather than 0) indicating that
the caller cannot provide level information. Unfortunately, the two
implementations of __kvm_tlb_flush_vmid_range() were not updated and so
now ask for an level 0 invalidation if FEAT_LPA2 is implemented.
Fix the problem by passing TLBI_TTL_UNKNOWN instead of 0 as the level
argument to __flush_s2_tlb_range_op() in __kvm_tlb_flush_vmid_range().
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-4-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The TLBI level hints are for leaf entries only, so take care not to pass
them incorrectly after clearing a table entry.
Cc: Gavin Shan <gshan@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Quentin Perret <qperret@google.com>
Fixes: 82bb02445de5 ("KVM: arm64: Implement kvm_pgtable_hyp_unmap() at EL2")
Fixes: 6d9d2115c480 ("KVM: arm64: Add support for stage-2 map()/unmap() in generic page-table")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-3-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for
unmap") introduced deferred TLB invalidation for the stage-2 page-table
so that range-based invalidation can be used for the accumulated
addresses. This works fine if the structure of the page-tables remains
unchanged, but if entire tables are zapped and subsequently freed then
we transiently leave the hardware page-table walker with a reference
to freed memory thanks to the translation walk caches. For example,
stage2_unmap_walker() will free page-table pages:
if (childp)
mm_ops->put_page(childp);
and issue the TLB invalidation later in kvm_pgtable_stage2_unmap():
if (stage2_unmap_defer_tlb_flush(pgt))
/* Perform the deferred TLB invalidations */
kvm_tlb_flush_vmid_range(pgt->mmu, addr, size);
For now, take the conservative approach and invalidate the TLB eagerly
when we clear a table entry. Note, however, that the existing level
hint passed to __kvm_tlb_flush_vmid_ipa() is incorrect and will be
fixed in a subsequent patch.
Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-2-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Commit 11e5ea5242e3 ("KVM: arm64: Use helpers to classify exception
types reported via ESR") tried to abstract the translation fault
check when handling an out-of IPA space condition, but incorrectly
replaced it with a permission fault check.
Restore the previous translation fault check.
Fixes: 11e5ea5242e3 ("KVM: arm64: Use helpers to classify exception types reported via ESR")
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wujie Duan <wjduan@linx-info.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/kvmarm/864jd3269g.wl-maz@kernel.org/
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
This reverts commits 99101dda29e3186b1356b0dc4dbb835c02c71ac9 and
b80b701d5a67d07f4df4a21e09cb31f6bc1feeca.
Linus reports that the sysreg reserved bit checks in KVM have led to
build failures, arising from commit fdd867fe9b32 ("arm64/sysreg: Add
register fields for ID_AA64DFR1_EL1") giving meaning to fields that were
previously RES0.
Of course, this is a genuine issue, since KVM's sysreg emulation depends
heavily on the definition of reserved fields. But at this point the
build breakage is far more offensive, and the right course of action is
to revert and retry later.
All of these build-time assertions were on by default before
commit 99101dda29e3 ("KVM: arm64: Make build-time check of RES0/RES1
bits optional"), so deliberately revert it all atomically to avoid
introducing further breakage of bisection.
Link: https://lore.kernel.org/all/CAHk-=whCvkhc8BbFOUf1ddOsgSGgEjwoKv77=HEY1UiVCydGqw@mail.gmail.com/
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Changes to FPU handling came in via the main s390 pull request
* Only deliver to the guest the SCLP events that userspace has
requested.
* More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same).
* Fix selftests undefined behavior.
x86:
* Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can be
programmed *using the architectural encoding*. The enumeration does
NOT say anything about the encoding when the CPU doesn't report support
the event *in general*. It might support it, and it might support it
using the same encoding that made it into the architectural PMU spec.
* Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly emulates
RDMPC, counter availability, and a variety of other PMC-related
behaviors that depend on guest CPUID and therefore are easier to
validate with selftests than with custom guests (aka kvm-unit-tests).
* Zero out PMU state on AMD if the virtual PMU is disabled, it does not
cause any bug but it wastes time in various cases where KVM would check
if a PMC event needs to be synthesized.
* Optimize triggering of emulated events, with a nice ~10% performance
improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
guest.
* Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
* Fix a bug where KVM would report stale/bogus exit qualification information
when exiting to userspace with an internal error exit code.
* Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
* Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
* Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM
doesn't support yielding in the middle of processing a zap, and 1GiB
granularity resulted in multi-millisecond lags that are quite impolite
for CONFIG_PREEMPT kernels.
* Allocate write-tracking metadata on-demand to avoid the memory overhead when
a kernel is built with i915 virtualization support but the workloads use
neither shadow paging nor i915 virtualization.
* Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives.
* Fix the debugregs ABI for 32-bit KVM.
* Rework the "force immediate exit" code so that vendor code ultimately decides
how and when to force the exit, which allowed some optimization for both
Intel and AMD.
* Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, causing extra unnecessary work.
* Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
* Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere in the
kernel) are detected earlier and are less likely to hang the kernel.
x86 Xen emulation:
* Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the
gpa but the underlying host virtual address remains the same.
* When possible, use a single host TSC value when computing the deadline for
Xen timers in order to improve the accuracy of the timer emulation.
* Inject pending upcall events when the vCPU software-enables its APIC to fix
a bug where an upcall can be lost (and to follow Xen's behavior).
* Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
events fails, e.g. if the guest has aliased xAPIC IDs.
RISC-V:
* Support exception and interrupt handling in selftests
* New self test for RISC-V architectural timer (Sstc extension)
* New extension support (Ztso, Zacas)
* Support userspace emulation of random number seed CSRs.
ARM:
* Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
* Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
* Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
* Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
* Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
* Misc cleanups and fixes as usual.
Generic:
* cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
true on all architectures except MIPS (where Kconfig determines the
available depending on CPU capabilities). It is replaced either by
an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
everywhere else.
* Factor common "select" statements in common code instead of requiring
each architecture to specify it
* Remove thoroughly obsolete APIs from the uapi headers.
* Move architecture-dependent stuff to uapi/asm/kvm.h
* Always flush the async page fault workqueue when a work item is being
removed, especially during vCPU destruction, to ensure that there are no
workers running in KVM code when all references to KVM-the-module are gone,
i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
* Grab a reference to the VM's mm_struct in the async #PF worker itself instead
of gifting the worker a reference, so that there's no need to remember
to *conditionally* clean up after the worker.
Selftests:
* Reduce boilerplate especially when utilize selftest TAP infrastructure.
* Add basic smoke tests for SEV and SEV-ES, along with a pile of library
support for handling private/encrypted/protected memory.
* Fix benign bugs where tests neglect to close() guest_memfd files.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
=mqOV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"S390:
- Changes to FPU handling came in via the main s390 pull request
- Only deliver to the guest the SCLP events that userspace has
requested
- More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same)
- Fix selftests undefined behavior
x86:
- Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can
be programmed *using the architectural encoding*. The enumeration
does NOT say anything about the encoding when the CPU doesn't
report support the event *in general*. It might support it, and it
might support it using the same encoding that made it into the
architectural PMU spec
- Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly
emulates RDMPC, counter availability, and a variety of other
PMC-related behaviors that depend on guest CPUID and therefore are
easier to validate with selftests than with custom guests (aka
kvm-unit-tests)
- Zero out PMU state on AMD if the virtual PMU is disabled, it does
not cause any bug but it wastes time in various cases where KVM
would check if a PMC event needs to be synthesized
- Optimize triggering of emulated events, with a nice ~10%
performance improvement in VM-Exit microbenchmarks when a vPMU is
exposed to the guest
- Tighten the check for "PMI in guest" to reduce false positives if
an NMI arrives in the host while KVM is handling an IRQ VM-Exit
- Fix a bug where KVM would report stale/bogus exit qualification
information when exiting to userspace with an internal error exit
code
- Add a VMX flag in /proc/cpuinfo to report 5-level EPT support
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock
held for read, e.g. to avoid serializing vCPUs when userspace
deletes a memslot
- Tear down TDP MMU page tables at 4KiB granularity (used to be
1GiB). KVM doesn't support yielding in the middle of processing a
zap, and 1GiB granularity resulted in multi-millisecond lags that
are quite impolite for CONFIG_PREEMPT kernels
- Allocate write-tracking metadata on-demand to avoid the memory
overhead when a kernel is built with i915 virtualization support
but the workloads use neither shadow paging nor i915 virtualization
- Explicitly initialize a variety of on-stack variables in the
emulator that triggered KMSAN false positives
- Fix the debugregs ABI for 32-bit KVM
- Rework the "force immediate exit" code so that vendor code
ultimately decides how and when to force the exit, which allowed
some optimization for both Intel and AMD
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left
elevated if vCPU creation ultimately failed, causing extra
unnecessary work
- Cleanup the logic for checking if the currently loaded vCPU is
in-kernel
- Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere
in the kernel) are detected earlier and are less likely to hang the
kernel
x86 Xen emulation:
- Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the gpa
but the underlying host virtual address remains the same
- When possible, use a single host TSC value when computing the
deadline for Xen timers in order to improve the accuracy of the
timer emulation
- Inject pending upcall events when the vCPU software-enables its
APIC to fix a bug where an upcall can be lost (and to follow Xen's
behavior)
- Fall back to the slow path instead of warning if "fast" IRQ
delivery of Xen events fails, e.g. if the guest has aliased xAPIC
IDs
RISC-V:
- Support exception and interrupt handling in selftests
- New self test for RISC-V architectural timer (Sstc extension)
- New extension support (Ztso, Zacas)
- Support userspace emulation of random number seed CSRs
ARM:
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized
to address serialization some of the serialization on the LPI
injection path
- Support for _architectural_ VHE-only systems, advertised through
the absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
- Set reserved bits as zero in CPUCFG
- Start SW timer only when vcpu is blocking
- Do not restart SW timer when it is expired
- Remove unnecessary CSR register saving during enter guest
- Misc cleanups and fixes as usual
Generic:
- Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
always true on all architectures except MIPS (where Kconfig
determines the available depending on CPU capabilities). It is
replaced either by an architecture-dependent symbol for MIPS, and
IS_ENABLED(CONFIG_KVM) everywhere else
- Factor common "select" statements in common code instead of
requiring each architecture to specify it
- Remove thoroughly obsolete APIs from the uapi headers
- Move architecture-dependent stuff to uapi/asm/kvm.h
- Always flush the async page fault workqueue when a work item is
being removed, especially during vCPU destruction, to ensure that
there are no workers running in KVM code when all references to
KVM-the-module are gone, i.e. to prevent a very unlikely
use-after-free if kvm.ko is unloaded
- Grab a reference to the VM's mm_struct in the async #PF worker
itself instead of gifting the worker a reference, so that there's
no need to remember to *conditionally* clean up after the worker
Selftests:
- Reduce boilerplate especially when utilize selftest TAP
infrastructure
- Add basic smoke tests for SEV and SEV-ES, along with a pile of
library support for handling private/encrypted/protected memory
- Fix benign bugs where tests neglect to close() guest_memfd files"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
selftests: kvm: remove meaningless assignments in Makefiles
KVM: riscv: selftests: Add Zacas extension to get-reg-list test
RISC-V: KVM: Allow Zacas extension for Guest/VM
KVM: riscv: selftests: Add Ztso extension to get-reg-list test
RISC-V: KVM: Allow Ztso extension for Guest/VM
RISC-V: KVM: Forward SEED CSR access to user space
KVM: riscv: selftests: Add sstc timer test
KVM: riscv: selftests: Change vcpu_has_ext to a common function
KVM: riscv: selftests: Add guest helper to get vcpu id
KVM: riscv: selftests: Add exception handling support
LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
LoongArch: KVM: Do not restart SW timer when it is expired
LoongArch: KVM: Start SW timer only when vcpu is blocking
LoongArch: KVM: Set reserved bits as zero in CPUCFG
KVM: selftests: Explicitly close guest_memfd files in some gmem tests
KVM: x86/xen: fix recursive deadlock in timer injection
KVM: pfncache: simplify locking and make more self-contained
KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
KVM: x86/xen: improve accuracy of Xen timers
...
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is hotplugged
as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving policy
wherein we allocate memory across nodes in a weighted fashion rather
than uniformly. This is beneficial in heterogeneous memory environments
appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the process
has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown situations.
The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's series
"Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page faults.
He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction test",
Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
our handling of DAX on archiecctures which have virtually aliasing data
caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
improvements in worst-case mmap_lock hold times during certain
userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability improvements
in his series "Mitigate a vmap lock contention". It realizes a 12x
improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging of
large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages() to
an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which are
configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
=TG55
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series
"mm: zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is
hotplugged as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving
policy wherein we allocate memory across nodes in a weighted fashion
rather than uniformly. This is beneficial in heterogeneous memory
environments appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the
process has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown
situations. The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings"
Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's
series "Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page
faults. He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction
test", Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and
refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
in our handling of DAX on archiecctures which have virtually aliasing
data caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides
dramatic improvements in worst-case mmap_lock hold times during
certain userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability
improvements in his series "Mitigate a vmap lock contention". It
realizes a 12x improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging
of large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages()
to an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which
are configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
mm/zswap: remove the memcpy if acomp is not sleepable
crypto: introduce: acomp_is_async to expose if comp drivers might sleep
memtest: use {READ,WRITE}_ONCE in memory scanning
mm: prohibit the last subpage from reusing the entire large folio
mm: recover pud_leaf() definitions in nopmd case
selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
selftests/mm: skip uffd hugetlb tests with insufficient hugepages
selftests/mm: dont fail testsuite due to a lack of hugepages
mm/huge_memory: skip invalid debugfs new_order input for folio split
mm/huge_memory: check new folio order when split a folio
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
mm: fix list corruption in put_pages_list
mm: remove folio from deferred split list before uncharging it
filemap: avoid unnecessary major faults in filemap_fault()
mm,page_owner: drop unnecessary check
mm,page_owner: check for null stack_record before bumping its refcount
mm: swap: fix race between free_swap_and_cache() and swapoff()
mm/treewide: align up pXd_leaf() retval across archs
mm/treewide: drop pXd_large()
...
* Reorganise the arm64 kernel VA space and add support for LPA2 (at
stage 1, KVM stage 2 was merged earlier) - 52-bit VA/PA address range
with 4KB and 16KB pages
* Enable Rust on arm64
* Support for the 2023 dpISA extensions (data processing ISA), host only
* arm64 perf updates:
- StarFive's StarLink (integrates one or more CPU cores with a shared
L3 memory system) PMU support
- Enable HiSilicon Erratum 162700402 quirk for HIP09
- Several updates for the HiSilicon PCIe PMU driver
- Arm CoreSight PMU support
- Convert all drivers under drivers/perf/ to use .remove_new()
* Miscellaneous:
- Don't enable workarounds for "rare" errata by default
- Clean up the DAIF flags handling for EL0 returns (in preparation for
NMI support)
- Kselftest update for ptrace()
- Update some of the sysreg field definitions
- Slight improvement in the code generation for inline asm I/O
accessors to permit offset addressing
- kretprobes: acquire regs via a BRK exception (previously done via a
trampoline handler)
- SVE/SME cleanups, comment updates
- Allow CALL_OPS+CC_OPTIMIZE_FOR_SIZE with clang (previously disabled
due to gcc silently ignoring -falign-functions=N)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmXxiSgACgkQa9axLQDI
XvHd7hAAjQrQqxJogPT2ahM5/gxct8qTrXpIgX0B1Y7bb5R8ztvOUN9MJNuDyRsj
0s28SSZw387LReM5OUu+U6G/iahcuNAyP/8d9qeac32Tidd255fV3KPEh4C4eC+u
0HeOqLBZ+stmNoa71tBC2K6SmchizhYyYduvRnri8km8K4OMDawHWqWRTXl0PNRT
RMVJvZTDJMPfMBFeD4+B7EnSFOoP14tKCw9MZvlbpT2PEV0kINjhCQiojW2jJgqv
w36vm/dhwsg1avSzT1xhy3KE+m+7n28+IC/wr1HB7c1WumvYKv7Z84ieCp3PlO3Z
owvVO7dKJC6X3RkoY6Kge5p2RHU6poDerDVHYiAvG+Zi57nrDmHyAubskThsGTGR
AibSEeJ5nQ0yM6hx7zAIQa5XEo4l0svD1ZM7NynY+5JR44W9cdAH3SnEsvIBMGIf
/ja+iZ1W4ZQnIESQXD5uDPSxILfqQ8Ebhdorpw+Qg3rB7OhdTdGSSGQCi6V2PcJH
d/ErFO+i0lFRBPJtBbUAN4EEu3HJcVYEoEnVJYQahC+6KyNGLxO+7L6sH0YO7Pag
P1LRa6h8ktuBMrbCrOPWdmJYNDYCbb5rRtmcCwO0ItZ4g5tYWp9djFc8pyctCaNB
MZxxRrUCNwXTOcFTDiYzyk+JCvpf3EvXfvj8AH+P8BMjFWgqHqw=
=KTD/
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The major features are support for LPA2 (52-bit VA/PA with 4K and 16K
pages), the dpISA extension and Rust enabled on arm64. The changes are
mostly contained within the usual arch/arm64/, drivers/perf, the arm64
Documentation and kselftests. The exception is the Rust support which
touches some generic build files.
Summary:
- Reorganise the arm64 kernel VA space and add support for LPA2 (at
stage 1, KVM stage 2 was merged earlier) - 52-bit VA/PA address
range with 4KB and 16KB pages
- Enable Rust on arm64
- Support for the 2023 dpISA extensions (data processing ISA), host
only
- arm64 perf updates:
- StarFive's StarLink (integrates one or more CPU cores with a
shared L3 memory system) PMU support
- Enable HiSilicon Erratum 162700402 quirk for HIP09
- Several updates for the HiSilicon PCIe PMU driver
- Arm CoreSight PMU support
- Convert all drivers under drivers/perf/ to use .remove_new()
- Miscellaneous:
- Don't enable workarounds for "rare" errata by default
- Clean up the DAIF flags handling for EL0 returns (in preparation
for NMI support)
- Kselftest update for ptrace()
- Update some of the sysreg field definitions
- Slight improvement in the code generation for inline asm I/O
accessors to permit offset addressing
- kretprobes: acquire regs via a BRK exception (previously done
via a trampoline handler)
- SVE/SME cleanups, comment updates
- Allow CALL_OPS+CC_OPTIMIZE_FOR_SIZE with clang (previously
disabled due to gcc silently ignoring -falign-functions=N)"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (134 commits)
Revert "mm: add arch hook to validate mmap() prot flags"
Revert "arm64: mm: add support for WXN memory translation attribute"
Revert "ARM64: Dynamically allocate cpumasks and increase supported CPUs to 512"
ARM64: Dynamically allocate cpumasks and increase supported CPUs to 512
kselftest/arm64: Add 2023 DPISA hwcap test coverage
kselftest/arm64: Add basic FPMR test
kselftest/arm64: Handle FPMR context in generic signal frame parser
arm64/hwcap: Define hwcaps for 2023 DPISA features
arm64/ptrace: Expose FPMR via ptrace
arm64/signal: Add FPMR signal handling
arm64/fpsimd: Support FEAT_FPMR
arm64/fpsimd: Enable host kernel access to FPMR
arm64/cpufeature: Hook new identification registers up to cpufeature
docs: perf: Fix build warning of hisi-pcie-pmu.rst
perf: starfive: Only allow COMPILE_TEST for 64-bit architectures
MAINTAINERS: Add entry for StarFive StarLink PMU
docs: perf: Add description for StarFive's StarLink PMU
dt-bindings: perf: starfive: Add JH8100 StarLink PMU
perf: starfive: Add StarLink PMU support
docs: perf: Update usage for target filter of hisi-pcie-pmu
...
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
- Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZepBjgAKCRCivnWIJHzd
FnngAP93VxjCkJ+5qSmYpFNG6r0ECVIbLHFQ59nKn0+GgvbPEgEAwt8svdLdW06h
njFTpdzvl4Po+aD/V9xHgqVz3kVvZwE=
=1FbW
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.9
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
- Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
avoid creating ABI that KVM can't sanely support.
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely a development and testing vehicle, and
come with zero guarantees.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
is to support confidential VMs with deterministic private memory (SNP
and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXZB/8ACgkQOlYIJqCj
N/3XlQ//RIsvqr38k7kELSKhCMyWgF4J57itABrHpMqAZu3gaAo5sETX8AGcHEe5
mxmquxyNQSf4cthhWy1kzxjGCy6+fk+Z0Z7wzfz0Yd5D+FI6vpo3HhkjovLb2gpt
kSrHuhJyuj2vkftNvdaz0nHX1QalVyIEnXnR3oqTmxUUsg6lp1x/zr5SP0KBXjo8
ZzJtyFd0fkRXWpA792T7XPRBWrzPV31HYZBLX8sPlYmJATcbIx9rYSThgCN6XuVN
bfE6wATsC+mwv5BpCoDFpCKmFcqSqamag9NGe5qE5mOby5DQGYTCRMCQB8YXXBR0
97ppaY9ZJV4nOVjrYJn6IMOSMVNfoG7nTRFfcd0eFP4tlPEgHwGr5BGDaBtQPkrd
KcgWJw8nS02eCA2iOE+FtCXvGJwKhTTjQ45w7rU4EcfUk603L5J4GO1ddmjMhPcP
upGGcWDK9vCGrSUFTm8pyWp/NKRJPvAQEiQd/BweSk9+isQHTX2RYCQgPAQnwlTS
wTg7ZPNSLoUkRYmd6r+TUT32ELJGNc8GLftMnxIwweq6V7AgNMi0HE60eMovuBNO
7DAWWzfBEZmJv+0mNNZPGXczHVv4YvMWysRdKkhztBc3+sO7P3AL1zWIDlm5qwoG
LpFeeI3qo3o5ZNaqGzkSop2pUUGNGpWCH46WmP0AG7RpzW/Natw=
=M0td
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
avoid creating ABI that KVM can't sanely support.
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely a development and testing vehicle, and
come with zero guarantees.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
is to support confidential VMs with deterministic private memory (SNP
and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
* for-next/stage1-lpa2: (48 commits)
: Add support for LPA2 and WXN and stage 1
arm64/mm: Avoid ID mapping of kpti flag if it is no longer needed
arm64/mm: Use generic __pud_free() helper in pud_free() implementation
arm64: gitignore: ignore relacheck
arm64: Use Signed/Unsigned enums for TGRAN{4,16,64} and VARange
arm64: mm: Make PUD folding check in set_pud() a runtime check
arm64: mm: add support for WXN memory translation attribute
mm: add arch hook to validate mmap() prot flags
arm64: defconfig: Enable LPA2 support
arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs
arm64: kvm: avoid CONFIG_PGTABLE_LEVELS for runtime levels
arm64: ptdump: Deal with translation levels folded at runtime
arm64: ptdump: Disregard unaddressable VA space
arm64: mm: Add support for folding PUDs at runtime
arm64: kasan: Reduce minimum shadow alignment and enable 5 level paging
arm64: mm: Add 5 level paging support to fixmap and swapper handling
arm64: Enable LPA2 at boot if supported by the system
arm64: mm: add LPA2 and 5 level paging support to G-to-nG conversion
arm64: mm: Add definitions to support 5 levels of paging
arm64: mm: Add LPA2 support to phys<->pte conversion routines
arm64: mm: Wire up TCR.DS bit to PTE shareability fields
...