IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
* Add a capability for enabling secure guests under the Protected
Execution Framework ultravisor
* Various bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJegnq3AAoJEJ2a6ncsY3GfrU8IANpaxS7kTAsW3ZN0wJnP2PHq
i8j7ZPlCVLpkQWArrMyMimDLdiN9VP7lvaWonAWjG0HJrxlmMUUVnwVSMuPTXhWY
vSwEqhUXwSF5KKxN7DYkgxqaKFxkElJIubVE/AYJjH9zFpu9ca4vM5sCxnzWcvS3
hxPNe756nKhFH9xhrC/9NRUZWmDAiv75wvzq+5DRKbSVPJGJugchdIPBbi3Yrr7P
qxnUCPZdCxzXXU94bfQl038wrjSMR3S7b4FvekJ12go2FalujqzsL2lVtift4Rf0
jvu+RIINcmTiaQqclY332j7l24LZ6Pni456RygT5OwFxuYoxWKZRafN16JmaMCo=
=rpJL
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
KVM PPC update for 5.7
* Add a capability for enabling secure guests under the Protected
Execution Framework ultravisor
* Various bug fixes and cleanups.
At present, on Power systems with Protected Execution Facility
hardware and an ultravisor, a KVM guest can transition to being a
secure guest at will. Userspace (QEMU) has no way of knowing
whether a host system is capable of running secure guests. This
will present a problem in future when the ultravisor is capable of
migrating secure guests from one host to another, because
virtualization management software will have no way to ensure that
secure guests only run in domains where all of the hosts can
support secure guests.
This adds a VM capability which has two functions: (a) userspace
can query it to find out whether the host can support secure guests,
and (b) userspace can enable it for a guest, which allows that
guest to become a secure guest. If userspace does not enable it,
KVM will return an error when the ultravisor does the hypercall
that indicates that the guest is starting to transition to a
secure guest. The ultravisor will then abort the transition and
the guest will terminate.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
there is not enough free secured memory, the Hypervisor (HV) has to call
UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.
If the kvm->arch.secure_guest is not set, in the return path rfid is called
but there is no valid context to get back to the SVM since the Hcall has
been routed by the Ultravisor.
Move the setting of kvm->arch.secure_guest earlier in
kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
instead of rfid.
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Tested-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
prevent a malicious VM or SVM to call them. This could lead to weird result
and should be filtered out.
Checking the Secure bit of the calling MSR ensure that the call is coming
from either the Ultravisor or a SVM. But any system call made from a SVM
are going through the Ultravisor, and the Ultravisor should filter out
these malicious call. This way, only the Ultravisor is able to make such a
Hcall.
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibnm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These are only used by HV KVM and BookE, and in both cases they are
nops.
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This is only relevant to PR KVM. Make it obvious by moving the
function declaration to the Book3s header and rename it with
a _pr suffix.
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With PR KVM, shutting down a VM causes the host kernel to crash:
[ 314.219284] BUG: Unable to handle kernel data access on read at 0xc00800000176c638
[ 314.219299] Faulting instruction address: 0xc008000000d4ddb0
cpu 0x0: Vector: 300 (Data Access) at [c00000036da077a0]
pc: c008000000d4ddb0: kvmppc_mmu_pte_flush_all+0x68/0xd0 [kvm_pr]
lr: c008000000d4dd94: kvmppc_mmu_pte_flush_all+0x4c/0xd0 [kvm_pr]
sp: c00000036da07a30
msr: 900000010280b033
dar: c00800000176c638
dsisr: 40000000
current = 0xc00000036d4c0000
paca = 0xc000000001a00000 irqmask: 0x03 irq_happened: 0x01
pid = 1992, comm = qemu-system-ppc
Linux version 5.6.0-master-gku+ (greg@palmb) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #17 SMP Wed Mar 18 13:49:29 CET 2020
enter ? for help
[c00000036da07ab0] c008000000d4fbe0 kvmppc_mmu_destroy_pr+0x28/0x60 [kvm_pr]
[c00000036da07ae0] c0080000009eab8c kvmppc_mmu_destroy+0x34/0x50 [kvm]
[c00000036da07b00] c0080000009e50c0 kvm_arch_vcpu_destroy+0x108/0x140 [kvm]
[c00000036da07b30] c0080000009d1b50 kvm_vcpu_destroy+0x28/0x80 [kvm]
[c00000036da07b60] c0080000009e4434 kvm_arch_destroy_vm+0xbc/0x190 [kvm]
[c00000036da07ba0] c0080000009d9c2c kvm_put_kvm+0x1d4/0x3f0 [kvm]
[c00000036da07c00] c0080000009da760 kvm_vm_release+0x38/0x60 [kvm]
[c00000036da07c30] c000000000420be0 __fput+0xe0/0x310
[c00000036da07c90] c0000000001747a0 task_work_run+0x150/0x1c0
[c00000036da07cf0] c00000000014896c do_exit+0x44c/0xd00
[c00000036da07dc0] c0000000001492f4 do_group_exit+0x64/0xd0
[c00000036da07e00] c000000000149384 sys_exit_group+0x24/0x30
[c00000036da07e20] c00000000000b9d0 system_call+0x5c/0x68
This is caused by a use-after-free in kvmppc_mmu_pte_flush_all()
which dereferences vcpu->arch.book3s which was previously freed by
kvmppc_core_vcpu_free_pr(). This happens because kvmppc_mmu_destroy()
is called after kvmppc_core_vcpu_free() since commit ff030fdf5573
("KVM: PPC: Move kvm_vcpu_init() invocation to common code").
The kvmppc_mmu_destroy() helper calls one of the following depending
on the KVM backend:
- kvmppc_mmu_destroy_hv() which does nothing (Book3s HV)
- kvmppc_mmu_destroy_pr() which undoes the effects of
kvmppc_mmu_init() (Book3s PR 32-bit)
- kvmppc_mmu_destroy_pr() which undoes the effects of
kvmppc_mmu_init() (Book3s PR 64-bit)
- kvmppc_mmu_destroy_e500() which does nothing (BookE e500/e500mc)
It turns out that this is only relevant to PR KVM actually. And both
32 and 64 backends need vcpu->arch.book3s to be valid when calling
kvmppc_mmu_destroy_pr(). So instead of calling kvmppc_mmu_destroy()
from kvm_arch_vcpu_destroy(), call kvmppc_mmu_destroy_pr() at the
beginning of kvmppc_core_vcpu_free_pr(). This is consistent with
kvmppc_mmu_init() being the last call in kvmppc_core_vcpu_create_pr().
For the same reason, if kvmppc_core_vcpu_create_pr() returns an
error then this means that kvmppc_mmu_init() was either not called
or failed, in which case kvmppc_mmu_destroy() should not be called.
Drop the line in the error path of kvm_arch_vcpu_create().
Fixes: ff030fdf5573 ("KVM: PPC: Move kvm_vcpu_init() invocation to common code")
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest
via the guest/nested hypervisor.
./run-tests.sh -v
...
TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 2,threads=2 -machine cap-htm=on -append "h_cede_tm"
FAIL h_cede_tm (2 tests, 1 unexpected failures)
While the test relates to transactional memory instructions, the actual
failure is due to the return code of the H_CEDE hypercall, which is
reported as 224 instead of 0. This happens even when no TM instructions
are issued.
224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3
is where the caller expects the return code to be placed upon return.
In the case of guest running under a nested hypervisor, issuing H_CEDE
causes a return from H_ENTER_NESTED. In this case H_CEDE is
specially-handled immediately rather than later in
kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to
set the return code for the caller, hence why kvm-unit-test sees the
224 return code and reports an error.
Guest kernels generally don't check the return value of H_CEDE, so
that likely explains why this hasn't caused issues outside of
kvm-unit-tests so far.
Fix this by setting r3 to 0 after we finish processing the H_CEDE.
RHBZ: 1778556
Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when nested")
Cc: linuxppc-dev@ozlabs.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On P9 DD2.2 due to a CPU defect some TM instructions need to be emulated by
KVM. This is handled at first by the hardware raising a softpatch interrupt
when certain TM instructions that need KVM assistance are executed in the
guest. Althought some TM instructions per Power ISA are invalid forms they
can raise a softpatch interrupt too. For instance, 'tresume.' instruction
as defined in the ISA must have bit 31 set (1), but an instruction that
matches 'tresume.' PO and XO opcode fields but has bit 31 not set (0), like
0x7cfe9ddc, also raises a softpatch interrupt. Similarly for 'treclaim.'
and 'trechkpt.' instructions with bit 31 = 0, i.e. 0x7c00075c and
0x7c0007dc, respectively. Hence, if a code like the following is executed
in the guest it will raise a softpatch interrupt just like a 'tresume.'
when the TM facility is enabled ('tabort. 0' in the example is used only
to enable the TM facility):
int main() { asm("tabort. 0; .long 0x7cfe9ddc;"); }
Currently in such a case KVM throws a complete trace like:
[345523.705984] WARNING: CPU: 24 PID: 64413 at arch/powerpc/kvm/book3s_hv_tm.c:211 kvmhv_p9_tm_emulation+0x68/0x620 [kvm_hv]
[345523.705985] Modules linked in: kvm_hv(E) xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat
iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter
ip6_tables iptable_filter bridge stp llc sch_fq_codel ipmi_powernv at24 vmx_crypto ipmi_devintf ipmi_msghandler
ibmpowernv uio_pdrv_genirq kvm opal_prd uio leds_powernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear tg3
crct10dif_vpmsum crc32c_vpmsum ipr [last unloaded: kvm_hv]
[345523.706030] CPU: 24 PID: 64413 Comm: CPU 0/KVM Tainted: G W E 5.5.0+ #1
[345523.706031] NIP: c0080000072cb9c0 LR: c0080000072b5e80 CTR: c0080000085c7850
[345523.706034] REGS: c000000399467680 TRAP: 0700 Tainted: G W E (5.5.0+)
[345523.706034] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24022428 XER: 00000000
[345523.706042] CFAR: c0080000072b5e7c IRQMASK: 0
GPR00: c0080000072b5e80 c000000399467910 c0080000072db500 c000000375ccc720
GPR04: c000000375ccc720 00000003fbec0000 0000a10395dda5a6 0000000000000000
GPR08: 000000007cfe9ddc 7cfe9ddc000005dc 7cfe9ddc7c0005dc c0080000072cd530
GPR12: c0080000085c7850 c0000003fffeb800 0000000000000001 00007dfb737f0000
GPR16: c0002001edcca558 0000000000000000 0000000000000000 0000000000000001
GPR20: c000000001b21258 c0002001edcca558 0000000000000018 0000000000000000
GPR24: 0000000001000000 ffffffffffffffff 0000000000000001 0000000000001500
GPR28: c0002001edcc4278 c00000037dd80000 800000050280f033 c000000375ccc720
[345523.706062] NIP [c0080000072cb9c0] kvmhv_p9_tm_emulation+0x68/0x620 [kvm_hv]
[345523.706065] LR [c0080000072b5e80] kvmppc_handle_exit_hv.isra.53+0x3e8/0x798 [kvm_hv]
[345523.706066] Call Trace:
[345523.706069] [c000000399467910] [c000000399467940] 0xc000000399467940 (unreliable)
[345523.706071] [c000000399467950] [c000000399467980] 0xc000000399467980
[345523.706075] [c0000003994679f0] [c0080000072bd1c4] kvmhv_run_single_vcpu+0xa1c/0xb80 [kvm_hv]
[345523.706079] [c000000399467ac0] [c0080000072bd8e0] kvmppc_vcpu_run_hv+0x5b8/0xb00 [kvm_hv]
[345523.706087] [c000000399467b90] [c0080000085c93cc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[345523.706095] [c000000399467bb0] [c0080000085c582c] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[345523.706101] [c000000399467c40] [c0080000085b7498] kvm_vcpu_ioctl+0x3d0/0x7b0 [kvm]
[345523.706105] [c000000399467db0] [c0000000004adf9c] ksys_ioctl+0x13c/0x170
[345523.706107] [c000000399467e00] [c0000000004adff8] sys_ioctl+0x28/0x80
[345523.706111] [c000000399467e20] [c00000000000b278] system_call+0x5c/0x68
[345523.706112] Instruction dump:
[345523.706114] 419e0390 7f8a4840 409d0048 6d497c00 2f89075d 419e021c 6d497c00 2f8907dd
[345523.706119] 419e01c0 6d497c00 2f8905dd 419e00a4 <0fe00000> 38210040 38600000 ebc1fff0
and then treats the executed instruction as a 'nop'.
However the POWER9 User's Manual, in section "4.6.10 Book II Invalid
Forms", informs that for TM instructions bit 31 is in fact ignored, thus
for the TM-related invalid forms ignoring bit 31 and handling them like the
valid forms is an acceptable way to handle them. POWER8 behaves the same
way too.
This commit changes the handling of the cases here described by treating
the TM-related invalid forms that can generate a softpatch interrupt
just like their valid forms (w/ bit 31 = 1) instead of as a 'nop' and by
gently reporting any other unrecognized case to the host and treating it as
illegal instruction instead of throwing a trace and treating it as a 'nop'.
Signed-off-by: Gustavo Romero <gromero@linux.ibm.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Acked-By: Michael Neuling <mikey@neuling.org>
Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In kvmppc_unmap_free_pte() in book3s_64_mmu_radix.c, we use the
non-constant value PTE_INDEX_SIZE to clear a PTE page.
We can instead use the constant RADIX_PTE_INDEX_SIZE, because we know
this code will only be running when the Radix MMU is active.
Note that we already use RADIX_PTE_INDEX_SIZE for the allocation of
kvm_pte_cache.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the same changes in the page fault handler for HPT guests
that commits 31c8b0d0694a ("KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot()
in page fault handler", 2018-03-01), 71d29f43b633 ("KVM: PPC: Book3S HV:
Don't use compound_order to determine host mapping size", 2018-09-11)
and 6579804c4317 ("KVM: PPC: Book3S HV: Avoid crash from THP collapse
during radix page fault", 2018-10-04) made for the page fault handler
for radix guests.
In summary, where we used to call get_user_pages_fast() and then do
special handling for VM_PFNMAP vmas, we now call __get_user_pages_fast()
and then __gfn_to_pfn_memslot() if that fails, followed by reading the
Linux PTE to get the host PFN, host page size and mapping attributes.
This also brings in the change from SetPageDirty() to set_page_dirty_lock()
which was done for the radix page fault handler in commit c3856aeb2940
("KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault
handler", 2018-02-23).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Remove includes of asm/kvm_host.h from files that already include
linux/kvm_host.h to make it more obvious that there is no ordering issue
between the two headers. linux/kvm_host.h includes asm/kvm_host.h to
pick up architecture specific settings, and this will never change, i.e.
including asm/kvm_host.h after linux/kvm_host.h may seem problematic,
but in practice is simply redundant.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor memslot handling to treat the number of used slots as the de
facto size of the memslot array, e.g. return NULL from id_to_memslot()
when an invalid index is provided instead of relying on npages==0 to
detect an invalid memslot. Rework the sorting and walking of memslots
in advance of dynamically sizing memslots to aid bisection and debug,
e.g. with luck, a bug in the refactoring will bisect here and/or hit a
WARN instead of randomly corrupting memory.
Alternatively, a global null/invalid memslot could be returned, i.e. so
callers of id_to_memslot() don't have to explicitly check for a NULL
memslot, but that approach runs the risk of introducing difficult-to-
debug issues, e.g. if the global null slot is modified. Constifying
the return from id_to_memslot() to combat such issues is possible, but
would require a massive refactoring of arch specific code and would
still be susceptible to casting shenanigans.
Add function comments to update_memslots() and search_memslots() to
explicitly (and loudly) state how memslots are sorted.
Opportunistically stuff @hva with a non-canonical value when deleting a
private memslot on x86 to detect bogus usage of the freed slot.
No functional change intended.
Tested-by: Christoffer Dall <christoffer.dall@arm.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG
for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code.
The arch specific implemenations are extremely similar, differing
only in whether the dirty log needs to be sync'd from hardware (x86)
and how the TLBs are flushed. Add new arch hooks to handle sync
and TLB flush; the sync will also be used for non-generic dirty log
support in a future patch (s390).
The ulterior motive for providing a common implementation is to
eliminate the dependency between arch and common code with respect to
the memslot referenced by the dirty log, i.e. to make it obvious in the
code that the validity of the memslot is guaranteed, as a future patch
will rework memslot handling such that id_to_memslot() can return NULL.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that all callers of kvm_free_memslot() pass NULL for @dont, remove
the param from the top-level routine and all arch's implementations.
No functional change intended.
Tested-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the "const" attribute from @old in kvm_arch_commit_memory_region()
to allow arch specific code to free arch specific resources in the old
memslot without having to cast away the attribute. Freeing resources in
kvm_arch_commit_memory_region() paves the way for simplifying
kvm_free_memslot() by eliminating the last usage of its @dont param.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove kvm_arch_create_memslot() now that all arch implementations are
effectively nops. Removing kvm_arch_create_memslot() eliminates the
possibility for arch specific code to allocate memory prior to setting
a memslot, which sets the stage for simplifying kvm_free_memslot().
Cc: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allocate the rmap array during kvm_arch_prepare_memory_region() to pave
the way for removing kvm_arch_create_memslot() altogether. Moving PPC's
memory allocation only changes the order of kernel memory allocations
between PPC and common KVM code.
No functional change intended.
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The ls (lookup symbol) and zr (reboot) commands use xmon's getstring()
helper to read a string argument from the xmon prompt. This function
skips over leading whitespace, but doesn't check if the first
"non-whitespace" character is a newline which causes some odd
behaviour (<enter> indicates a the enter key was pressed):
0:mon> ls printk<enter>
printk: c0000000001680c4
0:mon> ls<enter>
printk<enter>
Symbol '
printk' not found.
0:mon>
With commit 2d9b332d99b ("powerpc/xmon: Allow passing an argument to
ppc_md.restart()") we have a similar problem with the zr command.
Previously zr took no arguments so "zr<enter> would trigger a reboot.
With that patch applied a second newline needs to be sent in order for
the reboot to occur. Fix this by checking if the leading whitespace
ended on a newline:
0:mon> ls<enter>
Symbol '' not found.
Fixes: 2d9b332d99b2 ("powerpc/xmon: Allow passing an argument to ppc_md.restart()")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200217041343.2454-1-oohall@gmail.com
power_save_ppc32_restore() is called during exception entry, before
re-enabling the MMU. It substracts KERNELBASE from the address
of nap_save_msscr0 to access it.
With CONFIG_VMAP_STACK enabled, data MMU translation has already been
re-enabled, so power_save_ppc32_restore() has to access
nap_save_msscr0 by its virtual address.
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7bce32ccbab3ba3e3e0f27da6961bf6313df97ed.1581663140.git.christophe.leroy@c-s.fr
hash_page() needs to read page tables from kernel memory. When entire
kernel memory is mapped by BATs, which is normally the case when
CONFIG_STRICT_KERNEL_RWX is not set, it works even if the page hosting
the page table is not referenced in the MMU hash table.
However, if the page where the page table resides is not covered by
a BAT, a DSI fault can be encountered from hash_page(), and it loops
forever. This can happen when CONFIG_STRICT_KERNEL_RWX is selected
and the alignment of the different regions is too small to allow
covering the entire memory with BATs. This also happens when
CONFIG_DEBUG_PAGEALLOC is selected or when booting with 'nobats'
flag.
Also, if the page containing the kernel stack is not present in the
MMU hash table, registers cannot be saved and a recursive DSI fault
is encountered.
To allow hash_page() to properly do its job at all time and load the
MMU hash table whenever needed, it must run with data MMU disabled.
This means it must be called before re-enabling data MMU. To allow
this, registers clobbered by hash_page() and create_hpte() have to
be saved in the thread struct together with SRR0, SSR1, DAR and DSISR.
It is also necessary to ensure that DSI prolog doesn't overwrite
regs saved by prolog of the current running exception. That means:
- DSI can only use SPRN_SPRG_SCRATCH0
- Exceptions must free SPRN_SPRG_SCRATCH0 before writing to the stack.
This also fixes the Oops reported by Erhard when create_hpte() is
called by add_hash_page().
Due to prolog size increase, a few more exceptions had to get split
in two parts.
Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Erhard F. <erhard_f@mailbox.org>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206501
Link: https://lore.kernel.org/r/64a4aa44686e9fd4b01333401367029771d9b231.1581761633.git.christophe.leroy@c-s.fr
After a treclaim, we expect to be in non-transactional state. If we
don't clear the current thread's MSR[TS] before we get preempted, then
tm_recheckpoint_new_task() will recheckpoint and we get rescheduled in
suspended transaction state.
When handling a signal caught in transactional state,
handle_rt_signal64() calls get_tm_stackpointer() that treclaims the
transaction using tm_reclaim_current() but without clearing the
thread's MSR[TS]. This can cause the TM Bad Thing exception below if
later we pagefault and get preempted trying to access the user's
sigframe, using __put_user(). Afterwards, when we are rescheduled back
into do_page_fault() (but now in suspended state since the thread's
MSR[TS] was not cleared), upon executing 'rfid' after completion of
the page fault handling, the exception is raised because a transition
from suspended to non-transactional state is invalid.
Unexpected TM Bad Thing exception at c00000000000de44 (msr 0x8000000302a03031) tm_scratch=800000010280b033
Oops: Unrecoverable exception, sig: 6 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 25 PID: 15547 Comm: a.out Not tainted 5.4.0-rc2 #32
NIP: c00000000000de44 LR: c000000000034728 CTR: 0000000000000000
REGS: c00000003fe7bd70 TRAP: 0700 Not tainted (5.4.0-rc2)
MSR: 8000000302a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[SE]> CR: 44000884 XER: 00000000
CFAR: c00000000000dda4 IRQMASK: 0
PACATMSCRATCH: 800000010280b033
GPR00: c000000000034728 c000000f65a17c80 c000000001662800 00007fffacf3fd78
GPR04: 0000000000001000 0000000000001000 0000000000000000 c000000f611f8af0
GPR08: 0000000000000000 0000000078006001 0000000000000000 000c000000000000
GPR12: c000000f611f84b0 c00000003ffcb200 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000f611f8140
GPR24: 0000000000000000 00007fffacf3fd68 c000000f65a17d90 c000000f611f7800
GPR28: c000000f65a17e90 c000000f65a17e90 c000000001685e18 00007fffacf3f000
NIP [c00000000000de44] fast_exception_return+0xf4/0x1b0
LR [c000000000034728] handle_rt_signal64+0x78/0xc50
Call Trace:
[c000000f65a17c80] [c000000000034710] handle_rt_signal64+0x60/0xc50 (unreliable)
[c000000f65a17d30] [c000000000023640] do_notify_resume+0x330/0x460
[c000000f65a17e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74
Instruction dump:
7c4ff120 e8410170 7c5a03a6 38400000 f8410060 e8010070 e8410080 e8610088
60000000 60000000 e8810090 e8210078 <4c000024> 48000000 e8610178 88ed0989
---[ end trace 93094aa44b442f87 ]---
The simplified sequence of events that triggers the above exception is:
... # userspace in NON-TRANSACTIONAL state
tbegin # userspace in TRANSACTIONAL state
signal delivery # kernelspace in SUSPENDED state
handle_rt_signal64()
get_tm_stackpointer()
treclaim # kernelspace in NON-TRANSACTIONAL state
__put_user()
page fault happens. We will never get back here because of the TM Bad Thing exception.
page fault handling kicks in and we voluntarily preempt ourselves
do_page_fault()
__schedule()
__switch_to(other_task)
our task is rescheduled and we recheckpoint because the thread's MSR[TS] was not cleared
__switch_to(our_task)
switch_to_tm()
tm_recheckpoint_new_task()
trechkpt # kernelspace in SUSPENDED state
The page fault handling resumes, but now we are in suspended transaction state
do_page_fault() completes
rfid <----- trying to get back where the page fault happened (we were non-transactional back then)
TM Bad Thing # illegal transition from suspended to non-transactional
This patch fixes that issue by clearing the current thread's MSR[TS]
just after treclaim in get_tm_stackpointer() so that we stay in
non-transactional state in case we are preempted. In order to make
treclaim and clearing the thread's MSR[TS] atomic from a preemption
perspective when CONFIG_PREEMPT is set, preempt_disable/enable() is
used. It's also necessary to save the previous value of the thread's
MSR before get_tm_stackpointer() is called so that it can be exposed
to the signal handler later in setup_tm_sigcontexts() to inform the
userspace MSR at the moment of the signal delivery.
Found with tm-signal-context-force-tm kernel selftest.
Fixes: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Gustavo Luiz Duarte <gustavold@linux.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200211033831.11165-1-gustavold@linux.ibm.com
In ITLB miss handled the line supposed to clear bits 20-23 on the L2
ITLB entry is buggy and does indeed nothing, leading to undefined
value which could allow execution when it shouldn't.
Properly do the clearing with the relevant instruction.
Fixes: 74fabcadfd43 ("powerpc/8xx: don't use r12/SPRN_SPRG_SCRATCH2 in TLB Miss handlers")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4f70c2778163affce8508a210f65d140e84524b4.1581272050.git.christophe.leroy@c-s.fr
With HW assistance all page tables must be 4k aligned, the 8xx drops
the last 12 bits during the walk.
Redefine HUGEPD_SHIFT_MASK to mask last 12 bits out. HUGEPD_SHIFT_MASK
is used to for alignment of page table cache.
Fixes: 22569b881d37 ("powerpc/8xx: Enable 8M hugepage support with HW assistance")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/778b1a248c4c7ca79640eeff7740044da6a220a0.1581264115.git.christophe.leroy@c-s.fr
Commit 55c8fc3f4930 ("powerpc/8xx: reintroduce 16K pages with HW
assistance") redefined pte_t as a struct of 4 pte_basic_t, because
in 16K pages mode there are four identical entries in the
page table. But the size of hugepage tables is calculated based
of the size of (void *). Therefore, we end up with page tables
of size 1k instead of 4k for 512k pages.
As 512k hugepage tables are the same size as standard page tables,
ie 4k, use the standard page tables instead of PGT_CACHE tables.
Fixes: 3fb69c6a1a13 ("powerpc/8xx: Enable 512k hugepage support with HW assistance")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/90ec56a2315be602494619ed0223bba3b0b8d619.1580997007.git.christophe.leroy@c-s.fr
Recovering a dead PHB can currently cause a deadlock as the PCI
rescan/remove lock is taken twice.
This is caused as part of an existing bug in
eeh_handle_special_event(). The pe is processed while traversing the
PHBs even though the pe is unrelated to the loop. This causes the pe
to be, incorrectly, processed more than once.
Untangling this section can move the pe processing out of the loop and
also outside the locked section, correcting both problems.
Fixes: 2e25505147b8 ("powerpc/eeh: Fix crash when edev->pdev changes")
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Tested-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0547e82dbf90ee0729a2979a8cac5c91665c621f.1581051445.git.sbobroff@linux.ibm.com
- fix randconfig to generate a sane .config
- rename hostprogs-y / always to hostprogs / always-y, which are
more natual syntax.
- optimize scripts/kallsyms
- fix yes2modconfig and mod2yesconfig
- make multiple directory targets ('make foo/ bar/') work
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl47NfMVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRGwP/3AHO8P0wGEeFKs3ziSMjs2W7/Pj
lN08Kuxm0u3LnyEEcHVUveoi+xBYqvrw0RsGgYf5S8q0Mpep7MPqbfkDUxV/0Zkj
QP2CsvOTbjdBjH7q3ojkwLcDl0Pxu9mg3eZMRXZ2WQeNXuMRw6Bicoh7ElvB1Bv/
HC+j30i2Me3cf/riQGSAsstvlXyIR8RaerR8PfRGESTysiiN76+JcHTatJHhOJL9
O6XKkzo8/CXMYKKVF4Ae4NP+WFg6E96/pAPx0Rf47RbPX9UG35L9rkzTDnk70Ms6
OhKiu3hXsRX7mkqApuoTqjge4+iiQcKZxYmMXU1vGlIRzjwg19/4YFP6pDSCcnIu
kKb8KN4o4N41N7MFS3OLZWwISA8Vw6RbtwDZ3AghDWb7EHb9oNW42mGfcAPr1+wZ
/KH6RHTzaz+5q2MgyMY1NhADFrhIT9CvDM+UJECgbokblnw7PHAnPmbsuVak9ZOH
u9ojO1HpTTuIYO6N6v4K5zQBZF1N+RvkmBnhHd8j6SksppsCoC/G62QxgXhF2YK3
FQMpATCpuyengLxWAmPEjsyyPOlrrdu9UxqNsXVy5ol40+7zpxuHwKcQKCa9urJR
rcpbIwLaBcLhHU4BmvBxUk5aZxxGV2F0O0gXTOAbT2xhd6BipZSMhUmN49SErhQm
NC/coUmQX7McxMXh
=sv4U
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix randconfig to generate a sane .config
- rename hostprogs-y / always to hostprogs / always-y, which are more
natual syntax.
- optimize scripts/kallsyms
- fix yes2modconfig and mod2yesconfig
- make multiple directory targets ('make foo/ bar/') work
* tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: make multiple directory targets work
kconfig: Invalidate all symbols after changing to y or m.
kallsyms: fix type of kallsyms_token_table[]
scripts/kallsyms: change table to store (strcut sym_entry *)
scripts/kallsyms: rename local variables in read_symbol()
kbuild: rename hostprogs-y/always to hostprogs/always-y
kbuild: fix the document to use extra-y for vmlinux.lds
kconfig: fix broken dependency in randconfig-generated .config
Fix an existing bug in our user access handling, exposed by one of the bug fixes
we merged this cycle.
A fix for a boot hang on 32-bit with CONFIG_TRACE_IRQFLAGS and the recently
added CONFIG_VMAP_STACK.
Thanks to:
Christophe Leroy, Guenter Roeck.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl4+qP8THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgA0xEACAciGc2VvFMxMJ+M59Sd76/KLDeV38
VE2Q9BukGREby9ekjDUy7j94nnXIWrhPabaK0qIfIl2TtgkIjccBrkj7uGjA0pol
9ri0uGU2BtEvEqklsrJRXHcukHGQcKNPtKm9CRKSmuoK335x9BS7HhLOhyudVURQ
/1lB8mM81UgmZ88j07Ws0Wa6sxaUWvCrBkRGmea5JIabOoRqELvUHZ9ZwFmMD9wL
MW2LDFOTIFOAoVes4K2JZB5n4Es3xsXA9IP079dF5mH9bh9RjUHv4dBsrnQEvNC5
Yna5cwYJn8N1rRRX5Zh7jHh1BICC+Z5yXJfkW8WUs7bf8BqEC4ZdXcqiWBo1jTb0
OW8uM/syOApXVmxJC2H9zWcU576zoc3dDzW29LITMgEde1BlgtkX6Ezk17TZ4d8C
jOt3LTNavsk5z4pu/11mRX/7bRKQ0A4MONnAtSzWWaIzWzaHkrVM226IS7kha4i6
GMyjHO7eDr+wRBPJyGh1QPou9d5sLacJ4TRECtP7AcPoafWY1Zpk61FDBc5OYTQp
csxNzG5R0S6bGaty6VmvsCiPlyHW8gdQP0YRSqmZ6aAts5vipNQ3WJzOzh28CAEj
A66E0L+nTbcNMmivgm4d23bXbqg0tH1vB2by5VJTj3QXAAIj+G68EwuqX/fIUqqh
62BAopZCeMYgSA==
=+Urh
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix an existing bug in our user access handling, exposed by one of
the bug fixes we merged this cycle.
- A fix for a boot hang on 32-bit with CONFIG_TRACE_IRQFLAGS and the
recently added CONFIG_VMAP_STACK.
Thanks to: Christophe Leroy, Guenter Roeck.
* tag 'powerpc-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Fix CONFIG_TRACE_IRQFLAGS with CONFIG_VMAP_STACK
powerpc/futex: Fix incorrect user access blocking
Various driver updates for platforms:
- Nvidia: Fuse support for Tegra194, continued memory controller pieces
for Tegra30
- NXP/FSL: Refactorings of QuickEngine drivers to support ARM/ARM64/PPC
- NXP/FSL: i.MX8MP SoC driver pieces
- TI Keystone: ring accelerator driver
- Qualcomm: SCM driver cleanup/refactoring + support for new SoCs.
- Xilinx ZynqMP: feature checking interface for firmware. Mailbox
communication for power management
- Overall support patch set for cpuidle on more complex hierarchies
(PSCI-based)
+ Misc cleanups, refactorings of Marvell, TI, other platforms.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCAAtFiEElf+HevZ4QCAJmMQ+jBrnPN6EHHcFAl4+lTYPHG9sb2ZAbGl4
b20ubmV0AAoJEIwa5zzehBx3nQcQAJm91+6hZbmMjlBySGS7ISjYvOcrI/hMgiOl
uhhEP0Dcylvf9A9x3wcIbLwixe+2pvie9DQh2u5F80ShYimidtFi/2xCfuTb9fKu
sxxKjrXWyVKhkpW0z+tedY08ftVhkwwcyD4m2C7uVl6AwTP7c367vFeU7XjF2APn
drfgmgbjm8U3XbSyAqv+k6z6tyqaCnFM7vbPupSKHgHJ3mfByxOa+XyBN2RdgBbs
0KrVfbXGv80zFIFrMPwaWG7G52bu7K68nVdgy44MpKdRZ6QTjhnR+kerFxHsYgV4
bM55Fya52nTCSTGdKaQakDtKwbAUdCDTSkxgOHGcQoyFi0R/VaEUJtcysnvLbI6c
+n/yFIzGyEdXcvIzfv2SoDYhogw19I6RR/M9K5Ni29eazkDVYx2z3rI+2QYeqCiF
u7cq52gW6JLP0SI/9kuUrRFiR8v19Ixap7qokAxgqQwYB3NzT8a7WsYPkzdpDZGQ
ETSDFMyBWT6UvBe/HWkQluBabbet53rG8BF0OHFrQuMK0u/ieKgSGuTB9XN2djEW
PHMOMz2vhi+8XTfpkskhF2tTxlA/k4R6QwCdIMpIkMRVnVQCh1XdPr3Fi2NrgB+S
kIXHD4vV6zLYh04zHyKewSPHAXWgraFpg2qKnvL5+KWMTnW6QH+RNjOt9xKDNXOd
+iDXpOad
=ONtb
-----END PGP SIGNATURE-----
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC-related driver updates from Olof Johansson:
"Various driver updates for platforms:
- Nvidia: Fuse support for Tegra194, continued memory controller
pieces for Tegra30
- NXP/FSL: Refactorings of QuickEngine drivers to support
ARM/ARM64/PPC
- NXP/FSL: i.MX8MP SoC driver pieces
- TI Keystone: ring accelerator driver
- Qualcomm: SCM driver cleanup/refactoring + support for new SoCs.
- Xilinx ZynqMP: feature checking interface for firmware. Mailbox
communication for power management
- Overall support patch set for cpuidle on more complex hierarchies
(PSCI-based)
and misc cleanups, refactorings of Marvell, TI, other platforms"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (166 commits)
drivers: soc: xilinx: Use mailbox IPI callback
dt-bindings: power: reset: xilinx: Add bindings for ipi mailbox
drivers: soc: ti: knav_qmss_queue: Pass lockdep expression to RCU lists
MAINTAINERS: Add brcmstb PCIe controller entry
soc/tegra: fuse: Unmap registers once they are not needed anymore
soc/tegra: fuse: Correct straps' address for older Tegra124 device trees
soc/tegra: fuse: Warn if straps are not ready
soc/tegra: fuse: Cache values of straps and Chip ID registers
memory: tegra30-emc: Correct error message for timed out auto calibration
memory: tegra30-emc: Firm up hardware programming sequence
memory: tegra30-emc: Firm up suspend/resume sequence
soc/tegra: regulators: Do nothing if voltage is unchanged
memory: tegra: Correct reset value of xusb_hostr
soc/tegra: fuse: Add APB DMA dependency for Tegra20
bus: tegra-aconnect: Remove PM_CLK dependency
dt-bindings: mediatek: add MT6765 power dt-bindings
soc: mediatek: cmdq: delete not used define
memory: tegra: Add support for the Tegra194 memory controller
memory: tegra: Only include support for enabled SoCs
memory: tegra: Support DVFS on Tegra186 and later
...
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.
New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.
And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.
Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"
* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
When CONFIG_PROVE_LOCKING is selected together with (now default)
CONFIG_VMAP_STACK, kernel enter deadlock during boot.
At the point of checking whether interrupts are enabled or not, the
value of MSR saved on stack is read using the physical address of the
stack. But at this point, when using VMAP stack the DATA MMU
translation has already been re-enabled, leading to deadlock.
Don't use the physical address of the stack when
CONFIG_VMAP_STACK is set.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 028474876f47 ("powerpc/32: prepare for CONFIG_VMAP_STACK")
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/daeacdc0dec0416d1c587cc9f9e7191ad3068dc0.1581095957.git.christophe.leroy@c-s.fr
The early versions of our kernel user access prevention (KUAP) were
written by Russell and Christophe, and didn't have separate
read/write access.
At some point I picked up the series and added the read/write access,
but I failed to update the usages in futex.h to correctly allow read
and write.
However we didn't notice because of another bug which was causing the
low-level code to always enable read and write. That bug was fixed
recently in commit 1d8f739b07bd ("powerpc/kuap: Fix set direction in
allow/prevent_user_access()").
futex_atomic_cmpxchg_inatomic() is passed the user address as %3 and
does:
1: lwarx %1, 0, %3
cmpw 0, %1, %4
bne- 3f
2: stwcx. %5, 0, %3
Which clearly loads and stores from/to %3. The logic in
arch_futex_atomic_op_inuser() is similar, so fix both of them to use
allow_read_write_user().
Without this fix, and with PPC_KUAP_DEBUG=y, we see eg:
Bug: Read fault blocked by AMR!
WARNING: CPU: 94 PID: 149215 at arch/powerpc/include/asm/book3s/64/kup-radix.h:126 __do_page_fault+0x600/0xf30
CPU: 94 PID: 149215 Comm: futex_requeue_p Tainted: G W 5.5.0-rc7-gcc9x-g4c25df5640ae #1
...
NIP [c000000000070680] __do_page_fault+0x600/0xf30
LR [c00000000007067c] __do_page_fault+0x5fc/0xf30
Call Trace:
[c00020138e5637e0] [c00000000007067c] __do_page_fault+0x5fc/0xf30 (unreliable)
[c00020138e5638c0] [c00000000000ada8] handle_page_fault+0x10/0x30
--- interrupt: 301 at cmpxchg_futex_value_locked+0x68/0xd0
LR = futex_lock_pi_atomic+0xe0/0x1f0
[c00020138e563bc0] [c000000000217b50] futex_lock_pi_atomic+0x80/0x1f0 (unreliable)
[c00020138e563c30] [c00000000021b668] futex_requeue+0x438/0xb60
[c00020138e563d60] [c00000000021c6cc] do_futex+0x1ec/0x2b0
[c00020138e563d90] [c00000000021c8b8] sys_futex+0x128/0x200
[c00020138e563e20] [c00000000000b7ac] system_call+0x5c/0x68
Fixes: de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace Access Protection")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: syzbot+e808452bad7c375cbee6@syzkaller-ppc64.appspotmail.com
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Link: https://lore.kernel.org/r/20200207122145.11928-1-mpe@ellerman.id.au
Some bug fixes/cleanups.
Deprecated scsi passthrough for blk removed.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl49E/4PHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpyecH/AlBzCOlv9kBHKvx30h2QTgbvZlZM++SRQ18
XAuvU/gRVTPLeSsXnJGz0hMD8hxBti6esqvxHzSzs2a6DqkqLrRdnMXsjs6QlAdX
6NwP4VesL7RNKTAjjrtmXQMr8iADtTy8FKCw/sZM+6sqhPeKAzFbBrjfH6amINru
orEF+eGwNXLkegK4+QVQx8f1rlIm7+/Z4lAP75FsaisYWLxklvn3VjZ7YjsCNexi
4zMxv64W8AHCRJK8k7/+vluedwwTghY9ayubw4zeRWmcfRw568bxlCZUQBmNWspC
lj4/ZWzGmd60UQx9fUvEd7T6QCRzaaUYVzXC0Mh8I3pLV+V5tKY=
=uqtg
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Some bug fixes/cleanups.
The deprecated scsi passthrough for virtio_blk is removed"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_balloon: Fix memory leaks on errors in virtballoon_probe()
virtio-balloon: Fix memory leak when unloading while hinting is in progress
virtio_balloon: prevent pfn array overflow
virtio-blk: remove VIRTIO_BLK_F_SCSI support
virtio-pci: check name when counting MSI-X vectors
virtio-balloon: initialize all vq callbacks
virtio-mmio: convert to devm_platform_ioremap_resource
Unused now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since the need for a special flag to support SCSI passthrough on a
block device was added in May 2017 the SCSI passthrough support in
virtio-blk has been disabled. It has always been a bad idea
(just ask the original author..) and we have virtio-scsi for proper
passthrough. The feature also never made it into the virtio 1.0
or later specifications.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
- Implement user_access_begin() and friends for our platforms that support
controlling kernel access to userspace.
- Enable CONFIG_VMAP_STACK on 32-bit Book3S and 8xx.
- Some tweaks to our pseries IOMMU code to allow SVMs ("secure" virtual
machines) to use the IOMMU.
- Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 32-bit VDSO, and
some other improvements.
- A series to use the PCI hotplug framework to control opencapi card's so that
they can be reset and re-read after flashing a new FPGA image.
As well as other minor fixes and improvements as usual.
Thanks to:
Alastair D'Silva, Alexandre Ghiti, Alexey Kardashevskiy, Andrew Donnellan,
Aneesh Kumar K.V, Anju T Sudhakar, Bai Yingjie, Chen Zhou, Christophe Leroy,
Frederic Barrat, Greg Kurz, Jason A. Donenfeld, Joel Stanley, Jordan Niethe,
Julia Lawall, Krzysztof Kozlowski, Laurent Dufour, Laurentiu Tudor, Linus
Walleij, Michael Bringmann, Nathan Chancellor, Nicholas Piggin, Nick
Desaulniers, Oliver O'Halloran, Peter Ujfalusi, Pingfan Liu, Ram Pai, Randy
Dunlap, Russell Currey, Sam Bobroff, Sebastian Andrzej Siewior, Shawn
Anastasio, Stephen Rothwell, Steve Best, Sukadev Bhattiprolu, Thiago Jung
Bauermann, Tyrel Datwyler, Vaibhav Jain.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl44uJgTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgGIcD/9U3R2BK3trEPOStcUbYPte9sMqkyYq
bcq4o2qrVc5deMvPhcHOQ4j28RUZOKoRODvSbXzGEGKIDlesmKjuP7AicE5qUjjV
jRtsSOlRElXmPojAgrrlWrFDJOKbW5mFSj2TY/0sjVa06Wcu1Oi6WiQs/TazvZV/
yzKh5lBL6xyQrmgH0h1VWWbblMbsA1bAL/D7m9Pgimpz0W6fOSRWgXILDUXPLBAy
Rtt7p1218xPfhe66EgbLhWLIBJb70r+Z9yJNuVbp9NMJbDAhpfOuyMNXpRCELzXD
5hwm0mFLOwxfSyBgIyIGokLRGFO6XL0uiZIG1Kp+tMxjgnNCmLlRs2R3EF1hoIWi
49DHRAdK+IEggi6S4dXG5aglz6Rsun8pb/lN7uW+M68t3wp2IYQ+H8MQh4cxPTLu
wX6KZr28lNG25yyp97nJq2Vld0xTxSSty92P8f588rkolyxzggUy0Xfen41szNrW
9/bu8NWgt7qVtHmeUoCdWqiIiuMT1k3Of7AN4uAuS6aJHx2Fxr+03ZU5yNr8WIkm
IOf27z8sUx3F8JL9cIuwAIPB0lSDPw1owvfiTYQ1VkzJa4Ko+kgv5wQ5Ors6V+ve
XspE4osSP9T9PoHK2MVlu8mOjLpoo3Ibr849J0lGHQZDP6U3kHNILGfcXA8WP/9b
Fgfh5Wj22cQe8A==
=xpG+
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"A pretty small batch for us, and apologies for it being a bit late, I
wanted to sneak Christophe's user_access_begin() series in.
Summary:
- Implement user_access_begin() and friends for our platforms that
support controlling kernel access to userspace.
- Enable CONFIG_VMAP_STACK on 32-bit Book3S and 8xx.
- Some tweaks to our pseries IOMMU code to allow SVMs ("secure"
virtual machines) to use the IOMMU.
- Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 32-bit
VDSO, and some other improvements.
- A series to use the PCI hotplug framework to control opencapi
card's so that they can be reset and re-read after flashing a new
FPGA image.
As well as other minor fixes and improvements as usual.
Thanks to: Alastair D'Silva, Alexandre Ghiti, Alexey Kardashevskiy,
Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Bai Yingjie, Chen
Zhou, Christophe Leroy, Frederic Barrat, Greg Kurz, Jason A.
Donenfeld, Joel Stanley, Jordan Niethe, Julia Lawall, Krzysztof
Kozlowski, Laurent Dufour, Laurentiu Tudor, Linus Walleij, Michael
Bringmann, Nathan Chancellor, Nicholas Piggin, Nick Desaulniers,
Oliver O'Halloran, Peter Ujfalusi, Pingfan Liu, Ram Pai, Randy Dunlap,
Russell Currey, Sam Bobroff, Sebastian Andrzej Siewior, Shawn
Anastasio, Stephen Rothwell, Steve Best, Sukadev Bhattiprolu, Thiago
Jung Bauermann, Tyrel Datwyler, Vaibhav Jain"
* tag 'powerpc-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (131 commits)
powerpc: configs: Cleanup old Kconfig options
powerpc/configs/skiroot: Enable some more hardening options
powerpc/configs/skiroot: Disable xmon default & enable reboot on panic
powerpc/configs/skiroot: Enable security features
powerpc/configs/skiroot: Update for symbol movement only
powerpc/configs/skiroot: Drop default n CONFIG_CRYPTO_ECHAINIV
powerpc/configs/skiroot: Drop HID_LOGITECH
powerpc/configs: Drop NET_VENDOR_HP which moved to staging
powerpc/configs: NET_CADENCE became NET_VENDOR_CADENCE
powerpc/configs: Drop CONFIG_QLGE which moved to staging
powerpc: Do not consider weak unresolved symbol relocations as bad
powerpc/32s: Fix kasan_early_hash_table() for CONFIG_VMAP_STACK
powerpc: indent to improve Kconfig readability
powerpc: Provide initial documentation for PAPR hcalls
powerpc: Implement user_access_save() and user_access_restore()
powerpc: Implement user_access_begin and friends
powerpc/32s: Prepare prevent_user_access() for user_access_end()
powerpc/32s: Drop NULL addr verification
powerpc/kuap: Fix set direction in allow/prevent_user_access()
powerpc/32s: Fix bad_kuap_fault()
...
Towards a more consistent naming scheme.
Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI. This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table. With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.
More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")
The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
false for sparc architecture.
Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fixup page directory freeing", v4.
This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped. ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes. This makes it easy to backport
these changes. Only the first 2 patches need to be backported to stable.
The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:
1) unhook page/directory
2) TLB invalidate
3) free page/directory
Without this, any concurrent page-table walk could end up with a
Use-after-Free. This is esp. trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie. caches page directories).
Even on UP this might give issues since mmu_gather is preemptible these
days. An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.
This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures. I haven't added patches
w.r.t other architecture because they are yet to be acked.
This patch (of 9):
A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages. In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch
!SMP case is right now broken for radix translation w.r.t page walk
cache flush. We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already. Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."
Link: http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.
For powerpc p?d_is_leaf() functions already exist. Export them using the
new p?d_leaf() name.
Link: http://lkml.kernel.org/r/20191218162402.45610-7-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In old days, the "host-progs" syntax was used for specifying host
programs. It was renamed to the current "hostprogs-y" in 2004.
It is typically useful in scripts/Makefile because it allows Kbuild to
selectively compile host programs based on the kernel configuration.
This commit renames like follows:
always -> always-y
hostprogs-y -> hostprogs
So, scripts/Makefile will look like this:
always-$(CONFIG_BUILD_BIN2C) += ...
always-$(CONFIG_KALLSYMS) += ...
...
hostprogs := $(always-y) $(always-m)
I think this makes more sense because a host program is always a host
program, irrespective of the kernel configuration. We want to specify
which ones to compile by CONFIG options, so always-y will be handier.
The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
compatibility for a while.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>