4748 Commits

Author SHA1 Message Date
David Hildenbrand
717c05554a s390/mm: limit number of real-space gmap shadows
We have no known user of real-space designation and only support it to
be architecture compliant.

Gmap shadows with real-space designation are never unshadowed
automatically, as there is nothing to protect for the top level table.

So let's simply limit the number of such shadows to one by removing
existing ones on creation of another one.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:07 +02:00
David Hildenbrand
3218f7094b s390/mm: support real-space for gmap shadows
We can easily support real-space designation just like EDAT1 and EDAT2.
So guest2 can provide for guest3 an asce with the real-space control being
set.

We simply have to allocate the biggest page table possible and fake all
levels.

There is no protection to consider. If we exceed guest memory, vsie code
will inject an addressing exception (via program intercept). In the future,
we could limit the fake table level to the gmap page table.

As the top level page table can never go away, such gmap shadows will never
get unshadowed, we'll have to come up with another way to limit the number
of kept gmap shadows.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:02 +02:00
David Hildenbrand
1c65781b56 s390/mm: push rte protection down to shadow pte
Just like we already do with ste protection, let's take rte protection
into account. This way, the host pte doesn't have to be mapped writable.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:00 +02:00
David Hildenbrand
18b8980988 s390/mm: support EDAT2 for gmap shadows
If the guest is enabled for EDAT2, we can easily create shadows for
guest2 -> guest3 provided tables that make use of EDAT2.

If guest2 references a 2GB page, this memory looks consecutive for guest2,
but it does not have to be so for us. Therefore we have to create fake
segment and page tables.

This works just like EDAT1 support, so page tables are removed when the
parent table (r3t table entry) is changed.

We don't hve to care about:
- ACCF-Validity Control in RTTE
- Access-Control Bits in RTTE
- Fetch-Protection Bit in RTTE
- Common-Region Bit in RTTE

Just like for EDAT1, all bits might be dropped and there is no guaranteed
that they are active.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:56 +02:00
David Hildenbrand
fd8d4e3ab6 s390/mm: support EDAT1 for gmap shadows
If the guest is enabled for EDAT1, we can easily create shadows for
guest2 -> guest3 provided tables that make use of EDAT1.

If guest2 references a 1MB page, this memory looks consecutive for guest2,
but it might not be so for us. Therefore we have to create fake page tables.

We can easily add that to our existing infrastructure. The invalidation
mechanism will make sure that fake page tables are removed when the parent
table (sgt table entry) is changed.

As EDAT1 also introduced protection on all page table levels, we have to
also shadow these correctly.

We don't have to care about:
- ACCF-Validity Control in STE
- Access-Control Bits in STE
- Fetch-Protection Bit in STE
- Common-Segment Bit in STE

As all bits might be dropped and there is no guaranteed that they are
active ("unpredictable whether the CPU uses these bits", "may be used").
Without using EDAT1 in the shadow ourselfes (STE-format control == 0),
simply shadowing these bits would not be enough. They would be ignored.

Please note that we are using the "fake" flag to make this look consistent
with further changes (EDAT2, real-space designation support) and don't let
the shadow functions handle fc=1 stes.

In the future, with huge pages in the host, gmap_shadow_pgt() could simply
try to map a huge host page if "fake" is set to one and indicate via return
value that no lower fake tables / shadow ptes are required.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:51 +02:00
David Hildenbrand
5b062bd494 s390/mm: prepare for EDAT1/EDAT2 support in gmap shadow
In preparation for EDAT1/EDAT2 support for gmap shadows, we have to store
the requested edat level in the gmap shadow.

The edat level used during shadow translation is a property of the gmap
shadow. Depending on that level, the gmap shadow will look differently for
the same guest tables. We have to store it internally in order to support
it later.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:47 +02:00
David Hildenbrand
00fc062d53 s390/mm: push ste protection down to shadow pte
If a guest ste is read-only, it doesn't make sense to force the ptes in as
writable in the host. If the source page is read-only in the host, it won't
have to be made writable. Please note that if the source page is not
available, it will still be faulted in writable. This can be changed
internally later on.

If ste protection is removed, underlying shadow tables are also removed,
therefore this change does not affect the guest.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:45 +02:00
David Hildenbrand
f4debb4090 s390/mm: take ipte_lock during shadow faults
Let's take the ipte_lock while working on guest 2 provided page table, just
like the other gaccess functions.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:40 +02:00
David Hildenbrand
7a6741576b s390/mm: protection exceptions are corrrectly shadowed
As gmap shadows contains correct protection permissions, protection
exceptons can directly be forwarded to guest 3. If we would encounter
a protection exception while faulting, the next guest 3 run will
automatically handle that for us.

Keep the dat_protection logic in place, as it will be helpful later.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:34 +02:00
David Hildenbrand
e52f8b6112 s390/mm: take the mmap_sem in kvm_s390_shadow_fault()
Instead of doing it in the caller, let's just take the mmap_sem
in kvm_s390_shadow_fault(). By taking it as read, we allow parallel
faulting on shadow page tables, gmap shadow code is prepared for that.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:33 +02:00
David Hildenbrand
0f7f848915 s390/mm: fix races on gmap_shadow creation
Before any thread is allowed to use a gmap_shadow, it has to be fully
initialized. However, for invalidation to work properly, we have to
register the new gmap_shadow before we protect the parent gmap table.

Because locking is tricky, and we have to avoid duplicate gmaps, let's
introduce an initialized field, that signalizes other threads if that
gmap_shadow can already be used or if they have to retry.

Let's properly return errors using ERR_PTR() instead of simply returning
NULL, so a caller can properly react on the error.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:28 +02:00
David Hildenbrand
998f637cc4 s390/mm: avoid races on region/segment/page table shadowing
We have to unlock sg->guest_table_lock in order to call
gmap_protect_rmap(). If we sleep just before that call, another VCPU
might pick up that shadowed page table (while it is not protected yet)
and use it.

In order to avoid these races, we have to introduce a third state -
"origin set but still invalid" for an entry. This way, we can avoid
another thread already using the entry before the table is fully protected.
As soon as everything is set up, we can clear the invalid bit - if we
had no race with the unshadowing code.

Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:27 +02:00
David Hildenbrand
a9d23e71d7 s390/mm: shadow pages with real guest requested protection
We really want to avoid manually handling protection for nested
virtualization. By shadowing pages with the protection the guest asked us
for, the SIE can handle most protection-related actions for us (e.g.
special handling for MVPG) and we can directly forward protection
exceptions to the guest.

PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
Unshadowing will take care of any guest changes to the parent PTE and
any host changes to the host PTE. If the host PTE doesn't have the
fitting access rights or is not available, we have to fix it up.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:19 +02:00
David Hildenbrand
eea3678d43 s390/mm: flush tlb of shadows in all situations
For now, the tlb of shadow gmap is only flushed when the parent is removed,
not when it is removed upfront. Therefore other shadow gmaps can reuse the
tables without the tlb getting flushed.

Fix this by simply flushing the tlb
1. Before the shadow tables are removed (analogouos to other unshadow functions)
2. When the gmap is freed and therefore the top level pages are freed.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:18 +02:00
Martin Schwidefsky
aa17aa57cf s390/mm: add kvm shadow fault function
This patch introduces function kvm_s390_shadow_fault() used to resolve a
fault on a shadow gmap. This function will do validity checking and
build up the shadow page table hierarchy in order to fault in the
requested page into the shadow page table structure.

If an exception occurs while shadowing, guest 2 has to be notified about
it using either an exception or a program interrupt intercept. If
concurrent unshadowing occurres, this function will simply return with
-EAGAIN and the caller has to retry.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:12 +02:00
Martin Schwidefsky
4be130a084 s390/mm: add shadow gmap support
For a nested KVM guest the outer KVM host needs to create shadow
page tables for the nested guest. This patch adds the basic support
to the guest address space (gmap) code.

For each guest address space the inner KVM host creates, the first
outer KVM host needs to create shadow page tables. The address space
is identified by the ASCE loaded into the control register 1 at the
time the inner SIE instruction for the second nested KVM guest is
executed. The outer KVM host creates the shadow tables starting with
the table identified by the ASCE on a on-demand basis. The outer KVM
host will get repeated faults for all the shadow tables needed to
run the second KVM guest.

While a shadow page table for the second KVM guest is active the access
to the origin region, segment and page tables needs to be restricted
for the first KVM guest. For region and segment and page tables the first
KVM guest may read the memory, but write attempt has to lead to an
unshadow.  This is done using the page invalid and read-only bits in the
page table of the first KVM guest. If the first guest re-accesses one of
the origin pages of a shadow, it gets a fault and the affected parts of
the shadow page table hierarchy needs to be removed again.

PGSTE tables don't have to be shadowed, as all interpretation assist can't
deal with the invalid bits in the shadow pte being set differently than
the original ones provided by the first KVM guest.

Many bug fixes and improvements by David Hildenbrand.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:04 +02:00
Martin Schwidefsky
6ea427bbbd s390/mm: add reference counter to gmap structure
Let's use a reference counter mechanism to control the lifetime of
gmap structures. This will be needed for further changes related to
gmap shadows.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:53:59 +02:00
Martin Schwidefsky
b2d73b2a0a s390/mm: extended gmap pte notifier
The current gmap pte notifier forces a pte into to a read-write state.
If the pte is invalidated the gmap notifier is called to inform KVM
that the mapping will go away.

Extend this approach to allow read-write, read-only and no-access
as possible target states and call the pte notifier for any change
to the pte.

This mechanism is used to temporarily set specific access rights for
a pte without doing the heavy work of a true mprotect call.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:46:49 +02:00
Martin Schwidefsky
8ecb1a59d6 s390/mm: use RCU for gmap notifier list and the per-mm gmap list
The gmap notifier list and the gmap list in the mm_struct change rarely.
Use RCU to optimize the reader of these lists.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:46:49 +02:00
Martin Schwidefsky
414d3b0749 s390/kvm: page table invalidation notifier
Pass an address range to the page table invalidation notifier
for KVM. This allows to notify changes that affect a larger
virtual memory area, e.g. for 1MB pages.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:46:48 +02:00
Hendrik Brueckner
9254e70c4e s390/cpum_cf: use perf software context for hardware counters
On s390, there are two different hardware PMUs for counting and
sampling.  Previously, both PMUs have shared the perf_hw_context
which is not correct and, recently, results in this warning:

    ------------[ cut here ]------------
    WARNING: CPU: 5 PID: 1 at kernel/events/core.c:8485 perf_pmu_register+0x420/0x428
    Modules linked in:
    CPU: 5 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc1+ #2
    task: 00000009c5240000 ti: 00000009c5234000 task.ti: 00000009c5234000
    Krnl PSW : 0704c00180000000 0000000000220c50 (perf_pmu_register+0x420/0x428)
               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
    Krnl GPRS: ffffffffffffffff 0000000000b15ac6 0000000000000000 00000009cb440000
               000000000022087a 0000000000000000 0000000000b78fa0 0000000000000000
               0000000000a9aa90 0000000000000084 0000000000000005 000000000088a97a
               0000000000000004 0000000000749dd0 000000000022087a 00000009c5237cc0
    Krnl Code: 0000000000220c44: a7f4ff54            brc     15,220aec
               0000000000220c48: 92011000           mvi     0(%r1),1
              #0000000000220c4c: a7f40001           brc     15,220c4e
              >0000000000220c50: a7f4ff12           brc     15,220a74
               0000000000220c54: 0707               bcr     0,%r7
               0000000000220c56: 0707               bcr     0,%r7
               0000000000220c58: ebdff0800024       stmg    %r13,%r15,128(%r15)
               0000000000220c5e: a7f13fe0           tmll    %r15,16352
    Call Trace:
    ([<000000000022087a>] perf_pmu_register+0x4a/0x428)
    ([<0000000000b2c25c>] init_cpum_sampling_pmu+0x14c/0x1f8)
    ([<0000000000100248>] do_one_initcall+0x48/0x140)
    ([<0000000000b25d26>] kernel_init_freeable+0x1e6/0x2a0)
    ([<000000000072bda4>] kernel_init+0x24/0x138)
    ([<000000000073495e>] kernel_thread_starter+0x6/0xc)
    ([<0000000000734958>] kernel_thread_starter+0x0/0xc)
    Last Breaking-Event-Address:
     [<0000000000220c4c>] perf_pmu_register+0x41c/0x428
    ---[ end trace 0c6ef9f5b771ad97 ]---

Using the perf_sw_context is an option because the cpum_cf PMU does
not use interrupts.  To make this more clear, initialize the
capabilities in the PMU structure.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-16 12:08:49 +02:00
Peter Zijlstra
b53d6bedbe locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
Since all architectures have this implemented now natively, remove this
dead code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-16 10:48:32 +02:00
Peter Zijlstra
56fefbbc3f locking/atomic, arch/s390: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
Implement FETCH-OP atomic primitives, these are very similar to the
existing OP-RETURN primitives we already have, except they return the
value of the atomic variable _before_ modification.

This is especially useful for irreversible operations -- such as
bitops (because it becomes impossible to reconstruct the state prior
to modification).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-16 10:48:29 +02:00
Paolo Bonzini
a03825bbd0 KVM: s390: use kvm->created_vcpus
The new created_vcpus field avoids possible races between enabling
capabilities and creating VCPUs.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:37 +02:00
Heiko Carstens
11a7752e01 s390: remove math emulation code
The last in-kernel user is gone so we can finally remove this code.

Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-15 16:37:11 +02:00
Heiko Carstens
9c203239c5 s390: calculate loops_per_jiffies with fp instructions
Implement calculation of loops_per_jiffies with fp instructions which
are available on all 64 bit machines.
To save and restore floating point register context use the new vx support
functions.

Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-15 16:37:10 +02:00
Hendrik Brueckner
a086171ad8 s390: Updated kernel config files
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-15 16:37:07 +02:00
Hendrik Brueckner
f848dbd3bc s390/crc32-vx: add crypto API module for optimized CRC-32 algorithms
Add a crypto API module to access the vector extension based CRC-32
implementations.  Users can request the optimized implementation through
the shash crypto API interface.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-15 16:36:34 +02:00
Paolo Bonzini
f26ed98326 KVM: s390: Features and fixes for 4.8 part1
Four bigger things:
 1. The implementation of the STHYI opcode in the kernel. This is used
    in libraries like qclib [1] to provide enough information for a
    capacity and usage based software licence pricing. The STHYI content
    is defined by the related z/VM documentation [2]. Its data can be
    composed by accessing several other interfaces provided by LPAR or
    the machine. This information is partially sensitive or root-only
    so the kernel does the necessary filtering.
 2. Preparation for nested virtualization (VSIE). KVM should query the
    proper sclp interfaces for the availability of some features before
    using it. In the past we have been sloppy and simply assumed that
    several features are available. With this we should be able to handle
    most cases of a missing feature.
 3. CPU model interfaces extended by some additional features that are
    not covered by a facility bit in STFLE. For example all the crypto
    instructions of the coprocessor provide a query function. As reality
    tends to be more complex (e.g. export regulations might block some
    algorithms) we have to provide additional interfaces to query or
    set these non-stfle features.
 4. Several fixes and changes detected and fixed when doing 1-3.
 
 All features change base s390 code. All relevant patches have an ACK
 from the s390 or component maintainers.
 
 The next pull request for 4.8 (part2) will contain the implementation
 of VSIE.
 
 [1] http://www.ibm.com/developerworks/linux/linux390/qclib.html
 [2] https://www.ibm.com/support/knowledgecenter/SSB27U_6.3.0/com.ibm.zvm.v630.hcpb4/hcpb4sth.htm
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJXX+A6AAoJEBF7vIC1phx8SBoQAIkFTMxoGvY9lFkkreUXIyeX
 XL0grybhsaKd4tT80FlobTl2ejpo/feRl5RfD5Oi75UCR4oMuk3Eb8bIyQjcKJvS
 7sYFz+zP9TZ5S/rxvc3EanXpcNnfowKDuLUyOTaq0Hq8XQHaSwzYGGbtPgTdMDAp
 DyhwNhYK8cPvmBS3KHX70ZOMfl9J4s0xvgs42BRJyyDGYrJOZcN1NLsG2l1dAb0L
 au/Svb05PxhgQvqoUId3VSrmRKLm9tSk5DJdIRcmj1+4Mlhfw14LTV+wGuTLTgSZ
 GOyEdum2E/b4QABWca7sxmgqo+Wo5voOW+WKOGLMiN2sK+JwvSnu4qmiRG/qgFCJ
 EQDZer+OEQTu+YgZzjm/r5wbIkV/gqUenjjepk5iWrxK6EB7CmlQuZyyEKm3wO7i
 LrEDqRU7SY+PuUu+Ov6/PHxmMy5DJuK+AedRe8uzuDSmYpSekYFLD44gctkPe56q
 uq4Fhx3g3EIkPMcHnAae92vHLp/INCHCGoPb4Xh6CnaP4Xm+RntCv2hWxw30rHgc
 IIYVy4fSyJuTeHpFcNgeBrbcx4jwvkfJ9kxezM864DA9hBBfcS3ZZDhLM5PPEaLr
 usu7Gt6nHeFtwvXxZn/Y+SsYWCWpmbt6An/m+lqf05aAqyndhbwJ8Kftz3OAxKDw
 b7o59x2wvV9dfakAHxNx
 =fdBQ
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Features and fixes for 4.8 part1

Four bigger things:
1. The implementation of the STHYI opcode in the kernel. This is used
   in libraries like qclib [1] to provide enough information for a
   capacity and usage based software licence pricing. The STHYI content
   is defined by the related z/VM documentation [2]. Its data can be
   composed by accessing several other interfaces provided by LPAR or
   the machine. This information is partially sensitive or root-only
   so the kernel does the necessary filtering.
2. Preparation for nested virtualization (VSIE). KVM should query the
   proper sclp interfaces for the availability of some features before
   using it. In the past we have been sloppy and simply assumed that
   several features are available. With this we should be able to handle
   most cases of a missing feature.
3. CPU model interfaces extended by some additional features that are
   not covered by a facility bit in STFLE. For example all the crypto
   instructions of the coprocessor provide a query function. As reality
   tends to be more complex (e.g. export regulations might block some
   algorithms) we have to provide additional interfaces to query or
   set these non-stfle features.
4. Several fixes and changes detected and fixed when doing 1-3.

All features change base s390 code. All relevant patches have an ACK
from the s390 or component maintainers.

The next pull request for 4.8 (part2) will contain the implementation
of VSIE.

[1] http://www.ibm.com/developerworks/linux/linux390/qclib.html
[2] https://www.ibm.com/support/knowledgecenter/SSB27U_6.3.0/com.ibm.zvm.v630.hcpb4/hcpb4sth.htm
2016-06-15 09:21:46 +02:00
Kees Cook
0208b9445b s390/ptrace: run seccomp after ptrace
Close the hole where ptrace can change a syscall out from under seccomp.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
2016-06-14 10:54:45 -07:00
Andy Lutomirski
2f275de5d1 seccomp: Add a seccomp_data parameter secure_computing()
Currently, if arch code wants to supply seccomp_data directly to
seccomp (which is generally much faster than having seccomp do it
using the syscall_get_xyz() API), it has to use the two-phase
seccomp hooks. Add it to the easy hooks, too.

Cc: linux-arch@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-06-14 10:54:39 -07:00
Hendrik Brueckner
19c93787f5 s390/crc32-vx: use vector instructions to optimize CRC-32 computation
Use vector instructions to optimize the computation of CRC-32 checksums.
An optimized version is provided for CRC-32 (IEEE 802.3 Ethernet) in
normal and bitreflected domain, as well as, for bitreflected CRC-32C
(Castagnoli).

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-14 16:54:16 +02:00
Hendrik Brueckner
0486480802 s390/vx: add support functions for in-kernel FPU use
Introduce the kernel_fpu_begin() and kernel_fpu_end() function
to enclose any in-kernel use of FPU instructions and registers.
In enclosed sections, you can perform floating-point or vector
(SIMD) computations.  The functions take care of saving and
restoring FPU register contents and controls.

For usage details, see the guidelines in arch/s390/include/asm/fpu/api.h

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-14 16:54:11 +02:00
Heiko Carstens
de3fa841e4 s390/mm: fix compile for PAGE_DEFAULT_KEY != 0
The usual problem for code that is ifdef'ed out is that it doesn't
compile after a while. That's also the case for the storage key
initialisation code, if it would be used (set PAGE_DEFAULT_KEY to
something not zero):

./arch/s390/include/asm/page.h: In function 'storage_key_init_range':
./arch/s390/include/asm/page.h:36:2: error: implicit declaration of function '__storage_key_init_range'

Since the code itself has been useful for debugging purposes several
times, remove the ifdefs and make sure the code gets compiler
coverage. The cost for this is eight bytes.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-14 16:54:05 +02:00
Peter Zijlstra
726328d92a locking/spinlock, arch: Update and fix spin_unlock_wait() implementations
This patch updates/fixes all spin_unlock_wait() implementations.

The update is in semantics; where it previously was only a control
dependency, we now upgrade to a full load-acquire to match the
store-release from the spin_unlock() we waited on. This ensures that
when spin_unlock_wait() returns, we're guaranteed to observe the full
critical section we waited on.

This fixes a number of spin_unlock_wait() users that (not
unreasonably) rely on this.

I also fixed a number of ticket lock versions to only wait on the
current lock holder, instead of for a full unlock, as this is
sufficient.

Furthermore; again for ticket locks; I added an smp_rmb() in between
the initial ticket load and the spin loop testing the current value
because I could not convince myself the address dependency is
sufficient, esp. if the loads are of different sizes.

I'm more than happy to remove this smp_rmb() again if people are
certain the address dependency does indeed work as expected.

Note: PPC32 will be fixed independently

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: chris@zankel.net
Cc: cmetcalf@mellanox.com
Cc: davem@davemloft.net
Cc: dhowells@redhat.com
Cc: james.hogan@imgtec.com
Cc: jejb@parisc-linux.org
Cc: linux@armlinux.org.uk
Cc: mpe@ellerman.id.au
Cc: ralf@linux-mips.org
Cc: realmz6@gmail.com
Cc: rkuo@codeaurora.org
Cc: rth@twiddle.net
Cc: schwidefsky@de.ibm.com
Cc: tony.luck@intel.com
Cc: vgupta@synopsys.com
Cc: ysato@users.sourceforge.jp
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14 11:55:15 +02:00
Andrea Gelmini
960cb306e6 KVM: S390: Fix typo
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-14 11:16:27 +02:00
Heiko Carstens
86d18a55dd s390/topology: remove z10 special handling
I don't have a z10 to test this anymore, so I have no idea if the code
works at all or even crashes. I can try to emulate, but it is just
guess work.

Nor do we know if the z10 special handling is performance wise still
better than the generic handling. There have been a lot of changes to
the scheduler.

Therefore let's play safe and remove the special handling.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:27 +02:00
Heiko Carstens
adac0f1e8c s390/topology: add drawer scheduling domain level
The z13 machine added a fourth level to the cpu topology
information. The new top level is called drawer.

A drawer contains two books, which used to be the top level.

Adding this additional scheduling domain did show performance
improvements for some workloads of up to 8%, while there don't
seem to be any workloads impacted in a negative way.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:27 +02:00
Heiko Carstens
0599eead58 s390/ipl: rename diagnose enums
Rename DIAG308_IPL and DIAG308_DUMP to DIAG308_LOAD_CLEAR and
DIAG308_LOAD_NORMAL_DUMP to better reflect the associated IPL
functions.

Suggested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:26 +02:00
Heiko Carstens
0f7451ff3a s390/ipl: use load normal for LPAR re-ipl
Avoid clearing memory for CCW-type re-ipl within a logical
partition. This can save a significant amount of time if a logical
partition contains a lot of memory.

On the other hand we still clear memory if running within a second
level hypervisor, since the hypervisor can simply free all memory that
was used for the guest.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:26 +02:00
Heiko Carstens
6c22c98637 s390: avoid extable collisions
We have some inline assemblies where the extable entry points to a
label at the end of an inline assembly which is not followed by an
instruction.

On the other hand we have also inline assemblies where the extable
entry points to the first instruction of an inline assembly.

If a first type inline asm (extable point to empty label at the end)
would be directly followed by a second type inline asm (extable points
to first instruction) then we would have two different extable entries
that point to the same instruction but would have a different target
address.

This can lead to quite random behaviour, depending on sorting order.

I verified that we currently do not have such collisions within the
kernel. However to avoid such subtle bugs add a couple of nop
instructions to those inline assemblies which contain an extable that
points to an empty label.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:26 +02:00
Heiko Carstens
ee64baf4ea s390/uaccess: use __builtin_expect for get_user/put_user
We always expect that get_user and put_user return with zero. Give the
compiler a hint so it can slightly optimize the code and avoid
branches.
This is the same what x86 got with commit a76cf66e948a ("x86/uaccess:
Tell the compiler that uaccess is unlikely to fault").

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:25 +02:00
Heiko Carstens
b8ac5e2f4d s390/uaccess: fix whitespace damage
Fix some whitespace damage that was introduced by me with a
query-replace when removing 31 bit support.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:25 +02:00
Sebastian Ott
8ee2db3cf1 s390/pci: ensure to not cross a dma segment boundary
When we use the iommu_area_alloc helper to get dma addresses
we specify the boundary_size parameter but not the offset (called
shift in this context).

As long as the offset (start_dma) is a multiple of the boundary
we're ok (on current machines start_dma always seems to be 4GB).

Don't leave this to chance and specify the offset for iommu_area_alloc.

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:24 +02:00
Sebastian Ott
53b1bc9aba s390/pci: ensure page aligned dma start address
We don't have an architectural guarantee on the value of
the dma offset but rely on it to be at least page aligned.
Enforce page alignemt of start_dma.

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:24 +02:00
Sebastian Ott
bb98f396f1 s390: use SPARSE_IRQ
Use dynamically allocated irq descriptors on s390 which allows
us to get rid of the s390 specific config option PCI_NR_MSI and
exploit more MSI interrupts. Also the size of the kernel image
is reduced by 131K (using performance_defconfig).

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:24 +02:00
Heiko Carstens
72a9b02d3b s390: use __section macro everywhere
Small cleanup patch to use the shorter __section macro everywhere.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:23 +02:00
Heiko Carstens
d07a980c1b s390: add proper __ro_after_init support
On s390 __ro_after_init is currently mapped to __read_mostly which
means that data marked as __ro_after_init will not be protected.

Reason for this is that the common code __ro_after_init implementation
is x86 centric: the ro_after_init data section was added to rodata,
since x86 enables write protection to kernel text and rodata very
late. On s390 we have write protection for these sections enabled with
the initial page tables. So adding the ro_after_init data section to
rodata does not work on s390.

In order to make __ro_after_init work properly on s390 move the
ro_after_init data, right behind rodata. Unlike the rodata section it
will be marked read-only later after all init calls happened.

This s390 specific implementation adds new __start_ro_after_init and
__end_ro_after_init labels. Everything in between will be marked
read-only after the init calls happened. In addition to the
__ro_after_init data move also the exception table there, since from a
practical point of view it fits the __ro_after_init requirements.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:23 +02:00
Martin Schwidefsky
64f31d5802 s390/mm: simplify the TLB flushing code
ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
decide between a lazy TLB flush vs an immediate TLB flush. The field
contains two 16-bit counters, the number of CPUs that have the mm
attached and can create TLB entries for it and the number of CPUs in
the middle of a page table update.

The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
use the attach counter and a mask check with mm_cpumask(mm) to decide
between a local flush local of the current CPU and a global flush.

For all these functions the decision between lazy vs immediate and
local vs global TLB flush can be based on CPU masks. There are two
masks:  the mm->context.cpu_attach_mask with the CPUs that are actively
using the mm, and the mm_cpumask(mm) with the CPUs that have used the
mm since the last full flush. The decision between lazy vs immediate
flush is based on the mm->context.cpu_attach_mask, to decide between
local vs global flush the mm_cpumask(mm) is used.

With this patch all checks will use the CPU masks, the old counter
mm->context.attach_count with its two 16-bit values is turned into a
single counter mm->context.flush_count that keeps track of the number
of CPUs with incomplete page table updates. The sole user of this
counter is finish_arch_post_lock_switch() which waits for the end of
all page table updates.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:22 +02:00
Martin Schwidefsky
a9809407f6 s390/mm: fix vunmap vs finish_arch_post_lock_switch
The vunmap_pte_range() function calls ptep_get_and_clear() without any
locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
for the page table update. ptep_flush_direct requires that preemption
is disabled, but without any locking this is not the case. If the kernel
preempts the task while the attach_counter is increased an endless loop
in finish_arch_post_lock_switch() will occur the next time the task is
scheduled.

Add explicit preempt_disable()/preempt_enable() calls to the relevant
functions in arch/s390/mm/pgtable.c.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:21 +02:00