Commit Graph

9895 Commits

Author SHA1 Message Date
c98d2ecae0 s390/mm: Uncouple physical vs virtual address spaces
The uncoupling physical vs virtual address spaces brings
the following benefits to s390:

- virtual memory layout flexibility;
- closes the address gap between kernel and modules, it
  caused s390-only problems in the past (e.g. 'perf' bugs);
- allows getting rid of trampolines used for module calls
  into kernel;
- allows simplifying BPF trampoline;
- minor performance improvement in branch prediction;
- kernel randomization entropy is magnitude bigger, as it is
  derived from the amount of available virtual, not physical
  memory;

The whole change could be described in two pictures below:
before and after the change.

Some aspects of the virtual memory layout setup are not
clarified (number of page levels, alignment, DMA memory),
since these are not a part of this change or secondary
with regard to how the uncoupling itself is implemented.

The focus of the pictures is to explain why __va() and __pa()
macros are implemented the way they are.

        Memory layout in V==R mode:

|    Physical      |    Virtual       |
+- 0 --------------+- 0 --------------+ identity mapping start
|                  | S390_lowcore     | Low-address memory
|                  +- 8 KB -----------+
|                  |                  |
|                  | identity         | phys == virt
|                  | mapping          | virt == phys
|                  |                  |
+- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
|.amode31 text/data|.amode31 text/data|
+- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt start
|                  |                  |
|                  |                  |
+- __kaslr_offset, __kaslr_offset_phys| kernel rand. phys/virt start
|                  |                  |
| kernel text/data | kernel text/data | phys == kvirt
|                  |                  |
+------------------+------------------+ kernel phys/virt end
|                  |                  |
|                  |                  |
|                  |                  |
|                  |                  |
+- ident_map_size -+- ident_map_size -+ identity mapping end
                   |                  |
                   |  ... unused gap  |
                   |                  |
                   +---- vmemmap -----+ 'struct page' array start
                   |                  |
                   | virtually mapped |
                   | memory map       |
                   |                  |
                   +- __abs_lowcore --+
                   |                  |
                   | Absolute Lowcore |
                   |                  |
                   +- __memcpy_real_area
                   |                  |
                   |  Real Memory Copy|
                   |                  |
                   +- VMALLOC_START --+ vmalloc area start
                   |                  |
                   |  vmalloc area    |
                   |                  |
                   +- MODULES_VADDR --+ modules area start
                   |                  |
                   |  modules area    |
                   |                  |
                   +------------------+ UltraVisor Secure Storage limit
                   |                  |
                   |  ... unused gap  |
                   |                  |
                   +KASAN_SHADOW_START+ KASAN shadow memory start
                   |                  |
                   |   KASAN shadow   |
                   |                  |
                   +------------------+ ASCE limit

        Memory layout in V!=R mode:

|    Physical      |    Virtual       |
+- 0 --------------+- 0 --------------+
|                  | S390_lowcore     | Low-address memory
|                  +- 8 KB -----------+
|                  |                  |
|                  |                  |
|                  | ... unused gap   |
|                  |                  |
+- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
|.amode31 text/data|.amode31 text/data|
+- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB)
|                  |                  |
|                  |                  |
+- __kaslr_offset_phys		     | kernel rand. phys start
|                  |                  |
| kernel text/data |                  |
|                  |                  |
+------------------+		     | kernel phys end
|                  |                  |
|                  |                  |
|                  |                  |
|                  |                  |
+- ident_map_size -+		     |
                   |                  |
                   |  ... unused gap  |
                   |                  |
                   +- __identity_base + identity mapping start (>= 2GB)
                   |                  |
                   | identity         | phys == virt - __identity_base
                   | mapping          | virt == phys + __identity_base
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   |                  |
                   +---- vmemmap -----+ 'struct page' array start
                   |                  |
                   | virtually mapped |
                   | memory map       |
                   |                  |
                   +- __abs_lowcore --+
                   |                  |
                   | Absolute Lowcore |
                   |                  |
                   +- __memcpy_real_area
                   |                  |
                   |  Real Memory Copy|
                   |                  |
                   +- VMALLOC_START --+ vmalloc area start
                   |                  |
                   |  vmalloc area    |
                   |                  |
                   +- MODULES_VADDR --+ modules area start
                   |                  |
                   |  modules area    |
                   |                  |
                   +- __kaslr_offset -+ kernel rand. virt start
                   |                  |
                   | kernel text/data | phys == (kvirt - __kaslr_offset) +
                   |                  |         __kaslr_offset_phys
                   +- kernel .bss end + kernel rand. virt end
                   |                  |
                   |  ... unused gap  |
                   |                  |
                   +------------------+ UltraVisor Secure Storage limit
                   |                  |
                   |  ... unused gap  |
                   |                  |
                   +KASAN_SHADOW_START+ KASAN shadow memory start
                   |                  |
                   |   KASAN shadow   |
                   |                  |
                   +------------------+ ASCE limit

Unused gaps in the virtual memory layout could be present
or not - depending on how partucular system is configured.
No page tables are created for the unused gaps.

The relative order of vmalloc, modules and kernel image in
virtual memory is defined by following considerations:

- start of the modules area and end of the kernel should reside
  within 4GB to accommodate relative 32-bit jumps. The best way
  to achieve that is to place kernel next to modules;

- vmalloc and module areas should locate next to each other
  to prevent failures and extra reworks in user level tools
  (makedumpfile, crash, etc.) which treat vmalloc and module
  addresses similarily;

- kernel needs to be the last area in the virtual memory
  layout to easily distinguish between kernel and non-kernel
  virtual addresses. That is needed to (again) simplify
  handling of addresses in user level tools and make __pa()
  macro faster (see below);

Concluding the above, the relative order of the considered
virtual areas in memory is: vmalloc - modules - kernel.
Therefore, the only change to the current memory layout is
moving kernel to the end of virtual address space.

With that approach the implementation of __pa() macro is
straightforward - all linear virtual addresses less than
kernel base are considered identity mapping:

	phys == virt - __identity_base

All addresses greater than kernel base are kernel ones:

	phys == (kvirt - __kaslr_offset) + __kaslr_offset_phys

By contrast, __va() macro deals only with identity mapping
addresses:

	virt == phys + __identity_base

.amode31 section is mapped separately and is not covered by
__pa() macro. In fact, it could have been handled easily by
checking whether a virtual address is within the section or
not, but there is no need for that. Thus, let __pa() code
do as little machine cycles as possible.

The KASAN shadow memory is located at the very end of the
virtual memory layout, at addresses higher than the kernel.
However, that is not a linear mapping and no code other than
KASAN instrumentation or API is expected to access it.

When KASLR mode is enabled the kernel base address randomized
within a memory window that spans whole unused virtual address
space. The size of that window depends from the amount of
physical memory available to the system, the limit imposed by
UltraVisor (if present) and the vmalloc area size as provided
by vmalloc= kernel command line parameter.

In case the virtual memory is exhausted the minimum size of
the randomization window is forcefully set to 2GB, which
amounts to in 15 bits of entropy if KASAN is enabled or 17
bits of entropy in default configuration.

The default kernel offset 0x100000 is used as a magic value
both in the decompressor code and vmlinux linker script, but
it will be removed with a follow-up change.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
f4cac27dc0 s390/crash: Use old os_info to create PT_LOAD headers
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

The vmcore ELF program headers describe virtual memory
regions of a crashed kernel. User level tools use that
information for the kernel text and data analysis (e.g
vmcore-dmesg extracts the kernel log).

Currently the kernel image is covered by program headers
describing the identity mapping regions. But in the future
the kernel image will be mapped into separate region outside
of the identity mapping. Create the additional ELF program
header that covers kernel image only, so that vmcore tools
could locate kernel text and data.

Further, the identity mapping in crashed and capture kernels
will have different base address. Due to that __va() macro
can not be used in the capture kernel. Instead, read crashed
kernel identity mapping base address from os_info and use
it for PT_LOAD type program headers creation.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
378e32aa81 s390/vmcoreinfo: Store virtual memory layout
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

The virtual memory layout is needed for address translation
by crash tool when /proc/kcore device is used as the memory
image.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
8572f52518 s390/os_info: Store virtual memory layout
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

The virtual memory layout will be read out by makedumpfile,
crash and other user tools for virtual address translation.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
88702793c5 s390/os_info: Introduce value entries
Introduce entries that do not reference any data in memory,
but rather provide values. Set the size of such entries to
zero and do not compute checksum for them, since there is no
data which integrity needs to be checked. The integrity of
the value entries itself is still covered by the os_info
checksum.

Reserve the lowest unused entry index OS_INFO_RESERVED for
future use - presumably for the number of entries present.
That could later be used by user level tools. The existing
tools would not notice any difference.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
5fb50fa66a s390/boot: Make .amode31 section address range explicit
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Introduce .amode31 section address range AMODE31_START
and AMODE31_END macros for later use.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
7de0446f0b s390/boot: Make identity mapping base address explicit
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Currently the identity mapping base address is implicit
and is always set to zero. Make it explicit by putting
into __identity_base persistent boot variable and use it
in proper context - which is the value of PAGE_OFFSET.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
3bb11234b1 s390/boot: Uncouple virtual and physical kernel offsets
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Currently __kaslr_offset is the kernel offset in both
physical memory on boot and in virtual memory after DAT
mode is enabled.

Uncouple these offsets and rename the physical address
space variant to __kaslr_offset_phys while keep the name
__kaslr_offset for the offset in virtual address space.

Do not use __kaslr_offset_phys after DAT mode is enabled
just yet, but still make it a persistent boot variable
for later use.

Use __kaslr_offset and __kaslr_offset_phys offsets in
proper contexts and alter handle_relocs() function to
distinguish between the two.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
236f324b74 s390/mm: Create virtual memory layout structure
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Put virtual memory layout information into a structure
to improve code generation when accessing the structure
members, which are currently only ident_map_size and
__kaslr_offset.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
bbe72f3902 s390/mm: Move KASLR related to <asm/page.h>
Move everyting KASLR related to <asm/page.h>,
similarly to many other architectures.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
c8aef260c8 s390/boot: Swap vmalloc and Lowcore/Real Memory Copy areas
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Currently the order of virtual memory areas is (the lowcore
and .amode31 section are skipped, as it is irrelevant):

	identity mapping (the kernel is contained within)
	vmemmap
	vmalloc
	modules
	Absolute Lowcore
	Real Memory Copy

In the future the kernel will be mapped separately and placed
to the end of the virtual address space, so the layout would
turn like this:

	identity mapping
	vmemmap
	vmalloc
	modules
	Absolute Lowcore
	Real Memory Copy
	kernel

However, the distance between kernel and modules needs to be as
little as possible, ideally - none. Thus, the Absolute Lowcore
and Real Memory Copy areas would stay in the way and therefore
need to be moved as well:

	identity mapping
	vmemmap
	Absolute Lowcore
	Real Memory Copy
	vmalloc
	modules
	kernel

To facilitate such layout swap the vmalloc and Absolute Lowcore
together with Real Memory Copy areas. As result, the current
layout turns into:

	identity mapping (the kernel is contained within)
	vmemmap
	Absolute Lowcore
	Real Memory Copy
	vmalloc
	modules

This will allow to locate the kernel directly next to the
modules once it gets mapped separately.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:37:59 +02:00
ecf74da64d s390/boot: Reduce size of identity mapping on overlap
In case vmemmap array could overlap with vmalloc area on
virtual memory layout setup, the size of vmalloc area
is decreased. That could result in less memory than user
requested with vmalloc= kernel command line parameter.
Instead, reduce the size of identity mapping (and the
size of vmemmap array as result) to avoid such overlap.

Further, currently the virtual memmory allocation "rolls"
from top to bottom and it is only VMALLOC_START that could
get increased due to the overlap. Change that to decrease-
only, which makes the whole allocation algorithm more easy
to comprehend.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:37:59 +02:00
b2b15f079c s390/boot: Consider DCSS segments on memory layout setup
The maximum mappable physical address (as returned by
arch_get_mappable_range() callback) is limited by the
value of (1UL << MAX_PHYSMEM_BITS).

The maximum physical address available to a DCSS segment
is 512GB.

In case the available online or offline memory size is less
than the DCSS limit arch_get_mappable_range() would include
never used [512GB..(1UL << MAX_PHYSMEM_BITS)] range.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:37:59 +02:00
47bf817672 s390/boot: Do not force vmemmap to start at MAX_PHYSMEM_BITS
vmemmap is forcefully set to start at MAX_PHYSMEM_BITS at most.
That could be needed in the past to limit ident_map_size to
MAX_PHYSMEM_BITS. However since commit 75eba6ec0de1 ("s390:
unify identity mapping limits handling") ident_map_size is
limited in setup_ident_map_size() function, which is called
earlier.

Another reason to limit vmemmap start to MAX_PHYSMEM_BITS is
because it was returned by arch_get_mappable_range() as the
maximum mappable physical address. Since commit f641679dfe55
("s390/mm: rework arch_get_mappable_range() callback") that
is not required anymore.

As result, there is no neccessity to limit vmemmap starting
address with MAX_PHYSMEM_BITS.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:37:59 +02:00
22fdd8ba61 KVM: s390: vsie: Use virt_to_phys for facility control block
In order for SIE to interpretively execute STFLE, it requires the real
or absolute address of a facility-list control block.
Before writing the location into the shadow SIE control block, convert
it from a virtual address.
We currently do not run into this bug because the lower 31 bits are the
same for virtual and physical addresses.

Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:37:59 +02:00
036cbbafbd s390/irq,nmi: Include <asm/vtime.h> header directly
update_timer_sys() and update_timer_mcck() are inlines used for
CPU time accounting from the interrupt and machine-check handlers.
These routines are specific to s390 architecture, but included
via <linux/vtime.h> header implicitly. Avoid the extra loop and
include <asm/vtime.h> header directly.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/3fb696637c0eb7e9d6ffd6cbf9e647d7c5986b3d.1712760275.git.agordeev@linux.ibm.com
2024-04-17 13:37:22 +02:00
60b8edba14 s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/b1055852eab0ffea33ad16c92d6a825c83037c3e.1712760275.git.agordeev@linux.ibm.com
2024-04-17 13:37:21 +02:00
94426ed213 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/unix/garbage.c
  47d8ac011f ("af_unix: Fix garbage collector racing against connect()")
  4090fa373f ("af_unix: Replace garbage collection algorithm.")

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  faa12ca245 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
  b3d0083caf ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")

drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
  7ac10c7d72 ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
  194fad5b27 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  958f56e483 ("net/mlx5e: Un-expose functions in en.h")
  49e6c93870 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 14:23:47 -07:00
d35c34bb32 s390/mm: Convert gmap_make_secure to use a folio
Remove uses of deprecated page APIs, and move the check for large
folios to here to avoid taking the folio lock if the folio is too large.
We could do better here by attempting to split the large folio, but I'll
leave that improvement for someone who can test it.

Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240322161149.2327518-3-willy@infradead.org
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:57 +02:00
259e660d91 s390/mm: Convert make_page_secure to use a folio
These page APIs are deprecated, so convert the incoming page to a folio
and use the folio APIs instead.  The ultravisor API cannot handle large
folios, so return -EINVAL if one has slipped through.

Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240322161149.2327518-2-willy@infradead.org
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:57 +02:00
f10933cbd2 s390/cpum_cf: make crypto counters upward compatible across machine types
The CPU Measurement facility crypto counter set functionality
is defined by the Second Counter Version Number. This number
varies between machine types, but is upward compatible.
Lessen the checks to reflect this behavior.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:56 +02:00
4f00d4ef66 s390: adjust indentation of RELOCS command build step out
Common pattern in non-verbose build output for quiet commands is that the
shorthand of a command including whitespace contains at least eight
characters. Adjust this for the RELOCS command, which comes only with seven
characters.

Before:
  SORTTAB vmlinux
  CC      arch/s390/boot/version.o
  RELOCS arch/s390/boot/relocs.S
  OBJCOPY arch/s390/boot/info.bin

After:
  SORTTAB vmlinux
  CC      arch/s390/boot/version.o
  RELOCS  arch/s390/boot/relocs.S
  OBJCOPY arch/s390/boot/info.bin

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:56 +02:00
b3840c8bfc s390/ap: rename ap debug configuration option
The configuration option ZCRYPT_DEBUG is used only in ap queue code,
so rename it to AP_DEBUG. It also no longer depends on ZCRYPT but on
AP. While at it, also update the help text.

Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:56 +02:00
123760841a s390/ap: modularize ap bus
There is no hard requirement to have the ap bus statically in the
kernel, so add an option to compile it as module.

Cc: Tony Krowiak <akrowiak@linux.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Anthony Krowiak <akrowiak@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:56 +02:00
2a483d333f s390/chsc: use notifier for AP configuration changes
The direct dependency of chsc and the AP bus prevents the
modularization of ap bus. Introduce a notifier interface for AP
changes, which decouples the producer of the change events (chsc) from
the consumer (ap_bus).

Remove the ap_cfg_chg() interface and replace it with the notifier
invocation. The ap bus module registers a notification handler, which
triggers the AP bus scan.

Cc: Vineeth Vijayan <vneethv@linux.ibm.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:55 +02:00
05272aa499 s390/uv: export prot_virt_guest symbol in uv
The inline function is_prot_virt_guest() in asm/uv.h makes use of the
prot_virt_guest symbol. As this inline function can be called by other
parts of the kernel (modules and built-in), the symbol should be
exported, similar to the prot_virt_host symbol.

One consumer of is_prot_virt_guest() will be the ap bus code.

Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:55 +02:00
8dec9cb9f5 s390/ap: use static qci information
Since qci is available on most of the current machines, move away from
the dynamic buffers for qci information and store it instead in a
statically defined buffer.

The new flags member in struct ap_config_info is now used as an
indicator, if qci is available in the system (at least one of these
bits is set).

Suggested-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-09 17:29:55 +02:00
c8e3a8b6f2 vdso: Consolidate vdso_calc_delta()
Consolidate vdso_calc_delta(), in preparation for further simplification.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-2-adrian.hunter@intel.com
2024-04-08 15:03:06 +02:00
378ca2d2ad s390/entry: align system call table on 8 bytes
Align system call table on 8 bytes. With sys_call_table entry size
of 8 bytes that eliminates the possibility of a system call pointer
crossing cache line boundary.

Cc: stable@kernel.org
Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-04-03 15:00:20 +02:00
e9f3af02f6 s390/pai: fix sampling event removal for PMU device driver
In case of a sampling event, the PAI PMU device drivers need a
reference to this event.  Currently to PMU device driver reference
is removed when a sampling event is destroyed. This may lead to
situations where the reference of the PMU device driver is removed
while being used by a different sampling event.
Reset the event reference pointer of the PMU device driver when
a sampling event is deleted and before the next one might be added.

Fixes: 39d62336f5 ("s390/pai: add support for cryptography counters")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-04-03 15:00:20 +02:00
c9c260681f s390/preempt: mark all functions __always_inline
preempt_count-related functions are quite ubiquitous and may be called
by noinstr ones, introducing unwanted instrumentation. Here is one
example call chain:

  irqentry_nmi_enter()  # noinstr
    lockdep_hardirqs_enabled()
      this_cpu_read()
        __pcpu_size_call_return()
          this_cpu_read_*()
            this_cpu_generic_read()
              __this_cpu_generic_read_nopreempt()
                preempt_disable_notrace()
                  __preempt_count_inc()
                    __preempt_count_add()

They are very small, so there are no significant downsides to
force-inlining them.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20240320230007.4782-3-iii@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-04-03 15:00:20 +02:00
01cac82ae0 s390/atomic: mark all functions __always_inline
Atomic functions are quite ubiquitous and may be called by noinstr
ones, introducing unwanted instrumentation. They are very small, so
there are no significant downsides to force-inlining them.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20240320230007.4782-2-iii@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-04-03 15:00:19 +02:00
e6ec07dc6d s390/mm: fix NULL pointer dereference
The recently added check to figure out if a fault happened on gmap ASCE
dereferences the gmap pointer in lowcore without checking that it is not
NULL. For all non-KVM processes the pointer is NULL, so that some value
from lowcore will be read. With the current layouts of struct gmap and
struct lowcore the read value (aka ASCE) is zero, so that this doesn't lead
to any observable bug; at least currently.

Fix this by adding the missing NULL pointer check.

Fixes: 64c3431808 ("s390/entry: compare gmap asce to determine guest/host fault")
Acked-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-04-03 15:00:19 +02:00
29ce50e078 crypto: remove CONFIG_CRYPTO_STATS
Remove support for the "Crypto usage statistics" feature
(CONFIG_CRYPTO_STATS).  This feature does not appear to have ever been
used, and it is harmful because it significantly reduces performance and
is a large maintenance burden.

Covering each of these points in detail:

1. Feature is not being used

Since these generic crypto statistics are only readable using netlink,
it's fairly straightforward to look for programs that use them.  I'm
unable to find any evidence that any such programs exist.  For example,
Debian Code Search returns no hits except the kernel header and kernel
code itself and translations of the kernel header:
https://codesearch.debian.net/search?q=CRYPTOCFGA_STAT&literal=1&perpkg=1

The patch series that added this feature in 2018
(https://lore.kernel.org/linux-crypto/1537351855-16618-1-git-send-email-clabbe@baylibre.com/)
said "The goal is to have an ifconfig for crypto device."  This doesn't
appear to have happened.

It's not clear that there is real demand for crypto statistics.  Just
because the kernel provides other types of statistics such as I/O and
networking statistics and some people find those useful does not mean
that crypto statistics are useful too.

Further evidence that programs are not using CONFIG_CRYPTO_STATS is that
it was able to be disabled in RHEL and Fedora as a bug fix
(https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2947).

Even further evidence comes from the fact that there are and have been
bugs in how the stats work, but they were never reported.  For example,
before Linux v6.7 hash stats were double-counted in most cases.

There has also never been any documentation for this feature, so it
might be hard to use even if someone wanted to.

2. CONFIG_CRYPTO_STATS significantly reduces performance

Enabling CONFIG_CRYPTO_STATS significantly reduces the performance of
the crypto API, even if no program ever retrieves the statistics.  This
primarily affects systems with a large number of CPUs.  For example,
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039576 reported
that Lustre client encryption performance improved from 21.7GB/s to
48.2GB/s by disabling CONFIG_CRYPTO_STATS.

It can be argued that this means that CONFIG_CRYPTO_STATS should be
optimized with per-cpu counters similar to many of the networking
counters.  But no one has done this in 5+ years.  This is consistent
with the fact that the feature appears to be unused, so there seems to
be little interest in improving it as opposed to just disabling it.

It can be argued that because CONFIG_CRYPTO_STATS is off by default,
performance doesn't matter.  But Linux distros tend to error on the side
of enabling options.  The option is enabled in Ubuntu and Arch Linux,
and until recently was enabled in RHEL and Fedora (see above).  So, even
just having the option available is harmful to users.

3. CONFIG_CRYPTO_STATS is a large maintenance burden

There are over 1000 lines of code associated with CONFIG_CRYPTO_STATS,
spread among 32 files.  It significantly complicates much of the
implementation of the crypto API.  After the initial submission, many
fixes and refactorings have consumed effort of multiple people to keep
this feature "working".  We should be spending this effort elsewhere.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-02 10:49:38 +08:00
5e47fbe5ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-28 17:25:57 -07:00
2a702c2e57 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2024-03-25

We've added 38 non-merge commits during the last 13 day(s) which contain
a total of 50 files changed, 867 insertions(+), 274 deletions(-).

The main changes are:

1) Add the ability to specify and retrieve BPF cookie also for raw
   tracepoint programs in order to ease migration from classic to raw
   tracepoints, from Andrii Nakryiko.

2) Allow the use of bpf_get_{ns_,}current_pid_tgid() helper for all
   program types and add additional BPF selftests, from Yonghong Song.

3) Several improvements to bpftool and its build, for example, enabling
   libbpf logs when loading pid_iter in debug mode, from Quentin Monnet.

4) Check the return code of all BPF-related set_memory_*() functions during
   load and bail out in case they fail, from Christophe Leroy.

5) Avoid a goto in regs_refine_cond_op() such that the verifier can
   be better integrated into Agni tool which doesn't support backedges
   yet, from Harishankar Vishwanathan.

6) Add a small BPF trie perf improvement by always inlining
   longest_prefix_match, from Jesper Dangaard Brouer.

7) Small BPF selftest refactor in bpf_tcp_ca.c to utilize start_server()
   helper instead of open-coding it, from Geliang Tang.

8) Improve test_tc_tunnel.sh BPF selftest to prevent client connect
   before the server bind, from Alessandro Carminati.

9) Fix BPF selftest benchmark for older glibc and use syscall(SYS_gettid)
   instead of gettid(), from Alan Maguire.

10) Implement a backward-compatible method for struct_ops types with
    additional fields which are not present in older kernels,
    from Kui-Feng Lee.

11) Add a small helper to check if an instruction is addr_space_cast
    from as(0) to as(1) and utilize it in x86-64 JIT, from Puranjay Mohan.

12) Small cleanup to remove unnecessary error check in
    bpf_struct_ops_map_update_elem, from Martin KaFai Lau.

13) Improvements to libbpf fd validity checks for BPF map/programs,
    from Mykyta Yatsenko.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (38 commits)
  selftests/bpf: Fix flaky test btf_map_in_map/lookup_update
  bpf: implement insn_is_cast_user() helper for JITs
  bpf: Avoid get_kernel_nofault() to fetch kprobe entry IP
  selftests/bpf: Use start_server in bpf_tcp_ca
  bpf: Sync uapi bpf.h to tools directory
  libbpf: Add new sec_def "sk_skb/verdict"
  selftests/bpf: Mark uprobe trigger functions with nocf_check attribute
  selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench
  bpf-next: Avoid goto in regs_refine_cond_op()
  bpftool: Clean up HOST_CFLAGS, HOST_LDFLAGS for bootstrap bpftool
  selftests/bpf: scale benchmark counting by using per-CPU counters
  bpftool: Remove unnecessary source files from bootstrap version
  bpftool: Enable libbpf logs when loading pid_iter in debug mode
  selftests/bpf: add raw_tp/tp_btf BPF cookie subtests
  libbpf: add support for BPF cookie for raw_tp/tp_btf programs
  bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs
  bpf: pass whole link instead of prog when triggering raw tracepoint
  bpf: flatten bpf_probe_register call chain
  selftests/bpf: Prevent client connect before server bind in test_tc_tunnel.sh
  selftests/bpf: Add a sk_msg prog bpf_get_ns_current_pid_tgid() test
  ...
====================

Link: https://lore.kernel.org/r/20240325233940.7154-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-27 07:52:34 -07:00
37ccdf7f11 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2024-03-25

The following pull-request contains BPF updates for your *net* tree.

We've added 17 non-merge commits during the last 12 day(s) which contain
a total of 19 files changed, 184 insertions(+), 61 deletions(-).

The main changes are:

1) Fix an arm64 BPF JIT bug in BPF_LDX_MEMSX implementation's offset handling
   found via test_bpf module, from Puranjay Mohan.

2) Various fixups to the BPF arena code in particular in the BPF verifier and
   around BPF selftests to match latest corresponding LLVM implementation,
   from Puranjay Mohan and Alexei Starovoitov.

3) Fix xsk to not assume that metadata is always requested in TX completion,
   from Stanislav Fomichev.

4) Fix riscv BPF JIT's kfunc parameter incompatibility between BPF and the riscv
   ABI which requires sign-extension on int/uint, from Pu Lehui.

5) Fix s390x BPF JIT's bpf_plt pointer arithmetic which triggered a crash when
   testing struct_ops, from Ilya Leoshkevich.

6) Fix libbpf's arena mmap handling which had incorrect u64-to-pointer cast on
   32-bit architectures, from Andrii Nakryiko.

7) Fix libbpf to define MFD_CLOEXEC when not available, from Arnaldo Carvalho de Melo.

8) Fix arm64 BPF JIT implementation for 32bit unconditional bswap which
   resulted in an incorrect swap as indicated by test_bpf, from Artem Savkov.

9) Fix BPF man page build script to use silent mode, from Hangbin Liu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  riscv, bpf: Fix kfunc parameters incompatibility between bpf and riscv abi
  bpf: verifier: reject addr_space_cast insn without arena
  selftests/bpf: verifier_arena: fix mmap address for arm64
  bpf: verifier: fix addr_space_cast from as(1) to as(0)
  libbpf: Define MFD_CLOEXEC if not available
  arm64: bpf: fix 32bit unconditional bswap
  bpf, arm64: fix bug in BPF_LDX_MEMSX
  libbpf: fix u64-to-pointer cast on 32-bit arches
  s390/bpf: Fix bpf_plt pointer arithmetic
  xsk: Don't assume metadata is always requested in TX completion
  selftests/bpf: Add arena test case for 4Gbyte corner case
  selftests/bpf: Remove hard coded PAGE_SIZE macro.
  libbpf, selftests/bpf: Adjust libbpf, bpftool, selftests to match LLVM
  bpf: Clarify bpf_arena comments.
  MAINTAINERS: Update email address for Quentin Monnet
  scripts/bpf_doc: Use silent mode when exec make cmd
  bpf: Temporarily disable atomic operations in BPF arena
====================

Link: https://lore.kernel.org/r/20240325213520.26688-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-26 12:55:18 +01:00
7ded842b35 s390/bpf: Fix bpf_plt pointer arithmetic
Kui-Feng Lee reported a crash on s390x triggered by the
dummy_st_ops/dummy_init_ptr_arg test [1]:

  [<0000000000000002>] 0x2
  [<00000000009d5cde>] bpf_struct_ops_test_run+0x156/0x250
  [<000000000033145a>] __sys_bpf+0xa1a/0xd00
  [<00000000003319dc>] __s390x_sys_bpf+0x44/0x50
  [<0000000000c4382c>] __do_syscall+0x244/0x300
  [<0000000000c59a40>] system_call+0x70/0x98

This is caused by GCC moving memcpy() after assignments in
bpf_jit_plt(), resulting in NULL pointers being written instead of
the return and the target addresses.

Looking at the GCC internals, the reordering is allowed because the
alias analysis thinks that the memcpy() destination and the assignments'
left-hand-sides are based on different objects: new_plt and
bpf_plt_ret/bpf_plt_target respectively, and therefore they cannot
alias.

This is in turn due to a violation of the C standard:

  When two pointers are subtracted, both shall point to elements of the
  same array object, or one past the last element of the array object
  ...

From the C's perspective, bpf_plt_ret and bpf_plt are distinct objects
and cannot be subtracted. In the practical terms, doing so confuses the
GCC's alias analysis.

The code was written this way in order to let the C side know a few
offsets defined in the assembly. While nice, this is by no means
necessary. Fix the noncompliance by hardcoding these offsets.

[1] https://lore.kernel.org/bpf/c9923c1d-971d-4022-8dc8-1364e929d34c@gmail.com/

Fixes: f1d5df84cd ("s390/bpf: Implement bpf_arch_text_poke()")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Message-ID: <20240320015515.11883-1-iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 22:52:43 -07:00
f9c035492f Merge tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:

 - Various virtual vs physical address usage fixes

 - Add new bitwise types and helper functions and use them in s390
   specific drivers and code to make it easier to find virtual vs
   physical address usage bugs.

   Right now virtual and physical addresses are identical for s390,
   except for module, vmalloc, and similar areas. This will be changed,
   hopefully with the next merge window, so that e.g. the kernel image
   and modules will be located close to each other, allowing for direct
   branches and also for some other simplifications.

   As a prerequisite this requires to fix all misuses of virtual and
   physical addresses. As it turned out people are so used to the
   concept that virtual and physical addresses are the same, that new
   bugs got added to code which was already fixed. In order to avoid
   that even more code gets merged which adds such bugs add and use new
   bitwise types, so that sparse can be used to find such usage bugs.

   Most likely the new types can go away again after some time

 - Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation

 - Fix kprobe branch handling: if an out-of-line single stepped relative
   branch instruction has a target address within a certain address area
   in the entry code, the program check handler may incorrectly execute
   cleanup code as if KVM code was executed, leading to crashes

 - Fix reference counting of zcrypt card objects

 - Various other small fixes and cleanups

* tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
  s390/entry: compare gmap asce to determine guest/host fault
  s390/entry: remove OUTSIDE macro
  s390/entry: add CIF_SIE flag and remove sie64a() address check
  s390/cio: use while (i--) pattern to clean up
  s390/raw3270: make class3270 constant
  s390/raw3270: improve raw3270_init() readability
  s390/tape: make tape_class constant
  s390/vmlogrdr: make vmlogrdr_class constant
  s390/vmur: make vmur_class constant
  s390/zcrypt: make zcrypt_class constant
  s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support
  s390/vfio_ccw_cp: use new address translation helpers
  s390/iucv: use new address translation helpers
  s390/ctcm: use new address translation helpers
  s390/lcs: use new address translation helpers
  s390/qeth: use new address translation helpers
  s390/zfcp: use new address translation helpers
  s390/tape: fix virtual vs physical address confusion
  s390/3270: use new address translation helpers
  s390/3215: use new address translation helpers
  ...
2024-03-19 11:38:27 -07:00
64c3431808 s390/entry: compare gmap asce to determine guest/host fault
With the current implementation, there are some cornercases where
a host fault would be treated as a guest fault, for example
when the sie instruction causes a program check. Therefore store
the gmap asce in ptregs, and use that to compare the primary asce
from the fault instead of matching instruction addresses.

Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-03-17 19:08:50 +01:00
29e5bc0f02 s390/entry: remove OUTSIDE macro
With only one OUTSIDE user left, remove the macro and move the code
directly to the machine check handler. This has the advantage that
it is much easier to determine which registers are used.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-03-17 19:08:49 +01:00
c239c83ed5 s390/entry: add CIF_SIE flag and remove sie64a() address check
When a program check, interrupt or machine check is triggered, the
PSW address is compared to a certain range of the sie64a() function
to figure out whether SIE was interrupted and a cleanup of SIE is
needed.

This doesn't work with kprobes: If kprobes probes an instruction, it
copies the instruction to the kprobes instruction page and overwrites the
original instruction with an undefind instruction (Opcode 00). When this
instruction is hit later, kprobes single-steps the instruction on the
kprobes_instruction page.

However, if this instruction is a relative branch instruction it will now
point to a different location in memory due to being moved to the kprobes
instruction page. If the new branch target points into sie64a() the kernel
assumes it interrupted SIE when processing the breakpoint and will crash
trying to access the SIE control block.

Instead of comparing the address, introduce a new CIF_SIE flag which
indicates whether SIE was interrupted.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-03-17 19:08:49 +01:00
4f712ee0cb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
e60adf5132 bpf: Take return from set_memory_rox() into account with bpf_jit_binary_lock_ro()
set_memory_rox() can fail, leaving memory unprotected.

Check return and bail out when bpf_jit_binary_lock_ro() returns
an error.

Link: https://github.com/KSPP/linux/issues/7
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linux-hardening@vger.kernel.org <linux-hardening@vger.kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>  # s390x
Acked-by: Tiezhu Yang <yangtiezhu@loongson.cn>  # LoongArch
Reviewed-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> # MIPS Part
Message-ID: <036b6393f23a2032ce75a1c92220b2afcb798d5d.1709850515.git.christophe.leroy@csgroup.eu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-14 19:28:52 -07:00
e5eb28f6d1 Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
   heap optimizations".

 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".

 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits. The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".

 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".

 - Ryusuke Konishi continues nilfs2 maintenance work in the series

	"nilfs2: eliminate kmap and kmap_atomic calls"
	"nilfs2: fix kernel bug at submit_bh_wbc()"

 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".

 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".

 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".

Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.

* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  nilfs2: prevent kernel bug at submit_bh_wbc()
  nilfs2: fix failure to detect DAT corruption in btree and direct mappings
  ocfs2: enable ocfs2_listxattr for special files
  ocfs2: remove SLAB_MEM_SPREAD flag usage
  assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
  buildid: use kmap_local_page()
  watchdog/core: remove sysctl handlers from public header
  nilfs2: use div64_ul() instead of do_div()
  mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
  kexec: copy only happens before uchunk goes to zero
  get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
  get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
  get_signal: don't abuse ksig->info.si_signo and ksig->sig
  const_structs.checkpatch: add device_type
  Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
  dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
  list: leverage list_is_head() for list_entry_is_head()
  nilfs2: MAINTAINERS: drop unreachable project mirror site
  smp: make __smp_processor_id() 0-argument macro
  fat: fix uninitialized field in nostale filehandles
  ...
2024-03-14 18:03:09 -07:00
902861e34c Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
17193ced2d Merge tag 'kvm-s390-next-6.9-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- Memop selftest rotate fix
- SCLP event bits over indication fix
- Missing virt_to_phys for the CRYCB fix
2024-03-14 14:47:56 -04:00
5f58bde726 s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support
Provide a very simple ARCH_HAS_DEBUG_VIRTUAL implementation.
For now errors are only reported for the following cases:

- Trying to translate a vmalloc or module address to a physical address

- Translating a supposed to be ZONE_DMA virtual address into a physical
  address, and the resulting physical address is larger than two GiB

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-03-13 09:23:49 +01:00
e1f51be68d s390/cio,idal: fix virtual vs physical address confusion
Fix virtual vs physical address confusion. This does not fix a bug since
virtual and physical address spaces are currently the same.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-03-13 09:23:47 +01:00
0c9580cebb s390/cio,idal: remove superfluous virt_to_phys() conversion
Only the last 12 bits of virtual / physical addresses are used when masking
with IDA_BLOCK_SIZE - 1. Given that the bits are the same regardless of
virtual or physical address, remove the virtual to physical address
conversion.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-03-13 09:23:47 +01:00