2019-06-03 07:44:46 +02:00
// SPDX-License-Identifier: GPL-2.0-only
2015-09-09 15:38:55 -07:00
/*
* kexec . c - kexec system call core code .
* Copyright ( C ) 2002 - 2004 Eric Biederman < ebiederm @ xmission . com >
*/
kexec: use file name as the output message prefix
kexec output message misses the prefix "kexec", when Dave Young split the
kexec code. Now, we use file name as the output message prefix.
Currently, the format of output message:
[ 140.290795] SYSC_kexec_load: hello, world
[ 140.291534] kexec: sanity_check_segment_list: hello, world
Ideally, the format of output message:
[ 30.791503] kexec: SYSC_kexec_load, Hello, world
[ 79.182752] kexec_core: sanity_check_segment_list, Hello, world
Remove the custom prefix "kexec" in output message.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 16:32:45 -08:00
# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
2015-09-09 15:38:55 -07:00
2023-02-01 11:30:15 -06:00
# include <linux/btf.h>
2015-09-09 15:38:55 -07:00
# include <linux/capability.h>
# include <linux/mm.h>
# include <linux/file.h>
# include <linux/slab.h>
# include <linux/fs.h>
# include <linux/kexec.h>
# include <linux/mutex.h>
# include <linux/list.h>
# include <linux/highmem.h>
# include <linux/syscalls.h>
# include <linux/reboot.h>
# include <linux/ioport.h>
# include <linux/hardirq.h>
# include <linux/elf.h>
# include <linux/elfcore.h>
# include <linux/utsname.h>
# include <linux/numa.h>
# include <linux/suspend.h>
# include <linux/device.h>
# include <linux/freezer.h>
2021-06-30 18:54:59 -07:00
# include <linux/panic_notifier.h>
2015-09-09 15:38:55 -07:00
# include <linux/pm.h>
# include <linux/cpu.h>
# include <linux/uaccess.h>
# include <linux/io.h>
# include <linux/console.h>
# include <linux/vmalloc.h>
# include <linux/swap.h>
# include <linux/syscore_ops.h>
# include <linux/compiler.h>
# include <linux/hugetlb.h>
2020-09-04 16:30:25 +01:00
# include <linux/objtool.h>
2021-05-06 18:04:41 -07:00
# include <linux/kmsg_dump.h>
2015-09-09 15:38:55 -07:00
# include <asm/page.h>
# include <asm/sections.h>
# include <crypto/hash.h>
# include "kexec_internal.h"
2022-06-30 23:32:58 +01:00
atomic_t __kexec_lock = ATOMIC_INIT ( 0 ) ;
2015-09-09 15:38:55 -07:00
/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false ;
kexec_file: add kexec_file flag to control debug printing
Patch series "kexec_file: print out debugging message if required", v4.
Currently, specifying '-d' on kexec command will print a lot of debugging
informationabout kexec/kdump loading with kexec_load interface.
However, kexec_file_load prints nothing even though '-d' is specified.
It's very inconvenient to debug or analyze the kexec/kdump loading when
something wrong happened with kexec/kdump itself or develper want to check
the kexec/kdump loading.
In this patchset, a kexec_file flag is KEXEC_FILE_DEBUG added and checked
in code. If it's passed in, debugging message of kexec_file code will be
printed out and can be seen from console and dmesg. Otherwise, the
debugging message is printed like beofre when pr_debug() is taken.
Note:
****
=====
1) The code in kexec-tools utility also need be changed to support
passing KEXEC_FILE_DEBUG to kernel when 'kexec -s -d' is specified.
The patch link is here:
=========
[PATCH] kexec_file: add kexec_file flag to support debug printing
http://lists.infradead.org/pipermail/kexec/2023-November/028505.html
2) s390 also has kexec_file code, while I am not sure what debugging
information is necessary. So leave it to s390 developer.
Test:
****
====
Testing was done in v1 on x86_64 and arm64. For v4, tested on x86_64
again. And on x86_64, the printed messages look like below:
--------------------------------------------------------------
kexec measurement buffer for the loaded kernel at 0x207fffe000.
Loaded purgatory at 0x207fff9000
Loaded boot_param, command line and misc at 0x207fff3000 bufsz=0x1180 memsz=0x1180
Loaded 64bit kernel at 0x207c000000 bufsz=0xc88200 memsz=0x3c4a000
Loaded initrd at 0x2079e79000 bufsz=0x2186280 memsz=0x2186280
Final command line is: root=/dev/mapper/fedora_intel--knightslanding--lb--02-root ro
rd.lvm.lv=fedora_intel-knightslanding-lb-02/root console=ttyS0,115200N81 crashkernel=256M
E820 memmap:
0000000000000000-000000000009a3ff (1)
000000000009a400-000000000009ffff (2)
00000000000e0000-00000000000fffff (2)
0000000000100000-000000006ff83fff (1)
000000006ff84000-000000007ac50fff (2)
......
000000207fff6150-000000207fff615f (128)
000000207fff6160-000000207fff714f (1)
000000207fff7150-000000207fff715f (128)
000000207fff7160-000000207fff814f (1)
000000207fff8150-000000207fff815f (128)
000000207fff8160-000000207fffffff (1)
nr_segments = 5
segment[0]: buf=0x000000004e5ece74 bufsz=0x211 mem=0x207fffe000 memsz=0x1000
segment[1]: buf=0x000000009e871498 bufsz=0x4000 mem=0x207fff9000 memsz=0x5000
segment[2]: buf=0x00000000d879f1fe bufsz=0x1180 mem=0x207fff3000 memsz=0x2000
segment[3]: buf=0x000000001101cd86 bufsz=0xc88200 mem=0x207c000000 memsz=0x3c4a000
segment[4]: buf=0x00000000c6e38ac7 bufsz=0x2186280 mem=0x2079e79000 memsz=0x2187000
kexec_file_load: type:0, start:0x207fff91a0 head:0x109e004002 flags:0x8
---------------------------------------------------------------------------
This patch (of 7):
When specifying 'kexec -c -d', kexec_load interface will print loading
information, e.g the regions where kernel/initrd/purgatory/cmdline are
put, the memmap passed to 2nd kernel taken as system RAM ranges, and
printing all contents of struct kexec_segment, etc. These are very
helpful for analyzing or positioning what's happening when kexec/kdump
itself failed. The debugging printing for kexec_load interface is made in
user space utility kexec-tools.
Whereas, with kexec_file_load interface, 'kexec -s -d' print nothing.
Because kexec_file code is mostly implemented in kernel space, and the
debugging printing functionality is missed. It's not convenient when
debugging kexec/kdump loading and jumping with kexec_file_load interface.
Now add KEXEC_FILE_DEBUG to kexec_file flag to control the debugging
message printing. And add global variable kexec_file_dbg_print and macro
kexec_dprintk() to facilitate the printing.
This is a preparation, later kexec_dprintk() will be used to replace the
existing pr_debug(). Once 'kexec -s -d' is specified, it will print out
kexec/kdump loading information. If '-d' is not specified, it regresses
to pr_debug().
Link: https://lkml.kernel.org/r/20231213055747.61826-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231213055747.61826-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-13 13:57:41 +08:00
bool kexec_file_dbg_print ;
2015-09-09 15:38:55 -07:00
/*
* When kexec transitions to the new kernel there is a one - to - one
* mapping between physical and virtual addresses . On processors
* where you can disable the MMU this is trivial , and easy . For
* others it is still a simple predictable page table to setup .
*
* In that environment kexec copies the new kernel to its final
* resting place . This means I can only support memory whose
* physical address can fit in an unsigned long . In particular
* addresses where ( pfn < < PAGE_SHIFT ) > ULONG_MAX cannot be handled .
* If the assembly stub has more restrictive requirements
* KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DEST_MEMORY_LIMIT can be
* defined more restrictively in < asm / kexec . h > .
*
* The code for the transition from the current kernel to the
2020-10-15 20:10:28 -07:00
* new kernel is placed in the control_code_buffer , whose size
2015-09-09 15:38:55 -07:00
* is given by KEXEC_CONTROL_PAGE_SIZE . In the best case only a single
* page of memory is necessary , but some architectures require more .
* Because this memory must be identity mapped in the transition from
* virtual to physical addresses it must live in the range
* 0 - TASK_SIZE , as only the user space mappings are arbitrarily
* modifiable .
*
* The assembly stub in the control code buffer is passed a linked list
* of descriptor pages detailing the source pages of the new kernel ,
* and the destination addresses of those source pages . As this data
* structure is not used in the context of the current OS , it must
* be self - contained .
*
* The code has been made to work with highmem pages and will use a
* destination page in its final resting place ( if it happens
* to allocate it ) . The end product of this is that most of the
* physical address space , and most of RAM can be used .
*
* Future directions include :
* - allocating a page table with the control code buffer identity
* mapped , to simplify machine_kexec and make kexec_on_panic more
* reliable .
*/
/*
* KIMAGE_NO_DEST is an impossible destination address . . . , for
* allocating pages whose destination address we do not care about .
*/
# define KIMAGE_NO_DEST (-1UL)
2016-08-02 14:06:22 -07:00
# define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)
2015-09-09 15:38:55 -07:00
static struct page * kimage_alloc_page ( struct kimage * image ,
gfp_t gfp_mask ,
unsigned long dest ) ;
int sanity_check_segment_list ( struct kimage * image )
{
2016-08-02 14:05:45 -07:00
int i ;
2015-09-09 15:38:55 -07:00
unsigned long nr_segments = image - > nr_segments ;
2016-08-02 14:06:22 -07:00
unsigned long total_pages = 0 ;
2018-12-28 00:34:29 -08:00
unsigned long nr_pages = totalram_pages ( ) ;
2015-09-09 15:38:55 -07:00
/*
* Verify we have good destination addresses . The caller is
* responsible for making certain we don ' t attempt to load
* the new image into invalid or reserved areas of RAM . This
* just verifies it is an address we can use .
*
* Since the kernel does everything in page size chunks ensure
* the destination addresses are page aligned . Too many
* special cases crop of when we don ' t do this . The most
* insidious is getting overlapping destination addresses
* simply because addresses are changed to page size
* granularity .
*/
for ( i = 0 ; i < nr_segments ; i + + ) {
unsigned long mstart , mend ;
mstart = image - > segment [ i ] . mem ;
mend = mstart + image - > segment [ i ] . memsz ;
2016-08-02 14:05:57 -07:00
if ( mstart > mend )
return - EADDRNOTAVAIL ;
2015-09-09 15:38:55 -07:00
if ( ( mstart & ~ PAGE_MASK ) | | ( mend & ~ PAGE_MASK ) )
2016-08-02 14:05:45 -07:00
return - EADDRNOTAVAIL ;
2015-09-09 15:38:55 -07:00
if ( mend > = KEXEC_DESTINATION_MEMORY_LIMIT )
2016-08-02 14:05:45 -07:00
return - EADDRNOTAVAIL ;
2015-09-09 15:38:55 -07:00
}
/* Verify our destination addresses do not overlap.
* If we alloed overlapping destination addresses
* through very weird things can happen with no
* easy explanation as one segment stops on another .
*/
for ( i = 0 ; i < nr_segments ; i + + ) {
unsigned long mstart , mend ;
unsigned long j ;
mstart = image - > segment [ i ] . mem ;
mend = mstart + image - > segment [ i ] . memsz ;
for ( j = 0 ; j < i ; j + + ) {
unsigned long pstart , pend ;
pstart = image - > segment [ j ] . mem ;
pend = pstart + image - > segment [ j ] . memsz ;
/* Do the segments overlap ? */
if ( ( mend > pstart ) & & ( mstart < pend ) )
2016-08-02 14:05:45 -07:00
return - EINVAL ;
2015-09-09 15:38:55 -07:00
}
}
/* Ensure our buffer sizes are strictly less than
* our memory sizes . This should always be the case ,
* and it is easier to check up front than to be surprised
* later on .
*/
for ( i = 0 ; i < nr_segments ; i + + ) {
if ( image - > segment [ i ] . bufsz > image - > segment [ i ] . memsz )
2016-08-02 14:05:45 -07:00
return - EINVAL ;
2015-09-09 15:38:55 -07:00
}
2016-08-02 14:06:22 -07:00
/*
* Verify that no more than half of memory will be consumed . If the
* request from userspace is too large , a large amount of time will be
* wasted allocating pages , which can cause a soft lockup .
*/
for ( i = 0 ; i < nr_segments ; i + + ) {
2018-12-28 00:34:20 -08:00
if ( PAGE_COUNT ( image - > segment [ i ] . memsz ) > nr_pages / 2 )
2016-08-02 14:06:22 -07:00
return - EINVAL ;
total_pages + = PAGE_COUNT ( image - > segment [ i ] . memsz ) ;
}
2018-12-28 00:34:20 -08:00
if ( total_pages > nr_pages / 2 )
2016-08-02 14:06:22 -07:00
return - EINVAL ;
2024-01-24 13:12:44 +08:00
# ifdef CONFIG_CRASH_DUMP
2015-09-09 15:38:55 -07:00
/*
* Verify we have good destination addresses . Normally
* the caller is responsible for making certain we don ' t
* attempt to load the new image into invalid or reserved
* areas of RAM . But crash kernels are preloaded into a
* reserved area of ram . We must ensure the addresses
* are in the reserved area otherwise preloading the
* kernel could corrupt things .
*/
if ( image - > type = = KEXEC_TYPE_CRASH ) {
for ( i = 0 ; i < nr_segments ; i + + ) {
unsigned long mstart , mend ;
mstart = image - > segment [ i ] . mem ;
mend = mstart + image - > segment [ i ] . memsz - 1 ;
/* Ensure we are within the crash kernel limits */
2016-08-02 14:06:04 -07:00
if ( ( mstart < phys_to_boot_phys ( crashk_res . start ) ) | |
( mend > phys_to_boot_phys ( crashk_res . end ) ) )
2016-08-02 14:05:45 -07:00
return - EADDRNOTAVAIL ;
2015-09-09 15:38:55 -07:00
}
}
2024-01-24 13:12:44 +08:00
# endif
2015-09-09 15:38:55 -07:00
return 0 ;
}
struct kimage * do_kimage_alloc_init ( void )
{
struct kimage * image ;
/* Allocate a controlling structure */
image = kzalloc ( sizeof ( * image ) , GFP_KERNEL ) ;
if ( ! image )
return NULL ;
image - > head = 0 ;
image - > entry = & image - > head ;
image - > last_entry = & image - > head ;
image - > control_page = ~ 0 ; /* By default this does not apply */
image - > type = KEXEC_TYPE_DEFAULT ;
/* Initialize the list of control pages */
INIT_LIST_HEAD ( & image - > control_pages ) ;
/* Initialize the list of destination pages */
INIT_LIST_HEAD ( & image - > dest_pages ) ;
/* Initialize the list of unusable pages */
INIT_LIST_HEAD ( & image - > unusable_pages ) ;
crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining).
The crash elfcorehdr describes the CPUs and memory to be written into the
vmcore.
To track CPU changes, callbacks are registered with the cpuhp mechanism
via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug
elfcorehdr update has no explicit ordering requirement (relative to other
cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN.
CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a
new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state
in the PREPARE group, just prior to the STARTING group, which is very
close to the CPU starting up in a plug/online situation, or stopping in a
unplug/ offline situation. This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the elfcorehdr
would be inaccurate. Note that for a CPU being unplugged or offlined, the
CPU will still be present in the list of CPUs generated by
crash_prepare_elf64_headers(). However, there is no need to explicitly
omit the CPU, see justification in 'crash: change
crash_prepare_elf64_headers() to for_each_possible_cpu()'.
To track memory changes, a notifier is registered to capture the memblock
MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory. During the process,
the kexec_lock is held.
Link: https://lkml.kernel.org/r/20230814214446.6659-3-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-14 17:44:40 -04:00
# ifdef CONFIG_CRASH_HOTPLUG
image - > hp_action = KEXEC_CRASH_HP_NONE ;
image - > elfcorehdr_index = - 1 ;
image - > elfcorehdr_updated = false ;
# endif
2015-09-09 15:38:55 -07:00
return image ;
}
int kimage_is_destination_range ( struct kimage * image ,
unsigned long start ,
unsigned long end )
{
unsigned long i ;
for ( i = 0 ; i < image - > nr_segments ; i + + ) {
unsigned long mstart , mend ;
mstart = image - > segment [ i ] . mem ;
2023-12-17 11:35:26 +08:00
mend = mstart + image - > segment [ i ] . memsz - 1 ;
if ( ( end > = mstart ) & & ( start < = mend ) )
2015-09-09 15:38:55 -07:00
return 1 ;
}
return 0 ;
}
static struct page * kimage_alloc_pages ( gfp_t gfp_mask , unsigned int order )
{
struct page * pages ;
2019-09-25 16:47:33 -07:00
if ( fatal_signal_pending ( current ) )
return NULL ;
2017-07-17 16:10:28 -05:00
pages = alloc_pages ( gfp_mask & ~ __GFP_ZERO , order ) ;
2015-09-09 15:38:55 -07:00
if ( pages ) {
unsigned int count , i ;
pages - > mapping = NULL ;
set_page_private ( pages , order ) ;
count = 1 < < order ;
for ( i = 0 ; i < count ; i + + )
SetPageReserved ( pages + i ) ;
2017-07-17 16:10:28 -05:00
arch_kexec_post_alloc_pages ( page_address ( pages ) , count ,
gfp_mask ) ;
if ( gfp_mask & __GFP_ZERO )
for ( i = 0 ; i < count ; i + + )
clear_highpage ( pages + i ) ;
2015-09-09 15:38:55 -07:00
}
return pages ;
}
static void kimage_free_pages ( struct page * page )
{
unsigned int order , count , i ;
order = page_private ( page ) ;
count = 1 < < order ;
2017-07-17 16:10:28 -05:00
arch_kexec_pre_free_pages ( page_address ( page ) , count ) ;
2015-09-09 15:38:55 -07:00
for ( i = 0 ; i < count ; i + + )
ClearPageReserved ( page + i ) ;
__free_pages ( page , order ) ;
}
void kimage_free_page_list ( struct list_head * list )
{
2016-01-20 15:00:34 -08:00
struct page * page , * next ;
2015-09-09 15:38:55 -07:00
2016-01-20 15:00:34 -08:00
list_for_each_entry_safe ( page , next , list , lru ) {
2015-09-09 15:38:55 -07:00
list_del ( & page - > lru ) ;
kimage_free_pages ( page ) ;
}
}
static struct page * kimage_alloc_normal_control_pages ( struct kimage * image ,
unsigned int order )
{
/* Control pages are special, they are the intermediaries
* that are needed while we copy the rest of the pages
* to their final resting place . As such they must
* not conflict with either the destination addresses
* or memory the kernel is already using .
*
* The only case where we really need more than one of
* these are for architectures where we cannot disable
* the MMU and must instead generate an identity mapped
* page table for all of the memory .
*
* At worst this runs in O ( N ) of the image size .
*/
struct list_head extra_pages ;
struct page * pages ;
unsigned int count ;
count = 1 < < order ;
INIT_LIST_HEAD ( & extra_pages ) ;
/* Loop while I can allocate a page and the page allocated
* is a destination page .
*/
do {
unsigned long pfn , epfn , addr , eaddr ;
pages = kimage_alloc_pages ( KEXEC_CONTROL_MEMORY_GFP , order ) ;
if ( ! pages )
break ;
2016-08-02 14:06:04 -07:00
pfn = page_to_boot_pfn ( pages ) ;
2015-09-09 15:38:55 -07:00
epfn = pfn + count ;
addr = pfn < < PAGE_SHIFT ;
2023-12-17 11:35:26 +08:00
eaddr = ( epfn < < PAGE_SHIFT ) - 1 ;
2015-09-09 15:38:55 -07:00
if ( ( epfn > = ( KEXEC_CONTROL_MEMORY_LIMIT > > PAGE_SHIFT ) ) | |
kimage_is_destination_range ( image , addr , eaddr ) ) {
list_add ( & pages - > lru , & extra_pages ) ;
pages = NULL ;
}
} while ( ! pages ) ;
if ( pages ) {
/* Remember the allocated page... */
list_add ( & pages - > lru , & image - > control_pages ) ;
/* Because the page is already in it's destination
* location we will never allocate another page at
* that address . Therefore kimage_alloc_pages
* will not return it ( again ) and we don ' t need
* to give it an entry in image - > segment [ ] .
*/
}
/* Deal with the destination pages I have inadvertently allocated.
*
* Ideally I would convert multi - page allocations into single
* page allocations , and add everything to image - > dest_pages .
*
* For now it is simpler to just free the pages .
*/
kimage_free_page_list ( & extra_pages ) ;
return pages ;
}
2024-01-24 13:12:44 +08:00
# ifdef CONFIG_CRASH_DUMP
2015-09-09 15:38:55 -07:00
static struct page * kimage_alloc_crash_control_pages ( struct kimage * image ,
unsigned int order )
{
/* Control pages are special, they are the intermediaries
* that are needed while we copy the rest of the pages
* to their final resting place . As such they must
* not conflict with either the destination addresses
* or memory the kernel is already using .
*
* Control pages are also the only pags we must allocate
* when loading a crash kernel . All of the other pages
* are specified by the segments and we just memcpy
* into them directly .
*
* The only case where we really need more than one of
* these are for architectures where we cannot disable
* the MMU and must instead generate an identity mapped
* page table for all of the memory .
*
* Given the low demand this implements a very simple
* allocator that finds the first hole of the appropriate
* size in the reserved memory region , and allocates all
* of the memory up to and including the hole .
*/
unsigned long hole_start , hole_end , size ;
struct page * pages ;
pages = NULL ;
size = ( 1 < < order ) < < PAGE_SHIFT ;
2023-12-12 22:27:06 +08:00
hole_start = ALIGN ( image - > control_page , size ) ;
2015-09-09 15:38:55 -07:00
hole_end = hole_start + size - 1 ;
while ( hole_end < = crashk_res . end ) {
unsigned long i ;
2016-12-14 15:04:23 -08:00
cond_resched ( ) ;
2015-09-09 15:38:55 -07:00
if ( hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT )
break ;
/* See if I overlap any of the segments */
for ( i = 0 ; i < image - > nr_segments ; i + + ) {
unsigned long mstart , mend ;
mstart = image - > segment [ i ] . mem ;
mend = mstart + image - > segment [ i ] . memsz - 1 ;
if ( ( hole_end > = mstart ) & & ( hole_start < = mend ) ) {
/* Advance the hole to the end of the segment */
2023-12-12 22:27:06 +08:00
hole_start = ALIGN ( mend , size ) ;
2015-09-09 15:38:55 -07:00
hole_end = hole_start + size - 1 ;
break ;
}
}
/* If I don't overlap any segments I have found my hole! */
if ( i = = image - > nr_segments ) {
pages = pfn_to_page ( hole_start > > PAGE_SHIFT ) ;
kexec_core: fix the assignment to kimage->control_page
image->control_page represents the starting address for allocating the
next control page, while hole_end represents the address of the last valid
byte of the currently allocated control page.
This bug actually does not affect the correctness of allocating control
pages, because image->control_page is currently only used in
kimage_alloc_crash_control_pages(), and this function, when allocating
control pages, will first align image->control_page up to the nearest
`(1 << order) << PAGE_SHIFT` boundary, then use this value as the
starting address of the next control page. This ensures that the newly
allocated control page will use the correct starting address and not
overlap with previously allocated control pages.
Although it does not affect the correctness of the final result, it is
better for us to set image->control_page to the correct value, in case
it might be used elsewhere in the future, potentially causing errors.
Therefore, after successfully allocating a control page,
image->control_page should be updated to `hole_end + 1`, rather than
hole_end.
Link: https://lkml.kernel.org/r/20231221042308.11076-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-21 12:23:08 +08:00
image - > control_page = hole_end + 1 ;
2015-09-09 15:38:55 -07:00
break ;
}
}
2018-09-30 11:10:31 +08:00
/* Ensure that these pages are decrypted if SME is enabled. */
if ( pages )
arch_kexec_post_alloc_pages ( page_address ( pages ) , 1 < < order , 0 ) ;
2015-09-09 15:38:55 -07:00
return pages ;
}
2024-01-24 13:12:44 +08:00
# endif
2015-09-09 15:38:55 -07:00
struct page * kimage_alloc_control_pages ( struct kimage * image ,
unsigned int order )
{
struct page * pages = NULL ;
switch ( image - > type ) {
case KEXEC_TYPE_DEFAULT :
pages = kimage_alloc_normal_control_pages ( image , order ) ;
break ;
2024-01-24 13:12:44 +08:00
# ifdef CONFIG_CRASH_DUMP
2015-09-09 15:38:55 -07:00
case KEXEC_TYPE_CRASH :
pages = kimage_alloc_crash_control_pages ( image , order ) ;
break ;
2024-01-24 13:12:44 +08:00
# endif
2015-09-09 15:38:55 -07:00
}
return pages ;
}
static int kimage_add_entry ( struct kimage * image , kimage_entry_t entry )
{
if ( * image - > entry ! = 0 )
image - > entry + + ;
if ( image - > entry = = image - > last_entry ) {
kimage_entry_t * ind_page ;
struct page * page ;
page = kimage_alloc_page ( image , GFP_KERNEL , KIMAGE_NO_DEST ) ;
if ( ! page )
return - ENOMEM ;
ind_page = page_address ( page ) ;
2016-08-02 14:06:04 -07:00
* image - > entry = virt_to_boot_phys ( ind_page ) | IND_INDIRECTION ;
2015-09-09 15:38:55 -07:00
image - > entry = ind_page ;
image - > last_entry = ind_page +
( ( PAGE_SIZE / sizeof ( kimage_entry_t ) ) - 1 ) ;
}
* image - > entry = entry ;
image - > entry + + ;
* image - > entry = 0 ;
return 0 ;
}
static int kimage_set_destination ( struct kimage * image ,
unsigned long destination )
{
destination & = PAGE_MASK ;
2022-09-29 12:29:34 +08:00
return kimage_add_entry ( image , destination | IND_DESTINATION ) ;
2015-09-09 15:38:55 -07:00
}
static int kimage_add_page ( struct kimage * image , unsigned long page )
{
page & = PAGE_MASK ;
2022-09-29 12:29:34 +08:00
return kimage_add_entry ( image , page | IND_SOURCE ) ;
2015-09-09 15:38:55 -07:00
}
static void kimage_free_extra_pages ( struct kimage * image )
{
/* Walk through and free any extra destination pages I may have */
kimage_free_page_list ( & image - > dest_pages ) ;
/* Walk through and free any unusable pages I have cached */
kimage_free_page_list ( & image - > unusable_pages ) ;
}
2019-12-04 10:59:15 -05:00
2015-09-09 15:38:55 -07:00
void kimage_terminate ( struct kimage * image )
{
if ( * image - > entry ! = 0 )
image - > entry + + ;
* image - > entry = IND_DONE ;
}
# define for_each_kimage_entry(image, ptr, entry) \
for ( ptr = & image - > head ; ( entry = * ptr ) & & ! ( entry & IND_DONE ) ; \
ptr = ( entry & IND_INDIRECTION ) ? \
2016-08-02 14:06:04 -07:00
boot_phys_to_virt ( ( entry & PAGE_MASK ) ) : ptr + 1 )
2015-09-09 15:38:55 -07:00
static void kimage_free_entry ( kimage_entry_t entry )
{
struct page * page ;
2016-08-02 14:06:04 -07:00
page = boot_pfn_to_page ( entry > > PAGE_SHIFT ) ;
2015-09-09 15:38:55 -07:00
kimage_free_pages ( page ) ;
}
void kimage_free ( struct kimage * image )
{
kimage_entry_t * ptr , entry ;
kimage_entry_t ind = 0 ;
if ( ! image )
return ;
2024-01-24 13:12:44 +08:00
# ifdef CONFIG_CRASH_DUMP
kdump: protect vmcoreinfo data under the crash memory
Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
has the risk of being modified by some wrong code during system is
running.
As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore, we
probably will get "Segmentation fault" or other unexpected errors.
E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.
Now except for vmcoreinfo, all the crash data is well
protected(including the cpu note which is fully updated in the crash
path, thus its correctness is guaranteed). Given that vmcoreinfo data
is a large chunk prepared for kdump, we better protect it as well.
To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash
memory will be protected by existing arch_kexec_protect_crashkres()
mechanism, we naturally protect vmcoreinfo_data from write(even read)
access under kernel direct mapping after kdump is loaded.
Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.
On the other hand, we still need to operate the vmcoreinfo safe copy
when crash happens to generate vmcoreinfo_note again, we rely on vmap()
to map out a new kernel virtual address and update to use this new one
instead in the following crash_save_vmcoreinfo().
BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.
Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 14:33:21 -07:00
if ( image - > vmcoreinfo_data_copy ) {
crash_update_vmcoreinfo_safecopy ( NULL ) ;
vunmap ( image - > vmcoreinfo_data_copy ) ;
}
2024-01-24 13:12:44 +08:00
# endif
kdump: protect vmcoreinfo data under the crash memory
Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
has the risk of being modified by some wrong code during system is
running.
As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore, we
probably will get "Segmentation fault" or other unexpected errors.
E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.
Now except for vmcoreinfo, all the crash data is well
protected(including the cpu note which is fully updated in the crash
path, thus its correctness is guaranteed). Given that vmcoreinfo data
is a large chunk prepared for kdump, we better protect it as well.
To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash
memory will be protected by existing arch_kexec_protect_crashkres()
mechanism, we naturally protect vmcoreinfo_data from write(even read)
access under kernel direct mapping after kdump is loaded.
Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.
On the other hand, we still need to operate the vmcoreinfo safe copy
when crash happens to generate vmcoreinfo_note again, we rely on vmap()
to map out a new kernel virtual address and update to use this new one
instead in the following crash_save_vmcoreinfo().
BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.
Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 14:33:21 -07:00
2015-09-09 15:38:55 -07:00
kimage_free_extra_pages ( image ) ;
for_each_kimage_entry ( image , ptr , entry ) {
if ( entry & IND_INDIRECTION ) {
/* Free the previous indirection page */
if ( ind & IND_INDIRECTION )
kimage_free_entry ( ind ) ;
/* Save this indirection page until we are
* done with it .
*/
ind = entry ;
} else if ( entry & IND_SOURCE )
kimage_free_entry ( entry ) ;
}
/* Free the final indirection page */
if ( ind & IND_INDIRECTION )
kimage_free_entry ( ind ) ;
/* Handle any machine specific cleanup */
machine_kexec_cleanup ( image ) ;
/* Free the kexec control pages... */
kimage_free_page_list ( & image - > control_pages ) ;
/*
* Free up any temporary buffers allocated . This might hit if
* error occurred much later after buffer allocation .
*/
if ( image - > file_mode )
kimage_file_post_load_cleanup ( image ) ;
kfree ( image ) ;
}
static kimage_entry_t * kimage_dst_used ( struct kimage * image ,
unsigned long page )
{
kimage_entry_t * ptr , entry ;
unsigned long destination = 0 ;
for_each_kimage_entry ( image , ptr , entry ) {
if ( entry & IND_DESTINATION )
destination = entry & PAGE_MASK ;
else if ( entry & IND_SOURCE ) {
if ( page = = destination )
return ptr ;
destination + = PAGE_SIZE ;
}
}
return NULL ;
}
static struct page * kimage_alloc_page ( struct kimage * image ,
gfp_t gfp_mask ,
unsigned long destination )
{
/*
* Here we implement safeguards to ensure that a source page
* is not copied to its destination page before the data on
* the destination page is no longer useful .
*
* To do this we maintain the invariant that a source page is
* either its own destination page , or it is not a
* destination page at all .
*
* That is slightly stronger than required , but the proof
* that no problems will not occur is trivial , and the
* implementation is simply to verify .
*
* When allocating all pages normally this algorithm will run
* in O ( N ) time , but in the worst case it will run in O ( N ^ 2 )
* time . If the runtime is a problem the data structures can
* be fixed .
*/
struct page * page ;
unsigned long addr ;
/*
* Walk through the list of destination pages , and see if I
* have a match .
*/
list_for_each_entry ( page , & image - > dest_pages , lru ) {
2016-08-02 14:06:04 -07:00
addr = page_to_boot_pfn ( page ) < < PAGE_SHIFT ;
2015-09-09 15:38:55 -07:00
if ( addr = = destination ) {
list_del ( & page - > lru ) ;
return page ;
}
}
page = NULL ;
while ( 1 ) {
kimage_entry_t * old ;
/* Allocate a page, if we run out of memory give up */
page = kimage_alloc_pages ( gfp_mask , 0 ) ;
if ( ! page )
return NULL ;
/* If the page cannot be used file it away */
2016-08-02 14:06:04 -07:00
if ( page_to_boot_pfn ( page ) >
2015-09-09 15:38:55 -07:00
( KEXEC_SOURCE_MEMORY_LIMIT > > PAGE_SHIFT ) ) {
list_add ( & page - > lru , & image - > unusable_pages ) ;
continue ;
}
2016-08-02 14:06:04 -07:00
addr = page_to_boot_pfn ( page ) < < PAGE_SHIFT ;
2015-09-09 15:38:55 -07:00
/* If it is the destination page we want use it */
if ( addr = = destination )
break ;
/* If the page is not a destination page use it */
if ( ! kimage_is_destination_range ( image , addr ,
2023-12-17 11:35:26 +08:00
addr + PAGE_SIZE - 1 ) )
2015-09-09 15:38:55 -07:00
break ;
/*
* I know that the page is someones destination page .
* See if there is already a source page for this
* destination page . And if so swap the source pages .
*/
old = kimage_dst_used ( image , addr ) ;
if ( old ) {
/* If so move it */
unsigned long old_addr ;
struct page * old_page ;
old_addr = * old & PAGE_MASK ;
2016-08-02 14:06:04 -07:00
old_page = boot_pfn_to_page ( old_addr > > PAGE_SHIFT ) ;
2015-09-09 15:38:55 -07:00
copy_highpage ( page , old_page ) ;
* old = addr | ( * old & ~ PAGE_MASK ) ;
/* The old page I have found cannot be a
* destination page , so return it if it ' s
* gfp_flags honor the ones passed in .
*/
if ( ! ( gfp_mask & __GFP_HIGHMEM ) & &
PageHighMem ( old_page ) ) {
kimage_free_pages ( old_page ) ;
continue ;
}
page = old_page ;
break ;
}
/* Place the page on the destination list, to be used later */
list_add ( & page - > lru , & image - > dest_pages ) ;
}
return page ;
}
static int kimage_load_normal_segment ( struct kimage * image ,
struct kexec_segment * segment )
{
unsigned long maddr ;
size_t ubytes , mbytes ;
int result ;
unsigned char __user * buf = NULL ;
unsigned char * kbuf = NULL ;
if ( image - > file_mode )
kbuf = segment - > kbuf ;
else
buf = segment - > buf ;
ubytes = segment - > bufsz ;
mbytes = segment - > memsz ;
maddr = segment - > mem ;
result = kimage_set_destination ( image , maddr ) ;
if ( result < 0 )
goto out ;
while ( mbytes ) {
struct page * page ;
char * ptr ;
size_t uchunk , mchunk ;
page = kimage_alloc_page ( image , GFP_HIGHUSER , maddr ) ;
if ( ! page ) {
result = - ENOMEM ;
goto out ;
}
2016-08-02 14:06:04 -07:00
result = kimage_add_page ( image , page_to_boot_pfn ( page )
2015-09-09 15:38:55 -07:00
< < PAGE_SHIFT ) ;
if ( result < 0 )
goto out ;
kexec: replace kmap() with kmap_local_page()
kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap's pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since its use in kexec_core.c is safe everywhere, it should be preferred.
Therefore, replace kmap() with kmap_local_page() in kexec_core.c.
Tested on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.
Link: https://lkml.kernel.org/r/20220821182519.9483-1-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21 20:25:19 +02:00
ptr = kmap_local_page ( page ) ;
2015-09-09 15:38:55 -07:00
/* Start with a clear page */
clear_page ( ptr ) ;
ptr + = maddr & ~ PAGE_MASK ;
mchunk = min_t ( size_t , mbytes ,
PAGE_SIZE - ( maddr & ~ PAGE_MASK ) ) ;
uchunk = min ( ubytes , mchunk ) ;
2024-02-22 17:21:19 +08:00
if ( uchunk ) {
/* For file based kexec, source pages are in kernel memory */
if ( image - > file_mode )
memcpy ( ptr , kbuf , uchunk ) ;
else
result = copy_from_user ( ptr , buf , uchunk ) ;
ubytes - = uchunk ;
if ( image - > file_mode )
kbuf + = uchunk ;
else
buf + = uchunk ;
}
kexec: replace kmap() with kmap_local_page()
kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap's pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since its use in kexec_core.c is safe everywhere, it should be preferred.
Therefore, replace kmap() with kmap_local_page() in kexec_core.c.
Tested on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.
Link: https://lkml.kernel.org/r/20220821182519.9483-1-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21 20:25:19 +02:00
kunmap_local ( ptr ) ;
2015-09-09 15:38:55 -07:00
if ( result ) {
result = - EFAULT ;
goto out ;
}
maddr + = mchunk ;
mbytes - = mchunk ;
2018-06-14 15:26:31 -07:00
cond_resched ( ) ;
2015-09-09 15:38:55 -07:00
}
out :
return result ;
}
2024-01-24 13:12:44 +08:00
# ifdef CONFIG_CRASH_DUMP
2015-09-09 15:38:55 -07:00
static int kimage_load_crash_segment ( struct kimage * image ,
struct kexec_segment * segment )
{
/* For crash dumps kernels we simply copy the data from
* user space to it ' s destination .
* We do things a page at a time for the sake of kmap .
*/
unsigned long maddr ;
size_t ubytes , mbytes ;
int result ;
unsigned char __user * buf = NULL ;
unsigned char * kbuf = NULL ;
result = 0 ;
if ( image - > file_mode )
kbuf = segment - > kbuf ;
else
buf = segment - > buf ;
ubytes = segment - > bufsz ;
mbytes = segment - > memsz ;
maddr = segment - > mem ;
while ( mbytes ) {
struct page * page ;
char * ptr ;
size_t uchunk , mchunk ;
2016-08-02 14:06:04 -07:00
page = boot_pfn_to_page ( maddr > > PAGE_SHIFT ) ;
2015-09-09 15:38:55 -07:00
if ( ! page ) {
result = - ENOMEM ;
goto out ;
}
2018-09-30 11:10:31 +08:00
arch_kexec_post_alloc_pages ( page_address ( page ) , 1 , 0 ) ;
kexec: replace kmap() with kmap_local_page()
kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap's pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since its use in kexec_core.c is safe everywhere, it should be preferred.
Therefore, replace kmap() with kmap_local_page() in kexec_core.c.
Tested on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.
Link: https://lkml.kernel.org/r/20220821182519.9483-1-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21 20:25:19 +02:00
ptr = kmap_local_page ( page ) ;
2015-09-09 15:38:55 -07:00
ptr + = maddr & ~ PAGE_MASK ;
mchunk = min_t ( size_t , mbytes ,
PAGE_SIZE - ( maddr & ~ PAGE_MASK ) ) ;
uchunk = min ( ubytes , mchunk ) ;
if ( mchunk > uchunk ) {
/* Zero the trailing part of the page */
memset ( ptr + uchunk , 0 , mchunk - uchunk ) ;
}
2024-02-22 17:21:19 +08:00
if ( uchunk ) {
/* For file based kexec, source pages are in kernel memory */
if ( image - > file_mode )
memcpy ( ptr , kbuf , uchunk ) ;
else
result = copy_from_user ( ptr , buf , uchunk ) ;
ubytes - = uchunk ;
if ( image - > file_mode )
kbuf + = uchunk ;
else
buf + = uchunk ;
}
2015-09-09 15:38:55 -07:00
kexec_flush_icache_page ( page ) ;
kexec: replace kmap() with kmap_local_page()
kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap's pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since its use in kexec_core.c is safe everywhere, it should be preferred.
Therefore, replace kmap() with kmap_local_page() in kexec_core.c.
Tested on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.
Link: https://lkml.kernel.org/r/20220821182519.9483-1-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21 20:25:19 +02:00
kunmap_local ( ptr ) ;
2018-09-30 11:10:31 +08:00
arch_kexec_pre_free_pages ( page_address ( page ) , 1 ) ;
2015-09-09 15:38:55 -07:00
if ( result ) {
result = - EFAULT ;
goto out ;
}
maddr + = mchunk ;
mbytes - = mchunk ;
2018-06-14 15:26:31 -07:00
cond_resched ( ) ;
2015-09-09 15:38:55 -07:00
}
out :
return result ;
}
2024-01-24 13:12:44 +08:00
# endif
2015-09-09 15:38:55 -07:00
int kimage_load_segment ( struct kimage * image ,
struct kexec_segment * segment )
{
int result = - ENOMEM ;
switch ( image - > type ) {
case KEXEC_TYPE_DEFAULT :
result = kimage_load_normal_segment ( image , segment ) ;
break ;
2024-01-24 13:12:44 +08:00
# ifdef CONFIG_CRASH_DUMP
2015-09-09 15:38:55 -07:00
case KEXEC_TYPE_CRASH :
result = kimage_load_crash_segment ( image , segment ) ;
break ;
2024-01-24 13:12:44 +08:00
# endif
2015-09-09 15:38:55 -07:00
}
return result ;
}
2023-01-04 15:38:48 +01:00
struct kexec_load_limit {
/* Mutex protects the limit count. */
struct mutex mutex ;
int limit ;
} ;
static struct kexec_load_limit load_limit_reboot = {
. mutex = __MUTEX_INITIALIZER ( load_limit_reboot . mutex ) ,
. limit = - 1 ,
} ;
static struct kexec_load_limit load_limit_panic = {
. mutex = __MUTEX_INITIALIZER ( load_limit_panic . mutex ) ,
. limit = - 1 ,
} ;
2015-09-09 15:38:55 -07:00
struct kimage * kexec_image ;
struct kimage * kexec_crash_image ;
2023-01-04 15:38:47 +01:00
static int kexec_load_disabled ;
2023-01-04 15:38:48 +01:00
2022-04-24 10:57:40 +08:00
# ifdef CONFIG_SYSCTL
2023-01-04 15:38:48 +01:00
static int kexec_limit_handler ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
struct kexec_load_limit * limit = table - > data ;
int val ;
struct ctl_table tmp = {
. data = & val ,
. maxlen = sizeof ( val ) ,
. mode = table - > mode ,
} ;
int ret ;
if ( write ) {
ret = proc_dointvec ( & tmp , write , buffer , lenp , ppos ) ;
if ( ret )
return ret ;
if ( val < 0 )
return - EINVAL ;
mutex_lock ( & limit - > mutex ) ;
if ( limit - > limit ! = - 1 & & val > = limit - > limit )
ret = - EINVAL ;
else
limit - > limit = val ;
mutex_unlock ( & limit - > mutex ) ;
return ret ;
}
mutex_lock ( & limit - > mutex ) ;
val = limit - > limit ;
mutex_unlock ( & limit - > mutex ) ;
return proc_dointvec ( & tmp , write , buffer , lenp , ppos ) ;
}
2022-04-24 10:57:40 +08:00
static struct ctl_table kexec_core_sysctls [ ] = {
{
. procname = " kexec_load_disabled " ,
. data = & kexec_load_disabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
/* only handle a transition from default "0" to "1" */
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ONE ,
. extra2 = SYSCTL_ONE ,
} ,
2023-01-04 15:38:48 +01:00
{
. procname = " kexec_load_limit_panic " ,
. data = & load_limit_panic ,
. mode = 0644 ,
. proc_handler = kexec_limit_handler ,
} ,
{
. procname = " kexec_load_limit_reboot " ,
. data = & load_limit_reboot ,
. mode = 0644 ,
. proc_handler = kexec_limit_handler ,
} ,
2022-04-24 10:57:40 +08:00
} ;
static int __init kexec_core_sysctl_init ( void )
{
register_sysctl_init ( " kernel " , kexec_core_sysctls ) ;
return 0 ;
}
late_initcall ( kexec_core_sysctl_init ) ;
# endif
2015-09-09 15:38:55 -07:00
2023-01-04 15:38:48 +01:00
bool kexec_load_permitted ( int kexec_image_type )
2023-01-04 15:38:47 +01:00
{
2023-01-04 15:38:48 +01:00
struct kexec_load_limit * limit ;
2023-01-04 15:38:47 +01:00
/*
* Only the superuser can use the kexec syscall and if it has not
* been disabled .
*/
2023-01-04 15:38:48 +01:00
if ( ! capable ( CAP_SYS_BOOT ) | | kexec_load_disabled )
return false ;
/* Check limit counter and decrease it.*/
limit = ( kexec_image_type = = KEXEC_TYPE_CRASH ) ?
& load_limit_panic : & load_limit_reboot ;
mutex_lock ( & limit - > mutex ) ;
if ( ! limit - > limit ) {
mutex_unlock ( & limit - > mutex ) ;
return false ;
}
if ( limit - > limit ! = - 1 )
limit - > limit - - ;
mutex_unlock ( & limit - > mutex ) ;
return true ;
2023-01-04 15:38:47 +01:00
}
2015-09-09 15:38:55 -07:00
/*
* Move into place and start executing a preloaded standalone
* executable . If nothing was preloaded return an error .
*/
int kernel_kexec ( void )
{
int error = 0 ;
2022-06-30 23:32:58 +01:00
if ( ! kexec_trylock ( ) )
2015-09-09 15:38:55 -07:00
return - EBUSY ;
if ( ! kexec_image ) {
error = - EINVAL ;
goto Unlock ;
}
# ifdef CONFIG_KEXEC_JUMP
if ( kexec_image - > preserve_context ) {
pm_prepare_console ( ) ;
error = freeze_processes ( ) ;
if ( error ) {
error = - EBUSY ;
goto Restore_console ;
}
suspend_console ( ) ;
error = dpm_suspend_start ( PMSG_FREEZE ) ;
if ( error )
goto Resume_console ;
/* At this point, dpm_suspend_start() has been called,
* but * not * dpm_suspend_end ( ) . We * must * call
* dpm_suspend_end ( ) now . Otherwise , drivers for
* some devices ( e . g . interrupt controllers ) become
* desynchronized with the actual state of the
* hardware at resume time , and evil weirdness ensues .
*/
error = dpm_suspend_end ( PMSG_FREEZE ) ;
if ( error )
goto Resume_devices ;
2019-04-11 13:34:45 +10:00
error = suspend_disable_secondary_cpus ( ) ;
2015-09-09 15:38:55 -07:00
if ( error )
goto Enable_cpus ;
local_irq_disable ( ) ;
error = syscore_suspend ( ) ;
if ( error )
goto Enable_irqs ;
} else
# endif
{
kexec_in_progress = true ;
2021-05-06 18:04:35 -07:00
kernel_restart_prepare ( " kexec reboot " ) ;
2015-09-09 15:38:55 -07:00
migrate_to_reboot_cpu ( ) ;
kexec: do syscore_shutdown() in kernel_kexec
syscore_shutdown() runs driver and module callbacks to get the system into
a state where it can be correctly shut down. In commit 6f389a8f1dd2 ("PM
/ reboot: call syscore_shutdown() after disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence got
(incorrectly?) removed from the kexec flow. This was innocuous until
commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to
hook restart/shutdown") changed the way that KVM registered its shutdown
callbacks, switching from reboot notifiers to syscore_ops.shutdown. As
syscore_shutdown() is missing from kexec, KVM's shutdown hook is not run
and virtualisation is left enabled on the boot CPU which results in triple
faults when switching to the new kernel on Intel x86 VT-x with VMXE
enabled.
Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down. It is not totally clear if this
is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call
syscore_shutdown() after disable_nonboot_cpus()") it is stated that
"syscore_ops operations should be carried with one CPU on-line and
interrupts disabled." APs are only offlined later in machine_shutdown(),
so this syscore_shutdown() is being run while APs are still online. This
seems to be the correct place as it matches where syscore_shutdown() is
run in the reboot and halt flows - they also run it before APs are shut
down. The assumption is that the commit message in commit 6f389a8f1dd2
("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()") is
no longer valid.
KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow. Looking at some of them like x86 MCE it
is probably more correct to also shut these down during kexec.
Maintainers of all drivers which use syscore_ops.shutdown are added on CC
for visibility. They are:
arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
virt/kvm/kvm_main.c .shutdown = kvm_shutdown,
This has been tested by doing a kexec on x86_64 and aarch64.
Link: https://lkml.kernel.org/r/20231213064004.2419447-1-jgowans@amazon.com
Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Jernej Skrabec <jernej.skrabec@gmail.com>
Cc: Samuel Holland <samuel@sholland.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Orson Zhai <orsonzhai@gmail.com>
Cc: Alexander Graf <graf@amazon.de>
Cc: Jan H. Schoenherr <jschoenh@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-13 08:40:04 +02:00
syscore_shutdown ( ) ;
2015-09-09 15:38:55 -07:00
/*
* migrate_to_reboot_cpu ( ) disables CPU hotplug assuming that
* no further code needs to use CPU hotplug ( which is true in
* the reboot case ) . However , the kexec path depends on using
* CPU hotplug again ; so re - enable it here .
*/
cpu_hotplug_enable ( ) ;
2019-12-04 10:59:14 -05:00
pr_notice ( " Starting new kernel \n " ) ;
2015-09-09 15:38:55 -07:00
machine_shutdown ( ) ;
}
2021-05-06 18:04:41 -07:00
kmsg_dump ( KMSG_DUMP_SHUTDOWN ) ;
2015-09-09 15:38:55 -07:00
machine_kexec ( kexec_image ) ;
# ifdef CONFIG_KEXEC_JUMP
if ( kexec_image - > preserve_context ) {
syscore_resume ( ) ;
Enable_irqs :
local_irq_enable ( ) ;
Enable_cpus :
2019-04-11 13:34:45 +10:00
suspend_enable_secondary_cpus ( ) ;
2015-09-09 15:38:55 -07:00
dpm_resume_start ( PMSG_RESTORE ) ;
Resume_devices :
dpm_resume_end ( PMSG_RESTORE ) ;
Resume_console :
resume_console ( ) ;
thaw_processes ( ) ;
Restore_console :
pm_restore_console ( ) ;
}
# endif
Unlock :
2022-06-30 23:32:58 +01:00
kexec_unlock ( ) ;
2015-09-09 15:38:55 -07:00
return error ;
}