2019-05-19 15:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-17 02:20:36 +04:00
/*
* Copyright ( C ) 1995 Linus Torvalds
*
2019-11-18 17:49:22 +03:00
* This file contains the setup_arch ( ) code , which handles the architecture - dependent
* parts of early kernel initialization .
2005-04-17 02:20:36 +04:00
*/
# include <linux/console.h>
2019-11-18 17:49:22 +03:00
# include <linux/crash_dump.h>
2020-09-11 11:56:52 +03:00
# include <linux/dma-map-ops.h>
2019-11-18 17:49:22 +03:00
# include <linux/dmi.h>
2005-04-17 02:20:36 +04:00
# include <linux/efi.h>
2019-11-18 17:49:22 +03:00
# include <linux/init_ohci1394_dma.h>
# include <linux/initrd.h>
2008-04-10 06:50:41 +04:00
# include <linux/iscsi_ibft.h>
2019-11-18 17:49:22 +03:00
# include <linux/memblock.h>
2008-01-30 15:30:16 +03:00
# include <linux/pci.h>
2019-11-18 17:49:22 +03:00
# include <linux/root_dev.h>
mm: hugetlb: optionally allocate gigantic hugepages using cma
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-11 00:32:45 +03:00
# include <linux/hugetlb.h>
2009-09-02 05:25:07 +04:00
# include <linux/tboot.h>
2017-03-21 11:01:31 +03:00
# include <linux/usb/xhci-dbgp.h>
2020-08-18 16:57:51 +03:00
# include <linux/static_call.h>
2020-09-22 16:31:03 +03:00
# include <linux/swiotlb.h>
2005-06-26 01:58:01 +04:00
2019-11-18 17:49:22 +03:00
# include <uapi/linux/mount.h>
# include <xen/xen.h>
2005-06-26 01:57:41 +04:00
# include <asm/apic.h>
2020-08-06 15:34:32 +03:00
# include <asm/numa.h>
2008-03-17 22:08:17 +03:00
# include <asm/bios_ebda.h>
2008-06-17 03:11:08 +04:00
# include <asm/bugs.h>
2009-01-07 15:41:35 +03:00
# include <asm/cpu.h>
2019-11-18 17:49:22 +03:00
# include <asm/efi.h>
2008-11-27 20:39:15 +03:00
# include <asm/gart.h>
2008-10-27 20:41:46 +03:00
# include <asm/hypervisor.h>
2019-11-18 17:49:22 +03:00
# include <asm/io_apic.h>
# include <asm/kasan.h>
# include <asm/kaslr.h>
2009-11-10 04:38:24 +03:00
# include <asm/mce.h>
2019-11-18 17:49:22 +03:00
# include <asm/mtrr.h>
2019-11-26 19:54:07 +03:00
# include <asm/realmode.h>
2019-11-18 17:49:22 +03:00
# include <asm/olpc_ofw.h>
# include <asm/pci-direct.h>
2011-02-22 23:07:37 +03:00
# include <asm/prom.h>
2019-11-18 17:49:22 +03:00
# include <asm/proto.h>
x86/thermal: Fix LVT thermal setup for SMI delivery mode
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):
[ 0.033858] read lvtthmr: 0x330, val: 0x200
which means:
- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.
Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:
setup_local_APIC:
...
/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);
which means (from the SDM):
"10.4.7.2 Local APIC State After It Has Been Software Disabled
...
* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."
And this happens too:
[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...
Now, before
4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.
But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.
Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.
Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-27 12:02:26 +03:00
# include <asm/thermal.h>
2017-07-25 02:36:57 +03:00
# include <asm/unwind.h>
2019-11-18 17:49:22 +03:00
# include <asm/vsyscall.h>
2019-11-29 10:17:25 +03:00
# include <linux/vmalloc.h>
2008-06-26 04:51:29 +04:00
2009-04-28 17:00:49 +04:00
/*
2019-11-18 18:03:39 +03:00
* max_low_pfn_mapped : highest directly mapped pfn < 4 GB
* max_pfn_mapped : highest directly mapped pfn > 4 GB
2012-11-17 07:38:52 +04:00
*
2017-01-28 19:09:33 +03:00
* The direct mapping only covers E820_TYPE_RAM regions , so the ranges and gaps are
2019-11-18 18:03:39 +03:00
* represented by pfn_mapped [ ] .
2009-04-28 17:00:49 +04:00
*/
unsigned long max_low_pfn_mapped ;
unsigned long max_pfn_mapped ;
2010-02-10 02:38:45 +03:00
# ifdef CONFIG_DMI
2009-03-13 02:09:49 +03:00
RESERVE_BRK ( dmi_alloc , 65536 ) ;
2010-02-10 02:38:45 +03:00
# endif
2009-03-13 02:09:49 +03:00
2009-01-27 19:13:05 +03:00
2019-11-18 18:03:39 +03:00
/*
* Range of the BSS area . The size of the BSS area is determined
2021-03-11 11:39:19 +03:00
* at link time , with RESERVE_BRK ( ) facility reserving additional
2019-11-18 18:03:39 +03:00
* chunks .
*/
unsigned long _brk_start = ( unsigned long ) __brk_base ;
unsigned long _brk_end = ( unsigned long ) __brk_base ;
x86: add brk allocation for very, very early allocations
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-27 04:35:44 +03:00
2008-06-26 04:55:20 +04:00
struct boot_params boot_params ;
2016-04-14 21:18:57 +03:00
/*
2019-11-18 18:03:39 +03:00
* These are the four main kernel memory regions , we put them into
* the resource tree so that kdump tools and other debugging tools
* recover it :
2016-04-14 21:18:57 +03:00
*/
2019-11-18 18:03:39 +03:00
2019-10-30 00:13:50 +03:00
static struct resource rodata_resource = {
. name = " Kernel rodata " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
2016-04-14 21:18:57 +03:00
static struct resource data_resource = {
. name = " Kernel data " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
static struct resource code_resource = {
. name = " Kernel code " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
static struct resource bss_resource = {
. name = " Kernel bss " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
2008-06-26 04:50:06 +04:00
# ifdef CONFIG_X86_32
2019-11-18 18:03:39 +03:00
/* CPU data as detected by the assembly code in head_32.S */
2017-02-13 00:12:08 +03:00
struct cpuinfo_x86 new_cpu_data ;
2019-11-18 18:03:39 +03:00
/* Common CPU data for all CPUs */
2017-02-13 00:12:08 +03:00
struct cpuinfo_x86 boot_cpu_data __read_mostly ;
2005-06-23 11:08:33 +04:00
EXPORT_SYMBOL ( boot_cpu_data ) ;
2005-04-17 02:20:36 +04:00
2008-03-27 23:55:04 +03:00
unsigned int def_to_bigsmp ;
2008-06-26 04:50:06 +04:00
struct apm_info apm_info ;
EXPORT_SYMBOL ( apm_info ) ;
# if defined(CONFIG_X86_SPEEDSTEP_SMI) || \
defined ( CONFIG_X86_SPEEDSTEP_SMI_MODULE )
struct ist_info ist_info ;
EXPORT_SYMBOL ( ist_info ) ;
# else
struct ist_info ist_info ;
# endif
# else
2018-02-14 14:16:54 +03:00
struct cpuinfo_x86 boot_cpu_data __read_mostly ;
2008-06-26 04:50:06 +04:00
EXPORT_SYMBOL ( boot_cpu_data ) ;
# endif
# if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
2016-08-09 02:29:06 +03:00
__visible unsigned long mmu_cr4_features __ro_after_init ;
2008-06-26 04:50:06 +04:00
# else
2016-08-09 02:29:06 +03:00
__visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE ;
2008-06-26 04:50:06 +04:00
# endif
2009-05-08 03:54:11 +04:00
/* Boot loader ID and version as integers, for the benefit of proc_dointvec */
int bootloader_type , bootloader_version ;
2005-04-17 02:20:36 +04:00
/*
* Setup options
*/
struct screen_info screen_info ;
2005-06-23 11:08:33 +04:00
EXPORT_SYMBOL ( screen_info ) ;
2005-04-17 02:20:36 +04:00
struct edid_info edid_info ;
2005-09-10 00:04:34 +04:00
EXPORT_SYMBOL_GPL ( edid_info ) ;
2005-04-17 02:20:36 +04:00
extern int root_mountflags ;
2008-04-11 01:28:10 +04:00
unsigned long saved_video_mode ;
2005-04-17 02:20:36 +04:00
2008-01-30 15:32:51 +03:00
# define RAMDISK_IMAGE_START_MASK 0x07FF
2005-04-17 02:20:36 +04:00
# define RAMDISK_PROMPT_FLAG 0x8000
2008-01-30 15:32:51 +03:00
# define RAMDISK_LOAD_FLAG 0x4000
2005-04-17 02:20:36 +04:00
2007-02-12 11:54:11 +03:00
static char __initdata command_line [ COMMAND_LINE_SIZE ] ;
2008-08-12 23:52:36 +04:00
# ifdef CONFIG_CMDLINE_BOOL
static char __initdata builtin_cmdline [ COMMAND_LINE_SIZE ] = CONFIG_CMDLINE ;
# endif
2005-04-17 02:20:36 +04:00
# if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
struct edd edd ;
# ifdef CONFIG_EDD_MODULE
EXPORT_SYMBOL ( edd ) ;
# endif
/**
* copy_edd ( ) - Copy the BIOS EDD information
* from boot_params into a safe place .
*
*/
2009-11-30 13:33:51 +03:00
static inline void __init copy_edd ( void )
2005-04-17 02:20:36 +04:00
{
2007-10-16 04:13:22 +04:00
memcpy ( edd . mbr_signature , boot_params . edd_mbr_sig_buffer ,
sizeof ( edd . mbr_signature ) ) ;
memcpy ( edd . edd_info , boot_params . eddbuf , sizeof ( edd . edd_info ) ) ;
edd . mbr_signature_nr = boot_params . edd_mbr_sig_buf_entries ;
edd . edd_info_nr = boot_params . eddbuf_entries ;
2005-04-17 02:20:36 +04:00
}
# else
2009-11-30 13:33:51 +03:00
static inline void __init copy_edd ( void )
2005-04-17 02:20:36 +04:00
{
}
# endif
2009-03-15 03:19:51 +03:00
void * __init extend_brk ( size_t size , size_t align )
{
size_t mask = align - 1 ;
void * ret ;
BUG_ON ( _brk_start = = 0 ) ;
BUG_ON ( align & mask ) ;
_brk_end = ( _brk_end + mask ) & ~ mask ;
BUG_ON ( ( char * ) ( _brk_end + size ) > __brk_limit ) ;
ret = ( void * ) _brk_end ;
_brk_end + = size ;
memset ( ret , 0 , size ) ;
return ret ;
}
2012-11-17 07:39:08 +04:00
# ifdef CONFIG_X86_32
2011-02-18 14:30:30 +03:00
static void __init cleanup_highmap ( void )
2010-12-28 03:48:32 +03:00
{
}
2009-06-22 18:39:41 +04:00
# endif
2009-03-15 03:19:51 +03:00
static void __init reserve_brk ( void )
{
if ( _brk_end > _brk_start )
2012-11-17 01:57:13 +04:00
memblock_reserve ( __pa_symbol ( _brk_start ) ,
_brk_end - _brk_start ) ;
2009-03-15 03:19:51 +03:00
/* Mark brk area as locked down and no longer taking any
new allocations */
_brk_start = 0 ;
}
2013-12-04 23:50:42 +04:00
u64 relocated_ramdisk ;
2008-01-30 15:32:51 +03:00
# ifdef CONFIG_BLK_DEV_INITRD
2013-01-25 00:19:56 +04:00
static u64 __init get_ramdisk_image ( void )
{
u64 ramdisk_image = boot_params . hdr . ramdisk_image ;
2013-01-29 08:16:44 +04:00
ramdisk_image | = ( u64 ) boot_params . ext_ramdisk_image < < 32 ;
x86/setup: Add an initrdmem= option to specify initrd physical address
Add the initrdmem option:
initrdmem=ss[KMG],nn[KMG]
which is used to specify the physical address of the initrd, almost
always an address in FLASH. Also add code for x86 to use the existing
phys_init_start and phys_init_size variables in the kernel.
This is useful in cases where a kernel and an initrd is placed in FLASH,
but there is no firmware file system structure in the FLASH.
One such situation occurs when unused FLASH space on UEFI systems has
been reclaimed by, e.g., taking it from the Management Engine. For
example, on many systems, the ME is given half the FLASH part; not only
is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
space can be used to contain an initrd, but need to tell Linux where it
is.
This space is "raw": due to, e.g., UEFI limitations: it can not be added
to UEFI firmware volumes without rebuilding UEFI from source or writing
a UEFI device driver. It can be referenced only as a physical address
and size.
At the same time, if a kernel can be "netbooted" or loaded from GRUB or
syslinux, the option of not using the physical address specification
should be available.
Then, it is easy to boot the kernel and provide an initrd; or boot the
the kernel and let it use the initrd in FLASH. In practice, this has
proven to be very helpful when integrating Linux into FLASH on x86.
Hence, the most flexible and convenient path is to enable the initrdmem
command line option in a way that it is the last choice tried.
For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
burnt in with a built-in command line which includes:
initrdmem=0xff968000,0x200000
which specifies a location and size.
[ bp: Massage commit message, make it passive. ]
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
2020-04-26 04:10:21 +03:00
if ( ramdisk_image = = 0 )
ramdisk_image = phys_initrd_start ;
2013-01-25 00:19:56 +04:00
return ramdisk_image ;
}
static u64 __init get_ramdisk_size ( void )
{
u64 ramdisk_size = boot_params . hdr . ramdisk_size ;
2013-01-29 08:16:44 +04:00
ramdisk_size | = ( u64 ) boot_params . ext_ramdisk_size < < 32 ;
x86/setup: Add an initrdmem= option to specify initrd physical address
Add the initrdmem option:
initrdmem=ss[KMG],nn[KMG]
which is used to specify the physical address of the initrd, almost
always an address in FLASH. Also add code for x86 to use the existing
phys_init_start and phys_init_size variables in the kernel.
This is useful in cases where a kernel and an initrd is placed in FLASH,
but there is no firmware file system structure in the FLASH.
One such situation occurs when unused FLASH space on UEFI systems has
been reclaimed by, e.g., taking it from the Management Engine. For
example, on many systems, the ME is given half the FLASH part; not only
is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
space can be used to contain an initrd, but need to tell Linux where it
is.
This space is "raw": due to, e.g., UEFI limitations: it can not be added
to UEFI firmware volumes without rebuilding UEFI from source or writing
a UEFI device driver. It can be referenced only as a physical address
and size.
At the same time, if a kernel can be "netbooted" or loaded from GRUB or
syslinux, the option of not using the physical address specification
should be available.
Then, it is easy to boot the kernel and provide an initrd; or boot the
the kernel and let it use the initrd in FLASH. In practice, this has
proven to be very helpful when integrating Linux into FLASH on x86.
Hence, the most flexible and convenient path is to enable the initrdmem
command line option in a way that it is the last choice tried.
For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
burnt in with a built-in command line which includes:
initrdmem=0xff968000,0x200000
which specifies a location and size.
[ bp: Massage commit message, make it passive. ]
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
2020-04-26 04:10:21 +03:00
if ( ramdisk_size = = 0 )
ramdisk_size = phys_initrd_size ;
2013-01-25 00:19:56 +04:00
return ramdisk_size ;
}
2008-06-26 04:49:26 +04:00
static void __init relocate_initrd ( void )
2008-01-30 15:32:51 +03:00
{
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 06:42:55 +04:00
/* Assume only end is not page aligned */
2013-01-25 00:19:56 +04:00
u64 ramdisk_image = get_ramdisk_image ( ) ;
u64 ramdisk_size = get_ramdisk_size ( ) ;
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 06:42:55 +04:00
u64 area_size = PAGE_ALIGN ( ramdisk_size ) ;
2008-01-30 15:32:51 +03:00
2012-11-17 07:38:51 +04:00
/* We need to move the initrd down into directly mapped mem */
2020-10-14 02:58:12 +03:00
relocated_ramdisk = memblock_phys_alloc_range ( area_size , PAGE_SIZE , 0 ,
PFN_PHYS ( max_pfn_mapped ) ) ;
2013-12-04 23:50:42 +04:00
if ( ! relocated_ramdisk )
2008-05-25 21:00:09 +04:00
panic ( " Cannot find place for new RAMDISK of size %lld \n " ,
2013-12-04 23:50:42 +04:00
ramdisk_size ) ;
2008-05-25 21:00:09 +04:00
2013-12-04 23:50:42 +04:00
initrd_start = relocated_ramdisk + PAGE_OFFSET ;
2008-01-30 15:32:51 +03:00
initrd_end = initrd_start + ramdisk_size ;
2012-05-30 02:06:29 +04:00
printk ( KERN_INFO " Allocated new RAMDISK: [mem %#010llx-%#010llx] \n " ,
2013-12-04 23:50:42 +04:00
relocated_ramdisk , relocated_ramdisk + ramdisk_size - 1 ) ;
2008-01-30 15:32:51 +03:00
2015-09-09 01:03:07 +03:00
copy_from_early_mem ( ( void * ) initrd_start , ramdisk_image , ramdisk_size ) ;
2012-05-30 02:06:29 +04:00
printk ( KERN_INFO " Move RAMDISK from [mem %#010llx-%#010llx] to "
" [mem %#010llx-%#010llx] \n " ,
2008-05-22 05:40:18 +04:00
ramdisk_image , ramdisk_image + ramdisk_size - 1 ,
2013-12-04 23:50:42 +04:00
relocated_ramdisk , relocated_ramdisk + ramdisk_size - 1 ) ;
2008-06-26 04:49:26 +04:00
}
2008-06-14 07:07:03 +04:00
2013-01-25 00:19:55 +04:00
static void __init early_reserve_initrd ( void )
{
/* Assume only end is not page aligned */
2013-01-25 00:19:56 +04:00
u64 ramdisk_image = get_ramdisk_image ( ) ;
u64 ramdisk_size = get_ramdisk_size ( ) ;
2013-01-25 00:19:55 +04:00
u64 ramdisk_end = PAGE_ALIGN ( ramdisk_image + ramdisk_size ) ;
if ( ! boot_params . hdr . type_of_loader | |
! ramdisk_image | | ! ramdisk_size )
return ; /* No initrd provided by bootloader */
memblock_reserve ( ramdisk_image , ramdisk_end - ramdisk_image ) ;
}
2020-10-14 02:58:12 +03:00
2008-06-26 04:49:26 +04:00
static void __init reserve_initrd ( void )
{
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 06:42:55 +04:00
/* Assume only end is not page aligned */
2013-01-25 00:19:56 +04:00
u64 ramdisk_image = get_ramdisk_image ( ) ;
u64 ramdisk_size = get_ramdisk_size ( ) ;
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 06:42:55 +04:00
u64 ramdisk_end = PAGE_ALIGN ( ramdisk_image + ramdisk_size ) ;
2008-06-26 04:49:26 +04:00
if ( ! boot_params . hdr . type_of_loader | |
! ramdisk_image | | ! ramdisk_size )
return ; /* No initrd provided by bootloader */
initrd_start = 0 ;
2012-05-30 02:06:29 +04:00
printk ( KERN_INFO " RAMDISK: [mem %#010llx-%#010llx] \n " , ramdisk_image ,
ramdisk_end - 1 ) ;
2008-06-26 04:49:26 +04:00
2012-11-17 07:38:53 +04:00
if ( pfn_range_is_mapped ( PFN_DOWN ( ramdisk_image ) ,
2012-11-17 07:38:51 +04:00
PFN_DOWN ( ramdisk_end ) ) ) {
/* All are mapped, easy case */
2008-06-26 04:49:26 +04:00
initrd_start = ramdisk_image + PAGE_OFFSET ;
initrd_end = initrd_start + ramdisk_size ;
return ;
}
relocate_initrd ( ) ;
2009-06-05 06:14:22 +04:00
2011-07-12 13:16:06 +04:00
memblock_free ( ramdisk_image , ramdisk_end - ramdisk_image ) ;
2008-01-30 15:32:51 +03:00
}
2016-04-11 05:13:27 +03:00
2008-06-22 13:46:58 +04:00
# else
2013-01-25 00:19:55 +04:00
static void __init early_reserve_initrd ( void )
{
}
2008-06-26 04:49:26 +04:00
static void __init reserve_initrd ( void )
2008-06-22 13:46:58 +04:00
{
}
2008-01-30 15:32:51 +03:00
# endif /* CONFIG_BLK_DEV_INITRD */
2008-06-26 05:00:22 +04:00
static void __init parse_setup_data ( void )
2008-06-26 04:56:22 +04:00
{
struct setup_data * data ;
2013-08-14 01:46:41 +04:00
u64 pa_data , pa_next ;
2008-06-26 04:56:22 +04:00
pa_data = boot_params . hdr . setup_data ;
while ( pa_data ) {
2015-01-07 13:55:48 +03:00
u32 data_len , data_type ;
2011-02-22 23:07:36 +03:00
2015-01-07 13:55:48 +03:00
data = early_memremap ( pa_data , sizeof ( * data ) ) ;
2011-02-22 23:07:36 +03:00
data_len = data - > len + sizeof ( struct setup_data ) ;
2013-08-14 01:46:41 +04:00
data_type = data - > type ;
pa_next = data - > next ;
2015-02-24 12:13:28 +03:00
early_memunmap ( data , sizeof ( * data ) ) ;
2011-02-22 23:07:36 +03:00
2013-08-14 01:46:41 +04:00
switch ( data_type ) {
2008-06-26 04:56:22 +04:00
case SETUP_E820_EXT :
2017-01-28 15:18:40 +03:00
e820__memory_setup_extended ( pa_data , data_len ) ;
2008-06-26 04:56:22 +04:00
break ;
2011-02-22 23:07:37 +03:00
case SETUP_DTB :
add_dtb ( pa_data ) ;
2008-06-26 04:56:22 +04:00
break ;
2013-12-20 14:02:19 +04:00
case SETUP_EFI :
parse_efi_setup ( pa_data , data_len ) ;
break ;
2008-06-26 04:56:22 +04:00
default :
break ;
}
2013-08-14 01:46:41 +04:00
pa_data = pa_next ;
2008-06-26 04:56:22 +04:00
}
}
2010-08-26 00:39:17 +04:00
static void __init memblock_x86_reserve_range_setup_data ( void )
2008-07-03 22:37:13 +04:00
{
struct setup_data * data ;
u64 pa_data ;
pa_data = boot_params . hdr . setup_data ;
while ( pa_data ) {
2008-09-08 02:21:16 +04:00
data = early_memremap ( pa_data , sizeof ( * data ) ) ;
2011-07-12 13:16:06 +04:00
memblock_reserve ( pa_data , sizeof ( * data ) + data - > len ) ;
2019-11-12 16:46:40 +03:00
if ( data - > type = = SETUP_INDIRECT & &
( ( struct setup_indirect * ) data - > data ) - > type ! = SETUP_INDIRECT )
memblock_reserve ( ( ( struct setup_indirect * ) data - > data ) - > addr ,
( ( struct setup_indirect * ) data - > data ) - > len ) ;
2008-07-03 22:37:13 +04:00
pa_data = data - > next ;
2015-02-24 12:13:28 +03:00
early_memunmap ( data , sizeof ( * data ) ) ;
2008-07-03 22:37:13 +04:00
}
}
2008-06-26 04:57:13 +04:00
/*
* - - - - - - - - - Crashkernel reservation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*/
2015-09-10 01:38:55 +03:00
# ifdef CONFIG_KEXEC_CORE
2008-06-26 23:54:08 +04:00
2015-10-19 12:17:44 +03:00
/* 16M alignment for crash kernel regions */
2019-04-21 06:50:59 +03:00
# define CRASH_ALIGN SZ_16M
2015-10-19 12:17:44 +03:00
2010-12-17 06:20:41 +03:00
/*
2019-05-24 10:38:10 +03:00
* Keep the crash kernel below this limit .
*
2019-11-18 18:03:39 +03:00
* Earlier 32 - bits kernels would limit the kernel to the low 512 MB range
2019-05-24 10:38:10 +03:00
* due to mapping restrictions .
*
2019-11-18 18:03:39 +03:00
* 64 - bit kdump kernels need to be restricted to be under 64 TB , which is
2019-11-18 10:00:12 +03:00
* the upper limit of system RAM in 4 - level paging mode . Since the kdump
2019-11-18 18:03:39 +03:00
* jump could be from 5 - level paging to 4 - level paging , the jump will fail if
* the kernel is put above 64 TB , and during the 1 st kernel bootup there ' s
* no good way to detect the paging mode of the target kernel which will be
* loaded for dumping .
2010-12-17 06:20:41 +03:00
*/
# ifdef CONFIG_X86_32
2019-04-21 06:50:59 +03:00
# define CRASH_ADDR_LOW_MAX SZ_512M
# define CRASH_ADDR_HIGH_MAX SZ_512M
2010-12-17 06:20:41 +03:00
# else
2019-04-21 06:50:59 +03:00
# define CRASH_ADDR_LOW_MAX SZ_4G
2019-05-24 10:38:10 +03:00
# define CRASH_ADDR_HIGH_MAX SZ_64T
2010-12-17 06:20:41 +03:00
# endif
2015-10-19 12:17:41 +03:00
static int __init reserve_crashkernel_low ( void )
2013-01-25 00:20:11 +04:00
{
# ifdef CONFIG_X86_64
2015-10-19 12:17:45 +03:00
unsigned long long base , low_base = 0 , low_size = 0 ;
2020-10-14 02:58:16 +03:00
unsigned long low_mem_limit ;
2013-01-25 00:20:11 +04:00
int ret ;
2020-10-14 02:58:16 +03:00
low_mem_limit = min ( memblock_phys_mem_size ( ) , CRASH_ADDR_LOW_MAX ) ;
2015-10-19 12:17:43 +03:00
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-16 09:23:48 +04:00
/* crashkernel=Y,low */
2020-10-14 02:58:16 +03:00
ret = parse_crashkernel_low ( boot_command_line , low_mem_limit , & low_size , & base ) ;
2015-10-19 12:17:45 +03:00
if ( ret ) {
2013-04-16 09:23:45 +04:00
/*
2019-06-19 17:19:55 +03:00
* two parts from kernel / dma / swiotlb . c :
2015-06-10 18:49:42 +03:00
* - swiotlb size : user - specified with swiotlb = or default .
*
* - swiotlb overflow buffer : now hardcoded to 32 k . We round it
* to 8 M for other buffers that may need to stay low too . Also
* make sure we allocate enough extra low memory so that we
* don ' t run out of DMA buffers for 32 - bit devices .
2013-04-16 09:23:45 +04:00
*/
2015-10-19 12:17:43 +03:00
low_size = max ( swiotlb_size_or_default ( ) + ( 8UL < < 20 ) , 256UL < < 20 ) ;
2013-04-16 09:23:45 +04:00
} else {
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-16 09:23:48 +04:00
/* passed with crashkernel=0,low ? */
2013-04-16 09:23:45 +04:00
if ( ! low_size )
2015-10-19 12:17:41 +03:00
return 0 ;
2013-04-16 09:23:45 +04:00
}
2013-01-25 00:20:11 +04:00
2020-10-14 02:58:16 +03:00
low_base = memblock_phys_alloc_range ( low_size , CRASH_ALIGN , 0 , CRASH_ADDR_LOW_MAX ) ;
2013-01-25 00:20:11 +04:00
if ( ! low_base ) {
2015-10-19 12:17:41 +03:00
pr_err ( " Cannot reserve %ldMB crashkernel low memory, please try smaller size. \n " ,
( unsigned long ) ( low_size > > 20 ) ) ;
return - ENOMEM ;
2013-01-25 00:20:11 +04:00
}
2020-10-14 02:58:16 +03:00
pr_info ( " Reserving %ldMB of low memory at %ldMB for crashkernel (low RAM limit: %ldMB) \n " ,
2015-10-19 12:17:43 +03:00
( unsigned long ) ( low_size > > 20 ) ,
( unsigned long ) ( low_base > > 20 ) ,
2020-10-14 02:58:16 +03:00
( unsigned long ) ( low_mem_limit > > 20 ) ) ;
2015-10-19 12:17:43 +03:00
2013-01-25 00:20:11 +04:00
crashk_low_res . start = low_base ;
crashk_low_res . end = low_base + low_size - 1 ;
insert_resource ( & iomem_resource , & crashk_low_res ) ;
2010-12-17 06:20:41 +03:00
# endif
2015-10-19 12:17:41 +03:00
return 0 ;
2013-01-25 00:20:11 +04:00
}
2010-12-17 06:20:41 +03:00
2008-06-26 05:00:22 +04:00
static void __init reserve_crashkernel ( void )
2008-06-26 04:57:13 +04:00
{
2015-10-19 12:17:45 +03:00
unsigned long long crash_size , crash_base , total_mem ;
2013-04-16 09:23:47 +04:00
bool high = false ;
2008-06-26 04:57:13 +04:00
int ret ;
2012-03-29 01:42:47 +04:00
total_mem = memblock_phys_mem_size ( ) ;
2008-06-26 04:57:13 +04:00
2013-04-16 09:23:47 +04:00
/* crashkernel=XM */
2015-10-19 12:17:43 +03:00
ret = parse_crashkernel ( boot_command_line , total_mem , & crash_size , & crash_base ) ;
2013-04-16 09:23:47 +04:00
if ( ret ! = 0 | | crash_size < = 0 ) {
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-16 09:23:48 +04:00
/* crashkernel=X,high */
2013-04-16 09:23:47 +04:00
ret = parse_crashkernel_high ( boot_command_line , total_mem ,
2015-10-19 12:17:43 +03:00
& crash_size , & crash_base ) ;
2013-04-16 09:23:47 +04:00
if ( ret ! = 0 | | crash_size < = 0 )
return ;
high = true ;
}
2008-06-26 23:54:08 +04:00
2018-04-25 13:08:35 +03:00
if ( xen_pv_domain ( ) ) {
pr_info ( " Ignoring crashkernel for a Xen PV domain \n " ) ;
return ;
}
2008-06-26 23:54:08 +04:00
/* 0 means: find the address automatically */
2019-04-22 06:19:05 +03:00
if ( ! crash_base ) {
2010-10-06 03:05:14 +04:00
/*
2017-01-23 09:48:23 +03:00
* Set CRASH_ADDR_LOW_MAX upper bound for crash memory ,
2019-04-22 06:19:05 +03:00
* crashkernel = x , high reserves memory over 4 G , also allocates
* 256 M extra low memory for DMA buffers and swiotlb .
* But the extra memory is not required for all machines .
* So try low memory first and fall back to high memory
* unless " crashkernel=size[KMG],high " is specified .
2010-10-06 03:05:14 +04:00
*/
2019-04-22 06:19:05 +03:00
if ( ! high )
2020-10-14 02:58:16 +03:00
crash_base = memblock_phys_alloc_range ( crash_size ,
CRASH_ALIGN , CRASH_ALIGN ,
CRASH_ADDR_LOW_MAX ) ;
2019-04-22 06:19:05 +03:00
if ( ! crash_base )
2020-10-14 02:58:16 +03:00
crash_base = memblock_phys_alloc_range ( crash_size ,
CRASH_ALIGN , CRASH_ALIGN ,
CRASH_ADDR_HIGH_MAX ) ;
2011-07-12 11:58:09 +04:00
if ( ! crash_base ) {
2009-11-23 04:18:49 +03:00
pr_info ( " crashkernel reservation failed - No suitable area found. \n " ) ;
2008-06-26 04:57:13 +04:00
return ;
}
2008-06-26 23:54:08 +04:00
} else {
2009-11-23 04:18:49 +03:00
unsigned long long start ;
2020-10-14 02:58:16 +03:00
start = memblock_phys_alloc_range ( crash_size , SZ_1M , crash_base ,
crash_base + crash_size ) ;
2009-11-23 04:18:49 +03:00
if ( start ! = crash_base ) {
pr_info ( " crashkernel reservation failed - memory is in use. \n " ) ;
2008-06-26 04:57:13 +04:00
return ;
}
2008-06-26 23:54:08 +04:00
}
2008-06-26 04:57:13 +04:00
2015-10-19 12:17:41 +03:00
if ( crash_base > = ( 1ULL < < 32 ) & & reserve_crashkernel_low ( ) ) {
memblock_free ( crash_base , crash_size ) ;
return ;
}
2008-06-26 04:57:13 +04:00
2015-10-19 12:17:45 +03:00
pr_info ( " Reserving %ldMB of memory at %ldMB for crashkernel (System RAM: %ldMB) \n " ,
( unsigned long ) ( crash_size > > 20 ) ,
( unsigned long ) ( crash_base > > 20 ) ,
( unsigned long ) ( total_mem > > 20 ) ) ;
2008-06-26 04:57:13 +04:00
2008-06-26 23:54:08 +04:00
crashk_res . start = crash_base ;
crashk_res . end = crash_base + crash_size - 1 ;
insert_resource ( & iomem_resource , & crashk_res ) ;
2008-06-26 04:57:13 +04:00
}
# else
2008-06-26 05:00:22 +04:00
static void __init reserve_crashkernel ( void )
2008-06-26 04:57:13 +04:00
{
}
# endif
2008-06-26 04:58:02 +04:00
static struct resource standard_io_resources [ ] = {
{ . name = " dma1 " , . start = 0x00 , . end = 0x1f ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " pic1 " , . start = 0x20 , . end = 0x21 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " timer0 " , . start = 0x40 , . end = 0x43 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " timer1 " , . start = 0x50 , . end = 0x53 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " keyboard " , . start = 0x60 , . end = 0x60 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " keyboard " , . start = 0x64 , . end = 0x64 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " dma page reg " , . start = 0x80 , . end = 0x8f ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " pic2 " , . start = 0xa0 , . end = 0xa1 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " dma2 " , . start = 0xc0 , . end = 0xdf ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " fpu " , . start = 0xf0 , . end = 0xff ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO }
} ;
2009-08-19 16:55:50 +04:00
void __init reserve_standard_io_resources ( void )
2008-06-26 04:58:02 +04:00
{
int i ;
/* request I/O space for devices used on all i[345]86 PCs */
for ( i = 0 ; i < ARRAY_SIZE ( standard_io_resources ) ; i + + )
request_resource ( & ioport_resource , & standard_io_resources [ i ] ) ;
}
2010-04-02 01:32:43 +04:00
static __init void reserve_ibft_region ( void )
{
unsigned long addr , size = 0 ;
addr = find_ibft_region ( & size ) ;
if ( size )
2011-07-12 13:16:06 +04:00
memblock_reserve ( addr , size ) ;
2010-04-02 01:32:43 +04:00
}
2012-11-15 00:43:31 +04:00
static bool __init snb_gfx_workaround_needed ( void )
{
2013-01-14 08:56:41 +04:00
# ifdef CONFIG_PCI
2012-11-15 00:43:31 +04:00
int i ;
u16 vendor , devid ;
2013-01-14 08:36:39 +04:00
static const __initconst u16 snb_ids [ ] = {
2012-11-15 00:43:31 +04:00
0x0102 ,
0x0112 ,
0x0122 ,
0x0106 ,
0x0116 ,
0x0126 ,
0x010a ,
} ;
/* Assume no if something weird is going on with PCI */
if ( ! early_pci_allowed ( ) )
return false ;
vendor = read_pci_config_16 ( 0 , 2 , 0 , PCI_VENDOR_ID ) ;
if ( vendor ! = 0x8086 )
return false ;
devid = read_pci_config_16 ( 0 , 2 , 0 , PCI_DEVICE_ID ) ;
for ( i = 0 ; i < ARRAY_SIZE ( snb_ids ) ; i + + )
if ( devid = = snb_ids [ i ] )
return true ;
2013-01-14 08:56:41 +04:00
# endif
2012-11-15 00:43:31 +04:00
return false ;
}
/*
* Sandy Bridge graphics has trouble with certain ranges , exclude
* them from allocation .
*/
static void __init trim_snb_memory ( void )
{
2013-01-14 08:36:39 +04:00
static const __initconst unsigned long bad_pages [ ] = {
2012-11-15 00:43:31 +04:00
0x20050000 ,
0x20110000 ,
0x20130000 ,
0x20138000 ,
0x40004000 ,
} ;
int i ;
if ( ! snb_gfx_workaround_needed ( ) )
return ;
printk ( KERN_DEBUG " reserving inaccessible SNB gfx pages \n " ) ;
/*
2021-04-13 21:08:39 +03:00
* SandyBridge integrated graphics devices have a bug that prevents
* them from accessing certain memory ranges , namely anything below
* 1 M and in the pages listed in bad_pages [ ] above .
*
2021-06-01 10:53:52 +03:00
* To avoid these pages being ever accessed by SNB gfx devices reserve
* bad_pages that have not already been reserved at boot time .
* All memory below the 1 MB mark is anyway reserved later during
* setup_arch ( ) , so there is no need to reserve it here .
2012-11-15 00:43:31 +04:00
*/
2021-04-13 21:08:39 +03:00
2012-11-15 00:43:31 +04:00
for ( i = 0 ; i < ARRAY_SIZE ( bad_pages ) ; i + + ) {
if ( memblock_reserve ( bad_pages [ i ] , PAGE_SIZE ) )
printk ( KERN_WARNING " failed to reserve 0x%08lx \n " ,
bad_pages [ i ] ) ;
}
}
2010-01-22 06:21:04 +03:00
static void __init trim_bios_range ( void )
{
2021-02-04 21:12:37 +03:00
/*
* A special case is the first 4 Kb of memory ;
* This is a BIOS owned area , not kernel ram , but generally
* not listed as such in the E820 table .
*
* This typically reserves additional memory ( 64 KiB by default )
* since some BIOSes are known to corrupt low memory . See the
* Kconfig help text for X86_RESERVE_LOW .
*/
e820__range_update ( 0 , PAGE_SIZE , E820_TYPE_RAM , E820_TYPE_RESERVED ) ;
2010-01-22 06:21:04 +03:00
/*
2019-11-18 10:00:12 +03:00
* special case : Some BIOSes report the PC BIOS
* area ( 640 Kb - > 1 Mb ) as RAM even though it is not .
2010-01-22 06:21:04 +03:00
* take them out .
*/
2017-01-28 19:09:33 +03:00
e820__range_remove ( BIOS_BEGIN , BIOS_END - BIOS_BEGIN , E820_TYPE_RAM , 1 ) ;
2012-11-15 00:43:31 +04:00
x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);
But 90% of the usage is trivial:
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
arch/x86/xen/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.
Only one use is non-trivial:
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).
Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:
int e820__update_table(struct e820_table *table);
This visibly simplifies all the call sites:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_table *table);
arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates. The allowance
arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
arch/x86/kernel/e820.c: if (e820__update_table(e820_table))
arch/x86/kernel/e820.c: e820__update_table(e820_table_firmware);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table) < 0)
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
arch/x86/xen/setup.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 20:00:35 +03:00
e820__update_table ( e820_table ) ;
2010-01-22 06:21:04 +03:00
}
2013-01-25 00:19:45 +04:00
/* called before trim_bios_range() to spare extra sanitize */
static void __init e820_add_kernel_range ( void )
{
u64 start = __pa_symbol ( _text ) ;
u64 size = __pa_symbol ( _end ) - start ;
/*
2017-01-28 19:09:33 +03:00
* Complain if . text . data and . bss are not marked as E820_TYPE_RAM and
2013-01-25 00:19:45 +04:00
* attempt to fix it by adding the range . We may have a confused BIOS ,
* or the user may have used memmap = exactmap or memmap = xxM $ yyM to
* exclude kernel range . If we really are running on top non - RAM ,
* we will crash later anyways .
*/
2017-01-28 19:09:33 +03:00
if ( e820__mapped_all ( start , start + size , E820_TYPE_RAM ) )
2013-01-25 00:19:45 +04:00
return ;
2017-01-28 19:09:33 +03:00
pr_warn ( " .text .data .bss are not marked as E820_TYPE_RAM! \n " ) ;
e820__range_remove ( start , size , E820_TYPE_RAM , 0 ) ;
e820__range_add ( start , size , E820_TYPE_RAM ) ;
2013-01-25 00:19:45 +04:00
}
2021-03-02 13:04:05 +03:00
static void __init early_reserve_memory ( void )
2013-02-15 02:02:52 +04:00
{
2021-03-02 13:04:05 +03:00
/*
* Reserve the memory occupied by the kernel between _text and
* __end_of_kernel_reserve symbols . Any kernel sections after the
* __end_of_kernel_reserve symbol must be explicitly reserved with a
* separate memblock_reserve ( ) or they will be discarded .
*/
memblock_reserve ( __pa_symbol ( _text ) ,
( unsigned long ) __end_of_kernel_reserve - ( unsigned long ) _text ) ;
/*
2021-03-02 13:04:06 +03:00
* The first 4 Kb of memory is a BIOS owned area , but generally it is
* not listed as such in the E820 table .
*
2021-06-01 10:53:52 +03:00
* Reserve the first 64 K of memory since some BIOSes are known to
* corrupt low memory . After the real mode trampoline is allocated the
* rest of the memory below 640 k is reserved .
2021-03-02 13:04:06 +03:00
*
* In addition , make sure page 0 is always reserved because on
* systems with L1TF its contents can be leaked to user processes .
2021-03-02 13:04:05 +03:00
*/
2021-06-01 10:53:52 +03:00
memblock_reserve ( 0 , SZ_64K ) ;
2021-03-02 13:04:05 +03:00
early_reserve_initrd ( ) ;
if ( efi_enabled ( EFI_BOOT ) )
efi_memblock_x86_reserve_range ( ) ;
memblock_x86_reserve_range_setup_data ( ) ;
reserve_ibft_region ( ) ;
reserve_bios_regions ( ) ;
2021-06-01 10:53:52 +03:00
trim_snb_memory ( ) ;
2013-02-15 02:02:52 +04:00
}
2021-03-02 13:04:05 +03:00
2013-10-11 04:18:17 +04:00
/*
* Dump out kernel offset information on panic .
*/
static int
dump_kernel_offset ( struct notifier_block * self , unsigned long v , void * p )
{
2015-04-01 13:49:52 +03:00
if ( kaslr_enabled ( ) ) {
pr_emerg ( " Kernel Offset: 0x%lx from 0x%lx (relocation range: 0x%lx-0x%lx) \n " ,
2015-04-27 14:17:19 +03:00
kaslr_offset ( ) ,
2015-04-01 13:49:52 +03:00
__START_KERNEL ,
__START_KERNEL_map ,
MODULES_VADDR - 1 ) ;
} else {
pr_emerg ( " Kernel Offset: disabled \n " ) ;
}
2013-10-11 04:18:17 +04:00
return 0 ;
}
2005-04-17 02:20:36 +04:00
/*
* Determine if we were loaded by an EFI loader . If so , then we have also been
* passed the efi memmap , systab , etc . , so we should use these data structures
* for initialization . Note , the efi init code path is determined by the
* global efi_enabled . This allows the same kernel image to be used on existing
* systems ( with a traditional BIOS ) as well as on EFI systems .
*/
2008-06-26 04:52:35 +04:00
/*
* setup_arch - architecture - specific boot - time initializations
*
* Note : On x86_64 , fixmaps are ready for use even before this is called .
*/
2005-04-17 02:20:36 +04:00
void __init setup_arch ( char * * cmdline_p )
{
2008-06-26 04:52:35 +04:00
# ifdef CONFIG_X86_32
2005-04-17 02:20:36 +04:00
memcpy ( & boot_cpu_data , & new_cpu_data , sizeof ( new_cpu_data ) ) ;
2010-08-28 17:58:33 +04:00
/*
* copy kernel address range established so far and switch
* to the proper swapper page table
*/
clone_pgd_range ( swapper_pg_dir + KERNEL_PGD_BOUNDARY ,
initial_page_table + KERNEL_PGD_BOUNDARY ,
KERNEL_PGD_PTRS ) ;
load_cr3 ( swapper_pg_dir ) ;
2014-10-07 04:19:48 +04:00
/*
* Note : Quark X1000 CPUs advertise PGE incorrectly and require
* a cr3 based tlb flush , so the following __flush_tlb_all ( )
2019-11-18 18:03:39 +03:00
* will not flush anything because the CPU quirk which clears
2014-10-07 04:19:48 +04:00
* X86_FEATURE_PGE has not been invoked yet . Though due to the
* load_cr3 ( ) above the TLB has been flushed already . The
* quirk is invoked before subsequent calls to __flush_tlb_all ( )
* so proper operation is guaranteed .
*/
2010-08-28 17:58:33 +04:00
__flush_tlb_all ( ) ;
2008-06-26 04:52:35 +04:00
# else
printk ( KERN_INFO " Command line: %s \n " , boot_command_line ) ;
2018-02-14 14:16:54 +03:00
boot_cpu_data . x86_phys_bits = MAX_PHYSMEM_BITS ;
2008-06-26 04:52:35 +04:00
# endif
2005-04-17 02:20:36 +04:00
2010-08-24 01:49:11 +04:00
/*
* If we have OLPC OFW , we might end up relocating the fixmap due to
* reserve_top ( ) , so do this before touching the ioremap area .
*/
2010-06-19 01:46:53 +04:00
olpc_ofw_detect ( ) ;
2017-08-28 09:47:50 +03:00
idt_setup_early_traps ( ) ;
2008-07-22 03:49:54 +04:00
early_cpu_init ( ) ;
2018-07-19 23:55:28 +03:00
jump_label_init ( ) ;
2020-08-18 16:57:51 +03:00
static_call_init ( ) ;
2008-06-30 07:02:44 +04:00
early_ioremap_init ( ) ;
2010-06-19 01:46:53 +04:00
setup_olpc_ofw_pgd ( ) ;
2007-10-16 04:13:22 +04:00
ROOT_DEV = old_decode_dev ( boot_params . hdr . root_dev ) ;
screen_info = boot_params . screen_info ;
edid_info = boot_params . edid_info ;
2008-06-26 04:52:35 +04:00
# ifdef CONFIG_X86_32
2007-10-16 04:13:22 +04:00
apm_info . bios = boot_params . apm_bios_info ;
ist_info = boot_params . ist_info ;
2008-06-26 04:52:35 +04:00
# endif
saved_video_mode = boot_params . hdr . vid_mode ;
2007-10-16 04:13:22 +04:00
bootloader_type = boot_params . hdr . type_of_loader ;
2009-05-08 03:54:11 +04:00
if ( ( bootloader_type > > 4 ) = = 0xe ) {
bootloader_type & = 0xf ;
bootloader_type | = ( boot_params . hdr . ext_loader_type + 0x10 ) < < 4 ;
}
bootloader_version = bootloader_type & 0xf ;
bootloader_version | = boot_params . hdr . ext_loader_ver < < 4 ;
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_BLK_DEV_RAM
2007-10-16 04:13:22 +04:00
rd_image_start = boot_params . hdr . ram_size & RAMDISK_IMAGE_START_MASK ;
2005-04-17 02:20:36 +04:00
# endif
2008-06-24 06:53:33 +04:00
# ifdef CONFIG_EFI
if ( ! strncmp ( ( char * ) & boot_params . efi_info . efi_loader_signature ,
2014-06-30 21:53:03 +04:00
EFI32_LOADER_SIGNATURE , 4 ) ) {
2014-01-15 17:21:22 +04:00
set_bit ( EFI_BOOT , & efi . flags ) ;
2012-02-13 01:24:29 +04:00
} else if ( ! strncmp ( ( char * ) & boot_params . efi_info . efi_loader_signature ,
2014-06-30 21:53:03 +04:00
EFI64_LOADER_SIGNATURE , 4 ) ) {
2014-01-15 17:21:22 +04:00
set_bit ( EFI_BOOT , & efi . flags ) ;
set_bit ( EFI_64BIT , & efi . flags ) ;
2008-06-24 06:53:33 +04:00
}
# endif
2009-08-20 15:04:10 +04:00
x86_init . oem . arch_setup ( ) ;
2008-01-30 15:31:19 +03:00
2010-10-27 01:41:49 +04:00
iomem_resource . end = ( 1ULL < < boot_cpu_data . x86_phys_bits ) - 1 ;
2017-01-28 11:58:49 +03:00
e820__memory_setup ( ) ;
2008-07-01 03:20:54 +04:00
parse_setup_data ( ) ;
2005-04-17 02:20:36 +04:00
copy_edd ( ) ;
2007-10-16 04:13:22 +04:00
if ( ! boot_params . hdr . root_flags )
2005-04-17 02:20:36 +04:00
root_mountflags & = ~ MS_RDONLY ;
init_mm . start_code = ( unsigned long ) _text ;
init_mm . end_code = ( unsigned long ) _etext ;
init_mm . end_data = ( unsigned long ) _edata ;
x86: add brk allocation for very, very early allocations
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-27 04:35:44 +03:00
init_mm . brk = _brk_end ;
x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-14 18:18:29 +03:00
2016-04-14 21:18:57 +03:00
code_resource . start = __pa_symbol ( _text ) ;
code_resource . end = __pa_symbol ( _etext ) - 1 ;
2019-10-30 00:13:50 +03:00
rodata_resource . start = __pa_symbol ( __start_rodata ) ;
rodata_resource . end = __pa_symbol ( __end_rodata ) - 1 ;
data_resource . start = __pa_symbol ( _sdata ) ;
2016-04-14 21:18:57 +03:00
data_resource . end = __pa_symbol ( _edata ) - 1 ;
bss_resource . start = __pa_symbol ( __bss_start ) ;
bss_resource . end = __pa_symbol ( __bss_stop ) - 1 ;
2008-08-12 23:52:36 +04:00
# ifdef CONFIG_CMDLINE_BOOL
# ifdef CONFIG_CMDLINE_OVERRIDE
strlcpy ( boot_command_line , builtin_cmdline , COMMAND_LINE_SIZE ) ;
# else
if ( builtin_cmdline [ 0 ] ) {
/* append boot loader cmdline to builtin */
strlcat ( builtin_cmdline , " " , COMMAND_LINE_SIZE ) ;
strlcat ( builtin_cmdline , boot_command_line , COMMAND_LINE_SIZE ) ;
strlcpy ( boot_command_line , builtin_cmdline , COMMAND_LINE_SIZE ) ;
}
# endif
# endif
2009-09-19 22:07:57 +04:00
strlcpy ( command_line , boot_command_line , COMMAND_LINE_SIZE ) ;
* cmdline_p = command_line ;
/*
2009-11-14 02:28:17 +03:00
* x86_configure_nx ( ) is called before parse_early_param ( ) to detect
* whether hardware doesn ' t support NX ( so that the early EHCI debug
* console setup can safely call set_fixmap ( ) ) . It may then be called
* again from within noexec_setup ( ) during parsing early parameters
* to honor the respective command line option .
2009-09-19 22:07:57 +04:00
*/
2009-11-14 02:28:16 +03:00
x86_configure_nx ( ) ;
2009-09-19 22:07:57 +04:00
parse_early_param ( ) ;
2021-03-02 13:04:05 +03:00
/*
* Do some memory reservations * before * memory is added to
* memblock , so memblock allocations won ' t overwrite it .
* Do it after early param , so we could get ( unlikely ) panic from
* serial .
*
* After this point everything still needed from the boot loader or
* firmware or kernel text should be early reserved or marked not
* RAM in e820 . All other memory is free game .
*/
early_reserve_memory ( ) ;
mm: remove x86-only restriction of movable_node
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:
1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().
Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.
This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced. (1) will happen for them, but (2) will not.
This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable. Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.
So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.
Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 03:42:55 +03:00
# ifdef CONFIG_MEMORY_HOTPLUG
/*
* Memory used by the kernel cannot be hot - removed because Linux
* cannot migrate the kernel pages . When memory hotplug is
* enabled , we should prevent memblock from allocating memory
* for the kernel .
*
* ACPI SRAT records all hotpluggable memory ranges . But before
* SRAT is parsed , we don ' t know about it .
*
* The kernel image is loaded into memory at very early time . We
* cannot prevent this anyway . So on NUMA system , we set any
* node the kernel resides in as un - hotpluggable .
*
* Since on modern servers , one node could have double - digit
* gigabytes memory , we can assume the memory around the kernel
* image is also un - hotpluggable . So before SRAT is parsed , just
* allocate memory near the kernel image to try the best to keep
* the kernel away from hotpluggable memory .
*/
if ( movable_node_is_enabled ( ) )
memblock_set_bottom_up ( true ) ;
# endif
2009-11-14 02:28:17 +03:00
x86_report_nx ( ) ;
2008-09-12 03:42:00 +04:00
2008-06-26 04:52:35 +04:00
if ( acpi_mps_check ( ) ) {
2008-06-24 00:19:22 +04:00
# ifdef CONFIG_X86_LOCAL_APIC
2008-06-26 04:52:35 +04:00
disable_apic = 1 ;
2008-06-24 00:19:22 +04:00
# endif
2008-07-21 22:21:43 +04:00
setup_clear_cpu_cap ( X86_FEATURE_APIC ) ;
2008-06-21 03:11:20 +04:00
}
2017-01-29 00:27:28 +03:00
e820__reserve_setup_data ( ) ;
2017-01-28 15:37:17 +03:00
e820__finish_early_params ( ) ;
2006-09-26 12:52:32 +04:00
2012-11-14 13:42:35 +04:00
if ( efi_enabled ( EFI_BOOT ) )
2009-03-04 05:55:31 +03:00
efi_init ( ) ;
2019-03-28 22:34:28 +03:00
dmi_setup ( ) ;
2008-09-22 13:52:26 +04:00
2008-10-27 20:41:46 +03:00
/*
* VMware detection requires dmi to be available , so this
2019-03-28 22:34:28 +03:00
* needs to be done after dmi_setup ( ) , for the boot CPU .
2008-10-27 20:41:46 +03:00
*/
2009-08-20 19:06:25 +04:00
init_hypervisor_platform ( ) ;
2008-10-27 20:41:46 +03:00
2018-07-19 23:55:38 +03:00
tsc_early_init ( ) ;
2009-08-19 16:43:56 +04:00
x86_init . resources . probe_roms ( ) ;
2008-06-17 00:03:31 +04:00
2016-04-14 21:18:57 +03:00
/* after parse_early_param, so could debug it */
insert_resource ( & iomem_resource , & code_resource ) ;
2019-10-30 00:13:50 +03:00
insert_resource ( & iomem_resource , & rodata_resource ) ;
2016-04-14 21:18:57 +03:00
insert_resource ( & iomem_resource , & data_resource ) ;
insert_resource ( & iomem_resource , & bss_resource ) ;
2013-01-25 00:19:45 +04:00
e820_add_kernel_range ( ) ;
2010-01-22 06:21:04 +03:00
trim_bios_range ( ) ;
2008-06-26 04:52:35 +04:00
# ifdef CONFIG_X86_32
2008-06-17 03:11:08 +04:00
if ( ppro_with_ram_bug ( ) ) {
2017-01-28 19:09:33 +03:00
e820__range_update ( 0x70000000ULL , 0x40000ULL , E820_TYPE_RAM ,
E820_TYPE_RESERVED ) ;
x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);
But 90% of the usage is trivial:
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
arch/x86/xen/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.
Only one use is non-trivial:
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).
Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:
int e820__update_table(struct e820_table *table);
This visibly simplifies all the call sites:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_table *table);
arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates. The allowance
arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
arch/x86/kernel/e820.c: if (e820__update_table(e820_table))
arch/x86/kernel/e820.c: e820__update_table(e820_table_firmware);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table) < 0)
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
arch/x86/xen/setup.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 20:00:35 +03:00
e820__update_table ( e820_table ) ;
2008-06-17 03:11:08 +04:00
printk ( KERN_INFO " fixed physical RAM map: \n " ) ;
2017-01-28 16:24:02 +03:00
e820__print_table ( " bad_ppro " ) ;
2008-06-17 03:11:08 +04:00
}
2008-06-26 04:52:35 +04:00
# else
early_gart_iommu_check ( ) ;
# endif
2008-06-17 03:11:08 +04:00
2008-06-04 06:35:04 +04:00
/*
* partially used pages are not usable - thus
* we are rounding upwards :
*/
2017-01-29 00:52:16 +03:00
max_pfn = e820__end_of_ram_pfn ( ) ;
2008-06-04 06:35:04 +04:00
2008-01-30 15:33:32 +03:00
/* update e820 for memory not covered by WB MTRRs */
mtrr_bp_init ( ) ;
2008-07-09 05:56:38 +04:00
if ( mtrr_trim_uncached_memory ( max_pfn ) )
2017-01-29 00:52:16 +03:00
max_pfn = e820__end_of_ram_pfn ( ) ;
2008-03-23 10:16:49 +03:00
2015-12-04 16:07:05 +03:00
max_possible_pfn = max_pfn ;
2017-07-05 02:04:23 +03:00
/*
* This call is required when the CPU does not support PAT . If
* mtrr_bp_init ( ) invoked it already via pat_init ( ) the call has no
* effect .
*/
init_cache_modes ( ) ;
2016-08-09 20:11:04 +03:00
/*
* Define random base addresses for memory sections after max_pfn is
* defined and before each memory section base is used .
*/
kernel_randomize_memory ( ) ;
2008-06-26 04:52:35 +04:00
# ifdef CONFIG_X86_32
2008-06-24 23:18:14 +04:00
/* max_low_pfn get updated here */
2008-06-23 14:05:30 +04:00
find_low_pfn_range ( ) ;
2008-06-26 04:52:35 +04:00
# else
2009-02-17 04:29:58 +03:00
check_x2apic ( ) ;
2008-06-26 04:52:35 +04:00
/* How many end-of-memory variables you have, grandma! */
/* need this before calling reserve_initrd */
2008-07-11 07:38:26 +04:00
if ( max_pfn > ( 1UL < < ( 32 - PAGE_SHIFT ) ) )
2017-01-29 00:52:16 +03:00
max_low_pfn = e820__end_of_low_ram_pfn ( ) ;
2008-07-11 07:38:26 +04:00
else
max_low_pfn = max_pfn ;
2008-06-26 04:52:35 +04:00
high_memory = ( void * ) __va ( max_pfn * PAGE_SIZE - 1 ) + 1 ;
2008-09-07 12:51:32 +04:00
# endif
2009-12-11 00:07:22 +03:00
/*
* Find and reserve possible boot - time SMP configuration :
*/
find_smp_config ( ) ;
2012-11-17 07:38:58 +04:00
early_alloc_pgt_buf ( ) ;
2010-08-26 00:39:17 +04:00
/*
x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:
- nothing in its name tells us that it's part of the E820 subsystem ...
- The 'fill' wording is ambiguous and doesn't tell us whether it's a single
entry or some process - while the _real_ purpose of the function is hidden,
which is to do a complete setup of the (platform independent) memblock regions.
So rename it accordingly, to e820__memblock_setup().
Also translate this incomprehensible and misleading comment:
/*
* EFI may have more than 128 entries
* We are safe to enable resizing, beause memblock_x86_fill()
* is rather later for x86
*/
memblock_allow_resize();
The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.
But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...
So change the comment to a more comprehensible version:
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
*
* This is safe, because this call happens pretty late during x86 setup,
* so we know about reserved memory regions already. (This is important
* so that memblock resizing does no stomp over reserved areas.)
*/
memblock_allow_resize();
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 13:37:42 +03:00
* Need to conclude brk , before e820__memblock_setup ( )
2021-03-11 11:39:19 +03:00
* it could use memblock_find_in_range , could overlap with
* brk area .
2010-08-26 00:39:17 +04:00
*/
reserve_brk ( ) ;
2011-02-18 14:30:30 +03:00
cleanup_highmap ( ) ;
2013-08-14 07:44:04 +04:00
memblock_set_current_limit ( ISA_END_ADDRESS ) ;
x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:
- nothing in its name tells us that it's part of the E820 subsystem ...
- The 'fill' wording is ambiguous and doesn't tell us whether it's a single
entry or some process - while the _real_ purpose of the function is hidden,
which is to do a complete setup of the (platform independent) memblock regions.
So rename it accordingly, to e820__memblock_setup().
Also translate this incomprehensible and misleading comment:
/*
* EFI may have more than 128 entries
* We are safe to enable resizing, beause memblock_x86_fill()
* is rather later for x86
*/
memblock_allow_resize();
The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.
But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...
So change the comment to a more comprehensible version:
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
*
* This is safe, because this call happens pretty late during x86 setup,
* so we know about reserved memory regions already. (This is important
* so that memblock resizing does no stomp over reserved areas.)
*/
memblock_allow_resize();
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 13:37:42 +03:00
e820__memblock_setup ( ) ;
2010-08-26 00:39:17 +04:00
2020-12-10 04:25:15 +03:00
/*
* Needs to run after memblock setup because it needs the physical
* memory size .
*/
sev_setup_arch ( ) ;
2019-11-07 04:43:05 +03:00
efi_fake_memmap ( ) ;
efi_find_mirror ( ) ;
efi_esrt_init ( ) ;
2020-09-05 04:31:05 +03:00
efi_mokvar_table_init ( ) ;
2016-08-10 12:29:13 +03:00
2019-11-07 04:43:05 +03:00
/*
* The EFI specification says that boot service code won ' t be
* called after ExitBootServices ( ) . This is , in fact , a lie .
*/
efi_reserve_boot_services ( ) ;
x86, efi: Retain boot service code until after switching to virtual mode
UEFI stands for "Unified Extensible Firmware Interface", where "Firmware"
is an ancient African word meaning "Why do something right when you can
do it so wrong that children will weep and brave adults will cower before
you", and "UEI" is Celtic for "We missed DOS so we burned it into your
ROMs". The UEFI specification provides for runtime services (ie, another
way for the operating system to be forced to depend on the firmware) and
we rely on these for certain trivial tasks such as setting up the
bootloader. But some hardware fails to work if we attempt to use these
runtime services from physical mode, and so we have to switch into virtual
mode. So far so dreadful.
The specification makes it clear that the operating system is free to do
whatever it wants with boot services code after ExitBootServices() has been
called. SetVirtualAddressMap() can't be called until ExitBootServices() has
been. So, obviously, a whole bunch of EFI implementations call into boot
services code when we do that. Since we've been charmingly naive and
trusted that the specification may be somehow relevant to the real world,
we've already stuffed a picture of a penguin or something in that address
space. And just to make things more entertaining, we've also marked it
non-executable.
This patch allocates the boot services regions during EFI init and makes
sure that they're executable. Then, after SetVirtualAddressMap(), it
discards them and everyone lives happily ever after. Except for the ones
who have to work on EFI, who live sad lives haunted by the knowledge that
someone's eventually going to write yet another firmware specification.
[ hpa: adding this to urgent with a stable tag since it fixes currently-broken
hardware. However, I do not know what the dependencies are and so I do
not know which -stable versions this may be a candidate for. ]
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1306331593-28715-1-git-send-email-mjg@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@kernel.org>
2011-05-25 17:53:13 +04:00
2010-08-26 00:39:17 +04:00
/* preallocate 4k for mptable mpc */
2017-01-28 15:46:28 +03:00
e820__memblock_alloc_reserved_mpc_new ( ) ;
2010-08-26 00:39:17 +04:00
# ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check ( ) ;
# endif
2013-01-25 00:19:54 +04:00
# ifdef CONFIG_X86_32
2012-05-30 02:06:29 +04:00
printk ( KERN_DEBUG " initial memory mapped: [mem 0x00000000-%#010lx] \n " ,
( max_pfn_mapped < < PAGE_SHIFT ) - 1 ) ;
2013-01-25 00:19:54 +04:00
# endif
2010-08-26 00:39:17 +04:00
2021-04-13 21:08:39 +03:00
/*
2021-06-08 23:17:10 +03:00
* Find free memory for the real mode trampoline and place it there . If
* there is not enough free memory under 1 M , on EFI - enabled systems
* there will be additional attempt to reclaim the memory for the real
* mode trampoline at efi_free_boot_services ( ) .
2021-06-01 10:53:52 +03:00
*
2021-06-08 23:17:10 +03:00
* Unconditionally reserve the entire first 1 M of RAM because BIOSes
* are known to corrupt low memory and several hundred kilobytes are not
* worth complex detection what memory gets clobbered . Windows does the
* same thing for very similar reasons .
*
* Moreover , on machines with SandyBridge graphics or in setups that use
* crashkernel the entire 1 M is reserved anyway .
2021-04-13 21:08:39 +03:00
*/
2021-06-01 10:53:52 +03:00
reserve_real_mode ( ) ;
2012-11-15 00:43:31 +04:00
2012-11-17 07:38:41 +04:00
init_mem_mapping ( ) ;
2011-10-21 01:15:26 +04:00
2017-08-28 09:47:50 +03:00
idt_setup_early_pf ( ) ;
2011-10-21 01:15:26 +04:00
2016-08-10 12:29:14 +03:00
/*
* Update mmu_cr4_features ( and , indirectly , trampoline_cr4_features )
* with the current CR4 value . This may not be necessary , but
* auditing all the early - boot CR4 manipulation would be needed to
* rule it out .
2017-09-11 03:48:27 +03:00
*
* Mask off features that don ' t work outside long mode ( just
* PCIDE for now ) .
2016-08-10 12:29:14 +03:00
*/
2017-09-11 03:48:27 +03:00
mmu_cr4_features = __read_cr4 ( ) & ~ X86_CR4_PCIDE ;
2016-08-10 12:29:14 +03:00
2014-01-28 05:06:50 +04:00
memblock_set_current_limit ( get_max_mapped ( ) ) ;
2008-06-24 23:18:14 +04:00
2008-06-26 08:51:28 +04:00
/*
* NOTE : On x86 - 32 , only from this point on , fixmaps are ready for use .
*/
# ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
if ( init_ohci1394_dma_early )
init_ohci1394_dma_on_all_controllers ( ) ;
# endif
2011-05-25 04:13:20 +04:00
/* Allocate bigger log buffer */
setup_log_buf ( 1 ) ;
2008-06-26 08:51:28 +04:00
2017-02-06 14:22:45 +03:00
if ( efi_enabled ( EFI_BOOT ) ) {
switch ( boot_params . secure_boot ) {
case efi_secureboot_mode_disabled :
pr_info ( " Secure boot disabled \n " ) ;
break ;
case efi_secureboot_mode_enabled :
pr_info ( " Secure boot enabled \n " ) ;
break ;
default :
pr_info ( " Secure boot could not be determined \n " ) ;
break ;
}
}
2008-06-23 14:05:30 +04:00
reserve_initrd ( ) ;
2016-06-20 13:56:10 +03:00
acpi_table_upgrade ( ) ;
2021-04-13 17:01:00 +03:00
/* Look for ACPI tables and reserve memory occupied by them. */
acpi_boot_table_init ( ) ;
2012-10-01 02:23:54 +04:00
2008-06-26 04:52:35 +04:00
vsmp_init ( ) ;
2008-06-18 02:41:45 +04:00
io_delay_init ( ) ;
2017-08-01 15:10:41 +03:00
early_platform_quirks ( ) ;
x86, ACPI, mm: Revert movablemem_map support
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5
Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")
It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.
If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.
We have workaround patch that could fix some problems, but some can not
be fixed.
So just remove that offending commit and related ones including:
f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")
01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")
27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")
e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")
fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")
42f47e27e761 ("page_alloc: make movablemem_map have higher priority")
6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")
34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")
4d59a75125d5 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-02 02:51:27 +04:00
early_acpi_boot_init ( ) ;
2011-02-16 14:13:06 +03:00
initmem_init ( ) ;
2014-10-24 13:00:34 +04:00
dma_contiguous_reserve ( max_pfn_mapped < < PAGE_SHIFT ) ;
2013-11-13 03:08:07 +04:00
mm: hugetlb: optionally allocate gigantic hugepages using cma
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-11 00:32:45 +03:00
if ( boot_cpu_has ( X86_FEATURE_GBPAGES ) )
hugetlb_cma_reserve ( PUD_SHIFT - PAGE_SHIFT ) ;
2013-11-13 03:08:07 +04:00
/*
* Reserve memory for crash kernel after SRAT is parsed so that it
* won ' t consume hotpluggable memory .
*/
reserve_crashkernel ( ) ;
2010-08-26 00:39:18 +04:00
memblock_find_dma_reserve ( ) ;
2008-07-18 21:07:53 +04:00
2017-09-11 21:51:11 +03:00
if ( ! early_xdbc_setup_hardware ( ) )
early_xdbc_register_console ( ) ;
2012-08-22 00:22:38 +04:00
x86_init . paging . pagetable_init ( ) ;
x86: early boot debugging via FireWire (ohci1394_dma=early)
This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.
If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.
With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.
In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.
An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.
A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt
It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff
Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 15:34:11 +03:00
2015-02-14 01:39:25 +03:00
kasan_init ( ) ;
2017-05-09 03:09:10 +03:00
/*
2018-02-28 23:14:26 +03:00
* Sync back kernel address range .
*
* FIXME : Can the later sync in setup_cpu_entry_areas ( ) replace
* this call ?
2017-05-09 03:09:10 +03:00
*/
2018-02-28 23:14:26 +03:00
sync_initial_page_table ( ) ;
2017-05-09 03:09:10 +03:00
x86, intel_txt: Intel TXT boot support
This patch adds kernel configuration and boot support for Intel Trusted
Execution Technology (Intel TXT).
Intel's technology for safer computing, Intel Trusted Execution
Technology (Intel TXT), defines platform-level enhancements that
provide the building blocks for creating trusted platforms.
Intel TXT was formerly known by the code name LaGrande Technology (LT).
Intel TXT in Brief:
o Provides dynamic root of trust for measurement (DRTM)
o Data protection in case of improper shutdown
o Measurement and verification of launched environment
Intel TXT is part of the vPro(TM) brand and is also available some
non-vPro systems. It is currently available on desktop systems based on
the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell Optiplex 755, HP
dc7800, etc.) and mobile systems based on the GM45, PM45, and GS45
Express chipsets.
For more information, see http://www.intel.com/technology/security/.
This site also has a link to the Intel TXT MLE Developers Manual, which
has been updated for the new released platforms.
A much more complete description of how these patches support TXT, how to
configure a system for it, etc. is in the Documentation/intel_txt.txt file
in this patch.
This patch provides the TXT support routines for complete functionality,
documentation for TXT support and for the changes to the boot_params structure,
and boot detection of a TXT launch. Attempts to shutdown (reboot, Sx) the system
will result in platform resets; subsequent patches will support these shutdown modes
properly.
Documentation/intel_txt.txt | 210 +++++++++++++++++++++
Documentation/x86/zero-page.txt | 1
arch/x86/include/asm/bootparam.h | 3
arch/x86/include/asm/fixmap.h | 3
arch/x86/include/asm/tboot.h | 197 ++++++++++++++++++++
arch/x86/kernel/Makefile | 1
arch/x86/kernel/setup.c | 4
arch/x86/kernel/tboot.c | 379 +++++++++++++++++++++++++++++++++++++++
security/Kconfig | 30 +++
9 files changed, 827 insertions(+), 1 deletion(-)
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Gang Wei <gang.wei@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-07-01 06:30:59 +04:00
tboot_probe ( ) ;
2008-06-26 04:52:35 +04:00
map_vsyscall ( ) ;
2006-09-26 12:52:32 +04:00
generic_apic_probe ( ) ;
2005-04-17 02:20:36 +04:00
2007-10-19 22:35:03 +04:00
early_quirks ( ) ;
2006-06-08 11:43:38 +04:00
2008-06-24 06:55:05 +04:00
/*
* Read APIC and some other early information from ACPI tables .
*/
2005-04-17 02:20:36 +04:00
acpi_boot_init ( ) ;
2011-02-25 18:09:31 +03:00
x86_dtb_init ( ) ;
2008-06-21 12:38:41 +04:00
2008-06-24 06:55:05 +04:00
/*
* get boot - time SMP configuration :
*/
2016-08-12 09:57:12 +03:00
get_smp_config ( ) ;
2008-06-26 04:52:35 +04:00
x86/smpboot: Init apic mapping before usage
The recent changes, which forced the registration of the boot cpu on UP
systems, which do not have ACPI tables, have been fixed for systems w/o
local APIC, but left a wreckage for systems which have neither ACPI nor
mptables, but the CPU has an APIC, e.g. virtualbox.
The boot process crashes in prefill_possible_map() as it wants to register
the boot cpu, which needs to access the local apic, but the local APIC is
not yet mapped.
There is no reason why init_apic_mapping() can't be invoked before
prefill_possible_map(). So instead of playing another silly early mapping
game, as the ACPI/mptables code does, we just move init_apic_mapping()
before the call to prefill_possible_map().
In hindsight, I should have noticed that combination earlier.
Sorry for the churn (also in stable)!
Fixes: ff8560512b8d ("x86/boot/smp: Don't try to poke disabled/non-existent APIC")
Reported-and-debugged-by: Michal Necasek <michal.necasek@oracle.com>
Reported-and-tested-by: Wolfgang Bauer <wbauer@tmo.at>
Cc: prarit@redhat.com
Cc: ville.syrjala@linux.intel.com
Cc: michael.thayer@oracle.com
Cc: knut.osmundsen@oracle.com
Cc: frank.mehnert@oracle.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610282114380.5053@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-29 14:42:42 +03:00
/*
* Systems w / o ACPI and mptables might not have it mapped the local
* APIC yet , but prefill_possible_map ( ) might need to access it .
*/
init_apic_mappings ( ) ;
2008-07-03 05:54:40 +04:00
prefill_possible_map ( ) ;
2008-08-20 07:50:02 +04:00
2008-07-03 05:53:44 +04:00
init_cpu_to_node ( ) ;
2020-09-30 17:05:43 +03:00
init_gi_nodes ( ) ;
2008-07-03 05:53:44 +04:00
2015-04-24 14:57:48 +03:00
io_apic_init_mappings ( ) ;
2008-08-20 07:50:52 +04:00
2017-11-09 16:27:38 +03:00
x86_init . hyper . guest_late_init ( ) ;
2005-04-17 02:20:36 +04:00
2017-01-29 00:41:14 +03:00
e820__reserve_resources ( ) ;
2018-09-21 09:26:24 +03:00
e820__register_nosave_regions ( max_pfn ) ;
2005-04-17 02:20:36 +04:00
2009-08-19 16:55:50 +04:00
x86_init . resources . reserve_resources ( ) ;
2008-06-17 00:03:31 +04:00
2017-01-28 16:16:38 +03:00
e820__setup_pci_gap ( ) ;
2008-06-17 00:03:31 +04:00
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_VT
# if defined(CONFIG_VGA_CONSOLE)
2012-11-14 13:42:35 +04:00
if ( ! efi_enabled ( EFI_BOOT ) | | ( efi_mem_type ( 0xa0000 ) ! = EFI_CONVENTIONAL_MEMORY ) )
2005-04-17 02:20:36 +04:00
conswitchp = & vga_con ;
# endif
# endif
2009-08-20 15:19:57 +04:00
x86_init . oem . banner ( ) ;
2009-11-10 04:38:24 +03:00
2011-02-14 19:13:31 +03:00
x86_init . timers . wallclock_init ( ) ;
x86/thermal: Fix LVT thermal setup for SMI delivery mode
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):
[ 0.033858] read lvtthmr: 0x330, val: 0x200
which means:
- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.
Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:
setup_local_APIC:
...
/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);
which means (from the SDM):
"10.4.7.2 Local APIC State After It Has Been Software Disabled
...
* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."
And this happens too:
[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...
Now, before
4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.
But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.
Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.
Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-27 12:02:26 +03:00
/*
* This needs to run before setup_local_APIC ( ) which soft - disables the
* local APIC temporarily and that masks the thermal LVT interrupt ,
* leading to softlockups on machines which have configured SMI
* interrupt delivery .
*/
therm_lvt_init ( ) ;
2009-11-10 04:38:24 +03:00
mcheck_init ( ) ;
2010-09-17 19:08:51 +04:00
jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-04 20:42:27 +04:00
register_refined_jiffies ( CLOCK_TICK_RATE ) ;
2012-10-24 21:00:44 +04:00
# ifdef CONFIG_EFI
2014-03-04 20:02:17 +04:00
if ( efi_enabled ( EFI_BOOT ) )
efi_apply_memmap_quirks ( ) ;
2012-10-24 21:00:44 +04:00
# endif
2017-07-25 02:36:57 +03:00
unwind_init ( ) ;
2005-04-17 02:20:36 +04:00
}
2008-09-16 11:29:09 +04:00
2009-02-18 01:12:48 +03:00
# ifdef CONFIG_X86_32
2009-08-19 16:55:50 +04:00
static struct resource video_ram_resource = {
. name = " Video RAM area " ,
. start = 0xa0000 ,
. end = 0xbffff ,
. flags = IORESOURCE_BUSY | IORESOURCE_MEM
2009-02-18 01:12:48 +03:00
} ;
2009-08-19 16:55:50 +04:00
void __init i386_reserve_resources ( void )
2009-02-18 01:12:48 +03:00
{
2009-08-19 16:55:50 +04:00
request_resource ( & iomem_resource , & video_ram_resource ) ;
reserve_standard_io_resources ( ) ;
2009-02-18 01:12:48 +03:00
}
# endif /* CONFIG_X86_32 */
2013-10-11 04:18:17 +04:00
static struct notifier_block kernel_offset_notifier = {
. notifier_call = dump_kernel_offset
} ;
static int __init register_kernel_offset_dumper ( void )
{
atomic_notifier_chain_register ( & panic_notifier_list ,
& kernel_offset_notifier ) ;
return 0 ;
}
__initcall ( register_kernel_offset_dumper ) ;