2005-04-16 15:20:36 -07:00
/*
* Copyright ( C ) 1995 Linus Torvalds
*
* Support of BIGMEM added by Gerhard Wichert , Siemens AG , July 1999
*
* Memory region support
* David Parsons < orc @ pell . chi . il . us > , July - August 1999
*
* Added E820 sanitization routine ( removes overlapping memory regions ) ;
* Brian Moyle < bmoyle @ mvista . com > , February 2001
*
* Moved CPU detection code to cpu / $ { cpu } . c
* Patrick Mochel < mochel @ osdl . org > , March 2002
*
* Provisions for empty E820 memory regions ( reported by certain BIOSes ) .
* Alex Achenbach < xela @ slit . de > , December 2002.
*
*/
/*
* This file handles the architecture - dependent parts of initialization
*/
# include <linux/sched.h>
# include <linux/mm.h>
2005-06-23 00:07:57 -07:00
# include <linux/mmzone.h>
2006-07-10 04:44:13 -07:00
# include <linux/screen_info.h>
2005-04-16 15:20:36 -07:00
# include <linux/ioport.h>
# include <linux/acpi.h>
2009-08-14 15:23:29 -04:00
# include <linux/sfi.h>
2005-04-16 15:20:36 -07:00
# include <linux/apm_bios.h>
# include <linux/initrd.h>
# include <linux/bootmem.h>
2010-08-25 13:39:17 -07:00
# include <linux/memblock.h>
2005-04-16 15:20:36 -07:00
# include <linux/seq_file.h>
# include <linux/console.h>
# include <linux/root_dev.h>
# include <linux/highmem.h>
2016-07-13 20:18:56 -04:00
# include <linux/export.h>
2005-04-16 15:20:36 -07:00
# include <linux/efi.h>
# include <linux/init.h>
# include <linux/edd.h>
2008-04-09 19:50:41 -07:00
# include <linux/iscsi_ibft.h>
2005-04-16 15:20:36 -07:00
# include <linux/nodemask.h>
2005-06-25 14:58:01 -07:00
# include <linux/kexec.h>
2006-01-11 22:43:33 +01:00
# include <linux/dmi.h>
2006-03-27 01:16:04 -08:00
# include <linux/pfn.h>
2008-01-30 13:30:16 +01:00
# include <linux/pci.h>
2008-06-25 17:51:29 -07:00
# include <asm/pci-direct.h>
x86: early boot debugging via FireWire (ohci1394_dma=early)
This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.
If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.
With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.
In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.
An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.
A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt
It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff
Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:34:11 +01:00
# include <linux/init_ohci1394_dma.h>
2008-02-15 17:52:48 -02:00
# include <linux/kvm_para.h>
2011-12-29 13:09:51 +01:00
# include <linux/dma-contiguous.h>
2005-06-25 14:58:01 -07:00
2008-06-25 17:51:29 -07:00
# include <linux/errno.h>
# include <linux/kernel.h>
# include <linux/stddef.h>
# include <linux/unistd.h>
# include <linux/ptrace.h>
# include <linux/user.h>
# include <linux/delay.h>
# include <linux/kallsyms.h>
# include <linux/cpufreq.h>
# include <linux/dma-mapping.h>
# include <linux/ctype.h>
# include <linux/uaccess.h>
# include <linux/percpu.h>
# include <linux/crash_dump.h>
2009-09-01 18:25:07 -07:00
# include <linux/tboot.h>
jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-04 12:42:27 -04:00
# include <linux/jiffies.h>
2008-06-25 17:51:29 -07:00
2005-04-16 15:20:36 -07:00
# include <video/edid.h>
2005-06-25 14:58:01 -07:00
2008-01-30 13:33:32 +01:00
# include <asm/mtrr.h>
2005-06-25 14:57:41 -07:00
# include <asm/apic.h>
2012-05-08 21:22:26 +03:00
# include <asm/realmode.h>
2017-01-27 10:27:10 +01:00
# include <asm/e820/api.h>
2005-04-16 15:20:36 -07:00
# include <asm/mpspec.h>
# include <asm/setup.h>
2008-06-25 17:54:23 -07:00
# include <asm/efi.h>
2009-02-23 00:34:39 +01:00
# include <asm/timer.h>
# include <asm/i8259.h>
2005-04-16 15:20:36 -07:00
# include <asm/sections.h>
# include <asm/io_apic.h>
# include <asm/ist.h>
2009-01-28 19:34:09 +01:00
# include <asm/setup_arch.h>
2008-03-17 22:08:17 +03:00
# include <asm/bios_ebda.h>
2007-10-21 16:42:01 -07:00
# include <asm/cacheflush.h>
2008-03-04 19:57:42 +01:00
# include <asm/processor.h>
2008-06-16 16:11:08 -07:00
# include <asm/bugs.h>
2015-02-13 14:39:25 -08:00
# include <asm/kasan.h>
2005-04-16 15:20:36 -07:00
2008-06-25 17:51:29 -07:00
# include <asm/vsyscall.h>
2009-01-07 18:11:35 +05:30
# include <asm/cpu.h>
2008-06-25 17:51:29 -07:00
# include <asm/desc.h>
# include <asm/dma.h>
2008-07-11 10:23:42 +09:00
# include <asm/iommu.h>
2008-11-27 18:39:15 +01:00
# include <asm/gart.h>
2008-06-25 17:51:29 -07:00
# include <asm/mmu_context.h>
# include <asm/proto.h>
# include <asm/paravirt.h>
2008-10-27 10:41:46 -07:00
# include <asm/hypervisor.h>
2010-06-18 17:46:53 -04:00
# include <asm/olpc_ofw.h>
2008-06-25 17:51:29 -07:00
# include <asm/percpu.h>
# include <asm/topology.h>
# include <asm/apicdef.h>
2010-09-17 18:03:43 +02:00
# include <asm/amd_nb.h>
2009-11-10 09:38:24 +08:00
# include <asm/mce.h>
2010-09-17 11:08:51 -04:00
# include <asm/alternative.h>
2011-02-22 21:07:37 +01:00
# include <asm/prom.h>
2015-10-20 11:54:44 +02:00
# include <asm/microcode.h>
2016-02-12 13:02:27 -08:00
# include <asm/mmu_context.h>
x86/mm: Implement ASLR for kernel memory regions
Randomizes the virtual address space of kernel memory regions for
x86_64. This first patch adds the infrastructure and does not randomize
any region. The following patches will randomize the physical memory
mapping, vmalloc and vmemmap regions.
This security feature mitigates exploits relying on predictable kernel
addresses. These addresses can be used to disclose the kernel modules
base addresses or corrupt specific structures to elevate privileges
bypassing the current implementation of KASLR. This feature can be
enabled with the CONFIG_RANDOMIZE_MEMORY option.
The order of each memory region is not changed. The feature looks at the
available space for the regions based on different configuration options
and randomizes the base and space between each. The size of the physical
memory mapping is the available physical memory. No performance impact
was detected while testing the feature.
Entropy is generated using the KASLR early boot functions now shared in
the lib directory (originally written by Kees Cook). Randomization is
done on PGD & PUD page table levels to increase possible addresses. The
physical memory mapping code was adapted to support PUD level virtual
addresses. This implementation on the best configuration provides 30,000
possible virtual addresses in average for each memory region. An
additional low memory page is used to ensure each CPU can start with a
PGD aligned virtual address (for realmode).
x86/dump_pagetable was updated to correctly display each region.
Updated documentation on x86_64 memory layout accordingly.
Performance data, after all patches in the series:
Kernbench shows almost no difference (-+ less than 1%):
Before:
Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.63 (1.2695)
User Time 1034.89 (1.18115) System Time 87.056 (0.456416) Percent CPU 1092.9
(13.892) Context Switches 199805 (3455.33) Sleeps 97907.8 (900.636)
After:
Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.489 (1.10636)
User Time 1034.86 (1.36053) System Time 87.764 (0.49345) Percent CPU 1095
(12.7715) Context Switches 199036 (4298.1) Sleeps 97681.6 (1031.11)
Hackbench shows 0% difference on average (hackbench 90 repeated 10 times):
attemp,before,after 1,0.076,0.069 2,0.072,0.069 3,0.066,0.066 4,0.066,0.068
5,0.066,0.067 6,0.066,0.069 7,0.067,0.066 8,0.063,0.067 9,0.067,0.065
10,0.068,0.071 average,0.0677,0.0677
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/1466556426-32664-6-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-21 17:47:02 -07:00
# include <asm/kaslr.h>
2008-06-25 17:51:29 -07:00
2009-04-28 16:00:49 +03:00
/*
2012-11-16 19:38:52 -08:00
* max_low_pfn_mapped : highest direct mapped pfn under 4 GB
* max_pfn_mapped : highest direct mapped pfn over 4 GB
*
2017-01-28 17:09:33 +01:00
* The direct mapping only covers E820_TYPE_RAM regions , so the ranges and gaps are
2012-11-16 19:38:52 -08:00
* represented by pfn_mapped
2009-04-28 16:00:49 +03:00
*/
unsigned long max_low_pfn_mapped ;
unsigned long max_pfn_mapped ;
2010-02-09 21:38:45 -02:00
# ifdef CONFIG_DMI
2009-03-12 16:09:49 -07:00
RESERVE_BRK ( dmi_alloc , 65536 ) ;
2010-02-09 21:38:45 -02:00
# endif
2009-03-12 16:09:49 -07:00
2009-01-27 17:13:05 +01:00
x86: add brk allocation for very, very early allocations
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-26 17:35:44 -08:00
static __initdata unsigned long _brk_start = ( unsigned long ) __brk_base ;
unsigned long _brk_end = ( unsigned long ) __brk_base ;
2009-01-27 17:13:05 +01:00
# ifdef CONFIG_X86_64
int default_cpu_present_to_apicid ( int mps_cpu )
{
return __default_cpu_present_to_apicid ( mps_cpu ) ;
}
2009-08-31 15:18:40 +02:00
int default_check_phys_apicid_present ( int phys_apicid )
2009-01-27 17:13:05 +01:00
{
2009-08-31 15:18:40 +02:00
return __default_check_phys_apicid_present ( phys_apicid ) ;
2009-01-27 17:13:05 +01:00
}
# endif
2008-06-25 17:55:20 -07:00
struct boot_params boot_params ;
2016-04-14 11:18:57 -07:00
/*
* Machine setup . .
*/
static struct resource data_resource = {
. name = " Kernel data " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
static struct resource code_resource = {
. name = " Kernel code " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
static struct resource bss_resource = {
. name = " Kernel bss " ,
. start = 0 ,
. end = 0 ,
. flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
} ;
2008-06-25 17:50:06 -07:00
# ifdef CONFIG_X86_32
2005-04-16 15:20:36 -07:00
/* cpu data as detected by the assembly code in head.S */
x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings. In any case, they are temporary and harmless.
This removes all the arch/x86 uses of the __cpuinit macros from
all C files. x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-06-18 18:23:59 -04:00
struct cpuinfo_x86 new_cpu_data = {
2013-03-03 00:14:42 +01:00
. wp_works_ok = - 1 ,
} ;
2005-04-16 15:20:36 -07:00
/* common cpu data for all cpus */
2013-03-03 00:14:42 +01:00
struct cpuinfo_x86 boot_cpu_data __read_mostly = {
. wp_works_ok = - 1 ,
} ;
2005-06-23 00:08:33 -07:00
EXPORT_SYMBOL ( boot_cpu_data ) ;
2005-04-16 15:20:36 -07:00
2008-03-27 23:55:04 +03:00
unsigned int def_to_bigsmp ;
2005-04-16 15:20:36 -07:00
/* for MCA, but anyone else can use it if they want */
unsigned int machine_id ;
unsigned int machine_submodel_id ;
unsigned int BIOS_revision ;
2008-06-25 17:50:06 -07:00
struct apm_info apm_info ;
EXPORT_SYMBOL ( apm_info ) ;
# if defined(CONFIG_X86_SPEEDSTEP_SMI) || \
defined ( CONFIG_X86_SPEEDSTEP_SMI_MODULE )
struct ist_info ist_info ;
EXPORT_SYMBOL ( ist_info ) ;
# else
struct ist_info ist_info ;
# endif
# else
2009-03-04 16:16:51 -08:00
struct cpuinfo_x86 boot_cpu_data __read_mostly = {
. x86_phys_bits = MAX_PHYSMEM_BITS ,
} ;
2008-06-25 17:50:06 -07:00
EXPORT_SYMBOL ( boot_cpu_data ) ;
# endif
# if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
2016-08-08 16:29:06 -07:00
__visible unsigned long mmu_cr4_features __ro_after_init ;
2008-06-25 17:50:06 -07:00
# else
2016-08-08 16:29:06 -07:00
__visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE ;
2008-06-25 17:50:06 -07:00
# endif
2009-05-07 16:54:11 -07:00
/* Boot loader ID and version as integers, for the benefit of proc_dointvec */
int bootloader_type , bootloader_version ;
2005-04-16 15:20:36 -07:00
/*
* Setup options
*/
struct screen_info screen_info ;
2005-06-23 00:08:33 -07:00
EXPORT_SYMBOL ( screen_info ) ;
2005-04-16 15:20:36 -07:00
struct edid_info edid_info ;
2005-09-09 13:04:34 -07:00
EXPORT_SYMBOL_GPL ( edid_info ) ;
2005-04-16 15:20:36 -07:00
extern int root_mountflags ;
2008-04-10 23:28:10 +02:00
unsigned long saved_video_mode ;
2005-04-16 15:20:36 -07:00
2008-01-30 13:32:51 +01:00
# define RAMDISK_IMAGE_START_MASK 0x07FF
2005-04-16 15:20:36 -07:00
# define RAMDISK_PROMPT_FLAG 0x8000
2008-01-30 13:32:51 +01:00
# define RAMDISK_LOAD_FLAG 0x4000
2005-04-16 15:20:36 -07:00
2007-02-12 00:54:11 -08:00
static char __initdata command_line [ COMMAND_LINE_SIZE ] ;
2008-08-12 12:52:36 -07:00
# ifdef CONFIG_CMDLINE_BOOL
static char __initdata builtin_cmdline [ COMMAND_LINE_SIZE ] = CONFIG_CMDLINE ;
# endif
2005-04-16 15:20:36 -07:00
# if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
struct edd edd ;
# ifdef CONFIG_EDD_MODULE
EXPORT_SYMBOL ( edd ) ;
# endif
/**
* copy_edd ( ) - Copy the BIOS EDD information
* from boot_params into a safe place .
*
*/
2009-11-30 18:33:51 +08:00
static inline void __init copy_edd ( void )
2005-04-16 15:20:36 -07:00
{
2007-10-15 17:13:22 -07:00
memcpy ( edd . mbr_signature , boot_params . edd_mbr_sig_buffer ,
sizeof ( edd . mbr_signature ) ) ;
memcpy ( edd . edd_info , boot_params . eddbuf , sizeof ( edd . edd_info ) ) ;
edd . mbr_signature_nr = boot_params . edd_mbr_sig_buf_entries ;
edd . edd_info_nr = boot_params . eddbuf_entries ;
2005-04-16 15:20:36 -07:00
}
# else
2009-11-30 18:33:51 +08:00
static inline void __init copy_edd ( void )
2005-04-16 15:20:36 -07:00
{
}
# endif
2009-03-14 17:19:51 -07:00
void * __init extend_brk ( size_t size , size_t align )
{
size_t mask = align - 1 ;
void * ret ;
BUG_ON ( _brk_start = = 0 ) ;
BUG_ON ( align & mask ) ;
_brk_end = ( _brk_end + mask ) & ~ mask ;
BUG_ON ( ( char * ) ( _brk_end + size ) > __brk_limit ) ;
ret = ( void * ) _brk_end ;
_brk_end + = size ;
memset ( ret , 0 , size ) ;
return ret ;
}
2012-11-16 19:39:08 -08:00
# ifdef CONFIG_X86_32
2011-02-18 11:30:30 +00:00
static void __init cleanup_highmap ( void )
2010-12-27 16:48:32 -08:00
{
}
2009-06-22 17:39:41 +03:00
# endif
2009-03-14 17:19:51 -07:00
static void __init reserve_brk ( void )
{
if ( _brk_end > _brk_start )
2012-11-16 13:57:13 -08:00
memblock_reserve ( __pa_symbol ( _brk_start ) ,
_brk_end - _brk_start ) ;
2009-03-14 17:19:51 -07:00
/* Mark brk area as locked down and no longer taking any
new allocations */
_brk_start = 0 ;
}
2013-12-04 20:50:42 +01:00
u64 relocated_ramdisk ;
2008-01-30 13:32:51 +01:00
# ifdef CONFIG_BLK_DEV_INITRD
2013-01-24 12:19:56 -08:00
static u64 __init get_ramdisk_image ( void )
{
u64 ramdisk_image = boot_params . hdr . ramdisk_image ;
2013-01-28 20:16:44 -08:00
ramdisk_image | = ( u64 ) boot_params . ext_ramdisk_image < < 32 ;
2013-01-24 12:19:56 -08:00
return ramdisk_image ;
}
static u64 __init get_ramdisk_size ( void )
{
u64 ramdisk_size = boot_params . hdr . ramdisk_size ;
2013-01-28 20:16:44 -08:00
ramdisk_size | = ( u64 ) boot_params . ext_ramdisk_size < < 32 ;
2013-01-24 12:19:56 -08:00
return ramdisk_size ;
}
2008-06-25 17:49:26 -07:00
static void __init relocate_initrd ( void )
2008-01-30 13:32:51 +01:00
{
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-28 19:42:55 -07:00
/* Assume only end is not page aligned */
2013-01-24 12:19:56 -08:00
u64 ramdisk_image = get_ramdisk_image ( ) ;
u64 ramdisk_size = get_ramdisk_size ( ) ;
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-28 19:42:55 -07:00
u64 area_size = PAGE_ALIGN ( ramdisk_size ) ;
2008-01-30 13:32:51 +01:00
2012-11-16 19:38:51 -08:00
/* We need to move the initrd down into directly mapped mem */
2013-12-04 20:50:42 +01:00
relocated_ramdisk = memblock_find_in_range ( 0 , PFN_PHYS ( max_pfn_mapped ) ,
area_size , PAGE_SIZE ) ;
2008-01-30 13:32:51 +01:00
2013-12-04 20:50:42 +01:00
if ( ! relocated_ramdisk )
2008-05-25 10:00:09 -07:00
panic ( " Cannot find place for new RAMDISK of size %lld \n " ,
2013-12-04 20:50:42 +01:00
ramdisk_size ) ;
2008-05-25 10:00:09 -07:00
2012-11-16 19:38:51 -08:00
/* Note: this includes all the mem currently occupied by
2008-01-30 13:32:51 +01:00
the initrd , we rely on that fact to keep the data intact . */
2013-12-04 20:50:42 +01:00
memblock_reserve ( relocated_ramdisk , area_size ) ;
initrd_start = relocated_ramdisk + PAGE_OFFSET ;
2008-01-30 13:32:51 +01:00
initrd_end = initrd_start + ramdisk_size ;
2012-05-29 15:06:29 -07:00
printk ( KERN_INFO " Allocated new RAMDISK: [mem %#010llx-%#010llx] \n " ,
2013-12-04 20:50:42 +01:00
relocated_ramdisk , relocated_ramdisk + ramdisk_size - 1 ) ;
2008-01-30 13:32:51 +01:00
2015-09-08 15:03:07 -07:00
copy_from_early_mem ( ( void * ) initrd_start , ramdisk_image , ramdisk_size ) ;
2012-05-29 15:06:29 -07:00
printk ( KERN_INFO " Move RAMDISK from [mem %#010llx-%#010llx] to "
" [mem %#010llx-%#010llx] \n " ,
2008-05-21 18:40:18 -07:00
ramdisk_image , ramdisk_image + ramdisk_size - 1 ,
2013-12-04 20:50:42 +01:00
relocated_ramdisk , relocated_ramdisk + ramdisk_size - 1 ) ;
2008-06-25 17:49:26 -07:00
}
2008-06-13 20:07:03 -07:00
2013-01-24 12:19:55 -08:00
static void __init early_reserve_initrd ( void )
{
/* Assume only end is not page aligned */
2013-01-24 12:19:56 -08:00
u64 ramdisk_image = get_ramdisk_image ( ) ;
u64 ramdisk_size = get_ramdisk_size ( ) ;
2013-01-24 12:19:55 -08:00
u64 ramdisk_end = PAGE_ALIGN ( ramdisk_image + ramdisk_size ) ;
if ( ! boot_params . hdr . type_of_loader | |
! ramdisk_image | | ! ramdisk_size )
return ; /* No initrd provided by bootloader */
memblock_reserve ( ramdisk_image , ramdisk_end - ramdisk_image ) ;
}
2008-06-25 17:49:26 -07:00
static void __init reserve_initrd ( void )
{
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-28 19:42:55 -07:00
/* Assume only end is not page aligned */
2013-01-24 12:19:56 -08:00
u64 ramdisk_image = get_ramdisk_image ( ) ;
u64 ramdisk_size = get_ramdisk_size ( ) ;
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-28 19:42:55 -07:00
u64 ramdisk_end = PAGE_ALIGN ( ramdisk_image + ramdisk_size ) ;
2012-11-16 19:38:51 -08:00
u64 mapped_size ;
2008-06-25 17:49:26 -07:00
if ( ! boot_params . hdr . type_of_loader | |
! ramdisk_image | | ! ramdisk_size )
return ; /* No initrd provided by bootloader */
initrd_start = 0 ;
2013-01-24 12:20:09 -08:00
mapped_size = memblock_mem_size ( max_pfn_mapped ) ;
2012-11-16 19:38:51 -08:00
if ( ramdisk_size > = ( mapped_size > > 1 ) )
2012-05-16 13:43:26 -04:00
panic ( " initrd too large to handle, "
" disabling initrd (%lld needed, %lld available) \n " ,
2012-11-16 19:38:51 -08:00
ramdisk_size , mapped_size > > 1 ) ;
2008-06-25 17:49:26 -07:00
2012-05-29 15:06:29 -07:00
printk ( KERN_INFO " RAMDISK: [mem %#010llx-%#010llx] \n " , ramdisk_image ,
ramdisk_end - 1 ) ;
2008-06-25 17:49:26 -07:00
2012-11-16 19:38:53 -08:00
if ( pfn_range_is_mapped ( PFN_DOWN ( ramdisk_image ) ,
2012-11-16 19:38:51 -08:00
PFN_DOWN ( ramdisk_end ) ) ) {
/* All are mapped, easy case */
2008-06-25 17:49:26 -07:00
initrd_start = ramdisk_image + PAGE_OFFSET ;
initrd_end = initrd_start + ramdisk_size ;
return ;
}
relocate_initrd ( ) ;
2009-06-04 19:14:22 -07:00
2011-07-12 11:16:06 +02:00
memblock_free ( ramdisk_image , ramdisk_end - ramdisk_image ) ;
2008-01-30 13:32:51 +01:00
}
2016-04-11 10:13:27 +08:00
2008-06-22 02:46:58 -07:00
# else
2013-01-24 12:19:55 -08:00
static void __init early_reserve_initrd ( void )
{
}
2008-06-25 17:49:26 -07:00
static void __init reserve_initrd ( void )
2008-06-22 02:46:58 -07:00
{
}
2008-01-30 13:32:51 +01:00
# endif /* CONFIG_BLK_DEV_INITRD */
2008-06-25 18:00:22 -07:00
static void __init parse_setup_data ( void )
2008-06-25 17:56:22 -07:00
{
struct setup_data * data ;
2013-08-13 15:46:41 -06:00
u64 pa_data , pa_next ;
2008-06-25 17:56:22 -07:00
pa_data = boot_params . hdr . setup_data ;
while ( pa_data ) {
2015-01-07 18:55:48 +08:00
u32 data_len , data_type ;
2011-02-22 21:07:36 +01:00
2015-01-07 18:55:48 +08:00
data = early_memremap ( pa_data , sizeof ( * data ) ) ;
2011-02-22 21:07:36 +01:00
data_len = data - > len + sizeof ( struct setup_data ) ;
2013-08-13 15:46:41 -06:00
data_type = data - > type ;
pa_next = data - > next ;
2015-02-24 10:13:28 +01:00
early_memunmap ( data , sizeof ( * data ) ) ;
2011-02-22 21:07:36 +01:00
2013-08-13 15:46:41 -06:00
switch ( data_type ) {
2008-06-25 17:56:22 -07:00
case SETUP_E820_EXT :
2017-01-28 13:18:40 +01:00
e820__memory_setup_extended ( pa_data , data_len ) ;
2008-06-25 17:56:22 -07:00
break ;
2011-02-22 21:07:37 +01:00
case SETUP_DTB :
add_dtb ( pa_data ) ;
2008-06-25 17:56:22 -07:00
break ;
2013-12-20 18:02:19 +08:00
case SETUP_EFI :
parse_efi_setup ( pa_data , data_len ) ;
break ;
2008-06-25 17:56:22 -07:00
default :
break ;
}
2013-08-13 15:46:41 -06:00
pa_data = pa_next ;
2008-06-25 17:56:22 -07:00
}
}
2010-08-25 13:39:17 -07:00
static void __init memblock_x86_reserve_range_setup_data ( void )
2008-07-03 11:37:13 -07:00
{
struct setup_data * data ;
u64 pa_data ;
pa_data = boot_params . hdr . setup_data ;
while ( pa_data ) {
2008-09-07 15:21:16 -07:00
data = early_memremap ( pa_data , sizeof ( * data ) ) ;
2011-07-12 11:16:06 +02:00
memblock_reserve ( pa_data , sizeof ( * data ) + data - > len ) ;
2008-07-03 11:37:13 -07:00
pa_data = data - > next ;
2015-02-24 10:13:28 +01:00
early_memunmap ( data , sizeof ( * data ) ) ;
2008-07-03 11:37:13 -07:00
}
}
2008-06-25 17:57:13 -07:00
/*
* - - - - - - - - - Crashkernel reservation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*/
2015-09-09 15:38:55 -07:00
# ifdef CONFIG_KEXEC_CORE
2008-06-26 21:54:08 +02:00
2015-10-19 11:17:44 +02:00
/* 16M alignment for crash kernel regions */
# define CRASH_ALIGN (16 << 20)
2010-12-16 19:20:41 -08:00
/*
* Keep the crash kernel below this limit . On 32 bits earlier kernels
* would limit the kernel to the low 512 MiB due to mapping restrictions .
2013-04-15 22:23:47 -07:00
* On 64 bit , old kexec - tools need to under 896 MiB .
2010-12-16 19:20:41 -08:00
*/
# ifdef CONFIG_X86_32
2015-10-19 11:17:43 +02:00
# define CRASH_ADDR_LOW_MAX (512 << 20)
# define CRASH_ADDR_HIGH_MAX (512 << 20)
2010-12-16 19:20:41 -08:00
# else
2015-10-19 11:17:43 +02:00
# define CRASH_ADDR_LOW_MAX (896UL << 20)
# define CRASH_ADDR_HIGH_MAX MAXMEM
2010-12-16 19:20:41 -08:00
# endif
2015-10-19 11:17:41 +02:00
static int __init reserve_crashkernel_low ( void )
2013-01-24 12:20:11 -08:00
{
# ifdef CONFIG_X86_64
2015-10-19 11:17:45 +02:00
unsigned long long base , low_base = 0 , low_size = 0 ;
2013-01-24 12:20:11 -08:00
unsigned long total_low_mem ;
int ret ;
2015-10-19 11:17:43 +02:00
total_low_mem = memblock_mem_size ( 1UL < < ( 32 - PAGE_SHIFT ) ) ;
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
/* crashkernel=Y,low */
2015-10-19 11:17:43 +02:00
ret = parse_crashkernel_low ( boot_command_line , total_low_mem , & low_size , & base ) ;
2015-10-19 11:17:45 +02:00
if ( ret ) {
2013-04-15 22:23:45 -07:00
/*
* two parts from lib / swiotlb . c :
2015-06-10 17:49:42 +02:00
* - swiotlb size : user - specified with swiotlb = or default .
*
* - swiotlb overflow buffer : now hardcoded to 32 k . We round it
* to 8 M for other buffers that may need to stay low too . Also
* make sure we allocate enough extra low memory so that we
* don ' t run out of DMA buffers for 32 - bit devices .
2013-04-15 22:23:45 -07:00
*/
2015-10-19 11:17:43 +02:00
low_size = max ( swiotlb_size_or_default ( ) + ( 8UL < < 20 ) , 256UL < < 20 ) ;
2013-04-15 22:23:45 -07:00
} else {
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
/* passed with crashkernel=0,low ? */
2013-04-15 22:23:45 -07:00
if ( ! low_size )
2015-10-19 11:17:41 +02:00
return 0 ;
2013-04-15 22:23:45 -07:00
}
2013-01-24 12:20:11 -08:00
2015-10-19 11:17:44 +02:00
low_base = memblock_find_in_range ( low_size , 1ULL < < 32 , low_size , CRASH_ALIGN ) ;
2013-01-24 12:20:11 -08:00
if ( ! low_base ) {
2015-10-19 11:17:41 +02:00
pr_err ( " Cannot reserve %ldMB crashkernel low memory, please try smaller size. \n " ,
( unsigned long ) ( low_size > > 20 ) ) ;
return - ENOMEM ;
2013-01-24 12:20:11 -08:00
}
2015-10-19 11:17:46 +02:00
ret = memblock_reserve ( low_base , low_size ) ;
if ( ret ) {
pr_err ( " %s: Error reserving crashkernel low memblock. \n " , __func__ ) ;
return ret ;
2013-01-24 12:20:11 -08:00
}
pr_info ( " Reserving %ldMB of low memory at %ldMB for crashkernel (System low RAM: %ldMB) \n " ,
2015-10-19 11:17:43 +02:00
( unsigned long ) ( low_size > > 20 ) ,
( unsigned long ) ( low_base > > 20 ) ,
( unsigned long ) ( total_low_mem > > 20 ) ) ;
2013-01-24 12:20:11 -08:00
crashk_low_res . start = low_base ;
crashk_low_res . end = low_base + low_size - 1 ;
insert_resource ( & iomem_resource , & crashk_low_res ) ;
2010-12-16 19:20:41 -08:00
# endif
2015-10-19 11:17:41 +02:00
return 0 ;
2013-01-24 12:20:11 -08:00
}
2010-12-16 19:20:41 -08:00
2008-06-25 18:00:22 -07:00
static void __init reserve_crashkernel ( void )
2008-06-25 17:57:13 -07:00
{
2015-10-19 11:17:45 +02:00
unsigned long long crash_size , crash_base , total_mem ;
2013-04-15 22:23:47 -07:00
bool high = false ;
2008-06-25 17:57:13 -07:00
int ret ;
2012-03-28 14:42:47 -07:00
total_mem = memblock_phys_mem_size ( ) ;
2008-06-25 17:57:13 -07:00
2013-04-15 22:23:47 -07:00
/* crashkernel=XM */
2015-10-19 11:17:43 +02:00
ret = parse_crashkernel ( boot_command_line , total_mem , & crash_size , & crash_base ) ;
2013-04-15 22:23:47 -07:00
if ( ret ! = 0 | | crash_size < = 0 ) {
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-15 22:23:48 -07:00
/* crashkernel=X,high */
2013-04-15 22:23:47 -07:00
ret = parse_crashkernel_high ( boot_command_line , total_mem ,
2015-10-19 11:17:43 +02:00
& crash_size , & crash_base ) ;
2013-04-15 22:23:47 -07:00
if ( ret ! = 0 | | crash_size < = 0 )
return ;
high = true ;
}
2008-06-26 21:54:08 +02:00
/* 0 means: find the address automatically */
if ( crash_base < = 0 ) {
2010-10-05 16:05:14 -07:00
/*
2010-12-16 19:20:41 -08:00
* kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
2010-10-05 16:05:14 -07:00
*/
2015-10-19 11:17:44 +02:00
crash_base = memblock_find_in_range ( CRASH_ALIGN ,
2015-10-19 11:17:43 +02:00
high ? CRASH_ADDR_HIGH_MAX
: CRASH_ADDR_LOW_MAX ,
2015-10-19 11:17:44 +02:00
crash_size , CRASH_ALIGN ) ;
2011-07-12 09:58:09 +02:00
if ( ! crash_base ) {
2009-11-22 17:18:49 -08:00
pr_info ( " crashkernel reservation failed - No suitable area found. \n " ) ;
2008-06-25 17:57:13 -07:00
return ;
}
2013-01-24 12:20:11 -08:00
2008-06-26 21:54:08 +02:00
} else {
2009-11-22 17:18:49 -08:00
unsigned long long start ;
2010-10-05 16:05:14 -07:00
start = memblock_find_in_range ( crash_base ,
2015-10-19 11:17:43 +02:00
crash_base + crash_size ,
crash_size , 1 < < 20 ) ;
2009-11-22 17:18:49 -08:00
if ( start ! = crash_base ) {
pr_info ( " crashkernel reservation failed - memory is in use. \n " ) ;
2008-06-25 17:57:13 -07:00
return ;
}
2008-06-26 21:54:08 +02:00
}
2015-10-19 11:17:46 +02:00
ret = memblock_reserve ( crash_base , crash_size ) ;
if ( ret ) {
pr_err ( " %s: Error reserving crashkernel memblock. \n " , __func__ ) ;
return ;
}
2008-06-25 17:57:13 -07:00
2015-10-19 11:17:41 +02:00
if ( crash_base > = ( 1ULL < < 32 ) & & reserve_crashkernel_low ( ) ) {
memblock_free ( crash_base , crash_size ) ;
return ;
}
2008-06-25 17:57:13 -07:00
2015-10-19 11:17:45 +02:00
pr_info ( " Reserving %ldMB of memory at %ldMB for crashkernel (System RAM: %ldMB) \n " ,
( unsigned long ) ( crash_size > > 20 ) ,
( unsigned long ) ( crash_base > > 20 ) ,
( unsigned long ) ( total_mem > > 20 ) ) ;
2008-06-25 17:57:13 -07:00
2008-06-26 21:54:08 +02:00
crashk_res . start = crash_base ;
crashk_res . end = crash_base + crash_size - 1 ;
insert_resource ( & iomem_resource , & crashk_res ) ;
2008-06-25 17:57:13 -07:00
}
# else
2008-06-25 18:00:22 -07:00
static void __init reserve_crashkernel ( void )
2008-06-25 17:57:13 -07:00
{
}
# endif
2008-06-25 17:58:02 -07:00
static struct resource standard_io_resources [ ] = {
{ . name = " dma1 " , . start = 0x00 , . end = 0x1f ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " pic1 " , . start = 0x20 , . end = 0x21 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " timer0 " , . start = 0x40 , . end = 0x43 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " timer1 " , . start = 0x50 , . end = 0x53 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " keyboard " , . start = 0x60 , . end = 0x60 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " keyboard " , . start = 0x64 , . end = 0x64 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " dma page reg " , . start = 0x80 , . end = 0x8f ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " pic2 " , . start = 0xa0 , . end = 0xa1 ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " dma2 " , . start = 0xc0 , . end = 0xdf ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO } ,
{ . name = " fpu " , . start = 0xf0 , . end = 0xff ,
. flags = IORESOURCE_BUSY | IORESOURCE_IO }
} ;
2009-08-19 14:55:50 +02:00
void __init reserve_standard_io_resources ( void )
2008-06-25 17:58:02 -07:00
{
int i ;
/* request I/O space for devices used on all i[345]86 PCs */
for ( i = 0 ; i < ARRAY_SIZE ( standard_io_resources ) ; i + + )
request_resource ( & ioport_resource , & standard_io_resources [ i ] ) ;
}
2010-04-01 14:32:43 -07:00
static __init void reserve_ibft_region ( void )
{
unsigned long addr , size = 0 ;
addr = find_ibft_region ( & size ) ;
if ( size )
2011-07-12 11:16:06 +02:00
memblock_reserve ( addr , size ) ;
2010-04-01 14:32:43 -07:00
}
2012-11-14 20:43:31 +00:00
static bool __init snb_gfx_workaround_needed ( void )
{
2013-01-13 20:56:41 -08:00
# ifdef CONFIG_PCI
2012-11-14 20:43:31 +00:00
int i ;
u16 vendor , devid ;
2013-01-13 20:36:39 -08:00
static const __initconst u16 snb_ids [ ] = {
2012-11-14 20:43:31 +00:00
0x0102 ,
0x0112 ,
0x0122 ,
0x0106 ,
0x0116 ,
0x0126 ,
0x010a ,
} ;
/* Assume no if something weird is going on with PCI */
if ( ! early_pci_allowed ( ) )
return false ;
vendor = read_pci_config_16 ( 0 , 2 , 0 , PCI_VENDOR_ID ) ;
if ( vendor ! = 0x8086 )
return false ;
devid = read_pci_config_16 ( 0 , 2 , 0 , PCI_DEVICE_ID ) ;
for ( i = 0 ; i < ARRAY_SIZE ( snb_ids ) ; i + + )
if ( devid = = snb_ids [ i ] )
return true ;
2013-01-13 20:56:41 -08:00
# endif
2012-11-14 20:43:31 +00:00
return false ;
}
/*
* Sandy Bridge graphics has trouble with certain ranges , exclude
* them from allocation .
*/
static void __init trim_snb_memory ( void )
{
2013-01-13 20:36:39 -08:00
static const __initconst unsigned long bad_pages [ ] = {
2012-11-14 20:43:31 +00:00
0x20050000 ,
0x20110000 ,
0x20130000 ,
0x20138000 ,
0x40004000 ,
} ;
int i ;
if ( ! snb_gfx_workaround_needed ( ) )
return ;
printk ( KERN_DEBUG " reserving inaccessible SNB gfx pages \n " ) ;
/*
* Reserve all memory below the 1 MB mark that has not
* already been reserved .
*/
memblock_reserve ( 0 , 1 < < 20 ) ;
for ( i = 0 ; i < ARRAY_SIZE ( bad_pages ) ; i + + ) {
if ( memblock_reserve ( bad_pages [ i ] , PAGE_SIZE ) )
printk ( KERN_WARNING " failed to reserve 0x%08lx \n " ,
bad_pages [ i ] ) ;
}
}
/*
* Here we put platform - specific memory range workarounds , i . e .
* memory known to be corrupt or otherwise in need to be reserved on
* specific platforms .
*
* If this gets used more widely it could use a real dispatch mechanism .
*/
static void __init trim_platform_memory_ranges ( void )
{
trim_snb_memory ( ) ;
}
2010-01-22 11:21:04 +08:00
static void __init trim_bios_range ( void )
{
/*
* A special case is the first 4 Kb of memory ;
* This is a BIOS owned area , not kernel ram , but generally
* not listed as such in the E820 table .
2010-08-24 17:32:04 -07:00
*
* This typically reserves additional memory ( 64 KiB by default )
* since some BIOSes are known to corrupt low memory . See the
2010-08-25 16:38:20 -07:00
* Kconfig help text for X86_RESERVE_LOW .
2010-01-22 11:21:04 +08:00
*/
2017-01-28 17:09:33 +01:00
e820__range_update ( 0 , PAGE_SIZE , E820_TYPE_RAM , E820_TYPE_RESERVED ) ;
2010-08-24 17:32:04 -07:00
2010-01-22 11:21:04 +08:00
/*
* special case : Some BIOSen report the PC BIOS
* area ( 640 - > 1 Mb ) as ram even though it is not .
* take them out .
*/
2017-01-28 17:09:33 +01:00
e820__range_remove ( BIOS_BEGIN , BIOS_END - BIOS_BEGIN , E820_TYPE_RAM , 1 ) ;
2012-11-14 20:43:31 +00:00
x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);
But 90% of the usage is trivial:
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
arch/x86/xen/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.
Only one use is non-trivial:
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).
Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:
int e820__update_table(struct e820_table *table);
This visibly simplifies all the call sites:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_table *table);
arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates. The allowance
arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
arch/x86/kernel/e820.c: if (e820__update_table(e820_table))
arch/x86/kernel/e820.c: e820__update_table(e820_table_firmware);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table) < 0)
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
arch/x86/xen/setup.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 18:00:35 +01:00
e820__update_table ( e820_table ) ;
2010-01-22 11:21:04 +08:00
}
2013-01-24 12:19:45 -08:00
/* called before trim_bios_range() to spare extra sanitize */
static void __init e820_add_kernel_range ( void )
{
u64 start = __pa_symbol ( _text ) ;
u64 size = __pa_symbol ( _end ) - start ;
/*
2017-01-28 17:09:33 +01:00
* Complain if . text . data and . bss are not marked as E820_TYPE_RAM and
2013-01-24 12:19:45 -08:00
* attempt to fix it by adding the range . We may have a confused BIOS ,
* or the user may have used memmap = exactmap or memmap = xxM $ yyM to
* exclude kernel range . If we really are running on top non - RAM ,
* we will crash later anyways .
*/
2017-01-28 17:09:33 +01:00
if ( e820__mapped_all ( start , start + size , E820_TYPE_RAM ) )
2013-01-24 12:19:45 -08:00
return ;
2017-01-28 17:09:33 +01:00
pr_warn ( " .text .data .bss are not marked as E820_TYPE_RAM! \n " ) ;
e820__range_remove ( start , size , E820_TYPE_RAM , 0 ) ;
e820__range_add ( start , size , E820_TYPE_RAM ) ;
2013-01-24 12:19:45 -08:00
}
2013-02-14 14:02:52 -08:00
static unsigned reserve_low = CONFIG_X86_RESERVE_LOW < < 10 ;
2010-08-25 16:38:20 -07:00
static int __init parse_reservelow ( char * p )
{
unsigned long long size ;
if ( ! p )
return - EINVAL ;
size = memparse ( p , & p ) ;
if ( size < 4096 )
size = 4096 ;
if ( size > 640 * 1024 )
size = 640 * 1024 ;
reserve_low = size ;
return 0 ;
}
early_param ( " reservelow " , parse_reservelow ) ;
2013-02-14 14:02:52 -08:00
static void __init trim_low_memory_range ( void )
{
memblock_reserve ( 0 , ALIGN ( reserve_low , PAGE_SIZE ) ) ;
}
2013-10-10 17:18:17 -07:00
/*
* Dump out kernel offset information on panic .
*/
static int
dump_kernel_offset ( struct notifier_block * self , unsigned long v , void * p )
{
2015-04-01 12:49:52 +02:00
if ( kaslr_enabled ( ) ) {
pr_emerg ( " Kernel Offset: 0x%lx from 0x%lx (relocation range: 0x%lx-0x%lx) \n " ,
2015-04-27 13:17:19 +02:00
kaslr_offset ( ) ,
2015-04-01 12:49:52 +02:00
__START_KERNEL ,
__START_KERNEL_map ,
MODULES_VADDR - 1 ) ;
} else {
pr_emerg ( " Kernel Offset: disabled \n " ) ;
}
2013-10-10 17:18:17 -07:00
return 0 ;
}
2005-04-16 15:20:36 -07:00
/*
* Determine if we were loaded by an EFI loader . If so , then we have also been
* passed the efi memmap , systab , etc . , so we should use these data structures
* for initialization . Note , the efi init code path is determined by the
* global efi_enabled . This allows the same kernel image to be used on existing
* systems ( with a traditional BIOS ) as well as on EFI systems .
*/
2008-06-25 17:52:35 -07:00
/*
* setup_arch - architecture - specific boot - time initializations
*
* Note : On x86_64 , fixmaps are ready for use even before this is called .
*/
2005-04-16 15:20:36 -07:00
void __init setup_arch ( char * * cmdline_p )
{
2013-01-24 12:20:12 -08:00
memblock_reserve ( __pa_symbol ( _text ) ,
( unsigned long ) __bss_stop - ( unsigned long ) _text ) ;
2013-01-24 12:19:55 -08:00
early_reserve_initrd ( ) ;
2013-01-24 12:20:12 -08:00
/*
* At this point everything still needed from the boot loader
* or BIOS or kernel text should be early reserved or marked not
* RAM in e820 . All other memory is free game .
*/
2008-06-25 17:52:35 -07:00
# ifdef CONFIG_X86_32
2005-04-16 15:20:36 -07:00
memcpy ( & boot_cpu_data , & new_cpu_data , sizeof ( new_cpu_data ) ) ;
2010-08-28 15:58:33 +02:00
/*
* copy kernel address range established so far and switch
* to the proper swapper page table
*/
clone_pgd_range ( swapper_pg_dir + KERNEL_PGD_BOUNDARY ,
initial_page_table + KERNEL_PGD_BOUNDARY ,
KERNEL_PGD_PTRS ) ;
load_cr3 ( swapper_pg_dir ) ;
2014-10-07 01:19:48 +01:00
/*
* Note : Quark X1000 CPUs advertise PGE incorrectly and require
* a cr3 based tlb flush , so the following __flush_tlb_all ( )
* will not flush anything because the cpu quirk which clears
* X86_FEATURE_PGE has not been invoked yet . Though due to the
* load_cr3 ( ) above the TLB has been flushed already . The
* quirk is invoked before subsequent calls to __flush_tlb_all ( )
* so proper operation is guaranteed .
*/
2010-08-28 15:58:33 +02:00
__flush_tlb_all ( ) ;
2008-06-25 17:52:35 -07:00
# else
printk ( KERN_INFO " Command line: %s \n " , boot_command_line ) ;
# endif
2005-04-16 15:20:36 -07:00
2010-08-23 14:49:11 -07:00
/*
* If we have OLPC OFW , we might end up relocating the fixmap due to
* reserve_top ( ) , so do this before touching the ioremap area .
*/
2010-06-18 17:46:53 -04:00
olpc_ofw_detect ( ) ;
2010-05-20 21:04:29 -05:00
early_trap_init ( ) ;
2008-07-21 16:49:54 -07:00
early_cpu_init ( ) ;
2008-06-29 20:02:44 -07:00
early_ioremap_init ( ) ;
2010-06-18 17:46:53 -04:00
setup_olpc_ofw_pgd ( ) ;
2007-10-15 17:13:22 -07:00
ROOT_DEV = old_decode_dev ( boot_params . hdr . root_dev ) ;
screen_info = boot_params . screen_info ;
edid_info = boot_params . edid_info ;
2008-06-25 17:52:35 -07:00
# ifdef CONFIG_X86_32
2007-10-15 17:13:22 -07:00
apm_info . bios = boot_params . apm_bios_info ;
ist_info = boot_params . ist_info ;
2008-06-25 17:52:35 -07:00
# endif
saved_video_mode = boot_params . hdr . vid_mode ;
2007-10-15 17:13:22 -07:00
bootloader_type = boot_params . hdr . type_of_loader ;
2009-05-07 16:54:11 -07:00
if ( ( bootloader_type > > 4 ) = = 0xe ) {
bootloader_type & = 0xf ;
bootloader_type | = ( boot_params . hdr . ext_loader_type + 0x10 ) < < 4 ;
}
bootloader_version = bootloader_type & 0xf ;
bootloader_version | = boot_params . hdr . ext_loader_ver < < 4 ;
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_BLK_DEV_RAM
2007-10-15 17:13:22 -07:00
rd_image_start = boot_params . hdr . ram_size & RAMDISK_IMAGE_START_MASK ;
rd_prompt = ( ( boot_params . hdr . ram_size & RAMDISK_PROMPT_FLAG ) ! = 0 ) ;
rd_doload = ( ( boot_params . hdr . ram_size & RAMDISK_LOAD_FLAG ) ! = 0 ) ;
2005-04-16 15:20:36 -07:00
# endif
2008-06-23 19:53:33 -07:00
# ifdef CONFIG_EFI
if ( ! strncmp ( ( char * ) & boot_params . efi_info . efi_loader_signature ,
2014-06-30 19:53:03 +02:00
EFI32_LOADER_SIGNATURE , 4 ) ) {
2014-01-15 13:21:22 +00:00
set_bit ( EFI_BOOT , & efi . flags ) ;
2012-02-12 13:24:29 -08:00
} else if ( ! strncmp ( ( char * ) & boot_params . efi_info . efi_loader_signature ,
2014-06-30 19:53:03 +02:00
EFI64_LOADER_SIGNATURE , 4 ) ) {
2014-01-15 13:21:22 +00:00
set_bit ( EFI_BOOT , & efi . flags ) ;
set_bit ( EFI_64BIT , & efi . flags ) ;
2008-06-23 19:53:33 -07:00
}
2012-11-14 09:42:35 +00:00
if ( efi_enabled ( EFI_BOOT ) )
efi_memblock_x86_reserve_range ( ) ;
2008-06-23 19:53:33 -07:00
# endif
2009-08-20 13:04:10 +02:00
x86_init . oem . arch_setup ( ) ;
2008-01-30 13:31:19 +01:00
2010-10-26 15:41:49 -06:00
iomem_resource . end = ( 1ULL < < boot_cpu_data . x86_phys_bits ) - 1 ;
2017-01-28 09:58:49 +01:00
e820__memory_setup ( ) ;
2008-06-30 16:20:54 -07:00
parse_setup_data ( ) ;
2005-04-16 15:20:36 -07:00
copy_edd ( ) ;
2007-10-15 17:13:22 -07:00
if ( ! boot_params . hdr . root_flags )
2005-04-16 15:20:36 -07:00
root_mountflags & = ~ MS_RDONLY ;
init_mm . start_code = ( unsigned long ) _text ;
init_mm . end_code = ( unsigned long ) _etext ;
init_mm . end_data = ( unsigned long ) _edata ;
x86: add brk allocation for very, very early allocations
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-26 17:35:44 -08:00
init_mm . brk = _brk_end ;
x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-14 07:18:29 -08:00
mpx_mm_init ( & init_mm ) ;
2005-04-16 15:20:36 -07:00
2016-04-14 11:18:57 -07:00
code_resource . start = __pa_symbol ( _text ) ;
code_resource . end = __pa_symbol ( _etext ) - 1 ;
data_resource . start = __pa_symbol ( _etext ) ;
data_resource . end = __pa_symbol ( _edata ) - 1 ;
bss_resource . start = __pa_symbol ( __bss_start ) ;
bss_resource . end = __pa_symbol ( __bss_stop ) - 1 ;
2008-08-12 12:52:36 -07:00
# ifdef CONFIG_CMDLINE_BOOL
# ifdef CONFIG_CMDLINE_OVERRIDE
strlcpy ( boot_command_line , builtin_cmdline , COMMAND_LINE_SIZE ) ;
# else
if ( builtin_cmdline [ 0 ] ) {
/* append boot loader cmdline to builtin */
strlcat ( builtin_cmdline , " " , COMMAND_LINE_SIZE ) ;
strlcat ( builtin_cmdline , boot_command_line , COMMAND_LINE_SIZE ) ;
strlcpy ( boot_command_line , builtin_cmdline , COMMAND_LINE_SIZE ) ;
}
# endif
# endif
2009-09-19 11:07:57 -07:00
strlcpy ( command_line , boot_command_line , COMMAND_LINE_SIZE ) ;
* cmdline_p = command_line ;
/*
2009-11-13 15:28:17 -08:00
* x86_configure_nx ( ) is called before parse_early_param ( ) to detect
* whether hardware doesn ' t support NX ( so that the early EHCI debug
* console setup can safely call set_fixmap ( ) ) . It may then be called
* again from within noexec_setup ( ) during parsing early parameters
* to honor the respective command line option .
2009-09-19 11:07:57 -07:00
*/
2009-11-13 15:28:16 -08:00
x86_configure_nx ( ) ;
2009-09-19 11:07:57 -07:00
parse_early_param ( ) ;
mm: remove x86-only restriction of movable_node
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:
1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().
Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.
This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced. (1) will happen for them, but (2) will not.
This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable. Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.
So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.
Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 16:42:55 -08:00
# ifdef CONFIG_MEMORY_HOTPLUG
/*
* Memory used by the kernel cannot be hot - removed because Linux
* cannot migrate the kernel pages . When memory hotplug is
* enabled , we should prevent memblock from allocating memory
* for the kernel .
*
* ACPI SRAT records all hotpluggable memory ranges . But before
* SRAT is parsed , we don ' t know about it .
*
* The kernel image is loaded into memory at very early time . We
* cannot prevent this anyway . So on NUMA system , we set any
* node the kernel resides in as un - hotpluggable .
*
* Since on modern servers , one node could have double - digit
* gigabytes memory , we can assume the memory around the kernel
* image is also un - hotpluggable . So before SRAT is parsed , just
* allocate memory near the kernel image to try the best to keep
* the kernel away from hotpluggable memory .
*/
if ( movable_node_is_enabled ( ) )
memblock_set_bottom_up ( true ) ;
# endif
2009-11-13 15:28:17 -08:00
x86_report_nx ( ) ;
2008-09-11 16:42:00 -07:00
2008-06-30 16:20:54 -07:00
/* after early param, so could get panic from serial */
2010-08-25 13:39:17 -07:00
memblock_x86_reserve_range_setup_data ( ) ;
2008-06-30 16:20:54 -07:00
2008-06-25 17:52:35 -07:00
if ( acpi_mps_check ( ) ) {
2008-06-23 22:19:22 +02:00
# ifdef CONFIG_X86_LOCAL_APIC
2008-06-25 17:52:35 -07:00
disable_apic = 1 ;
2008-06-23 22:19:22 +02:00
# endif
2008-07-21 11:21:43 -07:00
setup_clear_cpu_cap ( X86_FEATURE_APIC ) ;
2008-06-20 16:11:20 -07:00
}
2008-07-16 17:25:46 -07:00
# ifdef CONFIG_PCI
if ( pci_early_dump_regs )
early_dump_pci_devices ( ) ;
# endif
2017-01-28 22:27:28 +01:00
e820__reserve_setup_data ( ) ;
2017-01-28 13:37:17 +01:00
e820__finish_early_params ( ) ;
2006-09-26 10:52:32 +02:00
2012-11-14 09:42:35 +00:00
if ( efi_enabled ( EFI_BOOT ) )
2009-03-03 21:55:31 -05:00
efi_init ( ) ;
2008-09-22 02:52:26 -07:00
dmi_scan_machine ( ) ;
2013-10-18 14:29:25 -07:00
dmi_memdev_walk ( ) ;
2013-04-30 15:27:15 -07:00
dmi_set_dump_stack_arch_desc ( ) ;
2008-09-22 02:52:26 -07:00
2008-10-27 10:41:46 -07:00
/*
* VMware detection requires dmi to be available , so this
* needs to be done after dmi_scan_machine , for the BP .
*/
2009-08-20 17:06:25 +02:00
init_hypervisor_platform ( ) ;
2008-10-27 10:41:46 -07:00
2009-08-19 14:43:56 +02:00
x86_init . resources . probe_roms ( ) ;
2008-06-16 13:03:31 -07:00
2016-04-14 11:18:57 -07:00
/* after parse_early_param, so could debug it */
insert_resource ( & iomem_resource , & code_resource ) ;
insert_resource ( & iomem_resource , & data_resource ) ;
insert_resource ( & iomem_resource , & bss_resource ) ;
2013-01-24 12:19:45 -08:00
e820_add_kernel_range ( ) ;
2010-01-22 11:21:04 +08:00
trim_bios_range ( ) ;
2008-06-25 17:52:35 -07:00
# ifdef CONFIG_X86_32
2008-06-16 16:11:08 -07:00
if ( ppro_with_ram_bug ( ) ) {
2017-01-28 17:09:33 +01:00
e820__range_update ( 0x70000000ULL , 0x40000ULL , E820_TYPE_RAM ,
E820_TYPE_RESERVED ) ;
x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);
But 90% of the usage is trivial:
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
arch/x86/xen/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.
Only one use is non-trivial:
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).
Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:
int e820__update_table(struct e820_table *table);
This visibly simplifies all the call sites:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_table *table);
arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates. The allowance
arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
arch/x86/kernel/e820.c: if (e820__update_table(e820_table))
arch/x86/kernel/e820.c: e820__update_table(e820_table_firmware);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table) < 0)
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
arch/x86/xen/setup.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 18:00:35 +01:00
e820__update_table ( e820_table ) ;
2008-06-16 16:11:08 -07:00
printk ( KERN_INFO " fixed physical RAM map: \n " ) ;
2017-01-28 14:24:02 +01:00
e820__print_table ( " bad_ppro " ) ;
2008-06-16 16:11:08 -07:00
}
2008-06-25 17:52:35 -07:00
# else
early_gart_iommu_check ( ) ;
# endif
2008-06-16 16:11:08 -07:00
2008-06-03 19:35:04 -07:00
/*
* partially used pages are not usable - thus
* we are rounding upwards :
*/
2008-07-10 20:38:26 -07:00
max_pfn = e820_end_of_ram_pfn ( ) ;
2008-06-03 19:35:04 -07:00
2008-01-30 13:33:32 +01:00
/* update e820 for memory not covered by WB MTRRs */
mtrr_bp_init ( ) ;
2008-07-08 18:56:38 -07:00
if ( mtrr_trim_uncached_memory ( max_pfn ) )
2008-07-10 20:38:26 -07:00
max_pfn = e820_end_of_ram_pfn ( ) ;
2008-03-23 00:16:49 -07:00
2015-12-04 14:07:05 +01:00
max_possible_pfn = max_pfn ;
2016-08-09 10:11:04 -07:00
/*
* Define random base addresses for memory sections after max_pfn is
* defined and before each memory section base is used .
*/
kernel_randomize_memory ( ) ;
2008-06-25 17:52:35 -07:00
# ifdef CONFIG_X86_32
2008-06-24 12:18:14 -07:00
/* max_low_pfn get updated here */
2008-06-23 03:05:30 -07:00
find_low_pfn_range ( ) ;
2008-06-25 17:52:35 -07:00
# else
2009-02-16 17:29:58 -08:00
check_x2apic ( ) ;
2008-06-25 17:52:35 -07:00
/* How many end-of-memory variables you have, grandma! */
/* need this before calling reserve_initrd */
2008-07-10 20:38:26 -07:00
if ( max_pfn > ( 1UL < < ( 32 - PAGE_SHIFT ) ) )
max_low_pfn = e820_end_of_low_ram_pfn ( ) ;
else
max_low_pfn = max_pfn ;
2008-06-25 17:52:35 -07:00
high_memory = ( void * ) __va ( max_pfn * PAGE_SIZE - 1 ) + 1 ;
2008-09-07 01:51:32 -07:00
# endif
2009-12-10 13:07:22 -08:00
/*
* Find and reserve possible boot - time SMP configuration :
*/
find_smp_config ( ) ;
2010-04-01 14:32:43 -07:00
reserve_ibft_region ( ) ;
2012-11-16 19:38:58 -08:00
early_alloc_pgt_buf ( ) ;
2010-08-25 13:39:17 -07:00
/*
x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:
- nothing in its name tells us that it's part of the E820 subsystem ...
- The 'fill' wording is ambiguous and doesn't tell us whether it's a single
entry or some process - while the _real_ purpose of the function is hidden,
which is to do a complete setup of the (platform independent) memblock regions.
So rename it accordingly, to e820__memblock_setup().
Also translate this incomprehensible and misleading comment:
/*
* EFI may have more than 128 entries
* We are safe to enable resizing, beause memblock_x86_fill()
* is rather later for x86
*/
memblock_allow_resize();
The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.
But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...
So change the comment to a more comprehensible version:
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
*
* This is safe, because this call happens pretty late during x86 setup,
* so we know about reserved memory regions already. (This is important
* so that memblock resizing does no stomp over reserved areas.)
*/
memblock_allow_resize();
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 11:37:42 +01:00
* Need to conclude brk , before e820__memblock_setup ( )
2010-08-25 13:39:17 -07:00
* it could use memblock_find_in_range , could overlap with
* brk area .
*/
reserve_brk ( ) ;
2011-02-18 11:30:30 +00:00
cleanup_highmap ( ) ;
2013-08-14 11:44:04 +08:00
memblock_set_current_limit ( ISA_END_ADDRESS ) ;
x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:
- nothing in its name tells us that it's part of the E820 subsystem ...
- The 'fill' wording is ambiguous and doesn't tell us whether it's a single
entry or some process - while the _real_ purpose of the function is hidden,
which is to do a complete setup of the (platform independent) memblock regions.
So rename it accordingly, to e820__memblock_setup().
Also translate this incomprehensible and misleading comment:
/*
* EFI may have more than 128 entries
* We are safe to enable resizing, beause memblock_x86_fill()
* is rather later for x86
*/
memblock_allow_resize();
The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.
But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...
So change the comment to a more comprehensible version:
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
*
* This is safe, because this call happens pretty late during x86 setup,
* so we know about reserved memory regions already. (This is important
* so that memblock resizing does no stomp over reserved areas.)
*/
memblock_allow_resize();
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 11:37:42 +01:00
e820__memblock_setup ( ) ;
2010-08-25 13:39:17 -07:00
2016-06-21 23:11:38 +01:00
reserve_bios_regions ( ) ;
if ( efi_enabled ( EFI_MEMMAP ) ) {
2015-09-30 23:01:56 +09:00
efi_fake_memmap ( ) ;
2015-06-24 16:58:15 -07:00
efi_find_mirror ( ) ;
x86/efi: Defer efi_esrt_init until after memblock_x86_fill
Commit 7b02d53e7852 ("efi: Allow drivers to reserve boot services forever")
introduced a new efi_mem_reserve to reserve the boot services memory
regions forever. This reservation involves allocating a new EFI memory
range descriptor. However, allocation can only succeed if there is memory
available for the allocation. Otherwise, error such as the following may
occur:
esrt: Reserving ESRT space from 0x000000003dd6a000 to 0x000000003dd6a010.
Kernel panic - not syncing: ERROR: Failed to allocate 0x9f0 bytes below \
0x0.
CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc5+ #503
0000000000000000 ffffffff81e03ce0 ffffffff8131dae8 ffffffff81bb6c50
ffffffff81e03d70 ffffffff81e03d60 ffffffff8111f4df 0000000000000018
ffffffff81e03d70 ffffffff81e03d08 00000000000009f0 00000000000009f0
Call Trace:
[<ffffffff8131dae8>] dump_stack+0x4d/0x65
[<ffffffff8111f4df>] panic+0xc5/0x206
[<ffffffff81f7c6d3>] memblock_alloc_base+0x29/0x2e
[<ffffffff81f7c6e3>] memblock_alloc+0xb/0xd
[<ffffffff81f6c86d>] efi_arch_mem_reserve+0xbc/0x134
[<ffffffff81fa3280>] efi_mem_reserve+0x2c/0x31
[<ffffffff81fa3280>] ? efi_mem_reserve+0x2c/0x31
[<ffffffff81fa40d3>] efi_esrt_init+0x19e/0x1b4
[<ffffffff81f6d2dd>] efi_init+0x398/0x44a
[<ffffffff81f5c782>] setup_arch+0x415/0xc30
[<ffffffff81f55af1>] start_kernel+0x5b/0x3ef
[<ffffffff81f55434>] x86_64_start_reservations+0x2f/0x31
[<ffffffff81f55520>] x86_64_start_kernel+0xea/0xed
---[ end Kernel panic - not syncing: ERROR: Failed to allocate 0x9f0
bytes below 0x0.
An inspection of the memblock configuration reveals that there is no memory
available for the allocation:
MEMBLOCK configuration:
memory size = 0x0 reserved size = 0x4f339c0
memory.cnt = 0x1
memory[0x0] [0x00000000000000-0xffffffffffffffff], 0x0 bytes on node 0\
flags: 0x0
reserved.cnt = 0x4
reserved[0x0] [0x0000000008c000-0x0000000008c9bf], 0x9c0 bytes flags: 0x0
reserved[0x1] [0x0000000009f000-0x000000000fffff], 0x61000 bytes\
flags: 0x0
reserved[0x2] [0x00000002800000-0x0000000394bfff], 0x114c000 bytes\
flags: 0x0
reserved[0x3] [0x000000304e4000-0x00000034269fff], 0x3d86000 bytes\
flags: 0x0
This situation can be avoided if we call efi_esrt_init after memblock has
memory regions for the allocation.
Also, the EFI ESRT driver makes use of early_memremap'pings. Therfore, we
do not want to defer efi_esrt_init for too long. We must call such function
while calls to early_memremap are still valid.
A good place to meet the two aforementioned conditions is right after
memblock_x86_fill, grouped with other EFI-related functions.
Reported-by: Scott Lawson <scott.lawson@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-08-16 17:32:31 -07:00
efi_esrt_init ( ) ;
2016-08-10 02:29:13 -07:00
2016-06-21 23:11:38 +01:00
/*
* The EFI specification says that boot service code won ' t be
* called after ExitBootServices ( ) . This is , in fact , a lie .
*/
x86, efi: Retain boot service code until after switching to virtual mode
UEFI stands for "Unified Extensible Firmware Interface", where "Firmware"
is an ancient African word meaning "Why do something right when you can
do it so wrong that children will weep and brave adults will cower before
you", and "UEI" is Celtic for "We missed DOS so we burned it into your
ROMs". The UEFI specification provides for runtime services (ie, another
way for the operating system to be forced to depend on the firmware) and
we rely on these for certain trivial tasks such as setting up the
bootloader. But some hardware fails to work if we attempt to use these
runtime services from physical mode, and so we have to switch into virtual
mode. So far so dreadful.
The specification makes it clear that the operating system is free to do
whatever it wants with boot services code after ExitBootServices() has been
called. SetVirtualAddressMap() can't be called until ExitBootServices() has
been. So, obviously, a whole bunch of EFI implementations call into boot
services code when we do that. Since we've been charmingly naive and
trusted that the specification may be somehow relevant to the real world,
we've already stuffed a picture of a penguin or something in that address
space. And just to make things more entertaining, we've also marked it
non-executable.
This patch allocates the boot services regions during EFI init and makes
sure that they're executable. Then, after SetVirtualAddressMap(), it
discards them and everyone lives happily ever after. Except for the ones
who have to work on EFI, who live sad lives haunted by the knowledge that
someone's eventually going to write yet another firmware specification.
[ hpa: adding this to urgent with a stable tag since it fixes currently-broken
hardware. However, I do not know what the dependencies are and so I do
not know which -stable versions this may be a candidate for. ]
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1306331593-28715-1-git-send-email-mjg@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@kernel.org>
2011-05-25 09:53:13 -04:00
efi_reserve_boot_services ( ) ;
2016-06-21 23:11:38 +01:00
}
x86, efi: Retain boot service code until after switching to virtual mode
UEFI stands for "Unified Extensible Firmware Interface", where "Firmware"
is an ancient African word meaning "Why do something right when you can
do it so wrong that children will weep and brave adults will cower before
you", and "UEI" is Celtic for "We missed DOS so we burned it into your
ROMs". The UEFI specification provides for runtime services (ie, another
way for the operating system to be forced to depend on the firmware) and
we rely on these for certain trivial tasks such as setting up the
bootloader. But some hardware fails to work if we attempt to use these
runtime services from physical mode, and so we have to switch into virtual
mode. So far so dreadful.
The specification makes it clear that the operating system is free to do
whatever it wants with boot services code after ExitBootServices() has been
called. SetVirtualAddressMap() can't be called until ExitBootServices() has
been. So, obviously, a whole bunch of EFI implementations call into boot
services code when we do that. Since we've been charmingly naive and
trusted that the specification may be somehow relevant to the real world,
we've already stuffed a picture of a penguin or something in that address
space. And just to make things more entertaining, we've also marked it
non-executable.
This patch allocates the boot services regions during EFI init and makes
sure that they're executable. Then, after SetVirtualAddressMap(), it
discards them and everyone lives happily ever after. Except for the ones
who have to work on EFI, who live sad lives haunted by the knowledge that
someone's eventually going to write yet another firmware specification.
[ hpa: adding this to urgent with a stable tag since it fixes currently-broken
hardware. However, I do not know what the dependencies are and so I do
not know which -stable versions this may be a candidate for. ]
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1306331593-28715-1-git-send-email-mjg@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@kernel.org>
2011-05-25 09:53:13 -04:00
2010-08-25 13:39:17 -07:00
/* preallocate 4k for mptable mpc */
2017-01-28 13:46:28 +01:00
e820__memblock_alloc_reserved_mpc_new ( ) ;
2010-08-25 13:39:17 -07:00
# ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check ( ) ;
# endif
2013-01-24 12:19:54 -08:00
# ifdef CONFIG_X86_32
2012-05-29 15:06:29 -07:00
printk ( KERN_DEBUG " initial memory mapped: [mem 0x00000000-%#010lx] \n " ,
( max_pfn_mapped < < PAGE_SHIFT ) - 1 ) ;
2013-01-24 12:19:54 -08:00
# endif
2010-08-25 13:39:17 -07:00
2013-01-24 12:19:51 -08:00
reserve_real_mode ( ) ;
2009-12-10 13:07:22 -08:00
2012-11-14 20:43:31 +00:00
trim_platform_memory_ranges ( ) ;
2013-02-14 14:02:52 -08:00
trim_low_memory_range ( ) ;
2012-11-14 20:43:31 +00:00
2012-11-16 19:38:41 -08:00
init_mem_mapping ( ) ;
2011-10-20 16:15:26 -05:00
x86, 64bit: Use a #PF handler to materialize early mappings on demand
Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all
64-bit code has to use page tables. This makes it awkward before we
have first set up properly all-covering page tables to access objects
that are outside the static kernel range.
So far we have dealt with that simply by mapping a fixed amount of
low memory, but that fails in at least two upcoming use cases:
1. We will support load and run kernel, struct boot_params, ramdisk,
command line, etc. above the 4 GiB mark.
2. need to access ramdisk early to get microcode to update that as
early possible.
We could use early_iomap to access them too, but it will make code to
messy and hard to be unified with 32 bit.
Hence, set up a #PF table and use a fixed number of buffers to set up
page tables on demand. If the buffers fill up then we simply flush
them and start over. These buffers are all in __initdata, so it does
not increase RAM usage at runtime.
Thus, with the help of the #PF handler, we can set the final kernel
mapping from blank, and switch to init_level4_pgt later.
During the switchover in head_64.S, before #PF handler is available,
we use three pages to handle kernel crossing 1G, 512G boundaries with
sharing page by playing games with page aliasing: the same page is
mapped twice in the higher-level tables with appropriate wraparound.
The kernel region itself will be properly mapped; other mappings may
be spurious.
early_make_pgtable is using kernel high mapping address to access pages
to set page table.
-v4: Add phys_base offset to make kexec happy, and add
init_mapping_kernel() - Yinghai
-v5: fix compiling with xen, and add back ident level3 and level2 for xen
also move back init_level4_pgt from BSS to DATA again.
because we have to clear it anyway. - Yinghai
-v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai
-v7: remove not needed clear_page for init_level4_page
it is with fill 512,8,0 already in head_64.S - Yinghai
-v8: we need to keep that handler alive until init_mem_mapping and don't
let early_trap_init to trash that early #PF handler.
So split early_trap_pf_init out and move it down. - Yinghai
-v9: switchover only cover kernel space instead of 1G so could avoid
touch possible mem holes. - Yinghai
-v11: change far jmp back to far return to initial_code, that is needed
to fix failure that is reported by Konrad on AMD systems. - Yinghai
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-24 12:19:52 -08:00
early_trap_pf_init ( ) ;
2011-10-20 16:15:26 -05:00
2016-08-10 02:29:14 -07:00
/*
* Update mmu_cr4_features ( and , indirectly , trampoline_cr4_features )
* with the current CR4 value . This may not be necessary , but
* auditing all the early - boot CR4 manipulation would be needed to
* rule it out .
*/
2016-09-29 12:48:12 -07:00
mmu_cr4_features = __read_cr4 ( ) ;
2016-08-10 02:29:14 -07:00
2014-01-27 17:06:50 -08:00
memblock_set_current_limit ( get_max_mapped ( ) ) ;
2008-06-24 12:18:14 -07:00
2008-06-25 21:51:28 -07:00
/*
* NOTE : On x86 - 32 , only from this point on , fixmaps are ready for use .
*/
# ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
if ( init_ohci1394_dma_early )
init_ohci1394_dma_on_all_controllers ( ) ;
# endif
2011-05-24 17:13:20 -07:00
/* Allocate bigger log buffer */
setup_log_buf ( 1 ) ;
2008-06-25 21:51:28 -07:00
2008-06-23 03:05:30 -07:00
reserve_initrd ( ) ;
2016-06-20 13:56:10 +03:00
acpi_table_upgrade ( ) ;
2012-10-01 00:23:54 +02:00
2008-06-25 17:52:35 -07:00
vsmp_init ( ) ;
2008-06-17 15:41:45 -07:00
io_delay_init ( ) ;
/*
* Parse the ACPI tables for possible boot - time SMP configuration .
*/
x86, ACPI, mm: Revert movablemem_map support
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5
Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")
It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.
If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.
We have workaround patch that could fix some problems, but some can not
be fixed.
So just remove that offending commit and related ones including:
f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")
01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")
27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")
e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")
fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")
42f47e27e761 ("page_alloc: make movablemem_map have higher priority")
6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")
34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")
4d59a75125d5 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-01 14:51:27 -08:00
acpi_boot_table_init ( ) ;
early_acpi_boot_init ( ) ;
2011-02-16 12:13:06 +01:00
initmem_init ( ) ;
2014-10-24 17:00:34 +08:00
dma_contiguous_reserve ( max_pfn_mapped < < PAGE_SHIFT ) ;
2013-11-12 15:08:07 -08:00
/*
* Reserve memory for crash kernel after SRAT is parsed so that it
* won ' t consume hotpluggable memory .
*/
reserve_crashkernel ( ) ;
2010-08-25 13:39:18 -07:00
memblock_find_dma_reserve ( ) ;
2008-07-18 19:07:53 +02:00
2012-08-16 17:00:19 -03:00
# ifdef CONFIG_KVM_GUEST
2008-02-15 17:52:48 -02:00
kvmclock_init ( ) ;
# endif
2012-08-21 21:22:38 +01:00
x86_init . paging . pagetable_init ( ) ;
x86: early boot debugging via FireWire (ohci1394_dma=early)
This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.
If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.
With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.
In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.
An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.
A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt
It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff
Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:34:11 +01:00
2015-02-13 14:39:25 -08:00
kasan_init ( ) ;
2010-08-28 15:58:33 +02:00
# ifdef CONFIG_X86_32
/* sync back kernel address range */
clone_pgd_range ( initial_page_table + KERNEL_PGD_BOUNDARY ,
swapper_pg_dir + KERNEL_PGD_BOUNDARY ,
KERNEL_PGD_PTRS ) ;
x86/setup: Extend low identity map to cover whole kernel range
On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap. efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.
For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table. This makes the EFI stub's trick "just work".
However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault). Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.
For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM. However, even under KVM one can clearly
see that the page table is bogus:
$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.
Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 000000ff
IDT= fffbb000 000007ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
...
The page directory is sane:
(gdb) x/4wx 0x32b7000
0x32b7000: 0x03398063 0x03399063 0x0339a063 0x0339b063
(gdb) x/4wx 0x3398000
0x3398000: 0x00000163 0x00001163 0x00002163 0x00003163
(gdb) x/4wx 0x3399000
0x3399000: 0x00400003 0x00401003 0x00402003 0x00403003
but our particular page directory entry is empty:
(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070: 0x00000000
[ It appears that you can skate past this issue if you don't receive
any interrupts while the bogus GDT pointer is loaded, or if you avoid
reloading the segment registers in general.
Andy Lutomirski provides some additional insight:
"AFAICT it's entirely permissible for the GDTR and/or LDT
descriptor to point to unmapped memory. Any attempt to use them
(segment loads, interrupts, IRET, etc) will try to access that memory
as if the access came from CPL 0 and, if the access fails, will
generate a valid page fault with CR2 pointing into the GDT or
LDT."
Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
calls, not in the epilog/prolog calls") interrupts were disabled
around the prolog and epilog calls, and the functional GDT was
re-installed before interrupts were re-enabled.
Which explains why no one has hit this issue until now. ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
[ Updated changelog. ]
2015-10-14 13:30:45 +02:00
/*
* sync back low identity map too . It is used for example
* in the 32 - bit EFI stub .
*/
clone_pgd_range ( initial_page_table ,
swapper_pg_dir + KERNEL_PGD_BOUNDARY ,
2015-11-06 14:18:36 +01:00
min ( KERNEL_PGD_PTRS , KERNEL_PGD_BOUNDARY ) ) ;
2010-08-28 15:58:33 +02:00
# endif
x86-32: Separate 1:1 pagetables from swapper_pg_dir
This patch fixes machine crashes which occur when heavily exercising the
CPU hotplug codepaths on a 32-bit kernel. These crashes are caused by
AMD Erratum 383 and result in a fatal machine check exception. Here's
the scenario:
1. On 32-bit, the swapper_pg_dir page table is used as the initial page
table for booting a secondary CPU.
2. To make this work, swapper_pg_dir needs a direct mapping of physical
memory in it (the low mappings). By adding those low, large page (2M)
mappings (PAE kernel), we create the necessary conditions for Erratum
383 to occur.
3. Other CPUs which do not participate in the off- and onlining game may
use swapper_pg_dir while the low mappings are present (when leave_mm is
called). For all steps below, the CPU referred to is a CPU that is using
swapper_pg_dir, and not the CPU which is being onlined.
4. The presence of the low mappings in swapper_pg_dir can result
in TLB entries for addresses below __PAGE_OFFSET to be established
speculatively. These TLB entries are marked global and large.
5. When the CPU with such TLB entry switches to another page table, this
TLB entry remains because it is global.
6. The process then generates an access to an address covered by the
above TLB entry but there is a permission mismatch - the TLB entry
covers a large global page not accessible to userspace.
7. Due to this permission mismatch a new 4kb, user TLB entry gets
established. Further, Erratum 383 provides for a small window of time
where both TLB entries are present. This results in an uncorrectable
machine check exception signalling a TLB multimatch which panics the
machine.
There are two ways to fix this issue:
1. Always do a global TLB flush when a new cr3 is loaded and the
old page table was swapper_pg_dir. I consider this a hack hard
to understand and with performance implications
2. Do not use swapper_pg_dir to boot secondary CPUs like 64-bit
does.
This patch implements solution 2. It introduces a trampoline_pg_dir
which has the same layout as swapper_pg_dir with low_mappings. This page
table is used as the initial page table of the booting CPU. Later in the
bringup process, it switches to swapper_pg_dir and does a global TLB
flush. This fixes the crashes in our test cases.
-v2: switch to swapper_pg_dir right after entering start_secondary() so
that we are able to access percpu data which might not be mapped in the
trampoline page table.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
LKML-Reference: <20100816123833.GB28147@aftab>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-16 14:38:33 +02:00
x86, intel_txt: Intel TXT boot support
This patch adds kernel configuration and boot support for Intel Trusted
Execution Technology (Intel TXT).
Intel's technology for safer computing, Intel Trusted Execution
Technology (Intel TXT), defines platform-level enhancements that
provide the building blocks for creating trusted platforms.
Intel TXT was formerly known by the code name LaGrande Technology (LT).
Intel TXT in Brief:
o Provides dynamic root of trust for measurement (DRTM)
o Data protection in case of improper shutdown
o Measurement and verification of launched environment
Intel TXT is part of the vPro(TM) brand and is also available some
non-vPro systems. It is currently available on desktop systems based on
the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell Optiplex 755, HP
dc7800, etc.) and mobile systems based on the GM45, PM45, and GS45
Express chipsets.
For more information, see http://www.intel.com/technology/security/.
This site also has a link to the Intel TXT MLE Developers Manual, which
has been updated for the new released platforms.
A much more complete description of how these patches support TXT, how to
configure a system for it, etc. is in the Documentation/intel_txt.txt file
in this patch.
This patch provides the TXT support routines for complete functionality,
documentation for TXT support and for the changes to the boot_params structure,
and boot detection of a TXT launch. Attempts to shutdown (reboot, Sx) the system
will result in platform resets; subsequent patches will support these shutdown modes
properly.
Documentation/intel_txt.txt | 210 +++++++++++++++++++++
Documentation/x86/zero-page.txt | 1
arch/x86/include/asm/bootparam.h | 3
arch/x86/include/asm/fixmap.h | 3
arch/x86/include/asm/tboot.h | 197 ++++++++++++++++++++
arch/x86/kernel/Makefile | 1
arch/x86/kernel/setup.c | 4
arch/x86/kernel/tboot.c | 379 +++++++++++++++++++++++++++++++++++++++
security/Kconfig | 30 +++
9 files changed, 827 insertions(+), 1 deletion(-)
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Gang Wei <gang.wei@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-30 19:30:59 -07:00
tboot_probe ( ) ;
2008-06-25 17:52:35 -07:00
map_vsyscall ( ) ;
2006-09-26 10:52:32 +02:00
generic_apic_probe ( ) ;
2005-04-16 15:20:36 -07:00
2007-10-19 20:35:03 +02:00
early_quirks ( ) ;
2006-06-08 00:43:38 -07:00
2008-06-23 19:55:05 -07:00
/*
* Read APIC and some other early information from ACPI tables .
*/
2005-04-16 15:20:36 -07:00
acpi_boot_init ( ) ;
2009-08-14 15:23:29 -04:00
sfi_init ( ) ;
2011-02-25 16:09:31 +01:00
x86_dtb_init ( ) ;
2008-06-21 01:38:41 -07:00
2008-06-23 19:55:05 -07:00
/*
* get boot - time SMP configuration :
*/
2016-08-12 14:57:12 +08:00
get_smp_config ( ) ;
2008-06-25 17:52:35 -07:00
x86/smpboot: Init apic mapping before usage
The recent changes, which forced the registration of the boot cpu on UP
systems, which do not have ACPI tables, have been fixed for systems w/o
local APIC, but left a wreckage for systems which have neither ACPI nor
mptables, but the CPU has an APIC, e.g. virtualbox.
The boot process crashes in prefill_possible_map() as it wants to register
the boot cpu, which needs to access the local apic, but the local APIC is
not yet mapped.
There is no reason why init_apic_mapping() can't be invoked before
prefill_possible_map(). So instead of playing another silly early mapping
game, as the ACPI/mptables code does, we just move init_apic_mapping()
before the call to prefill_possible_map().
In hindsight, I should have noticed that combination earlier.
Sorry for the churn (also in stable)!
Fixes: ff8560512b8d ("x86/boot/smp: Don't try to poke disabled/non-existent APIC")
Reported-and-debugged-by: Michal Necasek <michal.necasek@oracle.com>
Reported-and-tested-by: Wolfgang Bauer <wbauer@tmo.at>
Cc: prarit@redhat.com
Cc: ville.syrjala@linux.intel.com
Cc: michael.thayer@oracle.com
Cc: knut.osmundsen@oracle.com
Cc: frank.mehnert@oracle.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610282114380.5053@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-29 13:42:42 +02:00
/*
* Systems w / o ACPI and mptables might not have it mapped the local
* APIC yet , but prefill_possible_map ( ) might need to access it .
*/
init_apic_mappings ( ) ;
2008-07-02 18:54:40 -07:00
prefill_possible_map ( ) ;
2008-08-19 20:50:02 -07:00
2008-07-02 18:53:44 -07:00
init_cpu_to_node ( ) ;
2015-04-24 13:57:48 +02:00
io_apic_init_mappings ( ) ;
2008-08-19 20:50:52 -07:00
2008-06-23 19:55:05 -07:00
kvm_guest_init ( ) ;
2005-04-16 15:20:36 -07:00
2008-06-16 13:03:31 -07:00
e820_reserve_resources ( ) ;
2008-05-20 20:10:58 -07:00
e820_mark_nosave_regions ( max_low_pfn ) ;
2005-04-16 15:20:36 -07:00
2009-08-19 14:55:50 +02:00
x86_init . resources . reserve_resources ( ) ;
2008-06-16 13:03:31 -07:00
2017-01-28 14:16:38 +01:00
e820__setup_pci_gap ( ) ;
2008-06-16 13:03:31 -07:00
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_VT
# if defined(CONFIG_VGA_CONSOLE)
2012-11-14 09:42:35 +00:00
if ( ! efi_enabled ( EFI_BOOT ) | | ( efi_mem_type ( 0xa0000 ) ! = EFI_CONVENTIONAL_MEMORY ) )
2005-04-16 15:20:36 -07:00
conswitchp = & vga_con ;
# elif defined(CONFIG_DUMMY_CONSOLE)
conswitchp = & dummy_con ;
# endif
# endif
2009-08-20 13:19:57 +02:00
x86_init . oem . banner ( ) ;
2009-11-10 09:38:24 +08:00
2011-02-15 00:13:31 +08:00
x86_init . timers . wallclock_init ( ) ;
2009-11-10 09:38:24 +08:00
mcheck_init ( ) ;
2010-09-17 11:08:51 -04:00
2011-04-18 15:19:51 -07:00
arch_init_ideal_nops ( ) ;
jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-04 12:42:27 -04:00
register_refined_jiffies ( CLOCK_TICK_RATE ) ;
2012-10-24 10:00:44 -07:00
# ifdef CONFIG_EFI
2014-03-04 17:02:17 +01:00
if ( efi_enabled ( EFI_BOOT ) )
efi_apply_memmap_quirks ( ) ;
2012-10-24 10:00:44 -07:00
# endif
2005-04-16 15:20:36 -07:00
}
2008-09-16 09:29:09 +02:00
2009-02-17 23:12:48 +01:00
# ifdef CONFIG_X86_32
2009-08-19 14:55:50 +02:00
static struct resource video_ram_resource = {
. name = " Video RAM area " ,
. start = 0xa0000 ,
. end = 0xbffff ,
. flags = IORESOURCE_BUSY | IORESOURCE_MEM
2009-02-17 23:12:48 +01:00
} ;
2009-08-19 14:55:50 +02:00
void __init i386_reserve_resources ( void )
2009-02-17 23:12:48 +01:00
{
2009-08-19 14:55:50 +02:00
request_resource ( & iomem_resource , & video_ram_resource ) ;
reserve_standard_io_resources ( ) ;
2009-02-17 23:12:48 +01:00
}
# endif /* CONFIG_X86_32 */
2013-10-10 17:18:17 -07:00
static struct notifier_block kernel_offset_notifier = {
. notifier_call = dump_kernel_offset
} ;
static int __init register_kernel_offset_dumper ( void )
{
atomic_notifier_chain_register ( & panic_notifier_list ,
& kernel_offset_notifier ) ;
return 0 ;
}
__initcall ( register_kernel_offset_dumper ) ;
2016-02-12 13:02:27 -08:00
void arch_show_smap ( struct seq_file * m , struct vm_area_struct * vma )
{
if ( ! boot_cpu_has ( X86_FEATURE_OSPKE ) )
return ;
seq_printf ( m , " ProtectionKey: %8u \n " , vma_pkey ( vma ) ) ;
}