322 Commits

Author SHA1 Message Date
Kees Cook
06486d6c97 x86/KASLR: Return earliest overlap when avoiding regions
In preparation for being able to detect where to split up contiguous
memory regions that overlap with memory regions to avoid, we need to
pass back what the earliest overlapping region was. This modifies the
overlap checker to return that information.

Based on a separate mem_min_overlap() implementation by Baoquan He.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462825332-10505-5-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:12:04 +02:00
Baoquan He
c401cf1524 x86/KASLR: Add 'struct slot_area' to manage random_addr slots
In order to support KASLR moving the kernel anywhere in physical memory
(which could be up to 64TB), we need to handle counting the potential
randomization locations in a more efficient manner.

In the worst case with 64TB, there could be roughly 32 * 1024 * 1024
randomization slots if CONFIG_PHYSICAL_ALIGN is 0x1000000. Currently
the starting address of candidate positions is stored into the slots[]
array, one at a time. This method would cost too much memory and it's
also very inefficient to get and save the slot information into the slot
array one by one.

This patch introduces 'struct slot_area' to manage each contiguous region
of randomization slots. Each slot_area will contain the starting address
and how many available slots are in this area. As with the original code,
the slot_areas[] will avoid the mem_avoid[] regions.

Since setup_data is a linked list, it could contain an unknown number
of memory regions to be avoided, which could cause us to fragment
the contiguous memory that the slot_area array is tracking. In normal
operation this level of fragmentation will be extremely rare, but we
choose a suitably large value (100) for the array. If setup_data forces
the slot_area array to become highly fragmented and there are more
slots available beyond the first 100 found, the rest will be ignored
for KASLR selection.

The function store_slot_info() is used to calculate the number of slots
available in the passed-in memory region and stores it into slot_areas[]
after adjusting for alignment and size requirements.

Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rewrote changelog, squashed with new functions. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462825332-10505-4-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:12:04 +02:00
Kees Cook
cb18ef0da2 x86/boot: Add missing file header comments
There were some files with missing header comments. Since they are
included from both compressed and regular kernels, make note of that.
Also corrects a typo in the mem_avoid comments.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462825332-10505-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:12:03 +02:00
Kees Cook
434a6c9f90 x86/KASLR: Initialize mapping_info every time
As it turns out, mapping_info DOES need to be initialized every
time, because pgt_data address could be changed during kernel
relocation. So it can not be build time assigned.

Without this, page tables were not being corrected updated, which
could cause reboots when a physical address beyond 2G was chosen.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462825332-10505-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:12:02 +02:00
Borislav Petkov
36a39ac967 x86/boot: Comment what finalize_identity_maps() does
So it is not really obvious that finalize_identity_maps() doesn't do any
finalization but it *actually* writes CR3 with the ident PGD. Comment
that at the call site.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: bhe@redhat.com
Cc: dyoung@redhat.com
Cc: jkosina@suse.cz
Cc: linux-tip-commits@vger.kernel.org
Cc: luto@kernel.org
Cc: vgoyal@redhat.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20160507100541.GA24613@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:12:02 +02:00
Kees Cook
3a94707d7a x86/KASLR: Build identity mappings on demand
Currently KASLR only supports relocation in a small physical range (from
16M to 1G), due to using the initial kernel page table identity mapping.
To support ranges above this, we need to have an identity mapping for the
desired memory range before we can decompress (and later run) the kernel.

32-bit kernels already have the needed identity mapping. This patch adds
identity mappings for the needed memory ranges on 64-bit kernels. This
happens in two possible boot paths:

If loaded via startup_32(), we need to set up the needed identity map.

If loaded from a 64-bit bootloader, the bootloader will have already
set up an identity mapping, and we'll start via the compressed kernel's
startup_64(). In this case, the bootloader's page tables need to be
avoided while selecting the new uncompressed kernel location. If not,
the decompressor could overwrite them during decompression.

To accomplish this, we could walk the pagetable and find every page
that is used, and add them to mem_avoid, but this needs extra code and
will require increasing the size of the mem_avoid array.

Instead, we can create a new set of page tables for our own identity
mapping instead. The pages for the new page table will come from the
_pagetable section of the compressed kernel, which means they are
already contained by in mem_avoid array. To do this, we reuse the code
from the uncompressed kernel's identity mapping routines.

The _pgtable will be shared by both the 32-bit and 64-bit paths to reduce
init_size, as now the compressed kernel's _rodata to _end will contribute
to init_size.

To handle the possible mappings, we need to increase the existing page
table buffer size:

When booting via startup_64(), we need to cover the old VO, params,
cmdline and uncompressed kernel. In an extreme case we could have them
all beyond the 512G boundary, which needs (2+2)*4 pages with 2M mappings.
And we'll need 2 for first 2M for VGA RAM. One more is needed for level4.
This gets us to 19 pages total.

When booting via startup_32(), KASLR could move the uncompressed kernel
above 4G, so we need to create extra identity mappings, which should only
need (2+2) pages at most when it is beyond the 512G boundary. So 19
pages is sufficient for this case as well.

The resulting BOOT_*PGT_SIZE defines use the "_SIZE" suffix on their
names to maintain logical consistency with the existing BOOT_HEAP_SIZE
and BOOT_STACK_SIZE defines.

This patch is based on earlier patches from Yinghai Lu and Baoquan He.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462572095-11754-4-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07 07:38:39 +02:00
Kees Cook
ed09acde44 x86/KASLR: Improve comments around the mem_avoid[] logic
This attempts to improve the comments that describe how the memory
range used for decompression is avoided. Additionally uses an enum
instead of raw numbers for the mem_avoid[] indexing.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20160506194459.GA16480@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07 07:38:38 +02:00
Borislav Petkov
549f90db68 x86/boot: Simplify pointer casting in choose_random_location()
Pass them down as 'unsigned long' directly and get rid of more casting and
assignments.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: bhe@redhat.com
Cc: dyoung@redhat.com
Cc: linux-tip-commits@vger.kernel.org
Cc: luto@kernel.org
Cc: vgoyal@redhat.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20160506115015.GI24044@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07 07:38:38 +02:00
Yinghai Lu
9dc1969c24 x86/KASLR: Consolidate mem_avoid[] entries
The mem_avoid[] array is used to track positions that should be avoided (like
the compressed kernel, decompression code, etc) when selecting a memory
position for the randomly relocated kernel. Since ZO is now at the end of
the decompression buffer and the decompression code (and its heap and
stack) are at the front, we can safely consolidate the decompression entry,
the heap entry, and the stack entry. The boot_params memory, however, could
be elsewhere, so it should be explicitly included.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rwrote changelog, cleaned up code comments. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462486436-3707-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-06 09:00:59 +02:00
Kees Cook
2bc1cd39fa x86/boot: Clean up pointer casting
Currently extract_kernel() defines the input and output buffer pointers
as "unsigned char *" since that's effectively what they are. It passes
these to the decompressor routine and to the ELF parser, which both
logically deal with buffer pointers too. There is some casting ("unsigned
long") done to validate the numerical value of the pointers, but it is
relatively limited.

However, choose_random_location() operates almost exclusively on the
numerical representation of these pointers, so it ended up carrying
a lot of "unsigned long" casts. With the future physical/virtual split
these casts were going to multiply, so this attempts to solve the
problem by doing all the casting in choose_random_location()'s entry
and return instead of through-out the code. Adjusts argument names to
be more meaningful, and changes one us of "choice" to "output" to make
the future physical/virtual split more clear (i.e. "choice" should be
strictly a function return value and not used as an intermediate).

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462486436-3707-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-06 09:00:59 +02:00
Kees Cook
00ec2c3703 x86/boot: Warn on future overlapping memcpy() use
If an overlapping memcpy() is ever attempted, we should at least report
it, in case it might lead to problems, so it could be changed to a
memmove() call instead.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Lasse Collin <lasse.collin@tukaani.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1462229461-3370-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-03 08:15:58 +02:00
Kees Cook
dc425a6e14 x86/boot: Extract error reporting functions
Currently to use warn(), a caller would need to include misc.h. However,
this means they would get the (unavailable during compressed boot)
gcc built-in memcpy family of functions. But since string.c is defining
these memcpy functions for use by misc.c, we end up in a weird circular
dependency.

To break this loop, move the error reporting functions outside of misc.c
with their own header so that they can be independently included by
other sources. Since the screen-writing routines use memmove(), keep the
low-level *_putstr() functions in misc.c.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Lasse Collin <lasse.collin@tukaani.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1462229461-3370-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-03 08:15:58 +02:00
Yinghai Lu
4abf061bf8 x86/boot: Correctly bounds-check relocations
Relocation handling performs bounds checking on the resulting calculated
addresses. The existing code uses output_len (VO size plus relocs size) as
the max address. This is not right since the max_addr check should stop at
the end of VO and exclude bss, brk, etc, which follows.  The valid range
should be VO [_text, __bss_start] in the loaded physical address space.

This patch adds an export for __bss_start in voffset.h and uses it to
set the correct limit for max_addr.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
[ Rewrote the changelog. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1461888548-32439-7-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 11:03:30 +02:00
Yinghai Lu
4d2d542482 x86/KASLR: Clean up unused code from old 'run_size' and rename it to 'kernel_total_size'
Since 'run_size' is now calculated in misc.c, the old script and associated
argument passing is no longer needed. This patch removes them, and renames
'run_size' to the more descriptive 'kernel_total_size'.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rewrote the changelog, renamed 'run_size' to 'kernel_total_size' ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Junjie Mao <eternal.n08@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1461888548-32439-6-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 11:03:30 +02:00
Yinghai Lu
67b6662559 x86/boot: Fix "run_size" calculation
Currently, the "run_size" variable holds the total kernel size
(size of code plus brk and bss) and is calculated via the shell script
arch/x86/tools/calc_run_size.sh. It gets the file offset and mem size
of the .bss and .brk sections from the vmlinux, and adds them as follows:

  run_size = $(( $offsetA + $sizeA + $sizeB ))

However, this is not correct (it is too large). To illustrate, here's
a walk-through of the script's calculation, compared to the correct way
to find it.

First, offsetA is found as the starting address of the first .bss or
.brk section seen in the ELF file. The sizeA and sizeB values are the
respective section sizes.

 [bhe@x1 linux]$ objdump -h vmlinux

 vmlinux:     file format elf64-x86-64

 Sections:
 Idx Name    Size      VMA               LMA               File off  Algn
  27 .bss    00170000  ffffffff81ec8000  0000000001ec8000  012c8000  2**12
             ALLOC
  28 .brk    00027000  ffffffff82038000  0000000002038000  012c8000  2**0
             ALLOC

Here, offsetA is 0x012c8000, with sizeA at 0x00170000 and sizeB at
0x00027000. The resulting run_size is 0x145f000:

 0x012c8000 + 0x00170000 + 0x00027000 = 0x145f000

However, if we instead examine the ELF LOAD program headers, we see a
different picture.

 [bhe@x1 linux]$ readelf -l vmlinux

 Elf file type is EXEC (Executable file)
 Entry point 0x1000000
 There are 5 program headers, starting at offset 64

 Program Headers:
  Type        Offset             VirtAddr           PhysAddr
              FileSiz            MemSiz              Flags  Align
  LOAD        0x0000000000200000 0xffffffff81000000 0x0000000001000000
              0x0000000000b5e000 0x0000000000b5e000  R E    200000
  LOAD        0x0000000000e00000 0xffffffff81c00000 0x0000000001c00000
              0x0000000000145000 0x0000000000145000  RW     200000
  LOAD        0x0000000001000000 0x0000000000000000 0x0000000001d45000
              0x0000000000018158 0x0000000000018158  RW     200000
  LOAD        0x000000000115e000 0xffffffff81d5e000 0x0000000001d5e000
              0x000000000016a000 0x0000000000301000  RWE    200000
  NOTE        0x000000000099bcac 0xffffffff8179bcac 0x000000000179bcac
              0x00000000000001bc 0x00000000000001bc         4

 Section to Segment mapping:
  Segment Sections...
   00     .text .notes __ex_table .rodata __bug_table .pci_fixup .tracedata
          __ksymtab __ksymtab_gpl __ksymtab_strings __init_rodata __param
          __modver
   01     .data .vvar
   02     .data..percpu
   03     .init.text .init.data .x86_cpu_dev.init .parainstructions
          .altinstructions .altinstr_replacement .iommu_table .apicdrivers
          .exit.text .smp_locks .bss .brk
   04     .notes

As mentioned, run_size needs to be the size of the running kernel
including .bss and .brk. We can see from the Section/Segment mapping
above that .bss and .brk are included in segment 03 (which corresponds
to the final LOAD program header). To find the run_size, we calculate
the end of the LOAD segment from its PhysAddr start (0x0000000001d5e000)
and its MemSiz (0x0000000000301000), minus the physical load address of
the kernel (the first LOAD segment's PhysAddr: 0x0000000001000000). The
resulting run_size is 0x105f000:

 0x0000000001d5e000 + 0x0000000000301000 - 0x0000000001000000 = 0x105f000

So, from this we can see that the existing run_size calculation is
0x400000 too high. And, as it turns out, the correct run_size is
actually equal to VO_end - VO_text, which is certainly easier to calculate.
_end: 0xffffffff8205f000
_text:0xffffffff81000000

 0xffffffff8205f000 - 0xffffffff81000000 = 0x105f000

As a result, run_size is a simple constant, so we don't need to pass it
around; we already have voffset.h for such things. We can share voffset.h
between misc.c and header.S instead of getting run_size in other ways.
This patch moves voffset.h creation code to boot/compressed/Makefile,
and switches misc.c to use the VO_end - VO_text calculation for run_size.

Dependence before:

 boot/header.S ==> boot/voffset.h ==> vmlinux
 boot/header.S ==> compressed/vmlinux ==> compressed/misc.c

Dependence after:

 boot/header.S ==> compressed/vmlinux ==> compressed/misc.c ==> boot/voffset.h ==> vmlinux

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rewrote the changelog. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Junjie Mao <eternal.n08@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: lasse.collin@tukaani.org
Fixes: e6023367d779 ("x86, kaslr: Prevent .bss from overlaping initrd")
Link: http://lkml.kernel.org/r/1461888548-32439-5-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 11:03:30 +02:00
Yinghai Lu
d607251ba9 x86/boot: Calculate decompression size during boot not build
Currently z_extract_offset is calculated in boot/compressed/mkpiggy.c.
This doesn't work well because mkpiggy.c doesn't know the details of the
decompressor in use. As a result, it can only make an estimation, which
has risks:

 - output + output_len (VO) could be much bigger than input + input_len
   (ZO). In this case, the decompressed kernel plus relocs could overwrite
   the decompression code while it is running.

 - The head code of ZO could be bigger than z_extract_offset. In this case
   an overwrite could happen when the head code is running to move ZO to
   the end of buffer. Though currently the size of the head code is very
   small it's still a potential risk. Since there is no rule to limit the
   size of the head code of ZO, it runs the risk of suddenly becoming a
   (hard to find) bug.

Instead, this moves the z_extract_offset calculation into header.S, and
makes adjustments to be sure that the above two cases can never happen,
and further corrects the comments describing the calculations.

Since we have (in the previous patch) made ZO always be located against
the end of decompression buffer, z_extract_offset is only used here to
calculate an appropriate buffer size (INIT_SIZE), and is not longer used
elsewhere. As such, it can be removed from voffset.h.

Additionally clean up #if/#else #define to improve readability.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rewrote the changelog and comments. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1461888548-32439-4-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 11:03:29 +02:00
Yinghai Lu
974f221c84 x86/boot: Move compressed kernel to the end of the decompression buffer
This change makes later calculations about where the kernel is located
easier to reason about. To better understand this change, we must first
clarify what 'VO' and 'ZO' are. These values were introduced in commits
by hpa:

  77d1a4999502 ("x86, boot: make symbols from the main vmlinux available")
  37ba7ab5e33c ("x86, boot: make kernel_alignment adjustable; new bzImage fields")

Specifically:

All names prefixed with 'VO_':

 - relate to the uncompressed kernel image

 - the size of the VO image is: VO__end-VO__text ("VO_INIT_SIZE" define)

All names prefixed with 'ZO_':

 - relate to the bootable compressed kernel image (boot/compressed/vmlinux),
   which is composed of the following memory areas:
     - head text
     - compressed kernel (VO image and relocs table)
     - decompressor code

 - the size of the ZO image is: ZO__end - ZO_startup_32 ("ZO_INIT_SIZE" define, though see below)

The 'INIT_SIZE' value is used to find the larger of the two image sizes:

 #define ZO_INIT_SIZE    (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
 #define VO_INIT_SIZE    (VO__end - VO__text)

 #if ZO_INIT_SIZE > VO_INIT_SIZE
 # define INIT_SIZE ZO_INIT_SIZE
 #else
 # define INIT_SIZE VO_INIT_SIZE
 #endif

The current code uses extract_offset to decide where to position the
copied ZO (i.e. ZO starts at extract_offset). (This is why ZO_INIT_SIZE
currently includes the extract_offset.)

Why does z_extract_offset exist? It's needed because we are trying to minimize
the amount of RAM used for the whole act of creating an uncompressed, executable,
properly relocation-linked kernel image in system memory. We do this so that
kernels can be booted on even very small systems.

To achieve the goal of minimal memory consumption we have implemented an in-place
decompression strategy: instead of cleanly separating the VO and ZO images and
also allocating some memory for the decompression code's runtime needs, we instead
create this elaborate layout of memory buffers where the output (decompressed)
stream, as it progresses, overlaps with and destroys the input (compressed)
stream. This can only be done safely if the ZO image is placed to the end of the
VO range, plus a certain amount of safety distance to make sure that when the last
bytes of the VO range are decompressed, the compressed stream pointer is safely
beyond the end of the VO range.

z_extract_offset is calculated in arch/x86/boot/compressed/mkpiggy.c during
the build process, at a point when we know the exact compressed and
uncompressed size of the kernel images and can calculate this safe minimum
offset value. (Note that the mkpiggy.c calculation is not perfect, because
we don't know the decompressor used at that stage, so the z_extract_offset
calculation is necessarily imprecise and is mostly based on gzip internals -
we'll improve that in the next patch.)

When INIT_SIZE is bigger than VO_INIT_SIZE (uncommon but possible),
the copied ZO occupies the memory from extract_offset to the end of
decompression buffer. It overlaps with the soon-to-be-uncompressed kernel
like this:

                            |-----compressed kernel image------|
                            V                                  V
0                       extract_offset                      +INIT_SIZE
|-----------|---------------|-------------------------|--------|
            |               |                         |        |
          VO__text      startup_32 of ZO          VO__end    ZO__end
            ^                                         ^
            |-------uncompressed kernel image---------|

When INIT_SIZE is equal to VO_INIT_SIZE (likely) there's still space
left from end of ZO to the end of decompressing buffer, like below.

                            |-compressed kernel image-|
                            V                         V
0                       extract_offset                      +INIT_SIZE
|-----------|---------------|-------------------------|--------|
            |               |                         |        |
          VO__text      startup_32 of ZO          ZO__end    VO__end
            ^                                                  ^
            |------------uncompressed kernel image-------------|

To simplify calculations and avoid special cases, it is cleaner to
always place the compressed kernel image in memory so that ZO__end
is at the end of the decompression buffer, instead of placing t at
the start of extract_offset as is currently done.

This patch adds BP_init_size (which is the INIT_SIZE as passed in from
the boot_params) into asm-offsets.c to make it visible to the assembly
code.

Then when moving the ZO, it calculates the starting position of
the copied ZO (via BP_init_size and the ZO run size) so that the VO__end
will be at the end of the decompression buffer. To make the position
calculation safe, the end of ZO is page aligned (and a comment is added
to the existing VO alignment for good measure).

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
[ Rewrote changelog and comments. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1461888548-32439-3-git-send-email-keescook@chromium.org
[ Rewrote the changelog some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 11:03:29 +02:00
Baoquan He
6f9af75faa x86/KASLR: Handle kernel relocations above 2G correctly
When processing the relocation table, the offset used to calculate the
relocation is an 'int'. This is sufficient for calculating the physical
address of the relocs entry on 32-bit systems and on 64-bit systems when
the relocation is under 2G.

To handle relocations above 2G (seen in situations like kexec, netboot, etc),
this offset needs to be calculated using a 'long' to avoid wrapping and
miscalculating the relocation.

Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rewrote the changelog. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1461888548-32439-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 09:58:26 +02:00
Kees Cook
81b785f3e4 x86/boot: Rename overlapping memcpy() to memmove()
Instead of having non-standard memcpy() behavior, explicitly call the new
function memmove(), make it available to the decompressors, and switch
the two overlap cases (screen scrolling and ELF parsing) to use memmove().
Additionally documents the purpose of compressed/string.c.

Suggested-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20160426214606.GA5758@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28 11:02:29 +02:00
Kees Cook
0f8ede1b8c x86/KASLR: Warn when KASLR is disabled
If KASLR is built in but not available at run-time (either due to the
current conflict with hibernation, command-line request, or e820 parsing
failures), announce the state explicitly. To support this, a new "warn"
function is created, based on the existing "error" function.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1461185746-8017-6-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22 10:00:51 +02:00
Kees Cook
bf0118dbba x86/boot: Make memcpy() handle overlaps
Two uses of memcpy() (screen scrolling and ELF parsing) were handling
overlapping memory areas. While there were no explicitly noticed bugs
here (yet), it is best to fix this so that the copying will always be
safe.

Instead of making a new memmove() function that might collide with other
memmove() definitions in the decompressors, this just makes the compressed
boot code's copy of memcpy() overlap-safe.

Suggested-by: Lasse Collin <lasse.collin@tukaani.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461185746-8017-5-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22 10:00:50 +02:00
Kees Cook
1f208de37d x86/boot: Clean up things used by decompressors
This rearranges the pieces needed to include the decompressor code
in misc.c. It wasn't obvious why things were there, so a comment was
added and definitions consolidated.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1461185746-8017-4-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22 10:00:50 +02:00
Baoquan He
e8581e3d67 x86/KASLR: Drop CONFIG_RANDOMIZE_BASE_MAX_OFFSET
Currently CONFIG_RANDOMIZE_BASE_MAX_OFFSET is used to limit the maximum
offset for kernel randomization. This limit doesn't need to be a CONFIG
since it is tied completely to KERNEL_IMAGE_SIZE, and will make no sense
once physical and virtual offsets are randomized separately. This patch
removes CONFIG_RANDOMIZE_BASE_MAX_OFFSET and consolidates the Kconfig
help text.

[kees: rewrote changelog, dropped KERNEL_IMAGE_SIZE_DEFAULT, rewrote help]
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1461185746-8017-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22 10:00:50 +02:00
Baoquan He
4252db1055 x86/KASLR: Update description for decompressor worst case size
The comment that describes the analysis for the size of the decompressor
code only took gzip into account (there are currently 6 other decompressors
that could be used). The actual z_extract_offset calculation in code was
already handling the correct maximum size, but this documentation hadn't
been updated. This updates the documentation, fixes several typos, moves
the comment to header.S, updates references, and adds a note at the end
of the decompressor include list to remind us about updating the comment
in the future.

(Instead of moving the comment to mkpiggy.c, where the calculation
is currently happening, it is being moved to header.S because
the calculations in mkpiggy.c will be removed in favor of header.S
calculations in a following patch, and it seemed like overkill to move
the giant comment twice, especially when there's already reference to
z_extract_offset in header.S.)

Signed-off-by: Baoquan He <bhe@redhat.com>
[ Rewrote changelog, cleaned up comment style, moved comments around. ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1461185746-8017-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22 10:00:50 +02:00
Kees Cook
9016875df4 x86/KASLR: Rename "random" to "random_addr"
The variable "random" is also the name of a libc function. It's better
coding style to avoid overloading such things, so rename it to the more
accurate "random_addr".

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1460997735-24785-7-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:30:51 +02:00
Kees Cook
7de828dfe6 x86/KASLR: Clarify purpose of kaslr.c
The name "choose_kernel_location" isn't specific enough, and doesn't
describe the primary thing it does: choosing a random location. This
patch renames it to "choose_random_location", and clarifies the what
routines are contained in the kaslr.c source file.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1460997735-24785-6-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:30:51 +02:00
Kees Cook
c040288132 x86/boot: Clarify purpose of functions in misc.c
The function "decompress_kernel" now performs many more duties, so this
patch renames it to "extract_kernel" and updates callers and comments.
Additionally the file header comment for misc.c is improved to actually
describe what is contained.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1460997735-24785-5-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:30:51 +02:00
Kees Cook
6655e0aaf7 x86/boot: Rename "real_mode" to "boot_params"
The non-compressed boot code uses the (much more obvious) name
"boot_params" for the global pointer to the x86 boot parameters. The
compressed kernel loader code, though, was using the legacy name
"real_mode". There is no need to have a different name, and changing it
improves readability.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1460997735-24785-4-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:30:50 +02:00
Yinghai Lu
206f25a831 x86/KASLR: Remove unneeded boot_params argument
Since the boot_params can be found using the real_mode global variable,
there is no need to pass around a pointer to it. This slightly simplifies
the choose_kernel_location function and its callers.

[kees: rewrote changelog, tracked file rename]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460997735-24785-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:30:50 +02:00
Kees Cook
9b238748cb x86/KASLR: Rename aslr.c to kaslr.c
In order to avoid confusion over what this file provides, rename it to
kaslr.c since it is used exclusively for the kernel ASLR, not userspace
ASLR.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1460997735-24785-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:30:50 +02:00
H.J. Lu
6d92bc9d48 x86/build: Build compressed x86 kernels as PIE
The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X
relocation to get the symbol address in PIC.  When the compressed x86
kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations
to their fixed symbol addresses.  However, when the compressed x86
kernel is loaded at a different address, it leads to the following
load failure:

  Failed to allocate space for phdrs

during the decompression stage.

If the compressed x86 kernel is relocatable at run-time, it should be
compiled with -fPIE, instead of -fPIC, if possible and should be built as
Position Independent Executable (PIE) so that linker won't optimize
R_386_GOT32X relocation to its fixed symbol address.

Older linkers generate R_386_32 relocations against locally defined
symbols, _bss, _ebss, _got and _egot, in PIE.  It isn't wrong, just less
optimal than R_386_RELATIVE.  But the x86 kernel fails to properly handle
R_386_32 relocations when relocating the kernel.  To generate
R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as
hidden in both 32-bit and 64-bit x86 kernels.

To build a 64-bit compressed x86 kernel as PIE, we need to disable the
relocation overflow check to avoid relocation overflow errors. We do
this with a new linker command-line option, -z noreloc-overflow, which
got added recently:

 commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7
 Author: H.J. Lu <hjl.tools@gmail.com>
 Date:   Tue Mar 15 11:07:06 2016 -0700

    Add -z noreloc-overflow option to x86-64 ld

    Add -z noreloc-overflow command-line option to the x86-64 ELF linker to
    disable relocation overflow check.  This can be used to avoid relocation
    overflow check if there will be no dynamic relocation overflow at
    run-time.

The 64-bit compressed x86 kernel is built as PIE only if the linker supports
-z noreloc-overflow.  So far 64-bit relocatable compressed x86 kernel
boots fine even when it is built as a normal executable.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
[ Edited the changelog and comments. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-29 12:51:12 +02:00
Dmitry Vyukov
5c9a8750a6 kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Josh Poimboeuf
c0dd671686 objtool: Mark non-standard object files and directories
Code which runs outside the kernel's normal mode of operation often does
unusual things which can cause a static analysis tool like objtool to
emit false positive warnings:

 - boot image
 - vdso image
 - relocation
 - realmode
 - efi
 - head
 - purgatory
 - modpost

Set OBJECT_FILES_NON_STANDARD for their related files and directories,
which will tell objtool to skip checking them.  It's ok to skip them
because they don't affect runtime stack traces.

Also skip the following code which does the right thing with respect to
frame pointers, but is too "special" to be validated by a tool:

 - entry
 - mcount

Also skip the test_nx module because it modifies its exception handling
table at runtime, which objtool can't understand.  Fortunately it's
just a test module so it doesn't matter much.

Currently objtool is the only user of OBJECT_FILES_NON_STANDARD, but it
might eventually be useful for other tools.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/366c080e3844e8a5b6a0327dc7e8c2b90ca3baeb.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 08:35:02 +01:00
Andrey Ryabinin
c6d308534a UBSAN: run-time undefined behavior sanity checker
UBSAN uses compile-time instrumentation to catch undefined behavior
(UB).  Compiler inserts code that perform certain kinds of checks before
operations that could cause UB.  If check fails (i.e.  UB detected)
__ubsan_handle_* function called to print error message.

So the most of the work is done by compiler.  This patch just implements
ubsan handlers printing errors.

GCC has this capability since 4.9.x [1] (see -fsanitize=undefined
option and its suboptions).
However GCC 5.x has more checkers implemented [2].
Article [3] has a bit more details about UBSAN in the GCC.

[1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
[2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
[3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/

Issues which UBSAN has found thus far are:

Found bugs:

 * out-of-bounds access - 97840cb67ff5 ("netfilter: nfnetlink: fix
   insufficient validation in nfnetlink_bind")

undefined shifts:

 * d48458d4a768 ("jbd2: use a better hash function for the revoke
   table")

 * 10632008b9e1 ("clockevents: Prevent shift out of bounds")

 * 'x << -1' shift in ext4 -
   http://lkml.kernel.org/r/<5444EF21.8020501@samsung.com>

 * undefined rol32(0) -
   http://lkml.kernel.org/r/<1449198241-20654-1-git-send-email-sasha.levin@oracle.com>

 * undefined dirty_ratelimit calculation -
   http://lkml.kernel.org/r/<566594E2.3050306@odin.com>

 * undefined roundown_pow_of_two(0) -
   http://lkml.kernel.org/r/<1449156616-11474-1-git-send-email-sasha.levin@oracle.com>

 * [WONTFIX] undefined shift in __bpf_prog_run -
   http://lkml.kernel.org/r/<CACT4Y+ZxoR3UjLgcNdUm4fECLMx2VdtfrENMtRRCdgHB2n0bJA@mail.gmail.com>

   WONTFIX here because it should be fixed in bpf program, not in kernel.

signed overflows:

 * 32a8df4e0b33f ("sched: Fix odd values in effective_load()
   calculations")

 * mul overflow in ntp -
   http://lkml.kernel.org/r/<1449175608-1146-1-git-send-email-sasha.levin@oracle.com>

 * incorrect conversion into rtc_time in rtc_time64_to_tm() -
   http://lkml.kernel.org/r/<1449187944-11730-1-git-send-email-sasha.levin@oracle.com>

 * unvalidated timespec in io_getevents() -
   http://lkml.kernel.org/r/<CACT4Y+bBxVYLQ6LtOKrKtnLthqLHcw-BMp3aqP3mjdAvr9FULQ@mail.gmail.com>

 * [NOTABUG] signed overflow in ktime_add_safe() -
   http://lkml.kernel.org/r/<CACT4Y+aJ4muRnWxsUe1CMnA6P8nooO33kwG-c8YZg=0Xc8rJqw@mail.gmail.com>

[akpm@linux-foundation.org: fix unused local warning]
[akpm@linux-foundation.org: fix __int128 build woes]
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yury Gribov <y.gribov@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Ingo Molnar
790a2ee242 * Make the EFI System Resource Table (ESRT) driver explicitly
non-modular by ripping out the module_* code since Kconfig doesn't
    allow it to be built as a module anyway - Paul Gortmaker
 
  * Make the x86 efi=debug kernel parameter, which enables EFI debug
    code and output, generic and usable by arm64 - Leif Lindholm
 
  * Add support to the x86 EFI boot stub for 64-bit Graphics Output
    Protocol frame buffer addresses - Matt Fleming
 
  * Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled
    in the firmware and set an efi.flags bit so the kernel knows when
    it can apply more strict runtime mapping attributes - Ard Biesheuvel
 
  * Auto-load the efi-pstore module on EFI systems, just like we
    currently do for the efivars module - Ben Hutchings
 
  * Add "efi_fake_mem" kernel parameter which allows the system's EFI
    memory map to be updated with additional attributes for specific
    memory ranges. This is useful for testing the kernel code that handles
    the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware
    doesn't include support - Taku Izumi
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWG7OwAAoJEC84WcCNIz1VEEEP/0SsdrwJ66B4MfP5YNjqHYWm
 +OTHR6Ovv2i10kc+NjOV/GN8sWPndnkLfIfJ4EqJ9BoQ9PDEYZilV2aleSQ4DrPm
 H7uGwBXQkfd76tZKX9pMToK76mkhg6M7M2LR3Suv3OGfOEzuozAOt3Ez37lpksTN
 2ByhHr/oGbhu99jC2ki5+k0ySH8PMqDBRxqrPbBzTD+FfB7bM11vAJbSNbSMQ21R
 ZwX0acZBLqb9J2Vf7tDsW+fCfz0TFo8JHW8jdLRFm/y2dpquzxswkkBpODgA8+VM
 0F5UbiUdkaIRug75I6N/OJ8+yLwdzuxm7ul+tbS3JrXGLAlK3850+dP2Pr5zQ2Ce
 zaYGRUy+tD5xMXqOKgzpu+Ia8XnDRLhOlHabiRd5fG6ZC9nR8E9uK52g79voSN07
 pADAJnVB03CGV/HdduDOI4C4UykUKubuArbQVkqWJcecV1Jic/tYI0gjeACmU1VF
 v8FzXpBUe3U3A0jauOz8PBz8M+k5qky/GbIrnEvXreBtKdt999LN9fykTN7rBOpo
 dk/6vTR1Jyv3aYc9EXHmRluktI6KmfWCqmRBOIgQveX1VhdRM+1w2LKC0+8co3dF
 v/DBh19KDyfPI8eOvxKykhn164UeAt03EXqDa46wFGr2nVOm/JiShL/d+QuyYU4G
 8xb/rET4JrhCG4gFMUZ7
 =1Oee
 -----END PGP SIGNATURE-----

Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi

Pull v4.4 EFI updates from Matt Fleming:

  - Make the EFI System Resource Table (ESRT) driver explicitly
    non-modular by ripping out the module_* code since Kconfig doesn't
    allow it to be built as a module anyway. (Paul Gortmaker)

  - Make the x86 efi=debug kernel parameter, which enables EFI debug
    code and output, generic and usable by arm64. (Leif Lindholm)

  - Add support to the x86 EFI boot stub for 64-bit Graphics Output
    Protocol frame buffer addresses. (Matt Fleming)

  - Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled
    in the firmware and set an efi.flags bit so the kernel knows when
    it can apply more strict runtime mapping attributes - Ard Biesheuvel

  - Auto-load the efi-pstore module on EFI systems, just like we
    currently do for the efivars module. (Ben Hutchings)

  - Add "efi_fake_mem" kernel parameter which allows the system's EFI
    memory map to be updated with additional attributes for specific
    memory ranges. This is useful for testing the kernel code that handles
    the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware
    doesn't include support. (Taku Izumi)

Note: there is a semantic conflict between the following two commits:

  8a53554e12e9 ("x86/efi: Fix multiple GOP device support")
  ae2ee627dc87 ("efifb: Add support for 64-bit frame buffer addresses")

I fixed up the interaction in the merge commit, changing the type of
current_fb_base from u32 to u64.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-14 16:51:34 +02:00
Kővágó, Zoltán
8a53554e12 x86/efi: Fix multiple GOP device support
When multiple GOP devices exists, but none of them implements
ConOut, the code should just choose the first GOP (according to
the comments). But currently 'fb_base' will refer to the last GOP,
while other parameters to the first GOP, which will likely
result in a garbled display.

I can reliably reproduce this bug using my ASRock Z87M Extreme4
motherboard with CSM and integrated GPU disabled, and two PCIe
video cards (NVidia GT640 and GTX980), booting from efi-stub
(booting from grub works fine).  On the primary display the
ASRock logo remains and on the secondary screen it is garbled
up completely.

Signed-off-by: Kővágó, Zoltán <DirtY.iCE.hu@gmail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444659236-24837-2-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-14 16:02:43 +02:00
Matt Fleming
ae2ee627dc efifb: Add support for 64-bit frame buffer addresses
The EFI Graphics Output Protocol uses 64-bit frame buffer addresses
but these get truncated to 32-bit by the EFI boot stub when storing
the address in the 'lfb_base' field of 'struct screen_info'.

Add a 'ext_lfb_base' field for the upper 32-bits of the frame buffer
address and set VIDEO_TYPE_CAPABILITY_64BIT_BASE when the field is
useable.

It turns out that the reason no one has required this support so far
is that there's actually code in tianocore to "downgrade" PCI
resources that have option ROMs and 64-bit BARS from 64-bit to 32-bit
to cope with legacy option ROMs that can't handle 64-bit addresses.
The upshot is that basically all GOP devices in the wild use a 32-bit
frame buffer address.

Still, it is possible to build firmware that uses a full 64-bit GOP
frame buffer address. Chad did, which led to him reporting this issue.

Add support in anticipation of GOP devices using 64-bit addresses more
widely, and so that efifb works out of the box when that happens.

Reported-by: Chad Page <chad.page@znyx.com>
Cc: Pete Hawkins <pete.hawkins@znyx.com>
Acked-by: Peter Jones <pjones@redhat.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-10-12 14:20:06 +01:00
Yinghai Lu
2d3862d26e lib/decompressors: use real out buf size for gunzip with kernel
When loading x86 64bit kernel above 4GiB with patched grub2, got kernel
gunzip error.

| early console in decompress_kernel
| decompress_kernel:
|       input: [0x807f2143b4-0x807ff61aee]
|      output: [0x807cc00000-0x807f3ea29b] 0x027ea29c: output_len
| boot via startup_64
| KASLR using RDTSC...
|  new output: [0x46fe000000-0x470138cfff] 0x0338d000: output_run_size
|  decompress: [0x46fe000000-0x47007ea29b] <=== [0x807f2143b4-0x807ff61aee]
|
| Decompressing Linux... gz...
|
| uncompression error
|
| -- System halted

the new buffer is at 0x46fe000000ULL, decompressor_gzip is using
0xffffffb901ffffff as out_len.  gunzip in lib/zlib_inflate/inflate.c cap
that len to 0x01ffffff and decompress fails later.

We could hit this problem with crashkernel booting that uses kexec loading
kernel above 4GiB.

We have decompress_* support:
    1. inbuf[]/outbuf[] for kernel preboot.
    2. inbuf[]/flush() for initramfs
    3. fill()/flush() for initrd.
This bug only affect kernel preboot path that use outbuf[].

Add __decompress and take real out_buf_len for gunzip instead of guessing
wrong buf size.

Fixes: 1431574a1c4 (lib/decompressors: fix "no limit" output buffer length)
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Alexandre Courbot <acourbot@nvidia.com>
Cc: Jon Medhurst <tixy@linaro.org>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Linus Torvalds
11e612ddb4 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
 "The main x86 bootup related changes in this cycle were:

   - more boot time optimizations.  (Len Brown)

   - implement hex output to allow the debugging of early bootup
     parameters.  (Kees Cook)

   - remove obsolete MCA leftovers.  (Paolo Pisati)"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smpboot: Remove APIC.wait_for_init_deassert and atomic init_deasserted
  x86/smpboot: Remove SIPI delays from cpu_up()
  x86/smpboot: Remove udelay(100) when polling cpu_callin_map
  x86/smpboot: Remove udelay(100) when polling cpu_initialized_map
  x86/boot: Obsolete the MCA sys_desc_table
  x86/boot: Add hex output for debugging
2015-09-01 09:04:31 -07:00
Ingo Molnar
5461bd81bf Linux 4.2-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJV0R4AAAoJEHm+PkMAQRiG8xIH/AmiRd+JDrs0qqEy46p6X8Gn
 0lB5/KsGycvIGIBTiy2nZzcT0Ly6LeFUKUjzPytlOhIZPMrxMVMShDaQKCXXIMUr
 1mN6hkvpkLNnUhvL2fR6mm0zkjbz3zZEazFY+Jic8wQrtSkHgfH0DXqSAo8le0f8
 kNrd5BPPhIwvpHGaNGFdTpbgpPcalXyQk/fHyvDGidbyXzY/d7l05QfYJ6XCD4Zm
 IAy48iK5BFts2+z3aOYrOeuuCcm1qFX8YArqzE1rfPp+U/LQpfUfij4cmOqDLn/F
 qnv9E7bRRVovvrgKe4I3Trta8kT53VLJvqpdw2Usqo8zvhs4VyrYpHC+gEE6YUY=
 =9Rd4
 -----END PGP SIGNATURE-----

Merge tag 'v4.2-rc7' into x86/boot, to refresh the branch before merging new changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-17 10:41:59 +02:00
Ingo Molnar
5b929bd11d Merge branch 'x86/urgent' into x86/asm, before applying dependent patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-31 10:23:35 +02:00
Dmitry Skorodumov
7cc03e4896 x86/efi: Use all 64 bit of efi_memmap in setup_e820()
The efi_info structure stores low 32 bits of memory map
in efi_memmap and high 32 bits in efi_memmap_hi.

While constructing pointer in the setup_e820(), need
to take into account all 64 bit of the pointer.

It is because on 64bit machine the function
efi_get_memory_map() may return full 64bit pointer and before
the patch that pointer was truncated.

The issue is triggered on Parallles virtual machine and
fixed with this patch.

Signed-off-by: Dmitry Skorodumov <sdmitry@parallels.com>
Cc: Denis V. Lunev <den@openvz.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-07-30 18:07:10 +01:00
Paolo Pisati
949163015c x86/boot: Obsolete the MCA sys_desc_table
The kernel does not support the MCA bus anymroe, so mark sys_desc_table
as obsolete: remove any reference from the code together with the remaining
of MCA logic.

bloat-o-meter output:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-55 (-55)
  function                                     old     new   delta
  i386_start_kernel                            128     119      -9
  setup_arch                                  1421    1375     -46

Signed-off-by: Paolo Pisati <p.pisati@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437409430-8491-1-git-send-email-p.pisati@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-21 10:55:11 +02:00
Kees Cook
79063a7c02 x86/boot: Add hex output for debugging
This is useful for reporting various addresses or other values
while debugging early boot, for example, the recent kernel image
size vs kernel run size. For example, when
CONFIG_X86_VERBOSE_BOOTUP is set, this is now visible at boot
time:

	early console in setup code
	early console in decompress_kernel
	input_data: 0x0000000001e1526e
	input_len: 0x0000000000732236
	output: 0x0000000001000000
	output_len: 0x0000000001535640
	run_size: 0x00000000021fb000
	KASLR using RDTSC...

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Junjie Mao <eternal.n08@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20150706230620.GA17501@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-07 08:59:05 +02:00
Andy Lutomirski
4ea1636b04 x86/asm/tsc: Rename native_read_tsc() to rdtsc()
Now that there is no paravirt TSC, the "native" is
inappropriate. The function does RDTSC, so give it the obvious
name: rdtsc().

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/fd43e16281991f096c1e4d21574d9e1402c62d39.1434501121.git.luto@kernel.org
[ Ported it to v4.2-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-06 15:23:28 +02:00
Andy Lutomirski
87be28aaf1 x86/asm/tsc: Replace rdtscll() with native_read_tsc()
Now that the ->read_tsc() paravirt hook is gone, rdtscll() is
just a wrapper around native_read_tsc(). Unwrap it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/d2449ae62c1b1fb90195bcfb19ef4a35883a04dc.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-06 15:23:26 +02:00
Linus Torvalds
88793e5c77 The libnvdimm sub-system introduces, in addition to the libnvdimm-core,
4 drivers / enabling modules:
 
 NFIT:
 Instantiates an "nvdimm bus" with the core and registers memory devices
 (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware Interface
 table).  After registering NVDIMMs the NFIT driver then registers
 "region" devices.  A libnvdimm-region defines an access mode and the
 boundaries of persistent memory media.  A region may span multiple
 NVDIMMs that are interleaved by the hardware memory controller.  In
 turn, a libnvdimm-region can be carved into a "namespace" device and
 bound to the PMEM or BLK driver which will attach a Linux block device
 (disk) interface to the memory.
 
 PMEM:
 Initially merged in v4.1 this driver for contiguous spans of persistent
 memory address ranges is re-worked to drive PMEM-namespaces emitted by
 the libnvdimm-core.  In this update the PMEM driver, on x86, gains the
 ability to assert that writes to persistent memory have been flushed all
 the way through the caches and buffers in the platform to persistent
 media.  See memcpy_to_pmem() and wmb_pmem().
 
 BLK:
 This new driver enables access to persistent memory media through "Block
 Data Windows" as defined by the NFIT.  The primary difference of this
 driver to PMEM is that only a small window of persistent memory is
 mapped into system address space at any given point in time.  Per-NVDIMM
 windows are reprogrammed at run time, per-I/O, to access different
 portions of the media.  BLK-mode, by definition, does not support DAX.
 
 BTT:
 This is a library, optionally consumed by either PMEM or BLK, that
 converts a byte-accessible namespace into a disk with atomic sector
 update semantics (prevents sector tearing on crash or power loss).  The
 sinister aspect of sector tearing is that most applications do not know
 they have a atomic sector dependency.  At least today's disk's rarely
 ever tear sectors and if they do one almost certainly gets a CRC error
 on access.  NVDIMMs will always tear and always silently.  Until an
 application is audited to be robust in the presence of sector-tearing
 the usage of BTT is recommended.
 
 Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
 Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
 Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
 Wysocki, and Bob Moore.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVjZGBAAoJEB7SkWpmfYgC4fkP/j+k6HmSRNU/yRYPyo7CAWvj
 3P5P1i6R6nMZZbjQrQArAXaIyLlFk4sEQDYsciR6dmslhhFZAkR2eFwVO5rBOyx3
 QN0yxEpyjJbroRFUrV/BLaFK4cq2oyJAFFHs0u7/pLHBJ4MDMqfRKAMtlnBxEkTE
 LFcqXapSlvWitSbjMdIBWKFEvncaiJ2mdsFqT4aZqclBBTj00eWQvEG9WxleJLdv
 +tj7qR/vGcwOb12X5UrbQXgwtMYos7A6IzhHbqwQL8IrOcJ6YB8NopJUpLDd7ZVq
 KAzX6ZYMzNueN4uvv6aDfqDRLyVL7qoxM9XIjGF5R8SV9sF2LMspm1FBpfowo1GT
 h2QMr0ky1nHVT32yspBCpE9zW/mubRIDtXxEmZZ53DIc4N6Dy9jFaNVmhoWtTAqG
 b9pndFnjUzzieCjX5pCvo2M5U6N0AQwsnq76/CasiWyhSa9DNKOg8MVDRg0rbxb0
 UvK0v8JwOCIRcfO3qiKcx+02nKPtjCtHSPqGkFKPySRvAdb+3g6YR26CxTb3VmnF
 etowLiKU7HHalLvqGFOlDoQG6viWes9Zl+ZeANBOCVa6rL2O7ZnXJtYgXf1wDQee
 fzgKB78BcDjXH4jHobbp/WBANQGN/GF34lse8yHa7Ym+28uEihDvSD1wyNLnefmo
 7PJBbN5M5qP5tD0aO7SZ
 =VtWG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm

Pull libnvdimm subsystem from Dan Williams:
 "The libnvdimm sub-system introduces, in addition to the
  libnvdimm-core, 4 drivers / enabling modules:

  NFIT:
    Instantiates an "nvdimm bus" with the core and registers memory
    devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
    Interface table).

    After registering NVDIMMs the NFIT driver then registers "region"
    devices.  A libnvdimm-region defines an access mode and the
    boundaries of persistent memory media.  A region may span multiple
    NVDIMMs that are interleaved by the hardware memory controller.  In
    turn, a libnvdimm-region can be carved into a "namespace" device and
    bound to the PMEM or BLK driver which will attach a Linux block
    device (disk) interface to the memory.

  PMEM:
    Initially merged in v4.1 this driver for contiguous spans of
    persistent memory address ranges is re-worked to drive
    PMEM-namespaces emitted by the libnvdimm-core.

    In this update the PMEM driver, on x86, gains the ability to assert
    that writes to persistent memory have been flushed all the way
    through the caches and buffers in the platform to persistent media.
    See memcpy_to_pmem() and wmb_pmem().

  BLK:
    This new driver enables access to persistent memory media through
    "Block Data Windows" as defined by the NFIT.  The primary difference
    of this driver to PMEM is that only a small window of persistent
    memory is mapped into system address space at any given point in
    time.

    Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
    different portions of the media.  BLK-mode, by definition, does not
    support DAX.

  BTT:
    This is a library, optionally consumed by either PMEM or BLK, that
    converts a byte-accessible namespace into a disk with atomic sector
    update semantics (prevents sector tearing on crash or power loss).

    The sinister aspect of sector tearing is that most applications do
    not know they have a atomic sector dependency.  At least today's
    disk's rarely ever tear sectors and if they do one almost certainly
    gets a CRC error on access.  NVDIMMs will always tear and always
    silently.  Until an application is audited to be robust in the
    presence of sector-tearing the usage of BTT is recommended.

  Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
  Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
  Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
  Wysocki, and Bob Moore"

* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
  arch, x86: pmem api for ensuring durability of persistent memory updates
  libnvdimm: Add sysfs numa_node to NVDIMM devices
  libnvdimm: Set numa_node to NVDIMM devices
  acpi: Add acpi_map_pxm_to_online_node()
  libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
  pmem: flag pmem block devices as non-rotational
  libnvdimm: enable iostat
  pmem: make_request cleanups
  libnvdimm, pmem: fix up max_hw_sectors
  libnvdimm, blk: add support for blk integrity
  libnvdimm, btt: add support for blk integrity
  fs/block_dev.c: skip rw_page if bdev has integrity
  libnvdimm: Non-Volatile Devices
  tools/testing/nvdimm: libnvdimm unit test infrastructure
  libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  nd_btt: atomic sector updates
  libnvdimm: infrastructure for btt devices
  libnvdimm: write blk label set
  libnvdimm: write pmem label set
  libnvdimm: blk labels and namespace instantiation
  ...
2015-06-29 10:34:42 -07:00
Ingo Molnar
927392d73a x86/boot: Add CONFIG_PARAVIRT_SPINLOCKS quirk to arch/x86/boot/compressed/misc.h
Linus reported the following new warning on x86 allmodconfig with GCC 5.1:

  > ./arch/x86/include/asm/spinlock.h: In function ‘arch_spin_lock’:
  > ./arch/x86/include/asm/spinlock.h:119:3: warning: implicit declaration
  > of function ‘__ticket_lock_spinning’ [-Wimplicit-function-declaration]
  >    __ticket_lock_spinning(lock, inc.tail);
  >    ^

This warning triggers because of these hacks in misc.h:

  /*
   * we have to be careful, because no indirections are allowed here, and
   * paravirt_ops is a kind of one. As it will only run in baremetal anyway,
   * we just keep it from happening
   */
  #undef CONFIG_PARAVIRT
  #undef CONFIG_KASAN

But these hacks were not updated when CONFIG_PARAVIRT_SPINLOCKS was added,
and eventually (with the introduction of queued paravirt spinlocks in
recent kernels) this created an invalid Kconfig combination and broke
the build.

So add a CONFIG_PARAVIRT_SPINLOCKS #undef line as well.

Also remove the _ASM_X86_DESC_H quirk: that undocumented quirk
was originally added ages ago, in:

  099e1377269a ("x86: use ELF format in compressed images.")

and I went back to that kernel (and fixed up the main Makefile
which didn't build anymore) and checked what failure it
avoided: it avoided an include file dependencies related
build failure related to our old x86-platforms code.

That old code is long gone, the header dependencies got cleaned
up, and the build does not fail anymore with the totality of
asm/desc.h included - so remove the quirk.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-29 11:06:37 +02:00
Dan Williams
ad5fb870c4 e820, efi: add ACPI 6.0 persistent memory types
ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.

This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).

Note, /proc/iomem can be consulted for differentiating legacy
"Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
E820_PMEM.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-05-27 21:46:05 -04:00
Ingo Molnar
c102cb097d * Avoid garbage names in efivarfs due to buggy firmware by zero'ing
EFI variable name - Ross Lagerwall
 
  * Stop erroneously dropping upper 32-bits of boot command line pointer
    in EFI boot stub and stash them in ext_cmd_line_ptr - Roy Franz
 
  * Fix double-free bug in error handling code path of EFI runtime map
    code - Dan Carpenter
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVSOSjAAoJEC84WcCNIz1VXk4P/R4GwmmzZBdYAseiwv6u/NRm
 bTXnK7SN1ZyY8WibEm8ptXJuTIyXZxmQYr4lY97canJy8P7umtoCP7P3tS0Ier8U
 N1AMFGes7xlwBhjIRz2Cr9e5plr5H3qk65JNMuUDp0/MVuPEiNEzi6efbL82dh9S
 RCLxQ94paX+wV6ltQMKWGD3v0WnHkzouuCdETCGaozqQmJx6PGzDmJ51kXYRWDyP
 esTCZpRHlIzKN0u3XEFgswlIev2wab0BtjXYOzUqb0AH1Q13OgQfiswX3WIG6k+c
 3xuMH4JByBIDwOLudgu0D6Sst2QwVJZnw6JavoEgGCFao0n6IPzUGolAWLFMdDhL
 Kparzc6ObHpiqYtqBjJXW+awOENVS4qIrn9MHc9wwsJxXOy++0YnyYCgge0iia47
 F2/pOHvkd52QiQ0gC442W0EdX1VlPCUR04G0s4d3UX3O875yl80QTyLQ4n7ZK074
 3wfi/9+Fuv8wWMJ4HI8FJgaTl57KzAP4ZPh2cy8oPs6bkiiwlnMWH24bEhlxKBK4
 mEIze045kyswz3rV7j1WX3MSXrPA2cM95L5WlvVTxckMn40QwLPBWSDCOJIj3K5K
 yhXNHHfHzG/GRm3SfD2i1EcK4gUW82awl72jJn0F69YMI5a+T1BIppEMP2pzsWE4
 FcwvWDxzWwKxYKJosfkk
 =f7a2
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent

Pull EFI fixes from Matt Fleming:

 * Avoid garbage names in efivarfs due to buggy firmware by zeroing
   EFI variable name. (Ross Lagerwall)

 * Stop erroneously dropping upper 32 bits of boot command line pointer
   in EFI boot stub and stash them in ext_cmd_line_ptr. (Roy Franz)

 * Fix double-free bug in error handling code path of EFI runtime map
   code. (Dan Carpenter)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-06 08:30:24 +02:00