IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The way page-table folding is implemented on 32 bit, these functions are
not only setting, but also PUDs and even PMDs. Give the function a more
generic name to reflect that.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-16-git-send-email-joro@8bytes.org
Song Liu noticed switch_mm_irqs_off() taking a lot of CPU time in recent
kernels,using 1.8% of a 48 CPU system during a netperf to localhost run.
Digging into the profile, we noticed that cpumask_clear_cpu and
cpumask_set_cpu together take about half of the CPU time taken by
switch_mm_irqs_off().
However, the CPUs running netperf end up switching back and forth
between netperf and the idle task, which does not require changes
to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
the most heavily contended one in the system.
Simply skipping changes to mm_cpumask(&init_mm) reduces overhead.
Reported-and-tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-8-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CPUs in !is_lazy have either received TLB flush IPIs earlier on during
the munmap (when the user memory was unmapped), or have context switched
and reloaded during that stage of the munmap.
Page table free TLB flushes only need to be sent to CPUs in lazy TLB
mode, which TLB contents might not yet be up to date yet.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-6-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.
Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.
The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.
Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
are not removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.
This patch reduces total CPU use in the system by about 1-2% for a
memcache workload on two socket systems, and by about 1% for a heavily
multi-process netperf between two systems.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-5-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.
No functional changes, since the if (real_prev == next) branch
always returns.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.
The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.
The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Without this change the distance table calculation for emulated nodes
may use the wrong numa node and report an incorrect distance.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/153089328103.27680.14778434392225818887.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ioremap() calls pud_free_pmd_page() / pmd_free_pte_page() when it creates
a pud / pmd map. The following preconditions are met at their entry.
- All pte entries for a target pud/pmd address range have been cleared.
- System-wide TLB purges have been peformed for a target pud/pmd address
range.
The preconditions assure that there is no stale TLB entry for the range.
Speculation may not cache TLB entries since it requires all levels of page
entries, including ptes, to have P & A-bits set for an associated address.
However, speculation may cache pud/pmd entries (paging-structure caches)
when they have P-bit set.
Add a system-wide TLB purge (INVLPG) to a single page after clearing
pud/pmd entry's P-bit.
SDM 4.10.4.1, Operation that Invalidate TLBs and Paging-Structure Caches,
states that:
INVLPG invalidates all paging-structure caches associated with the
current PCID regardless of the liner addresses to which they correspond.
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: mhocko@suse.com
Cc: akpm@linux-foundation.org
Cc: hpa@zytor.com
Cc: cpandya@codeaurora.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Joerg Roedel <joro@8bytes.org>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20180627141348.21777-4-toshi.kani@hpe.com
The following kernel panic was observed on ARM64 platform due to a stale
TLB entry.
1. ioremap with 4K size, a valid pte page table is set.
2. iounmap it, its pte entry is set to 0.
3. ioremap the same address with 2M size, update its pmd entry with
a new value.
4. CPU may hit an exception because the old pmd entry is still in TLB,
which leads to a kernel panic.
Commit b6bdb7517c3d ("mm/vmalloc: add interfaces to free unmapped page
table") has addressed this panic by falling to pte mappings in the above
case on ARM64.
To support pmd mappings in all cases, TLB purge needs to be performed
in this case on ARM64.
Add a new arg, 'addr', to pud_free_pmd_page() and pmd_free_pte_page()
so that TLB purge can be added later in seprate patches.
[toshi.kani@hpe.com: merge changes, rewrite patch description]
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: mhocko@suse.com
Cc: akpm@linux-foundation.org
Cc: hpa@zytor.com
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Will Deacon <will.deacon@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20180627141348.21777-3-toshi.kani@hpe.com
ioremap() supports pmd mappings on x86-PAE. However, kernel's pmd
tables are not shared among processes on x86-PAE. Therefore, any
update to sync'd pmd entries need re-syncing. Freeing a pte page
also leads to a vmalloc fault and hits the BUG_ON in vmalloc_sync_one().
Disable free page handling on x86-PAE. pud_free_pmd_page() and
pmd_free_pte_page() simply return 0 if a given pud/pmd entry is present.
This assures that ioremap() does not update sync'd pmd entries at the
cost of falling back to pte mappings.
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Reported-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: mhocko@suse.com
Cc: akpm@linux-foundation.org
Cc: hpa@zytor.com
Cc: cpandya@codeaurora.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20180627141348.21777-2-toshi.kani@hpe.com
- Remove 'nx_warning' and 'smep_warning', which are just pointless obfuscation.
- Also convert to pr_crit().
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180627090715.28076-1-dvyukov@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.
By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
KERN_CONT leads to split lines in kernel output
and complicates useful changes to printk like
printing context before each line.
Only acceptable use of continuations is basically
boot-time testing.
Get rid of it.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180625123808.227417-1-dvyukov@gmail.com
[ Removed unnecessary parentheses and prettified the printk statement. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The previous patch has limited swap file size so that large offsets cannot
clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.
It assumed that offsets are encoded starting with bit 12, same as pfn. But
on x86_64, offsets are encoded starting with bit 9.
Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
and 256TB with 46bit MAX_PA.
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a new function to "adjust" the current fixed UV memory block size
of 2GB so it can be changed to a different physical boundary. This is
out of necessity so arch dependent code can accommodate specific BIOS
requirements which can align these new PMEM modules at less than the
default boundaries.
A "set order" type of function was used to insure that the memory block
size will be a power of two value without requiring a validity check.
64GB was chosen as the upper limit for memory block size values to
accommodate upcoming 4PB systems which have 6 more bits of physical
address space (46 becoming 52).
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Reviewed-by: Andrew Banman <andrew.banman@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <russ.anderson@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dan.j.williams@intel.com
Cc: jgross@suse.com
Cc: kirill.shutemov@linux.intel.com
Cc: mhocko@suse.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/lkml/20180524201711.609546602@stormcage.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.
Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
With PHYS_ADDR_MAX there is now a type safe variant for all bits set.
Make use of it.
Patch created using a semantic patch as follows:
// <smpl>
@@
typedef phys_addr_t;
@@
-(phys_addr_t)ULLONG_MAX
+PHYS_ADDR_MAX
// </smpl>
Link: http://lkml.kernel.org/r/20180419214204.19322-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hussam reports:
I was poking around and for no real reason, I did cat /dev/mem and
strings /dev/mem. Then I saw the following warning in dmesg. I saved it
and rebooted immediately.
memremap attempted on mixed range 0x000000000009c000 size: 0x1000
------------[ cut here ]------------
WARNING: CPU: 0 PID: 11810 at kernel/memremap.c:98 memremap+0x104/0x170
[..]
Call Trace:
xlate_dev_mem_ptr+0x25/0x40
read_mem+0x89/0x1a0
__vfs_read+0x36/0x170
The memremap() implementation checks for attempts to remap System RAM
with MEMREMAP_WB and instead redirects those mapping attempts to the
linear map. However, that only works if the physical address range
being remapped is page aligned. In low memory we have situations like
the following:
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
...where System RAM intersects Reserved ranges on a sub-page page
granularity.
Given that devmem_is_allowed() special cases any attempt to map System
RAM in the first 1MB of memory, replace page_is_ram() with the more
precise region_intersects() to trap attempts to map disallowed ranges.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199999
Link: http://lkml.kernel.org/r/152856436164.18127.2847888121707136898.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 92281dee825f ("arch: introduce memremap()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Hussam Al-Tayeb <me@hussam.eu.org>
Tested-by: Hussam Al-Tayeb <me@hussam.eu.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 updates and fixes from Thomas Gleixner:
- Fix the (late) fallout from the vector management rework causing
hlist corruption and irq descriptor reference leaks caused by a
missing sanity check.
The straight forward fix triggered another long standing issue to
surface. The pre rework code hid the issue due to being way slower,
but now the chance that user space sees an EBUSY error return when
updating irq affinities is way higher, though quite a bunch of
userspace tools do not handle it properly despite the fact that EBUSY
could be returned for at least 10 years.
It turned out that the EBUSY return can be avoided completely by
utilizing the existing delayed affinity update mechanism for irq
remapped scenarios as well. That's a bit more error handling in the
kernel, but avoids fruitless fingerpointing discussions with tool
developers.
- Decouple PHYSICAL_MASK from AMD SME as its going to be required for
the upcoming Intel memory encryption support as well.
- Handle legacy device ACPI detection properly for newer platforms
- Fix the wrong argument ordering in the vector allocation tracepoint
- Simplify the IDT setup code for the APIC=n case
- Use the proper string helpers in the MTRR code
- Remove a stale unused VDSO source file
- Convert the microcode update lock to a raw spinlock as its used in
atomic context.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel_rdt: Enable CMT and MBM on new Skylake stepping
x86/apic/vector: Print APIC control bits in debugfs
genirq/affinity: Defer affinity setting if irq chip is busy
x86/platform/uv: Use apic_ack_irq()
x86/ioapic: Use apic_ack_irq()
irq_remapping: Use apic_ack_irq()
x86/apic: Provide apic_ack_irq()
genirq/migration: Avoid out of line call if pending is not set
genirq/generic_pending: Do not lose pending affinity update
x86/apic/vector: Prevent hlist corruption and leaks
x86/vector: Fix the args of vector_alloc tracepoint
x86/idt: Simplify the idt_setup_apic_and_irq_gates()
x86/platform/uv: Remove extra parentheses
x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME
x86: Mark native_set_p4d() as __always_inline
x86/microcode: Make the late update update_lock a raw lock for RT
x86/mtrr: Convert to use strncpy_from_user() helper
x86/mtrr: Convert to use match_string() helper
x86/vdso: Remove unused file
x86/i8237: Register device based on FADT legacy boot flag
For pgd page table pages, x86 overloads the page->index field to store a
pointer to the mm_struct. Rename this to pt_mm so it's visible to other
users.
Link: http://lkml.kernel.org/r/20180518194519.3820-13-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
AMD SME claims one bit from physical address to indicate whether the
page is encrypted or not. To achieve that we clear out the bit from
__PHYSICAL_MASK.
The capability to adjust __PHYSICAL_MASK is required beyond AMD SME.
For instance for upcoming Intel Multi-Key Total Memory Encryption.
Factor it out into a separate feature with own Kconfig handle.
It also helps with overhead of AMD SME. It saves more than 3k in .text
on defconfig + AMD_MEM_ENCRYPT:
add/remove: 3/2 grow/shrink: 5/110 up/down: 189/-3753 (-3564)
We would need to return to this once we have infrastructure to patch
constants in code. That's good candidate for it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180518113028.79825-1-kirill.shutemov@linux.intel.com
Pull x86 debug updates from Ingo Molnar:
"This contains the x86 oops code printing reorganization and cleanups
from Borislav Betkov, with a particular focus in enhancing opcode
dumping all around"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/dumpstack: Explain the reasoning for the prologue and buffer size
x86/dumpstack: Save first regs set for the executive summary
x86/dumpstack: Add a show_ip() function
x86/fault: Dump user opcode bytes on fatal faults
x86/dumpstack: Add loglevel argument to show_opcodes()
x86/dumpstack: Improve opcodes dumping in the code section
x86/dumpstack: Carve out code-dumping into a function
x86/dumpstack: Unexport oops_begin()
x86/dumpstack: Remove code_bytes
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apm: Fix spelling mistake: "caculate" -> "calculate"
x86/mtrr: Rename main.c to mtrr.c and remove duplicate prefixes
x86: Remove pr_fmt duplicate logging prefixes
x86/early-quirks: Rename duplicate define of dev_err
x86/bpf: Clean up non-standard comments, to make the code more readable
Pull x86 boot updates from Ingo Molnar:
- Centaur CPU updates (David Wang)
- AMD and other CPU topology enumeration improvements and fixes
(Borislav Petkov, Thomas Gleixner, Suravee Suthikulpanit)
- Continued 5-level paging work (Kirill A. Shutemov)
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Mark __pgtable_l5_enabled __initdata
x86/mm: Mark p4d_offset() __always_inline
x86/mm: Introduce the 'no5lvl' kernel parameter
x86/mm: Stop pretending pgtable_l5_enabled is a variable
x86/mm: Unify pgtable_l5_enabled usage in early boot code
x86/boot/compressed/64: Fix trampoline page table address calculation
x86/CPU: Move x86_cpuinfo::x86_max_cores assignment to detect_num_cpu_cores()
x86/Centaur: Report correct CPU/cache topology
x86/CPU: Move cpu_detect_cache_sizes() into init_intel_cacheinfo()
x86/CPU: Make intel_num_cpu_cores() generic
x86/CPU: Move cpu local function declarations to local header
x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available
x86/CPU: Modify detect_extended_topology() to return result
x86/CPU/AMD: Calculate last level cache ID from number of sharing threads
x86/CPU: Rename intel_cacheinfo.c to cacheinfo.c
perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
x86/Centaur: Initialize supported CPU features properly
Pull siginfo updates from Eric Biederman:
"This set of changes close the known issues with setting si_code to an
invalid value, and with not fully initializing struct siginfo. There
remains work to do on nds32, arc, unicore32, powerpc, arm, arm64, ia64
and x86 to get the code that generates siginfo into a simpler and more
maintainable state. Most of that work involves refactoring the signal
handling code and thus careful code review.
Also not included is the work to shrink the in kernel version of
struct siginfo. That depends on getting the number of places that
directly manipulate struct siginfo under control, as it requires the
introduction of struct kernel_siginfo for the in kernel things.
Overall this set of changes looks like it is making good progress, and
with a little luck I will be wrapping up the siginfo work next
development cycle"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
signal/sh: Stop gcc warning about an impossible case in do_divide_error
signal/mips: Report FPE_FLTUNK for undiagnosed floating point exceptions
signal/um: More carefully relay signals in relay_signal.
signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR}
signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code
signal/signalfd: Add support for SIGSYS
signal/signalfd: Remove __put_user from signalfd_copyinfo
signal/xtensa: Use force_sig_fault where appropriate
signal/xtensa: Consistenly use SIGBUS in do_unaligned_user
signal/um: Use force_sig_fault where appropriate
signal/sparc: Use force_sig_fault where appropriate
signal/sparc: Use send_sig_fault where appropriate
signal/sh: Use force_sig_fault where appropriate
signal/s390: Use force_sig_fault where appropriate
signal/riscv: Replace do_trap_siginfo with force_sig_fault
signal/riscv: Use force_sig_fault where appropriate
signal/parisc: Use force_sig_fault where appropriate
signal/parisc: Use force_sig_mceerr where appropriate
signal/openrisc: Use force_sig_fault where appropriate
signal/nios2: Use force_sig_fault where appropriate
...
pgtable_l5_enabled is defined using cpu_feature_enabled() but we refer
to it as a variable. This is misleading.
Make pgtable_l5_enabled() a function.
We cannot literally define it as a function due to circular dependencies
between header files. Function-alike macros is close enough.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Usually pgtable_l5_enabled is defined using cpu_feature_enabled().
cpu_feature_enabled() is not available in early boot code. We use
several different preprocessor tricks to get around it. It's messy.
Unify them all.
If cpu_feature_enabled() is not yet available, USE_EARLY_PGTABLE_L5 can
be defined before all includes. It makes pgtable_l5_enabled rely on
__pgtable_l5_enabled variable instead. This approach fits all early
users.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I got a bug report that the following code (roughly) was
causing a SIGSEGV:
mprotect(ptr, size, PROT_EXEC);
mprotect(ptr, size, PROT_NONE);
mprotect(ptr, size, PROT_READ);
*ptr = 100;
The problem is hit when the mprotect(PROT_EXEC)
is implicitly assigned a protection key to the VMA, and made
that key ACCESS_DENY|WRITE_DENY. The PROT_NONE mprotect()
failed to remove the protection key, and the PROT_NONE->
PROT_READ left the PTE usable, but the pkey still in place
and left the memory inaccessible.
To fix this, we ensure that we always "override" the pkee
at mprotect() if the VMA does not have execute-only
permissions, but the VMA has the execute-only pkey.
We had a check for PROT_READ/WRITE, but it did not work
for PROT_NONE. This entirely removes the PROT_* checks,
which ensures that PROT_NONE now works.
Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Fixes: 62b5f7d013f ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
Link: http://lkml.kernel.org/r/20180509171351.084C5A71@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sometimes it is useful to see which user opcode bytes RIP points to
when a fault happens: be it to rule out RIP corruption, to dump info
early during boot, when doing core dumps is impossible due to not having
a writable filesystem yet.
Sometimes it is useful if debugging an issue and one doesn't have access
to the executable which caused the fault in order to disassemble it.
That last aspect might have some security implications so
show_unhandled_signals could be revisited for that or a new config option
added.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-7-bp@alien8.de
Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.
Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.
The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.
In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.
Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
commit 26d35ca6c3776784f8156e1d6f80cc60d9a2a915
RANDSTRUCT derives its hardening benefits from the attacker's lack of
knowledge about the layout of kernel data structures. Keep the kernel
image non-global in cases where RANDSTRUCT is in use to help keep the
layout a secret.
Fixes: 8c06c7740 (x86/pti: Leave kernel text global for !PCID)
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: https://lkml.kernel.org/r/20180420222026.D0B4AAC9@viggo.jf.intel.com
commit abb67605203687c8b7943d760638d0301787f8d9
Kees reported to me that I made too much of the kernel image global.
It was far more than just text:
I think this is too much set global: _end is after data,
bss, and brk, and all kinds of other stuff that could
hold secrets. I think this should match what
mark_rodata_ro() is doing.
This does exactly that. We use __end_rodata_hpage_align as our
marker both because it is huge-page-aligned and it does not contain
any sections we expect to hold secrets.
Kees's logic was that r/o data is in the kernel image anyway and,
in the case of traditional distributions, can be freely downloaded
from the web, so there's no reason to hide it.
Fixes: 8c06c7740 (x86/pti: Leave kernel text global for !PCID)
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: https://lkml.kernel.org/r/20180420222023.1C8B2B20@viggo.jf.intel.com
commit 231df823c4f04176f607afc4576c989895cff40e
The pageattr.c code attempts to process "faults" when it goes looking
for PTEs to change and finds non-present entries. It allows these
faults in the linear map which is "expected to have holes", but
WARN()s about them elsewhere, like when called on the kernel image.
However, change_page_attr_clear() is now called on the kernel image in the
process of trying to clear the Global bit.
This trips the warning in __cpa_process_fault() if a non-present PTE is
encountered in the kernel image. The "holes" in the kernel image result
from free_init_pages()'s use of set_memory_np(). These holes are totally
fine, and result from normal operation, just as they would be in the kernel
linear map.
Just silence the warning when holes in the kernel image are encountered.
Fixes: 39114b7a7 (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image)
Reported-by: Mariusz Ceier <mceier@gmail.com>
Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Kees Cook <keescook@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: https://lkml.kernel.org/r/20180420222021.1C7D2B3F@viggo.jf.intel.com
commit 16dce603adc9de4237b7bf2ff5c5290f34373e7b
Part of the global bit _setting_ patches also includes clearing the
Global bit when it should not be enabled. That is done with
set_memory_nonglobal(), which uses change_page_attr_clear() in
pageattr.c under the covers.
The TLB flushing code inside pageattr.c has has checks like
BUG_ON(irqs_disabled()), looking for interrupt disabling that might
cause deadlocks. But, these also trip in early boot on certain
preempt configurations. Just copy the existing BUG_ON() sequence from
cpa_flush_range() to the other two sites and check for early boot.
Fixes: 39114b7a7 (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image)
Reported-by: Mariusz Ceier <mceier@gmail.com>
Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Kees Cook <keescook@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: https://lkml.kernel.org/r/20180420222019.20C4A410@viggo.jf.intel.com
Pull x86 pti updates from Thomas Gleixner:
"Another series of PTI related changes:
- Remove the manual stack switch for user entries from the idtentry
code. This debloats entry by 5k+ bytes of text.
- Use the proper types for the asm/bootparam.h defines to prevent
user space compile errors.
- Use PAGE_GLOBAL for !PCID systems to gain back performance
- Prevent setting of huge PUD/PMD entries when the entries are not
leaf entries otherwise the entries to which the PUD/PMD points to
and are populated get lost"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
x86/pti: Leave kernel text global for !PCID
x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image
x86/pti: Enable global pages for shared areas
x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
x86/mm: Comment _PAGE_GLOBAL mystery
x86/mm: Remove extra filtering in pageattr code
x86/mm: Do not auto-massage page protections
x86/espfix: Document use of _PAGE_GLOBAL
x86/mm: Introduce "default" kernel PTE mask
x86/mm: Undo double _PAGE_PSE clearing
x86/mm: Factor out pageattr _PAGE_GLOBAL setting
x86/entry/64: Drop idtentry's manual stack switch for user entries
x86/uapi: Fix asm/bootparam.h userspace compilation errors
The pmd_set_huge() and pud_set_huge() functions are used from
the generic ioremap() code to establish large mappings where this
is possible.
But the generic ioremap() code does not check whether the
PMD/PUD entries are already populated with a non-leaf entry,
so that any page-table pages these entries point to will be
lost.
Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a
BUG_ON() in vmalloc_sync_one() when PMD entries are synced
from swapper_pg_dir to the current page-table. This happens
because the PMD entry from swapper_pg_dir was promoted to a
huge-page entry while the current PGD still contains the
non-leaf entry. Because both entries are present and point
to a different page, the BUG_ON() triggers.
This was actually triggered with pti-x32 enabled in a KVM
virtual machine by the graphics driver.
A real and better fix for that would be to improve the
page-table handling in the generic ioremap() code. But that is
out-of-scope for this patch-set and left for later work.
Reported-by: David H. Gutteridge <dhgutteridge@sympatico.ca>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <llong@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Global pages are bad for hardening because they potentially let an
exploit read the kernel image via a Meltdown-style attack which
makes it easier to find gadgets.
But, global pages are good for performance because they reduce TLB
misses when making user/kernel transitions, especially when PCIDs
are not available, such as on older hardware, or where a hypervisor
has disabled them for some reason.
This patch implements a basic, sane policy: If you have PCIDs, you
only map a minimal amount of kernel text global. If you do not have
PCIDs, you map all kernel text global.
This policy effectively makes PCIDs something that not only adds
performance but a little bit of hardening as well.
I ran a simple "lseek" microbenchmark[1] to test the benefit on
a modern Atom microserver. Most of the benefit comes from applying
the series before this patch ("entry only"), but there is still a
signifiant benefit from this patch.
No Global Lines (baseline ): 6077741 lseeks/sec
88 Global Lines (entry only): 7528609 lseeks/sec (+23.9%)
94 Global Lines (this patch): 8433111 lseeks/sec (+38.8%)
[1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205518.E3D989EB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Summary:
In current kernels, with PTI enabled, no pages are marked Global. This
potentially increases TLB misses. But, the mechanism by which the Global
bit is set and cleared is rather haphazard. This patch makes the process
more explicit. In the end, it leaves us with Global entries in the page
tables for the areas truly shared by userspace and kernel and increases
TLB hit rates.
The place this patch really shines in on systems without PCIDs. In this
case, we are using an lseek microbenchmark[1] to see how a reasonably
non-trivial syscall behaves. Higher is better:
No Global pages (baseline): 6077741 lseeks/sec
88 Global Pages (this set): 7528609 lseeks/sec (+23.9%)
On a modern Skylake desktop with PCIDs, the benefits are tangible, but not
huge for a kernel compile (lower is better):
No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% )
28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% )
-1.195 seconds (-0.64%)
I also re-checked everything using the lseek1 test[1]:
No Global pages (baseline): 15783951 lseeks/sec
28 Global pages (this set): 16054688 lseeks/sec
+270737 lseeks/sec (+1.71%)
The effect is more visible, but still modest.
Details:
The kernel page tables are inherited from head_64.S which rudely marks
them as _PAGE_GLOBAL. For PTI, we have been relying on the grace of
$DEITY and some insane behavior in pageattr.c to clear _PAGE_GLOBAL.
This patch tries to do better.
First, stop filtering out "unsupported" bits from being cleared in the
pageattr code. It's fine to filter out *setting* these bits but it
is insane to keep us from clearing them.
Then, *explicitly* go clear _PAGE_GLOBAL from the kernel identity map.
Do not rely on pageattr to do it magically.
After this patch, we can see that "GLB" shows up in each copy of the
page tables, that we have the same number of global entries in each
and that they are the *same* entries.
/sys/kernel/debug/page_tables/current_kernel:11
/sys/kernel/debug/page_tables/current_user:11
/sys/kernel/debug/page_tables/kernel:11
9caae8ad6a1fb53aca2407ec037f612d current_kernel.GLB
9caae8ad6a1fb53aca2407ec037f612d current_user.GLB
9caae8ad6a1fb53aca2407ec037f612d kernel.GLB
A quick visual audit also shows that all the entries make sense.
0xfffffe0000000000 is the cpu_entry_area and 0xffffffff81c00000
is the entry/exit text:
0xfffffe0000000000-0xfffffe0000002000 8K ro GLB NX pte
0xfffffe0000002000-0xfffffe0000003000 4K RW GLB NX pte
0xfffffe0000003000-0xfffffe0000006000 12K ro GLB NX pte
0xfffffe0000006000-0xfffffe0000007000 4K ro GLB x pte
0xfffffe0000007000-0xfffffe000000d000 24K RW GLB NX pte
0xfffffe000002d000-0xfffffe000002e000 4K ro GLB NX pte
0xfffffe000002e000-0xfffffe000002f000 4K RW GLB NX pte
0xfffffe000002f000-0xfffffe0000032000 12K ro GLB NX pte
0xfffffe0000032000-0xfffffe0000033000 4K ro GLB x pte
0xfffffe0000033000-0xfffffe0000039000 24K RW GLB NX pte
0xffffffff81c00000-0xffffffff81e00000 2M ro PSE GLB x pmd
[1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205517.C80FBE05@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The entry/exit text and cpu_entry_area are mapped into userspace and
the kernel. But, they are not _PAGE_GLOBAL. This creates unnecessary
TLB misses.
Add the _PAGE_GLOBAL flag for these areas.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205515.2977EE7D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__ro_after_init data gets stuck in the .rodata section. That's normally
fine because the kernel itself manages the R/W properties.
But, if we run __change_page_attr() on an area which is __ro_after_init,
the .rodata checks will trigger and force the area to be immediately
read-only, even if it is early-ish in boot. This caused problems when
trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
to clear _PAGE_RW. The kernel then oopses the next time it wrote to
a __ro_after_init data structure.
To fix this, add the kernel_set_to_readonly check, just like we have
for kernel text, just a few lines below in this function.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The pageattr code has a mode where it can set or clear PTE bits in
existing PTEs, so the page protections of the *new* PTEs come from
one of two places:
1. The set/clear masks: cpa->mask_clr / cpa->mask_set
2. The existing PTE
We filter ->mask_set/clr for supported PTE bits at entry to
__change_page_attr() so we never need to filter them again.
The only other place permissions can come from is an existing PTE
and those already presumably have good bits. We do not need to filter
them again.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205511.BC072352@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A PTE is constructed from a physical address and a pgprotval_t.
__PAGE_KERNEL, for instance, is a pgprot_t and must be converted
into a pgprotval_t before it can be used to create a PTE. This is
done implicitly within functions like pfn_pte() by massage_pgprot().
However, this makes it very challenging to set bits (and keep them
set) if your bit is being filtered out by massage_pgprot().
This moves the bit filtering out of pfn_pte() and friends. For
users of PAGE_KERNEL*, filtering will be done automatically inside
those macros but for users of __PAGE_KERNEL*, they need to do their
own filtering now.
Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
instead of massage_pgprot(). This way, we still *look* for
unsupported bits and properly warn about them if we find them. This
might happen if an unfiltered __PAGE_KERNEL* value was passed in,
for instance.
- printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
- boot crash fix from: Tom Lendacky <thomas.lendacky@amd.com>
- crash bisected by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-and-fixed-by: Arnd Bergmann <arnd@arndb.de>
Fixed-by: Tom Lendacky <thomas.lendacky@amd.com>
Bisected-by: Mike Galbraith <efault@gmx.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juergen Gross noticed that commit f7f99100d8d ("mm: stop zeroing memory
during allocation in vmemmap") broke XEN PV domains when deferred struct
page initialization is enabled.
This is because the xen's PagePinned() flag is getting erased from
struct pages when they are initialized later in boot.
Juergen fixed this problem by disabling deferred pages on xen pv
domains. It is desirable, however, to have this feature available as it
reduces boot time. This fix re-enables the feature for pv-dmains, and
fixes the problem the following way:
The fix is to delay setting PagePinned flag until struct pages for all
allocated memory are initialized, i.e. until after free_all_bootmem().
A new x86_init.hyper op init_after_bootmem() is called to let xen know
that boot allocator is done, and hence struct pages for all the
allocated memory are now initialized. If deferred page initialization
is enabled, the rest of struct pages are going to be initialized later
in boot once page_alloc_init_late() is called.
xen_after_bootmem() walks page table's pages and marks them pinned.
Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>