IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The debug IST stack is actually two separate debug stacks to handle #DB
recursion. This is required because the CPU starts always at top of stack
on exception entry, which means on #DB recursion the second #DB would
overwrite the stack of the first.
The low level entry code therefore adjusts the top of stack on entry so a
secondary #DB starts from a different stack page. But the stack pages are
adjacent without a guard page between them.
Split the debug stack into 3 stacks which are separated by guard pages. The
3rd stack is never mapped into the cpu_entry_area and is only there to
catch triple #DB nesting:
--- top of DB_stack <- Initial stack
--- end of DB_stack
guard page
--- top of DB1_stack <- Top of stack after entering first #DB
--- end of DB1_stack
guard page
--- top of DB2_stack <- Top of stack after entering second #DB
--- end of DB2_stack
guard page
If DB2 would not act as the final guard hole, a second #DB would point the
top of #DB stack to the stack below #DB1 which would be valid and not catch
the not so desired triple nesting.
The backing store does not allocate any memory for DB2 and its guard page
as it is not going to be mapped into the cpu_entry_area.
- Adjust the low level entry code so it adjusts top of #DB with the offset
between the stacks instead of exception stack size.
- Make the dumpstack code aware of the new stacks.
- Adjust the in_debug_stack() implementation and move it into the NMI code
where it belongs. As this is NMI hotpath code, it just checks the full
area between top of DB_stack and bottom of DB1_stack without checking
for the guard page. That's correct because the NMI cannot hit a
stackpointer pointing to the guard page between DB and DB1 stack. Even
if it would, then the NMI operation still is unaffected, but the resume
of the debug exception on the topmost DB stack will crash by touching
the guard page.
[ bp: Make exception_stack_names static const char * const ]
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: linux-doc@vger.kernel.org
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de
All usage sites which expected that the exception stacks in the CPU entry
area are mapped linearly are fixed up. Enable guard pages between the
IST stacks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190414160145.349862042@linutronix.de
The entry order of the TSS.IST array and the order of the stack
storage/mapping are not required to be the same.
With the upcoming split of the debug stack this is going to fall apart as
the number of TSS.IST array entries stays the same while the actual stacks
are increasing.
Make them separate so that code like dumpstack can just utilize the mapping
order. The IST index is solely required for the actual TSS.IST array
initialization.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190414160145.241588113@linutronix.de
Store a pointer to the per cpu entry area exception stack mappings to allow
fast retrieval.
Required for converting various places from using the shadow IST array to
directly doing address calculations on the actual mapping address.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190414160144.680960459@linutronix.de
At the moment everything assumes a full linear mapping of the various
exception stacks. Adding guard pages to the cpu entry area mapping of the
exception stacks will break that assumption.
As a preparatory step convert both the real storage and the effective
mapping in the cpu entry area from character arrays to structures.
To ensure that both arrays have the same ordering and the same size of the
individual stacks fill the members with a macro. The guard size is the only
difference between the two resulting structures. For now both have guard
size 0 until the preparation of all usage sites is done.
Provide a couple of helper macros which are used in the following
conversions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190414160144.506807893@linutronix.de
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
The Intel PEBS/BTS debug store is a design trainwreck as it expects virtual
addresses which must be visible in any execution context.
So it is required to make these mappings visible to user space when kernel
page table isolation is active.
Provide enough room for the buffer mappings in the cpu_entry_area so the
buffers are available in the user space visible page tables.
At the point where the kernel side entry area is populated there is no
buffer available yet, but the kernel PMD must be populated. To achieve this
set the entries for these buffers to non present.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Put the cpu_entry_area into a separate P4D entry. The fixmap gets too big
and 0-day already hit a case where the fixmap PTEs were cleared by
cleanup_highmap().
Aside of that the fixmap API is a pain as it's all backwards.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Separate the cpu_entry_area code out of cpu/common.c and the fixmap.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>