2005-04-17 02:20:36 +04:00
#
2012-03-25 03:23:57 +04:00
# Automatically generated file; DO NOT EDIT.
# User Mode Linux/i386 3.3.0 Kernel Configuration
2005-04-17 02:20:36 +04:00
#
2008-02-24 02:23:24 +03:00
CONFIG_DEFCONFIG_LIST="arch/$ARCH/defconfig"
2005-04-17 02:20:36 +04:00
CONFIG_UML=y
CONFIG_MMU=y
2008-02-24 02:23:24 +03:00
CONFIG_NO_IOMEM=y
# CONFIG_TRACE_IRQFLAGS_SUPPORT is not set
CONFIG_LOCKDEP_SUPPORT=y
# CONFIG_STACKTRACE_SUPPORT is not set
2005-04-17 02:20:36 +04:00
CONFIG_GENERIC_CALIBRATE_DELAY=y
2008-02-24 02:23:24 +03:00
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_CLOCKEVENTS=y
2006-05-01 23:15:59 +04:00
CONFIG_IRQ_RELEASE_METHOD=y
2008-02-24 02:23:24 +03:00
CONFIG_HZ=100
2005-04-17 02:20:36 +04:00
#
# UML-specific options
#
2006-05-01 23:15:59 +04:00
#
# Host processor type and features
#
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
CONFIG_M686=y
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
2012-03-25 03:23:57 +04:00
# CONFIG_MELAN is not set
2006-05-01 23:15:59 +04:00
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
2007-05-02 21:27:05 +04:00
# CONFIG_MVIAC7 is not set
2008-02-24 02:23:24 +03:00
# CONFIG_MCORE2 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_MATOM is not set
2006-05-01 23:15:59 +04:00
# CONFIG_X86_GENERIC is not set
2012-03-25 03:23:57 +04:00
CONFIG_X86_INTERNODE_CACHE_SHIFT=5
2006-05-01 23:15:59 +04:00
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=5
2008-02-24 02:23:24 +03:00
CONFIG_X86_XADD=y
2006-05-01 23:15:59 +04:00
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
2012-03-25 03:23:57 +04:00
CONFIG_X86_CMPXCHG64=y
2008-02-24 02:23:24 +03:00
CONFIG_X86_CMOV=y
2012-03-25 03:23:57 +04:00
CONFIG_X86_MINIMUM_CPU_FAMILY=5
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
CONFIG_CPU_SUP_UMC_32=y
2005-05-01 19:58:54 +04:00
CONFIG_UML_X86=y
# CONFIG_64BIT is not set
2012-03-25 03:23:57 +04:00
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_RWSEM_GENERIC_SPINLOCK=y
2005-04-17 02:20:36 +04:00
# CONFIG_3_LEVEL_PGTABLES is not set
CONFIG_ARCH_HAS_SC_SIGNALS=y
CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA=y
2006-05-01 23:15:59 +04:00
CONFIG_GENERIC_HWEIGHT=y
2012-03-25 03:23:57 +04:00
# CONFIG_STATIC_LINK is not set
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
2012-03-25 03:23:57 +04:00
CONFIG_PAGEFLAGS_EXTENDED=y
2006-05-01 23:15:59 +04:00
CONFIG_SPLIT_PTLOCK_CPUS=4
2012-03-25 03:23:57 +04:00
# CONFIG_COMPACTION is not set
# CONFIG_PHYS_ADDR_T_64BIT is not set
2008-02-24 02:23:24 +03:00
CONFIG_ZONE_DMA_FLAG=0
CONFIG_VIRT_TO_BUS=y
2012-03-25 03:23:57 +04:00
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_NEED_PER_CPU_KM=y
# CONFIG_CLEANCACHE is not set
2007-10-16 12:27:25 +04:00
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
2007-10-16 12:27:24 +04:00
CONFIG_HIGH_RES_TIMERS=y
2008-02-24 02:23:24 +03:00
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
CONFIG_LD_SCRIPT_DYN=y
2005-04-17 02:20:36 +04:00
CONFIG_BINFMT_ELF=y
2012-03-25 03:23:57 +04:00
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_HAVE_AOUT=y
2008-02-24 02:23:24 +03:00
# CONFIG_BINFMT_AOUT is not set
2005-04-17 02:20:36 +04:00
CONFIG_BINFMT_MISC=m
2008-02-05 09:31:28 +03:00
CONFIG_HOSTFS=y
2006-05-01 23:15:59 +04:00
# CONFIG_HPPFS is not set
2005-04-17 02:20:36 +04:00
CONFIG_MCONSOLE=y
2007-10-16 12:27:17 +04:00
CONFIG_MAGIC_SYSRQ=y
2007-05-11 09:22:35 +04:00
CONFIG_KERNEL_STACK_ORDER=0
2012-03-25 03:23:57 +04:00
# CONFIG_MMAPPER is not set
CONFIG_NO_DMA=y
2005-04-17 02:20:36 +04:00
#
2008-02-24 02:23:24 +03:00
# General setup
2005-04-17 02:20:36 +04:00
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
2008-02-24 02:23:24 +03:00
CONFIG_INIT_ENV_ARG_LIMIT=128
2012-03-25 03:23:57 +04:00
CONFIG_CROSS_COMPILE=""
2005-04-17 02:20:36 +04:00
CONFIG_LOCALVERSION=""
2006-05-01 23:15:59 +04:00
CONFIG_LOCALVERSION_AUTO=y
2012-03-25 03:23:57 +04:00
CONFIG_DEFAULT_HOSTNAME="(none)"
2005-04-17 02:20:36 +04:00
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
2008-02-24 02:23:24 +03:00
CONFIG_SYSVIPC_SYSCTL=y
2005-04-17 02:20:36 +04:00
CONFIG_POSIX_MQUEUE=y
2012-03-25 03:23:57 +04:00
CONFIG_POSIX_MQUEUE_SYSCTL=y
2005-04-17 02:20:36 +04:00
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_FHANDLE is not set
2008-02-24 02:23:24 +03:00
# CONFIG_TASKSTATS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_AUDIT is not set
2012-03-25 03:23:57 +04:00
CONFIG_HAVE_GENERIC_HARDIRQS=y
#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_SHOW=y
#
# RCU Subsystem
#
CONFIG_TINY_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_TREE_RCU_TRACE is not set
2005-04-17 02:20:36 +04:00
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
2008-02-24 02:23:24 +03:00
CONFIG_LOG_BUF_SHIFT=14
2012-03-25 03:23:57 +04:00
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
# CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is not set
# CONFIG_CGROUP_MEM_RES_CTLR_KMEM is not set
CONFIG_CGROUP_SCHED=y
2008-02-24 02:23:24 +03:00
CONFIG_FAIR_GROUP_SCHED=y
2012-03-25 03:23:57 +04:00
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_BLK_CGROUP=m
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_NET_NS=y
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_MM_OWNER=y
2008-02-24 02:23:24 +03:00
CONFIG_SYSFS_DEPRECATED=y
2012-03-25 03:23:57 +04:00
# CONFIG_SYSFS_DEPRECATED_V2 is not set
2006-05-01 23:15:59 +04:00
# CONFIG_RELAY is not set
2008-02-24 02:23:24 +03:00
# CONFIG_BLK_DEV_INITRD is not set
2006-05-01 23:15:59 +04:00
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
2008-02-24 02:23:24 +03:00
CONFIG_SYSCTL=y
2012-03-25 03:23:57 +04:00
CONFIG_ANON_INODES=y
2011-01-21 01:44:16 +03:00
# CONFIG_EXPERT is not set
2008-02-24 02:23:24 +03:00
CONFIG_UID16=y
2012-03-25 03:23:57 +04:00
# CONFIG_SYSCTL_SYSCALL is not set
2005-04-17 02:20:36 +04:00
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
2006-05-01 23:15:59 +04:00
CONFIG_HOTPLUG=y
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
CONFIG_PRINTK=y
CONFIG_BUG=y
2006-05-01 23:15:59 +04:00
CONFIG_ELF_CORE=y
2005-04-17 02:20:36 +04:00
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
2008-02-24 02:23:24 +03:00
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
2005-04-17 02:20:36 +04:00
CONFIG_SHMEM=y
2012-03-25 03:23:57 +04:00
CONFIG_AIO=y
# CONFIG_EMBEDDED is not set
#
# Kernel Performance Events And Counters
#
2008-02-24 02:23:24 +03:00
CONFIG_VM_EVENT_COUNTERS=y
2012-03-25 03:23:57 +04:00
CONFIG_COMPAT_BRK=y
2006-05-01 23:15:59 +04:00
CONFIG_SLAB=y
2008-02-24 02:23:24 +03:00
# CONFIG_SLUB is not set
# CONFIG_PROFILING is not set
2012-03-25 03:23:57 +04:00
#
# GCOV-based kernel profiling
#
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
2008-02-24 02:23:24 +03:00
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
2005-04-17 02:20:36 +04:00
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
2012-03-25 03:23:57 +04:00
# CONFIG_MODULE_FORCE_LOAD is not set
2005-04-17 02:20:36 +04:00
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
# CONFIG_MODVERSIONS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_MODULE_SRCVERSION_ALL is not set
2008-02-24 02:23:24 +03:00
CONFIG_BLOCK=y
2012-03-25 03:23:57 +04:00
CONFIG_LBDAF=y
2008-02-24 02:23:24 +03:00
# CONFIG_BLK_DEV_BSG is not set
2012-03-25 03:23:57 +04:00
# CONFIG_BLK_DEV_BSGLIB is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
2006-05-01 23:15:59 +04:00
#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
2012-03-25 03:23:57 +04:00
CONFIG_IOSCHED_CFQ=m
# CONFIG_CFQ_GROUP_IOSCHED is not set
CONFIG_DEFAULT_DEADLINE=y
2006-05-01 23:15:59 +04:00
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
2012-03-25 03:23:57 +04:00
CONFIG_DEFAULT_IOSCHED="deadline"
# CONFIG_INLINE_SPIN_TRYLOCK is not set
# CONFIG_INLINE_SPIN_TRYLOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK is not set
# CONFIG_INLINE_SPIN_LOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK_IRQ is not set
# CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set
CONFIG_INLINE_SPIN_UNLOCK=y
# CONFIG_INLINE_SPIN_UNLOCK_BH is not set
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
# CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_READ_TRYLOCK is not set
# CONFIG_INLINE_READ_LOCK is not set
# CONFIG_INLINE_READ_LOCK_BH is not set
# CONFIG_INLINE_READ_LOCK_IRQ is not set
# CONFIG_INLINE_READ_LOCK_IRQSAVE is not set
CONFIG_INLINE_READ_UNLOCK=y
# CONFIG_INLINE_READ_UNLOCK_BH is not set
CONFIG_INLINE_READ_UNLOCK_IRQ=y
# CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_WRITE_TRYLOCK is not set
# CONFIG_INLINE_WRITE_LOCK is not set
# CONFIG_INLINE_WRITE_LOCK_BH is not set
# CONFIG_INLINE_WRITE_LOCK_IRQ is not set
# CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set
CONFIG_INLINE_WRITE_UNLOCK=y
# CONFIG_INLINE_WRITE_UNLOCK_BH is not set
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
# CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set
# CONFIG_MUTEX_SPIN_ON_OWNER is not set
CONFIG_FREEZER=y
2005-04-17 02:20:36 +04:00
#
2012-03-25 03:23:57 +04:00
# UML Character Devices
2005-04-17 02:20:36 +04:00
#
CONFIG_STDERR_CONSOLE=y
CONFIG_STDIO_CONSOLE=y
CONFIG_SSL=y
CONFIG_NULL_CHAN=y
CONFIG_PORT_CHAN=y
CONFIG_PTY_CHAN=y
CONFIG_TTY_CHAN=y
CONFIG_XTERM_CHAN=y
# CONFIG_NOCONFIG_CHAN is not set
CONFIG_CON_ZERO_CHAN="fd:0,fd:1"
CONFIG_CON_CHAN="xterm"
2007-07-16 10:38:54 +04:00
CONFIG_SSL_CHAN="pts"
2005-04-17 02:20:36 +04:00
CONFIG_UML_SOUND=m
CONFIG_SOUND=m
2012-03-25 03:23:57 +04:00
CONFIG_SOUND_OSS_CORE=y
2005-04-17 02:20:36 +04:00
CONFIG_HOSTAUDIO=m
2012-03-25 03:23:57 +04:00
#
# Device Drivers
#
2005-04-17 02:20:36 +04:00
#
2006-05-01 23:15:59 +04:00
# Generic Driver Options
2005-04-17 02:20:36 +04:00
#
2008-02-24 02:23:24 +03:00
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
2012-03-25 03:23:57 +04:00
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
2006-05-01 23:15:59 +04:00
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
2012-03-25 03:23:57 +04:00
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
2006-05-01 23:15:59 +04:00
# CONFIG_DEBUG_DRIVER is not set
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
2012-03-25 03:23:57 +04:00
CONFIG_GENERIC_CPU_DEVICES=y
# CONFIG_DMA_SHARED_BUFFER is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_UBD=y
# CONFIG_BLK_DEV_UBD_SYNC is not set
CONFIG_BLK_DEV_COW_COMMON=y
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
#
# DRBD disabled because PROC_FS, INET or CONNECTOR not selected
#
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_RBD is not set
#
# Misc devices
#
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_C2PORT is not set
#
# EEPROM support
#
# CONFIG_EEPROM_93CX6 is not set
#
# Texas Instruments shared transport line discipline
#
#
# Altera FPGA firmware download module
#
#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
# CONFIG_SCSI is not set
# CONFIG_SCSI_DMA is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_MD is not set
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
CONFIG_DUMMY=m
# CONFIG_EQUALIZER is not set
# CONFIG_MII is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
CONFIG_TUN=m
# CONFIG_VETH is not set
#
# CAIF transport drivers
#
CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_CHELSIO=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_NET_VENDOR_I825XX=y
CONFIG_NET_VENDOR_MARVELL=y
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NET_VENDOR_8390=y
# CONFIG_PHYLIB is not set
CONFIG_PPP=m
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_DEFLATE is not set
# CONFIG_PPP_FILTER is not set
# CONFIG_PPP_MPPE is not set
# CONFIG_PPP_MULTILINK is not set
# CONFIG_PPPOE is not set
# CONFIG_PPP_ASYNC is not set
# CONFIG_PPP_SYNC_TTY is not set
CONFIG_SLIP=m
CONFIG_SLHC=m
# CONFIG_SLIP_COMPRESSED is not set
# CONFIG_SLIP_SMART is not set
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_WLAN=y
# CONFIG_HOSTAP is not set
#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
#
# Character devices
#
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=32
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVKMEM=y
# CONFIG_HW_RANDOM is not set
CONFIG_UML_RANDOM=y
# CONFIG_R3964 is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_RAW_DRIVER is not set
#
# PPS support
#
# CONFIG_PPS is not set
#
# PPS generators support
#
#
# PTP clock support
#
#
# Enable Device Drivers -> PPS to see the PTP clock options.
#
# CONFIG_POWER_SUPPLY is not set
# CONFIG_THERMAL is not set
# CONFIG_WATCHDOG is not set
# CONFIG_REGULATOR is not set
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
#
# Virtio drivers
#
# CONFIG_VIRTIO_BALLOON is not set
#
# Microsoft Hyper-V guest support
#
# CONFIG_STAGING is not set
2005-04-17 02:20:36 +04:00
#
2012-03-25 03:23:57 +04:00
# Hardware Spinlock drivers
2005-04-17 02:20:36 +04:00
#
2012-03-25 03:23:57 +04:00
CONFIG_IOMMU_SUPPORT=y
# CONFIG_VIRT_DRIVERS is not set
# CONFIG_PM_DEVFREQ is not set
2008-02-24 02:23:24 +03:00
CONFIG_NET=y
2005-04-17 02:20:36 +04:00
#
# Networking options
#
CONFIG_PACKET=y
CONFIG_UNIX=y
2012-03-25 03:23:57 +04:00
# CONFIG_UNIX_DIAG is not set
2008-02-24 02:23:24 +03:00
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
2012-03-25 03:23:57 +04:00
# CONFIG_NET_IPGRE_DEMUX is not set
2005-04-17 02:20:36 +04:00
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
2006-05-01 23:15:59 +04:00
# CONFIG_INET_XFRM_TUNNEL is not set
2005-04-17 02:20:36 +04:00
# CONFIG_INET_TUNNEL is not set
2008-02-24 02:23:24 +03:00
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
# CONFIG_INET_LRO is not set
2006-05-01 23:15:59 +04:00
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
2012-03-25 03:23:57 +04:00
# CONFIG_INET_UDP_DIAG is not set
2006-05-01 23:15:59 +04:00
# CONFIG_TCP_CONG_ADVANCED is not set
2008-02-24 02:23:24 +03:00
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
2005-04-17 02:20:36 +04:00
# CONFIG_IPV6 is not set
2008-02-24 02:23:24 +03:00
# CONFIG_NETWORK_SECMARK is not set
2012-03-25 03:23:57 +04:00
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
2005-04-17 02:20:36 +04:00
# CONFIG_NETFILTER is not set
2006-05-01 23:15:59 +04:00
# CONFIG_IP_DCCP is not set
2005-04-17 02:20:36 +04:00
# CONFIG_IP_SCTP is not set
2012-03-25 03:23:57 +04:00
# CONFIG_RDS is not set
2006-05-01 23:15:59 +04:00
# CONFIG_TIPC is not set
2005-04-17 02:20:36 +04:00
# CONFIG_ATM is not set
2012-03-25 03:23:57 +04:00
# CONFIG_L2TP is not set
2005-04-17 02:20:36 +04:00
# CONFIG_BRIDGE is not set
2012-03-25 03:23:57 +04:00
# CONFIG_NET_DSA is not set
2005-04-17 02:20:36 +04:00
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
2012-03-25 03:23:57 +04:00
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
2005-04-17 02:20:36 +04:00
# CONFIG_NET_SCHED is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DCB is not set
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_NETPRIO_CGROUP is not set
CONFIG_BQL=y
2005-04-17 02:20:36 +04:00
#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CAN is not set
2005-04-17 02:20:36 +04:00
# CONFIG_IRDA is not set
# CONFIG_BT is not set
2008-02-24 02:23:24 +03:00
# CONFIG_AF_RXRPC is not set
2012-03-25 03:23:57 +04:00
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set
2008-02-24 02:23:24 +03:00
#
2012-03-25 03:23:57 +04:00
# CFG80211 needs to be enabled for MAC80211
2008-02-24 02:23:24 +03:00
#
2012-03-25 03:23:57 +04:00
# CONFIG_WIMAX is not set
2008-02-24 02:23:24 +03:00
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
2006-05-01 23:15:59 +04:00
#
# UML Network Devices
#
CONFIG_UML_NET=y
CONFIG_UML_NET_ETHERTAP=y
CONFIG_UML_NET_TUNTAP=y
CONFIG_UML_NET_SLIP=y
CONFIG_UML_NET_DAEMON=y
2008-02-24 02:23:24 +03:00
# CONFIG_UML_NET_VDE is not set
2006-05-01 23:15:59 +04:00
CONFIG_UML_NET_MCAST=y
# CONFIG_UML_NET_PCAP is not set
CONFIG_UML_NET_SLIRP=y
2005-04-17 02:20:36 +04:00
#
# File systems
#
2012-03-25 03:23:57 +04:00
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_XATTR=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
CONFIG_FS_MBCACHE=y
2005-04-17 02:20:36 +04:00
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
# CONFIG_REISERFS_FS_XATTR is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
2008-02-24 02:23:24 +03:00
# CONFIG_GFS2_FS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_FS_POSIX_ACL is not set
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
2008-02-24 02:23:24 +03:00
CONFIG_INOTIFY_USER=y
2012-03-25 03:23:57 +04:00
# CONFIG_FANOTIFY is not set
2005-04-17 02:20:36 +04:00
CONFIG_QUOTA=y
2008-02-24 02:23:24 +03:00
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
2012-03-25 03:23:57 +04:00
# CONFIG_QUOTA_DEBUG is not set
2005-04-17 02:20:36 +04:00
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
CONFIG_AUTOFS4_FS=m
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
# CONFIG_FUSE_FS is not set
2012-03-25 03:23:57 +04:00
#
# Caches
#
# CONFIG_FSCACHE is not set
2005-04-17 02:20:36 +04:00
#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
# CONFIG_ZISOFS is not set
# CONFIG_UDF_FS is not set
#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set
#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
2008-02-24 02:23:24 +03:00
CONFIG_PROC_SYSCTL=y
2012-03-25 03:23:57 +04:00
CONFIG_PROC_PAGE_MONITOR=y
2005-04-17 02:20:36 +04:00
CONFIG_SYSFS=y
CONFIG_TMPFS=y
2008-02-24 02:23:24 +03:00
# CONFIG_TMPFS_POSIX_ACL is not set
2012-03-25 03:23:57 +04:00
# CONFIG_TMPFS_XATTR is not set
2005-04-17 02:20:36 +04:00
# CONFIG_HUGETLB_PAGE is not set
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
# CONFIG_CONFIGFS_FS is not set
2012-03-25 03:23:57 +04:00
CONFIG_MISC_FILESYSTEMS=y
2005-04-17 02:20:36 +04:00
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_LOGFS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_CRAMFS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_SQUASHFS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_VXFS_FS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_ROMFS_FS is not set
# CONFIG_PSTORE is not set
2005-04-17 02:20:36 +04:00
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
2008-02-24 02:23:24 +03:00
CONFIG_NETWORK_FILESYSTEMS=y
2005-04-17 02:20:36 +04:00
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CEPH_FS is not set
2005-04-17 02:20:36 +04:00
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set
#
# Security options
#
# CONFIG_KEYS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_SECURITY_DMESG_RESTRICT is not set
2005-04-17 02:20:36 +04:00
# CONFIG_SECURITY is not set
2012-03-25 03:23:57 +04:00
# CONFIG_SECURITYFS is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_DEFAULT_SECURITY=""
2008-02-24 02:23:24 +03:00
CONFIG_CRYPTO=y
2012-03-25 03:23:57 +04:00
#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=m
CONFIG_CRYPTO_ALGAPI2=m
CONFIG_CRYPTO_RNG=m
CONFIG_CRYPTO_RNG2=m
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_MANAGER is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_MANAGER2 is not set
# CONFIG_CRYPTO_USER is not set
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set
#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set
#
# Block modes
#
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set
#
# Hash modes
#
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_VMAC is not set
#
# Digest
#
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_GHASH is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_MD4 is not set
# CONFIG_CRYPTO_MD5 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_WP512 is not set
#
# Ciphers
#
CONFIG_CRYPTO_AES=m
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_AES_586 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_586 is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
#
# Compression
#
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_DEFLATE is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRYPTO_ZLIB is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CRYPTO_LZO is not set
2012-03-25 03:23:57 +04:00
#
# Random Number Generation
#
CONFIG_CRYPTO_ANSI_CPRNG=m
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
2008-02-24 02:23:24 +03:00
CONFIG_CRYPTO_HW=y
2012-03-25 03:23:57 +04:00
# CONFIG_BINARY_PRINTF is not set
2005-04-17 02:20:36 +04:00
#
# Library routines
#
2012-03-25 03:23:57 +04:00
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_IO=y
2005-04-17 02:20:36 +04:00
# CONFIG_CRC_CCITT is not set
2012-03-25 03:23:57 +04:00
CONFIG_CRC16=y
# CONFIG_CRC_T10DIF is not set
2008-02-24 02:23:24 +03:00
# CONFIG_CRC_ITU_T is not set
2012-03-25 03:23:57 +04:00
CONFIG_CRC32=y
2008-02-24 02:23:24 +03:00
# CONFIG_CRC7 is not set
2005-04-17 02:20:36 +04:00
# CONFIG_LIBCRC32C is not set
2012-03-25 03:23:57 +04:00
# CONFIG_CRC8 is not set
# CONFIG_XZ_DEC is not set
# CONFIG_XZ_DEC_BCJ is not set
CONFIG_DQL=y
CONFIG_NLATTR=y
# CONFIG_AVERAGE is not set
# CONFIG_CORDIC is not set
2005-04-17 02:20:36 +04:00
#
# Kernel hacking
#
# CONFIG_PRINTK_TIME is not set
2012-03-25 03:23:57 +04:00
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
2008-02-24 02:23:24 +03:00
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
2012-03-25 03:23:57 +04:00
CONFIG_FRAME_WARN=1024
# CONFIG_STRIP_ASM_SYMS is not set
2008-02-24 02:23:24 +03:00
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_SECTION_MISMATCH is not set
2005-04-17 02:20:36 +04:00
CONFIG_DEBUG_KERNEL=y
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_SHIRQ is not set
2012-03-25 03:23:57 +04:00
# CONFIG_LOCKUP_DETECTOR is not set
# CONFIG_HARDLOCKUP_DETECTOR is not set
# CONFIG_DETECT_HUNG_TASK is not set
2008-02-24 02:23:24 +03:00
CONFIG_SCHED_DEBUG=y
2005-04-17 02:20:36 +04:00
# CONFIG_SCHEDSTATS is not set
2008-02-24 02:23:24 +03:00
# CONFIG_TIMER_STATS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_OBJECTS is not set
2008-02-05 09:31:28 +03:00
# CONFIG_DEBUG_SLAB is not set
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
2005-04-17 02:20:36 +04:00
# CONFIG_DEBUG_SPINLOCK is not set
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_MUTEXES is not set
2012-03-25 03:23:57 +04:00
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_STACK_USAGE is not set
2005-04-17 02:20:36 +04:00
# CONFIG_DEBUG_KOBJECT is not set
2008-02-24 02:23:24 +03:00
CONFIG_DEBUG_BUGVERBOSE=y
2005-04-17 02:20:36 +04:00
CONFIG_DEBUG_INFO=y
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_INFO_REDUCED is not set
2006-05-01 23:15:59 +04:00
# CONFIG_DEBUG_VM is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_LIST is not set
2012-03-25 03:23:57 +04:00
# CONFIG_TEST_LIST_SORT is not set
2008-02-24 02:23:24 +03:00
# CONFIG_DEBUG_SG is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set
2005-04-17 02:20:36 +04:00
CONFIG_FRAME_POINTER=y
2008-02-24 02:23:24 +03:00
# CONFIG_BOOT_PRINTK_DELAY is not set
2006-05-01 23:15:59 +04:00
# CONFIG_RCU_TORTURE_TEST is not set
2008-02-24 02:23:24 +03:00
# CONFIG_BACKTRACE_SELF_TEST is not set
2012-03-25 03:23:57 +04:00
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
2008-02-24 02:23:24 +03:00
# CONFIG_FAULT_INJECTION is not set
2012-03-25 03:23:57 +04:00
# CONFIG_SYSCTL_SYSCALL_CHECK is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_ATOMIC64_SELFTEST is not set
2008-02-24 02:23:24 +03:00
# CONFIG_SAMPLES is not set
2012-03-25 03:23:57 +04:00
# CONFIG_TEST_KSTRTOX is not set
[PATCH] uml: skas0 - separate kernel address space on stock hosts
UML has had two modes of operation - an insecure, slow mode (tt mode) in
which the kernel is mapped into every process address space which requires
no host kernel modifications, and a secure, faster mode (skas mode) in
which the UML kernel is in a separate host address space, which requires a
patch to the host kernel.
This patch implements something very close to skas mode for hosts which
don't support skas - I'm calling this skas0. It provides the security of
the skas host patch, and some of the performance gains.
The two main things that are provided by the skas patch, /proc/mm and
PTRACE_FAULTINFO, are implemented in a way that require no host patch.
For the remote address space changing stuff (mmap, munmap, and mprotect),
we set aside two pages in the process above its stack, one of which
contains a little bit of code which can call mmap et al.
To update the address space, the system call information (system call
number and arguments) are written to the stub page above the code. The
%esp is set to the beginning of the data, the %eip is set the the start of
the stub, and it repeatedly pops the information into its registers and
makes the system call until it sees a system call number of zero. This is
to amortize the cost of the context switch across multiple address space
updates.
When the updates are done, it SIGSTOPs itself, and the kernel process
continues what it was doing.
For a PTRACE_FAULTINFO replacement, we set up a SIGSEGV handler in the
child, and let it handle segfaults rather than nullifying them. The
handler is in the same page as the mmap stub. The second page is used as
the stack. The handler reads cr2 and err from the sigcontext, sticks them
at the base of the stack in a faultinfo struct, and SIGSTOPs itself. The
kernel then reads the faultinfo and handles the fault.
A complication on x86_64 is that this involves resetting the registers to
the segfault values when the process is inside the kill system call. This
breaks on x86_64 because %rcx will contain %rip because you tell SYSRET
where to return to by putting the value in %rcx. So, this corrupts $rcx on
return from the segfault. To work around this, I added an
arch_finish_segv, which on x86 does nothing, but which on x86_64 ptraces
the child back through the sigreturn. This causes %rcx to be restored by
sigreturn and avoids the corruption. Ultimately, I think I will replace
this with the trick of having it send itself a blocked signal which will be
unblocked by the sigreturn. This will allow it to be stopped just after
the sigreturn, and PTRACE_SYSCALLed without all the back-and-forth of
PTRACE_SYSCALLing it through sigreturn.
This runs on a stock host, so theoretically (and hopefully), tt mode isn't
needed any more. We need to make sure that this is better in every way
than tt mode, though. I'm concerned about the speed of address space
updates and page fault handling, since they involve extra round-trips to
the child. We can amortize the round-trip cost for large address space
updates by writing all of the operations to the data page and having the
child execute them all at the same time. This will help fork and exec, but
not page faults, since they involve only one page.
I can't think of any way to help page faults, except to add something like
PTRACE_FAULTINFO to the host. There is PTRACE_SIGINFO, but UML doesn't use
siginfo for SIGSEGV (or anything else) because there isn't enough
information in the siginfo struct to handle page faults (the faulting
operation type is missing). Adding that would make PTRACE_SIGINFO a usable
equivalent to PTRACE_FAULTINFO.
As for the code itself:
- The system call stub is in arch/um/kernel/sys-$(SUBARCH)/stub.S. It is
put in its own section of the binary along with stub_segv_handler in
arch/um/kernel/skas/process.c. This is manipulated with run_syscall_stub
in arch/um/kernel/skas/mem_user.c. syscall_stub will execute any system
call at all, but it's only used for mmap, munmap, and mprotect.
- The x86_64 stub calls sigreturn by hand rather than allowing the normal
sigreturn to happen, because the normal sigreturn is a SA_RESTORER in
UML's address space provided by libc. Needless to say, this is not
available in the child's address space. Also, it does a couple of odd
pops before that which restore the stack to the state it was in at the
time the signal handler was called.
- There is a new field in the arch mmu_context, which is now a union.
This is the pid to be manipulated rather than the /proc/mm file
descriptor. Code which deals with this now checks proc_mm to see whether
it should use the usual skas code or the new code.
- userspace_tramp is now used to create a new host process for every UML
process, rather than one per UML processor. It checks proc_mm and
ptrace_faultinfo to decide whether to map in the pages above its stack.
- start_userspace now makes CLONE_VM conditional on proc_mm since we need
separate address spaces now.
- switch_mm_skas now just sets userspace_pid[0] to the new pid rather
than PTRACE_SWITCH_MM. There is an addition to userspace which updates
its idea of the pid being manipulated each time around the loop. This is
important on exec, when the pid will change underneath userspace().
- The stub page has a pte, but it can't be mapped in using tlb_flush
because it is part of tlb_flush. This is why it's required for it to be
mapped in by userspace_tramp.
Other random things:
- The stub section in uml.lds.S is page aligned. This page is written
out to the backing vm file in setup_physmem because it is mapped from
there into user processes.
- There's some confusion with TASK_SIZE now that there are a couple of
extra pages that the process can't use. TASK_SIZE is considered by the
elf code to be the usable process memory, which is reasonable, so it is
decreased by two pages. This confuses the definition of
USER_PGDS_IN_LAST_PML4, making it too small because of the rounding down
of the uneven division. So we round it to the nearest PGDIR_SIZE rather
than the lower one.
- I added a missing PT_SYSCALL_ARG6_OFFSET macro.
- um_mmu.h was made into a userspace-usable file.
- proc_mm and ptrace_faultinfo are globals which say whether the host
supports these features.
- There is a bad interaction between the mm.nr_ptes check at the end of
exit_mmap, stack randomization, and skas0. exit_mmap will stop freeing
pages at the PGDIR_SIZE boundary after the last vma. If the stack isn't
on the last page table page, the last pte page won't be freed, as it
should be since the stub ptes are there, and exit_mmap will BUG because
there is an unfreed page. To get around this, TASK_SIZE is set to the
next lowest PGDIR_SIZE boundary and mm->nr_ptes is decremented after the
calls to init_stub_pte. This ensures that we know the process stack (and
all other process mappings) will be below the top page table page, and
thus we know that mm->nr_ptes will be one too many, and can be
decremented.
Things that need fixing:
- We may need better assurrences that the stub code is PIC.
- The stub pte is set up in init_new_context_skas.
- alloc_pgdir is probably the right place.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-08 04:56:49 +04:00
# CONFIG_GPROF is not set
2005-04-17 02:20:36 +04:00
# CONFIG_GCOV is not set
2012-03-25 03:23:57 +04:00
CONFIG_EARLY_PRINTK=y