2017-04-02 15:18:32 -06:00
unshare system call
===================
2006-02-07 12:58:56 -08:00
2017-04-02 15:18:32 -06:00
This document describes the new system call, unshare(). The document
2006-02-07 12:58:56 -08:00
provides an overview of the feature, why it is needed, how it can
be used, its interface specification, design, implementation and
how it can be tested.
2017-04-02 15:18:32 -06:00
Change Log
----------
2006-02-07 12:58:56 -08:00
version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
2017-04-02 15:18:32 -06:00
Contents
--------
2006-02-07 12:58:56 -08:00
1) Overview
2) Benefits
3) Cost
4) Requirements
5) Functional Specification
6) High Level Design
7) Low Level Design
8) Test Specification
9) Future Work
1) Overview
-----------
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
Most legacy operating system kernels support an abstraction of threads
as multiple execution contexts within a process. These kernels provide
special resources and mechanisms to maintain these "threads". The Linux
kernel, in a clever and simple manner, does not make distinction
between processes and "threads". The kernel allows processes to share
resources and thus they can achieve legacy "threads" behavior without
requiring additional data structures and mechanisms in the kernel. The
power of implementing threads in this manner comes not only from
its simplicity but also from allowing application programmers to work
outside the confinement of all-or-nothing shared resources of legacy
threads. On Linux, at the time of thread creation using the clone system
call, applications can selectively choose which resources to share
between threads.
2017-04-02 15:18:32 -06:00
unshare() system call adds a primitive to the Linux thread model that
2006-02-07 12:58:56 -08:00
allows threads to selectively 'unshare' any resources that were being
2017-04-02 15:18:32 -06:00
shared at the time of their creation. unshare() was conceptualized by
2006-02-07 12:58:56 -08:00
Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
2017-04-02 15:18:32 -06:00
of the discussion on POSIX threads on Linux. unshare() augments the
2006-02-07 12:58:56 -08:00
usefulness of Linux threads for applications that would like to control
2017-04-02 15:18:32 -06:00
shared resources without creating a new process. unshare() is a natural
2006-02-07 12:58:56 -08:00
addition to the set of available primitives on Linux that implement
the concept of process/thread as a virtual machine.
2) Benefits
-----------
2017-04-02 15:18:32 -06:00
unshare() would be useful to large application frameworks such as PAM
2006-02-07 12:58:56 -08:00
where creating a new process to control sharing/unsharing of process
resources is not possible. Since namespaces are shared by default
2017-04-02 15:18:32 -06:00
when creating a new process using fork or clone, unshare() can benefit
2006-02-07 12:58:56 -08:00
even non-threaded applications if they have a need to disassociate
from default shared namespace. The following lists two use-cases
2017-04-02 15:18:32 -06:00
where unshare() can be used.
2006-02-07 12:58:56 -08:00
2.1 Per-security context namespaces
2017-04-02 15:18:32 -06:00
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
unshare() can be used to implement polyinstantiated directories using
2006-02-07 12:58:56 -08:00
the kernel's per-process namespace mechanism. Polyinstantiated directories,
such as per-user and/or per-security context instance of /tmp, /var/tmp or
per-security context instance of a user's home directory, isolate user
2017-04-02 15:18:32 -06:00
processes when working with these directories. Using unshare(), a PAM
2006-02-07 12:58:56 -08:00
module can easily setup a private namespace for a user at login.
Polyinstantiated directories are required for Common Criteria certification
with Labeled System Protection Profile, however, with the availability
of shared-tree feature in the Linux kernel, even regular Linux systems
can benefit from setting up private namespaces at login and
polyinstantiating /tmp, /var/tmp and other directories deemed
appropriate by system administrators.
2.2 unsharing of virtual memory and/or open files
2017-04-02 15:18:32 -06:00
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2006-02-07 12:58:56 -08:00
Consider a client/server application where the server is processing
client requests by creating processes that share resources such as
2017-04-02 15:18:32 -06:00
virtual memory and open files. Without unshare(), the server has to
2006-02-07 12:58:56 -08:00
decide what needs to be shared at the time of creating the process
2017-04-02 15:18:32 -06:00
which services the request. unshare() allows the server an ability to
2006-02-07 12:58:56 -08:00
disassociate parts of the context during the servicing of the
request. For large and complex middleware application frameworks, this
2017-04-02 15:18:32 -06:00
ability to unshare() after the process was created can be very
2006-02-07 12:58:56 -08:00
useful.
3) Cost
-------
2017-04-02 15:18:32 -06:00
In order to not duplicate code and to handle the fact that unshare()
2006-02-07 12:58:56 -08:00
works on an active task (as opposed to clone/fork working on a newly
2017-04-02 15:18:32 -06:00
allocated inactive task) unshare() had to make minor reorganizational
2006-02-07 12:58:56 -08:00
changes to copy_* functions utilized by clone/fork system call.
There is a cost associated with altering existing, well tested and
stable code to implement a new feature that may not get exercised
extensively in the beginning. However, with proper design and code
2017-04-02 15:18:32 -06:00
review of the changes and creation of an unshare() test for the LTP
2006-02-07 12:58:56 -08:00
the benefits of this new feature can exceed its cost.
4) Requirements
---------------
2017-04-02 15:18:32 -06:00
unshare() reverses sharing that was done using clone(2) system call,
so unshare() should have a similar interface as clone(2). That is,
2017-05-13 15:41:38 +02:00
since flags in clone(int flags, void \*stack) specifies what should
2006-02-07 12:58:56 -08:00
be shared, similar flags in unshare(int flags) should specify
what should be unshared. Unfortunately, this may appear to invert
the meaning of the flags from the way they are used in clone(2).
However, there was no easy solution that was less confusing and that
allowed incremental context unsharing in future without an ABI change.
2017-04-02 15:18:32 -06:00
unshare() interface should accommodate possible future addition of
2006-02-07 12:58:56 -08:00
new context flags without requiring a rebuild of old applications.
2017-04-02 15:18:32 -06:00
If and when new context flags are added, unshare() design should allow
2006-02-07 12:58:56 -08:00
incremental unsharing of those resources on an as needed basis.
5) Functional Specification
---------------------------
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
NAME
unshare - disassociate parts of the process execution context
SYNOPSIS
#include <sched.h>
int unshare(int flags);
DESCRIPTION
2017-04-02 15:18:32 -06:00
unshare() allows a process to disassociate parts of its execution
2006-02-07 12:58:56 -08:00
context that are currently being shared with other processes. Part
of execution context, such as the namespace, is shared by default
when a new process is created using fork(2), while other parts,
such as the virtual memory, open file descriptors, etc, may be
shared by explicit request to share them when creating a process
using clone(2).
2017-04-02 15:18:32 -06:00
The main use of unshare() is to allow a process to control its
2006-02-07 12:58:56 -08:00
shared execution context without creating a new process.
The flags argument specifies one or bitwise-or'ed of several of
the following constants.
CLONE_FS
If CLONE_FS is set, file system information of the caller
is disassociated from the shared file system information.
CLONE_FILES
If CLONE_FILES is set, the file descriptor table of the
caller is disassociated from the shared file descriptor
table.
CLONE_NEWNS
If CLONE_NEWNS is set, the namespace of the caller is
disassociated from the shared namespace.
CLONE_VM
If CLONE_VM is set, the virtual memory of the caller is
disassociated from the shared virtual memory.
RETURN VALUE
On success, zero returned. On failure, -1 is returned and errno is
ERRORS
EPERM CLONE_NEWNS was specified by a non-root process (process
without CAP_SYS_ADMIN).
ENOMEM Cannot allocate sufficient memory to copy parts of caller's
context that need to be unshared.
EINVAL Invalid flag was specified as an argument.
CONFORMING TO
The unshare() call is Linux-specific and should not be used
in programs intended to be portable.
SEE ALSO
clone(2), fork(2)
6) High Level Design
--------------------
2017-04-02 15:18:32 -06:00
Depending on the flags argument, the unshare() system call allocates
2006-02-07 12:58:56 -08:00
appropriate process context structures, populates it with values from
the current shared version, associates newly duplicated structures
with the current task structure and releases corresponding shared
versions. Helper functions of clone (copy_*) could not be used
2017-04-02 15:18:32 -06:00
directly by unshare() because of the following two reasons.
2006-02-07 12:58:56 -08:00
1) clone operates on a newly allocated not-yet-active task
2017-04-02 15:18:32 -06:00
structure, where as unshare() operates on the current active
task. Therefore unshare() has to take appropriate task_lock()
2006-02-07 12:58:56 -08:00
before associating newly duplicated context structures
2017-04-02 15:18:32 -06:00
2) unshare() has to allocate and duplicate all context structures
2006-02-07 12:58:56 -08:00
that are being unshared, before associating them with the
current task and releasing older shared structures. Failure
do so will create race conditions and/or oops when trying
to backout due to an error. Consider the case of unsharing
both virtual memory and namespace. After successfully unsharing
vm, if the system call encounters an error while allocating
new namespace structure, the error return code will have to
reverse the unsharing of vm. As part of the reversal the
system call will have to go back to older, shared, vm
structure, which may not exist anymore.
Therefore code from copy_* functions that allocated and duplicated
current context structure was moved into new dup_* functions. Now,
copy_* functions call dup_* functions to allocate and duplicate
appropriate context structures and then associate them with the
2017-04-02 15:18:32 -06:00
task structure that is being constructed. unshare() system call on
2006-02-07 12:58:56 -08:00
the other hand performs the following:
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
1) Check flags to force missing, but implied, flags
2017-04-02 15:18:32 -06:00
2) For each context structure, call the corresponding unshare()
2006-02-07 12:58:56 -08:00
helper function to allocate and duplicate a new context
structure, if the appropriate bit is set in the flags argument.
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
3) If there is no error in allocation and duplication and there
are new context structures then lock the current task structure,
associate new context structures with the current task structure,
and release the lock on the current task structure.
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
4) Appropriately release older, shared, context structures.
7) Low Level Design
-------------------
2017-04-02 15:18:32 -06:00
Implementation of unshare() can be grouped in the following 4 different
2006-02-07 12:58:56 -08:00
items:
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
a) Reorganization of existing copy_* functions
2017-04-02 15:18:32 -06:00
b) unshare() system call service function
c) unshare() helper functions for each different process context
2006-02-07 12:58:56 -08:00
d) Registration of system call number for different architectures
2017-04-02 15:18:32 -06:00
7.1) Reorganization of copy_* functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each copy function such as copy_mm, copy_namespace, copy_files,
etc, had roughly two components. The first component allocated
and duplicated the appropriate structure and the second component
linked it to the task structure passed in as an argument to the copy
function. The first component was split into its own function.
These dup_* functions allocated and duplicated the appropriate
context structure. The reorganized copy_* functions invoked
their corresponding dup_* functions and then linked the newly
duplicated structures to the task structure with which the
copy function was called.
7.2) unshare() system call service function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2006-02-07 12:58:56 -08:00
* Check flags
Force implied flags. If CLONE_THREAD is set force CLONE_VM.
If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
set and signals are also being shared, force CLONE_THREAD. If
CLONE_NEWNS is set, force CLONE_FS.
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
* For each context flag, invoke the corresponding unshare_*
helper routine with flags passed into the system call and a
reference to pointer pointing the new unshared structure
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
* If any new structures are created by unshare_* helper
functions, take the task_lock() on the current task,
modify appropriate context pointers, and release the
task lock.
2017-04-02 15:18:32 -06:00
2006-02-07 12:58:56 -08:00
* For all newly unshared structures, release the corresponding
older, shared, structures.
2017-04-02 15:18:32 -06:00
7.3) unshare_* helper functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2006-02-07 12:58:56 -08:00
2017-04-02 15:18:32 -06:00
For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
and CLONE_THREAD, return -EINVAL since they are not implemented yet.
For others, check the flag value to see if the unsharing is
required for that structure. If it is, invoke the corresponding
dup_* function to allocate and duplicate the structure and return
a pointer to it.
7.4) Finally
~~~~~~~~~~~~
Appropriately modify architecture specific code to register the
new system call.
2006-02-07 12:58:56 -08:00
8) Test Specification
---------------------
2017-04-02 15:18:32 -06:00
The test for unshare() should test the following:
2006-02-07 12:58:56 -08:00
1) Valid flags: Test to check that clone flags for signal and
2017-04-02 15:18:32 -06:00
signal handlers, for which unsharing is not implemented
yet, return -EINVAL.
2006-02-07 12:58:56 -08:00
2) Missing/implied flags: Test to make sure that if unsharing
2017-04-02 15:18:32 -06:00
namespace without specifying unsharing of filesystem, correctly
unshares both namespace and filesystem information.
2006-02-07 12:58:56 -08:00
3) For each of the four (namespace, filesystem, files and vm)
2017-04-02 15:18:32 -06:00
supported unsharing, verify that the system call correctly
unshares the appropriate structure. Verify that unsharing
them individually as well as in combination with each
other works as expected.
2006-02-07 12:58:56 -08:00
4) Concurrent execution: Use shared memory segments and futex on
2017-04-02 15:18:32 -06:00
an address in the shm segment to synchronize execution of
about 10 threads. Have a couple of threads execute execve,
a couple _exit and the rest unshare with different combination
of flags. Verify that unsharing is performed as expected and
that there are no oops or hangs.
2006-02-07 12:58:56 -08:00
9) Future Work
--------------
2017-04-02 15:18:32 -06:00
The current implementation of unshare() does not allow unsharing of
2006-02-07 12:58:56 -08:00
signals and signal handlers. Signals are complex to begin with and
to unshare signals and/or signal handlers of a currently running
process is even more complex. If in the future there is a specific
need to allow unsharing of signals and/or signal handlers, it can
2017-04-02 15:18:32 -06:00
be incrementally added to unshare() without affecting legacy
applications using unshare().
2006-02-07 12:58:56 -08:00