Documentation/adding-syscalls.txt: convert it to ReST markup
Convert adding-syscalls.txt to ReST markup and add it to the development-process book: - add extra lines to make Sphinx to correctly parse paragraphs; - use quote blocks for examples; - use monotonic font for dirs, function calls, etc; - mark manpage pages using the right markup; - add cross-reference to SubmittingPatches. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
This commit is contained in:
parent
3a61baddce
commit
12983bcd82
@ -3,7 +3,7 @@ Adding a New System Call
|
||||
|
||||
This document describes what's involved in adding a new system call to the
|
||||
Linux kernel, over and above the normal submission advice in
|
||||
Documentation/SubmittingPatches.
|
||||
:ref:`Documentation/SubmittingPatches <submittingpatches>`.
|
||||
|
||||
|
||||
System Call Alternatives
|
||||
@ -19,30 +19,33 @@ interface.
|
||||
object, it may make more sense to create a new filesystem or device. This
|
||||
also makes it easier to encapsulate the new functionality in a kernel module
|
||||
rather than requiring it to be built into the main kernel.
|
||||
|
||||
- If the new functionality involves operations where the kernel notifies
|
||||
userspace that something has happened, then returning a new file
|
||||
descriptor for the relevant object allows userspace to use
|
||||
poll/select/epoll to receive that notification.
|
||||
- However, operations that don't map to read(2)/write(2)-like operations
|
||||
have to be implemented as ioctl(2) requests, which can lead to a
|
||||
somewhat opaque API.
|
||||
``poll``/``select``/``epoll`` to receive that notification.
|
||||
- However, operations that don't map to
|
||||
:manpage:`read(2)`/:manpage:`write(2)`-like operations
|
||||
have to be implemented as :manpage:`ioctl(2)` requests, which can lead
|
||||
to a somewhat opaque API.
|
||||
|
||||
- If you're just exposing runtime system information, a new node in sysfs
|
||||
(see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be
|
||||
more appropriate. However, access to these mechanisms requires that the
|
||||
(see ``Documentation/filesystems/sysfs.txt``) or the ``/proc`` filesystem may
|
||||
be more appropriate. However, access to these mechanisms requires that the
|
||||
relevant filesystem is mounted, which might not always be the case (e.g.
|
||||
in a namespaced/sandboxed/chrooted environment). Avoid adding any API to
|
||||
debugfs, as this is not considered a 'production' interface to userspace.
|
||||
- If the operation is specific to a particular file or file descriptor, then
|
||||
an additional fcntl(2) command option may be more appropriate. However,
|
||||
fcntl(2) is a multiplexing system call that hides a lot of complexity, so
|
||||
an additional :manpage:`fcntl(2)` command option may be more appropriate. However,
|
||||
:manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so
|
||||
this option is best for when the new function is closely analogous to
|
||||
existing fcntl(2) functionality, or the new functionality is very simple
|
||||
existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple
|
||||
(for example, getting/setting a simple flag related to a file descriptor).
|
||||
- If the operation is specific to a particular task or process, then an
|
||||
additional prctl(2) command option may be more appropriate. As with
|
||||
fcntl(2), this system call is a complicated multiplexor so is best reserved
|
||||
for near-analogs of existing prctl() commands or getting/setting a simple
|
||||
flag related to a process.
|
||||
additional :manpage:`prctl(2)` command option may be more appropriate. As
|
||||
with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so
|
||||
is best reserved for near-analogs of existing ``prctl()`` commands or
|
||||
getting/setting a simple flag related to a process.
|
||||
|
||||
|
||||
Designing the API: Planning for Extension
|
||||
@ -54,15 +57,16 @@ interface on the kernel mailing list, and it's important to plan for future
|
||||
extensions of the interface.
|
||||
|
||||
(The syscall table is littered with historical examples where this wasn't done,
|
||||
together with the corresponding follow-up system calls -- eventfd/eventfd2,
|
||||
dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2 -- so
|
||||
together with the corresponding follow-up system calls --
|
||||
``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
|
||||
``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so
|
||||
learn from the history of the kernel and plan for extensions from the start.)
|
||||
|
||||
For simpler system calls that only take a couple of arguments, the preferred
|
||||
way to allow for future extensibility is to include a flags argument to the
|
||||
system call. To make sure that userspace programs can safely use flags
|
||||
between kernel versions, check whether the flags value holds any unknown
|
||||
flags, and reject the system call (with EINVAL) if it does:
|
||||
flags, and reject the system call (with ``EINVAL``) if it does::
|
||||
|
||||
if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
|
||||
return -EINVAL;
|
||||
@ -72,7 +76,7 @@ flags, and reject the system call (with EINVAL) if it does:
|
||||
For more sophisticated system calls that involve a larger number of arguments,
|
||||
it's preferred to encapsulate the majority of the arguments into a structure
|
||||
that is passed in by pointer. Such a structure can cope with future extension
|
||||
by including a size argument in the structure:
|
||||
by including a size argument in the structure::
|
||||
|
||||
struct xyzzy_params {
|
||||
u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
|
||||
@ -81,19 +85,19 @@ by including a size argument in the structure:
|
||||
u64 param_3;
|
||||
};
|
||||
|
||||
As long as any subsequently added field, say param_4, is designed so that a
|
||||
As long as any subsequently added field, say ``param_4``, is designed so that a
|
||||
zero value gives the previous behaviour, then this allows both directions of
|
||||
version mismatch:
|
||||
|
||||
- To cope with a later userspace program calling an older kernel, the kernel
|
||||
code should check that any memory beyond the size of the structure that it
|
||||
expects is zero (effectively checking that param_4 == 0).
|
||||
expects is zero (effectively checking that ``param_4 == 0``).
|
||||
- To cope with an older userspace program calling a newer kernel, the kernel
|
||||
code can zero-extend a smaller instance of the structure (effectively
|
||||
setting param_4 = 0).
|
||||
setting ``param_4 = 0``).
|
||||
|
||||
See perf_event_open(2) and the perf_copy_attr() function (in
|
||||
kernel/events/core.c) for an example of this approach.
|
||||
See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
|
||||
``kernel/events/core.c``) for an example of this approach.
|
||||
|
||||
|
||||
Designing the API: Other Considerations
|
||||
@ -104,57 +108,60 @@ should use a file descriptor as the handle for that object -- don't invent a
|
||||
new type of userspace object handle when the kernel already has mechanisms and
|
||||
well-defined semantics for using file descriptors.
|
||||
|
||||
If your new xyzzy(2) system call does return a new file descriptor, then the
|
||||
flags argument should include a value that is equivalent to setting O_CLOEXEC
|
||||
on the new FD. This makes it possible for userspace to close the timing
|
||||
window between xyzzy() and calling fcntl(fd, F_SETFD, FD_CLOEXEC), where an
|
||||
unexpected fork() and execve() in another thread could leak a descriptor to
|
||||
If your new :manpage:`xyzzy(2)` system call does return a new file descriptor,
|
||||
then the flags argument should include a value that is equivalent to setting
|
||||
``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close
|
||||
the timing window between ``xyzzy()`` and calling
|
||||
``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and
|
||||
``execve()`` in another thread could leak a descriptor to
|
||||
the exec'ed program. (However, resist the temptation to re-use the actual value
|
||||
of the O_CLOEXEC constant, as it is architecture-specific and is part of a
|
||||
numbering space of O_* flags that is fairly full.)
|
||||
of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a
|
||||
numbering space of ``O_*`` flags that is fairly full.)
|
||||
|
||||
If your system call returns a new file descriptor, you should also consider
|
||||
what it means to use the poll(2) family of system calls on that file
|
||||
what it means to use the :manpage:`poll(2)` family of system calls on that file
|
||||
descriptor. Making a file descriptor ready for reading or writing is the
|
||||
normal way for the kernel to indicate to userspace that an event has
|
||||
occurred on the corresponding kernel object.
|
||||
|
||||
If your new xyzzy(2) system call involves a filename argument:
|
||||
If your new :manpage:`xyzzy(2)` system call involves a filename argument::
|
||||
|
||||
int sys_xyzzy(const char __user *path, ..., unsigned int flags);
|
||||
|
||||
you should also consider whether an xyzzyat(2) version is more appropriate:
|
||||
you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate::
|
||||
|
||||
int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
|
||||
|
||||
This allows more flexibility for how userspace specifies the file in question;
|
||||
in particular it allows userspace to request the functionality for an
|
||||
already-opened file descriptor using the AT_EMPTY_PATH flag, effectively giving
|
||||
an fxyzzy(3) operation for free:
|
||||
already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively
|
||||
giving an :manpage:`fxyzzy(3)` operation for free::
|
||||
|
||||
- xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)
|
||||
- xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
|
||||
|
||||
(For more details on the rationale of the *at() calls, see the openat(2) man
|
||||
page; for an example of AT_EMPTY_PATH, see the fstatat(2) man page.)
|
||||
(For more details on the rationale of the \*at() calls, see the
|
||||
:manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the
|
||||
:manpage:`fstatat(2)` man page.)
|
||||
|
||||
If your new xyzzy(2) system call involves a parameter describing an offset
|
||||
within a file, make its type loff_t so that 64-bit offsets can be supported
|
||||
even on 32-bit architectures.
|
||||
If your new :manpage:`xyzzy(2)` system call involves a parameter describing an
|
||||
offset within a file, make its type ``loff_t`` so that 64-bit offsets can be
|
||||
supported even on 32-bit architectures.
|
||||
|
||||
If your new xyzzy(2) system call involves privileged functionality, it needs
|
||||
to be governed by the appropriate Linux capability bit (checked with a call to
|
||||
capable()), as described in the capabilities(7) man page. Choose an existing
|
||||
capability bit that governs related functionality, but try to avoid combining
|
||||
lots of only vaguely related functions together under the same bit, as this
|
||||
goes against capabilities' purpose of splitting the power of root. In
|
||||
particular, avoid adding new uses of the already overly-general CAP_SYS_ADMIN
|
||||
capability.
|
||||
If your new :manpage:`xyzzy(2)` system call involves privileged functionality,
|
||||
it needs to be governed by the appropriate Linux capability bit (checked with
|
||||
a call to ``capable()``), as described in the :manpage:`capabilities(7)` man
|
||||
page. Choose an existing capability bit that governs related functionality,
|
||||
but try to avoid combining lots of only vaguely related functions together
|
||||
under the same bit, as this goes against capabilities' purpose of splitting
|
||||
the power of root. In particular, avoid adding new uses of the already
|
||||
overly-general ``CAP_SYS_ADMIN`` capability.
|
||||
|
||||
If your new xyzzy(2) system call manipulates a process other than the calling
|
||||
process, it should be restricted (using a call to ptrace_may_access()) so that
|
||||
only a calling process with the same permissions as the target process, or
|
||||
with the necessary capabilities, can manipulate the target process.
|
||||
If your new :manpage:`xyzzy(2)` system call manipulates a process other than
|
||||
the calling process, it should be restricted (using a call to
|
||||
``ptrace_may_access()``) so that only a calling process with the same
|
||||
permissions as the target process, or with the necessary capabilities, can
|
||||
manipulate the target process.
|
||||
|
||||
Finally, be aware that some non-x86 architectures have an easier time if
|
||||
system call parameters that are explicitly 64-bit fall on odd-numbered
|
||||
@ -175,7 +182,7 @@ distinct commits (each of which is described further below):
|
||||
- Wiring up of the new system call for one particular architecture, usually
|
||||
x86 (including all of x86_64, x86_32 and x32).
|
||||
- A demonstration of the use of the new system call in userspace via a
|
||||
selftest in tools/testing/selftests/.
|
||||
selftest in ``tools/testing/selftests/``.
|
||||
- A draft man-page for the new system call, either as plain text in the
|
||||
cover letter, or as a patch to the (separate) man-pages repository.
|
||||
|
||||
@ -186,24 +193,24 @@ be cc'ed to linux-api@vger.kernel.org.
|
||||
Generic System Call Implementation
|
||||
----------------------------------
|
||||
|
||||
The main entry point for your new xyzzy(2) system call will be called
|
||||
sys_xyzzy(), but you add this entry point with the appropriate
|
||||
SYSCALL_DEFINEn() macro rather than explicitly. The 'n' indicates the number
|
||||
of arguments to the system call, and the macro takes the system call name
|
||||
The main entry point for your new :manpage:`xyzzy(2)` system call will be called
|
||||
``sys_xyzzy()``, but you add this entry point with the appropriate
|
||||
``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the
|
||||
number of arguments to the system call, and the macro takes the system call name
|
||||
followed by the (type, name) pairs for the parameters as arguments. Using
|
||||
this macro allows metadata about the new system call to be made available for
|
||||
other tools.
|
||||
|
||||
The new entry point also needs a corresponding function prototype, in
|
||||
include/linux/syscalls.h, marked as asmlinkage to match the way that system
|
||||
calls are invoked:
|
||||
``include/linux/syscalls.h``, marked as asmlinkage to match the way that system
|
||||
calls are invoked::
|
||||
|
||||
asmlinkage long sys_xyzzy(...);
|
||||
|
||||
Some architectures (e.g. x86) have their own architecture-specific syscall
|
||||
tables, but several other architectures share a generic syscall table. Add your
|
||||
new system call to the generic list by adding an entry to the list in
|
||||
include/uapi/asm-generic/unistd.h:
|
||||
``include/uapi/asm-generic/unistd.h``::
|
||||
|
||||
#define __NR_xyzzy 292
|
||||
__SYSCALL(__NR_xyzzy, sys_xyzzy)
|
||||
@ -212,30 +219,30 @@ Also update the __NR_syscalls count to reflect the additional system call, and
|
||||
note that if multiple new system calls are added in the same merge window,
|
||||
your new syscall number may get adjusted to resolve conflicts.
|
||||
|
||||
The file kernel/sys_ni.c provides a fallback stub implementation of each system
|
||||
call, returning -ENOSYS. Add your new system call here too:
|
||||
The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each
|
||||
system call, returning ``-ENOSYS``. Add your new system call here too::
|
||||
|
||||
cond_syscall(sys_xyzzy);
|
||||
|
||||
Your new kernel functionality, and the system call that controls it, should
|
||||
normally be optional, so add a CONFIG option (typically to init/Kconfig) for
|
||||
it. As usual for new CONFIG options:
|
||||
normally be optional, so add a ``CONFIG`` option (typically to
|
||||
``init/Kconfig``) for it. As usual for new ``CONFIG`` options:
|
||||
|
||||
- Include a description of the new functionality and system call controlled
|
||||
by the option.
|
||||
- Make the option depend on EXPERT if it should be hidden from normal users.
|
||||
- Make any new source files implementing the function dependent on the CONFIG
|
||||
option in the Makefile (e.g. "obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.c").
|
||||
option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.c``).
|
||||
- Double check that the kernel still builds with the new CONFIG option turned
|
||||
off.
|
||||
|
||||
To summarize, you need a commit that includes:
|
||||
|
||||
- CONFIG option for the new function, normally in init/Kconfig
|
||||
- SYSCALL_DEFINEn(xyzzy, ...) for the entry point
|
||||
- corresponding prototype in include/linux/syscalls.h
|
||||
- generic table entry in include/uapi/asm-generic/unistd.h
|
||||
- fallback stub in kernel/sys_ni.c
|
||||
- ``CONFIG`` option for the new function, normally in ``init/Kconfig``
|
||||
- ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point
|
||||
- corresponding prototype in ``include/linux/syscalls.h``
|
||||
- generic table entry in ``include/uapi/asm-generic/unistd.h``
|
||||
- fallback stub in ``kernel/sys_ni.c``
|
||||
|
||||
|
||||
x86 System Call Implementation
|
||||
@ -244,11 +251,11 @@ x86 System Call Implementation
|
||||
To wire up your new system call for x86 platforms, you need to update the
|
||||
master syscall tables. Assuming your new system call isn't special in some
|
||||
way (see below), this involves a "common" entry (for x86_64 and x32) in
|
||||
arch/x86/entry/syscalls/syscall_64.tbl:
|
||||
arch/x86/entry/syscalls/syscall_64.tbl::
|
||||
|
||||
333 common xyzzy sys_xyzzy
|
||||
|
||||
and an "i386" entry in arch/x86/entry/syscalls/syscall_32.tbl:
|
||||
and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``::
|
||||
|
||||
380 i386 xyzzy sys_xyzzy
|
||||
|
||||
@ -267,48 +274,49 @@ However, there are a couple of situations where a compatibility layer is
|
||||
needed to cope with size differences between 32-bit and 64-bit.
|
||||
|
||||
The first is if the 64-bit kernel also supports 32-bit userspace programs, and
|
||||
so needs to parse areas of (__user) memory that could hold either 32-bit or
|
||||
so needs to parse areas of (``__user``) memory that could hold either 32-bit or
|
||||
64-bit values. In particular, this is needed whenever a system call argument
|
||||
is:
|
||||
|
||||
- a pointer to a pointer
|
||||
- a pointer to a struct containing a pointer (e.g. struct iovec __user *)
|
||||
- a pointer to a varying sized integral type (time_t, off_t, long, ...)
|
||||
- a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``)
|
||||
- a pointer to a varying sized integral type (``time_t``, ``off_t``,
|
||||
``long``, ...)
|
||||
- a pointer to a struct containing a varying sized integral type.
|
||||
|
||||
The second situation that requires a compatibility layer is if one of the
|
||||
system call's arguments has a type that is explicitly 64-bit even on a 32-bit
|
||||
architecture, for example loff_t or __u64. In this case, a value that arrives
|
||||
at a 64-bit kernel from a 32-bit application will be split into two 32-bit
|
||||
values, which then need to be re-assembled in the compatibility layer.
|
||||
architecture, for example ``loff_t`` or ``__u64``. In this case, a value that
|
||||
arrives at a 64-bit kernel from a 32-bit application will be split into two
|
||||
32-bit values, which then need to be re-assembled in the compatibility layer.
|
||||
|
||||
(Note that a system call argument that's a pointer to an explicit 64-bit type
|
||||
does *not* need a compatibility layer; for example, splice(2)'s arguments of
|
||||
type loff_t __user * do not trigger the need for a compat_ system call.)
|
||||
does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of
|
||||
type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.)
|
||||
|
||||
The compatibility version of the system call is called compat_sys_xyzzy(), and
|
||||
is added with the COMPAT_SYSCALL_DEFINEn() macro, analogously to
|
||||
The compatibility version of the system call is called ``compat_sys_xyzzy()``,
|
||||
and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to
|
||||
SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit
|
||||
kernel, but expects to receive 32-bit parameter values and does whatever is
|
||||
needed to deal with them. (Typically, the compat_sys_ version converts the
|
||||
values to 64-bit versions and either calls on to the sys_ version, or both of
|
||||
needed to deal with them. (Typically, the ``compat_sys_`` version converts the
|
||||
values to 64-bit versions and either calls on to the ``sys_`` version, or both of
|
||||
them call a common inner implementation function.)
|
||||
|
||||
The compat entry point also needs a corresponding function prototype, in
|
||||
include/linux/compat.h, marked as asmlinkage to match the way that system
|
||||
calls are invoked:
|
||||
``include/linux/compat.h``, marked as asmlinkage to match the way that system
|
||||
calls are invoked::
|
||||
|
||||
asmlinkage long compat_sys_xyzzy(...);
|
||||
|
||||
If the system call involves a structure that is laid out differently on 32-bit
|
||||
and 64-bit systems, say struct xyzzy_args, then the include/linux/compat.h
|
||||
header file should also include a compat version of the structure (struct
|
||||
compat_xyzzy_args) where each variable-size field has the appropriate compat_
|
||||
type that corresponds to the type in struct xyzzy_args. The
|
||||
compat_sys_xyzzy() routine can then use this compat_ structure to parse the
|
||||
arguments from a 32-bit invocation.
|
||||
and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h
|
||||
header file should also include a compat version of the structure (``struct
|
||||
compat_xyzzy_args``) where each variable-size field has the appropriate
|
||||
``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The
|
||||
``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to
|
||||
parse the arguments from a 32-bit invocation.
|
||||
|
||||
For example, if there are fields:
|
||||
For example, if there are fields::
|
||||
|
||||
struct xyzzy_args {
|
||||
const char __user *ptr;
|
||||
@ -317,7 +325,7 @@ For example, if there are fields:
|
||||
/* ... */
|
||||
};
|
||||
|
||||
in struct xyzzy_args, then struct compat_xyzzy_args would have:
|
||||
in struct xyzzy_args, then struct compat_xyzzy_args would have::
|
||||
|
||||
struct compat_xyzzy_args {
|
||||
compat_uptr_t ptr;
|
||||
@ -327,18 +335,19 @@ in struct xyzzy_args, then struct compat_xyzzy_args would have:
|
||||
};
|
||||
|
||||
The generic system call list also needs adjusting to allow for the compat
|
||||
version; the entry in include/uapi/asm-generic/unistd.h should use
|
||||
__SC_COMP rather than __SYSCALL:
|
||||
version; the entry in ``include/uapi/asm-generic/unistd.h`` should use
|
||||
``__SC_COMP`` rather than ``__SYSCALL``::
|
||||
|
||||
#define __NR_xyzzy 292
|
||||
__SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
|
||||
|
||||
To summarize, you need:
|
||||
|
||||
- a COMPAT_SYSCALL_DEFINEn(xyzzy, ...) for the compat entry point
|
||||
- corresponding prototype in include/linux/compat.h
|
||||
- (if needed) 32-bit mapping struct in include/linux/compat.h
|
||||
- instance of __SC_COMP not __SYSCALL in include/uapi/asm-generic/unistd.h
|
||||
- a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point
|
||||
- corresponding prototype in ``include/linux/compat.h``
|
||||
- (if needed) 32-bit mapping struct in ``include/linux/compat.h``
|
||||
- instance of ``__SC_COMP`` not ``__SYSCALL`` in
|
||||
``include/uapi/asm-generic/unistd.h``
|
||||
|
||||
|
||||
Compatibility System Calls (x86)
|
||||
@ -347,9 +356,9 @@ Compatibility System Calls (x86)
|
||||
To wire up the x86 architecture of a system call with a compatibility version,
|
||||
the entries in the syscall tables need to be adjusted.
|
||||
|
||||
First, the entry in arch/x86/entry/syscalls/syscall_32.tbl gets an extra
|
||||
First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra
|
||||
column to indicate that a 32-bit userspace program running on a 64-bit kernel
|
||||
should hit the compat entry point:
|
||||
should hit the compat entry point::
|
||||
|
||||
380 i386 xyzzy sys_xyzzy compat_sys_xyzzy
|
||||
|
||||
@ -359,8 +368,8 @@ should either match the 64-bit version or the 32-bit version.
|
||||
|
||||
If there's a pointer-to-a-pointer involved, the decision is easy: x32 is
|
||||
ILP32, so the layout should match the 32-bit version, and the entry in
|
||||
arch/x86/entry/syscalls/syscall_64.tbl is split so that x32 programs hit the
|
||||
compatibility wrapper:
|
||||
``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit
|
||||
the compatibility wrapper::
|
||||
|
||||
333 64 xyzzy sys_xyzzy
|
||||
...
|
||||
@ -384,8 +393,9 @@ stack the same and most of the registers the same as before the system call,
|
||||
and with the same virtual memory space.
|
||||
|
||||
However, a few system calls do things differently. They might return to a
|
||||
different location (rt_sigreturn) or change the memory space (fork/vfork/clone)
|
||||
or even architecture (execve/execveat) of the program.
|
||||
different location (``rt_sigreturn``) or change the memory space
|
||||
(``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``)
|
||||
of the program.
|
||||
|
||||
To allow for this, the kernel implementation of the system call may need to
|
||||
save and restore additional registers to the kernel stack, allowing complete
|
||||
@ -395,31 +405,31 @@ This is arch-specific, but typically involves defining assembly entry points
|
||||
that save/restore additional registers and invoke the real system call entry
|
||||
point.
|
||||
|
||||
For x86_64, this is implemented as a stub_xyzzy entry point in
|
||||
arch/x86/entry/entry_64.S, and the entry in the syscall table
|
||||
(arch/x86/entry/syscalls/syscall_64.tbl) is adjusted to match:
|
||||
For x86_64, this is implemented as a ``stub_xyzzy`` entry point in
|
||||
``arch/x86/entry/entry_64.S``, and the entry in the syscall table
|
||||
(``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match::
|
||||
|
||||
333 common xyzzy stub_xyzzy
|
||||
|
||||
The equivalent for 32-bit programs running on a 64-bit kernel is normally
|
||||
called stub32_xyzzy and implemented in arch/x86/entry/entry_64_compat.S,
|
||||
called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``,
|
||||
with the corresponding syscall table adjustment in
|
||||
arch/x86/entry/syscalls/syscall_32.tbl:
|
||||
``arch/x86/entry/syscalls/syscall_32.tbl``::
|
||||
|
||||
380 i386 xyzzy sys_xyzzy stub32_xyzzy
|
||||
|
||||
If the system call needs a compatibility layer (as in the previous section)
|
||||
then the stub32_ version needs to call on to the compat_sys_ version of the
|
||||
system call rather than the native 64-bit version. Also, if the x32 ABI
|
||||
then the ``stub32_`` version needs to call on to the ``compat_sys_`` version
|
||||
of the system call rather than the native 64-bit version. Also, if the x32 ABI
|
||||
implementation is not common with the x86_64 version, then its syscall
|
||||
table will also need to invoke a stub that calls on to the compat_sys_
|
||||
table will also need to invoke a stub that calls on to the ``compat_sys_``
|
||||
version.
|
||||
|
||||
For completeness, it's also nice to set up a mapping so that user-mode Linux
|
||||
still works -- its syscall table will reference stub_xyzzy, but the UML build
|
||||
doesn't include arch/x86/entry/entry_64.S implementation (because UML
|
||||
doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML
|
||||
simulates registers etc). Fixing this is as simple as adding a #define to
|
||||
arch/x86/um/sys_call_table_64.c:
|
||||
``arch/x86/um/sys_call_table_64.c``::
|
||||
|
||||
#define stub_xyzzy sys_xyzzy
|
||||
|
||||
@ -432,9 +442,9 @@ occasional exception that may need updating for your particular system call.
|
||||
|
||||
The audit subsystem is one such special case; it includes (arch-specific)
|
||||
functions that classify some special types of system call -- specifically
|
||||
file open (open/openat), program execution (execve/exeveat) or socket
|
||||
multiplexor (socketcall) operations. If your new system call is analogous to
|
||||
one of these, then the audit system should be updated.
|
||||
file open (``open``/``openat``), program execution (``execve``/``exeveat``) or
|
||||
socket multiplexor (``socketcall``) operations. If your new system call is
|
||||
analogous to one of these, then the audit system should be updated.
|
||||
|
||||
More generally, if there is an existing system call that is analogous to your
|
||||
new system call, it's worth doing a kernel-wide grep for the existing system
|
||||
@ -447,10 +457,10 @@ Testing
|
||||
A new system call should obviously be tested; it is also useful to provide
|
||||
reviewers with a demonstration of how user space programs will use the system
|
||||
call. A good way to combine these aims is to include a simple self-test
|
||||
program in a new directory under tools/testing/selftests/.
|
||||
program in a new directory under ``tools/testing/selftests/``.
|
||||
|
||||
For a new system call, there will obviously be no libc wrapper function and so
|
||||
the test will need to invoke it using syscall(); also, if the system call
|
||||
the test will need to invoke it using ``syscall()``; also, if the system call
|
||||
involves a new userspace-visible structure, the corresponding header will need
|
||||
to be installed to compile the test.
|
||||
|
||||
@ -461,6 +471,7 @@ and x32 (-mx32) ABI program.
|
||||
For more extensive and thorough testing of new functionality, you should also
|
||||
consider adding tests to the Linux Test Project, or to the xfstests project
|
||||
for filesystem-related changes.
|
||||
|
||||
- https://linux-test-project.github.io/
|
||||
- git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
|
||||
|
||||
@ -487,12 +498,14 @@ References and Sources
|
||||
arguments: https://lwn.net/Articles/311630/
|
||||
- Pair of LWN articles from David Drysdale that describe the system call
|
||||
implementation paths in detail for v3.14:
|
||||
|
||||
- https://lwn.net/Articles/604287/
|
||||
- https://lwn.net/Articles/604515/
|
||||
|
||||
- Architecture-specific requirements for system calls are discussed in the
|
||||
syscall(2) man-page:
|
||||
:manpage:`syscall(2)` man-page:
|
||||
http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
|
||||
- Collated emails from Linus Torvalds discussing the problems with ioctl():
|
||||
- Collated emails from Linus Torvalds discussing the problems with ``ioctl()``:
|
||||
http://yarchive.net/comp/linux/ioctl.html
|
||||
- "How to not invent kernel interfaces", Arnd Bergmann,
|
||||
http://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
|
||||
@ -507,17 +520,19 @@ References and Sources
|
||||
commit: https://lkml.org/lkml/2014/11/19/254
|
||||
- Suggestion from Greg Kroah-Hartman that it's good for new system calls to
|
||||
come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710
|
||||
- Discussion from Michael Kerrisk of new system call vs. prctl(2) extension:
|
||||
- Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension:
|
||||
https://lkml.org/lkml/2014/6/3/411
|
||||
- Suggestion from Ingo Molnar that system calls that involve multiple
|
||||
arguments should encapsulate those arguments in a struct, which includes a
|
||||
size field for future extensibility: https://lkml.org/lkml/2015/7/30/117
|
||||
- Numbering oddities arising from (re-)use of O_* numbering space flags:
|
||||
|
||||
- commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
|
||||
check")
|
||||
- commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
|
||||
conflict")
|
||||
- commit bb458c644a59 ("Safer ABI for O_TMPFILE")
|
||||
|
||||
- Discussion from Matthew Wilcox about restrictions on 64-bit arguments:
|
||||
https://lkml.org/lkml/2008/12/12/187
|
||||
- Recommendation from Greg Kroah-Hartman that unknown flags should be
|
Loading…
Reference in New Issue
Block a user