License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
// SPDX-License-Identifier: GPL-2.0
2005-04-16 15:20:36 -07:00
# include <linux/linkage.h>
# include <linux/errno.h>
# include <asm/unistd.h>
2018-04-05 11:53:03 +02:00
# ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
/* Architectures may override COND_SYSCALL and COND_SYSCALL_COMPAT */
# include <asm/syscall_wrapper.h>
# endif /* CONFIG_ARCH_HAS_SYSCALL_WRAPPER */
2007-10-16 23:29:25 -07:00
/* we can't #include <linux/syscalls.h> here,
but tell gcc to not warn with - Wmissing - prototypes */
asmlinkage long sys_ni_syscall ( void ) ;
2005-04-16 15:20:36 -07:00
/*
* Non - implemented system calls get redirected here .
*/
asmlinkage long sys_ni_syscall ( void )
{
return - ENOSYS ;
}
2018-04-05 11:53:03 +02:00
# ifndef COND_SYSCALL
2018-03-04 19:06:35 +01:00
# define COND_SYSCALL(name) cond_syscall(sys_##name)
2018-04-05 11:53:03 +02:00
# endif /* COND_SYSCALL */
# ifndef COND_SYSCALL_COMPAT
2018-03-04 19:06:35 +01:00
# define COND_SYSCALL_COMPAT(name) cond_syscall(compat_sys_##name)
2018-04-05 11:53:03 +02:00
# endif /* COND_SYSCALL_COMPAT */
2018-03-04 19:06:35 +01:00
2018-03-06 19:53:01 +01:00
/*
* This list is kept in the same order as include / uapi / asm - generic / unistd . h .
* Architecture specific entries go below , followed by deprecated or obsolete
* system calls .
*/
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( io_setup ) ;
COND_SYSCALL_COMPAT ( io_setup ) ;
COND_SYSCALL ( io_destroy ) ;
COND_SYSCALL ( io_submit ) ;
COND_SYSCALL_COMPAT ( io_submit ) ;
COND_SYSCALL ( io_cancel ) ;
COND_SYSCALL ( io_getevents ) ;
aio: implement io_pgetevents
This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-02 19:51:00 +02:00
COND_SYSCALL ( io_pgetevents ) ;
2018-03-04 19:06:35 +01:00
COND_SYSCALL_COMPAT ( io_getevents ) ;
aio: implement io_pgetevents
This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-02 19:51:00 +02:00
COND_SYSCALL_COMPAT ( io_pgetevents ) ;
2018-03-06 19:53:01 +01:00
/* fs/xattr.c */
/* fs/dcache.c */
/* fs/cookies.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( lookup_dcookie ) ;
COND_SYSCALL_COMPAT ( lookup_dcookie ) ;
2018-03-06 19:53:01 +01:00
/* fs/eventfd.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( eventfd2 ) ;
2018-03-06 19:53:01 +01:00
/* fs/eventfd.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( epoll_create1 ) ;
COND_SYSCALL ( epoll_ctl ) ;
COND_SYSCALL ( epoll_pwait ) ;
COND_SYSCALL_COMPAT ( epoll_pwait ) ;
2018-03-06 19:53:01 +01:00
/* fs/fcntl.c */
/* fs/inotify_user.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( inotify_init1 ) ;
COND_SYSCALL ( inotify_add_watch ) ;
COND_SYSCALL ( inotify_rm_watch ) ;
2018-03-06 19:53:01 +01:00
/* fs/ioctl.c */
/* fs/ioprio.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( ioprio_set ) ;
COND_SYSCALL ( ioprio_get ) ;
2018-03-06 19:53:01 +01:00
/* fs/locks.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( flock ) ;
2018-03-06 19:53:01 +01:00
/* fs/namei.c */
/* fs/namespace.c */
/* fs/nfsctl.c */
/* fs/open.c */
/* fs/pipe.c */
/* fs/quota.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( quotactl ) ;
2018-03-06 19:53:01 +01:00
/* fs/readdir.c */
/* fs/read_write.c */
/* fs/sendfile.c */
/* fs/select.c */
/* fs/signalfd.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( signalfd4 ) ;
COND_SYSCALL_COMPAT ( signalfd4 ) ;
2018-03-06 19:53:01 +01:00
/* fs/splice.c */
/* fs/stat.c */
/* fs/sync.c */
/* fs/timerfd.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( timerfd_create ) ;
COND_SYSCALL ( timerfd_settime ) ;
COND_SYSCALL_COMPAT ( timerfd_settime ) ;
COND_SYSCALL ( timerfd_gettime ) ;
COND_SYSCALL_COMPAT ( timerfd_gettime ) ;
2018-03-06 19:53:01 +01:00
/* fs/utimes.c */
/* kernel/acct.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( acct ) ;
2018-03-06 19:53:01 +01:00
/* kernel/capability.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( capget ) ;
COND_SYSCALL ( capset ) ;
2018-03-06 19:53:01 +01:00
/* kernel/exec_domain.c */
/* kernel/exit.c */
/* kernel/fork.c */
/* kernel/futex.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( futex ) ;
COND_SYSCALL_COMPAT ( futex ) ;
COND_SYSCALL ( set_robust_list ) ;
COND_SYSCALL_COMPAT ( set_robust_list ) ;
COND_SYSCALL ( get_robust_list ) ;
COND_SYSCALL_COMPAT ( get_robust_list ) ;
2018-03-06 19:53:01 +01:00
/* kernel/hrtimer.c */
/* kernel/itimer.c */
/* kernel/kexec.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( kexec_load ) ;
COND_SYSCALL_COMPAT ( kexec_load ) ;
2018-03-06 19:53:01 +01:00
/* kernel/module.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( init_module ) ;
COND_SYSCALL ( delete_module ) ;
2018-03-06 19:53:01 +01:00
/* kernel/posix-timers.c */
/* kernel/printk.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( syslog ) ;
2018-03-06 19:53:01 +01:00
/* kernel/ptrace.c */
/* kernel/sched/core.c */
/* kernel/signal.c */
/* kernel/sys.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( setregid ) ;
COND_SYSCALL ( setgid ) ;
COND_SYSCALL ( setreuid ) ;
COND_SYSCALL ( setuid ) ;
COND_SYSCALL ( setresuid ) ;
COND_SYSCALL ( getresuid ) ;
COND_SYSCALL ( setresgid ) ;
COND_SYSCALL ( getresgid ) ;
COND_SYSCALL ( setfsuid ) ;
COND_SYSCALL ( setfsgid ) ;
COND_SYSCALL ( setgroups ) ;
COND_SYSCALL ( getgroups ) ;
2018-03-06 19:53:01 +01:00
/* kernel/time.c */
/* kernel/timer.c */
/* ipc/mqueue.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( mq_open ) ;
COND_SYSCALL_COMPAT ( mq_open ) ;
COND_SYSCALL ( mq_unlink ) ;
COND_SYSCALL ( mq_timedsend ) ;
COND_SYSCALL_COMPAT ( mq_timedsend ) ;
COND_SYSCALL ( mq_timedreceive ) ;
COND_SYSCALL_COMPAT ( mq_timedreceive ) ;
COND_SYSCALL ( mq_notify ) ;
COND_SYSCALL_COMPAT ( mq_notify ) ;
COND_SYSCALL ( mq_getsetattr ) ;
COND_SYSCALL_COMPAT ( mq_getsetattr ) ;
2018-03-06 19:53:01 +01:00
/* ipc/msg.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( msgget ) ;
COND_SYSCALL ( msgctl ) ;
COND_SYSCALL_COMPAT ( msgctl ) ;
COND_SYSCALL ( msgrcv ) ;
COND_SYSCALL_COMPAT ( msgrcv ) ;
COND_SYSCALL ( msgsnd ) ;
COND_SYSCALL_COMPAT ( msgsnd ) ;
2018-03-06 19:53:01 +01:00
/* ipc/sem.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( semget ) ;
COND_SYSCALL ( semctl ) ;
COND_SYSCALL_COMPAT ( semctl ) ;
COND_SYSCALL ( semtimedop ) ;
COND_SYSCALL_COMPAT ( semtimedop ) ;
COND_SYSCALL ( semop ) ;
2018-03-06 19:53:01 +01:00
/* ipc/shm.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( shmget ) ;
COND_SYSCALL ( shmctl ) ;
COND_SYSCALL_COMPAT ( shmctl ) ;
COND_SYSCALL ( shmat ) ;
COND_SYSCALL_COMPAT ( shmat ) ;
COND_SYSCALL ( shmdt ) ;
2018-03-06 19:53:01 +01:00
/* net/socket.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( socket ) ;
COND_SYSCALL ( socketpair ) ;
COND_SYSCALL ( bind ) ;
COND_SYSCALL ( listen ) ;
COND_SYSCALL ( accept ) ;
COND_SYSCALL ( connect ) ;
COND_SYSCALL ( getsockname ) ;
COND_SYSCALL ( getpeername ) ;
COND_SYSCALL ( setsockopt ) ;
COND_SYSCALL_COMPAT ( setsockopt ) ;
COND_SYSCALL ( getsockopt ) ;
COND_SYSCALL_COMPAT ( getsockopt ) ;
COND_SYSCALL ( sendto ) ;
COND_SYSCALL ( shutdown ) ;
COND_SYSCALL ( recvfrom ) ;
COND_SYSCALL_COMPAT ( recvfrom ) ;
COND_SYSCALL ( sendmsg ) ;
COND_SYSCALL_COMPAT ( sendmsg ) ;
COND_SYSCALL ( recvmsg ) ;
COND_SYSCALL_COMPAT ( recvmsg ) ;
2018-03-06 19:53:01 +01:00
/* mm/filemap.c */
/* mm/nommu.c, also with MMU */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( mremap ) ;
2018-03-06 19:53:01 +01:00
/* security/keys/keyctl.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( add_key ) ;
COND_SYSCALL ( request_key ) ;
COND_SYSCALL ( keyctl ) ;
COND_SYSCALL_COMPAT ( keyctl ) ;
2005-04-16 15:20:36 -07:00
2018-03-06 19:53:01 +01:00
/* arch/example/kernel/sys_example.c */
2006-04-10 22:53:06 -07:00
2018-03-06 19:53:01 +01:00
/* mm/fadvise.c */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( fadvise64_64 ) ;
2018-03-06 19:53:01 +01:00
/* mm/, CONFIG_MMU only */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( swapon ) ;
COND_SYSCALL ( swapoff ) ;
COND_SYSCALL ( mprotect ) ;
COND_SYSCALL ( msync ) ;
COND_SYSCALL ( mlock ) ;
COND_SYSCALL ( munlock ) ;
COND_SYSCALL ( mlockall ) ;
COND_SYSCALL ( munlockall ) ;
COND_SYSCALL ( mincore ) ;
COND_SYSCALL ( madvise ) ;
COND_SYSCALL ( remap_file_pages ) ;
COND_SYSCALL ( mbind ) ;
COND_SYSCALL_COMPAT ( mbind ) ;
COND_SYSCALL ( get_mempolicy ) ;
COND_SYSCALL_COMPAT ( get_mempolicy ) ;
COND_SYSCALL ( set_mempolicy ) ;
COND_SYSCALL_COMPAT ( set_mempolicy ) ;
COND_SYSCALL ( migrate_pages ) ;
COND_SYSCALL_COMPAT ( migrate_pages ) ;
COND_SYSCALL ( move_pages ) ;
COND_SYSCALL_COMPAT ( move_pages ) ;
COND_SYSCALL ( perf_event_open ) ;
COND_SYSCALL ( accept4 ) ;
COND_SYSCALL ( recvmmsg ) ;
COND_SYSCALL_COMPAT ( recvmmsg ) ;
2018-03-06 19:53:01 +01:00
/*
* Architecture specific syscalls : see further below
*/
2009-12-17 21:24:25 -05:00
2018-03-06 19:53:01 +01:00
/* fanotify */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( fanotify_init ) ;
COND_SYSCALL ( fanotify_mark ) ;
2011-01-29 18:43:26 +05:30
/* open by handle */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( name_to_handle_at ) ;
COND_SYSCALL ( open_by_handle_at ) ;
COND_SYSCALL_COMPAT ( open_by_handle_at ) ;
2012-05-31 16:26:44 -07:00
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( sendmmsg ) ;
COND_SYSCALL_COMPAT ( sendmmsg ) ;
COND_SYSCALL ( process_vm_readv ) ;
COND_SYSCALL_COMPAT ( process_vm_readv ) ;
COND_SYSCALL ( process_vm_writev ) ;
COND_SYSCALL_COMPAT ( process_vm_writev ) ;
2018-03-06 19:53:01 +01:00
2012-05-31 16:26:44 -07:00
/* compare kernel pointers */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( kcmp ) ;
2014-06-25 16:08:24 -07:00
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( finit_module ) ;
2018-03-06 19:53:01 +01:00
2014-06-25 16:08:24 -07:00
/* operate on Secure Computing state */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( seccomp ) ;
2014-09-26 00:16:58 -07:00
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( memfd_create ) ;
2018-03-06 19:53:01 +01:00
2014-09-26 00:16:58 -07:00
/* access BPF programs and maps */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( bpf ) ;
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
/* execveat */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( execveat ) ;
sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.
The existing applications of which I am aware that would be improved by
this system call are as follows:
* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)
Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().
* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.
To explain the benefit of this scheme, let's introduce two example threads:
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())
In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".
Before the change, we had, for each smp_mb() pairs:
Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses
After the change, these pairs become:
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).
1) Non-concurrent Thread A vs Thread B accesses:
Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses
In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.
2) Concurrent Thread A vs Thread B accesses
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().
* Benchmarks
On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)
1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
* User-space user of this system call: Userspace RCU library
Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
Results in liburcu:
Operations in 10s, 6 readers, 2 writers:
memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes
The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.
Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.
An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.
This patch adds the system call to x86 and to asm-generic.
[1] http://urcu.so
membarrier(2) man page:
MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
NAME
membarrier - issue memory barriers on a set of threads
SYNOPSIS
#include <linux/membarrier.h>
int membarrier(int cmd, int flags);
DESCRIPTION
The cmd argument is one of the following:
MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.
MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.
The flags argument needs to be 0. For future extensions.
All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():
The pair ordering is detailed as (O: ordered, X: not ordered):
barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.
ERRORS
ENOSYS System call is not implemented.
EINVAL Invalid arguments.
Linux 2015-04-15 MEMBARRIER(2)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 13:07:39 -07:00
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( userfaultfd ) ;
2018-03-06 19:53:01 +01:00
sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.
The existing applications of which I am aware that would be improved by
this system call are as follows:
* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)
Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().
* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.
To explain the benefit of this scheme, let's introduce two example threads:
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())
In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".
Before the change, we had, for each smp_mb() pairs:
Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses
After the change, these pairs become:
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).
1) Non-concurrent Thread A vs Thread B accesses:
Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses
In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.
2) Concurrent Thread A vs Thread B accesses
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().
* Benchmarks
On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)
1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
* User-space user of this system call: Userspace RCU library
Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
Results in liburcu:
Operations in 10s, 6 readers, 2 writers:
memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes
The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.
Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.
An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.
This patch adds the system call to x86 and to asm-generic.
[1] http://urcu.so
membarrier(2) man page:
MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
NAME
membarrier - issue memory barriers on a set of threads
SYNOPSIS
#include <linux/membarrier.h>
int membarrier(int cmd, int flags);
DESCRIPTION
The cmd argument is one of the following:
MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.
MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.
The flags argument needs to be 0. For future extensions.
All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():
The pair ordering is detailed as (O: ordered, X: not ordered):
barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.
ERRORS
ENOSYS System call is not implemented.
EINVAL Invalid arguments.
Linux 2015-04-15 MEMBARRIER(2)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 13:07:39 -07:00
/* membarrier */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( membarrier ) ;
2016-09-12 13:38:42 -07:00
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( mlock2 ) ;
2018-03-06 19:53:01 +01:00
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( copy_file_range ) ;
2018-03-06 19:53:01 +01:00
2016-09-12 13:38:42 -07:00
/* memory protection keys */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( pkey_mprotect ) ;
COND_SYSCALL ( pkey_alloc ) ;
COND_SYSCALL ( pkey_free ) ;
2018-03-06 19:53:01 +01:00
/*
* Architecture specific weak syscall entries .
*/
/* pciconfig: alpha, arm, arm64, ia64, sparc */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( pciconfig_read ) ;
COND_SYSCALL ( pciconfig_write ) ;
COND_SYSCALL ( pciconfig_iobase ) ;
2018-03-06 19:53:01 +01:00
/* sys_socketcall: arm, mips, x86, ... */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( socketcall ) ;
COND_SYSCALL_COMPAT ( socketcall ) ;
2018-03-06 19:53:01 +01:00
/* compat syscalls for arm64, x86, ... */
2018-03-04 19:06:35 +01:00
COND_SYSCALL_COMPAT ( sysctl ) ;
COND_SYSCALL_COMPAT ( fanotify_mark ) ;
2018-03-06 19:53:01 +01:00
/* x86 */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( vm86old ) ;
COND_SYSCALL ( modify_ldt ) ;
COND_SYSCALL_COMPAT ( quotactl32 ) ;
COND_SYSCALL ( vm86 ) ;
COND_SYSCALL ( kexec_file_load ) ;
2018-03-06 19:53:01 +01:00
/* s390 */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( s390_pci_mmio_read ) ;
COND_SYSCALL ( s390_pci_mmio_write ) ;
COND_SYSCALL_COMPAT ( s390_ipc ) ;
2018-03-06 19:53:01 +01:00
/* powerpc */
2018-05-02 23:20:48 +10:00
COND_SYSCALL ( rtas ) ;
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( spu_run ) ;
COND_SYSCALL ( spu_create ) ;
COND_SYSCALL ( subpage_prot ) ;
2018-03-06 19:53:01 +01:00
/*
* Deprecated system calls which are still defined in
* include / uapi / asm - generic / unistd . h and wanted by > = 1 arch
*/
/* __ARCH_WANT_SYSCALL_NO_FLAGS */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( epoll_create ) ;
COND_SYSCALL ( inotify_init ) ;
COND_SYSCALL ( eventfd ) ;
COND_SYSCALL ( signalfd ) ;
COND_SYSCALL_COMPAT ( signalfd ) ;
2018-03-06 19:53:01 +01:00
/* __ARCH_WANT_SYSCALL_OFF_T */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( fadvise64 ) ;
2018-03-06 19:53:01 +01:00
/* __ARCH_WANT_SYSCALL_DEPRECATED */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( epoll_wait ) ;
COND_SYSCALL ( recv ) ;
COND_SYSCALL_COMPAT ( recv ) ;
COND_SYSCALL ( send ) ;
COND_SYSCALL ( bdflush ) ;
COND_SYSCALL ( uselib ) ;
2018-03-06 19:53:01 +01:00
/*
* The syscalls below are not found in include / uapi / asm - generic / unistd . h
*/
/* obsolete: SGETMASK_SYSCALL */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( sgetmask ) ;
COND_SYSCALL ( ssetmask ) ;
2018-03-06 19:53:01 +01:00
/* obsolete: SYSFS_SYSCALL */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( sysfs ) ;
2018-03-06 19:53:01 +01:00
/* obsolete: __ARCH_WANT_SYS_IPC */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( ipc ) ;
COND_SYSCALL_COMPAT ( ipc ) ;
2018-03-06 19:53:01 +01:00
/* obsolete: UID16 */
2018-03-04 19:06:35 +01:00
COND_SYSCALL ( chown16 ) ;
COND_SYSCALL ( fchown16 ) ;
COND_SYSCALL ( getegid16 ) ;
COND_SYSCALL ( geteuid16 ) ;
COND_SYSCALL ( getgid16 ) ;
COND_SYSCALL ( getgroups16 ) ;
COND_SYSCALL ( getresgid16 ) ;
COND_SYSCALL ( getresuid16 ) ;
COND_SYSCALL ( getuid16 ) ;
COND_SYSCALL ( lchown16 ) ;
COND_SYSCALL ( setfsgid16 ) ;
COND_SYSCALL ( setfsuid16 ) ;
COND_SYSCALL ( setgid16 ) ;
COND_SYSCALL ( setgroups16 ) ;
COND_SYSCALL ( setregid16 ) ;
COND_SYSCALL ( setresgid16 ) ;
COND_SYSCALL ( setresuid16 ) ;
COND_SYSCALL ( setreuid16 ) ;
COND_SYSCALL ( setuid16 ) ;
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 08:43:54 -04:00
/* restartable sequence */
COND_SYSCALL ( rseq ) ;