2018-12-13 14:37:38 +05:30
# SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
#
# system call numbers and entry vectors for mips
#
# The format is:
# <number> <abi> <name> <entry point>
#
# The <abi> is always "n64" for this file.
#
0 n64 read sys_read
1 n64 write sys_write
2 n64 open sys_open
3 n64 close sys_close
4 n64 stat sys_newstat
5 n64 fstat sys_newfstat
6 n64 lstat sys_newlstat
7 n64 poll sys_poll
8 n64 lseek sys_lseek
9 n64 mmap sys_mips_mmap
10 n64 mprotect sys_mprotect
11 n64 munmap sys_munmap
12 n64 brk sys_brk
13 n64 rt_sigaction sys_rt_sigaction
14 n64 rt_sigprocmask sys_rt_sigprocmask
15 n64 ioctl sys_ioctl
16 n64 pread64 sys_pread64
17 n64 pwrite64 sys_pwrite64
18 n64 readv sys_readv
19 n64 writev sys_writev
20 n64 access sys_access
21 n64 pipe sysm_pipe
22 n64 _newselect sys_select
23 n64 sched_yield sys_sched_yield
24 n64 mremap sys_mremap
25 n64 msync sys_msync
26 n64 mincore sys_mincore
27 n64 madvise sys_madvise
28 n64 shmget sys_shmget
29 n64 shmat sys_shmat
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-12-31 22:22:40 +01:00
30 n64 shmctl sys_old_shmctl
2018-12-13 14:37:38 +05:30
31 n64 dup sys_dup
32 n64 dup2 sys_dup2
33 n64 pause sys_pause
34 n64 nanosleep sys_nanosleep
35 n64 getitimer sys_getitimer
36 n64 setitimer sys_setitimer
37 n64 alarm sys_alarm
38 n64 getpid sys_getpid
39 n64 sendfile sys_sendfile64
40 n64 socket sys_socket
41 n64 connect sys_connect
42 n64 accept sys_accept
43 n64 sendto sys_sendto
44 n64 recvfrom sys_recvfrom
45 n64 sendmsg sys_sendmsg
46 n64 recvmsg sys_recvmsg
47 n64 shutdown sys_shutdown
48 n64 bind sys_bind
49 n64 listen sys_listen
50 n64 getsockname sys_getsockname
51 n64 getpeername sys_getpeername
52 n64 socketpair sys_socketpair
53 n64 setsockopt sys_setsockopt
54 n64 getsockopt sys_getsockopt
55 n64 clone __sys_clone
56 n64 fork __sys_fork
57 n64 execve sys_execve
58 n64 exit sys_exit
59 n64 wait4 sys_wait4
60 n64 kill sys_kill
61 n64 uname sys_newuname
62 n64 semget sys_semget
63 n64 semop sys_semop
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-12-31 22:22:40 +01:00
64 n64 semctl sys_old_semctl
2018-12-13 14:37:38 +05:30
65 n64 shmdt sys_shmdt
66 n64 msgget sys_msgget
67 n64 msgsnd sys_msgsnd
68 n64 msgrcv sys_msgrcv
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-12-31 22:22:40 +01:00
69 n64 msgctl sys_old_msgctl
2018-12-13 14:37:38 +05:30
70 n64 fcntl sys_fcntl
71 n64 flock sys_flock
72 n64 fsync sys_fsync
73 n64 fdatasync sys_fdatasync
74 n64 truncate sys_truncate
75 n64 ftruncate sys_ftruncate
76 n64 getdents sys_getdents
77 n64 getcwd sys_getcwd
78 n64 chdir sys_chdir
79 n64 fchdir sys_fchdir
80 n64 rename sys_rename
81 n64 mkdir sys_mkdir
82 n64 rmdir sys_rmdir
83 n64 creat sys_creat
84 n64 link sys_link
85 n64 unlink sys_unlink
86 n64 symlink sys_symlink
87 n64 readlink sys_readlink
88 n64 chmod sys_chmod
89 n64 fchmod sys_fchmod
90 n64 chown sys_chown
91 n64 fchown sys_fchown
92 n64 lchown sys_lchown
93 n64 umask sys_umask
94 n64 gettimeofday sys_gettimeofday
95 n64 getrlimit sys_getrlimit
96 n64 getrusage sys_getrusage
97 n64 sysinfo sys_sysinfo
98 n64 times sys_times
99 n64 ptrace sys_ptrace
100 n64 getuid sys_getuid
101 n64 syslog sys_syslog
102 n64 getgid sys_getgid
103 n64 setuid sys_setuid
104 n64 setgid sys_setgid
105 n64 geteuid sys_geteuid
106 n64 getegid sys_getegid
107 n64 setpgid sys_setpgid
108 n64 getppid sys_getppid
109 n64 getpgrp sys_getpgrp
110 n64 setsid sys_setsid
111 n64 setreuid sys_setreuid
112 n64 setregid sys_setregid
113 n64 getgroups sys_getgroups
114 n64 setgroups sys_setgroups
115 n64 setresuid sys_setresuid
116 n64 getresuid sys_getresuid
117 n64 setresgid sys_setresgid
118 n64 getresgid sys_getresgid
119 n64 getpgid sys_getpgid
120 n64 setfsuid sys_setfsuid
121 n64 setfsgid sys_setfsgid
122 n64 getsid sys_getsid
123 n64 capget sys_capget
124 n64 capset sys_capset
125 n64 rt_sigpending sys_rt_sigpending
126 n64 rt_sigtimedwait sys_rt_sigtimedwait
127 n64 rt_sigqueueinfo sys_rt_sigqueueinfo
128 n64 rt_sigsuspend sys_rt_sigsuspend
129 n64 sigaltstack sys_sigaltstack
130 n64 utime sys_utime
131 n64 mknod sys_mknod
132 n64 personality sys_personality
133 n64 ustat sys_ustat
134 n64 statfs sys_statfs
135 n64 fstatfs sys_fstatfs
136 n64 sysfs sys_sysfs
137 n64 getpriority sys_getpriority
138 n64 setpriority sys_setpriority
139 n64 sched_setparam sys_sched_setparam
140 n64 sched_getparam sys_sched_getparam
141 n64 sched_setscheduler sys_sched_setscheduler
142 n64 sched_getscheduler sys_sched_getscheduler
143 n64 sched_get_priority_max sys_sched_get_priority_max
144 n64 sched_get_priority_min sys_sched_get_priority_min
145 n64 sched_rr_get_interval sys_sched_rr_get_interval
146 n64 mlock sys_mlock
147 n64 munlock sys_munlock
148 n64 mlockall sys_mlockall
149 n64 munlockall sys_munlockall
150 n64 vhangup sys_vhangup
151 n64 pivot_root sys_pivot_root
2020-08-14 17:31:07 -07:00
152 n64 _sysctl sys_ni_syscall
2018-12-13 14:37:38 +05:30
153 n64 prctl sys_prctl
154 n64 adjtimex sys_adjtimex
155 n64 setrlimit sys_setrlimit
156 n64 chroot sys_chroot
157 n64 sync sys_sync
158 n64 acct sys_acct
159 n64 settimeofday sys_settimeofday
160 n64 mount sys_mount
161 n64 umount2 sys_umount
162 n64 swapon sys_swapon
163 n64 swapoff sys_swapoff
164 n64 reboot sys_reboot
165 n64 sethostname sys_sethostname
166 n64 setdomainname sys_setdomainname
167 n64 create_module sys_ni_syscall
168 n64 init_module sys_init_module
169 n64 delete_module sys_delete_module
170 n64 get_kernel_syms sys_ni_syscall
171 n64 query_module sys_ni_syscall
172 n64 quotactl sys_quotactl
173 n64 nfsservctl sys_ni_syscall
174 n64 getpmsg sys_ni_syscall
175 n64 putpmsg sys_ni_syscall
176 n64 afs_syscall sys_ni_syscall
# 177 reserved for security
177 n64 reserved177 sys_ni_syscall
178 n64 gettid sys_gettid
179 n64 readahead sys_readahead
180 n64 setxattr sys_setxattr
181 n64 lsetxattr sys_lsetxattr
182 n64 fsetxattr sys_fsetxattr
183 n64 getxattr sys_getxattr
184 n64 lgetxattr sys_lgetxattr
185 n64 fgetxattr sys_fgetxattr
186 n64 listxattr sys_listxattr
187 n64 llistxattr sys_llistxattr
188 n64 flistxattr sys_flistxattr
189 n64 removexattr sys_removexattr
190 n64 lremovexattr sys_lremovexattr
191 n64 fremovexattr sys_fremovexattr
192 n64 tkill sys_tkill
193 n64 reserved193 sys_ni_syscall
194 n64 futex sys_futex
195 n64 sched_setaffinity sys_sched_setaffinity
196 n64 sched_getaffinity sys_sched_getaffinity
197 n64 cacheflush sys_cacheflush
198 n64 cachectl sys_cachectl
199 n64 sysmips __sys_sysmips
200 n64 io_setup sys_io_setup
201 n64 io_destroy sys_io_destroy
202 n64 io_getevents sys_io_getevents
203 n64 io_submit sys_io_submit
204 n64 io_cancel sys_io_cancel
205 n64 exit_group sys_exit_group
206 n64 lookup_dcookie sys_lookup_dcookie
207 n64 epoll_create sys_epoll_create
208 n64 epoll_ctl sys_epoll_ctl
209 n64 epoll_wait sys_epoll_wait
210 n64 remap_file_pages sys_remap_file_pages
211 n64 rt_sigreturn sys_rt_sigreturn
212 n64 set_tid_address sys_set_tid_address
213 n64 restart_syscall sys_restart_syscall
214 n64 semtimedop sys_semtimedop
215 n64 fadvise64 sys_fadvise64_64
216 n64 timer_create sys_timer_create
217 n64 timer_settime sys_timer_settime
218 n64 timer_gettime sys_timer_gettime
219 n64 timer_getoverrun sys_timer_getoverrun
220 n64 timer_delete sys_timer_delete
221 n64 clock_settime sys_clock_settime
222 n64 clock_gettime sys_clock_gettime
223 n64 clock_getres sys_clock_getres
224 n64 clock_nanosleep sys_clock_nanosleep
225 n64 tgkill sys_tgkill
226 n64 utimes sys_utimes
227 n64 mbind sys_mbind
228 n64 get_mempolicy sys_get_mempolicy
229 n64 set_mempolicy sys_set_mempolicy
230 n64 mq_open sys_mq_open
231 n64 mq_unlink sys_mq_unlink
232 n64 mq_timedsend sys_mq_timedsend
233 n64 mq_timedreceive sys_mq_timedreceive
234 n64 mq_notify sys_mq_notify
235 n64 mq_getsetattr sys_mq_getsetattr
236 n64 vserver sys_ni_syscall
237 n64 waitid sys_waitid
# 238 was sys_setaltroot
239 n64 add_key sys_add_key
240 n64 request_key sys_request_key
241 n64 keyctl sys_keyctl
242 n64 set_thread_area sys_set_thread_area
243 n64 inotify_init sys_inotify_init
244 n64 inotify_add_watch sys_inotify_add_watch
245 n64 inotify_rm_watch sys_inotify_rm_watch
246 n64 migrate_pages sys_migrate_pages
247 n64 openat sys_openat
248 n64 mkdirat sys_mkdirat
249 n64 mknodat sys_mknodat
250 n64 fchownat sys_fchownat
251 n64 futimesat sys_futimesat
252 n64 newfstatat sys_newfstatat
253 n64 unlinkat sys_unlinkat
254 n64 renameat sys_renameat
255 n64 linkat sys_linkat
256 n64 symlinkat sys_symlinkat
257 n64 readlinkat sys_readlinkat
258 n64 fchmodat sys_fchmodat
259 n64 faccessat sys_faccessat
260 n64 pselect6 sys_pselect6
261 n64 ppoll sys_ppoll
262 n64 unshare sys_unshare
263 n64 splice sys_splice
264 n64 sync_file_range sys_sync_file_range
265 n64 tee sys_tee
266 n64 vmsplice sys_vmsplice
267 n64 move_pages sys_move_pages
268 n64 set_robust_list sys_set_robust_list
269 n64 get_robust_list sys_get_robust_list
270 n64 kexec_load sys_kexec_load
271 n64 getcpu sys_getcpu
272 n64 epoll_pwait sys_epoll_pwait
273 n64 ioprio_set sys_ioprio_set
274 n64 ioprio_get sys_ioprio_get
275 n64 utimensat sys_utimensat
276 n64 signalfd sys_signalfd
277 n64 timerfd sys_ni_syscall
278 n64 eventfd sys_eventfd
279 n64 fallocate sys_fallocate
280 n64 timerfd_create sys_timerfd_create
281 n64 timerfd_gettime sys_timerfd_gettime
282 n64 timerfd_settime sys_timerfd_settime
283 n64 signalfd4 sys_signalfd4
284 n64 eventfd2 sys_eventfd2
285 n64 epoll_create1 sys_epoll_create1
286 n64 dup3 sys_dup3
287 n64 pipe2 sys_pipe2
288 n64 inotify_init1 sys_inotify_init1
289 n64 preadv sys_preadv
290 n64 pwritev sys_pwritev
291 n64 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo
292 n64 perf_event_open sys_perf_event_open
293 n64 accept4 sys_accept4
294 n64 recvmmsg sys_recvmmsg
295 n64 fanotify_init sys_fanotify_init
296 n64 fanotify_mark sys_fanotify_mark
297 n64 prlimit64 sys_prlimit64
298 n64 name_to_handle_at sys_name_to_handle_at
299 n64 open_by_handle_at sys_open_by_handle_at
300 n64 clock_adjtime sys_clock_adjtime
301 n64 syncfs sys_syncfs
302 n64 sendmmsg sys_sendmmsg
303 n64 setns sys_setns
304 n64 process_vm_readv sys_process_vm_readv
305 n64 process_vm_writev sys_process_vm_writev
306 n64 kcmp sys_kcmp
307 n64 finit_module sys_finit_module
308 n64 getdents64 sys_getdents64
309 n64 sched_setattr sys_sched_setattr
310 n64 sched_getattr sys_sched_getattr
311 n64 renameat2 sys_renameat2
312 n64 seccomp sys_seccomp
313 n64 getrandom sys_getrandom
314 n64 memfd_create sys_memfd_create
315 n64 bpf sys_bpf
316 n64 execveat sys_execveat
317 n64 userfaultfd sys_userfaultfd
318 n64 membarrier sys_membarrier
319 n64 mlock2 sys_mlock2
320 n64 copy_file_range sys_copy_file_range
321 n64 preadv2 sys_preadv2
322 n64 pwritev2 sys_pwritev2
323 n64 pkey_mprotect sys_pkey_mprotect
324 n64 pkey_alloc sys_pkey_alloc
325 n64 pkey_free sys_pkey_free
326 n64 statx sys_statx
327 n64 rseq sys_rseq
328 n64 io_pgetevents sys_io_pgetevents
2019-01-10 12:45:11 +01:00
# 329 through 423 are reserved to sync up with other architectures
2019-02-28 13:59:19 +01:00
424 n64 pidfd_send_signal sys_pidfd_send_signal
425 n64 io_uring_setup sys_io_uring_setup
426 n64 io_uring_enter sys_io_uring_enter
427 n64 io_uring_register sys_io_uring_register
2019-05-16 12:52:34 +01:00
428 n64 open_tree sys_open_tree
429 n64 move_mount sys_move_mount
430 n64 fsopen sys_fsopen
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
2019-05-24 12:44:59 +02:00
434 n64 pidfd_open sys_pidfd_open
2019-10-02 18:59:49 +00:00
435 n64 clone3 __sys_clone3
2019-05-24 11:31:44 +02:00
436 n64 close_range sys_close_range
open: introduce openat2(2) syscall
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).
/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);
/* Description. */
The initial version of 'struct open_how' contains the following fields:
flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).
mode
The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).
After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.
/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.
In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).
/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).
Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.
Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-18 23:07:59 +11:00
437 n64 openat2 sys_openat2
2020-01-07 09:59:26 -08:00
438 n64 pidfd_getfd sys_pidfd_getfd
2020-05-14 16:44:25 +02:00
439 n64 faccessat2 sys_faccessat2
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to the
app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint. It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement. I
think it would be bigger in real practice because the testing ran very
cache friendly environment).
Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone. Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
If someone want to add other hints, we could hear the usecase and review
it for each hint. It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.
So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve
system or application performance.
The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.
MADV_COLD
MADV_PAGEOUT
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-17 16:14:59 -07:00
440 n64 process_madvise sys_process_madvise