License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
// SPDX-License-Identifier: GPL-2.0
2005-04-16 15:20:36 -07:00
/*
* linux / fs / namei . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*/
/*
* Some corrections by tytso .
*/
/* [Feb 1997 T. Schoebel-Theuer] Complete rewrite of the pathname
* lookup logic .
*/
/* [Feb-Apr 2000, AV] Rewrite to the new namespace architecture.
*/
# include <linux/init.h>
2011-11-16 23:57:37 -05:00
# include <linux/export.h>
2012-05-23 20:12:50 -07:00
# include <linux/kernel.h>
2005-04-16 15:20:36 -07:00
# include <linux/slab.h>
# include <linux/fs.h>
# include <linux/namei.h>
# include <linux/pagemap.h>
2022-02-22 09:43:12 -05:00
# include <linux/sched/mm.h>
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 17:06:03 -04:00
# include <linux/fsnotify.h>
2005-04-16 15:20:36 -07:00
# include <linux/personality.h>
# include <linux/security.h>
2009-02-04 09:06:57 -05:00
# include <linux/ima.h>
2005-04-16 15:20:36 -07:00
# include <linux/syscalls.h>
# include <linux/mount.h>
# include <linux/audit.h>
2006-01-11 12:17:46 -08:00
# include <linux/capability.h>
2005-10-18 14:20:16 -07:00
# include <linux/file.h>
2006-01-18 17:43:53 -08:00
# include <linux/fcntl.h>
2008-04-29 01:00:10 -07:00
# include <linux/device_cgroup.h>
2009-03-29 19:50:06 -04:00
# include <linux/fs_struct.h>
2011-07-22 19:30:19 -07:00
# include <linux/posix_acl.h>
vfs: fix bad hashing of dentries
Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:
"The test case is essentially
for (i = 0; i < 1000000; i++)
mkdir("a$i");
On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
dir/sec with 3.10. This is because we spend waaaaay more time in
__d_lookup on 3.10 than in 3.2.
The new hashing function for strings is suboptimal for <
sizeof(unsigned long) string names (and hell even > sizeof(unsigned
long) string names that I've tested). I broke out the old hashing
function and the new one into a userspace helper to get real numbers
and this is what I'm getting:
Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 12628 entries, 987372 dupes, 900 max dupes
We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
My test does the hash, and then does the d_hash into a integer pointer
array the same size as the dentry hash table on my system, and then
just increments the value at the address we got to see how many
entries we overlap with.
As you can see the old hash function ended up with all 1 million
entries in their own bucket, whereas the new one they are only
distributed among ~12.5k buckets, which is why we're using so much
more CPU in __d_lookup".
The reason for this hash regression is two-fold:
- On 64-bit architectures the down-mixing of the original 64-bit
word-at-a-time hash into the final 32-bit hash value is very
simplistic and suboptimal, and just adds the two 32-bit parts
together.
In particular, because there is no bit shuffling and the mixing
boundary is also a byte boundary, similar character patterns in the
low and high word easily end up just canceling each other out.
- the old byte-at-a-time hash mixed each byte into the final hash as it
hashed the path component name, resulting in the low bits of the hash
generally being a good source of hash data. That is not true for the
word-at-a-time case, and the hash data is distributed among all the
bits.
The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible. We already have the
"hash_32|64()" functions to do that.
Reported-by: Josef Bacik <jbacik@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-13 11:30:10 -07:00
# include <linux/hash.h>
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
# include <linux/bitops.h>
2016-07-23 11:20:44 -05:00
# include <linux/init_task.h>
2016-12-24 11:46:01 -08:00
# include <linux/uaccess.h>
2005-04-16 15:20:36 -07:00
2009-12-04 15:47:36 -05:00
# include "internal.h"
2011-11-24 18:22:03 -05:00
# include "mount.h"
2009-12-04 15:47:36 -05:00
2005-04-16 15:20:36 -07:00
/* [Feb-1997 T. Schoebel-Theuer]
* Fundamental changes in the pathname lookup mechanisms ( namei )
* were necessary because of omirr . The reason is that omirr needs
* to know the _real_ pathname , not the user - supplied one , in case
* of symlinks ( and also when transname replacements occur ) .
*
* The new code replaces the old recursive symlink resolution with
* an iterative one ( in case of non - nested symlink chains ) . It does
* this with calls to < fs > _follow_link ( ) .
* As a side effect , dir_namei ( ) , _namei ( ) and follow_link ( ) are now
* replaced with a single function lookup_dentry ( ) that can handle all
* the special cases of the former code .
*
* With the new dcache , the pathname is stored at each inode , at least as
* long as the refcount of the inode is positive . As a side effect , the
* size of the dcache depends on the inode cache and thus is dynamic .
*
* [ 29 - Apr - 1998 C . Scott Ananian ] Updated above description of symlink
* resolution to correspond with current state of the code .
*
* Note that the symlink resolution is not * completely * iterative .
* There is still a significant amount of tail - and mid - recursion in
* the algorithm . Also , note that < fs > _readlink ( ) is not used in
* lookup_dentry ( ) : lookup_dentry ( ) on the result of < fs > _readlink ( )
* may return different results than < fs > _follow_link ( ) . Many virtual
* filesystems ( including / proc ) exhibit this behavior .
*/
/* [24-Feb-97 T. Schoebel-Theuer] Side effects caused by new implementation:
* New symlink semantics : when open ( ) is called with flags O_CREAT | O_EXCL
* and the name already exists in form of a symlink , try to create the new
* name indicated by the symlink . The old code always complained that the
* name already exists , due to not following the symlink even if its target
* is nonexistent . The new semantics affects also mknod ( ) and link ( ) when
2011-03-30 22:57:33 -03:00
* the name is a symlink pointing to a non - existent name .
2005-04-16 15:20:36 -07:00
*
* I don ' t know which semantics is the right one , since I have no access
* to standards . But I found by trial that HP - UX 9.0 has the full " new "
* semantics implemented , while SunOS 4.1 .1 and Solaris ( SunOS 5.4 ) have the
* " old " one . Personally , I think the new semantics is much more logical .
* Note that " ln old new " where " new " is a symlink pointing to a non - existing
* file does succeed in both HP - UX and SunOs , but not in Solaris
* and in the old Linux semantics .
*/
/* [16-Dec-97 Kevin Buhr] For security reasons, we change some symlink
* semantics . See the comments in " open_namei " and " do_link " below .
*
* [ 10 - Sep - 98 Alan Modra ] Another symlink change .
*/
/* [Feb-Apr 2000 AV] Complete rewrite. Rules for symlinks:
* inside the path - always follow .
* in the last component in creation / removal / renaming - never follow .
* if LOOKUP_FOLLOW passed - follow .
* if the pathname has trailing slashes - follow .
* otherwise - don ' t follow .
* ( applied in that order ) .
*
* [ Jun 2000 AV ] Inconsistent behaviour of open ( ) in case if flags = = O_CREAT
* restored for 2.4 . This is the last surviving part of old 4.2 BSD bug .
* During the 2.4 we need to fix the userland stuff depending on it -
* hopefully we will be able to get rid of that wart in 2.5 . So far only
* XEmacs seems to be relying on it . . .
*/
/*
* [ Sep 2001 AV ] Single - semaphore locking scheme ( kudos to David Holland )
2006-03-23 03:00:33 -08:00
* implemented . Let ' s see if raised priority of - > s_vfs_rename_mutex gives
2005-04-16 15:20:36 -07:00
* any extra contention . . .
*/
/* In order to reduce some races, while at the same time doing additional
* checking and hopefully speeding things up , we copy filenames to the
* kernel data space before using them . .
*
* POSIX .1 2.4 : an empty pathname is invalid ( ENOENT ) .
* PATH_MAX includes the nul terminator - - RR .
*/
2012-10-10 15:25:28 -04:00
2015-02-22 20:07:13 -05:00
# define EMBEDDED_NAME_MAX (PATH_MAX - offsetof(struct filename, iname))
2012-10-10 16:43:13 -04:00
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
struct filename *
2012-10-10 15:25:28 -04:00
getname_flags ( const char __user * filename , int flags , int * empty )
{
2015-02-22 19:38:03 -05:00
struct filename * result ;
2012-10-10 16:43:13 -04:00
char * kname ;
2015-02-22 19:38:03 -05:00
int len ;
2012-01-03 14:23:08 -05:00
2012-10-10 15:25:28 -04:00
result = audit_reusename ( filename ) ;
if ( result )
return result ;
2012-10-10 16:43:13 -04:00
result = __getname ( ) ;
2012-04-28 14:38:32 -07:00
if ( unlikely ( ! result ) )
2012-01-03 14:23:08 -05:00
return ERR_PTR ( - ENOMEM ) ;
2012-10-10 16:43:13 -04:00
/*
* First , try to embed the struct filename inside the names_cache
* allocation
*/
2015-02-22 20:07:13 -05:00
kname = ( char * ) result - > iname ;
2012-10-10 15:25:28 -04:00
result - > name = kname ;
2012-10-10 16:43:13 -04:00
2015-02-22 19:38:03 -05:00
len = strncpy_from_user ( kname , filename , EMBEDDED_NAME_MAX ) ;
2012-10-10 15:25:28 -04:00
if ( unlikely ( len < 0 ) ) {
2015-02-22 19:38:03 -05:00
__putname ( result ) ;
return ERR_PTR ( len ) ;
2012-10-10 15:25:28 -04:00
}
2012-04-28 14:38:32 -07:00
2012-10-10 16:43:13 -04:00
/*
* Uh - oh . We have a name that ' s approaching PATH_MAX . Allocate a
* separate struct filename so we can dedicate the entire
* names_cache allocation for the pathname , and re - do the copy from
* userland .
*/
2015-02-22 19:38:03 -05:00
if ( unlikely ( len = = EMBEDDED_NAME_MAX ) ) {
2015-02-22 20:07:13 -05:00
const size_t size = offsetof ( struct filename , iname [ 1 ] ) ;
2012-10-10 16:43:13 -04:00
kname = ( char * ) result ;
2015-02-22 20:07:13 -05:00
/*
* size is chosen that way we to guarantee that
* result - > iname [ 0 ] is within the same object and that
* kname can ' t be equal to result - > iname , no matter what .
*/
result = kzalloc ( size , GFP_KERNEL ) ;
2015-02-22 19:38:03 -05:00
if ( unlikely ( ! result ) ) {
__putname ( kname ) ;
return ERR_PTR ( - ENOMEM ) ;
2012-10-10 16:43:13 -04:00
}
result - > name = kname ;
2015-02-22 19:38:03 -05:00
len = strncpy_from_user ( kname , filename , PATH_MAX ) ;
if ( unlikely ( len < 0 ) ) {
__putname ( kname ) ;
kfree ( result ) ;
return ERR_PTR ( len ) ;
}
if ( unlikely ( len = = PATH_MAX ) ) {
__putname ( kname ) ;
kfree ( result ) ;
return ERR_PTR ( - ENAMETOOLONG ) ;
}
2012-10-10 16:43:13 -04:00
}
2015-02-22 19:38:03 -05:00
result - > refcnt = 1 ;
2012-04-28 14:38:32 -07:00
/* The empty path is special. */
if ( unlikely ( ! len ) ) {
if ( empty )
2012-01-03 14:23:08 -05:00
* empty = 1 ;
2015-02-22 19:38:03 -05:00
if ( ! ( flags & LOOKUP_EMPTY ) ) {
putname ( result ) ;
return ERR_PTR ( - ENOENT ) ;
}
2005-04-16 15:20:36 -07:00
}
2012-04-28 14:38:32 -07:00
2012-10-10 16:43:13 -04:00
result - > uptr = filename ;
2014-02-05 12:54:53 -08:00
result - > aname = NULL ;
2012-10-10 16:43:13 -04:00
audit_getname ( result ) ;
return result ;
2005-04-16 15:20:36 -07:00
}
2021-07-08 13:34:42 +07:00
struct filename *
getname_uflags ( const char __user * filename , int uflags )
{
int flags = ( uflags & AT_EMPTY_PATH ) ? LOOKUP_EMPTY : 0 ;
return getname_flags ( filename , flags , NULL ) ;
}
2012-10-10 15:25:28 -04:00
struct filename *
getname ( const char __user * filename )
2011-03-14 18:56:51 -04:00
{
2012-03-22 16:10:40 -07:00
return getname_flags ( filename , 0 , NULL ) ;
2011-03-14 18:56:51 -04:00
}
2014-02-05 12:54:53 -08:00
struct filename *
getname_kernel ( const char * filename )
{
struct filename * result ;
2015-01-21 23:59:56 -05:00
int len = strlen ( filename ) + 1 ;
2014-02-05 12:54:53 -08:00
result = __getname ( ) ;
if ( unlikely ( ! result ) )
return ERR_PTR ( - ENOMEM ) ;
2015-01-21 23:59:56 -05:00
if ( len < = EMBEDDED_NAME_MAX ) {
2015-02-22 20:07:13 -05:00
result - > name = ( char * ) result - > iname ;
2015-01-21 23:59:56 -05:00
} else if ( len < = PATH_MAX ) {
2018-04-08 11:57:10 -04:00
const size_t size = offsetof ( struct filename , iname [ 1 ] ) ;
2015-01-21 23:59:56 -05:00
struct filename * tmp ;
2018-04-08 11:57:10 -04:00
tmp = kmalloc ( size , GFP_KERNEL ) ;
2015-01-21 23:59:56 -05:00
if ( unlikely ( ! tmp ) ) {
__putname ( result ) ;
return ERR_PTR ( - ENOMEM ) ;
}
tmp - > name = ( char * ) result ;
result = tmp ;
} else {
__putname ( result ) ;
return ERR_PTR ( - ENAMETOOLONG ) ;
}
memcpy ( ( char * ) result - > name , filename , len ) ;
2014-02-05 12:54:53 -08:00
result - > uptr = NULL ;
result - > aname = NULL ;
2015-01-22 00:00:23 -05:00
result - > refcnt = 1 ;
2015-01-22 00:00:10 -05:00
audit_getname ( result ) ;
2014-02-05 12:54:53 -08:00
return result ;
}
2012-10-10 15:25:28 -04:00
void putname ( struct filename * name )
2005-04-16 15:20:36 -07:00
{
2021-09-07 16:14:05 -04:00
if ( IS_ERR ( name ) )
2021-07-08 13:34:37 +07:00
return ;
2015-01-22 00:00:23 -05:00
BUG_ON ( name - > refcnt < = 0 ) ;
if ( - - name - > refcnt > 0 )
return ;
2015-02-22 20:07:13 -05:00
if ( name - > name ! = name - > iname ) {
2015-01-22 00:00:23 -05:00
__putname ( name - > name ) ;
kfree ( name ) ;
} else
__putname ( name ) ;
2005-04-16 15:20:36 -07:00
}
2021-01-21 14:19:24 +01:00
/**
* check_acl - perform ACL permission checking
* @ mnt_userns : user namespace of the mount the inode was found from
* @ inode : inode to check permissions on
* @ mask : right to check for ( % MAY_READ , % MAY_WRITE , % MAY_EXEC . . . )
*
* This function performs the ACL permission checking . Since this function
* retrieve POSIX acls it needs to know whether it is called from a blocking or
* non - blocking context and thus cares about the MAY_NOT_BLOCK bit .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
static int check_acl ( struct user_namespace * mnt_userns ,
struct inode * inode , int mask )
2011-07-22 19:30:19 -07:00
{
2011-07-25 22:47:03 -07:00
# ifdef CONFIG_FS_POSIX_ACL
2011-07-22 19:30:19 -07:00
struct posix_acl * acl ;
if ( mask & MAY_NOT_BLOCK ) {
2011-08-02 21:32:13 -04:00
acl = get_cached_acl_rcu ( inode , ACL_TYPE_ACCESS ) ;
if ( ! acl )
2011-07-22 19:30:19 -07:00
return - EAGAIN ;
2011-08-02 21:32:13 -04:00
/* no ->get_acl() calls in RCU mode... */
2016-03-24 14:38:37 +01:00
if ( is_uncached_acl ( acl ) )
2011-08-02 21:32:13 -04:00
return - ECHILD ;
2021-01-21 14:19:24 +01:00
return posix_acl_permission ( mnt_userns , inode , acl , mask ) ;
2011-07-22 19:30:19 -07:00
}
2013-12-20 05:16:38 -08:00
acl = get_acl ( inode , ACL_TYPE_ACCESS ) ;
if ( IS_ERR ( acl ) )
return PTR_ERR ( acl ) ;
2011-07-22 19:30:19 -07:00
if ( acl ) {
2021-01-21 14:19:24 +01:00
int error = posix_acl_permission ( mnt_userns , inode , acl , mask ) ;
2011-07-22 19:30:19 -07:00
posix_acl_release ( acl ) ;
return error ;
}
2011-07-25 22:47:03 -07:00
# endif
2011-07-22 19:30:19 -07:00
return - EAGAIN ;
}
2021-01-21 14:19:24 +01:00
/**
* acl_permission_check - perform basic UNIX permission checking
* @ mnt_userns : user namespace of the mount the inode was found from
* @ inode : inode to check permissions on
* @ mask : right to check for ( % MAY_READ , % MAY_WRITE , % MAY_EXEC . . . )
*
* This function performs the basic UNIX permission checking . Since this
* function may retrieve POSIX acls it needs to know whether it is called from a
* blocking or non - blocking context and thus cares about the MAY_NOT_BLOCK bit .
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
*
2021-01-21 14:19:24 +01:00
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
2005-04-16 15:20:36 -07:00
*/
2021-01-21 14:19:24 +01:00
static int acl_permission_check ( struct user_namespace * mnt_userns ,
struct inode * inode , int mask )
2005-04-16 15:20:36 -07:00
{
2011-05-13 11:51:01 -07:00
unsigned int mode = inode - > i_mode ;
2021-01-21 14:19:24 +01:00
kuid_t i_uid ;
2005-04-16 15:20:36 -07:00
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
/* Are we the owner? If so, ACL's don't matter */
2021-01-21 14:19:24 +01:00
i_uid = i_uid_into_mnt ( mnt_userns , inode ) ;
if ( likely ( uid_eq ( current_fsuid ( ) , i_uid ) ) ) {
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
mask & = 7 ;
2005-04-16 15:20:36 -07:00
mode > > = 6 ;
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
return ( mask & ~ mode ) ? - EACCES : 0 ;
}
2005-04-16 15:20:36 -07:00
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
/* Do we have ACL's? */
if ( IS_POSIXACL ( inode ) & & ( mode & S_IRWXG ) ) {
2021-01-21 14:19:24 +01:00
int error = check_acl ( mnt_userns , inode , mask ) ;
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
if ( error ! = - EAGAIN )
return error ;
2005-04-16 15:20:36 -07:00
}
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
/* Only RWX matters for group/other mode bits */
mask & = 7 ;
2005-04-16 15:20:36 -07:00
/*
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
* Are the group permissions different from
* the other permissions in the bits we care
* about ? Need to check group ownership if so .
2005-04-16 15:20:36 -07:00
*/
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
if ( mask & ( mode ^ ( mode > > 3 ) ) ) {
2021-01-21 14:19:24 +01:00
kgid_t kgid = i_gid_into_mnt ( mnt_userns , inode ) ;
if ( in_group_p ( kgid ) )
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
mode > > = 3 ;
}
/* Bits in 'mode' clear that we require? */
return ( mask & ~ mode ) ? - EACCES : 0 ;
2009-08-28 11:51:25 -07:00
}
/**
2011-01-07 17:49:58 +11:00
* generic_permission - check for access rights on a Posix - like filesystem
2021-01-21 14:19:24 +01:00
* @ mnt_userns : user namespace of the mount the inode was found from
2009-08-28 11:51:25 -07:00
* @ inode : inode to check access rights for
vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 13:40:45 -07:00
* @ mask : right to check for ( % MAY_READ , % MAY_WRITE , % MAY_EXEC ,
* % MAY_NOT_BLOCK . . . )
2009-08-28 11:51:25 -07:00
*
* Used to check for read / write / execute permissions on a file .
* We use " fsuid " for this , letting us set arbitrary permissions
* for filesystem access without changing the " normal " uids which
2011-01-07 17:49:58 +11:00
* are used for other things .
*
* generic_permission is rcu - walk aware . It returns - ECHILD in case an rcu - walk
* request cannot be satisfied ( eg . requires blocking or too much complexity ) .
* It would then be called again in ref - walk mode .
2021-01-21 14:19:24 +01:00
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
2009-08-28 11:51:25 -07:00
*/
2021-01-21 14:19:24 +01:00
int generic_permission ( struct user_namespace * mnt_userns , struct inode * inode ,
int mask )
2009-08-28 11:51:25 -07:00
{
int ret ;
/*
2011-10-23 23:13:33 +05:30
* Do the basic permission checks .
2009-08-28 11:51:25 -07:00
*/
2021-01-21 14:19:24 +01:00
ret = acl_permission_check ( mnt_userns , inode , mask ) ;
2009-08-28 11:51:25 -07:00
if ( ret ! = - EACCES )
return ret ;
2005-04-16 15:20:36 -07:00
2011-06-20 19:55:42 -04:00
if ( S_ISDIR ( inode - > i_mode ) ) {
/* DACs are overridable for directories */
if ( ! ( mask & MAY_WRITE ) )
2021-01-21 14:19:24 +01:00
if ( capable_wrt_inode_uidgid ( mnt_userns , inode ,
2014-06-10 12:45:42 -07:00
CAP_DAC_READ_SEARCH ) )
2011-06-20 19:55:42 -04:00
return 0 ;
2021-01-21 14:19:24 +01:00
if ( capable_wrt_inode_uidgid ( mnt_userns , inode ,
2021-01-21 14:19:23 +01:00
CAP_DAC_OVERRIDE ) )
2005-04-16 15:20:36 -07:00
return 0 ;
2017-03-10 12:14:18 -05:00
return - EACCES ;
}
2005-04-16 15:20:36 -07:00
/*
* Searching includes executable on directories , else just read .
*/
2009-12-29 14:50:19 -06:00
mask & = MAY_READ | MAY_WRITE | MAY_EXEC ;
2011-06-20 19:55:42 -04:00
if ( mask = = MAY_READ )
2021-01-21 14:19:24 +01:00
if ( capable_wrt_inode_uidgid ( mnt_userns , inode ,
2021-01-21 14:19:23 +01:00
CAP_DAC_READ_SEARCH ) )
2005-04-16 15:20:36 -07:00
return 0 ;
2017-03-10 12:14:18 -05:00
/*
* Read / write DACs are always overridable .
* Executable DACs are overridable when there is
* at least one exec bit set .
*/
if ( ! ( mask & MAY_EXEC ) | | ( inode - > i_mode & S_IXUGO ) )
2021-01-21 14:19:24 +01:00
if ( capable_wrt_inode_uidgid ( mnt_userns , inode ,
2021-01-21 14:19:23 +01:00
CAP_DAC_OVERRIDE ) )
2017-03-10 12:14:18 -05:00
return 0 ;
2005-04-16 15:20:36 -07:00
return - EACCES ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( generic_permission ) ;
2005-04-16 15:20:36 -07:00
2021-01-21 14:19:24 +01:00
/**
* do_inode_permission - UNIX permission checking
* @ mnt_userns : user namespace of the mount the inode was found from
* @ inode : inode to check permissions on
* @ mask : right to check for ( % MAY_READ , % MAY_WRITE , % MAY_EXEC . . . )
*
2011-08-06 22:45:50 -07:00
* We _really_ want to just do " generic_permission() " without
* even looking at the inode - > i_op values . So we keep a cache
* flag in inode - > i_opflags , that says " this has not special
* permission function , use the fast case " .
*/
2021-01-21 14:19:24 +01:00
static inline int do_inode_permission ( struct user_namespace * mnt_userns ,
struct inode * inode , int mask )
2011-08-06 22:45:50 -07:00
{
if ( unlikely ( ! ( inode - > i_opflags & IOP_FASTPERM ) ) ) {
if ( likely ( inode - > i_op - > permission ) )
2021-01-21 14:19:43 +01:00
return inode - > i_op - > permission ( mnt_userns , inode , mask ) ;
2011-08-06 22:45:50 -07:00
/* This gets set once for the inode lifetime */
spin_lock ( & inode - > i_lock ) ;
inode - > i_opflags | = IOP_FASTPERM ;
spin_unlock ( & inode - > i_lock ) ;
}
2021-01-21 14:19:24 +01:00
return generic_permission ( mnt_userns , inode , mask ) ;
2011-08-06 22:45:50 -07:00
}
2012-06-25 12:55:46 +01:00
/**
* sb_permission - Check superblock - level permissions
* @ sb : Superblock of inode to check permission on
2012-08-18 17:39:25 -07:00
* @ inode : Inode to check permission on
2012-06-25 12:55:46 +01:00
* @ mask : Right to check for ( % MAY_READ , % MAY_WRITE , % MAY_EXEC )
*
* Separate out file - system wide checks from inode - specific permission checks .
*/
static int sb_permission ( struct super_block * sb , struct inode * inode , int mask )
{
if ( unlikely ( mask & MAY_WRITE ) ) {
umode_t mode = inode - > i_mode ;
/* Nobody gets write access to a read-only fs. */
2017-07-17 08:45:34 +01:00
if ( sb_rdonly ( sb ) & & ( S_ISREG ( mode ) | | S_ISDIR ( mode ) | | S_ISLNK ( mode ) ) )
2012-06-25 12:55:46 +01:00
return - EROFS ;
}
return 0 ;
}
/**
* inode_permission - Check for access rights to a given inode
2021-01-21 14:19:24 +01:00
* @ mnt_userns : User namespace of the mount the inode was found from
* @ inode : Inode to check permission on
* @ mask : Right to check for ( % MAY_READ , % MAY_WRITE , % MAY_EXEC )
2012-06-25 12:55:46 +01:00
*
* Check for read / write / execute permissions on an inode . We use fs [ ug ] id for
* this , letting us set arbitrary permissions for filesystem access without
* changing the " normal " UIDs which are used for other things .
*
* When checking for MAY_APPEND , MAY_WRITE must also be set in @ mask .
*/
2021-01-21 14:19:24 +01:00
int inode_permission ( struct user_namespace * mnt_userns ,
struct inode * inode , int mask )
2012-06-25 12:55:46 +01:00
{
int retval ;
retval = sb_permission ( inode - > i_sb , inode , mask ) ;
if ( retval )
return retval ;
2018-01-16 21:44:24 -08:00
if ( unlikely ( mask & MAY_WRITE ) ) {
/*
* Nobody gets write access to an immutable file .
*/
if ( IS_IMMUTABLE ( inode ) )
return - EPERM ;
/*
* Updating mtime will likely cause i_uid and i_gid to be
* written back improperly if their true value is unknown
* to the vfs .
*/
2021-01-21 14:19:31 +01:00
if ( HAS_UNMAPPED_ID ( mnt_userns , inode ) )
2018-01-16 21:44:24 -08:00
return - EACCES ;
}
2021-01-21 14:19:24 +01:00
retval = do_inode_permission ( mnt_userns , inode , mask ) ;
2018-01-16 21:44:24 -08:00
if ( retval )
return retval ;
retval = devcgroup_inode_permission ( inode , mask ) ;
if ( retval )
return retval ;
return security_inode_permission ( inode , mask ) ;
2012-06-25 12:55:46 +01:00
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( inode_permission ) ;
2012-06-25 12:55:46 +01:00
2008-02-14 19:34:38 -08:00
/**
* path_get - get a reference to a path
* @ path : path to get the reference to
*
* Given a path increment the reference count to the dentry and the vfsmount .
*/
2013-03-01 23:51:07 -05:00
void path_get ( const struct path * path )
2008-02-14 19:34:38 -08:00
{
mntget ( path - > mnt ) ;
dget ( path - > dentry ) ;
}
EXPORT_SYMBOL ( path_get ) ;
2008-02-14 19:34:35 -08:00
/**
* path_put - put a reference to a path
* @ path : path to put the reference to
*
* Given a path decrement the reference count to the dentry and the vfsmount .
*/
2013-03-01 23:51:07 -05:00
void path_put ( const struct path * path )
2005-04-16 15:20:36 -07:00
{
2008-02-14 19:34:35 -08:00
dput ( path - > dentry ) ;
mntput ( path - > mnt ) ;
2005-04-16 15:20:36 -07:00
}
2008-02-14 19:34:35 -08:00
EXPORT_SYMBOL ( path_put ) ;
2005-04-16 15:20:36 -07:00
2015-05-02 07:16:16 -04:00
# define EMBEDDED_LEVELS 2
2014-11-01 19:30:41 -04:00
struct nameidata {
struct path path ;
2015-05-06 16:01:56 -04:00
struct qstr last ;
2014-11-01 19:30:41 -04:00
struct path root ;
struct inode * inode ; /* path.dentry.d_inode */
2021-04-01 22:03:41 -04:00
unsigned int flags , state ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
unsigned seq , next_seq , m_seq , r_seq ;
2014-11-01 19:30:41 -04:00
int last_type ;
unsigned depth ;
2015-03-23 13:37:38 +11:00
int total_link_count ;
2015-05-02 19:38:35 -04:00
struct saved {
struct path link ;
2015-12-29 15:58:39 -05:00
struct delayed_call done ;
2015-05-02 19:38:35 -04:00
const char * name ;
2015-05-08 13:23:53 -04:00
unsigned seq ;
2015-05-02 07:16:16 -04:00
} * stack , internal [ EMBEDDED_LEVELS ] ;
2015-05-13 07:28:08 -04:00
struct filename * name ;
struct nameidata * saved ;
unsigned root_seq ;
int dfd ;
2020-03-05 11:34:48 -05:00
kuid_t dir_uid ;
umode_t dir_mode ;
2016-10-28 01:22:25 -07:00
} __randomize_layout ;
2014-11-01 19:30:41 -04:00
2021-04-01 22:03:41 -04:00
# define ND_ROOT_PRESET 1
# define ND_ROOT_GRABBED 2
# define ND_JUMPED 4
2021-04-01 22:28:03 -04:00
static void __set_nameidata ( struct nameidata * p , int dfd , struct filename * name )
2015-05-02 07:16:16 -04:00
{
2015-03-23 13:37:38 +11:00
struct nameidata * old = current - > nameidata ;
p - > stack = p - > internal ;
2021-04-03 16:49:44 -04:00
p - > depth = 0 ;
2015-05-12 18:43:07 -04:00
p - > dfd = dfd ;
p - > name = name ;
2021-04-06 12:33:07 -04:00
p - > path . mnt = NULL ;
p - > path . dentry = NULL ;
2015-03-23 13:37:38 +11:00
p - > total_link_count = old ? old - > total_link_count : 0 ;
2015-05-13 07:28:08 -04:00
p - > saved = old ;
2015-03-23 13:37:38 +11:00
current - > nameidata = p ;
2015-05-02 07:16:16 -04:00
}
2021-04-01 22:28:03 -04:00
static inline void set_nameidata ( struct nameidata * p , int dfd , struct filename * name ,
const struct path * root )
{
__set_nameidata ( p , dfd , name ) ;
p - > state = 0 ;
if ( unlikely ( root ) ) {
p - > state = ND_ROOT_PRESET ;
p - > root = * root ;
}
}
2015-05-13 07:28:08 -04:00
static void restore_nameidata ( void )
2015-05-02 07:16:16 -04:00
{
2015-05-13 07:28:08 -04:00
struct nameidata * now = current - > nameidata , * old = now - > saved ;
2015-03-23 13:37:38 +11:00
current - > nameidata = old ;
if ( old )
old - > total_link_count = now - > total_link_count ;
2015-12-05 21:06:33 -05:00
if ( now - > stack ! = now - > internal )
2015-03-23 13:37:38 +11:00
kfree ( now - > stack ) ;
2015-05-02 07:16:16 -04:00
}
2020-03-03 11:43:55 -05:00
static bool nd_alloc_stack ( struct nameidata * nd )
2015-05-02 07:16:16 -04:00
{
2015-05-09 13:04:24 -04:00
struct saved * p ;
2020-03-03 11:43:55 -05:00
p = kmalloc_array ( MAXSYMLINKS , sizeof ( struct saved ) ,
nd - > flags & LOOKUP_RCU ? GFP_ATOMIC : GFP_KERNEL ) ;
if ( unlikely ( ! p ) )
return false ;
2015-05-02 07:16:16 -04:00
memcpy ( p , nd - > internal , sizeof ( nd - > internal ) ) ;
nd - > stack = p ;
2020-03-03 11:43:55 -05:00
return true ;
2015-05-02 07:16:16 -04:00
}
2015-08-15 20:27:13 -05:00
/**
2020-02-24 15:53:19 -05:00
* path_connected - Verify that a dentry is below mnt . mnt_root
2015-08-15 20:27:13 -05:00
*
* Rename can sometimes move a file or directory outside of a bind
* mount , path_connected allows those cases to be detected .
*/
2020-02-24 15:53:19 -05:00
static bool path_connected ( struct vfsmount * mnt , struct dentry * dentry )
2015-08-15 20:27:13 -05:00
{
2018-03-14 18:20:29 -05:00
struct super_block * sb = mnt - > mnt_sb ;
2015-08-15 20:27:13 -05:00
2020-09-24 08:51:28 +02:00
/* Bind mounts can have disconnected paths */
if ( mnt - > mnt_root = = sb - > s_root )
2015-08-15 20:27:13 -05:00
return true ;
2020-02-24 15:53:19 -05:00
return is_subdir ( dentry , mnt - > mnt_root ) ;
2015-08-15 20:27:13 -05:00
}
2015-05-09 12:55:43 -04:00
static void drop_links ( struct nameidata * nd )
{
int i = nd - > depth ;
while ( i - - ) {
struct saved * last = nd - > stack + i ;
2015-12-29 15:58:39 -05:00
do_delayed_call ( & last - > done ) ;
clear_delayed_call ( & last - > done ) ;
2015-05-09 12:55:43 -04:00
}
}
2022-07-06 12:40:31 -04:00
static void leave_rcu ( struct nameidata * nd )
{
nd - > flags & = ~ LOOKUP_RCU ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > seq = nd - > next_seq = 0 ;
2022-07-06 12:40:31 -04:00
rcu_read_unlock ( ) ;
}
2015-05-09 12:55:43 -04:00
static void terminate_walk ( struct nameidata * nd )
{
drop_links ( nd ) ;
if ( ! ( nd - > flags & LOOKUP_RCU ) ) {
int i ;
path_put ( & nd - > path ) ;
for ( i = 0 ; i < nd - > depth ; i + + )
path_put ( & nd - > stack [ i ] . link ) ;
2021-04-01 22:03:41 -04:00
if ( nd - > state & ND_ROOT_GRABBED ) {
2015-05-12 17:35:52 -04:00
path_put ( & nd - > root ) ;
2021-04-01 22:03:41 -04:00
nd - > state & = ~ ND_ROOT_GRABBED ;
2015-05-12 17:35:52 -04:00
}
2015-05-09 12:55:43 -04:00
} else {
2022-07-06 12:40:31 -04:00
leave_rcu ( nd ) ;
2015-05-09 12:55:43 -04:00
}
nd - > depth = 0 ;
2021-04-06 12:33:07 -04:00
nd - > path . mnt = NULL ;
nd - > path . dentry = NULL ;
2015-05-09 12:55:43 -04:00
}
/* path_put is needed afterwards regardless of success or failure */
2020-02-26 19:19:05 -05:00
static bool __legitimize_path ( struct path * path , unsigned seq , unsigned mseq )
2015-05-09 12:55:43 -04:00
{
2020-02-26 19:19:05 -05:00
int res = __legitimize_mnt ( path - > mnt , mseq ) ;
2015-05-09 12:55:43 -04:00
if ( unlikely ( res ) ) {
if ( res > 0 )
path - > mnt = NULL ;
path - > dentry = NULL ;
return false ;
}
if ( unlikely ( ! lockref_get_not_dead ( & path - > dentry - > d_lockref ) ) ) {
path - > dentry = NULL ;
return false ;
}
return ! read_seqcount_retry ( & path - > dentry - > d_seq , seq ) ;
}
2020-02-26 19:19:05 -05:00
static inline bool legitimize_path ( struct nameidata * nd ,
struct path * path , unsigned seq )
{
2020-04-05 21:59:55 -04:00
return __legitimize_path ( path , seq , nd - > m_seq ) ;
2020-02-26 19:19:05 -05:00
}
2015-05-09 12:55:43 -04:00
static bool legitimize_links ( struct nameidata * nd )
{
int i ;
2021-02-15 12:03:23 -05:00
if ( unlikely ( nd - > flags & LOOKUP_CACHED ) ) {
drop_links ( nd ) ;
nd - > depth = 0 ;
return false ;
}
2015-05-09 12:55:43 -04:00
for ( i = 0 ; i < nd - > depth ; i + + ) {
struct saved * last = nd - > stack + i ;
if ( unlikely ( ! legitimize_path ( nd , & last - > link , last - > seq ) ) ) {
drop_links ( nd ) ;
nd - > depth = i + 1 ;
return false ;
}
}
return true ;
}
2019-07-16 21:05:36 -04:00
static bool legitimize_root ( struct nameidata * nd )
{
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
/* Nothing to do if nd->root is zero or is managed by the VFS user. */
2021-04-01 22:03:41 -04:00
if ( ! nd - > root . mnt | | ( nd - > state & ND_ROOT_PRESET ) )
2019-07-16 21:05:36 -04:00
return true ;
2021-04-01 22:03:41 -04:00
nd - > state | = ND_ROOT_GRABBED ;
2019-07-16 21:05:36 -04:00
return legitimize_path ( nd , & nd - > root , nd - > root_seq ) ;
}
2011-03-25 10:32:48 -04:00
/*
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
* Path walking has 2 modes , rcu - walk and ref - walk ( see
2011-03-25 10:32:48 -04:00
* Documentation / filesystems / path - lookup . txt ) . In situations when we can ' t
* continue in RCU mode , we attempt to drop out of rcu - walk mode and grab
2015-11-30 11:11:59 -05:00
* normal reference counts on dentries and vfsmounts to transition to ref - walk
2011-03-25 10:32:48 -04:00
* mode . Refcounts are grabbed at the last known good point before rcu - walk
* got stuck , so ref - walk may continue from there . If this is not successful
* ( eg . a seqcount has changed ) , then failure is returned and it ' s up to caller
* to restart the path walk from the beginning in ref - walk mode .
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
*/
/**
2020-12-17 09:19:08 -07:00
* try_to_unlazy - try to switch to ref - walk mode .
2011-03-25 10:32:48 -04:00
* @ nd : nameidata pathwalk data
2020-12-17 09:19:08 -07:00
* Returns : true on success , false on failure
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
*
2020-12-17 09:19:08 -07:00
* try_to_unlazy attempts to legitimize the current nd - > path and nd - > root
2017-01-09 22:29:15 -05:00
* for ref - walk mode .
* Must be called from rcu - walk context .
2020-12-17 09:19:08 -07:00
* Nothing should touch nameidata between try_to_unlazy ( ) failure and
2015-05-09 12:55:43 -04:00
* terminate_walk ( ) .
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
*/
2020-12-17 09:19:08 -07:00
static bool try_to_unlazy ( struct nameidata * nd )
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
{
struct dentry * parent = nd - > path . dentry ;
BUG_ON ( ! ( nd - > flags & LOOKUP_RCU ) ) ;
vfs: fix dentry RCU to refcounting possibly sleeping dput()
This is the fix that the last two commits indirectly led up to - making
sure that we don't call dput() in a bad context on the dentries we've
looked up in RCU mode after the sequence count validation fails.
This basically expands d_rcu_to_refcount() into the callers, and then
fixes the callers to delay the dput() in the failure case until _after_
we've dropped all locks and are no longer in an RCU-locked region.
The case of 'complete_walk()' was trivial, since its failure case did
the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
and as such that is just a pure expansion of the function with a trivial
movement of the resulting dput() to after 'unlock_rcu_walk()'.
In contrast, the unlazy_walk() case was much more complicated, because
not only does convert two different dentries from RCU to be reference
counted, but it used to not call unlock_rcu_walk() at all, and instead
just returned an error and let the caller clean everything up in
"terminate_walk()".
Happily, one of the dentries in question (called "parent" inside
unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
a refcount to anyway for the non-RCU case.
So what the new and improved unlazy_walk() does is to first turn that
dentry into a refcounted one, and once that is set up, the error cases
can continue to use the terminate_walk() helper for cleanup, but for the
non-RCU case. Which makes it possible to drop out of RCU mode if we
actually hit the sequence number failure case.
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-08 18:13:49 -07:00
2017-01-09 22:29:15 -05:00
if ( unlikely ( ! legitimize_links ( nd ) ) )
goto out1 ;
2019-07-16 21:20:17 -04:00
if ( unlikely ( ! legitimize_path ( nd , & nd - > path , nd - > seq ) ) )
goto out ;
2019-07-16 21:05:36 -04:00
if ( unlikely ( ! legitimize_root ( nd ) ) )
goto out ;
2022-07-06 12:40:31 -04:00
leave_rcu ( nd ) ;
2017-01-09 22:29:15 -05:00
BUG_ON ( nd - > inode ! = parent - > d_inode ) ;
2020-12-17 09:19:08 -07:00
return true ;
2017-01-09 22:29:15 -05:00
2019-07-16 21:20:17 -04:00
out1 :
2017-01-09 22:29:15 -05:00
nd - > path . mnt = NULL ;
nd - > path . dentry = NULL ;
out :
2022-07-06 12:40:31 -04:00
leave_rcu ( nd ) ;
2020-12-17 09:19:08 -07:00
return false ;
2017-01-09 22:29:15 -05:00
}
/**
2021-01-04 00:08:41 -05:00
* try_to_unlazy_next - try to switch to ref - walk mode .
2017-01-09 22:29:15 -05:00
* @ nd : nameidata pathwalk data
2021-01-04 00:08:41 -05:00
* @ dentry : next dentry to step into
* Returns : true on success , false on failure
2017-01-09 22:29:15 -05:00
*
2022-01-25 05:13:40 -08:00
* Similar to try_to_unlazy ( ) , but here we have the next dentry already
2021-01-04 00:08:41 -05:00
* picked by rcu - walk and want to legitimize that in addition to the current
* nd - > path and nd - > root for ref - walk mode . Must be called from rcu - walk context .
* Nothing should touch nameidata between try_to_unlazy_next ( ) failure and
2017-01-09 22:29:15 -05:00
* terminate_walk ( ) .
*/
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
static bool try_to_unlazy_next ( struct nameidata * nd , struct dentry * dentry )
2017-01-09 22:29:15 -05:00
{
2022-07-05 12:22:46 -04:00
int res ;
2017-01-09 22:29:15 -05:00
BUG_ON ( ! ( nd - > flags & LOOKUP_RCU ) ) ;
2015-05-09 12:55:43 -04:00
if ( unlikely ( ! legitimize_links ( nd ) ) )
goto out2 ;
2022-07-05 12:22:46 -04:00
res = __legitimize_mnt ( nd - > path . mnt , nd - > m_seq ) ;
if ( unlikely ( res ) ) {
if ( res > 0 )
goto out2 ;
goto out1 ;
}
2017-01-09 22:29:15 -05:00
if ( unlikely ( ! lockref_get_not_dead ( & nd - > path . dentry - > d_lockref ) ) )
2015-05-09 12:55:43 -04:00
goto out1 ;
2013-09-29 22:06:07 -04:00
2013-09-02 11:38:06 -07:00
/*
2017-01-09 22:29:15 -05:00
* We need to move both the parent and the dentry from the RCU domain
* to be properly refcounted . And the sequence number in the dentry
* validates * both * dentry counters , since we checked the sequence
* number of the parent after we got the child sequence number . So we
* know the parent must still be valid if the child sequence number is
2013-09-02 11:38:06 -07:00
*/
2017-01-09 22:29:15 -05:00
if ( unlikely ( ! lockref_get_not_dead ( & dentry - > d_lockref ) ) )
goto out ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
if ( read_seqcount_retry ( & dentry - > d_seq , nd - > next_seq ) )
2019-07-16 21:20:17 -04:00
goto out_dput ;
vfs: fix dentry RCU to refcounting possibly sleeping dput()
This is the fix that the last two commits indirectly led up to - making
sure that we don't call dput() in a bad context on the dentries we've
looked up in RCU mode after the sequence count validation fails.
This basically expands d_rcu_to_refcount() into the callers, and then
fixes the callers to delay the dput() in the failure case until _after_
we've dropped all locks and are no longer in an RCU-locked region.
The case of 'complete_walk()' was trivial, since its failure case did
the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
and as such that is just a pure expansion of the function with a trivial
movement of the resulting dput() to after 'unlock_rcu_walk()'.
In contrast, the unlazy_walk() case was much more complicated, because
not only does convert two different dentries from RCU to be reference
counted, but it used to not call unlock_rcu_walk() at all, and instead
just returned an error and let the caller clean everything up in
"terminate_walk()".
Happily, one of the dentries in question (called "parent" inside
unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
a refcount to anyway for the non-RCU case.
So what the new and improved unlazy_walk() does is to first turn that
dentry into a refcounted one, and once that is set up, the error cases
can continue to use the terminate_walk() helper for cleanup, but for the
non-RCU case. Which makes it possible to drop out of RCU mode if we
actually hit the sequence number failure case.
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-08 18:13:49 -07:00
/*
* Sequence counts matched . Now make sure that the root is
* still valid and get it if required .
*/
2019-07-16 21:20:17 -04:00
if ( unlikely ( ! legitimize_root ( nd ) ) )
goto out_dput ;
2022-07-06 12:40:31 -04:00
leave_rcu ( nd ) ;
2021-01-04 00:08:41 -05:00
return true ;
2011-03-25 10:32:48 -04:00
2015-05-09 12:55:43 -04:00
out2 :
nd - > path . mnt = NULL ;
out1 :
nd - > path . dentry = NULL ;
vfs: fix dentry RCU to refcounting possibly sleeping dput()
This is the fix that the last two commits indirectly led up to - making
sure that we don't call dput() in a bad context on the dentries we've
looked up in RCU mode after the sequence count validation fails.
This basically expands d_rcu_to_refcount() into the callers, and then
fixes the callers to delay the dput() in the failure case until _after_
we've dropped all locks and are no longer in an RCU-locked region.
The case of 'complete_walk()' was trivial, since its failure case did
the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
and as such that is just a pure expansion of the function with a trivial
movement of the resulting dput() to after 'unlock_rcu_walk()'.
In contrast, the unlazy_walk() case was much more complicated, because
not only does convert two different dentries from RCU to be reference
counted, but it used to not call unlock_rcu_walk() at all, and instead
just returned an error and let the caller clean everything up in
"terminate_walk()".
Happily, one of the dentries in question (called "parent" inside
unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
a refcount to anyway for the non-RCU case.
So what the new and improved unlazy_walk() does is to first turn that
dentry into a refcounted one, and once that is set up, the error cases
can continue to use the terminate_walk() helper for cleanup, but for the
non-RCU case. Which makes it possible to drop out of RCU mode if we
actually hit the sequence number failure case.
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-08 18:13:49 -07:00
out :
2022-07-06 12:40:31 -04:00
leave_rcu ( nd ) ;
2021-01-04 00:08:41 -05:00
return false ;
2019-07-16 21:20:17 -04:00
out_dput :
2022-07-06 12:40:31 -04:00
leave_rcu ( nd ) ;
2019-07-16 21:20:17 -04:00
dput ( dentry ) ;
2021-01-04 00:08:41 -05:00
return false ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
}
2012-06-10 16:10:59 -04:00
static inline int d_revalidate ( struct dentry * dentry , unsigned int flags )
2011-01-07 17:49:57 +11:00
{
2017-01-09 22:25:28 -05:00
if ( unlikely ( dentry - > d_flags & DCACHE_OP_REVALIDATE ) )
return dentry - > d_op - > d_revalidate ( dentry , flags ) ;
else
return 1 ;
2011-01-07 17:49:57 +11:00
}
2011-03-25 11:00:12 -04:00
/**
* complete_walk - successful completion of path walk
* @ nd : pointer nameidata
2009-12-07 12:01:50 -05:00
*
2011-03-25 11:00:12 -04:00
* If we had been in RCU mode , drop out of it and legitimize nd - > path .
* Revalidate the final result , unless we ' d already done that during
* the path walk or the filesystem doesn ' t ask for it . Return 0 on
* success , - error on failure . In case of failure caller does not
* need to drop nd - > path .
2009-12-07 12:01:50 -05:00
*/
2011-03-25 11:00:12 -04:00
static int complete_walk ( struct nameidata * nd )
2009-12-07 12:01:50 -05:00
{
2011-02-22 15:50:10 -05:00
struct dentry * dentry = nd - > path . dentry ;
2009-12-07 12:01:50 -05:00
int status ;
2011-03-25 11:00:12 -04:00
if ( nd - > flags & LOOKUP_RCU ) {
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
/*
* We don ' t want to zero nd - > root for scoped - lookups or
* externally - managed nd - > root .
*/
2021-04-01 22:03:41 -04:00
if ( ! ( nd - > state & ND_ROOT_PRESET ) )
if ( ! ( nd - > flags & LOOKUP_IS_SCOPED ) )
nd - > root . mnt = NULL ;
2020-12-17 09:19:09 -07:00
nd - > flags & = ~ LOOKUP_CACHED ;
2020-12-17 09:19:08 -07:00
if ( ! try_to_unlazy ( nd ) )
2011-03-25 11:00:12 -04:00
return - ECHILD ;
}
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
if ( unlikely ( nd - > flags & LOOKUP_IS_SCOPED ) ) {
/*
* While the guarantee of LOOKUP_IS_SCOPED is ( roughly ) " don't
* ever step outside the root during lookup " and should already
* be guaranteed by the rest of namei , we want to avoid a namei
* BUG resulting in userspace being given a path that was not
* scoped within the root at some point during the lookup .
*
* So , do a final sanity - check to make sure that in the
* worst - case scenario ( a complete bypass of LOOKUP_IS_SCOPED )
* we won ' t silently return an fd completely outside of the
* requested root to userspace .
*
* Userspace could move the path outside the root after this
* check , but as discussed elsewhere this is not a concern ( the
* resolved file was inside the root at some point ) .
*/
if ( ! path_is_under ( & nd - > path , & nd - > root ) )
return - EXDEV ;
}
2021-04-01 22:03:41 -04:00
if ( likely ( ! ( nd - > state & ND_JUMPED ) ) )
2011-02-22 15:50:10 -05:00
return 0 ;
2013-02-20 11:19:05 -05:00
if ( likely ( ! ( dentry - > d_flags & DCACHE_OP_WEAK_REVALIDATE ) ) )
2009-12-07 12:01:50 -05:00
return 0 ;
2013-02-20 11:19:05 -05:00
status = dentry - > d_op - > d_weak_revalidate ( dentry , nd - > flags ) ;
2009-12-07 12:01:50 -05:00
if ( status > 0 )
return 0 ;
2011-02-22 15:50:10 -05:00
if ( ! status )
2009-12-07 12:01:50 -05:00
status = - ESTALE ;
2011-02-22 15:50:10 -05:00
2009-12-07 12:01:50 -05:00
return status ;
}
2019-12-07 01:13:29 +11:00
static int set_root ( struct nameidata * nd )
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
{
2014-09-13 21:55:46 -04:00
struct fs_struct * fs = current - > fs ;
2011-01-07 17:49:53 +11:00
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
/*
* Jumping to the real root in a scoped - lookup is a BUG in namei , but we
* still have to ensure it doesn ' t happen because it will cause a breakout
* from the dirfd .
*/
if ( WARN_ON ( nd - > flags & LOOKUP_IS_SCOPED ) )
return - ENOTRECOVERABLE ;
2015-12-05 20:07:21 -05:00
if ( nd - > flags & LOOKUP_RCU ) {
unsigned seq ;
do {
seq = read_seqcount_begin ( & fs - > seq ) ;
nd - > root = fs - > root ;
nd - > root_seq = __read_seqcount_begin ( & nd - > root . dentry - > d_seq ) ;
} while ( read_seqcount_retry ( & fs - > seq , seq ) ) ;
} else {
get_fs_root ( fs , & nd - > root ) ;
2021-04-01 22:03:41 -04:00
nd - > state | = ND_ROOT_GRABBED ;
2015-12-05 20:07:21 -05:00
}
2019-12-07 01:13:29 +11:00
return 0 ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
}
2015-12-05 20:51:58 -05:00
static int nd_jump_root ( struct nameidata * nd )
{
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
if ( unlikely ( nd - > flags & LOOKUP_BENEATH ) )
return - EXDEV ;
2019-12-07 01:13:32 +11:00
if ( unlikely ( nd - > flags & LOOKUP_NO_XDEV ) ) {
/* Absolute path arguments to path_init() are allowed. */
if ( nd - > path . mnt ! = NULL & & nd - > path . mnt ! = nd - > root . mnt )
return - EXDEV ;
}
2019-12-07 01:13:29 +11:00
if ( ! nd - > root . mnt ) {
int error = set_root ( nd ) ;
if ( error )
return error ;
}
2015-12-05 20:51:58 -05:00
if ( nd - > flags & LOOKUP_RCU ) {
struct dentry * d ;
nd - > path = nd - > root ;
d = nd - > path . dentry ;
nd - > inode = d - > d_inode ;
nd - > seq = nd - > root_seq ;
2022-07-05 11:23:58 -04:00
if ( read_seqcount_retry ( & d - > d_seq , nd - > seq ) )
2015-12-05 20:51:58 -05:00
return - ECHILD ;
} else {
path_put ( & nd - > path ) ;
nd - > path = nd - > root ;
path_get ( & nd - > path ) ;
nd - > inode = nd - > path . dentry - > d_inode ;
}
2021-04-01 22:03:41 -04:00
nd - > state | = ND_JUMPED ;
2015-12-05 20:51:58 -05:00
return 0 ;
}
2012-06-18 10:47:04 -04:00
/*
2015-11-17 10:20:54 -05:00
* Helper to directly jump to a known parsed path from - > get_link ,
2012-06-18 10:47:04 -04:00
* caller must have taken a reference to path beforehand .
*/
2019-12-07 01:13:28 +11:00
int nd_jump_link ( struct path * path )
2012-06-18 10:47:04 -04:00
{
2019-12-07 01:13:31 +11:00
int error = - ELOOP ;
2015-05-02 13:37:52 -04:00
struct nameidata * nd = current - > nameidata ;
2012-06-18 10:47:04 -04:00
2019-12-07 01:13:31 +11:00
if ( unlikely ( nd - > flags & LOOKUP_NO_MAGICLINKS ) )
goto err ;
2019-12-07 01:13:32 +11:00
error = - EXDEV ;
if ( unlikely ( nd - > flags & LOOKUP_NO_XDEV ) ) {
if ( nd - > path . mnt ! = path - > mnt )
goto err ;
}
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
/* Not currently safe for scoped-lookups. */
if ( unlikely ( nd - > flags & LOOKUP_IS_SCOPED ) )
goto err ;
2019-12-07 01:13:32 +11:00
2019-12-07 01:13:31 +11:00
path_put ( & nd - > path ) ;
2012-06-18 10:47:04 -04:00
nd - > path = * path ;
nd - > inode = nd - > path . dentry - > d_inode ;
2021-04-01 22:03:41 -04:00
nd - > state | = ND_JUMPED ;
2019-12-07 01:13:28 +11:00
return 0 ;
2019-12-07 01:13:31 +11:00
err :
path_put ( path ) ;
return error ;
2012-06-18 10:47:04 -04:00
}
2015-05-02 20:19:23 -04:00
static inline void put_link ( struct nameidata * nd )
2011-03-14 22:20:34 -04:00
{
2015-05-03 21:06:24 -04:00
struct saved * last = nd - > stack + - - nd - > depth ;
2015-12-29 15:58:39 -05:00
do_delayed_call ( & last - > done ) ;
2015-05-07 20:32:22 -04:00
if ( ! ( nd - > flags & LOOKUP_RCU ) )
path_put ( & last - > link ) ;
2011-03-14 22:20:34 -04:00
}
2022-01-21 22:13:13 -08:00
static int sysctl_protected_symlinks __read_mostly ;
static int sysctl_protected_hardlinks __read_mostly ;
static int sysctl_protected_fifos __read_mostly ;
static int sysctl_protected_regular __read_mostly ;
# ifdef CONFIG_SYSCTL
static struct ctl_table namei_sysctls [ ] = {
{
. procname = " protected_symlinks " ,
. data = & sysctl_protected_symlinks ,
. maxlen = sizeof ( int ) ,
2022-05-13 16:58:15 -07:00
. mode = 0644 ,
2022-01-21 22:13:13 -08:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
{
. procname = " protected_hardlinks " ,
. data = & sysctl_protected_hardlinks ,
. maxlen = sizeof ( int ) ,
2022-05-13 16:58:15 -07:00
. mode = 0644 ,
2022-01-21 22:13:13 -08:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
{
. procname = " protected_fifos " ,
. data = & sysctl_protected_fifos ,
. maxlen = sizeof ( int ) ,
2022-05-13 16:58:15 -07:00
. mode = 0644 ,
2022-01-21 22:13:13 -08:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_TWO ,
} ,
{
. procname = " protected_regular " ,
. data = & sysctl_protected_regular ,
. maxlen = sizeof ( int ) ,
2022-05-13 16:58:15 -07:00
. mode = 0644 ,
2022-01-21 22:13:13 -08:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_TWO ,
} ,
{ }
} ;
static int __init init_fs_namei_sysctls ( void )
{
register_sysctl_init ( " fs " , namei_sysctls ) ;
return 0 ;
}
fs_initcall ( init_fs_namei_sysctls ) ;
# endif /* CONFIG_SYSCTL */
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
/**
* may_follow_link - Check symlink following for unsafe situations
2012-08-18 17:39:25 -07:00
* @ nd : nameidata pathwalk data
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
*
* In the case of the sysctl_protected_symlinks sysctl being enabled ,
* CAP_DAC_OVERRIDE needs to be specifically ignored if the symlink is
* in a sticky world - writable directory . This is to protect privileged
* processes from failing races against path names that may change out
* from under them by way of other users creating malicious symlinks .
* It will permit symlinks to be followed only when outside a sticky
* world - writable directory , or when the uid of the symlink and follower
* match , or when the directory owner matches the symlink ' s owner .
*
* Returns 0 if following the symlink is allowed , - ve on error .
*/
2020-01-14 14:41:39 -05:00
static inline int may_follow_link ( struct nameidata * nd , const struct inode * inode )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
{
2021-01-21 14:19:31 +01:00
struct user_namespace * mnt_userns ;
kuid_t i_uid ;
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
if ( ! sysctl_protected_symlinks )
return 0 ;
2021-01-21 14:19:31 +01:00
mnt_userns = mnt_user_ns ( nd - > path . mnt ) ;
i_uid = i_uid_into_mnt ( mnt_userns , inode ) ;
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
/* Allowed if owner and follower match. */
2021-01-21 14:19:31 +01:00
if ( uid_eq ( current_cred ( ) - > fsuid , i_uid ) )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return 0 ;
/* Allowed if parent directory not sticky and world-writable. */
2020-03-05 11:34:48 -05:00
if ( ( nd - > dir_mode & ( S_ISVTX | S_IWOTH ) ) ! = ( S_ISVTX | S_IWOTH ) )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return 0 ;
/* Allowed if parent directory and link owner match. */
2021-01-21 14:19:31 +01:00
if ( uid_valid ( nd - > dir_uid ) & & uid_eq ( nd - > dir_uid , i_uid ) )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return 0 ;
2015-05-07 20:37:40 -04:00
if ( nd - > flags & LOOKUP_RCU )
return - ECHILD ;
2018-03-21 04:42:21 -04:00
audit_inode ( nd - > name , nd - > stack [ 0 ] . link . dentry , 0 ) ;
2019-10-02 16:41:58 -07:00
audit_log_path_denied ( AUDIT_ANOM_LINK , " follow_link " ) ;
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return - EACCES ;
}
/**
* safe_hardlink_source - Check for safe hardlink conditions
2021-01-21 14:19:31 +01:00
* @ mnt_userns : user namespace of the mount the inode was found from
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
* @ inode : the source inode to hardlink from
*
* Return false if at least one of the following conditions :
* - inode is not a regular file
* - inode is setuid
* - inode is setgid and group - exec
* - access failure for read and write
*
* Otherwise returns true .
*/
2021-01-21 14:19:31 +01:00
static bool safe_hardlink_source ( struct user_namespace * mnt_userns ,
struct inode * inode )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
{
umode_t mode = inode - > i_mode ;
/* Special files should not get pinned to the filesystem. */
if ( ! S_ISREG ( mode ) )
return false ;
/* Setuid files should not get pinned to the filesystem. */
if ( mode & S_ISUID )
return false ;
/* Executable setgid files should not get pinned to the filesystem. */
if ( ( mode & ( S_ISGID | S_IXGRP ) ) = = ( S_ISGID | S_IXGRP ) )
return false ;
/* Hardlinking to unreadable or unwritable sources is dangerous. */
2021-01-21 14:19:31 +01:00
if ( inode_permission ( mnt_userns , inode , MAY_READ | MAY_WRITE ) )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return false ;
return true ;
}
/**
* may_linkat - Check permissions for creating a hardlink
2021-01-21 14:19:31 +01:00
* @ mnt_userns : user namespace of the mount the inode was found from
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
* @ link : the source to hardlink from
*
* Block hardlink when all of :
* - sysctl_protected_hardlinks enabled
* - fsuid does not match inode
* - hardlink source is unsafe ( see safe_hardlink_source ( ) above )
2015-10-20 16:09:19 +02:00
* - not CAP_FOWNER in a namespace with the inode owner uid mapped
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
*
2021-01-21 14:19:31 +01:00
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
* Returns 0 if successful , - ve on error .
*/
2021-01-21 14:19:31 +01:00
int may_linkat ( struct user_namespace * mnt_userns , struct path * link )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
{
2017-09-14 12:07:32 -05:00
struct inode * inode = link - > dentry - > d_inode ;
/* Inode writeback is not safe when the uid or gid are invalid. */
2021-01-21 14:19:31 +01:00
if ( ! uid_valid ( i_uid_into_mnt ( mnt_userns , inode ) ) | |
! gid_valid ( i_gid_into_mnt ( mnt_userns , inode ) ) )
2017-09-14 12:07:32 -05:00
return - EOVERFLOW ;
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
if ( ! sysctl_protected_hardlinks )
return 0 ;
/* Source inode owner (or CAP_FOWNER) can hardlink all they like,
* otherwise , it must be a safe source .
*/
2021-01-21 14:19:31 +01:00
if ( safe_hardlink_source ( mnt_userns , inode ) | |
inode_owner_or_capable ( mnt_userns , inode ) )
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return 0 ;
2019-10-02 16:41:58 -07:00
audit_log_path_denied ( AUDIT_ANOM_LINK , " linkat " ) ;
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
return - EPERM ;
}
2018-08-23 17:00:35 -07:00
/**
* may_create_in_sticky - Check whether an O_CREAT open in a sticky directory
* should be allowed , or not , on files that already
* exist .
2021-01-21 14:19:31 +01:00
* @ mnt_userns : user namespace of the mount the inode was found from
2021-02-15 20:29:28 -08:00
* @ nd : nameidata pathwalk data
2018-08-23 17:00:35 -07:00
* @ inode : the inode of the file to open
*
* Block an O_CREAT open of a FIFO ( or a regular file ) when :
* - sysctl_protected_fifos ( or sysctl_protected_regular ) is enabled
* - the file already exists
* - we are in a sticky directory
* - we don ' t own the file
* - the owner of the directory doesn ' t own the file
* - the directory is world writable
* If the sysctl_protected_fifos ( or sysctl_protected_regular ) is set to 2
* the directory doesn ' t have to be world writable : being group writable will
* be enough .
*
2021-01-21 14:19:31 +01:00
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*
2018-08-23 17:00:35 -07:00
* Returns 0 if the open is allowed , - ve on error .
*/
2021-01-21 14:19:31 +01:00
static int may_create_in_sticky ( struct user_namespace * mnt_userns ,
struct nameidata * nd , struct inode * const inode )
2018-08-23 17:00:35 -07:00
{
2021-01-21 14:19:31 +01:00
umode_t dir_mode = nd - > dir_mode ;
kuid_t dir_uid = nd - > dir_uid ;
2018-08-23 17:00:35 -07:00
if ( ( ! sysctl_protected_fifos & & S_ISFIFO ( inode - > i_mode ) ) | |
( ! sysctl_protected_regular & & S_ISREG ( inode - > i_mode ) ) | |
2020-01-26 09:29:34 -05:00
likely ( ! ( dir_mode & S_ISVTX ) ) | |
2021-01-21 14:19:31 +01:00
uid_eq ( i_uid_into_mnt ( mnt_userns , inode ) , dir_uid ) | |
uid_eq ( current_fsuid ( ) , i_uid_into_mnt ( mnt_userns , inode ) ) )
2018-08-23 17:00:35 -07:00
return 0 ;
2020-01-26 09:29:34 -05:00
if ( likely ( dir_mode & 0002 ) | |
( dir_mode & 0020 & &
2018-08-23 17:00:35 -07:00
( ( sysctl_protected_fifos > = 2 & & S_ISFIFO ( inode - > i_mode ) ) | |
( sysctl_protected_regular > = 2 & & S_ISREG ( inode - > i_mode ) ) ) ) ) {
2019-10-02 16:41:58 -07:00
const char * operation = S_ISFIFO ( inode - > i_mode ) ?
" sticky_create_fifo " :
" sticky_create_regular " ;
audit_log_path_denied ( AUDIT_ANOM_CREAT , operation ) ;
2018-08-23 17:00:35 -07:00
return - EACCES ;
}
return 0 ;
}
2012-06-25 12:55:28 +01:00
/*
* follow_up - Find the mountpoint of path ' s vfsmount
*
* Given a path , find the mountpoint of its source file system .
* Replace @ path with the path of the mountpoint in the parent mount .
* Up is towards / .
*
* Return 1 if we went up a level and 0 if we were already at the
* root .
*/
2009-04-18 03:26:48 -04:00
int follow_up ( struct path * path )
2005-04-16 15:20:36 -07:00
{
2011-11-24 22:19:58 -05:00
struct mount * mnt = real_mount ( path - > mnt ) ;
struct mount * parent ;
2005-04-16 15:20:36 -07:00
struct dentry * mountpoint ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
2013-09-29 22:06:07 -04:00
read_seqlock_excl ( & mount_lock ) ;
2011-11-24 22:19:58 -05:00
parent = mnt - > mnt_parent ;
2012-07-18 17:32:50 +04:00
if ( parent = = mnt ) {
2013-09-29 22:06:07 -04:00
read_sequnlock_excl ( & mount_lock ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2011-11-24 22:19:58 -05:00
mntget ( & parent - > mnt ) ;
2011-11-24 22:25:07 -05:00
mountpoint = dget ( mnt - > mnt_mountpoint ) ;
2013-09-29 22:06:07 -04:00
read_sequnlock_excl ( & mount_lock ) ;
2009-04-18 03:26:48 -04:00
dput ( path - > dentry ) ;
path - > dentry = mountpoint ;
mntput ( path - > mnt ) ;
2011-11-24 22:19:58 -05:00
path - > mnt = & parent - > mnt ;
2005-04-16 15:20:36 -07:00
return 1 ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( follow_up ) ;
2005-04-16 15:20:36 -07:00
2020-02-26 17:50:13 -05:00
static bool choose_mountpoint_rcu ( struct mount * m , const struct path * root ,
struct path * path , unsigned * seqp )
{
while ( mnt_has_parent ( m ) ) {
struct dentry * mountpoint = m - > mnt_mountpoint ;
m = m - > mnt_parent ;
if ( unlikely ( root - > dentry = = mountpoint & &
root - > mnt = = & m - > mnt ) )
break ;
if ( mountpoint ! = m - > mnt . mnt_root ) {
path - > mnt = & m - > mnt ;
path - > dentry = mountpoint ;
* seqp = read_seqcount_begin ( & mountpoint - > d_seq ) ;
return true ;
}
}
return false ;
}
2020-02-26 19:19:05 -05:00
static bool choose_mountpoint ( struct mount * m , const struct path * root ,
struct path * path )
{
bool found ;
rcu_read_lock ( ) ;
while ( 1 ) {
unsigned seq , mseq = read_seqbegin ( & mount_lock ) ;
found = choose_mountpoint_rcu ( m , root , path , & seq ) ;
if ( unlikely ( ! found ) ) {
if ( ! read_seqretry ( & mount_lock , mseq ) )
break ;
} else {
if ( likely ( __legitimize_path ( path , seq , mseq ) ) )
break ;
rcu_read_unlock ( ) ;
path_put ( path ) ;
rcu_read_lock ( ) ;
}
}
rcu_read_unlock ( ) ;
return found ;
}
2011-01-07 17:49:38 +11:00
/*
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
* Perform an automount
* - return - EISDIR to tell follow_managed ( ) to stop and return the path we
* were called with .
2005-04-16 15:20:36 -07:00
*/
2020-01-16 22:05:18 -05:00
static int follow_automount ( struct path * path , int * count , unsigned lookup_flags )
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
{
2020-01-11 11:27:46 -05:00
struct dentry * dentry = path - > dentry ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
2011-09-05 18:06:26 +02:00
/* We don't want to mount if someone's just doing a stat -
* unless they ' re stat ' ing a directory and appended a ' / ' to
* the name .
*
* We do , however , want to mount if someone wants to open or
* create a file of any type under the mountpoint , wants to
* traverse through the mountpoint or wants to open the
* mounted directory . Also , autofs may mark negative dentries
* as being automount points . These will need the attentions
* of the daemon to instantiate them before they can be used .
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
*/
2020-01-16 22:05:18 -05:00
if ( ! ( lookup_flags & ( LOOKUP_PARENT | LOOKUP_DIRECTORY |
2017-11-29 16:11:26 -08:00
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT ) ) & &
2020-01-11 11:27:46 -05:00
dentry - > d_inode )
2017-11-29 16:11:26 -08:00
return - EISDIR ;
2011-09-05 18:06:26 +02:00
2020-01-16 22:05:18 -05:00
if ( count & & ( * count ) + + > = MAXSYMLINKS )
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
return - ELOOP ;
2020-01-11 11:27:46 -05:00
return finish_automount ( dentry - > d_op - > d_automount ( path ) , path ) ;
2005-06-06 13:36:05 -07:00
}
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
/*
2020-01-17 08:45:08 -05:00
* mount traversal - out - of - line part . One note on - > d_flags accesses -
* dentries are pinned but not locked here , so negative dentry can go
* positive right under us . Use of smp_load_acquire ( ) provides a barrier
* sufficient for - > d_inode and - > d_flags consistency .
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
*/
2020-01-17 08:45:08 -05:00
static int __traverse_mounts ( struct path * path , unsigned flags , bool * jumped ,
int * count , unsigned lookup_flags )
2005-04-16 15:20:36 -07:00
{
2020-01-17 08:45:08 -05:00
struct vfsmount * mnt = path - > mnt ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
bool need_mntput = false ;
VFS: Fix vfsmount overput on simultaneous automount
[Kudos to dhowells for tracking that crap down]
If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.
The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.
During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.
The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.
The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt. That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move. The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.
The BUG can be reproduced with the following test program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char **argv)
{
int pid, ws;
struct stat buf;
pid = fork();
stat(argv[1], &buf);
if (pid > 0) wait(&ws);
return 0;
}
and the following procedure:
(1) Mount an NFS volume that on the server has something else mounted on a
subdirectory. For instance, I can mount / from my server:
mount warthog:/ /mnt -t nfs4 -r
On the server /data has another filesystem mounted on it, so NFS will see
a change in FSID as it walks down the path, and will mark /mnt/data as
being a mountpoint. This will cause the automount code to be triggered.
!!! Do not look inside the mounted fs at this point !!!
(2) Run the above program on a file within the submount to generate two
simultaneous automount requests:
/tmp/forkstat /mnt/data/testfile
(3) Unmount the automounted submount:
umount /mnt/data
(4) Unmount the original mount:
umount /mnt
At this point the kernel should throw a BUG with something like the
following:
BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]
Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:
[<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
[<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
[<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
[<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
[<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
[<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
[<ffffffff811629ff>] deactivate_super+0x6f/0x7b
[<ffffffff81186261>] mntput_no_expire+0x18d/0x199
[<ffffffff811862a8>] mntput+0x3b/0x44
[<ffffffff81186d87>] release_mounts+0xa2/0xbf
[<ffffffff811876af>] sys_umount+0x47a/0x4ba
[<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
[<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b
as do_umount() is inlined. However, you can see release_mounts() in there.
Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-16 15:10:06 +01:00
int ret = 0 ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
2020-01-17 08:45:08 -05:00
while ( flags & DCACHE_MANAGED_DENTRY ) {
Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:26 +00:00
/* Allow the filesystem to manage the transit without i_mutex
* being held . */
2019-11-04 22:30:52 -05:00
if ( flags & DCACHE_MANAGE_TRANSIT ) {
2016-11-24 08:03:41 +11:00
ret = path - > dentry - > d_op - > d_manage ( path , false ) ;
2020-01-14 22:09:57 -05:00
flags = smp_load_acquire ( & path - > dentry - > d_flags ) ;
Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:26 +00:00
if ( ret < 0 )
VFS: Fix vfsmount overput on simultaneous automount
[Kudos to dhowells for tracking that crap down]
If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.
The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.
During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.
The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.
The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt. That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move. The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.
The BUG can be reproduced with the following test program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char **argv)
{
int pid, ws;
struct stat buf;
pid = fork();
stat(argv[1], &buf);
if (pid > 0) wait(&ws);
return 0;
}
and the following procedure:
(1) Mount an NFS volume that on the server has something else mounted on a
subdirectory. For instance, I can mount / from my server:
mount warthog:/ /mnt -t nfs4 -r
On the server /data has another filesystem mounted on it, so NFS will see
a change in FSID as it walks down the path, and will mark /mnt/data as
being a mountpoint. This will cause the automount code to be triggered.
!!! Do not look inside the mounted fs at this point !!!
(2) Run the above program on a file within the submount to generate two
simultaneous automount requests:
/tmp/forkstat /mnt/data/testfile
(3) Unmount the automounted submount:
umount /mnt/data
(4) Unmount the original mount:
umount /mnt
At this point the kernel should throw a BUG with something like the
following:
BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]
Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:
[<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
[<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
[<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
[<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
[<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
[<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
[<ffffffff811629ff>] deactivate_super+0x6f/0x7b
[<ffffffff81186261>] mntput_no_expire+0x18d/0x199
[<ffffffff811862a8>] mntput+0x3b/0x44
[<ffffffff81186d87>] release_mounts+0xa2/0xbf
[<ffffffff811876af>] sys_umount+0x47a/0x4ba
[<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
[<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b
as do_umount() is inlined. However, you can see release_mounts() in there.
Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-16 15:10:06 +01:00
break ;
Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:26 +00:00
}
2020-01-17 08:45:08 -05:00
if ( flags & DCACHE_MOUNTED ) { // something's mounted on it..
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
struct vfsmount * mounted = lookup_mnt ( path ) ;
2020-01-17 08:45:08 -05:00
if ( mounted ) { // ... in our namespace
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
dput ( path - > dentry ) ;
if ( need_mntput )
mntput ( path - > mnt ) ;
path - > mnt = mounted ;
path - > dentry = dget ( mounted - > mnt_root ) ;
2020-01-17 08:45:08 -05:00
// here we know it's positive
flags = path - > dentry - > d_flags ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
need_mntput = true ;
continue ;
}
}
2020-01-17 08:45:08 -05:00
if ( ! ( flags & DCACHE_NEED_AUTOMOUNT ) )
break ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
2020-01-17 08:45:08 -05:00
// uncovered automount point
ret = follow_automount ( path , count , lookup_flags ) ;
flags = smp_load_acquire ( & path - > dentry - > d_flags ) ;
if ( ret < 0 )
break ;
2005-04-16 15:20:36 -07:00
}
VFS: Fix vfsmount overput on simultaneous automount
[Kudos to dhowells for tracking that crap down]
If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.
The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.
During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.
The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.
The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt. That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move. The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.
The BUG can be reproduced with the following test program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char **argv)
{
int pid, ws;
struct stat buf;
pid = fork();
stat(argv[1], &buf);
if (pid > 0) wait(&ws);
return 0;
}
and the following procedure:
(1) Mount an NFS volume that on the server has something else mounted on a
subdirectory. For instance, I can mount / from my server:
mount warthog:/ /mnt -t nfs4 -r
On the server /data has another filesystem mounted on it, so NFS will see
a change in FSID as it walks down the path, and will mark /mnt/data as
being a mountpoint. This will cause the automount code to be triggered.
!!! Do not look inside the mounted fs at this point !!!
(2) Run the above program on a file within the submount to generate two
simultaneous automount requests:
/tmp/forkstat /mnt/data/testfile
(3) Unmount the automounted submount:
umount /mnt/data
(4) Unmount the original mount:
umount /mnt
At this point the kernel should throw a BUG with something like the
following:
BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]
Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:
[<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
[<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
[<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
[<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
[<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
[<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
[<ffffffff811629ff>] deactivate_super+0x6f/0x7b
[<ffffffff81186261>] mntput_no_expire+0x18d/0x199
[<ffffffff811862a8>] mntput+0x3b/0x44
[<ffffffff81186d87>] release_mounts+0xa2/0xbf
[<ffffffff811876af>] sys_umount+0x47a/0x4ba
[<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
[<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b
as do_umount() is inlined. However, you can see release_mounts() in there.
Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-16 15:10:06 +01:00
2020-01-17 08:45:08 -05:00
if ( ret = = - EISDIR )
ret = 0 ;
// possible if you race with several mount --move
if ( need_mntput & & path - > mnt = = mnt )
mntput ( path - > mnt ) ;
if ( ! ret & & unlikely ( d_flags_negative ( flags ) ) )
2019-11-04 22:30:52 -05:00
ret = - ENOENT ;
2020-01-17 08:45:08 -05:00
* jumped = need_mntput ;
2015-04-22 10:30:08 -04:00
return ret ;
2005-04-16 15:20:36 -07:00
}
2020-01-17 08:45:08 -05:00
static inline int traverse_mounts ( struct path * path , bool * jumped ,
int * count , unsigned lookup_flags )
{
unsigned flags = smp_load_acquire ( & path - > dentry - > d_flags ) ;
/* fastpath */
if ( likely ( ! ( flags & DCACHE_MANAGED_DENTRY ) ) ) {
* jumped = false ;
if ( unlikely ( d_flags_negative ( flags ) ) )
return - ENOENT ;
return 0 ;
}
return __traverse_mounts ( path , flags , jumped , count , lookup_flags ) ;
}
Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:26 +00:00
int follow_down_one ( struct path * path )
2005-04-16 15:20:36 -07:00
{
struct vfsmount * mounted ;
2009-04-18 14:06:57 -04:00
mounted = lookup_mnt ( path ) ;
2005-04-16 15:20:36 -07:00
if ( mounted ) {
2009-04-18 13:58:15 -04:00
dput ( path - > dentry ) ;
mntput ( path - > mnt ) ;
path - > mnt = mounted ;
path - > dentry = dget ( mounted - > mnt_root ) ;
2005-04-16 15:20:36 -07:00
return 1 ;
}
return 0 ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( follow_down_one ) ;
2005-04-16 15:20:36 -07:00
2020-01-17 08:45:08 -05:00
/*
* Follow down to the covering mount currently visible to userspace . At each
* point , the filesystem owning that dentry may be queried as to whether the
* caller is permitted to proceed or not .
*/
int follow_down ( struct path * path )
{
struct vfsmount * mnt = path - > mnt ;
bool jumped ;
int ret = traverse_mounts ( path , & jumped , NULL , 0 ) ;
if ( path - > mnt ! = mnt )
mntput ( mnt ) ;
return ret ;
}
EXPORT_SYMBOL ( follow_down ) ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
/*
2011-05-27 06:50:06 -04:00
* Try to skip to top of mountpoint pile in rcuwalk mode . Fail if
* we meet a managed dentry that would need blocking .
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
*/
2022-07-03 22:35:56 -04:00
static bool __follow_mount_rcu ( struct nameidata * nd , struct path * path )
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
{
2020-01-16 09:52:04 -05:00
struct dentry * dentry = path - > dentry ;
unsigned int flags = dentry - > d_flags ;
if ( likely ( ! ( flags & DCACHE_MANAGED_DENTRY ) ) )
return true ;
if ( unlikely ( nd - > flags & LOOKUP_NO_XDEV ) )
return false ;
2011-03-25 01:51:02 +08:00
for ( ; ; ) {
/*
* Don ' t forget we might have a non - mountpoint managed dentry
* that wants to block transit .
*/
2020-01-16 09:52:04 -05:00
if ( unlikely ( flags & DCACHE_MANAGE_TRANSIT ) ) {
int res = dentry - > d_op - > d_manage ( path , true ) ;
if ( res )
return res = = - EISDIR ;
flags = dentry - > d_flags ;
2014-08-04 17:06:29 +10:00
}
2011-03-25 01:51:02 +08:00
2020-01-16 09:52:04 -05:00
if ( flags & DCACHE_MOUNTED ) {
struct mount * mounted = __lookup_mnt ( path - > mnt , dentry ) ;
if ( mounted ) {
path - > mnt = & mounted - > mnt ;
dentry = path - > dentry = mounted - > mnt . mnt_root ;
2021-04-01 22:03:41 -04:00
nd - > state | = ND_JUMPED ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > next_seq = read_seqcount_begin ( & dentry - > d_seq ) ;
2020-01-16 09:52:04 -05:00
flags = dentry - > d_flags ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
// makes sure that non-RCU pathwalk could reach
// this state.
2022-07-04 17:26:29 -04:00
if ( read_seqretry ( & mount_lock , nd - > m_seq ) )
return false ;
2020-01-16 09:52:04 -05:00
continue ;
}
if ( read_seqretry ( & mount_lock , nd - > m_seq ) )
return false ;
}
return ! ( flags & DCACHE_NEED_AUTOMOUNT ) ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
}
2011-05-27 06:50:06 -04:00
}
2020-01-09 14:41:00 -05:00
static inline int handle_mounts ( struct nameidata * nd , struct dentry * dentry ,
2022-07-03 22:35:56 -04:00
struct path * path )
2020-01-08 20:37:23 -05:00
{
2020-01-17 08:45:08 -05:00
bool jumped ;
2020-01-09 14:41:00 -05:00
int ret ;
2020-01-08 20:37:23 -05:00
2020-01-09 14:41:00 -05:00
path - > mnt = nd - > path . mnt ;
path - > dentry = dentry ;
2020-01-09 14:50:18 -05:00
if ( nd - > flags & LOOKUP_RCU ) {
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
unsigned int seq = nd - > next_seq ;
2022-07-03 22:35:56 -04:00
if ( likely ( __follow_mount_rcu ( nd , path ) ) )
2020-01-17 08:45:08 -05:00
return 0 ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
// *path and nd->next_seq might've been clobbered
2020-01-09 14:50:18 -05:00
path - > mnt = nd - > path . mnt ;
path - > dentry = dentry ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > next_seq = seq ;
if ( ! try_to_unlazy_next ( nd , dentry ) )
return - ECHILD ;
2020-01-09 14:50:18 -05:00
}
2020-01-17 08:45:08 -05:00
ret = traverse_mounts ( path , & jumped , & nd - > total_link_count , nd - > flags ) ;
if ( jumped ) {
if ( unlikely ( nd - > flags & LOOKUP_NO_XDEV ) )
ret = - EXDEV ;
else
2021-04-01 22:03:41 -04:00
nd - > state | = ND_JUMPED ;
2020-01-17 08:45:08 -05:00
}
if ( unlikely ( ret ) ) {
dput ( path - > dentry ) ;
if ( path - > mnt ! = nd - > path . mnt )
mntput ( path - > mnt ) ;
2020-01-08 20:37:23 -05:00
}
return ret ;
}
2010-08-18 04:37:31 +10:00
/*
2016-07-07 22:04:04 -04:00
* This looks up the name in dcache and possibly revalidates the found dentry .
* NULL is returned if the dentry does not exist in the cache .
2010-08-18 04:37:31 +10:00
*/
2016-03-06 14:03:27 -05:00
static struct dentry * lookup_dcache ( const struct qstr * name ,
struct dentry * dir ,
2016-03-05 20:09:32 -05:00
unsigned int flags )
2010-08-18 04:37:31 +10:00
{
2017-01-09 22:25:28 -05:00
struct dentry * dentry = d_lookup ( dir , name ) ;
2012-03-26 12:54:24 +02:00
if ( dentry ) {
2017-01-09 22:25:28 -05:00
int error = d_revalidate ( dentry , flags ) ;
if ( unlikely ( error < = 0 ) ) {
if ( ! error )
d_invalidate ( dentry ) ;
dput ( dentry ) ;
return ERR_PTR ( error ) ;
2012-03-26 12:54:24 +02:00
}
}
2010-08-18 04:37:31 +10:00
return dentry ;
}
2011-05-31 11:58:49 -04:00
/*
2018-03-08 11:00:45 -05:00
* Parent directory has inode locked exclusive . This is one
* and only case when - > lookup ( ) gets called on non in - lookup
* dentries - as the matter of fact , this only gets called
* when directory is guaranteed to have no in - lookup children
* at all .
2011-05-31 11:58:49 -04:00
*/
2016-03-06 14:03:27 -05:00
static struct dentry * __lookup_hash ( const struct qstr * name ,
2012-06-10 17:17:17 -04:00
struct dentry * base , unsigned int flags )
2012-03-30 14:41:51 -04:00
{
2016-03-05 20:09:32 -05:00
struct dentry * dentry = lookup_dcache ( name , base , flags ) ;
2018-03-08 11:00:45 -05:00
struct dentry * old ;
struct inode * dir = base - > d_inode ;
2012-03-30 14:41:51 -04:00
2016-03-05 20:09:32 -05:00
if ( dentry )
2012-03-26 12:54:24 +02:00
return dentry ;
2012-03-30 14:41:51 -04:00
2018-03-08 11:00:45 -05:00
/* Don't create child dentry for a dead directory. */
if ( unlikely ( IS_DEADDIR ( dir ) ) )
return ERR_PTR ( - ENOENT ) ;
2016-03-05 20:09:32 -05:00
dentry = d_alloc ( base , name ) ;
if ( unlikely ( ! dentry ) )
return ERR_PTR ( - ENOMEM ) ;
2018-03-08 11:00:45 -05:00
old = dir - > i_op - > lookup ( dir , dentry , flags ) ;
if ( unlikely ( old ) ) {
dput ( dentry ) ;
dentry = old ;
}
return dentry ;
2012-03-30 14:41:51 -04:00
}
2022-07-03 22:20:20 -04:00
static struct dentry * lookup_fast ( struct nameidata * nd )
2005-04-16 15:20:36 -07:00
{
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
struct dentry * dentry , * parent = nd - > path . dentry ;
untangle do_lookup()
That thing has devolved into rats nest of gotos; sane use of unlikely()
gets rid of that horror and gives much more readable structure:
* make a fast attempt to find a dentry; false negatives are OK.
In RCU mode if everything went fine, we are done, otherwise just drop
out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused
outright (i.e. didn't give us -ECHILD), remember its result.
* now we are not in RCU mode and hopefully have a dentry. If we
do not, lock parent, do full d_lookup() and if that has not found anything,
allocate and call ->lookup(). If we'd done that ->lookup(), remember that
dentry is good and we don't need to revalidate it.
* now we have a dentry. If it has ->d_revalidate() and we can't
skip it, call it.
* hopefully dentry is good; if not, either fail (in case of error)
or try to invalidate it. If d_invalidate() has succeeded, drop it and
retry everything as if original attempt had not found a dentry.
* now we can finish it up - deal with mountpoint crossing and
automount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-11 04:44:53 -05:00
int status = 1 ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
fs: remove extra lookup in __lookup_hash
fs: remove extra lookup in __lookup_hash
Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.
Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
- We are interested in how the RCU lookup works here, particularly with
renames. Make that explicit, and point to the document where it is explained
in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
code used to work, more about the interesting parts of how it works now. So
comments about lazy LRU may be interesting, but would better be done in the
LRU or refcount management code.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:34 +10:00
/*
* Rename seqlock is not required here because in the off chance
2016-03-05 21:32:53 -05:00
* of a false negative due to a concurrent rename , the caller is
* going to fall back to non - racy lookup .
fs: remove extra lookup in __lookup_hash
fs: remove extra lookup in __lookup_hash
Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.
Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
- We are interested in how the RCU lookup works here, particularly with
renames. Make that explicit, and point to the document where it is explained
in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
code used to work, more about the interesting parts of how it works now. So
comments about lazy LRU may be interesting, but would better be done in the
LRU or refcount management code.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:34 +10:00
*/
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
if ( nd - > flags & LOOKUP_RCU ) {
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
dentry = __d_lookup_rcu ( parent , & nd - > last , & nd - > next_seq ) ;
2016-03-05 21:32:53 -05:00
if ( unlikely ( ! dentry ) ) {
2020-12-17 09:19:08 -07:00
if ( ! try_to_unlazy ( nd ) )
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return ERR_PTR ( - ECHILD ) ;
return NULL ;
2016-03-05 21:32:53 -05:00
}
untangle do_lookup()
That thing has devolved into rats nest of gotos; sane use of unlikely()
gets rid of that horror and gives much more readable structure:
* make a fast attempt to find a dentry; false negatives are OK.
In RCU mode if everything went fine, we are done, otherwise just drop
out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused
outright (i.e. didn't give us -ECHILD), remember its result.
* now we are not in RCU mode and hopefully have a dentry. If we
do not, lock parent, do full d_lookup() and if that has not found anything,
allocate and call ->lookup(). If we'd done that ->lookup(), remember that
dentry is good and we don't need to revalidate it.
* now we have a dentry. If it has ->d_revalidate() and we can't
skip it, call it.
* hopefully dentry is good; if not, either fail (in case of error)
or try to invalidate it. If d_invalidate() has succeeded, drop it and
retry everything as if original attempt had not found a dentry.
* now we can finish it up - deal with mountpoint crossing and
automount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-11 04:44:53 -05:00
vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces
The calling conventions for __d_lookup_rcu() and dentry_cmp() are
annoying in different ways, and there is actually one single underlying
reason for both of the annoyances.
The fundamental reason is that we do the returned dentry sequence number
check inside __d_lookup_rcu() instead of doing it in the caller. This
results in two annoyances:
- __d_lookup_rcu() now not only needs to return the dentry and the
sequence number that goes along with the lookup, it also needs to
return the inode pointer that was validated by that sequence number
check.
- and because we did the sequence number check early (to validate the
name pointer and length) we also couldn't just pass the dentry itself
to dentry_cmp(), we had to pass the counted string that contained the
name.
So that sequence number decision caused two separate ugly calling
conventions.
Both of these problems would be solved if we just did the sequence
number check in the caller instead. There's only one caller, and that
caller already has to do the sequence number check for the parent
anyway, so just do that.
That allows us to stop returning the dentry->d_inode in that in-out
argument (pointer-to-pointer-to-inode), so we can make the inode
argument just a regular input inode pointer. The caller can just load
the inode from dentry->d_inode, and then do the sequence number check
after that to make sure that it's synchronized with the name we looked
up.
And it allows us to just pass in the dentry to dentry_cmp(), which is
what all the callers really wanted. Sure, dentry_cmp() has to be a bit
careful about the dentry (which is not stable during RCU lookup), but
that's actually very simple.
And now that dentry_cmp() can clearly see that the first string argument
is a dentry, we can use the direct word access for that, instead of the
careful unaligned zero-padding. The dentry name is always properly
aligned, since it is a single path component that is either embedded
into the dentry itself, or was allocated with kmalloc() (see __d_alloc).
Finally, this also uninlines the nasty slow-case for dentry comparisons:
that one *does* need to do a sequence number check, since it will call
in to the low-level filesystems, and we want to give those a stable
inode pointer and path component length/start arguments. Doing an extra
sequence check for that slow case is not a problem, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-04 14:59:14 -07:00
/*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above .
*/
2022-07-03 22:20:20 -04:00
if ( read_seqcount_retry ( & parent - > d_seq , nd - > seq ) )
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return ERR_PTR ( - ECHILD ) ;
untangle do_lookup()
That thing has devolved into rats nest of gotos; sane use of unlikely()
gets rid of that horror and gives much more readable structure:
* make a fast attempt to find a dentry; false negatives are OK.
In RCU mode if everything went fine, we are done, otherwise just drop
out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused
outright (i.e. didn't give us -ECHILD), remember its result.
* now we are not in RCU mode and hopefully have a dentry. If we
do not, lock parent, do full d_lookup() and if that has not found anything,
allocate and call ->lookup(). If we'd done that ->lookup(), remember that
dentry is good and we don't need to revalidate it.
* now we have a dentry. If it has ->d_revalidate() and we can't
skip it, call it.
* hopefully dentry is good; if not, either fail (in case of error)
or try to invalidate it. If d_invalidate() has succeeded, drop it and
retry everything as if original attempt had not found a dentry.
* now we can finish it up - deal with mountpoint crossing and
automount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-11 04:44:53 -05:00
2017-01-09 22:25:28 -05:00
status = d_revalidate ( dentry , nd - > flags ) ;
2020-01-09 14:50:18 -05:00
if ( likely ( status > 0 ) )
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return dentry ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
if ( ! try_to_unlazy_next ( nd , dentry ) )
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return ERR_PTR ( - ECHILD ) ;
fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()
Running my yearly branch profiling code, it detected a 100% wrong branch
condition in name.c for lookup_fast(). The code in question has:
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
return dentry;
if (unlazy_child(nd, dentry, seq))
return ERR_PTR(-ECHILD);
if (unlikely(status == -ECHILD))
/* we'd been told to redo it in non-rcu mode */
status = d_revalidate(dentry, nd->flags);
If the status of the d_revalidate() is greater than zero, then the function
finishes. Otherwise, if it is an "unlazy_child" it returns with -ECHILD.
After the above two checks, the status is compared to -ECHILD, as that is
what is returned if the original d_revalidate() needed to be done in a
non-rcu mode.
Especially this path is called in a condition of:
if (nd->flags & LOOKUP_RCU) {
And most of the d_revalidate() functions have:
if (flags & LOOKUP_RCU)
return -ECHILD;
It appears that that is the only case that this if statement is triggered
on two of my machines, running in production.
As it is dependent on what filesystem mix is configured in the running
kernel, simply remove the unlikely() from the if statement.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-09 17:09:28 -05:00
if ( status = = - ECHILD )
2017-01-09 01:35:39 -05:00
/* we'd been told to redo it in non-rcu mode */
status = d_revalidate ( dentry , nd - > flags ) ;
untangle do_lookup()
That thing has devolved into rats nest of gotos; sane use of unlikely()
gets rid of that horror and gives much more readable structure:
* make a fast attempt to find a dentry; false negatives are OK.
In RCU mode if everything went fine, we are done, otherwise just drop
out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused
outright (i.e. didn't give us -ECHILD), remember its result.
* now we are not in RCU mode and hopefully have a dentry. If we
do not, lock parent, do full d_lookup() and if that has not found anything,
allocate and call ->lookup(). If we'd done that ->lookup(), remember that
dentry is good and we don't need to revalidate it.
* now we have a dentry. If it has ->d_revalidate() and we can't
skip it, call it.
* hopefully dentry is good; if not, either fail (in case of error)
or try to invalidate it. If d_invalidate() has succeeded, drop it and
retry everything as if original attempt had not found a dentry.
* now we can finish it up - deal with mountpoint crossing and
automount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-11 04:44:53 -05:00
} else {
2013-01-24 18:16:00 -05:00
dentry = __d_lookup ( parent , & nd - > last ) ;
2016-03-05 21:32:53 -05:00
if ( unlikely ( ! dentry ) )
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return NULL ;
2017-01-09 22:25:28 -05:00
status = d_revalidate ( dentry , nd - > flags ) ;
Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:21 +00:00
}
untangle do_lookup()
That thing has devolved into rats nest of gotos; sane use of unlikely()
gets rid of that horror and gives much more readable structure:
* make a fast attempt to find a dentry; false negatives are OK.
In RCU mode if everything went fine, we are done, otherwise just drop
out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused
outright (i.e. didn't give us -ECHILD), remember its result.
* now we are not in RCU mode and hopefully have a dentry. If we
do not, lock parent, do full d_lookup() and if that has not found anything,
allocate and call ->lookup(). If we'd done that ->lookup(), remember that
dentry is good and we don't need to revalidate it.
* now we have a dentry. If it has ->d_revalidate() and we can't
skip it, call it.
* hopefully dentry is good; if not, either fail (in case of error)
or try to invalidate it. If d_invalidate() has succeeded, drop it and
retry everything as if original attempt had not found a dentry.
* now we can finish it up - deal with mountpoint crossing and
automount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-11 04:44:53 -05:00
if ( unlikely ( status < = 0 ) ) {
2016-03-05 22:04:59 -05:00
if ( ! status )
2016-03-05 21:32:53 -05:00
d_invalidate ( dentry ) ;
2014-02-13 09:46:25 -08:00
dput ( dentry ) ;
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return ERR_PTR ( status ) ;
2011-02-15 01:26:22 -05:00
}
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
return dentry ;
2012-05-21 17:30:05 +02:00
}
/* Fast lookup failed, do it the slow way */
2018-04-06 16:43:47 -04:00
static struct dentry * __lookup_slow ( const struct qstr * name ,
struct dentry * dir ,
unsigned int flags )
2012-05-21 17:30:05 +02:00
{
2018-04-06 16:43:47 -04:00
struct dentry * dentry , * old ;
2016-04-14 19:33:34 -04:00
struct inode * inode = dir - > d_inode ;
2016-04-15 03:33:13 -04:00
DECLARE_WAIT_QUEUE_HEAD_ONSTACK ( wq ) ;
2016-04-14 19:33:34 -04:00
/* Don't go there if it's already dead */
2016-04-15 02:42:04 -04:00
if ( unlikely ( IS_DEADDIR ( inode ) ) )
2018-04-06 16:43:47 -04:00
return ERR_PTR ( - ENOENT ) ;
2016-04-15 02:42:04 -04:00
again :
2016-04-15 03:33:13 -04:00
dentry = d_alloc_parallel ( dir , name , & wq ) ;
2016-04-15 02:42:04 -04:00
if ( IS_ERR ( dentry ) )
2018-04-06 16:43:47 -04:00
return dentry ;
2016-04-15 02:42:04 -04:00
if ( unlikely ( ! d_in_lookup ( dentry ) ) ) {
2020-01-10 17:17:19 -05:00
int error = d_revalidate ( dentry , flags ) ;
if ( unlikely ( error < = 0 ) ) {
if ( ! error ) {
d_invalidate ( dentry ) ;
2016-03-06 14:20:52 -05:00
dput ( dentry ) ;
2020-01-10 17:17:19 -05:00
goto again ;
2016-03-06 14:20:52 -05:00
}
2020-01-10 17:17:19 -05:00
dput ( dentry ) ;
dentry = ERR_PTR ( error ) ;
2016-03-06 14:20:52 -05:00
}
2016-04-15 02:42:04 -04:00
} else {
old = inode - > i_op - > lookup ( inode , dentry , flags ) ;
d_lookup_done ( dentry ) ;
if ( unlikely ( old ) ) {
dput ( dentry ) ;
dentry = old ;
2016-03-06 14:20:52 -05:00
}
}
2016-03-06 14:03:27 -05:00
return dentry ;
2005-04-16 15:20:36 -07:00
}
2018-04-06 16:43:47 -04:00
static struct dentry * lookup_slow ( const struct qstr * name ,
struct dentry * dir ,
unsigned int flags )
{
struct inode * inode = dir - > d_inode ;
struct dentry * res ;
inode_lock_shared ( inode ) ;
res = __lookup_slow ( name , dir , flags ) ;
inode_unlock_shared ( inode ) ;
return res ;
}
2021-01-21 14:19:31 +01:00
static inline int may_lookup ( struct user_namespace * mnt_userns ,
struct nameidata * nd )
2011-02-21 21:34:47 -05:00
{
if ( nd - > flags & LOOKUP_RCU ) {
idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
2021-02-23 13:39:45 -08:00
int err = inode_permission ( mnt_userns , nd - > inode , MAY_EXEC | MAY_NOT_BLOCK ) ;
2020-12-17 09:19:08 -07:00
if ( err ! = - ECHILD | | ! try_to_unlazy ( nd ) )
2011-02-21 21:34:47 -05:00
return err ;
}
2021-01-21 14:19:31 +01:00
return inode_permission ( mnt_userns , nd - > inode , MAY_EXEC ) ;
2011-02-21 21:34:47 -05:00
}
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
static int reserve_stack ( struct nameidata * nd , struct path * link )
2020-03-03 11:22:34 -05:00
{
if ( unlikely ( nd - > total_link_count + + > = MAXSYMLINKS ) )
return - ELOOP ;
2020-03-03 11:25:31 -05:00
if ( likely ( nd - > depth ! = EMBEDDED_LEVELS ) )
return 0 ;
if ( likely ( nd - > stack ! = nd - > internal ) )
return 0 ;
2020-03-03 11:43:55 -05:00
if ( likely ( nd_alloc_stack ( nd ) ) )
2020-03-03 11:22:34 -05:00
return 0 ;
2020-03-03 11:43:55 -05:00
if ( nd - > flags & LOOKUP_RCU ) {
// we need to grab link before we do unlazy. And we can't skip
// unlazy even if we fail to grab the link - cleanup needs it
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
bool grabbed_link = legitimize_path ( nd , link , nd - > next_seq ) ;
2020-03-03 11:43:55 -05:00
2022-01-07 12:20:37 -05:00
if ( ! try_to_unlazy ( nd ) | | ! grabbed_link )
2020-03-03 11:43:55 -05:00
return - ECHILD ;
if ( nd_alloc_stack ( nd ) )
return 0 ;
2020-03-03 11:22:34 -05:00
}
2020-03-03 11:43:55 -05:00
return - ENOMEM ;
2020-03-03 11:22:34 -05:00
}
2020-01-19 12:48:44 -05:00
enum { WALK_TRAILING = 1 , WALK_MORE = 2 , WALK_NOFOLLOW = 4 } ;
2020-01-14 14:26:57 -05:00
static const char * pick_link ( struct nameidata * nd , struct path * link ,
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
struct inode * inode , int flags )
2015-05-04 18:13:23 -04:00
{
2015-05-06 16:01:56 -04:00
struct saved * last ;
2020-01-14 14:41:39 -05:00
const char * res ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
int error = reserve_stack ( nd , link ) ;
2020-01-14 14:41:39 -05:00
2015-05-04 18:26:59 -04:00
if ( unlikely ( error ) ) {
2020-03-03 11:22:34 -05:00
if ( ! ( nd - > flags & LOOKUP_RCU ) )
2015-05-09 13:04:24 -04:00
path_put ( link ) ;
2020-03-03 11:22:34 -05:00
return ERR_PTR ( error ) ;
2015-05-04 18:26:59 -04:00
}
2015-05-10 11:50:01 -04:00
last = nd - > stack + nd - > depth + + ;
2015-05-06 16:01:56 -04:00
last - > link = * link ;
2015-12-29 15:58:39 -05:00
clear_delayed_call ( & last - > done ) ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
last - > seq = nd - > next_seq ;
2020-01-14 14:41:39 -05:00
2020-01-19 12:48:44 -05:00
if ( flags & WALK_TRAILING ) {
2020-01-14 14:41:39 -05:00
error = may_follow_link ( nd , inode ) ;
if ( unlikely ( error ) )
return ERR_PTR ( error ) ;
}
2020-08-27 11:09:46 -06:00
if ( unlikely ( nd - > flags & LOOKUP_NO_SYMLINKS ) | |
unlikely ( link - > mnt - > mnt_flags & MNT_NOSYMFOLLOW ) )
2020-01-14 14:41:39 -05:00
return ERR_PTR ( - ELOOP ) ;
if ( ! ( nd - > flags & LOOKUP_RCU ) ) {
touch_atime ( & last - > link ) ;
cond_resched ( ) ;
} else if ( atime_needs_update ( & last - > link , inode ) ) {
2020-12-17 09:19:08 -07:00
if ( ! try_to_unlazy ( nd ) )
2020-01-14 14:41:39 -05:00
return ERR_PTR ( - ECHILD ) ;
touch_atime ( & last - > link ) ;
}
error = security_inode_follow_link ( link - > dentry , inode ,
nd - > flags & LOOKUP_RCU ) ;
if ( unlikely ( error ) )
return ERR_PTR ( error ) ;
res = READ_ONCE ( inode - > i_link ) ;
if ( ! res ) {
const char * ( * get ) ( struct dentry * , struct inode * ,
struct delayed_call * ) ;
get = inode - > i_op - > get_link ;
if ( nd - > flags & LOOKUP_RCU ) {
res = get ( NULL , inode , & last - > done ) ;
2020-12-17 09:19:08 -07:00
if ( res = = ERR_PTR ( - ECHILD ) & & try_to_unlazy ( nd ) )
2020-01-14 14:41:39 -05:00
res = get ( link - > dentry , inode , & last - > done ) ;
} else {
res = get ( link - > dentry , inode , & last - > done ) ;
}
if ( ! res )
goto all_done ;
if ( IS_ERR ( res ) )
return res ;
}
if ( * res = = ' / ' ) {
error = nd_jump_root ( nd ) ;
if ( unlikely ( error ) )
return ERR_PTR ( error ) ;
while ( unlikely ( * + + res = = ' / ' ) )
;
}
if ( * res )
return res ;
all_done : // pure jump
put_link ( nd ) ;
return NULL ;
2015-05-04 18:13:23 -04:00
}
2011-08-06 22:45:50 -07:00
/*
* Do we need to follow links ? We _really_ want to be able
* to do this check without having to look at inode - > i_op ,
* so we keep a cache of " no, this doesn't need follow_link "
* for the common case .
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
*
* NOTE : dentry must be what nd - > next_seq had been sampled from .
2011-08-06 22:45:50 -07:00
*/
2020-01-14 13:34:20 -05:00
static const char * step_into ( struct nameidata * nd , int flags ,
2022-07-03 22:07:32 -04:00
struct dentry * dentry )
2011-08-06 22:45:50 -07:00
{
2020-01-12 13:40:02 -05:00
struct path path ;
2022-07-03 22:07:32 -04:00
struct inode * inode ;
2022-07-03 22:35:56 -04:00
int err = handle_mounts ( nd , dentry , & path ) ;
2020-01-12 13:40:02 -05:00
if ( err < 0 )
2020-01-14 13:34:20 -05:00
return ERR_PTR ( err ) ;
2022-07-03 22:35:56 -04:00
inode = path . dentry - > d_inode ;
2020-01-12 13:40:02 -05:00
if ( likely ( ! d_is_symlink ( path . dentry ) ) | |
2020-01-19 12:44:18 -05:00
( ( flags & WALK_TRAILING ) & & ! ( nd - > flags & LOOKUP_FOLLOW ) ) | |
2020-01-09 15:17:57 -05:00
( flags & WALK_NOFOLLOW ) ) {
2016-11-14 01:50:26 -05:00
/* not a symlink or should not follow */
2022-07-03 22:35:56 -04:00
if ( nd - > flags & LOOKUP_RCU ) {
if ( read_seqcount_retry ( & path . dentry - > d_seq , nd - > next_seq ) )
return ERR_PTR ( - ECHILD ) ;
if ( unlikely ( ! inode ) )
return ERR_PTR ( - ENOENT ) ;
} else {
2020-03-03 10:56:17 -05:00
dput ( nd - > path . dentry ) ;
if ( nd - > path . mnt ! = path . mnt )
mntput ( nd - > path . mnt ) ;
}
nd - > path = path ;
2016-11-14 01:50:26 -05:00
nd - > inode = inode ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > seq = nd - > next_seq ;
2020-01-14 13:34:20 -05:00
return NULL ;
2016-11-14 01:50:26 -05:00
}
2016-02-27 19:31:01 -05:00
if ( nd - > flags & LOOKUP_RCU ) {
2020-03-03 10:14:30 -05:00
/* make sure that d_is_symlink above matches inode */
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
if ( read_seqcount_retry ( & path . dentry - > d_seq , nd - > next_seq ) )
2020-01-14 13:34:20 -05:00
return ERR_PTR ( - ECHILD ) ;
2020-03-03 10:14:30 -05:00
} else {
if ( path . mnt = = nd - > path . mnt )
mntget ( path . mnt ) ;
2016-02-27 19:31:01 -05:00
}
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
return pick_link ( nd , & path , inode , flags ) ;
2011-08-06 22:45:50 -07:00
}
2022-07-03 22:18:11 -04:00
static struct dentry * follow_dotdot_rcu ( struct nameidata * nd )
2020-02-26 01:40:04 -05:00
{
2020-02-26 14:59:56 -05:00
struct dentry * parent , * old ;
2020-02-26 01:40:04 -05:00
2020-02-26 14:59:56 -05:00
if ( path_equal ( & nd - > path , & nd - > root ) )
goto in_root ;
if ( unlikely ( nd - > path . dentry = = nd - > path . mnt - > mnt_root ) ) {
2020-02-26 17:50:13 -05:00
struct path path ;
2020-02-28 10:06:37 -05:00
unsigned seq ;
2020-02-26 17:50:13 -05:00
if ( ! choose_mountpoint_rcu ( real_mount ( nd - > path . mnt ) ,
& nd - > root , & path , & seq ) )
goto in_root ;
2020-02-28 10:06:37 -05:00
if ( unlikely ( nd - > flags & LOOKUP_NO_XDEV ) )
return ERR_PTR ( - ECHILD ) ;
nd - > path = path ;
nd - > inode = path . dentry - > d_inode ;
nd - > seq = seq ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
// makes sure that non-RCU pathwalk could reach this state
2022-07-05 11:23:58 -04:00
if ( read_seqretry ( & mount_lock , nd - > m_seq ) )
2020-02-28 10:06:37 -05:00
return ERR_PTR ( - ECHILD ) ;
/* we know that mountpoint was pinned */
2020-02-26 01:40:04 -05:00
}
2020-02-26 14:59:56 -05:00
old = nd - > path . dentry ;
parent = old - > d_parent ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > next_seq = read_seqcount_begin ( & parent - > d_seq ) ;
// makes sure that non-RCU pathwalk could reach this state
2022-07-05 11:23:58 -04:00
if ( read_seqcount_retry ( & old - > d_seq , nd - > seq ) )
2020-02-26 14:59:56 -05:00
return ERR_PTR ( - ECHILD ) ;
if ( unlikely ( ! path_connected ( nd - > path . mnt , parent ) ) )
return ERR_PTR ( - ECHILD ) ;
return parent ;
in_root :
2022-07-05 11:23:58 -04:00
if ( read_seqretry ( & mount_lock , nd - > m_seq ) )
2020-02-28 10:06:37 -05:00
return ERR_PTR ( - ECHILD ) ;
2020-02-26 14:33:30 -05:00
if ( unlikely ( nd - > flags & LOOKUP_BENEATH ) )
return ERR_PTR ( - ECHILD ) ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > next_seq = nd - > seq ;
2022-07-04 11:20:51 -04:00
return nd - > path . dentry ;
2020-02-26 01:40:04 -05:00
}
2022-07-03 22:18:11 -04:00
static struct dentry * follow_dotdot ( struct nameidata * nd )
2020-02-26 01:40:04 -05:00
{
2020-02-26 14:59:56 -05:00
struct dentry * parent ;
if ( path_equal ( & nd - > path , & nd - > root ) )
goto in_root ;
if ( unlikely ( nd - > path . dentry = = nd - > path . mnt - > mnt_root ) ) {
2020-02-26 19:19:05 -05:00
struct path path ;
if ( ! choose_mountpoint ( real_mount ( nd - > path . mnt ) ,
& nd - > root , & path ) )
goto in_root ;
2020-02-28 10:17:52 -05:00
path_put ( & nd - > path ) ;
nd - > path = path ;
2020-02-26 19:19:05 -05:00
nd - > inode = path . dentry - > d_inode ;
2020-02-28 10:17:52 -05:00
if ( unlikely ( nd - > flags & LOOKUP_NO_XDEV ) )
return ERR_PTR ( - EXDEV ) ;
2020-02-26 01:40:04 -05:00
}
2020-02-26 14:59:56 -05:00
/* rare case of legitimate dget_parent()... */
parent = dget_parent ( nd - > path . dentry ) ;
if ( unlikely ( ! path_connected ( nd - > path . mnt , parent ) ) ) {
dput ( parent ) ;
return ERR_PTR ( - ENOENT ) ;
}
return parent ;
in_root :
2020-02-26 14:33:30 -05:00
if ( unlikely ( nd - > flags & LOOKUP_BENEATH ) )
return ERR_PTR ( - EXDEV ) ;
2022-07-04 11:20:51 -04:00
return dget ( nd - > path . dentry ) ;
2020-02-26 01:40:04 -05:00
}
2020-02-26 12:22:58 -05:00
static const char * handle_dots ( struct nameidata * nd , int type )
2020-02-26 01:40:04 -05:00
{
if ( type = = LAST_DOTDOT ) {
2020-02-26 12:22:58 -05:00
const char * error = NULL ;
2020-02-26 14:33:30 -05:00
struct dentry * parent ;
2020-02-26 01:40:04 -05:00
if ( ! nd - > root . mnt ) {
2020-02-26 12:22:58 -05:00
error = ERR_PTR ( set_root ( nd ) ) ;
2020-02-26 01:40:04 -05:00
if ( error )
return error ;
}
if ( nd - > flags & LOOKUP_RCU )
2022-07-03 22:18:11 -04:00
parent = follow_dotdot_rcu ( nd ) ;
2020-02-26 01:40:04 -05:00
else
2022-07-03 22:18:11 -04:00
parent = follow_dotdot ( nd ) ;
2020-02-26 14:33:30 -05:00
if ( IS_ERR ( parent ) )
return ERR_CAST ( parent ) ;
2022-07-03 22:07:32 -04:00
error = step_into ( nd , WALK_NOFOLLOW , parent ) ;
2020-02-26 14:33:30 -05:00
if ( unlikely ( error ) )
2020-02-26 01:40:04 -05:00
return error ;
if ( unlikely ( nd - > flags & LOOKUP_IS_SCOPED ) ) {
/*
* If there was a racing rename or mount along our
* path , then we can ' t be sure that " .. " hasn ' t jumped
* above nd - > root ( and so userspace should retry or use
* some fallback ) .
*/
smp_rmb ( ) ;
2022-07-05 11:23:58 -04:00
if ( __read_seqcount_retry ( & mount_lock . seqcount , nd - > m_seq ) )
2020-02-26 12:22:58 -05:00
return ERR_PTR ( - EAGAIN ) ;
2022-07-05 11:23:58 -04:00
if ( __read_seqcount_retry ( & rename_lock . seqcount , nd - > r_seq ) )
2020-02-26 12:22:58 -05:00
return ERR_PTR ( - EAGAIN ) ;
2020-02-26 01:40:04 -05:00
}
}
2020-02-26 12:22:58 -05:00
return NULL ;
2020-02-26 01:40:04 -05:00
}
2020-01-14 13:24:17 -05:00
static const char * walk_component ( struct nameidata * nd , int flags )
2011-03-13 19:58:58 -04:00
{
2020-01-09 14:41:00 -05:00
struct dentry * dentry ;
2011-03-13 19:58:58 -04:00
/*
* " . " and " .. " are special - " .. " especially so because it has
* to be able to know about the current root directory and
* parent relationships .
*/
2015-05-04 17:47:11 -04:00
if ( unlikely ( nd - > last_type ! = LAST_NORM ) ) {
2016-11-14 01:39:36 -05:00
if ( ! ( flags & WALK_MORE ) & & nd - > depth )
2015-05-04 17:47:11 -04:00
put_link ( nd ) ;
2020-02-26 12:22:58 -05:00
return handle_dots ( nd , nd - > last_type ) ;
2015-05-04 17:47:11 -04:00
}
2022-07-03 22:20:20 -04:00
dentry = lookup_fast ( nd ) ;
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
if ( IS_ERR ( dentry ) )
2020-01-14 13:24:17 -05:00
return ERR_CAST ( dentry ) ;
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
if ( unlikely ( ! dentry ) ) {
2020-01-09 14:41:00 -05:00
dentry = lookup_slow ( & nd - > last , nd - > path . dentry , nd - > flags ) ;
if ( IS_ERR ( dentry ) )
2020-01-14 13:24:17 -05:00
return ERR_CAST ( dentry ) ;
2011-03-13 19:58:58 -04:00
}
2020-03-10 21:54:54 -04:00
if ( ! ( flags & WALK_MORE ) & & nd - > depth )
put_link ( nd ) ;
2022-07-03 22:07:32 -04:00
return step_into ( nd , flags , dentry ) ;
2011-03-13 19:58:58 -04:00
}
2012-03-06 11:16:17 -08:00
/*
* We can do the critical dentry name comparison and hashing
* operations one word at a time , but we are limited to :
*
* - Architectures with fast unaligned word accesses . We could
* do a " get_unaligned() " if this helps and is sufficiently
* fast .
*
* - non - CONFIG_DEBUG_PAGEALLOC configurations ( so that we
* do not trap on the ( extremely unlikely ) case of a page
* crossing operation .
*
* - Furthermore , we need an efficient 64 - bit compile for the
* 64 - bit case in order to generate the " number of bytes in
* the final mask " . Again, that could be replaced with a
* efficient population count instruction or similar .
*/
# ifdef CONFIG_DCACHE_WORD_ACCESS
2012-04-06 13:54:56 -07:00
# include <asm/word-at-a-time.h>
2012-03-06 11:16:17 -08:00
2016-05-26 22:11:51 -04:00
# ifdef HASH_MIX
2012-03-06 11:16:17 -08:00
2016-05-26 22:11:51 -04:00
/* Architecture provides HASH_MIX and fold_hash() in <asm/hash.h> */
2012-03-06 11:16:17 -08:00
2016-05-26 22:11:51 -04:00
# elif defined(CONFIG_64BIT)
2016-05-02 06:31:01 -04:00
/*
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
* Register pressure in the mixing function is an issue , particularly
* on 32 - bit x86 , but almost any function requires one state value and
* one temporary . Instead , use a function designed for two state values
* and no temporaries .
*
* This function cannot create a collision in only two iterations , so
* we have two iterations to achieve avalanche . In those two iterations ,
* we have six layers of mixing , which is enough to spread one bit ' s
* influence out to 2 ^ 6 = 64 state bits .
*
* Rotate constants are scored by considering either 64 one - bit input
* deltas or 64 * 63 / 2 = 2016 two - bit input deltas , and finding the
* probability of that delta causing a change to each of the 128 output
* bits , using a sample of random initial states .
*
* The Shannon entropy of the computed probabilities is then summed
* to produce a score . Ideally , any input change has a 50 % chance of
* toggling any given output bit .
*
* Mixing scores ( in bits ) for ( 12 , 45 ) :
* Input delta : 1 - bit 2 - bit
* 1 round : 713.3 42542.6
* 2 rounds : 2753.7 140389.8
* 3 rounds : 5954.1 233458.2
* 4 rounds : 7862.6 256672.2
* Perfect : 8192 258048
* ( 64 * 128 ) ( 64 * 63 / 2 * 128 )
2016-05-02 06:31:01 -04:00
*/
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
# define HASH_MIX(x, y, a) \
( x ^ = ( a ) , \
y ^ = x , x = rol64 ( x , 12 ) , \
x + = y , y = rol64 ( y , 45 ) , \
y * = 9 )
2012-03-06 11:16:17 -08:00
2016-05-02 06:31:01 -04:00
/*
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
* Fold two longs into one 32 - bit hash value . This must be fast , but
* latency isn ' t quite as critical , as there is a fair bit of additional
* work done before the hash value is used .
2016-05-02 06:31:01 -04:00
*/
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
static inline unsigned int fold_hash ( unsigned long x , unsigned long y )
2016-05-02 06:31:01 -04:00
{
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
y ^ = x * GOLDEN_RATIO_64 ;
y * = GOLDEN_RATIO_64 ;
return y > > 32 ;
2016-05-02 06:31:01 -04:00
}
2012-03-06 11:16:17 -08:00
# else /* 32-bit case */
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
/*
* Mixing scores ( in bits ) for ( 7 , 20 ) :
* Input delta : 1 - bit 2 - bit
* 1 round : 330.3 9201.6
* 2 rounds : 1246.4 25475.4
* 3 rounds : 1907.1 31295.1
* 4 rounds : 2042.3 31718.6
* Perfect : 2048 31744
* ( 32 * 64 ) ( 32 * 31 / 2 * 64 )
*/
# define HASH_MIX(x, y, a) \
( x ^ = ( a ) , \
y ^ = x , x = rol32 ( x , 7 ) , \
x + = y , y = rol32 ( y , 20 ) , \
y * = 9 )
2012-03-06 11:16:17 -08:00
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
static inline unsigned int fold_hash ( unsigned long x , unsigned long y )
2016-05-02 06:31:01 -04:00
{
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
/* Use arch-optimized multiply if one exists */
return __hash_32 ( y ^ __hash_32 ( x ) ) ;
2016-05-02 06:31:01 -04:00
}
2012-03-06 11:16:17 -08:00
# endif
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
/*
* Return the hash of a string of known length . This is carfully
* designed to match hash_name ( ) , which is the more critical function .
* In particular , we must end by hashing a final word containing 0. .7
* payload bytes , to match the way that hash_name ( ) iterates until it
* finds the delimiter after the name .
*/
2016-06-10 07:51:30 -07:00
unsigned int full_name_hash ( const void * salt , const char * name , unsigned int len )
2012-03-06 11:16:17 -08:00
{
2016-06-10 07:51:30 -07:00
unsigned long a , x = 0 , y = ( unsigned long ) salt ;
2012-03-06 11:16:17 -08:00
for ( ; ; ) {
2016-05-20 08:41:37 -04:00
if ( ! len )
goto done ;
2012-05-03 10:16:43 -07:00
a = load_unaligned_zeropad ( name ) ;
2012-03-06 11:16:17 -08:00
if ( len < sizeof ( unsigned long ) )
break ;
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
HASH_MIX ( x , y , a ) ;
2012-03-06 11:16:17 -08:00
name + = sizeof ( unsigned long ) ;
len - = sizeof ( unsigned long ) ;
}
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
x ^ = a & bytemask_from_count ( len ) ;
2012-03-06 11:16:17 -08:00
done :
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
return fold_hash ( x , y ) ;
2012-03-06 11:16:17 -08:00
}
EXPORT_SYMBOL ( full_name_hash ) ;
2016-05-20 08:41:37 -04:00
/* Return the "hash_len" (hash and length) of a null-terminated string */
2016-06-10 07:51:30 -07:00
u64 hashlen_string ( const void * salt , const char * name )
2016-05-20 08:41:37 -04:00
{
2016-06-10 07:51:30 -07:00
unsigned long a = 0 , x = 0 , y = ( unsigned long ) salt ;
unsigned long adata , mask , len ;
2016-05-20 08:41:37 -04:00
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS ;
2016-06-10 07:51:30 -07:00
len = 0 ;
goto inside ;
2016-05-20 08:41:37 -04:00
do {
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
HASH_MIX ( x , y , a ) ;
2016-05-20 08:41:37 -04:00
len + = sizeof ( unsigned long ) ;
2016-06-10 07:51:30 -07:00
inside :
2016-05-20 08:41:37 -04:00
a = load_unaligned_zeropad ( name + len ) ;
} while ( ! has_zero ( a , & adata , & constants ) ) ;
adata = prep_zero_mask ( a , adata , & constants ) ;
mask = create_zero_mask ( adata ) ;
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
x ^ = a & zero_bytemask ( mask ) ;
2016-05-20 08:41:37 -04:00
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
return hashlen_create ( fold_hash ( x , y ) , len + find_zero ( mask ) ) ;
2016-05-20 08:41:37 -04:00
}
EXPORT_SYMBOL ( hashlen_string ) ;
2012-03-06 11:16:17 -08:00
/*
* Calculate the length and hash of the path component , and
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
* return the " hash_len " as the result .
2012-03-06 11:16:17 -08:00
*/
2016-06-10 07:51:30 -07:00
static inline u64 hash_name ( const void * salt , const char * name )
2012-03-06 11:16:17 -08:00
{
2016-06-10 07:51:30 -07:00
unsigned long a = 0 , b , x = 0 , y = ( unsigned long ) salt ;
unsigned long adata , bdata , mask , len ;
word-at-a-time: make the interfaces truly generic
This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
complicated, but a lot more generic.
In particular, it allows us to really do the operations efficiently on
both little-endian and big-endian machines, pretty much regardless of
machine details. For example, if you can rely on a fast population
count instruction on your architecture, this will allow you to make your
optimized <asm/word-at-a-time.h> file with that.
NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
not truly generic, it actually only works on big-endian. Why? Because
on little-endian the generic algorithms are wasteful, since you can
inevitably do better. The x86 implementation is an example of that.
(The only truly non-generic part of the asm-generic implementation is
the "find_zero()" function, and you could make a little-endian version
of it. And if the Kbuild infrastructure allowed us to pick a particular
header file, that would be lovely)
The <asm/word-at-a-time.h> functions are as follows:
- WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
uses.
- has_zero(): take a word, and determine if it has a zero byte in it.
It gets the word, the pointer to the constant pool, and a pointer to
an intermediate "data" field it can set.
This is the "quick-and-dirty" zero tester: it's what is run inside
the hot loops.
- "prep_zero_mask()": take the word, the data that has_zero() produced,
and the constant pool, and generate an *exact* mask of which byte had
the first zero. This is run directly *outside* the loop, and allows
the "has_zero()" function to answer the "is there a zero byte"
question without necessarily getting exactly *which* byte is the
first one to contain a zero.
If you do multiple byte lookups concurrently (eg "hash_name()", which
looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
phase, the result of those can be or'ed together to get the "either
or" case.
- The result from "prep_zero_mask()" can then be fed into "find_zero()"
(to find the byte offset of the first byte that was zero) or into
"zero_bytemask()" (to find the bytemask of the bytes preceding the
zero byte).
The existence of zero_bytemask() is optional, and is not necessary
for the normal string routines. But dentry name hashing needs it, so
if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
This changes the generic strncpy_from_user() function and the dentry
hashing functions to use these modified word-at-a-time interfaces. This
gets us back to the optimized state of the x86 strncpy that we lost in
the previous commit when moving over to the generic version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-26 10:43:17 -07:00
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS ;
2012-03-06 11:16:17 -08:00
2016-06-10 07:51:30 -07:00
len = 0 ;
goto inside ;
2012-03-06 11:16:17 -08:00
do {
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
HASH_MIX ( x , y , a ) ;
2012-03-06 11:16:17 -08:00
len + = sizeof ( unsigned long ) ;
2016-06-10 07:51:30 -07:00
inside :
2012-05-03 10:16:43 -07:00
a = load_unaligned_zeropad ( name + len ) ;
word-at-a-time: make the interfaces truly generic
This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
complicated, but a lot more generic.
In particular, it allows us to really do the operations efficiently on
both little-endian and big-endian machines, pretty much regardless of
machine details. For example, if you can rely on a fast population
count instruction on your architecture, this will allow you to make your
optimized <asm/word-at-a-time.h> file with that.
NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
not truly generic, it actually only works on big-endian. Why? Because
on little-endian the generic algorithms are wasteful, since you can
inevitably do better. The x86 implementation is an example of that.
(The only truly non-generic part of the asm-generic implementation is
the "find_zero()" function, and you could make a little-endian version
of it. And if the Kbuild infrastructure allowed us to pick a particular
header file, that would be lovely)
The <asm/word-at-a-time.h> functions are as follows:
- WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
uses.
- has_zero(): take a word, and determine if it has a zero byte in it.
It gets the word, the pointer to the constant pool, and a pointer to
an intermediate "data" field it can set.
This is the "quick-and-dirty" zero tester: it's what is run inside
the hot loops.
- "prep_zero_mask()": take the word, the data that has_zero() produced,
and the constant pool, and generate an *exact* mask of which byte had
the first zero. This is run directly *outside* the loop, and allows
the "has_zero()" function to answer the "is there a zero byte"
question without necessarily getting exactly *which* byte is the
first one to contain a zero.
If you do multiple byte lookups concurrently (eg "hash_name()", which
looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
phase, the result of those can be or'ed together to get the "either
or" case.
- The result from "prep_zero_mask()" can then be fed into "find_zero()"
(to find the byte offset of the first byte that was zero) or into
"zero_bytemask()" (to find the bytemask of the bytes preceding the
zero byte).
The existence of zero_bytemask() is optional, and is not necessary
for the normal string routines. But dentry name hashing needs it, so
if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
This changes the generic strncpy_from_user() function and the dentry
hashing functions to use these modified word-at-a-time interfaces. This
gets us back to the optimized state of the x86 strncpy that we lost in
the previous commit when moving over to the generic version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-26 10:43:17 -07:00
b = a ^ REPEAT_BYTE ( ' / ' ) ;
} while ( ! ( has_zero ( a , & adata , & constants ) | has_zero ( b , & bdata , & constants ) ) ) ;
adata = prep_zero_mask ( a , adata , & constants ) ;
bdata = prep_zero_mask ( b , bdata , & constants ) ;
mask = create_zero_mask ( adata | bdata ) ;
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
x ^ = a & zero_bytemask ( mask ) ;
word-at-a-time: make the interfaces truly generic
This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
complicated, but a lot more generic.
In particular, it allows us to really do the operations efficiently on
both little-endian and big-endian machines, pretty much regardless of
machine details. For example, if you can rely on a fast population
count instruction on your architecture, this will allow you to make your
optimized <asm/word-at-a-time.h> file with that.
NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
not truly generic, it actually only works on big-endian. Why? Because
on little-endian the generic algorithms are wasteful, since you can
inevitably do better. The x86 implementation is an example of that.
(The only truly non-generic part of the asm-generic implementation is
the "find_zero()" function, and you could make a little-endian version
of it. And if the Kbuild infrastructure allowed us to pick a particular
header file, that would be lovely)
The <asm/word-at-a-time.h> functions are as follows:
- WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
uses.
- has_zero(): take a word, and determine if it has a zero byte in it.
It gets the word, the pointer to the constant pool, and a pointer to
an intermediate "data" field it can set.
This is the "quick-and-dirty" zero tester: it's what is run inside
the hot loops.
- "prep_zero_mask()": take the word, the data that has_zero() produced,
and the constant pool, and generate an *exact* mask of which byte had
the first zero. This is run directly *outside* the loop, and allows
the "has_zero()" function to answer the "is there a zero byte"
question without necessarily getting exactly *which* byte is the
first one to contain a zero.
If you do multiple byte lookups concurrently (eg "hash_name()", which
looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
phase, the result of those can be or'ed together to get the "either
or" case.
- The result from "prep_zero_mask()" can then be fed into "find_zero()"
(to find the byte offset of the first byte that was zero) or into
"zero_bytemask()" (to find the bytemask of the bytes preceding the
zero byte).
The existence of zero_bytemask() is optional, and is not necessary
for the normal string routines. But dentry name hashing needs it, so
if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
This changes the generic strncpy_from_user() function and the dentry
hashing functions to use these modified word-at-a-time interfaces. This
gets us back to the optimized state of the x86 strncpy that we lost in
the previous commit when moving over to the generic version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-26 10:43:17 -07:00
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
return hashlen_create ( fold_hash ( x , y ) , len + find_zero ( mask ) ) ;
2012-03-06 11:16:17 -08:00
}
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
# else /* !CONFIG_DCACHE_WORD_ACCESS: Slow, byte-at-a-time version */
2012-03-06 11:16:17 -08:00
2016-05-20 08:41:37 -04:00
/* Return the hash of a string of known length */
2016-06-10 07:51:30 -07:00
unsigned int full_name_hash ( const void * salt , const char * name , unsigned int len )
2012-03-02 14:32:59 -08:00
{
2016-06-10 07:51:30 -07:00
unsigned long hash = init_name_hash ( salt ) ;
2012-03-02 14:32:59 -08:00
while ( len - - )
2016-05-20 08:41:37 -04:00
hash = partial_name_hash ( ( unsigned char ) * name + + , hash ) ;
2012-03-02 14:32:59 -08:00
return end_name_hash ( hash ) ;
}
2012-03-02 19:40:57 -08:00
EXPORT_SYMBOL ( full_name_hash ) ;
2012-03-02 14:32:59 -08:00
2016-05-20 08:41:37 -04:00
/* Return the "hash_len" (hash and length) of a null-terminated string */
2016-06-10 07:51:30 -07:00
u64 hashlen_string ( const void * salt , const char * name )
2016-05-20 08:41:37 -04:00
{
2016-06-10 07:51:30 -07:00
unsigned long hash = init_name_hash ( salt ) ;
2016-05-20 08:41:37 -04:00
unsigned long len = 0 , c ;
c = ( unsigned char ) * name ;
2016-05-29 08:05:56 -04:00
while ( c ) {
2016-05-20 08:41:37 -04:00
len + + ;
hash = partial_name_hash ( c , hash ) ;
c = ( unsigned char ) name [ len ] ;
2016-05-29 08:05:56 -04:00
}
2016-05-20 08:41:37 -04:00
return hashlen_create ( end_name_hash ( hash ) , len ) ;
}
2016-05-29 01:26:41 -04:00
EXPORT_SYMBOL ( hashlen_string ) ;
2016-05-20 08:41:37 -04:00
2012-03-02 14:49:24 -08:00
/*
* We know there ' s a real path component here of at least
* one character .
*/
2016-06-10 07:51:30 -07:00
static inline u64 hash_name ( const void * salt , const char * name )
2012-03-02 14:49:24 -08:00
{
2016-06-10 07:51:30 -07:00
unsigned long hash = init_name_hash ( salt ) ;
2012-03-02 14:49:24 -08:00
unsigned long len = 0 , c ;
c = ( unsigned char ) * name ;
do {
len + + ;
hash = partial_name_hash ( c , hash ) ;
c = ( unsigned char ) name [ len ] ;
} while ( c & & c ! = ' / ' ) ;
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
return hashlen_create ( end_name_hash ( hash ) , len ) ;
2012-03-02 14:49:24 -08:00
}
2012-03-06 11:16:17 -08:00
# endif
2005-04-16 15:20:36 -07:00
/*
* Name resolution .
2005-04-29 16:00:17 +01:00
* This is the basic name resolution function , turning a pathname into
* the final dentry . We expect ' base ' to be positive and a directory .
2005-04-16 15:20:36 -07:00
*
2005-04-29 16:00:17 +01:00
* Returns 0 and nd will have valid dentry and mnt on success .
* Returns error and drops reference to input namei data on failure .
2005-04-16 15:20:36 -07:00
*/
2009-08-09 01:41:57 +04:00
static int link_path_walk ( const char * name , struct nameidata * nd )
2005-04-16 15:20:36 -07:00
{
link_path_walk(): simplify stack handling
We use nd->stack to store two things: pinning down the symlinks
we are resolving and resuming the name traversal when a nested
symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when
we call link_path_walk() for the first time (for the pathname
itself) and 1 on all subsequent calls (for trailing symlinks,
if any). That's fine, as far as pinning symlinks goes - when
handling a trailing symlink, the string we are interpreting
is the body of symlink pinned down in nd->stack[0]. It's
rather inconvenient with respect to handling nested symlinks,
though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case
we need to pick the string saved back when we started to interpret
that nested symlink and resume its traversal) or not (in which
case we are done with link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when
running out of a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size
explicitly and always store it in nd->stack[0..depth-1].name.
We access the second set only inside link_path_walk(), so its
size can live in a local variable; that way the check becomes
trivial without the need of that kludge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-23 22:04:15 -05:00
int depth = 0 ; // depth <= nd->depth
2005-04-16 15:20:36 -07:00
int err ;
2015-04-18 20:30:49 -04:00
sanitize handling of nd->last_type, kill LAST_BIND
->last_type values are set in 3 places: path_init() (sets to LAST_ROOT),
link_path_walk (LAST_NORM/DOT/DOTDOT) and pick_link (LAST_BIND).
The are checked in walk_component(), lookup_last() and do_last().
They also get copied to the caller by filename_parentat(). In the last
3 cases the value is what we had at the return from link_path_walk().
In case of walk_component() it's either directly downstream from
assignment in link_path_walk() or, when called by lookup_last(), the
value we have at the return from link_path_walk().
The value at the entry into link_path_walk() can survive to return only
if the pathname contains nothing but slashes. Note that pick_link()
never returns such - pure jumps are handled directly. So for the calls
of link_path_walk() for trailing symlinks it does not matter what value
had been there at the entry; the value at the return won't depend upon it.
There are 3 call chains that might have pick_link() storing LAST_BIND:
1) pick_link() from step_into() from walk_component() from
link_path_walk(). In that case we will either be parsing the next
component immediately after return into link_path_walk(), which will
overwrite the ->last_type before anyone has a chance to look at it,
or we'll fail, in which case nobody will be looking at ->last_type at all.
2) pick_link() from step_into() from walk_component() from lookup_last().
The value is never looked at due to the above; it won't affect the value
seen at return from any link_path_walk().
3) pick_link() from step_into() from do_last(). Ditto.
In other words, assignemnt in pick_link() is pointless, and so is
LAST_BIND itself; nothing ever looks at that value. Kill it off.
And make link_path_walk() _always_ assign ->last_type - in the only
case when the value at the entry might survive to the return that value
is always LAST_ROOT, inherited from path_init(). Move that assignment
from path_init() into the beginning of link_path_walk(), to consolidate
the things.
Historical note: LAST_BIND used to be used for the kludge with trailing
pure jump symlinks (extra iteration through the top-level loop).
No point keeping it anymore...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-19 11:44:51 -05:00
nd - > last_type = LAST_ROOT ;
2020-03-05 15:48:44 -05:00
nd - > flags | = LOOKUP_PARENT ;
2018-07-09 16:33:23 -04:00
if ( IS_ERR ( name ) )
return PTR_ERR ( name ) ;
2005-04-16 15:20:36 -07:00
while ( * name = = ' / ' )
name + + ;
2020-09-19 17:55:58 +01:00
if ( ! * name ) {
nd - > dir_mode = 0 ; // short-circuit the 'hardening' idiocy
2015-04-18 20:44:34 -04:00
return 0 ;
2020-09-19 17:55:58 +01:00
}
2005-04-16 15:20:36 -07:00
/* At this point we know we have a real path component. */
for ( ; ; ) {
2021-01-21 14:19:43 +01:00
struct user_namespace * mnt_userns ;
2020-01-14 13:24:17 -05:00
const char * link ;
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
u64 hash_len ;
2011-02-22 15:10:03 -05:00
int type ;
2005-04-16 15:20:36 -07:00
2021-01-21 14:19:43 +01:00
mnt_userns = mnt_user_ns ( nd - > path . mnt ) ;
err = may_lookup ( mnt_userns , nd ) ;
fs/namei.c: Improve dcache hash function
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 07:43:58 -04:00
if ( err )
2015-05-09 16:54:45 -04:00
return err ;
2005-04-16 15:20:36 -07:00
2016-06-10 07:51:30 -07:00
hash_len = hash_name ( nd - > path . dentry , name ) ;
2005-04-16 15:20:36 -07:00
2011-02-22 15:10:03 -05:00
type = LAST_NORM ;
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
if ( name [ 0 ] = = ' . ' ) switch ( hashlen_len ( hash_len ) ) {
2011-02-22 15:10:03 -05:00
case 2 :
2012-03-02 14:49:24 -08:00
if ( name [ 1 ] = = ' . ' ) {
2011-02-22 15:10:03 -05:00
type = LAST_DOTDOT ;
2021-04-01 22:03:41 -04:00
nd - > state | = ND_JUMPED ;
2011-02-22 15:50:10 -05:00
}
2011-02-22 15:10:03 -05:00
break ;
case 1 :
type = LAST_DOT ;
}
2011-03-08 14:17:44 -05:00
if ( likely ( type = = LAST_NORM ) ) {
struct dentry * parent = nd - > path . dentry ;
2021-04-01 22:03:41 -04:00
nd - > state & = ~ ND_JUMPED ;
2011-03-08 14:17:44 -05:00
if ( unlikely ( parent - > d_flags & DCACHE_OP_HASH ) ) {
2014-09-16 13:07:35 +01:00
struct qstr this = { { . hash_len = hash_len } , . name = name } ;
2013-05-21 15:22:44 -07:00
err = parent - > d_op - > d_hash ( parent , & this ) ;
2011-03-08 14:17:44 -05:00
if ( err < 0 )
2015-05-09 16:54:45 -04:00
return err ;
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
hash_len = this . hash_len ;
name = this . name ;
2011-03-08 14:17:44 -05:00
}
}
2011-02-22 15:10:03 -05:00
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
nd - > last . hash_len = hash_len ;
nd - > last . name = name ;
2013-01-24 18:04:22 -05:00
nd - > last_type = type ;
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
name + = hashlen_len ( hash_len ) ;
if ( ! * name )
2015-04-18 20:21:40 -04:00
goto OK ;
2012-03-02 14:49:24 -08:00
/*
* If it wasn ' t NUL , we know it was ' / ' . Skip that
* slash , and continue until no more slashes .
*/
do {
vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
name + + ;
} while ( unlikely ( * name = = ' / ' ) ) ;
2015-05-04 08:58:35 -04:00
if ( unlikely ( ! * name ) ) {
OK :
link_path_walk(): simplify stack handling
We use nd->stack to store two things: pinning down the symlinks
we are resolving and resuming the name traversal when a nested
symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when
we call link_path_walk() for the first time (for the pathname
itself) and 1 on all subsequent calls (for trailing symlinks,
if any). That's fine, as far as pinning symlinks goes - when
handling a trailing symlink, the string we are interpreting
is the body of symlink pinned down in nd->stack[0]. It's
rather inconvenient with respect to handling nested symlinks,
though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case
we need to pick the string saved back when we started to interpret
that nested symlink and resume its traversal) or not (in which
case we are done with link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when
running out of a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size
explicitly and always store it in nd->stack[0..depth-1].name.
We access the second set only inside link_path_walk(), so its
size can live in a local variable; that way the check becomes
trivial without the need of that kludge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-23 22:04:15 -05:00
/* pathname or trailing symlink, done */
2020-03-05 15:48:44 -05:00
if ( ! depth ) {
2021-01-21 14:19:43 +01:00
nd - > dir_uid = i_uid_into_mnt ( mnt_userns , nd - > inode ) ;
2020-03-05 11:34:48 -05:00
nd - > dir_mode = nd - > inode - > i_mode ;
2020-03-05 15:48:44 -05:00
nd - > flags & = ~ LOOKUP_PARENT ;
2015-05-04 08:58:35 -04:00
return 0 ;
2020-03-05 15:48:44 -05:00
}
2015-05-04 08:58:35 -04:00
/* last component of nested symlink */
link_path_walk(): simplify stack handling
We use nd->stack to store two things: pinning down the symlinks
we are resolving and resuming the name traversal when a nested
symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when
we call link_path_walk() for the first time (for the pathname
itself) and 1 on all subsequent calls (for trailing symlinks,
if any). That's fine, as far as pinning symlinks goes - when
handling a trailing symlink, the string we are interpreting
is the body of symlink pinned down in nd->stack[0]. It's
rather inconvenient with respect to handling nested symlinks,
though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case
we need to pick the string saved back when we started to interpret
that nested symlink and resume its traversal) or not (in which
case we are done with link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when
running out of a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size
explicitly and always store it in nd->stack[0..depth-1].name.
We access the second set only inside link_path_walk(), so its
size can live in a local variable; that way the check becomes
trivial without the need of that kludge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-23 22:04:15 -05:00
name = nd - > stack [ - - depth ] . name ;
2020-01-19 12:44:18 -05:00
link = walk_component ( nd , 0 ) ;
2016-11-14 01:39:36 -05:00
} else {
/* not the last component */
2020-01-19 12:44:18 -05:00
link = walk_component ( nd , WALK_MORE ) ;
2015-05-04 08:58:35 -04:00
}
2020-01-14 13:24:17 -05:00
if ( unlikely ( link ) ) {
if ( IS_ERR ( link ) )
return PTR_ERR ( link ) ;
/* a symlink to follow */
link_path_walk(): simplify stack handling
We use nd->stack to store two things: pinning down the symlinks
we are resolving and resuming the name traversal when a nested
symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when
we call link_path_walk() for the first time (for the pathname
itself) and 1 on all subsequent calls (for trailing symlinks,
if any). That's fine, as far as pinning symlinks goes - when
handling a trailing symlink, the string we are interpreting
is the body of symlink pinned down in nd->stack[0]. It's
rather inconvenient with respect to handling nested symlinks,
though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case
we need to pick the string saved back when we started to interpret
that nested symlink and resume its traversal) or not (in which
case we are done with link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when
running out of a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size
explicitly and always store it in nd->stack[0..depth-1].name.
We access the second set only inside link_path_walk(), so its
size can live in a local variable; that way the check becomes
trivial without the need of that kludge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-23 22:04:15 -05:00
nd - > stack [ depth + + ] . name = name ;
2020-01-14 13:24:17 -05:00
name = link ;
continue ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
}
2015-08-01 19:59:28 -04:00
if ( unlikely ( ! d_can_lookup ( nd - > path . dentry ) ) ) {
if ( nd - > flags & LOOKUP_RCU ) {
2020-12-17 09:19:08 -07:00
if ( ! try_to_unlazy ( nd ) )
2015-08-01 19:59:28 -04:00
return - ECHILD ;
}
2015-05-09 16:54:45 -04:00
return - ENOTDIR ;
2015-08-01 19:59:28 -04:00
}
2005-04-16 15:20:36 -07:00
}
}
2018-07-09 16:27:23 -04:00
/* must be paired with terminate_walk() */
2015-05-12 18:43:07 -04:00
static const char * path_init ( struct nameidata * nd , unsigned flags )
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
{
2019-12-07 01:13:29 +11:00
int error ;
2015-05-12 18:43:07 -04:00
const char * s = nd - > name - > name ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2020-12-17 09:19:09 -07:00
/* LOOKUP_CACHED requires RCU, ask caller to retry */
if ( ( flags & ( LOOKUP_RCU | LOOKUP_CACHED ) ) = = LOOKUP_CACHED )
return ERR_PTR ( - EAGAIN ) ;
2017-04-02 17:10:08 -07:00
if ( ! * s )
flags & = ~ LOOKUP_RCU ;
2018-07-09 16:27:23 -04:00
if ( flags & LOOKUP_RCU )
rcu_read_lock ( ) ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
else
nd - > seq = nd - > next_seq = 0 ;
2017-04-02 17:10:08 -07:00
2021-04-01 22:03:41 -04:00
nd - > flags = flags ;
nd - > state | = ND_JUMPED ;
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].
As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat2(dirb, "b/c/../../etc/shadow",
{ .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.
The primary reason for deferring to userspace with -EAGAIN is that an
in-kernel retry loop (or doing a path_is_under() check after re-taking
the relevant seqlocks) can become unreasonably expensive on machines
with lots of VFS activity (nfsd can cause lots of rename_lock updates).
Thus it should be up to userspace how many times they wish to retry the
lookup -- the selftests for this attack indicate that there is a ~35%
chance of the lookup succeeding on the first try even with an attacker
thrashing rename_lock.
A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.
[*] It may be acceptable in the future to do a path_is_under() check for
magic-links after they are resolved. However this seems unlikely to
be a feature that people *really* need -- it can be added later if
it turns out a lot of people want it.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:35 +11:00
nd - > m_seq = __read_seqcount_begin ( & mount_lock . seqcount ) ;
nd - > r_seq = __read_seqcount_begin ( & rename_lock . seqcount ) ;
smp_rmb ( ) ;
2021-04-01 22:03:41 -04:00
if ( nd - > state & ND_ROOT_PRESET ) {
2013-09-12 19:22:53 +01:00
struct dentry * root = nd - > root . dentry ;
struct inode * inode = root - > d_inode ;
2017-04-15 17:29:14 -04:00
if ( * s & & unlikely ( ! d_can_lookup ( root ) ) )
return ERR_PTR ( - ENOTDIR ) ;
2011-03-09 23:04:47 -05:00
nd - > path = nd - > root ;
nd - > inode = inode ;
if ( flags & LOOKUP_RCU ) {
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].
As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat2(dirb, "b/c/../../etc/shadow",
{ .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.
The primary reason for deferring to userspace with -EAGAIN is that an
in-kernel retry loop (or doing a path_is_under() check after re-taking
the relevant seqlocks) can become unreasonably expensive on machines
with lots of VFS activity (nfsd can cause lots of rename_lock updates).
Thus it should be up to userspace how many times they wish to retry the
lookup -- the selftests for this attack indicate that there is a ~35%
chance of the lookup succeeding on the first try even with an attacker
thrashing rename_lock.
A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.
[*] It may be acceptable in the future to do a path_is_under() check for
magic-links after they are resolved. However this seems unlikely to
be a feature that people *really* need -- it can be added later if
it turns out a lot of people want it.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:35 +11:00
nd - > seq = read_seqcount_begin ( & nd - > path . dentry - > d_seq ) ;
2015-05-09 19:02:01 -04:00
nd - > root_seq = nd - > seq ;
2011-03-09 23:04:47 -05:00
} else {
path_get ( & nd - > path ) ;
}
2015-05-08 17:19:59 -04:00
return s ;
2011-03-09 23:04:47 -05:00
}
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
nd - > root . mnt = NULL ;
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.
Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.
Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root).
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).
More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.
/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_IN_ROOT applies to all components of the path.
With LOOKUP_IN_ROOT, any path component which attempts to cross the
starting point of the pathname lookup (the dirfd passed to openat) will
remain at the starting point. Thus, all absolute paths and symlinks will
be scoped within the starting point.
There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).
As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
will be lifted in a future patch, but requires more work to ensure that
permitting ".." is done safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:34 +11:00
/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
if ( * s = = ' / ' & & ! ( flags & LOOKUP_IN_ROOT ) ) {
2019-12-07 01:13:29 +11:00
error = nd_jump_root ( nd ) ;
if ( unlikely ( error ) )
return ERR_PTR ( error ) ;
return s ;
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.
Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.
Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root).
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).
More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.
/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_IN_ROOT applies to all components of the path.
With LOOKUP_IN_ROOT, any path component which attempts to cross the
starting point of the pathname lookup (the dirfd passed to openat) will
remain at the starting point. Thus, all absolute paths and symlinks will
be scoped within the starting point.
There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).
As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
will be lifted in a future patch, but requires more work to ensure that
permitting ".." is done safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:34 +11:00
}
/* Relative pathname -- get the starting-point it is relative to. */
if ( nd - > dfd = = AT_FDCWD ) {
2011-02-22 14:02:58 -05:00
if ( flags & LOOKUP_RCU ) {
struct fs_struct * fs = current - > fs ;
unsigned seq ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2011-02-22 14:02:58 -05:00
do {
seq = read_seqcount_begin ( & fs - > seq ) ;
nd - > path = fs - > pwd ;
2015-12-05 20:25:06 -05:00
nd - > inode = nd - > path . dentry - > d_inode ;
2011-02-22 14:02:58 -05:00
nd - > seq = __read_seqcount_begin ( & nd - > path . dentry - > d_seq ) ;
} while ( read_seqcount_retry ( & fs - > seq , seq ) ) ;
} else {
get_fs_pwd ( current - > fs , & nd - > path ) ;
2015-12-05 20:25:06 -05:00
nd - > inode = nd - > path . dentry - > d_inode ;
2011-02-22 14:02:58 -05:00
}
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
} else {
2012-12-11 08:56:16 -05:00
/* Caller must check execute permissions on the starting path component */
2015-05-12 18:43:07 -04:00
struct fd f = fdget_raw ( nd - > dfd ) ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
struct dentry * dentry ;
2012-08-28 12:52:22 -04:00
if ( ! f . file )
2015-05-08 17:19:59 -04:00
return ERR_PTR ( - EBADF ) ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2012-08-28 12:52:22 -04:00
dentry = f . file - > f_path . dentry ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2018-07-09 16:27:23 -04:00
if ( * s & & unlikely ( ! d_can_lookup ( dentry ) ) ) {
fdput ( f ) ;
return ERR_PTR ( - ENOTDIR ) ;
2011-03-14 18:56:51 -04:00
}
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2012-08-28 12:52:22 -04:00
nd - > path = f . file - > f_path ;
2011-02-22 14:02:58 -05:00
if ( flags & LOOKUP_RCU ) {
2015-05-11 08:05:05 -04:00
nd - > inode = nd - > path . dentry - > d_inode ;
nd - > seq = read_seqcount_begin ( & nd - > path . dentry - > d_seq ) ;
2011-02-22 14:02:58 -05:00
} else {
2012-08-28 12:52:22 -04:00
path_get ( & nd - > path ) ;
2015-05-11 08:05:05 -04:00
nd - > inode = nd - > path . dentry - > d_inode ;
2011-02-22 14:02:58 -05:00
}
2015-05-11 08:05:05 -04:00
fdput ( f ) ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
}
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.
Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.
Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root).
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).
More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.
/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_IN_ROOT applies to all components of the path.
With LOOKUP_IN_ROOT, any path component which attempts to cross the
starting point of the pathname lookup (the dirfd passed to openat) will
remain at the starting point. Thus, all absolute paths and symlinks will
be scoped within the starting point.
There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).
As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
will be lifted in a future patch, but requires more work to ensure that
permitting ".." is done safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:34 +11:00
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
/* For scoped-lookups we need to set the root to the dirfd as well. */
if ( flags & LOOKUP_IS_SCOPED ) {
nd - > root = nd - > path ;
if ( flags & LOOKUP_RCU ) {
nd - > root_seq = nd - > seq ;
} else {
path_get ( & nd - > root ) ;
2021-04-01 22:03:41 -04:00
nd - > state | = ND_ROOT_GRABBED ;
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-07 01:13:33 +11:00
}
}
return s ;
2009-04-07 11:44:16 -04:00
}
2020-01-14 10:13:40 -05:00
static inline const char * lookup_last ( struct nameidata * nd )
2011-03-14 19:54:59 -04:00
{
if ( nd - > last_type = = LAST_NORM & & nd - > last . name [ nd - > last . len ] )
nd - > flags | = LOOKUP_FOLLOW | LOOKUP_DIRECTORY ;
2020-03-05 15:48:44 -05:00
return walk_component ( nd , WALK_TRAILING ) ;
2011-03-14 19:54:59 -04:00
}
2017-04-15 17:31:22 -04:00
static int handle_lookup_down ( struct nameidata * nd )
{
2020-01-09 14:50:18 -05:00
if ( ! ( nd - > flags & LOOKUP_RCU ) )
2020-01-09 14:41:00 -05:00
dget ( nd - > path . dentry ) ;
namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq. Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered. Used instead of an arseload
of local variables, arguments, etc.
step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.
There are two requirements for RCU pathwalk:
1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.
The use of seq numbers is the way we achieve that. Invariant we want
to maintain is:
if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq
For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged. Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed. Same reasoning as
for the previous case applies.
For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk. That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk. The same reasoning as in the
previous case applies.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-04 18:12:39 -04:00
nd - > next_seq = nd - > seq ;
2022-07-03 22:07:32 -04:00
return PTR_ERR ( step_into ( nd , WALK_NOFOLLOW , nd - > path . dentry ) ) ;
2017-04-15 17:31:22 -04:00
}
2009-04-07 11:44:16 -04:00
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
2015-05-12 18:43:07 -04:00
static int path_lookupat ( struct nameidata * nd , unsigned flags , struct path * path )
2009-04-07 11:44:16 -04:00
{
2015-05-12 18:43:07 -04:00
const char * s = path_init ( nd , flags ) ;
2011-03-14 19:54:59 -04:00
int err ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2018-07-09 16:33:23 -04:00
if ( unlikely ( flags & LOOKUP_DOWN ) & & ! IS_ERR ( s ) ) {
2017-04-15 17:31:22 -04:00
err = handle_lookup_down ( nd ) ;
2018-07-09 16:38:06 -04:00
if ( unlikely ( err < 0 ) )
s = ERR_PTR ( err ) ;
2017-04-15 17:31:22 -04:00
}
2020-01-14 10:13:40 -05:00
while ( ! ( err = link_path_walk ( s , nd ) ) & &
( s = lookup_last ( nd ) ) ! = NULL )
;
2021-04-06 19:46:51 -04:00
if ( ! err & & unlikely ( nd - > flags & LOOKUP_MOUNTPOINT ) ) {
err = handle_lookup_down ( nd ) ;
2021-04-01 22:03:41 -04:00
nd - > state & = ~ ND_JUMPED ; // no d_weak_revalidate(), please...
2021-04-06 19:46:51 -04:00
}
2011-03-25 11:00:12 -04:00
if ( ! err )
err = complete_walk ( nd ) ;
2011-03-14 19:54:59 -04:00
2015-05-08 18:05:21 -04:00
if ( ! err & & nd - > flags & LOOKUP_DIRECTORY )
if ( ! d_can_lookup ( nd - > path . dentry ) )
2011-03-23 09:56:30 -04:00
err = - ENOTDIR ;
2015-05-12 16:36:12 -04:00
if ( ! err ) {
* path = nd - > path ;
nd - > path . mnt = NULL ;
nd - > path . dentry = NULL ;
}
terminate_walk ( nd ) ;
2011-03-14 19:54:59 -04:00
return err ;
2011-02-21 23:38:09 -05:00
}
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2021-09-01 10:51:42 -07:00
int filename_lookup ( int dfd , struct filename * name , unsigned flags ,
vfs: Add configuration parser helpers
Because the new API passes in key,value parameters, match_token() cannot be
used with it. Instead, provide three new helpers to aid with parsing:
(1) fs_parse(). This takes a parameter and a simple static description of
all the parameters and maps the key name to an ID. It returns 1 on a
match, 0 on no match if unknowns should be ignored and some other
negative error code on a parse error.
The parameter description includes a list of key names to IDs, desired
parameter types and a list of enumeration name -> ID mappings.
[!] Note that for the moment I've required that the key->ID mapping
array is expected to be sorted and unterminated. The size of the
array is noted in the fsconfig_parser struct. This allows me to use
bsearch(), but I'm not sure any performance gain is worth the hassle
of requiring people to keep the array sorted.
The parameter type array is sized according to the number of parameter
IDs and is indexed directly. The optional enum mapping array is an
unterminated, unsorted list and the size goes into the fsconfig_parser
struct.
The function can do some additional things:
(a) If it's not ambiguous and no value is given, the prefix "no" on
a key name is permitted to indicate that the parameter should
be considered negatory.
(b) If the desired type is a single simple integer, it will perform
an appropriate conversion and store the result in a union in
the parse result.
(c) If the desired type is an enumeration, {key ID, name} will be
looked up in the enumeration list and the matching value will
be stored in the parse result union.
(d) Optionally generate an error if the key is unrecognised.
This is called something like:
enum rdt_param {
Opt_cdp,
Opt_cdpl2,
Opt_mba_mpbs,
nr__rdt_params
};
const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
[Opt_cdp] = { fs_param_is_bool },
[Opt_cdpl2] = { fs_param_is_bool },
[Opt_mba_mpbs] = { fs_param_is_bool },
};
const const char *const rdt_param_keys[nr__rdt_params] = {
[Opt_cdp] = "cdp",
[Opt_cdpl2] = "cdpl2",
[Opt_mba_mpbs] = "mba_mbps",
};
const struct fs_parameter_description rdt_parser = {
.name = "rdt",
.nr_params = nr__rdt_params,
.keys = rdt_param_keys,
.specs = rdt_param_specs,
.no_source = true,
};
int rdt_parse_param(struct fs_context *fc,
struct fs_parameter *param)
{
struct fs_parse_result parse;
struct rdt_fs_context *ctx = rdt_fc2context(fc);
int ret;
ret = fs_parse(fc, &rdt_parser, param, &parse);
if (ret < 0)
return ret;
switch (parse.key) {
case Opt_cdp:
ctx->enable_cdpl3 = true;
return 0;
case Opt_cdpl2:
ctx->enable_cdpl2 = true;
return 0;
case Opt_mba_mpbs:
ctx->enable_mba_mbps = true;
return 0;
}
return -EINVAL;
}
(2) fs_lookup_param(). This takes a { dirfd, path, LOOKUP_EMPTY? } or
string value and performs an appropriate path lookup to convert it
into a path object, which it will then return.
If the desired type was a blockdev, the type of the looked up inode
will be checked to make sure it is one.
This can be used like:
enum foo_param {
Opt_source,
nr__foo_params
};
const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
[Opt_source] = { fs_param_is_blockdev },
};
const char *char foo_param_keys[nr__foo_params] = {
[Opt_source] = "source",
};
const struct constant_table foo_param_alt_keys[] = {
{ "device", Opt_source },
};
const struct fs_parameter_description foo_parser = {
.name = "foo",
.nr_params = nr__foo_params,
.nr_alt_keys = ARRAY_SIZE(foo_param_alt_keys),
.keys = foo_param_keys,
.alt_keys = foo_param_alt_keys,
.specs = foo_param_specs,
};
int foo_parse_param(struct fs_context *fc,
struct fs_parameter *param)
{
struct fs_parse_result parse;
struct foo_fs_context *ctx = foo_fc2context(fc);
int ret;
ret = fs_parse(fc, &foo_parser, param, &parse);
if (ret < 0)
return ret;
switch (parse.key) {
case Opt_source:
return fs_lookup_param(fc, &foo_parser, param,
&parse, &ctx->source);
default:
return -EINVAL;
}
}
(3) lookup_constant(). This takes a table of named constants and looks up
the given name within it. The table is expected to be sorted such
that bsearch() be used upon it.
Possibly I should require the table be terminated and just use a
for-loop to scan it instead of using bsearch() to reduce hassle.
Tables look something like:
static const struct constant_table bool_names[] = {
{ "0", false },
{ "1", true },
{ "false", false },
{ "no", false },
{ "true", true },
{ "yes", true },
};
and a lookup is done with something like:
b = lookup_constant(bool_names, param->string, -1);
Additionally, optional validation routines for the parameter description
are provided that can be enabled at compile time. A later patch will
invoke these when a filesystem is registered.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:24 +00:00
struct path * path , struct path * root )
2011-02-21 23:38:09 -05:00
{
2015-05-02 07:16:16 -04:00
int retval ;
2015-05-13 07:28:08 -04:00
struct nameidata nd ;
2015-05-12 16:53:42 -04:00
if ( IS_ERR ( name ) )
return PTR_ERR ( name ) ;
2021-04-01 22:28:03 -04:00
set_nameidata ( & nd , dfd , name , root ) ;
2015-05-12 18:43:07 -04:00
retval = path_lookupat ( & nd , flags | LOOKUP_RCU , path ) ;
2011-02-21 23:38:09 -05:00
if ( unlikely ( retval = = - ECHILD ) )
2015-05-12 18:43:07 -04:00
retval = path_lookupat ( & nd , flags , path ) ;
2011-02-21 23:38:09 -05:00
if ( unlikely ( retval = = - ESTALE ) )
2015-05-12 18:43:07 -04:00
retval = path_lookupat ( & nd , flags | LOOKUP_REVAL , path ) ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2012-10-10 15:25:20 -04:00
if ( likely ( ! retval ) )
2020-01-11 22:52:26 -05:00
audit_inode ( name , path - > dentry ,
flags & LOOKUP_MOUNTPOINT ? AUDIT_INODE_NOEVAL : 0 ) ;
2015-05-13 07:28:08 -04:00
restore_nameidata ( ) ;
2021-07-08 13:34:43 +07:00
return retval ;
}
2015-05-08 16:59:20 -04:00
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
2015-05-12 18:43:07 -04:00
static int path_parentat ( struct nameidata * nd , unsigned flags ,
2015-05-09 11:19:16 -04:00
struct path * parent )
2015-05-08 16:59:20 -04:00
{
2015-05-12 18:43:07 -04:00
const char * s = path_init ( nd , flags ) ;
2018-07-09 16:33:23 -04:00
int err = link_path_walk ( s , nd ) ;
2015-05-08 16:59:20 -04:00
if ( ! err )
err = complete_walk ( nd ) ;
2015-05-09 11:19:16 -04:00
if ( ! err ) {
* parent = nd - > path ;
nd - > path . mnt = NULL ;
nd - > path . dentry = NULL ;
}
terminate_walk ( nd ) ;
2015-05-08 16:59:20 -04:00
return err ;
}
2021-09-01 10:51:41 -07:00
/* Note: this does not consume "name" */
2021-09-07 15:57:42 -04:00
static int filename_parentat ( int dfd , struct filename * name ,
2021-09-01 10:51:41 -07:00
unsigned int flags , struct path * parent ,
struct qstr * last , int * type )
2015-05-08 16:59:20 -04:00
{
int retval ;
2015-05-13 07:28:08 -04:00
struct nameidata nd ;
2015-05-08 16:59:20 -04:00
2015-05-12 17:32:54 -04:00
if ( IS_ERR ( name ) )
2021-07-08 13:34:38 +07:00
return PTR_ERR ( name ) ;
2021-04-01 22:28:03 -04:00
set_nameidata ( & nd , dfd , name , NULL ) ;
2015-05-12 18:43:07 -04:00
retval = path_parentat ( & nd , flags | LOOKUP_RCU , parent ) ;
2015-05-08 16:59:20 -04:00
if ( unlikely ( retval = = - ECHILD ) )
2015-05-12 18:43:07 -04:00
retval = path_parentat ( & nd , flags , parent ) ;
2015-05-08 16:59:20 -04:00
if ( unlikely ( retval = = - ESTALE ) )
2015-05-12 18:43:07 -04:00
retval = path_parentat ( & nd , flags | LOOKUP_REVAL , parent ) ;
2015-05-09 11:19:16 -04:00
if ( likely ( ! retval ) ) {
* last = nd . last ;
* type = nd . last_type ;
2019-07-14 13:22:27 -04:00
audit_inode ( name , parent - > dentry , AUDIT_INODE_PARENT ) ;
2015-05-09 11:19:16 -04:00
}
2015-05-13 07:28:08 -04:00
restore_nameidata ( ) ;
2021-07-08 13:34:38 +07:00
return retval ;
}
2012-06-15 03:01:42 +04:00
/* does lookup, returns the object with parent locked */
2021-09-01 10:51:41 -07:00
static struct dentry * __kern_path_locked ( struct filename * name , struct path * path )
2006-01-18 17:43:53 -08:00
{
2015-05-12 17:32:54 -04:00
struct dentry * d ;
2015-05-09 11:19:16 -04:00
struct qstr last ;
2021-07-08 13:34:38 +07:00
int type , error ;
2015-01-22 00:00:03 -05:00
2021-09-07 15:57:42 -04:00
error = filename_parentat ( AT_FDCWD , name , 0 , path , & last , & type ) ;
2021-07-08 13:34:38 +07:00
if ( error )
return ERR_PTR ( error ) ;
2015-05-12 17:32:54 -04:00
if ( unlikely ( type ! = LAST_NORM ) ) {
2015-05-09 11:19:16 -04:00
path_put ( path ) ;
2015-05-12 17:32:54 -04:00
return ERR_PTR ( - EINVAL ) ;
2012-06-15 03:01:42 +04:00
}
2016-01-22 15:40:57 -05:00
inode_lock_nested ( path - > dentry - > d_inode , I_MUTEX_PARENT ) ;
2015-05-09 11:19:16 -04:00
d = __lookup_hash ( & last , path - > dentry , 0 ) ;
2012-06-15 03:01:42 +04:00
if ( IS_ERR ( d ) ) {
2016-01-22 15:40:57 -05:00
inode_unlock ( path - > dentry - > d_inode ) ;
2015-05-09 11:19:16 -04:00
path_put ( path ) ;
2012-06-15 03:01:42 +04:00
}
return d ;
2006-01-18 17:43:53 -08:00
}
2021-09-01 10:51:41 -07:00
struct dentry * kern_path_locked ( const char * name , struct path * path )
{
struct filename * filename = getname_kernel ( name ) ;
struct dentry * res = __kern_path_locked ( filename , path ) ;
putname ( filename ) ;
return res ;
}
2008-08-02 00:49:18 -04:00
int kern_path ( const char * name , unsigned int flags , struct path * path )
{
2021-09-01 10:51:42 -07:00
struct filename * filename = getname_kernel ( name ) ;
int ret = filename_lookup ( AT_FDCWD , filename , flags , path , NULL ) ;
putname ( filename ) ;
return ret ;
2008-08-02 00:49:18 -04:00
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( kern_path ) ;
2008-08-02 00:49:18 -04:00
fs: introduce vfs_path_lookup
Stackable file systems, among others, frequently need to lookup paths or
path components starting from an arbitrary point in the namespace
(identified by a dentry and a vfsmount). Currently, such file systems use
lookup_one_len, which is frowned upon [1] as it does not pass the lookup
intent along; not passing a lookup intent, for example, can trigger BUG_ON's
when stacking on top of NFSv4.
The first patch introduces a new lookup function to allow lookup starting
from an arbitrary point in the namespace. This approach has been suggested
by Christoph Hellwig [2].
The second patch changes sunrpc to use vfs_path_lookup.
The third patch changes nfsctl.c to use vfs_path_lookup.
The fourth patch marks link_path_walk static.
The fifth, and last patch, unexports path_walk because it is no longer
unnecessary to call it directly, and using the new vfs_path_lookup is
cleaner.
For example, the following snippet of code, looks up "some/path/component"
in a directory pointed to by parent_{dentry,vfsmnt}:
err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
"some/path/component", 0, &nd);
if (!err) {
/* exits */
...
/* once done, release the references */
path_release(&nd);
} else if (err == -ENOENT) {
/* doesn't exist */
} else {
/* other error */
}
VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:18 -07:00
/**
* vfs_path_lookup - lookup a file path relative to a dentry - vfsmount pair
* @ dentry : pointer to dentry of the base directory
* @ mnt : pointer to vfs mount of the base directory
* @ name : pointer to file name
* @ flags : lookup flags
2011-06-27 17:00:37 -04:00
* @ path : pointer to struct path to fill
fs: introduce vfs_path_lookup
Stackable file systems, among others, frequently need to lookup paths or
path components starting from an arbitrary point in the namespace
(identified by a dentry and a vfsmount). Currently, such file systems use
lookup_one_len, which is frowned upon [1] as it does not pass the lookup
intent along; not passing a lookup intent, for example, can trigger BUG_ON's
when stacking on top of NFSv4.
The first patch introduces a new lookup function to allow lookup starting
from an arbitrary point in the namespace. This approach has been suggested
by Christoph Hellwig [2].
The second patch changes sunrpc to use vfs_path_lookup.
The third patch changes nfsctl.c to use vfs_path_lookup.
The fourth patch marks link_path_walk static.
The fifth, and last patch, unexports path_walk because it is no longer
unnecessary to call it directly, and using the new vfs_path_lookup is
cleaner.
For example, the following snippet of code, looks up "some/path/component"
in a directory pointed to by parent_{dentry,vfsmnt}:
err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
"some/path/component", 0, &nd);
if (!err) {
/* exits */
...
/* once done, release the references */
path_release(&nd);
} else if (err == -ENOENT) {
/* doesn't exist */
} else {
/* other error */
}
VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:18 -07:00
*/
int vfs_path_lookup ( struct dentry * dentry , struct vfsmount * mnt ,
const char * name , unsigned int flags ,
2011-06-27 17:00:37 -04:00
struct path * path )
fs: introduce vfs_path_lookup
Stackable file systems, among others, frequently need to lookup paths or
path components starting from an arbitrary point in the namespace
(identified by a dentry and a vfsmount). Currently, such file systems use
lookup_one_len, which is frowned upon [1] as it does not pass the lookup
intent along; not passing a lookup intent, for example, can trigger BUG_ON's
when stacking on top of NFSv4.
The first patch introduces a new lookup function to allow lookup starting
from an arbitrary point in the namespace. This approach has been suggested
by Christoph Hellwig [2].
The second patch changes sunrpc to use vfs_path_lookup.
The third patch changes nfsctl.c to use vfs_path_lookup.
The fourth patch marks link_path_walk static.
The fifth, and last patch, unexports path_walk because it is no longer
unnecessary to call it directly, and using the new vfs_path_lookup is
cleaner.
For example, the following snippet of code, looks up "some/path/component"
in a directory pointed to by parent_{dentry,vfsmnt}:
err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
"some/path/component", 0, &nd);
if (!err) {
/* exits */
...
/* once done, release the references */
path_release(&nd);
} else if (err == -ENOENT) {
/* doesn't exist */
} else {
/* other error */
}
VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:18 -07:00
{
2021-09-01 10:51:42 -07:00
struct filename * filename ;
2015-05-12 16:44:39 -04:00
struct path root = { . mnt = mnt , . dentry = dentry } ;
2021-09-01 10:51:42 -07:00
int ret ;
filename = getname_kernel ( name ) ;
2015-05-12 16:44:39 -04:00
/* the first argument of filename_lookup() is ignored with root */
2021-09-01 10:51:42 -07:00
ret = filename_lookup ( AT_FDCWD , filename , flags , path , & root ) ;
putname ( filename ) ;
return ret ;
fs: introduce vfs_path_lookup
Stackable file systems, among others, frequently need to lookup paths or
path components starting from an arbitrary point in the namespace
(identified by a dentry and a vfsmount). Currently, such file systems use
lookup_one_len, which is frowned upon [1] as it does not pass the lookup
intent along; not passing a lookup intent, for example, can trigger BUG_ON's
when stacking on top of NFSv4.
The first patch introduces a new lookup function to allow lookup starting
from an arbitrary point in the namespace. This approach has been suggested
by Christoph Hellwig [2].
The second patch changes sunrpc to use vfs_path_lookup.
The third patch changes nfsctl.c to use vfs_path_lookup.
The fourth patch marks link_path_walk static.
The fifth, and last patch, unexports path_walk because it is no longer
unnecessary to call it directly, and using the new vfs_path_lookup is
cleaner.
For example, the following snippet of code, looks up "some/path/component"
in a directory pointed to by parent_{dentry,vfsmnt}:
err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
"some/path/component", 0, &nd);
if (!err) {
/* exits */
...
/* once done, release the references */
path_release(&nd);
} else if (err == -ENOENT) {
/* doesn't exist */
} else {
/* other error */
}
VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:18 -07:00
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_path_lookup ) ;
fs: introduce vfs_path_lookup
Stackable file systems, among others, frequently need to lookup paths or
path components starting from an arbitrary point in the namespace
(identified by a dentry and a vfsmount). Currently, such file systems use
lookup_one_len, which is frowned upon [1] as it does not pass the lookup
intent along; not passing a lookup intent, for example, can trigger BUG_ON's
when stacking on top of NFSv4.
The first patch introduces a new lookup function to allow lookup starting
from an arbitrary point in the namespace. This approach has been suggested
by Christoph Hellwig [2].
The second patch changes sunrpc to use vfs_path_lookup.
The third patch changes nfsctl.c to use vfs_path_lookup.
The fourth patch marks link_path_walk static.
The fifth, and last patch, unexports path_walk because it is no longer
unnecessary to call it directly, and using the new vfs_path_lookup is
cleaner.
For example, the following snippet of code, looks up "some/path/component"
in a directory pointed to by parent_{dentry,vfsmnt}:
err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
"some/path/component", 0, &nd);
if (!err) {
/* exits */
...
/* once done, release the references */
path_release(&nd);
} else if (err == -ENOENT) {
/* doesn't exist */
} else {
/* other error */
}
VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:18 -07:00
2021-07-27 12:48:40 +02:00
static int lookup_one_common ( struct user_namespace * mnt_userns ,
const char * name , struct dentry * base , int len ,
struct qstr * this )
2007-04-26 00:12:05 -07:00
{
2018-04-06 16:32:38 -04:00
this - > name = name ;
this - > len = len ;
this - > hash = full_name_hash ( base , name , len ) ;
2011-03-07 23:49:20 -05:00
if ( ! len )
2018-04-06 16:32:38 -04:00
return - EACCES ;
2011-03-07 23:49:20 -05:00
2012-11-29 22:17:21 -05:00
if ( unlikely ( name [ 0 ] = = ' . ' ) ) {
if ( len < 2 | | ( len = = 2 & & name [ 1 ] = = ' . ' ) )
2018-04-06 16:32:38 -04:00
return - EACCES ;
2012-11-29 22:17:21 -05:00
}
2011-03-07 23:49:20 -05:00
while ( len - - ) {
2018-04-06 16:32:38 -04:00
unsigned int c = * ( const unsigned char * ) name + + ;
2011-03-07 23:49:20 -05:00
if ( c = = ' / ' | | c = = ' \0 ' )
2018-04-06 16:32:38 -04:00
return - EACCES ;
2011-03-07 23:49:20 -05:00
}
2011-03-08 14:17:44 -05:00
/*
* See if the low - level filesystem might want
* to use its own hash . .
*/
if ( base - > d_flags & DCACHE_OP_HASH ) {
2018-04-06 16:32:38 -04:00
int err = base - > d_op - > d_hash ( base , this ) ;
2011-03-08 14:17:44 -05:00
if ( err < 0 )
2018-04-06 16:32:38 -04:00
return err ;
2011-03-08 14:17:44 -05:00
}
2007-10-16 23:25:38 -07:00
2021-07-27 12:48:40 +02:00
return inode_permission ( mnt_userns , base - > d_inode , MAY_EXEC ) ;
2018-04-06 16:32:38 -04:00
}
2018-06-15 15:19:22 +01:00
/**
* try_lookup_one_len - filesystem helper to lookup single pathname component
* @ name : pathname component to lookup
* @ base : base directory to lookup from
* @ len : maximum length @ len should be interpreted to
*
* Look up a dentry by name in the dcache , returning NULL if it does not
* currently exist . The function does not try to create a dentry .
*
* Note that this routine is purely a helper for filesystem usage and should
* not be called by generic code .
*
* The caller must hold base - > i_mutex .
*/
struct dentry * try_lookup_one_len ( const char * name , struct dentry * base , int len )
{
struct qstr this ;
int err ;
WARN_ON_ONCE ( ! inode_is_locked ( base - > d_inode ) ) ;
2021-07-27 12:48:40 +02:00
err = lookup_one_common ( & init_user_ns , name , base , len , & this ) ;
2018-06-15 15:19:22 +01:00
if ( err )
return ERR_PTR ( err ) ;
return lookup_dcache ( & this , base , 0 ) ;
}
EXPORT_SYMBOL ( try_lookup_one_len ) ;
2018-04-06 16:32:38 -04:00
/**
* lookup_one_len - filesystem helper to lookup single pathname component
* @ name : pathname component to lookup
* @ base : base directory to lookup from
* @ len : maximum length @ len should be interpreted to
*
* Note that this routine is purely a helper for filesystem usage and should
* not be called by generic code .
*
* The caller must hold base - > i_mutex .
*/
struct dentry * lookup_one_len ( const char * name , struct dentry * base , int len )
{
2018-04-06 16:45:33 -04:00
struct dentry * dentry ;
2018-04-06 16:32:38 -04:00
struct qstr this ;
int err ;
WARN_ON_ONCE ( ! inode_is_locked ( base - > d_inode ) ) ;
2021-07-27 12:48:40 +02:00
err = lookup_one_common ( & init_user_ns , name , base , len , & this ) ;
2012-03-26 12:54:21 +02:00
if ( err )
return ERR_PTR ( err ) ;
2018-04-06 16:45:33 -04:00
dentry = lookup_dcache ( & this , base , 0 ) ;
return dentry ? dentry : __lookup_slow ( & this , base , 0 ) ;
2007-04-26 00:12:05 -07:00
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( lookup_one_len ) ;
2007-04-26 00:12:05 -07:00
2021-07-27 12:48:40 +02:00
/**
* lookup_one - filesystem helper to lookup single pathname component
* @ mnt_userns : user namespace of the mount the lookup is performed from
* @ name : pathname component to lookup
* @ base : base directory to lookup from
* @ len : maximum length @ len should be interpreted to
*
* Note that this routine is purely a helper for filesystem usage and should
* not be called by generic code .
*
* The caller must hold base - > i_mutex .
*/
struct dentry * lookup_one ( struct user_namespace * mnt_userns , const char * name ,
struct dentry * base , int len )
{
struct dentry * dentry ;
struct qstr this ;
int err ;
WARN_ON_ONCE ( ! inode_is_locked ( base - > d_inode ) ) ;
err = lookup_one_common ( mnt_userns , name , base , len , & this ) ;
if ( err )
return ERR_PTR ( err ) ;
dentry = lookup_dcache ( & this , base , 0 ) ;
return dentry ? dentry : __lookup_slow ( & this , base , 0 ) ;
}
EXPORT_SYMBOL ( lookup_one ) ;
2016-01-07 16:08:20 -05:00
/**
2022-04-04 12:51:40 +02:00
* lookup_one_unlocked - filesystem helper to lookup single pathname component
* @ mnt_userns : idmapping of the mount the lookup is performed from
2016-01-07 16:08:20 -05:00
* @ name : pathname component to lookup
* @ base : base directory to lookup from
* @ len : maximum length @ len should be interpreted to
*
* Note that this routine is purely a helper for filesystem usage and should
* not be called by generic code .
*
* Unlike lookup_one_len , it should be called without the parent
* i_mutex held , and will take the i_mutex itself if necessary .
*/
2022-04-04 12:51:40 +02:00
struct dentry * lookup_one_unlocked ( struct user_namespace * mnt_userns ,
const char * name , struct dentry * base ,
int len )
2016-01-07 16:08:20 -05:00
{
struct qstr this ;
int err ;
2016-07-29 12:17:52 -07:00
struct dentry * ret ;
2016-01-07 16:08:20 -05:00
2022-04-04 12:51:40 +02:00
err = lookup_one_common ( mnt_userns , name , base , len , & this ) ;
2016-01-07 16:08:20 -05:00
if ( err )
return ERR_PTR ( err ) ;
2016-07-29 12:17:52 -07:00
ret = lookup_dcache ( & this , base , 0 ) ;
if ( ! ret )
ret = lookup_slow ( & this , base , 0 ) ;
return ret ;
2016-01-07 16:08:20 -05:00
}
2022-04-04 12:51:40 +02:00
EXPORT_SYMBOL ( lookup_one_unlocked ) ;
/**
* lookup_one_positive_unlocked - filesystem helper to lookup single
* pathname component
* @ mnt_userns : idmapping of the mount the lookup is performed from
* @ name : pathname component to lookup
* @ base : base directory to lookup from
* @ len : maximum length @ len should be interpreted to
*
* This helper will yield ERR_PTR ( - ENOENT ) on negatives . The helper returns
* known positive or ERR_PTR ( ) . This is what most of the users want .
*
* Note that pinned negative with unlocked parent _can_ become positive at any
* time , so callers of lookup_one_unlocked ( ) need to be very careful ; pinned
* positives have > d_inode stable , so this one avoids such problems .
*
* Note that this routine is purely a helper for filesystem usage and should
* not be called by generic code .
*
* The helper should be called without i_mutex held .
*/
struct dentry * lookup_one_positive_unlocked ( struct user_namespace * mnt_userns ,
const char * name ,
struct dentry * base , int len )
{
struct dentry * ret = lookup_one_unlocked ( mnt_userns , name , base , len ) ;
if ( ! IS_ERR ( ret ) & & d_flags_negative ( smp_load_acquire ( & ret - > d_flags ) ) ) {
dput ( ret ) ;
ret = ERR_PTR ( - ENOENT ) ;
}
return ret ;
}
EXPORT_SYMBOL ( lookup_one_positive_unlocked ) ;
/**
* lookup_one_len_unlocked - filesystem helper to lookup single pathname component
* @ name : pathname component to lookup
* @ base : base directory to lookup from
* @ len : maximum length @ len should be interpreted to
*
* Note that this routine is purely a helper for filesystem usage and should
* not be called by generic code .
*
* Unlike lookup_one_len , it should be called without the parent
* i_mutex held , and will take the i_mutex itself if necessary .
*/
struct dentry * lookup_one_len_unlocked ( const char * name ,
struct dentry * base , int len )
{
return lookup_one_unlocked ( & init_user_ns , name , base , len ) ;
}
2016-01-07 16:08:20 -05:00
EXPORT_SYMBOL ( lookup_one_len_unlocked ) ;
2019-10-31 01:21:58 -04:00
/*
* Like lookup_one_len_unlocked ( ) , except that it yields ERR_PTR ( - ENOENT )
* on negatives . Returns known positive or ERR_PTR ( ) ; that ' s what
* most of the users want . Note that pinned negative with unlocked parent
* _can_ become positive at any time , so callers of lookup_one_len_unlocked ( )
* need to be very careful ; pinned positives have - > d_inode stable , so
* this one avoids such problems .
*/
struct dentry * lookup_positive_unlocked ( const char * name ,
struct dentry * base , int len )
{
2022-04-04 12:51:40 +02:00
return lookup_one_positive_unlocked ( & init_user_ns , name , base , len ) ;
2019-10-31 01:21:58 -04:00
}
EXPORT_SYMBOL ( lookup_positive_unlocked ) ;
devpts: Make each mount of devpts an independent filesystem.
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg KH <greg@kroah.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-02 10:29:47 -05:00
# ifdef CONFIG_UNIX98_PTYS
int path_pts ( struct path * path )
{
/* Find something mounted on "pts" in the same directory as
* the input path .
*/
2020-03-11 13:05:03 -04:00
struct dentry * parent = dget_parent ( path - > dentry ) ;
struct dentry * child ;
2020-02-26 20:09:37 -05:00
struct qstr this = QSTR_INIT ( " pts " , 3 ) ;
devpts: Make each mount of devpts an independent filesystem.
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg KH <greg@kroah.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-02 10:29:47 -05:00
2020-03-11 13:05:03 -04:00
if ( unlikely ( ! path_connected ( path - > mnt , parent ) ) ) {
dput ( parent ) ;
2020-02-24 16:01:19 -05:00
return - ENOENT ;
2020-03-11 13:05:03 -04:00
}
2020-02-24 16:01:19 -05:00
dput ( path - > dentry ) ;
path - > dentry = parent ;
devpts: Make each mount of devpts an independent filesystem.
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg KH <greg@kroah.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-02 10:29:47 -05:00
child = d_hash_and_lookup ( parent , & this ) ;
if ( ! child )
return - ENOENT ;
path - > dentry = child ;
dput ( parent ) ;
2020-02-26 20:09:37 -05:00
follow_down ( path ) ;
devpts: Make each mount of devpts an independent filesystem.
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg KH <greg@kroah.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-02 10:29:47 -05:00
return 0 ;
}
# endif
2011-11-02 09:44:39 +01:00
int user_path_at_empty ( int dfd , const char __user * name , unsigned flags ,
struct path * path , int * empty )
2005-04-16 15:20:36 -07:00
{
2021-09-01 10:51:42 -07:00
struct filename * filename = getname_flags ( name , flags , empty ) ;
int ret = filename_lookup ( dfd , filename , flags , path , NULL ) ;
putname ( filename ) ;
return ret ;
2005-04-16 15:20:36 -07:00
}
2015-05-13 09:12:02 -04:00
EXPORT_SYMBOL ( user_path_at_empty ) ;
2011-11-02 09:44:39 +01:00
2021-01-21 14:19:31 +01:00
int __check_sticky ( struct user_namespace * mnt_userns , struct inode * dir ,
struct inode * inode )
2005-04-16 15:20:36 -07:00
{
2012-03-03 21:17:15 -08:00
kuid_t fsuid = current_fsuid ( ) ;
2008-11-14 10:39:05 +11:00
2021-01-21 14:19:31 +01:00
if ( uid_eq ( i_uid_into_mnt ( mnt_userns , inode ) , fsuid ) )
2005-04-16 15:20:36 -07:00
return 0 ;
2021-01-21 14:19:31 +01:00
if ( uid_eq ( i_uid_into_mnt ( mnt_userns , dir ) , fsuid ) )
2005-04-16 15:20:36 -07:00
return 0 ;
2021-01-21 14:19:31 +01:00
return ! capable_wrt_inode_uidgid ( mnt_userns , inode , CAP_FOWNER ) ;
2005-04-16 15:20:36 -07:00
}
2014-10-24 00:14:36 +02:00
EXPORT_SYMBOL ( __check_sticky ) ;
2005-04-16 15:20:36 -07:00
/*
* Check whether we can remove a link victim from directory dir , check
* whether the type of victim is right .
* 1. We can ' t do it if dir is read - only ( done in permission ( ) )
* 2. We should have write and exec permissions on dir
* 3. We can ' t remove anything from append - only dir
* 4. We can ' t do anything with immutable dir ( done in permission ( ) )
* 5. If the sticky bit on dir is set we should either
* a . be owner of dir , or
* b . be owner of victim , or
* c . have CAP_FOWNER capability
* 6. If the victim is append - only or immutable we can ' t do antyhing with
* links pointing to it .
2016-06-29 14:54:46 -05:00
* 7. If the victim has an unknown uid or gid we can ' t change the inode .
* 8. If we were asked to remove a directory and victim isn ' t one - ENOTDIR .
* 9. If we were asked to remove a non - directory and victim isn ' t one - EISDIR .
* 10. We can ' t remove a root or mountpoint .
* 11. We don ' t allow removal of NFS sillyrenamed files ; it ' s handled by
2005-04-16 15:20:36 -07:00
* nfs_async_unlink ( ) .
*/
2021-01-21 14:19:31 +01:00
static int may_delete ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * victim , bool isdir )
2005-04-16 15:20:36 -07:00
{
2015-05-06 15:59:00 +01:00
struct inode * inode = d_backing_inode ( victim ) ;
2005-04-16 15:20:36 -07:00
int error ;
2013-09-12 19:22:53 +01:00
if ( d_is_negative ( victim ) )
2005-04-16 15:20:36 -07:00
return - ENOENT ;
2013-09-12 19:22:53 +01:00
BUG_ON ( ! inode ) ;
2005-04-16 15:20:36 -07:00
BUG_ON ( victim - > d_parent - > d_inode ! = dir ) ;
2017-09-14 12:07:32 -05:00
/* Inode writeback is not safe when the uid or gid are invalid. */
2021-01-21 14:19:31 +01:00
if ( ! uid_valid ( i_uid_into_mnt ( mnt_userns , inode ) ) | |
! gid_valid ( i_gid_into_mnt ( mnt_userns , inode ) ) )
2017-09-14 12:07:32 -05:00
return - EOVERFLOW ;
2012-10-10 15:25:25 -04:00
audit_inode_child ( dir , victim , AUDIT_TYPE_CHILD_DELETE ) ;
2005-04-16 15:20:36 -07:00
2021-01-21 14:19:31 +01:00
error = inode_permission ( mnt_userns , dir , MAY_WRITE | MAY_EXEC ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
if ( IS_APPEND ( dir ) )
return - EPERM ;
2013-09-12 19:22:53 +01:00
2021-01-21 14:19:31 +01:00
if ( check_sticky ( mnt_userns , dir , inode ) | | IS_APPEND ( inode ) | |
IS_IMMUTABLE ( inode ) | | IS_SWAPFILE ( inode ) | |
HAS_UNMAPPED_ID ( mnt_userns , inode ) )
2005-04-16 15:20:36 -07:00
return - EPERM ;
if ( isdir ) {
2014-04-01 17:08:41 +02:00
if ( ! d_is_dir ( victim ) )
2005-04-16 15:20:36 -07:00
return - ENOTDIR ;
if ( IS_ROOT ( victim ) )
return - EBUSY ;
2014-04-01 17:08:41 +02:00
} else if ( d_is_dir ( victim ) )
2005-04-16 15:20:36 -07:00
return - EISDIR ;
if ( IS_DEADDIR ( dir ) )
return - ENOENT ;
if ( victim - > d_flags & DCACHE_NFSFS_RENAMED )
return - EBUSY ;
return 0 ;
}
/* Check whether we can create an object with dentry child in directory
* dir .
* 1. We can ' t do it if child already exists ( open has special treatment for
* this case , but since we are inlined it ' s OK )
* 2. We can ' t do it if dir is read - only ( done in permission ( ) )
2016-07-01 12:52:06 -05:00
* 3. We can ' t do it if the fs can ' t represent the fsuid or fsgid .
* 4. We should have write and exec permissions on dir
* 5. We can ' t do it if dir is immutable ( done in permission ( ) )
2005-04-16 15:20:36 -07:00
*/
2021-01-21 14:19:31 +01:00
static inline int may_create ( struct user_namespace * mnt_userns ,
struct inode * dir , struct dentry * child )
2005-04-16 15:20:36 -07:00
{
2013-05-08 10:25:58 -04:00
audit_inode_child ( dir , child , AUDIT_TYPE_CHILD_CREATE ) ;
2005-04-16 15:20:36 -07:00
if ( child - > d_inode )
return - EEXIST ;
if ( IS_DEADDIR ( dir ) )
return - ENOENT ;
2021-03-20 13:26:23 +01:00
if ( ! fsuidgid_has_mapping ( dir - > i_sb , mnt_userns ) )
2016-07-01 12:52:06 -05:00
return - EOVERFLOW ;
2021-03-20 13:26:23 +01:00
2021-01-21 14:19:31 +01:00
return inode_permission ( mnt_userns , dir , MAY_WRITE | MAY_EXEC ) ;
2005-04-16 15:20:36 -07:00
}
/*
* p1 and p2 should be directories on the same fs .
*/
struct dentry * lock_rename ( struct dentry * p1 , struct dentry * p2 )
{
struct dentry * p ;
if ( p1 = = p2 ) {
2016-01-22 15:40:57 -05:00
inode_lock_nested ( p1 - > d_inode , I_MUTEX_PARENT ) ;
2005-04-16 15:20:36 -07:00
return NULL ;
}
2016-04-10 01:33:30 -04:00
mutex_lock ( & p1 - > d_sb - > s_vfs_rename_mutex ) ;
2005-04-16 15:20:36 -07:00
2008-10-16 07:50:28 +09:00
p = d_ancestor ( p2 , p1 ) ;
if ( p ) {
2016-01-22 15:40:57 -05:00
inode_lock_nested ( p2 - > d_inode , I_MUTEX_PARENT ) ;
inode_lock_nested ( p1 - > d_inode , I_MUTEX_CHILD ) ;
2008-10-16 07:50:28 +09:00
return p ;
2005-04-16 15:20:36 -07:00
}
2008-10-16 07:50:28 +09:00
p = d_ancestor ( p1 , p2 ) ;
if ( p ) {
2016-01-22 15:40:57 -05:00
inode_lock_nested ( p1 - > d_inode , I_MUTEX_PARENT ) ;
inode_lock_nested ( p2 - > d_inode , I_MUTEX_CHILD ) ;
2008-10-16 07:50:28 +09:00
return p ;
2005-04-16 15:20:36 -07:00
}
2016-01-22 15:40:57 -05:00
inode_lock_nested ( p1 - > d_inode , I_MUTEX_PARENT ) ;
inode_lock_nested ( p2 - > d_inode , I_MUTEX_PARENT2 ) ;
2005-04-16 15:20:36 -07:00
return NULL ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( lock_rename ) ;
2005-04-16 15:20:36 -07:00
void unlock_rename ( struct dentry * p1 , struct dentry * p2 )
{
2016-01-22 15:40:57 -05:00
inode_unlock ( p1 - > d_inode ) ;
2005-04-16 15:20:36 -07:00
if ( p1 ! = p2 ) {
2016-01-22 15:40:57 -05:00
inode_unlock ( p2 - > d_inode ) ;
2016-04-10 01:33:30 -04:00
mutex_unlock ( & p1 - > d_sb - > s_vfs_rename_mutex ) ;
2005-04-16 15:20:36 -07:00
}
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( unlock_rename ) ;
2005-04-16 15:20:36 -07:00
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
/**
* mode_strip_umask - handle vfs umask stripping
* @ dir : parent directory of the new inode
* @ mode : mode of the new inode to be created in @ dir
*
* Umask stripping depends on whether or not the filesystem supports POSIX
* ACLs . If the filesystem doesn ' t support it umask stripping is done directly
* in here . If the filesystem does support POSIX ACLs umask stripping is
* deferred until the filesystem calls posix_acl_create ( ) .
*
* Returns : mode
*/
static inline umode_t mode_strip_umask ( const struct inode * dir , umode_t mode )
{
if ( ! IS_POSIXACL ( dir ) )
mode & = ~ current_umask ( ) ;
return mode ;
}
/**
* vfs_prepare_mode - prepare the mode to be used for a new inode
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dir : parent directory of the new inode
* @ mode : mode of the new inode
* @ mask_perms : allowed permission by the vfs
* @ type : type of file to be created
*
* This helper consolidates and enforces vfs restrictions on the @ mode of a new
* object to be created .
*
* Umask stripping depends on whether the filesystem supports POSIX ACLs ( see
* the kernel documentation for mode_strip_umask ( ) ) . Moving umask stripping
* after setgid stripping allows the same ordering for both non - POSIX ACL and
* POSIX ACL supporting filesystems .
*
* Note that it ' s currently valid for @ type to be 0 if a directory is created .
* Filesystems raise that flag individually and we need to check whether each
* filesystem can deal with receiving S_IFDIR from the vfs before we enforce a
* non - zero type .
*
* Returns : mode to be passed to the filesystem
*/
static inline umode_t vfs_prepare_mode ( struct user_namespace * mnt_userns ,
const struct inode * dir , umode_t mode ,
umode_t mask_perms , umode_t type )
{
mode = mode_strip_sgid ( mnt_userns , dir , mode ) ;
mode = mode_strip_umask ( dir , mode ) ;
/*
* Apply the vfs mandated allowed permission mask and set the type of
* file to be created before we call into the filesystem .
*/
mode & = ( mask_perms & ~ S_IFMT ) ;
mode | = ( type & S_IFMT ) ;
return mode ;
}
2021-01-21 14:19:33 +01:00
/**
* vfs_create - create new file
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dir : inode of @ dentry
* @ dentry : pointer to dentry of the base directory
* @ mode : mode of the new file
* @ want_excl : whether the file must not yet exist
*
* Create a new file .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
int vfs_create ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * dentry , umode_t mode , bool want_excl )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
int error = may_create ( mnt_userns , dir , dentry ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > create )
2005-04-16 15:20:36 -07:00
return - EACCES ; /* shouldn't it be ENOSYS? */
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
mode = vfs_prepare_mode ( mnt_userns , dir , mode , S_IALLUGO , S_IFREG ) ;
2005-04-16 15:20:36 -07:00
error = security_inode_create ( dir , dentry , mode ) ;
if ( error )
return error ;
2021-01-21 14:19:43 +01:00
error = dir - > i_op - > create ( mnt_userns , dir , dentry , mode , want_excl ) ;
2005-09-09 13:01:44 -07:00
if ( ! error )
2005-11-03 15:57:06 +00:00
fsnotify_create ( dir , dentry ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_create ) ;
2005-04-16 15:20:36 -07:00
2017-12-01 17:12:45 -05:00
int vfs_mkobj ( struct dentry * dentry , umode_t mode ,
int ( * f ) ( struct dentry * , umode_t , void * ) ,
void * arg )
{
struct inode * dir = dentry - > d_parent - > d_inode ;
2021-01-21 14:19:31 +01:00
int error = may_create ( & init_user_ns , dir , dentry ) ;
2017-12-01 17:12:45 -05:00
if ( error )
return error ;
mode & = S_IALLUGO ;
mode | = S_IFREG ;
error = security_inode_create ( dir , dentry , mode ) ;
if ( error )
return error ;
error = f ( dentry , mode , arg ) ;
if ( ! error )
fsnotify_create ( dir , dentry ) ;
return error ;
}
EXPORT_SYMBOL ( vfs_mkobj ) ;
2016-06-09 15:34:02 -05:00
bool may_open_dev ( const struct path * path )
{
return ! ( path - > mnt - > mnt_flags & MNT_NODEV ) & &
! ( path - > mnt - > mnt_sb - > s_iflags & SB_I_NODEV ) ;
}
2021-01-21 14:19:31 +01:00
static int may_open ( struct user_namespace * mnt_userns , const struct path * path ,
int acc_mode , int flag )
2005-04-16 15:20:36 -07:00
{
2008-10-24 09:58:10 +02:00
struct dentry * dentry = path - > dentry ;
2005-04-16 15:20:36 -07:00
struct inode * inode = dentry - > d_inode ;
int error ;
if ( ! inode )
return - ENOENT ;
2009-01-05 19:27:23 +01:00
switch ( inode - > i_mode & S_IFMT ) {
case S_IFLNK :
2005-04-16 15:20:36 -07:00
return - ELOOP ;
2009-01-05 19:27:23 +01:00
case S_IFDIR :
2020-08-14 17:30:14 -07:00
if ( acc_mode & MAY_WRITE )
2009-01-05 19:27:23 +01:00
return - EISDIR ;
2020-08-14 17:30:14 -07:00
if ( acc_mode & MAY_EXEC )
return - EACCES ;
2009-01-05 19:27:23 +01:00
break ;
case S_IFBLK :
case S_IFCHR :
2016-06-09 15:34:02 -05:00
if ( ! may_open_dev ( path ) )
2005-04-16 15:20:36 -07:00
return - EACCES ;
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:26 -07:00
fallthrough ;
2009-01-05 19:27:23 +01:00
case S_IFIFO :
case S_IFSOCK :
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:26 -07:00
if ( acc_mode & MAY_EXEC )
return - EACCES ;
2005-04-16 15:20:36 -07:00
flag & = ~ O_TRUNC ;
2009-01-05 19:27:23 +01:00
break ;
exec: move path_noexec() check earlier
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:30 -07:00
case S_IFREG :
if ( ( acc_mode & MAY_EXEC ) & & path_noexec ( path ) )
return - EACCES ;
break ;
2008-02-15 14:37:48 -08:00
}
2007-10-16 23:31:14 -07:00
2021-01-21 14:19:31 +01:00
error = inode_permission ( mnt_userns , inode , MAY_OPEN | acc_mode ) ;
2007-10-16 23:31:14 -07:00
if ( error )
return error ;
2009-02-04 09:06:57 -05:00
2005-04-16 15:20:36 -07:00
/*
* An append - only file must be opened in append mode for writing .
*/
if ( IS_APPEND ( inode ) ) {
2009-12-24 06:47:55 -05:00
if ( ( flag & O_ACCMODE ) ! = O_RDONLY & & ! ( flag & O_APPEND ) )
2009-12-16 03:54:00 -05:00
return - EPERM ;
2005-04-16 15:20:36 -07:00
if ( flag & O_TRUNC )
2009-12-16 03:54:00 -05:00
return - EPERM ;
2005-04-16 15:20:36 -07:00
}
/* O_NOATIME can only be set by the owner or superuser */
2021-01-21 14:19:31 +01:00
if ( flag & O_NOATIME & & ! inode_owner_or_capable ( mnt_userns , inode ) )
2009-12-16 03:54:00 -05:00
return - EPERM ;
2005-04-16 15:20:36 -07:00
2011-09-21 10:58:13 -04:00
return 0 ;
2009-12-16 03:54:00 -05:00
}
2005-04-16 15:20:36 -07:00
2021-01-21 14:19:43 +01:00
static int handle_truncate ( struct user_namespace * mnt_userns , struct file * filp )
2009-12-16 03:54:00 -05:00
{
2016-11-20 20:27:12 -05:00
const struct path * path = & filp - > f_path ;
2009-12-16 03:54:00 -05:00
struct inode * inode = path - > dentry - > d_inode ;
int error = get_write_access ( inode ) ;
if ( error )
return error ;
2021-10-26 11:56:45 -04:00
2021-08-19 14:56:38 -04:00
error = security_path_truncate ( path ) ;
2009-12-16 03:54:00 -05:00
if ( ! error ) {
2021-01-21 14:19:43 +01:00
error = do_truncate ( mnt_userns , path - > dentry , 0 ,
2009-12-16 03:54:00 -05:00
ATTR_MTIME | ATTR_CTIME | ATTR_OPEN ,
2010-12-07 16:19:50 -05:00
filp ) ;
2009-12-16 03:54:00 -05:00
}
put_write_access ( inode ) ;
2009-09-04 13:08:46 -04:00
return error ;
2005-04-16 15:20:36 -07:00
}
2008-02-15 14:37:27 -08:00
static inline int open_to_namei_flags ( int flag )
{
2011-06-25 19:15:54 -04:00
if ( ( flag & O_ACCMODE ) = = 3 )
flag - - ;
2008-02-15 14:37:27 -08:00
return flag ;
}
2021-01-21 14:19:31 +01:00
static int may_o_create ( struct user_namespace * mnt_userns ,
const struct path * dir , struct dentry * dentry ,
umode_t mode )
2012-06-05 15:10:17 +02:00
{
int error = security_path_mknod ( dir , dentry , mode , 0 ) ;
if ( error )
return error ;
2021-03-20 13:26:23 +01:00
if ( ! fsuidgid_has_mapping ( dir - > dentry - > d_sb , mnt_userns ) )
2017-01-26 14:33:46 -06:00
return - EOVERFLOW ;
2021-01-21 14:19:31 +01:00
error = inode_permission ( mnt_userns , dir - > dentry - > d_inode ,
2021-01-21 14:19:24 +01:00
MAY_WRITE | MAY_EXEC ) ;
2012-06-05 15:10:17 +02:00
if ( error )
return error ;
return security_inode_create ( dir - > dentry - > d_inode , dentry , mode ) ;
}
2012-06-14 16:13:46 +01:00
/*
* Attempt to atomically look up , create and open a file from a negative
* dentry .
*
* Returns 0 if successful . The file will have been created and attached to
* @ file by the filesystem calling finish_open ( ) .
*
2018-07-09 19:30:20 -04:00
* If the file was looked up only or didn ' t need creating , FMODE_OPENED won ' t
* be set . The caller will need to perform the open themselves . @ path will
* have been updated to point to the new dentry . This may be negative .
2012-06-14 16:13:46 +01:00
*
* Returns an error code otherwise .
*/
2020-01-09 14:12:40 -05:00
static struct dentry * atomic_open ( struct nameidata * nd , struct dentry * dentry ,
struct file * file ,
int open_flag , umode_t mode )
2012-06-05 15:10:17 +02:00
{
2016-04-28 02:03:55 -04:00
struct dentry * const DENTRY_NOT_SET = ( void * ) - 1UL ;
2012-06-05 15:10:17 +02:00
struct inode * dir = nd - > path . dentry - > d_inode ;
int error ;
if ( nd - > flags & LOOKUP_DIRECTORY )
open_flag | = O_DIRECTORY ;
2012-06-22 12:40:19 +04:00
file - > f_path . dentry = DENTRY_NOT_SET ;
file - > f_path . mnt = nd - > path . mnt ;
2016-04-27 14:13:10 -04:00
error = dir - > i_op - > atomic_open ( dir , dentry , file ,
2018-06-08 13:32:02 -04:00
open_to_namei_flags ( open_flag ) , mode ) ;
2016-04-28 11:50:59 -04:00
d_lookup_done ( dentry ) ;
2016-04-28 02:03:55 -04:00
if ( ! error ) {
2018-07-09 19:17:52 -04:00
if ( file - > f_mode & FMODE_OPENED ) {
2020-01-26 09:53:19 -05:00
if ( unlikely ( dentry ! = file - > f_path . dentry ) ) {
dput ( dentry ) ;
dentry = dget ( file - > f_path . dentry ) ;
}
2018-07-09 19:17:52 -04:00
} else if ( WARN_ON ( file - > f_path . dentry = = DENTRY_NOT_SET ) ) {
2012-06-22 12:41:10 +04:00
error = - EIO ;
2013-09-16 19:22:33 -04:00
} else {
2016-04-28 02:03:55 -04:00
if ( file - > f_path . dentry ) {
dput ( dentry ) ;
dentry = file - > f_path . dentry ;
2013-09-16 19:22:33 -04:00
}
2020-01-09 14:12:40 -05:00
if ( unlikely ( d_is_negative ( dentry ) ) )
2016-06-07 21:53:51 -04:00
error = - ENOENT ;
2012-08-15 13:30:12 -07:00
}
2012-06-05 15:10:17 +02:00
}
2020-01-09 14:12:40 -05:00
if ( error ) {
dput ( dentry ) ;
dentry = ERR_PTR ( error ) ;
}
return dentry ;
2012-06-05 15:10:17 +02:00
}
2012-06-05 15:10:15 +02:00
/*
2012-06-14 16:13:46 +01:00
* Look up and maybe create and open the last component .
2012-06-05 15:10:15 +02:00
*
2018-07-09 19:30:20 -04:00
* Must be called with parent locked ( exclusive in O_CREAT case ) .
2012-06-14 16:13:46 +01:00
*
2018-07-09 19:30:20 -04:00
* Returns 0 on success , that is , if
* the file was successfully atomically created ( if necessary ) and opened , or
* the file was not completely opened at this time , though lookups and
* creations were performed .
* These case are distinguished by presence of FMODE_OPENED on file - > f_mode .
* In the latter case dentry returned in @ path might be negative if O_CREAT
* hadn ' t been specified .
2012-06-14 16:13:46 +01:00
*
2018-07-09 19:30:20 -04:00
* An error code is returned on failure .
2012-06-05 15:10:15 +02:00
*/
2020-01-09 14:25:14 -05:00
static struct dentry * lookup_open ( struct nameidata * nd , struct file * file ,
const struct open_flags * op ,
bool got_write )
2012-06-05 15:10:15 +02:00
{
2021-01-21 14:19:43 +01:00
struct user_namespace * mnt_userns ;
2012-06-05 15:10:15 +02:00
struct dentry * dir = nd - > path . dentry ;
2012-06-05 15:10:16 +02:00
struct inode * dir_inode = dir - > d_inode ;
2016-04-27 19:14:10 -04:00
int open_flag = op - > open_flag ;
2012-06-05 15:10:15 +02:00
struct dentry * dentry ;
2016-04-27 19:14:10 -04:00
int error , create_error = 0 ;
umode_t mode = op - > mode ;
2016-04-28 11:50:59 -04:00
DECLARE_WAIT_QUEUE_HEAD_ONSTACK ( wq ) ;
2012-06-05 15:10:15 +02:00
2016-04-26 14:17:56 -04:00
if ( unlikely ( IS_DEADDIR ( dir_inode ) ) )
2020-01-09 14:25:14 -05:00
return ERR_PTR ( - ENOENT ) ;
2012-06-05 15:10:15 +02:00
2018-06-08 13:22:02 -04:00
file - > f_mode & = ~ FMODE_CREATED ;
2016-04-28 11:50:59 -04:00
dentry = d_lookup ( dir , & nd - > last ) ;
for ( ; ; ) {
if ( ! dentry ) {
dentry = d_alloc_parallel ( dir , & nd - > last , & wq ) ;
if ( IS_ERR ( dentry ) )
2020-01-09 14:25:14 -05:00
return dentry ;
2016-04-28 11:50:59 -04:00
}
if ( d_in_lookup ( dentry ) )
break ;
2012-06-05 15:10:15 +02:00
2016-04-28 11:50:59 -04:00
error = d_revalidate ( dentry , nd - > flags ) ;
if ( likely ( error > 0 ) )
break ;
if ( error )
goto out_dput ;
d_invalidate ( dentry ) ;
dput ( dentry ) ;
dentry = NULL ;
}
if ( dentry - > d_inode ) {
2016-03-05 20:09:32 -05:00
/* Cached positive dentry: will open in f_op->open */
2020-01-09 14:25:14 -05:00
return dentry ;
2016-03-05 20:09:32 -05:00
}
2012-06-05 15:10:17 +02:00
2016-04-27 19:14:10 -04:00
/*
* Checking write permission is tricky , bacuse we don ' t know if we are
* going to actually need it : O_CREAT opens should work as long as the
* file exists . But checking existence breaks atomicity . The trick is
* to check access and if not granted clear O_CREAT from the flags .
*
* Another problem is returing the " right " error value ( e . g . for an
* O_EXCL open we want to return EEXIST not EROFS ) .
*/
2020-03-12 14:07:27 -04:00
if ( unlikely ( ! got_write ) )
open_flag & = ~ O_TRUNC ;
2021-01-21 14:19:43 +01:00
mnt_userns = mnt_user_ns ( nd - > path . mnt ) ;
2016-04-27 19:14:10 -04:00
if ( open_flag & O_CREAT ) {
2020-03-12 14:07:27 -04:00
if ( open_flag & O_EXCL )
open_flag & = ~ O_TRUNC ;
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
mode = vfs_prepare_mode ( mnt_userns , dir - > d_inode , mode , mode , mode ) ;
2020-03-12 14:07:27 -04:00
if ( likely ( got_write ) )
2021-01-21 14:19:43 +01:00
create_error = may_o_create ( mnt_userns , & nd - > path ,
2021-01-21 14:19:31 +01:00
dentry , mode ) ;
2020-03-12 14:07:27 -04:00
else
create_error = - EROFS ;
2012-06-05 15:10:17 +02:00
}
2020-03-12 14:07:27 -04:00
if ( create_error )
open_flag & = ~ O_CREAT ;
2016-04-26 00:02:50 -04:00
if ( dir_inode - > i_op - > atomic_open ) {
2020-03-11 08:07:53 -04:00
dentry = atomic_open ( nd , dentry , file , open_flag , mode ) ;
2020-01-09 14:25:14 -05:00
if ( unlikely ( create_error ) & & dentry = = ERR_PTR ( - ENOENT ) )
dentry = ERR_PTR ( create_error ) ;
return dentry ;
2012-06-05 15:10:17 +02:00
}
2012-06-05 15:10:16 +02:00
2016-04-28 11:50:59 -04:00
if ( d_in_lookup ( dentry ) ) {
2016-04-28 11:19:43 -04:00
struct dentry * res = dir_inode - > i_op - > lookup ( dir_inode , dentry ,
nd - > flags ) ;
2016-04-28 11:50:59 -04:00
d_lookup_done ( dentry ) ;
2016-04-28 11:19:43 -04:00
if ( unlikely ( res ) ) {
if ( IS_ERR ( res ) ) {
error = PTR_ERR ( res ) ;
goto out_dput ;
}
dput ( dentry ) ;
dentry = res ;
}
2012-06-05 15:10:16 +02:00
}
2012-06-05 15:10:15 +02:00
/* Negative dentry, just create the file */
2016-04-27 19:14:10 -04:00
if ( ! dentry - > d_inode & & ( open_flag & O_CREAT ) ) {
2018-06-08 13:22:02 -04:00
file - > f_mode | = FMODE_CREATED ;
2016-04-26 14:17:56 -04:00
audit_inode_child ( dir_inode , dentry , AUDIT_TYPE_CHILD_CREATE ) ;
if ( ! dir_inode - > i_op - > create ) {
error = - EACCES ;
2012-06-05 15:10:15 +02:00
goto out_dput ;
2016-04-26 14:17:56 -04:00
}
2021-01-21 14:19:43 +01:00
error = dir_inode - > i_op - > create ( mnt_userns , dir_inode , dentry ,
mode , open_flag & O_EXCL ) ;
2012-06-05 15:10:15 +02:00
if ( error )
goto out_dput ;
}
2016-04-27 19:14:10 -04:00
if ( unlikely ( create_error ) & & ! dentry - > d_inode ) {
error = create_error ;
goto out_dput ;
2012-06-05 15:10:15 +02:00
}
2020-01-09 14:25:14 -05:00
return dentry ;
2012-06-05 15:10:15 +02:00
out_dput :
dput ( dentry ) ;
2020-01-09 14:25:14 -05:00
return ERR_PTR ( error ) ;
2012-06-05 15:10:15 +02:00
}
2020-01-26 11:06:21 -05:00
static const char * open_last_lookups ( struct nameidata * nd ,
2018-06-08 13:43:47 -04:00
struct file * file , const struct open_flags * op )
2009-12-24 01:58:28 -05:00
{
2009-12-24 02:12:06 -05:00
struct dentry * dir = nd - > path . dentry ;
2011-03-09 00:36:45 -05:00
int open_flag = op - > open_flag ;
2012-07-31 00:53:35 +04:00
bool got_write = false ;
2020-01-09 14:25:14 -05:00
struct dentry * dentry ;
2020-01-14 13:34:20 -05:00
const char * res ;
2009-12-26 10:56:19 -05:00
2011-02-23 13:39:45 -05:00
nd - > flags | = op - > intent ;
2013-06-06 09:12:33 -04:00
if ( nd - > last_type ! = LAST_NORM ) {
2020-03-10 21:54:54 -04:00
if ( nd - > depth )
put_link ( nd ) ;
2020-03-10 10:19:24 -04:00
return handle_dots ( nd , nd - > last_type ) ;
2009-12-26 10:56:19 -05:00
}
2009-12-26 07:01:01 -05:00
2011-03-09 00:36:45 -05:00
if ( ! ( open_flag & O_CREAT ) ) {
2011-03-05 22:58:25 -05:00
if ( nd - > last . name [ nd - > last . len ] )
nd - > flags | = LOOKUP_FOLLOW | LOOKUP_DIRECTORY ;
/* we _can_ be in RCU mode here */
2022-07-03 22:20:20 -04:00
dentry = lookup_fast ( nd ) ;
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
if ( IS_ERR ( dentry ) )
2020-01-14 10:13:40 -05:00
return ERR_CAST ( dentry ) ;
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
if ( likely ( dentry ) )
2012-06-05 15:10:14 +02:00
goto finish_lookup ;
2016-03-05 18:14:03 -05:00
BUG_ON ( nd - > flags & LOOKUP_RCU ) ;
2012-06-05 15:10:13 +02:00
} else {
/* create side of things */
2020-03-10 10:09:26 -04:00
if ( nd - > flags & LOOKUP_RCU ) {
2020-12-17 09:19:08 -07:00
if ( ! try_to_unlazy ( nd ) )
return ERR_PTR ( - ECHILD ) ;
2020-03-10 10:09:26 -04:00
}
2019-07-14 13:22:27 -04:00
audit_inode ( nd - > name , dir , AUDIT_INODE_PARENT ) ;
2012-06-05 15:10:13 +02:00
/* trailing slashes? */
2015-05-08 18:05:21 -04:00
if ( unlikely ( nd - > last . name [ nd - > last . len ] ) )
2020-01-14 10:13:40 -05:00
return ERR_PTR ( - EISDIR ) ;
2012-06-05 15:10:13 +02:00
}
2009-12-24 03:39:50 -05:00
2016-04-28 19:35:16 -04:00
if ( open_flag & ( O_CREAT | O_TRUNC | O_WRONLY | O_RDWR ) ) {
2020-12-17 09:19:08 -07:00
got_write = ! mnt_want_write ( nd - > path . mnt ) ;
2012-07-31 00:53:35 +04:00
/*
* do _not_ fail yet - we might not need that or fail with
* a different error ; let lookup_open ( ) decide ; we ' ll be
* dropping this one anyway .
*/
}
2016-04-28 19:35:16 -04:00
if ( open_flag & O_CREAT )
inode_lock ( dir - > d_inode ) ;
else
inode_lock_shared ( dir - > d_inode ) ;
2020-01-09 14:25:14 -05:00
dentry = lookup_open ( nd , file , op , got_write ) ;
2020-03-05 13:25:20 -05:00
if ( ! IS_ERR ( dentry ) & & ( file - > f_mode & FMODE_CREATED ) )
fsnotify_create ( dir - > d_inode , dentry ) ;
2016-04-28 19:35:16 -04:00
if ( open_flag & O_CREAT )
inode_unlock ( dir - > d_inode ) ;
else
inode_unlock_shared ( dir - > d_inode ) ;
2009-12-24 02:12:06 -05:00
2020-01-26 11:06:21 -05:00
if ( got_write )
2020-01-26 10:22:24 -05:00
mnt_drop_write ( nd - > path . mnt ) ;
2012-06-05 15:10:17 +02:00
2020-01-26 10:22:24 -05:00
if ( IS_ERR ( dentry ) )
return ERR_CAST ( dentry ) ;
2020-01-26 10:48:16 -05:00
if ( file - > f_mode & ( FMODE_OPENED | FMODE_CREATED ) ) {
2020-01-09 14:30:08 -05:00
dput ( nd - > path . dentry ) ;
nd - > path . dentry = dentry ;
2020-01-26 11:06:21 -05:00
return NULL ;
2009-12-24 01:58:28 -05:00
}
lookup_fast(): take mount traversal into callers
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-09 14:58:31 -05:00
finish_lookup :
2020-03-10 21:54:54 -04:00
if ( nd - > depth )
put_link ( nd ) ;
2022-07-03 22:07:32 -04:00
res = step_into ( nd , WALK_TRAILING , dentry ) ;
2020-03-10 10:19:24 -04:00
if ( unlikely ( res ) )
2020-01-14 13:34:20 -05:00
nd - > flags & = ~ ( LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL ) ;
2020-03-10 10:19:24 -04:00
return res ;
2020-01-26 11:06:21 -05:00
}
/*
* Handle the last step of open ( )
*/
2020-03-05 11:41:29 -05:00
static int do_open ( struct nameidata * nd ,
2020-01-26 11:06:21 -05:00
struct file * file , const struct open_flags * op )
{
2021-01-21 14:19:43 +01:00
struct user_namespace * mnt_userns ;
2020-01-26 11:06:21 -05:00
int open_flag = op - > open_flag ;
bool do_truncate ;
int acc_mode ;
int error ;
2020-03-10 10:19:24 -04:00
if ( ! ( file - > f_mode & ( FMODE_OPENED | FMODE_CREATED ) ) ) {
error = complete_walk ( nd ) ;
if ( error )
return error ;
}
2020-01-26 10:48:16 -05:00
if ( ! ( file - > f_mode & FMODE_CREATED ) )
audit_inode ( nd - > name , nd - > path . dentry , 0 ) ;
2021-01-21 14:19:43 +01:00
mnt_userns = mnt_user_ns ( nd - > path . mnt ) ;
2018-08-23 17:00:35 -07:00
if ( open_flag & O_CREAT ) {
open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
Currently path_openat() has "EEXIST on O_EXCL|O_CREAT" checks done on one
of the ways out of open_last_lookups(). There are 4 cases:
1) the last component is . or ..; check is not done.
2) we had FMODE_OPENED or FMODE_CREATED set while in lookup_open();
check is not done.
3) symlink to be traversed is found; check is not done (nor
should it be)
4) everything else: check done (before complete_walk(), even).
In case (1) O_EXCL|O_CREAT ends up failing with -EISDIR - that's
open("/tmp/.", O_CREAT|O_EXCL, 0600)
Note that in the same conditions
open("/tmp", O_CREAT|O_EXCL, 0600)
would have yielded EEXIST. Either error is allowed, switching to -EEXIST
in these cases would've been more consistent.
Case (2) is more subtle; first of all, if we have FMODE_CREATED set, the
object hadn't existed prior to the call. The check should not be done in
such a case. The rest is problematic, though - we have
FMODE_OPENED set (i.e. it went through ->atomic_open() and got
successfully opened there)
FMODE_CREATED is *NOT* set
O_CREAT and O_EXCL are both set.
Any such case is a bug - either we failed to set FMODE_CREATED when we
had, in fact, created an object (no such instances in the tree) or
we have opened a pre-existing file despite having had both O_CREAT and
O_EXCL passed. One of those was, in fact caught (and fixed) while
sorting out this mess (gfs2 on cold dcache). And in such situations
we should fail with EEXIST.
Note that for (1) and (4) FMODE_CREATED is not set - for (1) there's nothing
in handle_dots() to set it, for (4) we'd explicitly checked that.
And (1), (2) and (4) are exactly the cases when we leave the loop in
the caller, with do_open() called immediately after that loop. IOW, we
can move the check over there, and make it
If we have O_CREAT|O_EXCL and after successful pathname resolution
FMODE_CREATED is *not* set, we must have run into a preexisting file and
should fail with EEXIST.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-10 10:13:53 -04:00
if ( ( open_flag & O_EXCL ) & & ! ( file - > f_mode & FMODE_CREATED ) )
return - EEXIST ;
2018-08-23 17:00:35 -07:00
if ( d_is_dir ( nd - > path . dentry ) )
2020-03-05 11:41:29 -05:00
return - EISDIR ;
2021-01-21 14:19:43 +01:00
error = may_create_in_sticky ( mnt_userns , nd ,
2018-08-23 17:00:35 -07:00
d_backing_inode ( nd - > path . dentry ) ) ;
if ( unlikely ( error ) )
2020-03-05 11:41:29 -05:00
return error ;
2018-08-23 17:00:35 -07:00
}
2014-04-01 17:08:41 +02:00
if ( ( nd - > flags & LOOKUP_DIRECTORY ) & & ! d_can_lookup ( nd - > path . dentry ) )
2020-03-05 11:41:29 -05:00
return - ENOTDIR ;
2011-03-09 00:59:59 -05:00
2020-01-26 10:38:17 -05:00
do_truncate = false ;
acc_mode = op - > acc_mode ;
2020-01-26 10:32:22 -05:00
if ( file - > f_mode & FMODE_CREATED ) {
/* Don't check for write permission, don't truncate */
open_flag & = ~ O_TRUNC ;
acc_mode = 0 ;
2020-01-26 10:38:17 -05:00
} else if ( d_is_reg ( nd - > path . dentry ) & & open_flag & O_TRUNC ) {
2011-03-09 00:13:14 -05:00
error = mnt_want_write ( nd - > path . mnt ) ;
if ( error )
2020-03-05 11:41:29 -05:00
return error ;
2020-01-26 10:38:17 -05:00
do_truncate = true ;
2011-03-09 00:13:14 -05:00
}
2021-01-21 14:19:43 +01:00
error = may_open ( mnt_userns , & nd - > path , acc_mode , open_flag ) ;
2020-01-26 10:38:17 -05:00
if ( ! error & & ! ( file - > f_mode & FMODE_OPENED ) )
2020-01-26 10:06:13 -05:00
error = vfs_open ( & nd - > path , file ) ;
2020-01-26 10:38:17 -05:00
if ( ! error )
error = ima_file_check ( file , op - > acc_mode ) ;
if ( ! error & & do_truncate )
2021-01-21 14:19:43 +01:00
error = handle_truncate ( mnt_userns , file ) ;
2016-02-27 19:17:33 -05:00
if ( unlikely ( error > 0 ) ) {
WARN_ON ( 1 ) ;
error = - EINVAL ;
}
2020-01-26 10:38:17 -05:00
if ( do_truncate )
2011-03-09 00:13:14 -05:00
mnt_drop_write ( nd - > path . mnt ) ;
2020-03-05 11:41:29 -05:00
return error ;
2009-12-24 01:58:28 -05:00
}
2021-01-21 14:19:33 +01:00
/**
* vfs_tmpfile - create tmpfile
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dentry : pointer to dentry of the base directory
* @ mode : mode of the new tmpfile
2021-02-15 20:29:28 -08:00
* @ open_flag : flags
2021-01-21 14:19:33 +01:00
*
* Create a temporary file .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
struct dentry * vfs_tmpfile ( struct user_namespace * mnt_userns ,
struct dentry * dentry , umode_t mode , int open_flag )
2017-01-17 06:34:52 +02:00
{
struct dentry * child = NULL ;
struct inode * dir = dentry - > d_inode ;
struct inode * inode ;
int error ;
/* we want directory to be writable */
2021-01-21 14:19:33 +01:00
error = inode_permission ( mnt_userns , dir , MAY_WRITE | MAY_EXEC ) ;
2017-01-17 06:34:52 +02:00
if ( error )
goto out_err ;
error = - EOPNOTSUPP ;
if ( ! dir - > i_op - > tmpfile )
goto out_err ;
error = - ENOMEM ;
2017-07-04 17:25:22 +01:00
child = d_alloc ( dentry , & slash_name ) ;
2017-01-17 06:34:52 +02:00
if ( unlikely ( ! child ) )
goto out_err ;
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
mode = vfs_prepare_mode ( mnt_userns , dir , mode , mode , mode ) ;
2021-01-21 14:19:43 +01:00
error = dir - > i_op - > tmpfile ( mnt_userns , dir , child , mode ) ;
2017-01-17 06:34:52 +02:00
if ( error )
goto out_err ;
error = - ENOENT ;
inode = child - > d_inode ;
if ( unlikely ( ! inode ) )
goto out_err ;
if ( ! ( open_flag & O_EXCL ) ) {
spin_lock ( & inode - > i_lock ) ;
inode - > i_state | = I_LINKABLE ;
spin_unlock ( & inode - > i_lock ) ;
}
2021-01-21 14:19:45 +01:00
ima_post_create_tmpfile ( mnt_userns , inode ) ;
2017-01-17 06:34:52 +02:00
return child ;
out_err :
dput ( child ) ;
return ERR_PTR ( error ) ;
}
EXPORT_SYMBOL ( vfs_tmpfile ) ;
2015-05-12 18:43:07 -04:00
static int do_tmpfile ( struct nameidata * nd , unsigned flags ,
2013-06-07 01:20:27 -04:00
const struct open_flags * op ,
2018-06-08 13:43:47 -04:00
struct file * file )
2013-06-07 01:20:27 -04:00
{
2021-01-21 14:19:33 +01:00
struct user_namespace * mnt_userns ;
2015-05-12 16:36:12 -04:00
struct dentry * child ;
struct path path ;
2015-05-12 18:43:07 -04:00
int error = path_lookupat ( nd , flags | LOOKUP_DIRECTORY , & path ) ;
2013-06-07 01:20:27 -04:00
if ( unlikely ( error ) )
return error ;
2015-05-12 16:36:12 -04:00
error = mnt_want_write ( path . mnt ) ;
2013-06-07 01:20:27 -04:00
if ( unlikely ( error ) )
goto out ;
2021-01-21 14:19:33 +01:00
mnt_userns = mnt_user_ns ( path . mnt ) ;
child = vfs_tmpfile ( mnt_userns , path . dentry , op - > mode , op - > open_flag ) ;
2017-01-17 06:34:52 +02:00
error = PTR_ERR ( child ) ;
2017-09-26 03:21:26 +09:00
if ( IS_ERR ( child ) )
2013-06-07 01:20:27 -04:00
goto out2 ;
2015-05-12 16:36:12 -04:00
dput ( path . dentry ) ;
path . dentry = child ;
2015-05-12 18:43:07 -04:00
audit_inode ( nd - > name , child , 0 ) ;
fs: allow open(dir, O_TMPFILE|..., 0) with mode 0
The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:
Note that this mode applies only to future accesses of the newly
created file; the open() call that creates a read-only file
may well return a read/write file descriptor.
The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.
O_TMPFILE, however, behaves differently:
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
assert(fd == -1);
assert(errno == EACCES);
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
assert(fd > 0);
For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
if (*opened & FILE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = MAY_OPEN;
path_to_nameidata(path, nd);
goto finish_open_created;
}
But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().
This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().
A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523
The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.
On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.
But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.
Signed-off-by: Eric Rannaud <e@nanocritical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-30 01:51:01 -07:00
/* Don't check for other permissions, the inode was just created */
2021-01-21 14:19:43 +01:00
error = may_open ( mnt_userns , & path , 0 , op - > open_flag ) ;
2020-03-11 17:22:19 -04:00
if ( ! error )
error = vfs_open ( & path , file ) ;
2013-06-07 01:20:27 -04:00
out2 :
2015-05-12 16:36:12 -04:00
mnt_drop_write ( path . mnt ) ;
2013-06-07 01:20:27 -04:00
out :
2015-05-12 16:36:12 -04:00
path_put ( & path ) ;
2013-06-07 01:20:27 -04:00
return error ;
}
2016-04-26 00:02:50 -04:00
static int do_o_path ( struct nameidata * nd , unsigned flags , struct file * file )
{
struct path path ;
int error = path_lookupat ( nd , flags , & path ) ;
if ( ! error ) {
audit_inode ( nd - > name , path . dentry , 0 ) ;
2018-07-10 13:22:28 -04:00
error = vfs_open ( & path , file ) ;
2016-04-26 00:02:50 -04:00
path_put ( & path ) ;
}
return error ;
}
2015-05-12 18:43:07 -04:00
static struct file * path_openat ( struct nameidata * nd ,
const struct open_flags * op , unsigned flags )
2005-04-16 15:20:36 -07:00
{
2012-06-22 12:40:19 +04:00
struct file * file ;
2011-02-23 17:54:08 -05:00
int error ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2018-07-11 15:00:04 -04:00
file = alloc_empty_file ( op - > open_flag , current_cred ( ) ) ;
2013-02-14 20:41:04 -05:00
if ( IS_ERR ( file ) )
return file ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:49:52 +11:00
2013-07-13 13:26:37 +04:00
if ( unlikely ( file - > f_flags & __O_TMPFILE ) ) {
2018-06-08 13:43:47 -04:00
error = do_tmpfile ( nd , flags , op , file ) ;
2018-07-09 16:38:06 -04:00
} else if ( unlikely ( file - > f_flags & O_PATH ) ) {
2016-04-26 00:02:50 -04:00
error = do_o_path ( nd , flags , file ) ;
2018-07-09 16:38:06 -04:00
} else {
const char * s = path_init ( nd , flags ) ;
while ( ! ( error = link_path_walk ( s , nd ) ) & &
2020-03-05 11:41:29 -05:00
( s = open_last_lookups ( nd , file , op ) ) ! = NULL )
2020-01-14 10:13:40 -05:00
;
2020-03-05 11:41:29 -05:00
if ( ! error )
error = do_open ( nd , file , op ) ;
2018-07-09 16:38:06 -04:00
terminate_walk ( nd ) ;
2009-12-26 07:16:40 -05:00
}
2018-06-08 12:56:55 -04:00
if ( likely ( ! error ) ) {
2018-06-08 12:58:04 -04:00
if ( likely ( file - > f_mode & FMODE_OPENED ) )
2018-06-08 12:56:55 -04:00
return file ;
WARN_ON ( 1 ) ;
error = - EINVAL ;
2012-05-21 17:30:19 +02:00
}
2018-06-08 12:56:55 -04:00
fput ( file ) ;
if ( error = = - EOPENSTALE ) {
if ( flags & LOOKUP_RCU )
error = - ECHILD ;
else
error = - ESTALE ;
2012-06-22 12:41:10 +04:00
}
2018-06-08 12:56:55 -04:00
return ERR_PTR ( error ) ;
2005-04-16 15:20:36 -07:00
}
2012-10-10 16:43:10 -04:00
struct file * do_filp_open ( int dfd , struct filename * pathname ,
2013-06-11 08:23:01 +04:00
const struct open_flags * op )
2011-02-23 17:54:08 -05:00
{
2015-05-13 07:28:08 -04:00
struct nameidata nd ;
2013-06-11 08:23:01 +04:00
int flags = op - > lookup_flags ;
2011-02-23 17:54:08 -05:00
struct file * filp ;
2021-04-01 22:28:03 -04:00
set_nameidata ( & nd , dfd , pathname , NULL ) ;
2015-05-12 18:43:07 -04:00
filp = path_openat ( & nd , op , flags | LOOKUP_RCU ) ;
2011-02-23 17:54:08 -05:00
if ( unlikely ( filp = = ERR_PTR ( - ECHILD ) ) )
2015-05-12 18:43:07 -04:00
filp = path_openat ( & nd , op , flags ) ;
2011-02-23 17:54:08 -05:00
if ( unlikely ( filp = = ERR_PTR ( - ESTALE ) ) )
2015-05-12 18:43:07 -04:00
filp = path_openat ( & nd , op , flags | LOOKUP_REVAL ) ;
2015-05-13 07:28:08 -04:00
restore_nameidata ( ) ;
2011-02-23 17:54:08 -05:00
return filp ;
}
2021-04-01 19:00:57 -04:00
struct file * do_file_open_root ( const struct path * root ,
2013-06-11 08:23:01 +04:00
const char * name , const struct open_flags * op )
2011-03-11 12:08:24 -05:00
{
2015-05-13 07:28:08 -04:00
struct nameidata nd ;
2011-03-11 12:08:24 -05:00
struct file * file ;
2015-01-22 00:00:03 -05:00
struct filename * filename ;
2021-04-01 22:03:41 -04:00
int flags = op - > lookup_flags ;
2011-03-11 12:08:24 -05:00
2021-04-01 19:00:57 -04:00
if ( d_is_symlink ( root - > dentry ) & & op - > intent & LOOKUP_OPEN )
2011-03-11 12:08:24 -05:00
return ERR_PTR ( - ELOOP ) ;
2015-01-22 00:00:03 -05:00
filename = getname_kernel ( name ) ;
2015-08-12 15:59:44 +05:30
if ( IS_ERR ( filename ) )
2015-01-22 00:00:03 -05:00
return ERR_CAST ( filename ) ;
2021-04-01 22:28:03 -04:00
set_nameidata ( & nd , - 1 , filename , root ) ;
2015-05-12 18:43:07 -04:00
file = path_openat ( & nd , op , flags | LOOKUP_RCU ) ;
2011-03-11 12:08:24 -05:00
if ( unlikely ( file = = ERR_PTR ( - ECHILD ) ) )
2015-05-12 18:43:07 -04:00
file = path_openat ( & nd , op , flags ) ;
2011-03-11 12:08:24 -05:00
if ( unlikely ( file = = ERR_PTR ( - ESTALE ) ) )
2015-05-12 18:43:07 -04:00
file = path_openat ( & nd , op , flags | LOOKUP_REVAL ) ;
2015-05-13 07:28:08 -04:00
restore_nameidata ( ) ;
2015-01-22 00:00:03 -05:00
putname ( filename ) ;
2011-03-11 12:08:24 -05:00
return file ;
}
2021-09-01 10:51:43 -07:00
static struct dentry * filename_create ( int dfd , struct filename * name ,
struct path * path , unsigned int lookup_flags )
2005-04-16 15:20:36 -07:00
{
2005-06-23 00:09:49 -07:00
struct dentry * dentry = ERR_PTR ( - EEXIST ) ;
2015-05-09 11:19:16 -04:00
struct qstr last ;
VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 13:57:35 +10:00
bool want_dir = lookup_flags & LOOKUP_DIRECTORY ;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL ;
unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL ;
2015-05-09 11:19:16 -04:00
int type ;
2012-06-12 16:20:30 +02:00
int err2 ;
2012-12-11 12:10:06 -05:00
int error ;
VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 13:57:35 +10:00
error = filename_parentat ( dfd , name , reval_flag , path , & last , & type ) ;
2021-07-08 13:34:38 +07:00
if ( error )
return ERR_PTR ( error ) ;
2005-04-16 15:20:36 -07:00
2005-06-23 00:09:49 -07:00
/*
* Yucky last component or no last component at all ?
* ( foo / . , foo / . . , /////)
*/
2015-05-12 17:32:54 -04:00
if ( unlikely ( type ! = LAST_NORM ) )
2011-06-27 16:53:43 -04:00
goto out ;
2005-06-23 00:09:49 -07:00
2012-06-12 16:20:30 +02:00
/* don't fail immediately if it's r/o, at least try to report other errors */
2015-05-09 11:19:16 -04:00
err2 = mnt_want_write ( path - > mnt ) ;
2005-06-23 00:09:49 -07:00
/*
VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 13:57:35 +10:00
* Do the final lookup . Suppress ' create ' if there is a trailing
* ' / ' , and a directory wasn ' t requested .
2005-06-23 00:09:49 -07:00
*/
VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 13:57:35 +10:00
if ( last . name [ last . len ] & & ! want_dir )
create_flags = 0 ;
2016-01-22 15:40:57 -05:00
inode_lock_nested ( path - > dentry - > d_inode , I_MUTEX_PARENT ) ;
VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 13:57:35 +10:00
dentry = __lookup_hash ( & last , path - > dentry , reval_flag | create_flags ) ;
2005-04-16 15:20:36 -07:00
if ( IS_ERR ( dentry ) )
2012-07-20 02:25:00 +04:00
goto unlock ;
2005-06-23 00:09:49 -07:00
2012-07-20 02:25:00 +04:00
error = - EEXIST ;
2013-09-12 19:22:53 +01:00
if ( d_is_positive ( dentry ) )
2012-07-20 02:25:00 +04:00
goto fail ;
2013-09-12 19:22:53 +01:00
2005-06-23 00:09:49 -07:00
/*
* Special case - lookup gave negative , but . . . we had foo / bar /
* From the vfs_mknod ( ) POV we just have a negative dentry -
* all is fine . Let ' s be bastards - you had / on the end , you ' ve
* been asking for ( non - existent ) directory . - ENOENT for you .
*/
VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file. If it doesn't exist, -ENOENT is reported.
However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create. This is
misleading and can cause incorrect behaviour.
If you try
ln -s foo /path/dir/
where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.
But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation. In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made. Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.
So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.
Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'. __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 13:57:35 +10:00
if ( unlikely ( ! create_flags ) ) {
2012-07-20 02:25:00 +04:00
error = - ENOENT ;
2011-06-27 16:53:43 -04:00
goto fail ;
2008-05-15 04:49:12 -04:00
}
2012-06-12 16:20:30 +02:00
if ( unlikely ( err2 ) ) {
error = err2 ;
2012-07-20 02:25:00 +04:00
goto fail ;
2012-06-12 16:20:30 +02:00
}
2005-04-16 15:20:36 -07:00
return dentry ;
fail :
2012-07-20 02:25:00 +04:00
dput ( dentry ) ;
dentry = ERR_PTR ( error ) ;
unlock :
2016-01-22 15:40:57 -05:00
inode_unlock ( path - > dentry - > d_inode ) ;
2012-06-12 16:20:30 +02:00
if ( ! err2 )
2015-05-09 11:19:16 -04:00
mnt_drop_write ( path - > mnt ) ;
2011-06-27 16:53:43 -04:00
out :
2015-05-09 11:19:16 -04:00
path_put ( path ) ;
2005-04-16 15:20:36 -07:00
return dentry ;
}
2015-01-22 02:16:49 -05:00
2021-09-01 10:51:43 -07:00
struct dentry * kern_path_create ( int dfd , const char * pathname ,
2021-07-08 13:34:39 +07:00
struct path * path , unsigned int lookup_flags )
{
2021-09-01 10:51:43 -07:00
struct filename * filename = getname_kernel ( pathname ) ;
struct dentry * res = filename_create ( dfd , filename , path , lookup_flags ) ;
2021-07-08 13:34:39 +07:00
2021-09-01 10:51:43 -07:00
putname ( filename ) ;
2021-07-08 13:34:39 +07:00
return res ;
}
2011-06-26 11:50:15 -04:00
EXPORT_SYMBOL ( kern_path_create ) ;
2012-07-20 01:15:31 +04:00
void done_path_create ( struct path * path , struct dentry * dentry )
{
dput ( dentry ) ;
2016-01-22 15:40:57 -05:00
inode_unlock ( path - > dentry - > d_inode ) ;
2012-07-20 02:25:00 +04:00
mnt_drop_write ( path - > mnt ) ;
2012-07-20 01:15:31 +04:00
path_put ( path ) ;
}
EXPORT_SYMBOL ( done_path_create ) ;
2015-05-13 07:00:28 -04:00
inline struct dentry * user_path_create ( int dfd , const char __user * pathname ,
2012-12-11 12:10:06 -05:00
struct path * path , unsigned int lookup_flags )
2011-06-26 11:50:15 -04:00
{
2021-09-01 10:51:43 -07:00
struct filename * filename = getname ( pathname ) ;
struct dentry * res = filename_create ( dfd , filename , path , lookup_flags ) ;
putname ( filename ) ;
return res ;
2011-06-26 11:50:15 -04:00
}
EXPORT_SYMBOL ( user_path_create ) ;
2021-01-21 14:19:33 +01:00
/**
* vfs_mknod - create device node or file
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dir : inode of @ dentry
* @ dentry : pointer to dentry of the base directory
* @ mode : mode of the new device node or file
* @ dev : device number of device to create
*
* Create a device node or file .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
int vfs_mknod ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * dentry , umode_t mode , dev_t dev )
2005-04-16 15:20:36 -07:00
{
2020-05-14 16:44:23 +02:00
bool is_whiteout = S_ISCHR ( mode ) & & dev = = WHITEOUT_DEV ;
2021-01-21 14:19:33 +01:00
int error = may_create ( mnt_userns , dir , dentry ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2020-05-14 16:44:23 +02:00
if ( ( S_ISCHR ( mode ) | | S_ISBLK ( mode ) ) & & ! is_whiteout & &
! capable ( CAP_MKNOD ) )
2005-04-16 15:20:36 -07:00
return - EPERM ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > mknod )
2005-04-16 15:20:36 -07:00
return - EPERM ;
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
mode = vfs_prepare_mode ( mnt_userns , dir , mode , mode , mode ) ;
2008-04-29 01:00:10 -07:00
error = devcgroup_inode_mknod ( mode , dev ) ;
if ( error )
return error ;
2005-04-16 15:20:36 -07:00
error = security_inode_mknod ( dir , dentry , mode , dev ) ;
if ( error )
return error ;
2021-01-21 14:19:43 +01:00
error = dir - > i_op - > mknod ( mnt_userns , dir , dentry , mode , dev ) ;
2005-09-09 13:01:44 -07:00
if ( ! error )
2005-11-03 15:57:06 +00:00
fsnotify_create ( dir , dentry ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_mknod ) ;
2005-04-16 15:20:36 -07:00
2011-07-26 04:31:40 -04:00
static int may_mknod ( umode_t mode )
2008-02-15 14:37:57 -08:00
{
switch ( mode & S_IFMT ) {
case S_IFREG :
case S_IFCHR :
case S_IFBLK :
case S_IFIFO :
case S_IFSOCK :
case 0 : /* zero mode translates to S_IFREG */
return 0 ;
case S_IFDIR :
return - EPERM ;
default :
return - EINVAL ;
}
}
2021-07-08 13:34:44 +07:00
static int do_mknodat ( int dfd , struct filename * name , umode_t mode ,
2018-03-11 11:34:50 +01:00
unsigned int dev )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
struct user_namespace * mnt_userns ;
2008-07-21 09:32:51 -04:00
struct dentry * dentry ;
2011-06-26 11:50:15 -04:00
struct path path ;
int error ;
2012-12-20 16:00:10 -05:00
unsigned int lookup_flags = 0 ;
2005-04-16 15:20:36 -07:00
2012-07-20 01:17:26 +04:00
error = may_mknod ( mode ) ;
if ( error )
2021-07-08 13:34:40 +07:00
goto out1 ;
2012-12-20 16:00:10 -05:00
retry :
2021-09-01 10:51:43 -07:00
dentry = filename_create ( dfd , name , & path , lookup_flags ) ;
2021-07-08 13:34:40 +07:00
error = PTR_ERR ( dentry ) ;
2011-06-26 11:50:15 -04:00
if ( IS_ERR ( dentry ) )
2021-07-08 13:34:40 +07:00
goto out1 ;
2008-07-21 09:32:51 -04:00
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
error = security_path_mknod ( & path , dentry ,
mode_strip_umask ( path . dentry - > d_inode , mode ) , dev ) ;
2008-12-17 13:24:15 +09:00
if ( error )
2021-07-08 13:34:40 +07:00
goto out2 ;
2021-01-21 14:19:33 +01:00
mnt_userns = mnt_user_ns ( path . mnt ) ;
2008-02-15 14:37:57 -08:00
switch ( mode & S_IFMT ) {
2005-04-16 15:20:36 -07:00
case 0 : case S_IFREG :
2021-01-21 14:19:33 +01:00
error = vfs_create ( mnt_userns , path . dentry - > d_inode ,
dentry , mode , true ) ;
2016-02-29 19:52:05 -05:00
if ( ! error )
2021-01-21 14:19:45 +01:00
ima_post_path_mknod ( mnt_userns , dentry ) ;
2005-04-16 15:20:36 -07:00
break ;
case S_IFCHR : case S_IFBLK :
2021-01-21 14:19:33 +01:00
error = vfs_mknod ( mnt_userns , path . dentry - > d_inode ,
dentry , mode , new_decode_dev ( dev ) ) ;
2005-04-16 15:20:36 -07:00
break ;
case S_IFIFO : case S_IFSOCK :
2021-01-21 14:19:33 +01:00
error = vfs_mknod ( mnt_userns , path . dentry - > d_inode ,
dentry , mode , 0 ) ;
2005-04-16 15:20:36 -07:00
break ;
}
2021-07-08 13:34:40 +07:00
out2 :
2012-07-20 01:15:31 +04:00
done_path_create ( & path , dentry ) ;
2012-12-20 16:00:10 -05:00
if ( retry_estale ( error , lookup_flags ) ) {
lookup_flags | = LOOKUP_REVAL ;
goto retry ;
}
2021-07-08 13:34:40 +07:00
out1 :
putname ( name ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2018-03-11 11:34:50 +01:00
SYSCALL_DEFINE4 ( mknodat , int , dfd , const char __user * , filename , umode_t , mode ,
unsigned int , dev )
{
2021-07-08 13:34:40 +07:00
return do_mknodat ( dfd , getname ( filename ) , mode , dev ) ;
2018-03-11 11:34:50 +01:00
}
2011-07-25 17:32:17 -04:00
SYSCALL_DEFINE3 ( mknod , const char __user * , filename , umode_t , mode , unsigned , dev )
2006-01-18 17:43:53 -08:00
{
2021-07-08 13:34:40 +07:00
return do_mknodat ( AT_FDCWD , getname ( filename ) , mode , dev ) ;
2006-01-18 17:43:53 -08:00
}
2021-01-21 14:19:33 +01:00
/**
* vfs_mkdir - create directory
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dir : inode of @ dentry
* @ dentry : pointer to dentry of the base directory
* @ mode : mode of the new directory
*
* Create a directory .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
int vfs_mkdir ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * dentry , umode_t mode )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
int error = may_create ( mnt_userns , dir , dentry ) ;
2012-02-06 12:45:27 -05:00
unsigned max_links = dir - > i_sb - > s_max_links ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > mkdir )
2005-04-16 15:20:36 -07:00
return - EPERM ;
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
mode = vfs_prepare_mode ( mnt_userns , dir , mode , S_IRWXUGO | S_ISVTX , 0 ) ;
2005-04-16 15:20:36 -07:00
error = security_inode_mkdir ( dir , dentry , mode ) ;
if ( error )
return error ;
2012-02-06 12:45:27 -05:00
if ( max_links & & dir - > i_nlink > = max_links )
return - EMLINK ;
2021-01-21 14:19:43 +01:00
error = dir - > i_op - > mkdir ( mnt_userns , dir , dentry , mode ) ;
2005-09-09 13:01:44 -07:00
if ( ! error )
2005-11-03 15:57:06 +00:00
fsnotify_mkdir ( dir , dentry ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_mkdir ) ;
2005-04-16 15:20:36 -07:00
2021-07-08 13:34:44 +07:00
int do_mkdirat ( int dfd , struct filename * name , umode_t mode )
2005-04-16 15:20:36 -07:00
{
2006-09-30 23:29:01 -07:00
struct dentry * dentry ;
2011-06-26 11:50:15 -04:00
struct path path ;
int error ;
2012-12-20 16:04:09 -05:00
unsigned int lookup_flags = LOOKUP_DIRECTORY ;
2005-04-16 15:20:36 -07:00
2012-12-20 16:04:09 -05:00
retry :
2021-09-01 10:51:43 -07:00
dentry = filename_create ( dfd , name , & path , lookup_flags ) ;
2021-07-08 13:34:39 +07:00
error = PTR_ERR ( dentry ) ;
2006-09-30 23:29:01 -07:00
if ( IS_ERR ( dentry ) )
2021-07-08 13:34:39 +07:00
goto out_putname ;
2005-04-16 15:20:36 -07:00
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-14 14:11:27 +08:00
error = security_path_mkdir ( & path , dentry ,
mode_strip_umask ( path . dentry - > d_inode , mode ) ) ;
2021-01-21 14:19:33 +01:00
if ( ! error ) {
struct user_namespace * mnt_userns ;
mnt_userns = mnt_user_ns ( path . mnt ) ;
2021-01-21 14:19:43 +01:00
error = vfs_mkdir ( mnt_userns , path . dentry - > d_inode , dentry ,
mode ) ;
2021-01-21 14:19:33 +01:00
}
2012-07-20 01:15:31 +04:00
done_path_create ( & path , dentry ) ;
2012-12-20 16:04:09 -05:00
if ( retry_estale ( error , lookup_flags ) ) {
lookup_flags | = LOOKUP_REVAL ;
goto retry ;
}
2021-07-08 13:34:39 +07:00
out_putname :
putname ( name ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2018-03-11 11:34:49 +01:00
SYSCALL_DEFINE3 ( mkdirat , int , dfd , const char __user * , pathname , umode_t , mode )
{
2021-07-08 13:34:39 +07:00
return do_mkdirat ( dfd , getname ( pathname ) , mode ) ;
2018-03-11 11:34:49 +01:00
}
2011-11-21 14:59:34 -05:00
SYSCALL_DEFINE2 ( mkdir , const char __user * , pathname , umode_t , mode )
2006-01-18 17:43:53 -08:00
{
2021-07-08 13:34:39 +07:00
return do_mkdirat ( AT_FDCWD , getname ( pathname ) , mode ) ;
2006-01-18 17:43:53 -08:00
}
2021-01-21 14:19:33 +01:00
/**
* vfs_rmdir - remove directory
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dir : inode of @ dentry
* @ dentry : pointer to dentry of the base directory
*
* Remove a directory .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
int vfs_rmdir ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * dentry )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
int error = may_delete ( mnt_userns , dir , dentry , 1 ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > rmdir )
2005-04-16 15:20:36 -07:00
return - EPERM ;
2011-09-14 18:55:41 +01:00
dget ( dentry ) ;
2016-01-22 15:40:57 -05:00
inode_lock ( dentry - > d_inode ) ;
2011-05-24 13:06:11 -07:00
error = - EBUSY ;
2021-11-18 08:58:08 +00:00
if ( is_local_mountpoint ( dentry ) | |
( dentry - > d_inode - > i_flags & S_KERNEL_FILE ) )
2011-05-24 13:06:11 -07:00
goto out ;
error = security_inode_rmdir ( dir , dentry ) ;
if ( error )
goto out ;
error = dir - > i_op - > rmdir ( dir , dentry ) ;
if ( error )
goto out ;
rmdir(),rename(): do shrink_dcache_parent() only on success
Once upon a time ->rmdir() instances used to check if victim inode
had more than one (in-core) reference and failed with -EBUSY if it
had. The reason was race avoidance - emptiness check is worthless
if somebody could just go and create new objects in the victim
directory afterwards.
With introduction of dcache the checks had been replaced with
checking the refcount of dentry. However, since a cached negative
lookup leaves a negative child dentry, such check had lead to false
positives - with empty foo/ doing stat foo/bar before rmdir foo
ended up with -EBUSY unless the negative dentry of foo/bar happened
to be evicted by the time of rmdir(2). That had been fixed by
doing shrink_dcache_parent() just before the refcount check.
At the same time, ext2_rmdir() has grown a private solution that
eliminated those -EBUSY - it did something (setting ->i_size to 0)
which made any subsequent ext2_add_entry() fail.
Unfortunately, even with shrink_dcache_parent() the check had been
racy - after all, the victim itself could be found by dcache lookup
just after we'd checked its refcount. That got fixed by a new
helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
the sucker if its refcount ended up equal to 1. That got called before
->rmdir(), turning the checks in ->rmdir() instances into "if not
unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but
had an unpleasant side effect - now shrink_dcache_parent() had been
done before the emptiness checks, leading to easily triggerable calls
of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
nested into each other.
Several years later the ext2-private trick had been generalized -
(in-core) inodes of dead directories are flagged and calls of
lookup, readdir and all directory-modifying methods were prevented
in so marked directories. Remaining boilerplate in ->rmdir() instances
became redundant and some instances got rid of it.
In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
and then killed off in all of them. That has lead to another problem,
though - in case of successful rmdir we *want* any (negative) child
dentries dropped and the victim itself made negative. There's no point
keeping cached negative lookups in foo when we can get the negative
lookup of foo itself cached. So shrink_dcache_parent() call had been
restored; unfortunately, it went into the place where dentry_unhash()
used to be, i.e. before the ->rmdir() call. Note that we don't unhash
anymore, so any "is it busy" checks would be racy; fortunately, all of
them are gone.
We should've done that call right *after* successful ->rmdir(). That
reduces contention caused by tree-walking in shrink_dcache_parent()
and, especially, contention caused by evictions in two nested subtrees
going on in parallel. The same goes for directory-overwriting rename() -
the story there had been parallel to that of rmdir().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-27 16:23:51 -04:00
shrink_dcache_parent ( dentry ) ;
2011-05-24 13:06:11 -07:00
dentry - > d_inode - > i_flags | = S_DEAD ;
dont_mount ( dentry ) ;
vfs: Lazily remove mounts on unlinked files and directories.
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-01 18:33:48 -07:00
detach_mounts ( dentry ) ;
2011-05-24 13:06:11 -07:00
out :
2016-01-22 15:40:57 -05:00
inode_unlock ( dentry - > d_inode ) ;
2011-09-14 18:55:41 +01:00
dput ( dentry ) ;
2011-05-24 13:06:11 -07:00
if ( ! error )
2022-01-20 23:53:04 +02:00
d_delete_notify ( dir , dentry ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_rmdir ) ;
2005-04-16 15:20:36 -07:00
2021-07-08 13:34:44 +07:00
int do_rmdir ( int dfd , struct filename * name )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
struct user_namespace * mnt_userns ;
2021-07-08 13:34:38 +07:00
int error ;
2005-04-16 15:20:36 -07:00
struct dentry * dentry ;
2015-04-30 16:09:11 -04:00
struct path path ;
struct qstr last ;
int type ;
2012-12-20 16:28:33 -05:00
unsigned int lookup_flags = 0 ;
retry :
2021-09-07 15:57:42 -04:00
error = filename_parentat ( dfd , name , lookup_flags , & path , & last , & type ) ;
2021-07-08 13:34:38 +07:00
if ( error )
goto exit1 ;
2005-04-16 15:20:36 -07:00
2015-04-30 16:09:11 -04:00
switch ( type ) {
2008-10-16 07:50:29 +09:00
case LAST_DOTDOT :
error = - ENOTEMPTY ;
2021-07-08 13:34:38 +07:00
goto exit2 ;
2008-10-16 07:50:29 +09:00
case LAST_DOT :
error = - EINVAL ;
2021-07-08 13:34:38 +07:00
goto exit2 ;
2008-10-16 07:50:29 +09:00
case LAST_ROOT :
error = - EBUSY ;
2021-07-08 13:34:38 +07:00
goto exit2 ;
2005-04-16 15:20:36 -07:00
}
2008-10-16 07:50:29 +09:00
2015-04-30 16:09:11 -04:00
error = mnt_want_write ( path . mnt ) ;
2012-06-12 16:20:30 +02:00
if ( error )
2021-07-08 13:34:38 +07:00
goto exit2 ;
2008-10-16 07:50:29 +09:00
2016-01-22 15:40:57 -05:00
inode_lock_nested ( path . dentry - > d_inode , I_MUTEX_PARENT ) ;
2015-04-30 16:09:11 -04:00
dentry = __lookup_hash ( & last , path . dentry , lookup_flags ) ;
2005-04-16 15:20:36 -07:00
error = PTR_ERR ( dentry ) ;
2006-09-30 23:29:01 -07:00
if ( IS_ERR ( dentry ) )
2021-07-08 13:34:38 +07:00
goto exit3 ;
2011-06-06 19:19:40 -04:00
if ( ! dentry - > d_inode ) {
error = - ENOENT ;
2021-07-08 13:34:38 +07:00
goto exit4 ;
2011-06-06 19:19:40 -04:00
}
2015-04-30 16:09:11 -04:00
error = security_path_rmdir ( & path , dentry ) ;
2008-12-17 13:24:15 +09:00
if ( error )
2021-07-08 13:34:38 +07:00
goto exit4 ;
2021-01-21 14:19:33 +01:00
mnt_userns = mnt_user_ns ( path . mnt ) ;
error = vfs_rmdir ( mnt_userns , path . dentry - > d_inode , dentry ) ;
2021-07-08 13:34:38 +07:00
exit4 :
2006-09-30 23:29:01 -07:00
dput ( dentry ) ;
2021-07-08 13:34:38 +07:00
exit3 :
2016-01-22 15:40:57 -05:00
inode_unlock ( path . dentry - > d_inode ) ;
2015-04-30 16:09:11 -04:00
mnt_drop_write ( path . mnt ) ;
2021-07-08 13:34:38 +07:00
exit2 :
2015-04-30 16:09:11 -04:00
path_put ( & path ) ;
2012-12-20 16:28:33 -05:00
if ( retry_estale ( error , lookup_flags ) ) {
lookup_flags | = LOOKUP_REVAL ;
goto retry ;
}
2021-07-08 13:34:38 +07:00
exit1 :
2020-08-12 05:15:18 +01:00
putname ( name ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2009-01-14 14:14:22 +01:00
SYSCALL_DEFINE1 ( rmdir , const char __user * , pathname )
2006-01-18 17:43:53 -08:00
{
2020-07-21 10:48:15 +02:00
return do_rmdir ( AT_FDCWD , getname ( pathname ) ) ;
2006-01-18 17:43:53 -08:00
}
2011-09-20 09:14:34 -04:00
/**
* vfs_unlink - unlink a filesystem object
2021-01-21 14:19:33 +01:00
* @ mnt_userns : user namespace of the mount the inode was found from
2011-09-20 09:14:34 -04:00
* @ dir : parent directory
* @ dentry : victim
* @ delegated_inode : returns victim inode , if the inode is delegated .
*
* The caller must hold dir - > i_mutex .
*
* If vfs_unlink discovers a delegation , it will return - EWOULDBLOCK and
* return a reference to the inode in delegated_inode . The caller
* should then break the delegation on that inode and retry . Because
* breaking a delegation may take a long time , the caller should drop
* dir - > i_mutex before doing so .
*
* Alternatively , a caller may pass NULL for delegated_inode . This may
* be appropriate for callers that expect the underlying filesystem not
* to be NFS exported .
2021-01-21 14:19:33 +01:00
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
2011-09-20 09:14:34 -04:00
*/
2021-01-21 14:19:33 +01:00
int vfs_unlink ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * dentry , struct inode * * delegated_inode )
2005-04-16 15:20:36 -07:00
{
2012-08-28 07:03:24 -04:00
struct inode * target = dentry - > d_inode ;
2021-01-21 14:19:33 +01:00
int error = may_delete ( mnt_userns , dir , dentry , 0 ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > unlink )
2005-04-16 15:20:36 -07:00
return - EPERM ;
2016-01-22 15:40:57 -05:00
inode_lock ( target ) ;
2021-09-02 14:53:57 -07:00
if ( IS_SWAPFILE ( target ) )
error = - EPERM ;
else if ( is_local_mountpoint ( dentry ) )
2005-04-16 15:20:36 -07:00
error = - EBUSY ;
else {
error = security_inode_unlink ( dir , dentry ) ;
2010-03-03 14:12:08 -05:00
if ( ! error ) {
2012-08-28 07:50:40 -07:00
error = try_break_deleg ( target , delegated_inode ) ;
if ( error )
2011-09-20 09:14:34 -04:00
goto out ;
2005-04-16 15:20:36 -07:00
error = dir - > i_op - > unlink ( dir , dentry ) ;
vfs: Lazily remove mounts on unlinked files and directories.
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-01 18:33:48 -07:00
if ( ! error ) {
2010-04-30 17:17:09 -04:00
dont_mount ( dentry ) ;
vfs: Lazily remove mounts on unlinked files and directories.
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-01 18:33:48 -07:00
detach_mounts ( dentry ) ;
}
2010-03-03 14:12:08 -05:00
}
2005-04-16 15:20:36 -07:00
}
2011-09-20 09:14:34 -04:00
out :
2016-01-22 15:40:57 -05:00
inode_unlock ( target ) ;
2005-04-16 15:20:36 -07:00
/* We don't d_delete() NFS sillyrenamed files--they still exist. */
2022-01-20 23:53:04 +02:00
if ( ! error & & dentry - > d_flags & DCACHE_NFSFS_RENAMED ) {
fsnotify_unlink ( dir , dentry ) ;
} else if ( ! error ) {
2012-08-28 07:03:24 -04:00
fsnotify_link_count ( target ) ;
2022-01-20 23:53:04 +02:00
d_delete_notify ( dir , dentry ) ;
2005-04-16 15:20:36 -07:00
}
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 17:06:03 -04:00
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_unlink ) ;
2005-04-16 15:20:36 -07:00
/*
* Make sure that the actual truncation of the file will occur outside its
2006-01-09 15:59:24 -08:00
* directory ' s i_mutex . Truncate can take a long time if there is a lot of
2005-04-16 15:20:36 -07:00
* writeout happening , and we don ' t want to prevent access to the directory
* while waiting on the I / O .
*/
2021-07-08 13:34:44 +07:00
int do_unlinkat ( int dfd , struct filename * name )
2005-04-16 15:20:36 -07:00
{
2008-07-21 09:32:51 -04:00
int error ;
2005-04-16 15:20:36 -07:00
struct dentry * dentry ;
2015-04-30 16:09:11 -04:00
struct path path ;
struct qstr last ;
int type ;
2005-04-16 15:20:36 -07:00
struct inode * inode = NULL ;
2011-09-20 09:14:34 -04:00
struct inode * delegated_inode = NULL ;
2012-12-20 16:38:04 -05:00
unsigned int lookup_flags = 0 ;
retry :
2021-09-07 15:57:42 -04:00
error = filename_parentat ( dfd , name , lookup_flags , & path , & last , & type ) ;
2021-07-08 13:34:38 +07:00
if ( error )
goto exit1 ;
2008-07-21 09:32:51 -04:00
2005-04-16 15:20:36 -07:00
error = - EISDIR ;
2015-04-30 16:09:11 -04:00
if ( type ! = LAST_NORM )
2021-07-08 13:34:38 +07:00
goto exit2 ;
2008-10-16 07:50:29 +09:00
2015-04-30 16:09:11 -04:00
error = mnt_want_write ( path . mnt ) ;
2012-06-12 16:20:30 +02:00
if ( error )
2021-07-08 13:34:38 +07:00
goto exit2 ;
2011-09-20 09:14:34 -04:00
retry_deleg :
2016-01-22 15:40:57 -05:00
inode_lock_nested ( path . dentry - > d_inode , I_MUTEX_PARENT ) ;
2015-04-30 16:09:11 -04:00
dentry = __lookup_hash ( & last , path . dentry , lookup_flags ) ;
2005-04-16 15:20:36 -07:00
error = PTR_ERR ( dentry ) ;
if ( ! IS_ERR ( dentry ) ) {
2021-01-21 14:19:33 +01:00
struct user_namespace * mnt_userns ;
2005-04-16 15:20:36 -07:00
/* Why not before? Because we want correct error value */
2015-04-30 16:09:11 -04:00
if ( last . name [ last . len ] )
2011-06-16 00:06:14 +03:00
goto slashes ;
2005-04-16 15:20:36 -07:00
inode = dentry - > d_inode ;
2013-09-12 19:22:53 +01:00
if ( d_is_negative ( dentry ) )
2011-06-06 19:19:40 -04:00
goto slashes ;
ihold ( inode ) ;
2015-04-30 16:09:11 -04:00
error = security_path_unlink ( & path , dentry ) ;
2008-12-17 13:24:15 +09:00
if ( error )
2021-07-08 13:34:38 +07:00
goto exit3 ;
2021-01-21 14:19:33 +01:00
mnt_userns = mnt_user_ns ( path . mnt ) ;
2021-01-21 14:19:43 +01:00
error = vfs_unlink ( mnt_userns , path . dentry - > d_inode , dentry ,
& delegated_inode ) ;
2021-07-08 13:34:38 +07:00
exit3 :
2005-04-16 15:20:36 -07:00
dput ( dentry ) ;
}
2016-01-22 15:40:57 -05:00
inode_unlock ( path . dentry - > d_inode ) ;
2005-04-16 15:20:36 -07:00
if ( inode )
iput ( inode ) ; /* truncate the inode here */
2011-09-20 09:14:34 -04:00
inode = NULL ;
if ( delegated_inode ) {
2012-08-28 07:50:40 -07:00
error = break_deleg_wait ( & delegated_inode ) ;
2011-09-20 09:14:34 -04:00
if ( ! error )
goto retry_deleg ;
}
2015-04-30 16:09:11 -04:00
mnt_drop_write ( path . mnt ) ;
2021-07-08 13:34:38 +07:00
exit2 :
2015-04-30 16:09:11 -04:00
path_put ( & path ) ;
2012-12-20 16:38:04 -05:00
if ( retry_estale ( error , lookup_flags ) ) {
lookup_flags | = LOOKUP_REVAL ;
inode = NULL ;
goto retry ;
}
2021-07-08 13:34:38 +07:00
exit1 :
2017-11-04 13:44:45 +03:00
putname ( name ) ;
2005-04-16 15:20:36 -07:00
return error ;
slashes :
2013-09-12 19:22:53 +01:00
if ( d_is_negative ( dentry ) )
error = - ENOENT ;
2014-04-01 17:08:41 +02:00
else if ( d_is_dir ( dentry ) )
2013-09-12 19:22:53 +01:00
error = - EISDIR ;
else
error = - ENOTDIR ;
2021-07-08 13:34:38 +07:00
goto exit3 ;
2005-04-16 15:20:36 -07:00
}
2009-01-14 14:14:31 +01:00
SYSCALL_DEFINE3 ( unlinkat , int , dfd , const char __user * , pathname , int , flag )
2006-01-18 17:43:53 -08:00
{
if ( ( flag & ~ AT_REMOVEDIR ) ! = 0 )
return - EINVAL ;
if ( flag & AT_REMOVEDIR )
2020-07-21 10:48:15 +02:00
return do_rmdir ( dfd , getname ( pathname ) ) ;
2017-11-04 13:44:45 +03:00
return do_unlinkat ( dfd , getname ( pathname ) ) ;
2006-01-18 17:43:53 -08:00
}
2009-01-14 14:14:16 +01:00
SYSCALL_DEFINE1 ( unlink , const char __user * , pathname )
2006-01-18 17:43:53 -08:00
{
2017-11-04 13:44:45 +03:00
return do_unlinkat ( AT_FDCWD , getname ( pathname ) ) ;
2006-01-18 17:43:53 -08:00
}
2021-01-21 14:19:33 +01:00
/**
* vfs_symlink - create symlink
* @ mnt_userns : user namespace of the mount the inode was found from
* @ dir : inode of @ dentry
* @ dentry : pointer to dentry of the base directory
* @ oldname : name of the file to link to
*
* Create a symlink .
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
*/
int vfs_symlink ( struct user_namespace * mnt_userns , struct inode * dir ,
struct dentry * dentry , const char * oldname )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
int error = may_create ( mnt_userns , dir , dentry ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > symlink )
2005-04-16 15:20:36 -07:00
return - EPERM ;
error = security_inode_symlink ( dir , dentry , oldname ) ;
if ( error )
return error ;
2021-01-21 14:19:43 +01:00
error = dir - > i_op - > symlink ( mnt_userns , dir , dentry , oldname ) ;
2005-09-09 13:01:44 -07:00
if ( ! error )
2005-11-03 15:57:06 +00:00
fsnotify_create ( dir , dentry ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_symlink ) ;
2005-04-16 15:20:36 -07:00
2021-07-08 13:34:46 +07:00
int do_symlinkat ( struct filename * from , int newdfd , struct filename * to )
2005-04-16 15:20:36 -07:00
{
2008-07-21 09:32:51 -04:00
int error ;
2006-09-30 23:29:01 -07:00
struct dentry * dentry ;
2011-06-26 11:50:15 -04:00
struct path path ;
2012-12-11 12:10:08 -05:00
unsigned int lookup_flags = 0 ;
2005-04-16 15:20:36 -07:00
2021-07-08 13:34:41 +07:00
if ( IS_ERR ( from ) ) {
error = PTR_ERR ( from ) ;
goto out_putnames ;
}
2012-12-11 12:10:08 -05:00
retry :
2021-09-01 10:51:43 -07:00
dentry = filename_create ( newdfd , to , & path , lookup_flags ) ;
2006-09-30 23:29:01 -07:00
error = PTR_ERR ( dentry ) ;
if ( IS_ERR ( dentry ) )
2021-07-08 13:34:41 +07:00
goto out_putnames ;
2006-09-30 23:29:01 -07:00
2012-10-10 15:25:28 -04:00
error = security_path_symlink ( & path , dentry , from - > name ) ;
2021-01-21 14:19:33 +01:00
if ( ! error ) {
struct user_namespace * mnt_userns ;
mnt_userns = mnt_user_ns ( path . mnt ) ;
error = vfs_symlink ( mnt_userns , path . dentry - > d_inode , dentry ,
from - > name ) ;
}
2012-07-20 01:15:31 +04:00
done_path_create ( & path , dentry ) ;
2012-12-11 12:10:08 -05:00
if ( retry_estale ( error , lookup_flags ) ) {
lookup_flags | = LOOKUP_REVAL ;
goto retry ;
}
2021-07-08 13:34:41 +07:00
out_putnames :
putname ( to ) ;
2005-04-16 15:20:36 -07:00
putname ( from ) ;
return error ;
}
2018-03-11 11:34:49 +01:00
SYSCALL_DEFINE3 ( symlinkat , const char __user * , oldname ,
int , newdfd , const char __user * , newname )
{
2021-07-08 13:34:41 +07:00
return do_symlinkat ( getname ( oldname ) , newdfd , getname ( newname ) ) ;
2018-03-11 11:34:49 +01:00
}
2009-01-14 14:14:16 +01:00
SYSCALL_DEFINE2 ( symlink , const char __user * , oldname , const char __user * , newname )
2006-01-18 17:43:53 -08:00
{
2021-07-08 13:34:41 +07:00
return do_symlinkat ( getname ( oldname ) , AT_FDCWD , getname ( newname ) ) ;
2006-01-18 17:43:53 -08:00
}
2011-09-20 17:14:31 -04:00
/**
* vfs_link - create a new link
* @ old_dentry : object to be linked
2021-01-21 14:19:33 +01:00
* @ mnt_userns : the user namespace of the mount
2011-09-20 17:14:31 -04:00
* @ dir : new parent
* @ new_dentry : where to create the new link
* @ delegated_inode : returns inode needing a delegation break
*
* The caller must hold dir - > i_mutex
*
* If vfs_link discovers a delegation on the to - be - linked file in need
* of breaking , it will return - EWOULDBLOCK and return a reference to the
* inode in delegated_inode . The caller should then break the delegation
* and retry . Because breaking a delegation may take a long time , the
* caller should drop the i_mutex before doing so .
*
* Alternatively , a caller may pass NULL for delegated_inode . This may
* be appropriate for callers that expect the underlying filesystem not
* to be NFS exported .
2021-01-21 14:19:33 +01:00
*
* If the inode has been found through an idmapped mount the user namespace of
* the vfsmount must be passed through @ mnt_userns . This function will then take
* care to map the inode according to @ mnt_userns before checking permissions .
* On non - idmapped mounts or if permission checking is to be performed on the
* raw inode simply passs init_user_ns .
2011-09-20 17:14:31 -04:00
*/
2021-01-21 14:19:33 +01:00
int vfs_link ( struct dentry * old_dentry , struct user_namespace * mnt_userns ,
struct inode * dir , struct dentry * new_dentry ,
struct inode * * delegated_inode )
2005-04-16 15:20:36 -07:00
{
struct inode * inode = old_dentry - > d_inode ;
2012-02-06 12:45:27 -05:00
unsigned max_links = dir - > i_sb - > s_max_links ;
2005-04-16 15:20:36 -07:00
int error ;
if ( ! inode )
return - ENOENT ;
2021-01-21 14:19:33 +01:00
error = may_create ( mnt_userns , dir , new_dentry ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
if ( dir - > i_sb ! = inode - > i_sb )
return - EXDEV ;
/*
* A link to an append - only or immutable file cannot be created .
*/
if ( IS_APPEND ( inode ) | | IS_IMMUTABLE ( inode ) )
return - EPERM ;
2016-06-29 14:54:46 -05:00
/*
* Updating the link count will likely cause i_uid and i_gid to
* be writen back improperly if their true value is unknown to
* the vfs .
*/
2021-01-21 14:19:33 +01:00
if ( HAS_UNMAPPED_ID ( mnt_userns , inode ) )
2016-06-29 14:54:46 -05:00
return - EPERM ;
2008-12-04 10:06:33 -05:00
if ( ! dir - > i_op - > link )
2005-04-16 15:20:36 -07:00
return - EPERM ;
2008-06-24 16:50:15 +02:00
if ( S_ISDIR ( inode - > i_mode ) )
2005-04-16 15:20:36 -07:00
return - EPERM ;
error = security_inode_link ( old_dentry , dir , new_dentry ) ;
if ( error )
return error ;
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2011-01-29 18:43:27 +05:30
/* Make sure we don't allow creating hardlink to an unlinked file */
2013-06-11 08:34:36 +04:00
if ( inode - > i_nlink = = 0 & & ! ( inode - > i_state & I_LINKABLE ) )
2011-01-29 18:43:27 +05:30
error = - ENOENT ;
2012-02-06 12:45:27 -05:00
else if ( max_links & & inode - > i_nlink > = max_links )
error = - EMLINK ;
2011-09-20 17:14:31 -04:00
else {
error = try_break_deleg ( inode , delegated_inode ) ;
if ( ! error )
error = dir - > i_op - > link ( old_dentry , dir , new_dentry ) ;
}
2013-06-11 08:34:36 +04:00
if ( ! error & & ( inode - > i_state & I_LINKABLE ) ) {
spin_lock ( & inode - > i_lock ) ;
inode - > i_state & = ~ I_LINKABLE ;
spin_unlock ( & inode - > i_lock ) ;
}
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2005-09-09 13:01:45 -07:00
if ( ! error )
2008-06-24 16:50:15 +02:00
fsnotify_link ( dir , inode , new_dentry ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_link ) ;
2005-04-16 15:20:36 -07:00
/*
* Hardlinks are often used in delicate situations . We avoid
* security - related surprises by not following symlinks on the
* newname . - - KAB
*
* We don ' t follow them on the oldname either to be compatible
* with linux 2.0 , and to avoid hard - linking to directories
* and other special files . - - ADM
*/
2021-07-08 13:34:47 +07:00
int do_linkat ( int olddfd , struct filename * old , int newdfd ,
2021-07-08 13:34:43 +07:00
struct filename * new , int flags )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:33 +01:00
struct user_namespace * mnt_userns ;
2005-04-16 15:20:36 -07:00
struct dentry * new_dentry ;
2011-06-26 11:50:15 -04:00
struct path old_path , new_path ;
2011-09-20 17:14:31 -04:00
struct inode * delegated_inode = NULL ;
2011-01-29 18:43:42 +05:30
int how = 0 ;
2005-04-16 15:20:36 -07:00
int error ;
2021-07-08 13:34:43 +07:00
if ( ( flags & ~ ( AT_SYMLINK_FOLLOW | AT_EMPTY_PATH ) ) ! = 0 ) {
error = - EINVAL ;
goto out_putnames ;
}
2011-01-29 18:43:42 +05:30
/*
2013-08-28 09:18:05 -07:00
* To use null names we require CAP_DAC_READ_SEARCH
* This ensures that not everyone will be able to create
* handlink using the passed filedescriptor .
2011-01-29 18:43:42 +05:30
*/
2021-07-08 13:34:43 +07:00
if ( flags & AT_EMPTY_PATH & & ! capable ( CAP_DAC_READ_SEARCH ) ) {
error = - ENOENT ;
goto out_putnames ;
2013-08-28 09:18:05 -07:00
}
2011-01-29 18:43:42 +05:30
if ( flags & AT_SYMLINK_FOLLOW )
how | = LOOKUP_FOLLOW ;
2012-12-20 16:15:38 -05:00
retry :
2021-09-01 10:51:42 -07:00
error = filename_lookup ( olddfd , old , how , & old_path , NULL ) ;
2005-04-16 15:20:36 -07:00
if ( error )
2021-07-08 13:34:43 +07:00
goto out_putnames ;
2008-07-21 09:32:51 -04:00
2021-09-01 10:51:43 -07:00
new_dentry = filename_create ( newdfd , new , & new_path ,
2012-12-20 16:15:38 -05:00
( how & LOOKUP_REVAL ) ) ;
2005-04-16 15:20:36 -07:00
error = PTR_ERR ( new_dentry ) ;
2006-09-30 23:29:01 -07:00
if ( IS_ERR ( new_dentry ) )
2021-07-08 13:34:43 +07:00
goto out_putpath ;
2011-06-26 11:50:15 -04:00
error = - EXDEV ;
if ( old_path . mnt ! = new_path . mnt )
goto out_dput ;
2021-01-21 14:19:43 +01:00
mnt_userns = mnt_user_ns ( new_path . mnt ) ;
error = may_linkat ( mnt_userns , & old_path ) ;
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-25 17:29:07 -07:00
if ( unlikely ( error ) )
goto out_dput ;
2011-06-26 11:50:15 -04:00
error = security_path_link ( old_path . dentry , & new_path , new_dentry ) ;
2008-12-17 13:24:15 +09:00
if ( error )
2012-07-20 02:25:00 +04:00
goto out_dput ;
2021-01-21 14:19:33 +01:00
error = vfs_link ( old_path . dentry , mnt_userns , new_path . dentry - > d_inode ,
new_dentry , & delegated_inode ) ;
2008-02-15 14:37:45 -08:00
out_dput :
2012-07-20 01:15:31 +04:00
done_path_create ( & new_path , new_dentry ) ;
2011-09-20 17:14:31 -04:00
if ( delegated_inode ) {
error = break_deleg_wait ( & delegated_inode ) ;
2014-01-31 15:41:58 -05:00
if ( ! error ) {
path_put ( & old_path ) ;
2011-09-20 17:14:31 -04:00
goto retry ;
2014-01-31 15:41:58 -05:00
}
2011-09-20 17:14:31 -04:00
}
2012-12-20 16:15:38 -05:00
if ( retry_estale ( error , how ) ) {
2014-01-31 15:41:58 -05:00
path_put ( & old_path ) ;
2012-12-20 16:15:38 -05:00
how | = LOOKUP_REVAL ;
goto retry ;
}
2021-07-08 13:34:43 +07:00
out_putpath :
2008-07-22 09:59:21 -04:00
path_put ( & old_path ) ;
2021-07-08 13:34:43 +07:00
out_putnames :
putname ( old ) ;
putname ( new ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2018-03-11 11:34:53 +01:00
SYSCALL_DEFINE5 ( linkat , int , olddfd , const char __user * , oldname ,
int , newdfd , const char __user * , newname , int , flags )
{
2021-07-08 13:34:43 +07:00
return do_linkat ( olddfd , getname_uflags ( oldname , flags ) ,
newdfd , getname ( newname ) , flags ) ;
2018-03-11 11:34:53 +01:00
}
2009-01-14 14:14:16 +01:00
SYSCALL_DEFINE2 ( link , const char __user * , oldname , const char __user * , newname )
2006-01-18 17:43:53 -08:00
{
2021-07-08 13:34:43 +07:00
return do_linkat ( AT_FDCWD , getname ( oldname ) , AT_FDCWD , getname ( newname ) , 0 ) ;
2006-01-18 17:43:53 -08:00
}
2014-04-01 17:08:42 +02:00
/**
* vfs_rename - rename a filesystem object
2021-02-15 20:29:28 -08:00
* @ rd : pointer to & struct renamedata info
2014-04-01 17:08:42 +02:00
*
* The caller must hold multiple mutexes - - see lock_rename ( ) ) .
*
* If vfs_rename discovers a delegation in need of breaking at either
* the source or destination , it will return - EWOULDBLOCK and return a
* reference to the inode in delegated_inode . The caller should then
* break the delegation and retry . Because breaking a delegation may
* take a long time , the caller should drop all locks before doing
* so .
*
* Alternatively , a caller may pass NULL for delegated_inode . This may
* be appropriate for callers that expect the underlying filesystem not
* to be NFS exported .
*
2005-04-16 15:20:36 -07:00
* The worst of all namespace operations - renaming directory . " Perverted "
* doesn ' t even start to describe it . Somebody in UCB had a heck of a trip . . .
* Problems :
2017-05-12 07:45:42 -03:00
*
2014-02-17 16:52:33 -05:00
* a ) we can get into loop creation .
2005-04-16 15:20:36 -07:00
* b ) race potential - two innocent renames can create a loop together .
* That ' s where 4.4 screws up . Current fix : serialization on
2006-03-23 03:00:33 -08:00
* sb - > s_vfs_rename_mutex . We might be more accurate , but that ' s another
2005-04-16 15:20:36 -07:00
* story .
2012-03-05 11:40:41 -05:00
* c ) we have to lock _four_ objects - parents and victim ( if it exists ) ,
* and source ( if it is not a directory ) .
2006-01-09 15:59:24 -08:00
* And that - after we got - > i_mutex on parents ( until then we don ' t know
2005-04-16 15:20:36 -07:00
* whether the target exists ) . Solution : try to be smart with locking
* order for inodes . We rely on the fact that tree topology may change
2006-03-23 03:00:33 -08:00
* only under - > s_vfs_rename_mutex _and_ that parent of the object we
2005-04-16 15:20:36 -07:00
* move will be locked . Thus we can rank directories by the tree
* ( ancestors first ) and rank all non - directories after them .
* That works since everybody except rename does " lock parent, lookup,
2006-03-23 03:00:33 -08:00
* lock child " and rename is under ->s_vfs_rename_mutex.
2005-04-16 15:20:36 -07:00
* HOWEVER , it relies on the assumption that any object with - > lookup ( )
* has no more than 1 dentry . If " hybrid " objects will ever appear ,
* we ' d better make sure that there ' s no link ( 2 ) for them .
2011-05-24 13:06:07 -07:00
* d ) conversion from fhandle to dentry may come in the wrong moment - when
2006-01-09 15:59:24 -08:00
* we are removing the target . Solution : we will have to grab - > i_mutex
2005-04-16 15:20:36 -07:00
* in the fhandle_to_dentry code . [ FIXME - current nfsfh . c relies on
2009-12-11 16:35:39 -05:00
* - > i_mutex on parents , which works but leads to some truly excessive
2005-04-16 15:20:36 -07:00
* locking ] .
*/
2021-01-21 14:19:32 +01:00
int vfs_rename ( struct renamedata * rd )
2005-04-16 15:20:36 -07:00
{
2014-04-01 17:08:42 +02:00
int error ;
2021-01-21 14:19:32 +01:00
struct inode * old_dir = rd - > old_dir , * new_dir = rd - > new_dir ;
struct dentry * old_dentry = rd - > old_dentry ;
struct dentry * new_dentry = rd - > new_dentry ;
struct inode * * delegated_inode = rd - > delegated_inode ;
unsigned int flags = rd - > flags ;
2014-04-01 17:08:42 +02:00
bool is_dir = d_is_dir ( old_dentry ) ;
struct inode * source = old_dentry - > d_inode ;
2011-05-24 13:06:12 -07:00
struct inode * target = new_dentry - > d_inode ;
2014-04-01 17:08:43 +02:00
bool new_is_dir = false ;
unsigned max_links = new_dir - > i_sb - > s_max_links ;
2017-07-07 14:51:19 -04:00
struct name_snapshot old_name ;
2014-04-01 17:08:42 +02:00
2016-12-16 11:02:54 +01:00
if ( source = = target )
2014-04-01 17:08:42 +02:00
return 0 ;
2021-01-21 14:19:33 +01:00
error = may_delete ( rd - > old_mnt_userns , old_dir , old_dentry , is_dir ) ;
2014-04-01 17:08:42 +02:00
if ( error )
return error ;
2014-04-01 17:08:43 +02:00
if ( ! target ) {
2021-01-21 14:19:33 +01:00
error = may_create ( rd - > new_mnt_userns , new_dir , new_dentry ) ;
2014-04-01 17:08:43 +02:00
} else {
new_is_dir = d_is_dir ( new_dentry ) ;
if ( ! ( flags & RENAME_EXCHANGE ) )
2021-01-21 14:19:33 +01:00
error = may_delete ( rd - > new_mnt_userns , new_dir ,
new_dentry , is_dir ) ;
2014-04-01 17:08:43 +02:00
else
2021-01-21 14:19:33 +01:00
error = may_delete ( rd - > new_mnt_userns , new_dir ,
new_dentry , new_is_dir ) ;
2014-04-01 17:08:43 +02:00
}
2014-04-01 17:08:42 +02:00
if ( error )
return error ;
2016-09-27 11:03:58 +02:00
if ( ! old_dir - > i_op - > rename )
2014-04-01 17:08:42 +02:00
return - EPERM ;
2005-04-16 15:20:36 -07:00
/*
* If we are going to change the parent - check write permissions ,
* we ' ll need to flip ' . . ' .
*/
2014-04-01 17:08:43 +02:00
if ( new_dir ! = old_dir ) {
if ( is_dir ) {
2021-01-21 14:19:33 +01:00
error = inode_permission ( rd - > old_mnt_userns , source ,
2021-01-21 14:19:24 +01:00
MAY_WRITE ) ;
2014-04-01 17:08:43 +02:00
if ( error )
return error ;
}
if ( ( flags & RENAME_EXCHANGE ) & & new_is_dir ) {
2021-01-21 14:19:33 +01:00
error = inode_permission ( rd - > new_mnt_userns , target ,
2021-01-21 14:19:24 +01:00
MAY_WRITE ) ;
2014-04-01 17:08:43 +02:00
if ( error )
return error ;
}
2005-04-16 15:20:36 -07:00
}
2014-04-01 17:08:43 +02:00
error = security_inode_rename ( old_dir , old_dentry , new_dir , new_dentry ,
flags ) ;
2005-04-16 15:20:36 -07:00
if ( error )
return error ;
2017-07-07 14:51:19 -04:00
take_dentry_name_snapshot ( & old_name , old_dentry ) ;
2011-09-14 18:55:41 +01:00
dget ( new_dentry ) ;
2014-04-01 17:08:43 +02:00
if ( ! is_dir | | ( flags & RENAME_EXCHANGE ) )
2014-04-01 17:08:42 +02:00
lock_two_nondirectories ( source , target ) ;
else if ( target )
2016-01-22 15:40:57 -05:00
inode_lock ( target ) ;
2011-05-24 13:06:12 -07:00
2021-09-02 14:53:57 -07:00
error = - EPERM ;
if ( IS_SWAPFILE ( source ) | | ( target & & IS_SWAPFILE ( target ) ) )
goto out ;
2011-05-24 13:06:12 -07:00
error = - EBUSY ;
2013-10-04 19:15:13 -07:00
if ( is_local_mountpoint ( old_dentry ) | | is_local_mountpoint ( new_dentry ) )
2011-05-24 13:06:12 -07:00
goto out ;
2014-04-01 17:08:43 +02:00
if ( max_links & & new_dir ! = old_dir ) {
2014-04-01 17:08:42 +02:00
error = - EMLINK ;
2014-04-01 17:08:43 +02:00
if ( is_dir & & ! new_is_dir & & new_dir - > i_nlink > = max_links )
2014-04-01 17:08:42 +02:00
goto out ;
2014-04-01 17:08:43 +02:00
if ( ( flags & RENAME_EXCHANGE ) & & ! is_dir & & new_is_dir & &
old_dir - > i_nlink > = max_links )
goto out ;
}
if ( ! is_dir ) {
2014-04-01 17:08:42 +02:00
error = try_break_deleg ( source , delegated_inode ) ;
2011-09-20 16:59:58 -04:00
if ( error )
goto out ;
2014-04-01 17:08:43 +02:00
}
if ( target & & ! new_is_dir ) {
error = try_break_deleg ( target , delegated_inode ) ;
if ( error )
goto out ;
2011-09-20 16:59:58 -04:00
}
2021-01-21 14:19:43 +01:00
error = old_dir - > i_op - > rename ( rd - > new_mnt_userns , old_dir , old_dentry ,
new_dir , new_dentry , flags ) ;
2011-05-24 13:06:13 -07:00
if ( error )
goto out ;
2014-04-01 17:08:43 +02:00
if ( ! ( flags & RENAME_EXCHANGE ) & & target ) {
rmdir(),rename(): do shrink_dcache_parent() only on success
Once upon a time ->rmdir() instances used to check if victim inode
had more than one (in-core) reference and failed with -EBUSY if it
had. The reason was race avoidance - emptiness check is worthless
if somebody could just go and create new objects in the victim
directory afterwards.
With introduction of dcache the checks had been replaced with
checking the refcount of dentry. However, since a cached negative
lookup leaves a negative child dentry, such check had lead to false
positives - with empty foo/ doing stat foo/bar before rmdir foo
ended up with -EBUSY unless the negative dentry of foo/bar happened
to be evicted by the time of rmdir(2). That had been fixed by
doing shrink_dcache_parent() just before the refcount check.
At the same time, ext2_rmdir() has grown a private solution that
eliminated those -EBUSY - it did something (setting ->i_size to 0)
which made any subsequent ext2_add_entry() fail.
Unfortunately, even with shrink_dcache_parent() the check had been
racy - after all, the victim itself could be found by dcache lookup
just after we'd checked its refcount. That got fixed by a new
helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
the sucker if its refcount ended up equal to 1. That got called before
->rmdir(), turning the checks in ->rmdir() instances into "if not
unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but
had an unpleasant side effect - now shrink_dcache_parent() had been
done before the emptiness checks, leading to easily triggerable calls
of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
nested into each other.
Several years later the ext2-private trick had been generalized -
(in-core) inodes of dead directories are flagged and calls of
lookup, readdir and all directory-modifying methods were prevented
in so marked directories. Remaining boilerplate in ->rmdir() instances
became redundant and some instances got rid of it.
In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
and then killed off in all of them. That has lead to another problem,
though - in case of successful rmdir we *want* any (negative) child
dentries dropped and the victim itself made negative. There's no point
keeping cached negative lookups in foo when we can get the negative
lookup of foo itself cached. So shrink_dcache_parent() call had been
restored; unfortunately, it went into the place where dentry_unhash()
used to be, i.e. before the ->rmdir() call. Note that we don't unhash
anymore, so any "is it busy" checks would be racy; fortunately, all of
them are gone.
We should've done that call right *after* successful ->rmdir(). That
reduces contention caused by tree-walking in shrink_dcache_parent()
and, especially, contention caused by evictions in two nested subtrees
going on in parallel. The same goes for directory-overwriting rename() -
the story there had been parallel to that of rmdir().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-27 16:23:51 -04:00
if ( is_dir ) {
shrink_dcache_parent ( new_dentry ) ;
2014-04-01 17:08:42 +02:00
target - > i_flags | = S_DEAD ;
rmdir(),rename(): do shrink_dcache_parent() only on success
Once upon a time ->rmdir() instances used to check if victim inode
had more than one (in-core) reference and failed with -EBUSY if it
had. The reason was race avoidance - emptiness check is worthless
if somebody could just go and create new objects in the victim
directory afterwards.
With introduction of dcache the checks had been replaced with
checking the refcount of dentry. However, since a cached negative
lookup leaves a negative child dentry, such check had lead to false
positives - with empty foo/ doing stat foo/bar before rmdir foo
ended up with -EBUSY unless the negative dentry of foo/bar happened
to be evicted by the time of rmdir(2). That had been fixed by
doing shrink_dcache_parent() just before the refcount check.
At the same time, ext2_rmdir() has grown a private solution that
eliminated those -EBUSY - it did something (setting ->i_size to 0)
which made any subsequent ext2_add_entry() fail.
Unfortunately, even with shrink_dcache_parent() the check had been
racy - after all, the victim itself could be found by dcache lookup
just after we'd checked its refcount. That got fixed by a new
helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
the sucker if its refcount ended up equal to 1. That got called before
->rmdir(), turning the checks in ->rmdir() instances into "if not
unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but
had an unpleasant side effect - now shrink_dcache_parent() had been
done before the emptiness checks, leading to easily triggerable calls
of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
nested into each other.
Several years later the ext2-private trick had been generalized -
(in-core) inodes of dead directories are flagged and calls of
lookup, readdir and all directory-modifying methods were prevented
in so marked directories. Remaining boilerplate in ->rmdir() instances
became redundant and some instances got rid of it.
In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
and then killed off in all of them. That has lead to another problem,
though - in case of successful rmdir we *want* any (negative) child
dentries dropped and the victim itself made negative. There's no point
keeping cached negative lookups in foo when we can get the negative
lookup of foo itself cached. So shrink_dcache_parent() call had been
restored; unfortunately, it went into the place where dentry_unhash()
used to be, i.e. before the ->rmdir() call. Note that we don't unhash
anymore, so any "is it busy" checks would be racy; fortunately, all of
them are gone.
We should've done that call right *after* successful ->rmdir(). That
reduces contention caused by tree-walking in shrink_dcache_parent()
and, especially, contention caused by evictions in two nested subtrees
going on in parallel. The same goes for directory-overwriting rename() -
the story there had been parallel to that of rmdir().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-27 16:23:51 -04:00
}
2011-05-24 13:06:13 -07:00
dont_mount ( new_dentry ) ;
vfs: Lazily remove mounts on unlinked files and directories.
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-01 18:33:48 -07:00
detach_mounts ( new_dentry ) ;
2014-04-01 17:08:42 +02:00
}
2014-04-01 17:08:43 +02:00
if ( ! ( old_dir - > i_sb - > s_type - > fs_flags & FS_RENAME_DOES_D_MOVE ) ) {
if ( ! ( flags & RENAME_EXCHANGE ) )
d_move ( old_dentry , new_dentry ) ;
else
d_exchange ( old_dentry , new_dentry ) ;
}
2011-05-24 13:06:13 -07:00
out :
2014-04-01 17:08:43 +02:00
if ( ! is_dir | | ( flags & RENAME_EXCHANGE ) )
2014-04-01 17:08:42 +02:00
unlock_two_nondirectories ( source , target ) ;
else if ( target )
2016-01-22 15:40:57 -05:00
inode_unlock ( target ) ;
2005-04-16 15:20:36 -07:00
dput ( new_dentry ) ;
2014-04-01 17:08:43 +02:00
if ( ! error ) {
2019-04-26 13:21:24 -04:00
fsnotify_move ( old_dir , new_dir , & old_name . name , is_dir ,
2014-04-01 17:08:43 +02:00
! ( flags & RENAME_EXCHANGE ) ? target : NULL , old_dentry ) ;
if ( flags & RENAME_EXCHANGE ) {
2019-04-26 13:21:24 -04:00
fsnotify_move ( new_dir , old_dir , & old_dentry - > d_name ,
2014-04-01 17:08:43 +02:00
new_is_dir , NULL , new_dentry ) ;
}
}
2017-07-07 14:51:19 -04:00
release_dentry_name_snapshot ( & old_name ) ;
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 17:06:03 -04:00
2005-04-16 15:20:36 -07:00
return error ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( vfs_rename ) ;
2005-04-16 15:20:36 -07:00
2020-09-26 17:20:17 -06:00
int do_renameat2 ( int olddfd , struct filename * from , int newdfd ,
struct filename * to , unsigned int flags )
2005-04-16 15:20:36 -07:00
{
2021-01-21 14:19:32 +01:00
struct renamedata rd ;
2008-07-21 09:32:51 -04:00
struct dentry * old_dentry , * new_dentry ;
struct dentry * trap ;
2015-04-30 16:09:11 -04:00
struct path old_path , new_path ;
struct qstr old_last , new_last ;
int old_type , new_type ;
2011-09-20 16:59:58 -04:00
struct inode * delegated_inode = NULL ;
2015-04-30 16:09:11 -04:00
unsigned int lookup_flags = 0 , target_flags = LOOKUP_RENAME_TARGET ;
2012-12-11 12:10:10 -05:00
bool should_retry = false ;
2020-09-26 17:20:17 -06:00
int error = - EINVAL ;
2014-04-01 17:08:42 +02:00
2014-10-24 00:14:37 +02:00
if ( flags & ~ ( RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT ) )
2021-07-08 13:34:38 +07:00
goto put_names ;
2014-04-01 17:08:43 +02:00
2014-10-24 00:14:37 +02:00
if ( ( flags & ( RENAME_NOREPLACE | RENAME_WHITEOUT ) ) & &
( flags & RENAME_EXCHANGE ) )
2021-07-08 13:34:38 +07:00
goto put_names ;
2014-04-01 17:08:42 +02:00
2015-04-30 16:09:11 -04:00
if ( flags & RENAME_EXCHANGE )
target_flags = 0 ;
2012-12-11 12:10:10 -05:00
retry :
2021-09-07 15:57:42 -04:00
error = filename_parentat ( olddfd , from , lookup_flags , & old_path ,
& old_last , & old_type ) ;
2021-07-08 13:34:38 +07:00
if ( error )
goto put_names ;
2005-04-16 15:20:36 -07:00
2021-09-07 15:57:42 -04:00
error = filename_parentat ( newdfd , to , lookup_flags , & new_path , & new_last ,
& new_type ) ;
2021-07-08 13:34:38 +07:00
if ( error )
2005-04-16 15:20:36 -07:00
goto exit1 ;
error = - EXDEV ;
2015-04-30 16:09:11 -04:00
if ( old_path . mnt ! = new_path . mnt )
2005-04-16 15:20:36 -07:00
goto exit2 ;
error = - EBUSY ;
2015-04-30 16:09:11 -04:00
if ( old_type ! = LAST_NORM )
2005-04-16 15:20:36 -07:00
goto exit2 ;
2014-04-01 17:08:43 +02:00
if ( flags & RENAME_NOREPLACE )
error = - EEXIST ;
2015-04-30 16:09:11 -04:00
if ( new_type ! = LAST_NORM )
2005-04-16 15:20:36 -07:00
goto exit2 ;
2015-04-30 16:09:11 -04:00
error = mnt_want_write ( old_path . mnt ) ;
2012-06-12 16:20:30 +02:00
if ( error )
goto exit2 ;
2011-09-20 16:59:58 -04:00
retry_deleg :
2015-04-30 16:09:11 -04:00
trap = lock_rename ( new_path . dentry , old_path . dentry ) ;
2005-04-16 15:20:36 -07:00
2015-04-30 16:09:11 -04:00
old_dentry = __lookup_hash ( & old_last , old_path . dentry , lookup_flags ) ;
2005-04-16 15:20:36 -07:00
error = PTR_ERR ( old_dentry ) ;
if ( IS_ERR ( old_dentry ) )
goto exit3 ;
/* source must exist */
error = - ENOENT ;
2013-09-12 19:22:53 +01:00
if ( d_is_negative ( old_dentry ) )
2005-04-16 15:20:36 -07:00
goto exit4 ;
2015-04-30 16:09:11 -04:00
new_dentry = __lookup_hash ( & new_last , new_path . dentry , lookup_flags | target_flags ) ;
2014-04-01 17:08:43 +02:00
error = PTR_ERR ( new_dentry ) ;
if ( IS_ERR ( new_dentry ) )
goto exit4 ;
error = - EEXIST ;
if ( ( flags & RENAME_NOREPLACE ) & & d_is_positive ( new_dentry ) )
goto exit5 ;
2014-04-01 17:08:43 +02:00
if ( flags & RENAME_EXCHANGE ) {
error = - ENOENT ;
if ( d_is_negative ( new_dentry ) )
goto exit5 ;
if ( ! d_is_dir ( new_dentry ) ) {
error = - ENOTDIR ;
2015-04-30 16:09:11 -04:00
if ( new_last . name [ new_last . len ] )
2014-04-01 17:08:43 +02:00
goto exit5 ;
}
}
2005-04-16 15:20:36 -07:00
/* unless the source is a directory trailing slashes give -ENOTDIR */
2014-04-01 17:08:41 +02:00
if ( ! d_is_dir ( old_dentry ) ) {
2005-04-16 15:20:36 -07:00
error = - ENOTDIR ;
2015-04-30 16:09:11 -04:00
if ( old_last . name [ old_last . len ] )
2014-04-01 17:08:43 +02:00
goto exit5 ;
2015-04-30 16:09:11 -04:00
if ( ! ( flags & RENAME_EXCHANGE ) & & new_last . name [ new_last . len ] )
2014-04-01 17:08:43 +02:00
goto exit5 ;
2005-04-16 15:20:36 -07:00
}
/* source should not be ancestor of target */
error = - EINVAL ;
if ( old_dentry = = trap )
2014-04-01 17:08:43 +02:00
goto exit5 ;
2005-04-16 15:20:36 -07:00
/* target should not be an ancestor of source */
2014-04-01 17:08:43 +02:00
if ( ! ( flags & RENAME_EXCHANGE ) )
error = - ENOTEMPTY ;
2005-04-16 15:20:36 -07:00
if ( new_dentry = = trap )
goto exit5 ;
2015-04-30 16:09:11 -04:00
error = security_path_rename ( & old_path , old_dentry ,
& new_path , new_dentry , flags ) ;
2008-12-17 13:24:15 +09:00
if ( error )
2012-06-12 16:20:30 +02:00
goto exit5 ;
2021-01-21 14:19:32 +01:00
rd . old_dir = old_path . dentry - > d_inode ;
rd . old_dentry = old_dentry ;
2021-01-21 14:19:33 +01:00
rd . old_mnt_userns = mnt_user_ns ( old_path . mnt ) ;
2021-01-21 14:19:32 +01:00
rd . new_dir = new_path . dentry - > d_inode ;
rd . new_dentry = new_dentry ;
2021-01-21 14:19:33 +01:00
rd . new_mnt_userns = mnt_user_ns ( new_path . mnt ) ;
2021-01-21 14:19:32 +01:00
rd . delegated_inode = & delegated_inode ;
rd . flags = flags ;
error = vfs_rename ( & rd ) ;
2005-04-16 15:20:36 -07:00
exit5 :
dput ( new_dentry ) ;
exit4 :
dput ( old_dentry ) ;
exit3 :
2015-04-30 16:09:11 -04:00
unlock_rename ( new_path . dentry , old_path . dentry ) ;
2011-09-20 16:59:58 -04:00
if ( delegated_inode ) {
error = break_deleg_wait ( & delegated_inode ) ;
if ( ! error )
goto retry_deleg ;
}
2015-04-30 16:09:11 -04:00
mnt_drop_write ( old_path . mnt ) ;
2005-04-16 15:20:36 -07:00
exit2 :
2012-12-11 12:10:10 -05:00
if ( retry_estale ( error , lookup_flags ) )
should_retry = true ;
2015-04-30 16:09:11 -04:00
path_put ( & new_path ) ;
2005-04-16 15:20:36 -07:00
exit1 :
2015-04-30 16:09:11 -04:00
path_put ( & old_path ) ;
2012-12-11 12:10:10 -05:00
if ( should_retry ) {
should_retry = false ;
lookup_flags | = LOOKUP_REVAL ;
goto retry ;
}
2021-07-08 13:34:38 +07:00
put_names :
2021-07-08 13:34:37 +07:00
putname ( from ) ;
putname ( to ) ;
2005-04-16 15:20:36 -07:00
return error ;
}
2018-03-11 11:34:28 +01:00
SYSCALL_DEFINE5 ( renameat2 , int , olddfd , const char __user * , oldname ,
int , newdfd , const char __user * , newname , unsigned int , flags )
{
2020-09-26 17:20:17 -06:00
return do_renameat2 ( olddfd , getname ( oldname ) , newdfd , getname ( newname ) ,
flags ) ;
2018-03-11 11:34:28 +01:00
}
2014-04-01 17:08:42 +02:00
SYSCALL_DEFINE4 ( renameat , int , olddfd , const char __user * , oldname ,
int , newdfd , const char __user * , newname )
{
2020-09-26 17:20:17 -06:00
return do_renameat2 ( olddfd , getname ( oldname ) , newdfd , getname ( newname ) ,
0 ) ;
2014-04-01 17:08:42 +02:00
}
2009-01-14 14:14:17 +01:00
SYSCALL_DEFINE2 ( rename , const char __user * , oldname , const char __user * , newname )
2006-01-18 17:43:53 -08:00
{
2020-09-26 17:20:17 -06:00
return do_renameat2 ( AT_FDCWD , getname ( oldname ) , AT_FDCWD ,
getname ( newname ) , 0 ) ;
2006-01-18 17:43:53 -08:00
}
2014-03-14 13:42:45 -04:00
int readlink_copy ( char __user * buffer , int buflen , const char * link )
2005-04-16 15:20:36 -07:00
{
2014-03-14 13:42:45 -04:00
int len = PTR_ERR ( link ) ;
2005-04-16 15:20:36 -07:00
if ( IS_ERR ( link ) )
goto out ;
len = strlen ( link ) ;
if ( len > ( unsigned ) buflen )
len = buflen ;
if ( copy_to_user ( buffer , link , len ) )
len = - EFAULT ;
out :
return len ;
}
2016-12-09 16:45:04 +01:00
/**
* vfs_readlink - copy symlink body into userspace buffer
* @ dentry : dentry on which to get symbolic link
* @ buffer : user memory pointer
* @ buflen : size of buffer
*
* Does not touch atime . That ' s up to the caller if necessary
*
* Does not call security hook .
*/
int vfs_readlink ( struct dentry * dentry , char __user * buffer , int buflen )
{
struct inode * inode = d_inode ( dentry ) ;
2018-07-19 17:35:51 -04:00
DEFINE_DELAYED_CALL ( done ) ;
const char * link ;
int res ;
2016-12-09 16:45:04 +01:00
2016-12-09 16:45:04 +01:00
if ( unlikely ( ! ( inode - > i_opflags & IOP_DEFAULT_READLINK ) ) ) {
if ( unlikely ( inode - > i_op - > readlink ) )
return inode - > i_op - > readlink ( dentry , buffer , buflen ) ;
if ( ! d_is_symlink ( dentry ) )
return - EINVAL ;
spin_lock ( & inode - > i_lock ) ;
inode - > i_opflags | = IOP_DEFAULT_READLINK ;
spin_unlock ( & inode - > i_lock ) ;
}
2016-12-09 16:45:04 +01:00
2019-04-10 13:21:14 -07:00
link = READ_ONCE ( inode - > i_link ) ;
2018-07-19 17:35:51 -04:00
if ( ! link ) {
link = inode - > i_op - > get_link ( dentry , inode , & done ) ;
if ( IS_ERR ( link ) )
return PTR_ERR ( link ) ;
}
res = readlink_copy ( buffer , buflen , link ) ;
do_delayed_call ( & done ) ;
return res ;
2016-12-09 16:45:04 +01:00
}
EXPORT_SYMBOL ( vfs_readlink ) ;
2005-04-16 15:20:36 -07:00
2016-10-04 14:40:45 +02:00
/**
* vfs_get_link - get symlink body
* @ dentry : dentry on which to get symbolic link
* @ done : caller needs to free returned data with this
*
* Calls security hook and i_op - > get_link ( ) on the supplied inode .
*
* It does not touch atime . That ' s up to the caller if necessary .
*
* Does not work on " special " symlinks like / proc / $ $ / fd / N
*/
const char * vfs_get_link ( struct dentry * dentry , struct delayed_call * done )
{
const char * res = ERR_PTR ( - EINVAL ) ;
struct inode * inode = d_inode ( dentry ) ;
if ( d_is_symlink ( dentry ) ) {
res = ERR_PTR ( security_inode_readlink ( dentry ) ) ;
if ( ! res )
res = inode - > i_op - > get_link ( dentry , inode , done ) ;
}
return res ;
}
EXPORT_SYMBOL ( vfs_get_link ) ;
2005-04-16 15:20:36 -07:00
/* get the link contents into pagecache */
2015-11-17 10:20:54 -05:00
const char * page_get_link ( struct dentry * dentry , struct inode * inode ,
2015-12-29 15:58:39 -05:00
struct delayed_call * callback )
2005-04-16 15:20:36 -07:00
{
2008-12-19 20:47:12 +00:00
char * kaddr ;
struct page * page ;
2015-11-17 10:20:54 -05:00
struct address_space * mapping = inode - > i_mapping ;
2015-11-17 10:41:04 -05:00
if ( ! dentry ) {
page = find_get_page ( mapping , 0 ) ;
if ( ! page )
return ERR_PTR ( - ECHILD ) ;
if ( ! PageUptodate ( page ) ) {
put_page ( page ) ;
return ERR_PTR ( - ECHILD ) ;
}
} else {
page = read_mapping_page ( mapping , 0 , NULL ) ;
if ( IS_ERR ( page ) )
return ( char * ) page ;
}
2015-12-29 15:58:39 -05:00
set_delayed_call ( callback , page_put_link , page ) ;
2015-11-17 01:07:57 -05:00
BUG_ON ( mapping_gfp_mask ( mapping ) & __GFP_HIGHMEM ) ;
kaddr = page_address ( page ) ;
2015-11-17 10:20:54 -05:00
nd_terminate_link ( kaddr , inode - > i_size , PAGE_SIZE - 1 ) ;
2008-12-19 20:47:12 +00:00
return kaddr ;
2005-04-16 15:20:36 -07:00
}
2015-11-17 10:20:54 -05:00
EXPORT_SYMBOL ( page_get_link ) ;
2005-04-16 15:20:36 -07:00
2015-12-29 15:58:39 -05:00
void page_put_link ( void * arg )
2005-04-16 15:20:36 -07:00
{
2015-12-29 15:58:39 -05:00
put_page ( arg ) ;
2005-04-16 15:20:36 -07:00
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( page_put_link ) ;
2005-04-16 15:20:36 -07:00
2015-11-16 18:26:34 -05:00
int page_readlink ( struct dentry * dentry , char __user * buffer , int buflen )
{
2015-12-29 15:58:39 -05:00
DEFINE_DELAYED_CALL ( done ) ;
2015-11-17 10:20:54 -05:00
int res = readlink_copy ( buffer , buflen ,
page_get_link ( dentry , d_inode ( dentry ) ,
2015-12-29 15:58:39 -05:00
& done ) ) ;
do_delayed_call ( & done ) ;
2015-11-16 18:26:34 -05:00
return res ;
}
EXPORT_SYMBOL ( page_readlink ) ;
2022-02-22 09:40:54 -05:00
int page_symlink ( struct inode * inode , const char * symname , int len )
2005-04-16 15:20:36 -07:00
{
struct address_space * mapping = inode - > i_mapping ;
2022-03-03 13:35:20 -05:00
const struct address_space_operations * aops = mapping - > a_ops ;
2022-02-22 09:40:54 -05:00
bool nofs = ! mapping_gfp_constraint ( mapping , __GFP_FS ) ;
2006-03-11 03:27:13 -08:00
struct page * page ;
2007-10-16 01:25:01 -07:00
void * fsdata ;
2007-02-16 01:27:18 -08:00
int err ;
2022-02-22 09:43:12 -05:00
unsigned int flags ;
2005-04-16 15:20:36 -07:00
2006-03-25 03:07:57 -08:00
retry :
2022-02-22 09:43:12 -05:00
if ( nofs )
flags = memalloc_nofs_save ( ) ;
2022-03-03 13:35:20 -05:00
err = aops - > write_begin ( NULL , mapping , 0 , len - 1 , & page , & fsdata ) ;
2022-02-22 09:43:12 -05:00
if ( nofs )
memalloc_nofs_restore ( flags ) ;
2005-04-16 15:20:36 -07:00
if ( err )
2007-10-16 01:25:01 -07:00
goto fail ;
2015-11-17 01:07:57 -05:00
memcpy ( page_address ( page ) , symname , len - 1 ) ;
2007-10-16 01:25:01 -07:00
2022-03-03 13:35:20 -05:00
err = aops - > write_end ( NULL , mapping , 0 , len - 1 , len - 1 ,
2007-10-16 01:25:01 -07:00
page , fsdata ) ;
2005-04-16 15:20:36 -07:00
if ( err < 0 )
goto fail ;
2007-10-16 01:25:01 -07:00
if ( err < len - 1 )
goto retry ;
2005-04-16 15:20:36 -07:00
mark_inode_dirty ( inode ) ;
return 0 ;
fail :
return err ;
}
2014-03-14 12:20:17 -04:00
EXPORT_SYMBOL ( page_symlink ) ;
2006-03-11 03:27:13 -08:00
2007-02-12 00:55:39 -08:00
const struct inode_operations page_symlink_inode_operations = {
2015-11-17 10:20:54 -05:00
. get_link = page_get_link ,
2005-04-16 15:20:36 -07:00
} ;
EXPORT_SYMBOL ( page_symlink_inode_operations ) ;