IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
'o2hb_map_slot_data' and 'o2hb_populate_slot_data' are called from only
one place, in 'o2hb_region_dev_write'. Return value is checked and
'mlog_errno' is called to log a message if it is not 0.
So there is no need to call 'mlog_errno' directly within these functions.
This would result on logging the message twice.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When storage network is unstable, it may trigger the BUG in
__ocfs2_journal_access because of buffer not uptodate. We can retry the
write in this case or return error instead of BUG.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Zhangguanghui <zhang.guanghui@h3c.com>
Tested-by: Zhangguanghui <zhang.guanghui@h3c.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1) Take rw EX lock in case of append dio.
2) Explicitly treat the error code -EIOCBQUEUED as normal.
3) Set di_bh to NULL after brelse if it may be used again later.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During direct io the inode will be added to orphan first and then
deleted from orphan. There is a race window that the orphan entry will
be deleted twice and thus trigger the BUG when validating
OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.
ocfs2_direct_IO_write
...
ocfs2_add_inode_to_orphan
>>>>>>>> race window.
1) another node may rm the file and then down, this node
take care of orphan recovery and clear flag
OCFS2_DIO_ORPHANED_FL.
2) since rw lock is unlocked, it may race with another
orphan recovery and append dio.
ocfs2_del_inode_from_orphan
So take inode mutex lock when recovering orphans and make rw unlock at the
end of aio write in case of append dio.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iput() tests whether its argument is NULL and then returns immediately.
Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fsnotify_destroy_mark_locked() is subtle to use because it temporarily
releases group->mark_mutex. To avoid future problems with this
function, split it into two.
fsnotify_detach_mark() is the part that needs group->mark_mutex and
fsnotify_free_mark() is the part that must be called outside of
group->mark_mutex. This way it's much clearer what's going on and we
also avoid some pointless acquisitions of group->mark_mutex.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Free list is used when all marks on given inode / mount should be
destroyed when inode / mount is going away. However we can free all of
the marks without using a special list with some care.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A check in inotify_fdinfo() checking whether mark is valid was always
true due to a bug. Luckily we can never get to invalidated marks since
we hold mark_mutex and invalidated marks get removed from the group list
when they are invalidated under that mutex.
Anyway fix the check to make code more future proof.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have a _tiny_ microbenchmark that sits in a loop and writes single
bytes to a file. Writing one byte to a tmpfs file is around 2x slower
than reading one byte from a file, which is a _bit_ more than I expecte.
This is a dumb benchmark, but I think it's hard to deny that write() is
a hot path and we should avoid unnecessary overhead there.
I did a 'perf record' of 30-second samples of read and write. The top
item in a diffprofile is srcu_read_lock() from fsnotify(). There are
active inotify fd's from systemd, but nothing is actually listening to
the file or its part of the filesystem.
I *think* we can avoid taking the srcu_read_lock() for the common case
where there are no actual marks on the file. This means that there will
both be nothing to notify for *and* implies that there is no need for
clearing the ignore mask.
This patch gave a 13.1% speedup in writes/second on my test, which is an
improvement from the 10.8% that I saw with the last version.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Credit where credit is due: this idea comes from Christoph Lameter with
a lot of valuable input from Serge Hallyn. This patch is heavily based
on Christoph's patch.
===== The status quo =====
On Linux, there are a number of capabilities defined by the kernel. To
perform various privileged tasks, processes can wield capabilities that
they hold.
Each task has four capability masks: effective (pE), permitted (pP),
inheritable (pI), and a bounding set (X). When the kernel checks for a
capability, it checks pE. The other capability masks serve to modify
what capabilities can be in pE.
Any task can remove capabilities from pE, pP, or pI at any time. If a
task has a capability in pP, it can add that capability to pE and/or pI.
If a task has CAP_SETPCAP, then it can add any capability to pI, and it
can remove capabilities from X.
Tasks are not the only things that can have capabilities; files can also
have capabilities. A file can have no capabilty information at all [1].
If a file has capability information, then it has a permitted mask (fP)
and an inheritable mask (fI) as well as a single effective bit (fE) [2].
File capabilities modify the capabilities of tasks that execve(2) them.
A task that successfully calls execve has its capabilities modified for
the file ultimately being excecuted (i.e. the binary itself if that
binary is ELF or for the interpreter if the binary is a script.) [3] In
the capability evolution rules, for each mask Z, pZ represents the old
value and pZ' represents the new value. The rules are:
pP' = (X & fP) | (pI & fI)
pI' = pI
pE' = (fE ? pP' : 0)
X is unchanged
For setuid binaries, fP, fI, and fE are modified by a moderately
complicated set of rules that emulate POSIX behavior. Similarly, if
euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
(primary, fP and fI usually end up being the full set). For nonroot
users executing binaries with neither setuid nor file caps, fI and fP
are empty and fE is false.
As an extra complication, if you execute a process as nonroot and fE is
set, then the "secure exec" rules are in effect: AT_SECURE gets set,
LD_PRELOAD doesn't work, etc.
This is rather messy. We've learned that making any changes is
dangerous, though: if a new kernel version allows an unprivileged
program to change its security state in a way that persists cross
execution of a setuid program or a program with file caps, this
persistent state is surprisingly likely to allow setuid or file-capped
programs to be exploited for privilege escalation.
===== The problem =====
Capability inheritance is basically useless.
If you aren't root and you execute an ordinary binary, fI is zero, so
your capabilities have no effect whatsoever on pP'. This means that you
can't usefully execute a helper process or a shell command with elevated
capabilities if you aren't root.
On current kernels, you can sort of work around this by setting fI to
the full set for most or all non-setuid executable files. This causes
pP' = pI for nonroot, and inheritance works. No one does this because
it's a PITA and it isn't even supported on most filesystems.
If you try this, you'll discover that every nonroot program ends up with
secure exec rules, breaking many things.
This is a problem that has bitten many people who have tried to use
capabilities for anything useful.
===== The proposed change =====
This patch adds a fifth capability mask called the ambient mask (pA).
pA does what most people expect pI to do.
pA obeys the invariant that no bit can ever be set in pA if it is not
set in both pP and pI. Dropping a bit from pP or pI drops that bit from
pA. This ensures that existing programs that try to drop capabilities
still do so, with a complication. Because capability inheritance is so
broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
then calling execve effectively drops capabilities. Therefore,
setresuid from root to nonroot conditionally clears pA unless
SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
re-add bits to pA afterwards.
The capability evolution rules are changed:
pA' = (file caps or setuid or setgid ? 0 : pA)
pP' = (X & fP) | (pI & fI) | pA'
pI' = pI
pE' = (fE ? pP' : pA')
X is unchanged
If you are nonroot but you have a capability, you can add it to pA. If
you do so, your children get that capability in pA, pP, and pE. For
example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
automatically bind low-numbered ports. Hallelujah!
Unprivileged users can create user namespaces, map themselves to a
nonzero uid, and create both privileged (relative to their namespace)
and unprivileged process trees. This is currently more or less
impossible. Hallelujah!
You cannot use pA to try to subvert a setuid, setgid, or file-capped
program: if you execute any such program, pA gets cleared and the
resulting evolution rules are unchanged by this patch.
Users with nonzero pA are unlikely to unintentionally leak that
capability. If they run programs that try to drop privileges, dropping
privileges will still work.
It's worth noting that the degree of paranoia in this patch could
possibly be reduced without causing serious problems. Specifically, if
we allowed pA to persist across executing non-pA-aware setuid binaries
and across setresuid, then, naively, the only capabilities that could
leak as a result would be the capabilities in pA, and any attacker
*already* has those capabilities. This would make me nervous, though --
setuid binaries that tried to privilege-separate might fail to do so,
and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
unexpected side effects. (Whether these unexpected side effects would
be exploitable is an open question.) I've therefore taken the more
paranoid route. We can revisit this later.
An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
ambient capabilities. I think that this would be annoying and would
make granting otherwise unprivileged users minor ambient capabilities
(CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
it is with this patch.
===== Footnotes =====
[1] Files that are missing the "security.capability" xattr or that have
unrecognized values for that xattr end up with has_cap set to false.
The code that does that appears to be complicated for no good reason.
[2] The libcap capability mask parsers and formatters are dangerously
misleading and the documentation is flat-out wrong. fE is *not* a mask;
it's a single bit. This has probably confused every single person who
has tried to use file capabilities.
[3] Linux very confusingly processes both the script and the interpreter
if applicable, for reasons that elude me. The results from thinking
about a script's file capabilities and/or setuid bits are mostly
discarded.
Preliminary userspace code is here, but it needs updating:
https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2
Here is a test program that can be used to verify the functionality
(from Christoph):
/*
* Test program for the ambient capabilities. This program spawns a shell
* that allows running processes with a defined set of capabilities.
*
* (C) 2015 Christoph Lameter <cl@linux.com>
* Released under: GPL v3 or later.
*
*
* Compile using:
*
* gcc -o ambient_test ambient_test.o -lcap-ng
*
* This program must have the following capabilities to run properly:
* Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
*
* A command to equip the binary with the right caps is:
*
* setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
*
*
* To get a shell with additional caps that can be inherited by other processes:
*
* ./ambient_test /bin/bash
*
*
* Verifying that it works:
*
* From the bash spawed by ambient_test run
*
* cat /proc/$$/status
*
* and have a look at the capabilities.
*/
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <cap-ng.h>
#include <sys/prctl.h>
#include <linux/capability.h>
/*
* Definitions from the kernel header files. These are going to be removed
* when the /usr/include files have these defined.
*/
#define PR_CAP_AMBIENT 47
#define PR_CAP_AMBIENT_IS_SET 1
#define PR_CAP_AMBIENT_RAISE 2
#define PR_CAP_AMBIENT_LOWER 3
#define PR_CAP_AMBIENT_CLEAR_ALL 4
static void set_ambient_cap(int cap)
{
int rc;
capng_get_caps_process();
rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
if (rc) {
printf("Cannot add inheritable cap\n");
exit(2);
}
capng_apply(CAPNG_SELECT_CAPS);
/* Note the two 0s at the end. Kernel checks for these */
if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
perror("Cannot set cap");
exit(1);
}
}
int main(int argc, char **argv)
{
int rc;
set_ambient_cap(CAP_NET_RAW);
set_ambient_cap(CAP_NET_ADMIN);
set_ambient_cap(CAP_SYS_NICE);
printf("Ambient_test forking shell\n");
if (execv(argv[1], argv + 1))
perror("Cannot exec");
return 0;
}
Signed-off-by: Christoph Lameter <cl@linux.com> # Original author
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Aaron Jones <aaronmdjones@gmail.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: Markku Savela <msa@moth.iki.fi>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_file_write_iter() is usng the wrong return value ('written'). This
will cause ocfs2_rw_unlock() be called both in write_iter & end_io,
triggering a BUG_ON.
This issue was introduced by commit 7da839c47589 ("ocfs2: use
__generic_file_write_iter()").
Orabug: 21612107
Fixes: 7da839c47589 ("ocfs2: use __generic_file_write_iter()")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit f895c53f8ace3c3e49ebf9def90e63fc6d46d2bf.
This commit causes a NFSv4 regression in that close()+unlink() can end
up failing. The reason is that we no longer have a guarantee that the
CLOSE has completed on the server, meaning that the subsequent call to
REMOVE may fail with NFS4ERR_FILE_OPEN if the server implements Windows
unlink() semantics.
Reported-by: <Olga Kornievskaia <aglo@umich.edu>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there is no cached data, then there is no need to track the file
change attribute on close.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull f2fs updates from Jaegeuk Kim:
"The major work includes fixing and enhancing the existing extent_cache
feature, which has been well settling down so far and now it becomes a
default mount option accordingly.
Also, this version newly registers a f2fs memory shrinker to reclaim
several objects consumed by a couple of data structures in order to
avoid memory pressures.
Another new feature is to add ioctl(F2FS_GARBAGE_COLLECT) which
triggers a cleaning job explicitly by users.
Most of the other patches are to fix bugs occurred in the corner cases
across the whole code area"
* tag 'for-f2fs-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (85 commits)
f2fs: upset segment_info repair
f2fs: avoid accessing NULL pointer in f2fs_drop_largest_extent
f2fs: update extent tree in batches
f2fs: fix to release inode correctly
f2fs: handle f2fs_truncate error correctly
f2fs: avoid unneeded initializing when converting inline dentry
f2fs: atomically set inode->i_flags
f2fs: fix wrong pointer access during try_to_free_nids
f2fs: use __GFP_NOFAIL to avoid infinite loop
f2fs: lookup neighbor extent nodes for merging later
f2fs: split __insert_extent_tree_ret for readability
f2fs: kill dead code in __insert_extent_tree
f2fs: adjust showing of extent cache stat
f2fs: add largest/cached stat in extent cache
f2fs: fix incorrect mapping for bmap
f2fs: add annotation for space utilization of regular/inline dentry
f2fs: fix to update cached_en of extent tree properly
f2fs: fix typo
f2fs: check the node block address of newly allocated nid
f2fs: go out for insert_inode_locked failure
...
This set mainly includes a change to the way the
dlm uses the SCTP API in the kernel, removing the
direct dependency on the sctp module. Other odd
SCTP-related fixes are also included. The other
notable fix is for a long standing regression in
the behavior of lock value blocks for user space
locks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV5HwZAAoJEDgbc8f8gGmqoaQP/iz5zgKSjX0mOC3fz8BqXISk
85cKLPfsf0avDmGx6nkKp5wsmVDYkfrObkocvf7bOcemAuycuOmr9y22ZscNaAWM
vKLhTJQ0koAlZqhJmJx45w318BFY03RdDQmVKUnQHza9Ed7Uoa0CyR6jyuwBTuMP
gA9O6i6CezodtB8CLPySJa2znlt50CptLaJKj1V9/xCpBh7orwpihv4pBz8oH1lR
JXRj9hNEFy2+vk8Pce14fKmHgUROg5+y1V7jZeetpCbTxAAFOeFOL6EH28eWssbQ
YoWofcPugmOs9BDbnVZHf6+Y5xIaoiIylb2Q4/me4rjQfSmaiDbTZyqB4TtFrldF
BngaAJipmLQu8ELqQmwEMhZTAc/GsB60x1EcjrPVTKbW7pwsfVp2fPVV92a7koQe
prmz5rh8HCenrWuy3d4/EP7K+E4+W98ZXsDuym4pBNaoYwCPyvtWLa8kSqAdx47J
MNk/ak9ktP2NxsCs+EjCmP2hn2r+RTio6R2uCtKB2pdclfqOupIsYZkVdZERK5Ch
5+ALeVjHfxswFVRxGjbPQRs9x8ZclBydceAHgYbLQ2xDGRvTpQhnIyNLRXsZnkrD
t4mTokZG/GGgmWOscZ5nXOOGZt8SpX+UkICWWWbuy3dxuOK6al3lVeBcC0KW5Pki
KNHzcKrlGJJnCVr0nWTU
=iYRu
-----END PGP SIGNATURE-----
Merge tag 'dlm-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set mainly includes a change to the way the dlm uses the SCTP API
in the kernel, removing the direct dependency on the sctp module.
Other odd SCTP-related fixes are also included.
The other notable fix is for a long standing regression in the
behavior of lock value blocks for user space locks"
* tag 'dlm-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: print error from kernel_sendpage
dlm: fix lvb copy for user locks
dlm: sctp_accept_from_sock() can be static
dlm: fix reconnecting but not sending data
dlm: replace BUG_ON with a less severe handling
dlm: use sctp 1-to-1 API
dlm: fix not reconnecting on connecting error handling
dlm: fix race while closing connections
dlm: fix connection stealing if using SCTP
features and other churn going into 4.2.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJV55TlAAoJEPL5WVaVDYGjyzYH/1WtZpIzRjp7o+3H4/vFqONg
R1Fsw785C1w8WX2QuIK/m31u4XO+VeCV4jWA9DuqnSzWm9w9C/4kTqITd4Hwp416
/9gJvYoZHHaDikxpWWADptDi8IoLohTlcFVCHIvvf53cGehVEEsc2WOijUZo7Cgv
O454Nm3tK0CQ3yrCIlf5SyvkUZSMTiawLLJJzd4GCyvU13C1SnABNQj8UxKisBA5
cP8q4O2nPg/S9rkYxnFAifQyZppd3jMvorUaq9eHiWMjl95o6e/6+wYGnHhoFUvr
/P1dNjJYbzk+TUzlsDkq2zANK2UsB3iNNi8YwLFOpfFcuYopmUAYRIWOgIZWYUQ=
=ijuI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Pretty much all bug fixes and clean ups for 4.3, after a lot of
features and other churn going into 4.2"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Revert "ext4: remove block_device_ejected"
ext4: ratelimit the file system mounted message
ext4: silence a format string false positive
ext4: simplify some code in read_mmp_block()
ext4: don't manipulate recovery flag when freezing no-journal fs
jbd2: limit number of reserved credits
ext4 crypto: remove duplicate header file
ext4: update c/mtime on truncate up
jbd2: avoid infinite loop when destroying aborted journal
ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
ext4 crypto: fix spelling typo in comment
ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
ext4: reject journal options for ext2 mounts
ext4: implement cgroup writeback support
ext4: replace ext4_io_submit->io_op with ->io_wbc
ext4 crypto: check for too-short encrypted file names
ext4 crypto: use a jbd2 transaction when adding a crypto policy
jbd2: speedup jbd2_journal_dirty_metadata()
Pull ext3 removal, quota & udf fixes from Jan Kara:
"The biggest change in the pull is the removal of ext3 filesystem
driver (~28k lines removed). Ext4 driver is a full featured
replacement these days and both RH and SUSE use it for several years
without issues. Also there are some workarounds in VM & block layer
mainly for ext3 which we could eventually get rid of.
Other larger change is addition of proper error handling for
dquot_initialize(). The rest is small fixes and cleanups"
[ I wasn't convinced about the ext3 removal and worried about things
falling through the cracks for legacy users, but ext4 maintainers
piped up and were all unanimously in favor of removal, and maintaining
all legacy ext3 support inside ext4. - Linus ]
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Don't modify filesystem for read-only mounts
quota: remove an unneeded condition
ext4: memory leak on error in ext4_symlink()
mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
ext4: Improve ext4 Kconfig test
block: Remove forced page bouncing under IO
fs: Remove ext3 filesystem driver
doc: Update doc about journalling layer
jfs: Handle error from dquot_initialize()
reiserfs: Handle error from dquot_initialize()
ocfs2: Handle error from dquot_initialize()
ext4: Handle error from dquot_initialize()
ext2: Handle error from dquot_initalize()
quota: Propagate error from ->acquire_dquot()
Merge hpfs upddate from Mikulas Patocka.
* emailed patches from Mikulas Patocka <mikulas@twibright.com>:
hpfs: update ctime and mtime on directory modification
hpfs: support hotfixes
When the OS/2 driver hits a disk write error, it writes the sector to
another location and adds the sector mapping to the hotfix map.
This patch makes the hpfs driver understand the hotfix map and remap
accesses accoring to it.
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems cleaner to avoid the temporary value here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
None of these statistics can meaningfully be negative, and the
numerator for do_div() must have the type u64. The generic
implementation of do_div() used on some 32-bit architectures asserts
that, resulting in a compiler error in gfs2_rgrp_congested().
Fixes: 0166b197c2ed ("GFS2: Average in only non-zero round-trip times ...")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch changes the glock hash table from a normal hash table to
a resizable hash table, which scales better. This also simplifies
a lot of code.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
What uniquely identifies a glock in the glock hash table is not
gl_name, but gl_name and its superblock pointer. This patch makes
the gl_name field correspond to a unique glock identifier. That will
allow us to simplify hashing with a future patch, since the hash
algorithm can then take the gl_name and hash its components in one
operation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Don't use struct gfs2_glock_iter as the helper data structure for iterating
through "sbstats"; we are not iterating through glocks here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When I/O cannot complete due to a fatal error on the DS, ensure that we
invalidate the corresponding layout segment and return it.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
We have observed the server sending recalls for delegation stateids
that have already been successfully returned. Change
nfsd4_cb_recall_done() to return success if the client has returned
the delegation. While this does not completely eliminate the sending
of recalls for delegations that have already been returned, this
does prevent unnecessarily declaring the callback path to be down.
Reported-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull user namespace updates from Eric Biederman:
"This finishes up the changes to ensure proc and sysfs do not start
implementing executable files, as the there are application today that
are only secure because such files do not exist.
It akso fixes a long standing misfeature of /proc/<pid>/mountinfo that
did not show the proper source for files bind mounted from
/proc/<pid>/ns/*.
It also straightens out the handling of clone flags related to user
namespaces, fixing an unnecessary failure of unshare(CLONE_NEWUSER)
when files such as /proc/<pid>/environ are read while <pid> is calling
unshare. This winds up fixing a minor bug in unshare flag handling
that dates back to the first version of unshare in the kernel.
Finally, this fixes a minor regression caused by the introduction of
sysfs_create_mount_point, which broke someone's in house application,
by restoring the size of /sys/fs/cgroup to 0 bytes. Apparently that
application uses the directory size to determine if a tmpfs is mounted
on /sys/fs/cgroup.
The bind mount escape fixes are present in Al Viros for-next branch.
and I expect them to come from there. The bind mount escape is the
last of the user namespace related security bugs that I am aware of"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fs: Set the size of empty dirs to 0.
userns,pidns: Force thread group sharing, not signal handler sharing.
unshare: Unsharing a thread does not require unsharing a vm
nsfs: Add a show_path method to fix mountinfo
mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC
vfs: Commit to never having exectuables on proc and sysfs.
The return value from scnprintf always less than the buffer length.
So, result >= len always false. This patch removes those checking.
int vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
{
int i;
i = vsnprintf(buf, size, fmt, args);
if (likely(i < size))
return i;
if (size != 0)
return size - 1;
return 0;
}
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The length of "Linux NFSv4.0 " is 14, not 10.
Without this patch, I get a truncated client owner id as,
"Linux NFSv4.0 ::1/::1"
With this patch,
"Linux NFSv4.0 ::1/::1 tcp"
Fixes: a319268891 ("nfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a read-write layout has an invalid mirror, then we should
mark it as invalid, and return it.
If a read-only layout has an invalid mirror, then mark it as invalid
and check if there is still at least one valid mirror before we return
it.
Note: Also fix incorrect use of pnfs_generic_mark_devid_invalid().
We really want nfs4_mark_deviceid_unavailable().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Unlike read layouts, the writeable layout cannot fall back to using only
one of the mirrors. It need to write to all of them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Unlike the files layout, flexfiles does not test for the NFS_DEVICEID_INVALID
flag. Instead it relies on NFS_DEVICEID_UNAVAILABLE.
Fix is to replace with nfs4_mark_deviceid_unavailable().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Mirrors are now shared objects, so we should not be freeing them directly
inside ff_layout_free_lseg(). We should already be doing the right thing
in _ff_layout_free_lseg(), so just let it handle things.
Also ensure that ff_layout_free_mirror() frees the RPC credential if it
is set.
Fixes: 28a0d72c6867 ("Add refcounting to struct nfs4_ff_layout_mirror")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Somebody with a Solaris client was hitting this case. We haven't
figured out why yet, and don't have a reproducer. Meanwhile Frank
noticed that RFC 7530 actually recommends CLID_INUSE for this case.
Unlikely to help the original reporter, but may as well fix it.
Reported-by: Frank Filz <ffilzlnx@mindspring.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It's possible that a DELEGRETURN could race with (e.g.) client expiry,
in which case we could end up putting the delegation hash reference more
than once.
Have unhash_delegation_locked return a bool that indicates whether it
was already unhashed. In the case of destroy_delegation we only
conditionally put the hash reference if that returns true.
The other callers of unhash_delegation_locked call it while walking
list_heads that shouldn't yet be detached. If we find that it doesn't
return true in those cases, then throw a WARN_ON as that indicates that
we have a partially hashed delegation, and that something is likely very
wrong.
Tested-by: Andrew W Elble <aweits@rit.edu>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When an open or lock stateid is hashed, we take an extra reference to
it. When we unhash it, we drop that reference. The code however does
not properly account for the case where we have two callers concurrently
trying to unhash the stateid. This can lead to list corruption and the
hash reference being put more than once.
Fix this by having unhash_ol_stateid use list_del_init on the st_perfile
list_head, and then testing to see if that list_head is empty before
releasing the hash reference. This means that some of the unhashing
wrappers now become bool return functions so we can test to see whether
the stateid was unhashed before we put the reference.
Reported-by: Andrew W Elble <aweits@rit.edu>
Tested-by: Andrew W Elble <aweits@rit.edu>
Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can potentially have several nfs4_laundromat jobs running if there
are multiple namespaces running nfsd on the box. Those are effectively
separated from one another though, so I don't see any reason to
serialize them.
Also, create_singlethread_workqueue automatically adds the
WQ_MEM_RECLAIM flag. Since we run this job on a timer, it's not really
involved in any reclaim paths. I see no need for a rescuer thread.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
These messages, combined with the backtrace they trigger, makes it seem
like a serious problem, though a quick search shows distros marking
it as a "won't fix" non-issue when the problem is reported by users.
The backtrace is overkill, and only really manages to show that if
you follow the code path, you can't really avoid it with bootargs
or configuration settings in the container.
Given that, lets tone it down a bit and get rid of the WARN severity,
and the associated backtrace, so people aren't needlessly alarmed.
Also, lets drop the split printk line, since they are grep unfriendly.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix kernel-doc warnings in fs/locks.c:
Warning(..//fs/locks.c:1577): No description found for parameter 'flags'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Security label can be set in OPEN/CREATE request, nfsd should set
the bitmask in word2 if setting success.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to rfc5661 18.16.4,
"If EXCLUSIVE4_1 was used, the client determines the attributes
used for the verifier by comparing attrset with cva_attrs.attrmask;"
So, EXCLUSIVE4_1 also needs those bitmask used to store the verifier.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The encode order should be as the bitmask defined order.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>