IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
When attempting to move items from our target leaf to its neighbor
leaves (right and left), we only need to free data_size - free_space
bytes from our leaf in order to add the new item (which has size of
data_size bytes). Therefore attempt to move items to the right and
left leaves if they have at least data_size - free_space bytes free,
instead of data_size bytes free.
After 5 runs of the following test, I got a smaller number of btree
node splits overall:
sysbench --test=fileio --file-num=512 --file-total-size=5G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=8192 --max-requests=100000 --file-io-mode=sync
Before this change:
* 6171 splits (average of 5 test runs)
* 61.508Mb/sec of throughput (average of 5 test runs)
After this change:
* 6036 splits (average of 5 test runs)
* 63.533Mb/sec of throughput (average of 5 test runs)
An ideal test would not just have multiple threads/processes writing
to a file (insertion of file extent items) but also do other operations
that result in insertion of items with varied sizes, like file/directory
creations, creation of links, symlinks, xattrs, etc.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
After an ordered extent completes, don't blindly reset the
inode's ordered tree last accessed ordered extent pointer.
While running the xfstests I noticed that about 29% of the
time the ordered extent to which tree->last pointed was not
the same as our just completed ordered extent. After that I
ran the following sysbench test (after a prepare phase) and
noticed that about 68% of the time tree->last pointed to
a different ordered extent too.
sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=rndwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --max-requests=0 run
Therefore reset tree->last on ordered extent removal only if
it pointed to the ordered extent we're removing from the tree.
Results from 4 runs of the following test before and after
applying this patch:
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --file-io-mode=sync prepare
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --file-io-mode=sync run
Before this path:
run 1 - 64.049Mb/sec
run 2 - 63.455Mb/sec
run 3 - 64.656Mb/sec
run 4 - 63.833Mb/sec
After this patch:
run 1 - 66.149Mb/sec
run 2 - 68.459Mb/sec
run 3 - 66.338Mb/sec
run 4 - 66.176Mb/sec
With random writes (--file-test-mode=rndwr) I had huge fluctuations
on the results (+- 35% easily).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe noticed that we were leaking the features attribute group
after umount. His fix of just calling sysfs_remove_group() wasn't enough
since that removes just the supported features and not the unsupported
features.
This patch changes the unknown feature handling to add them individually
so we can skip the kmalloc and uses the same iteration to tear them down
later.
We also fix the error handling during mount so that we catch the
failing creation of the per-super kobject, and handle proper teardown
of a half-setup sysfs context.
Tested properly with kmemleak enabled this time.
Reported-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch fixes the following warnings:
fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The inode eviction can be very slow, because during eviction we
tell the VFS to truncate all of the inode's pages. This results
in calls to btrfs_invalidatepage() which in turn does calls to
lock_extent_bits() and clear_extent_bit(). These calls result in
too many merges and splits of extent_state structures, which
consume a lot of time and cpu when the inode has many pages. In
some scenarios I have experienced umount times higher than 15
minutes, even when there's no pending IO (after a btrfs fs sync).
A quick way to reproduce this issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'
real 0m25.457s
user 0m0.000s
sys 0m0.092s
$ cd ..
$ time umount /mnt/btrfs
real 1m38.234s
user 0m0.000s
sys 1m25.760s
The same test on ext4 runs much faster:
$ mkfs.ext4 /dev/sdb3
$ mount /dev/sdb3 /mnt/ext4
$ cd /mnt/ext4
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ sync
$ cd ..
$ time umount /mnt/ext4
real 0m3.626s
user 0m0.004s
sys 0m3.012s
After this patch, the unmount (inode evictions) is much faster:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'
real 0m26.774s
user 0m0.000s
sys 0m0.084s
$ cd ..
$ time umount /mnt/btrfs
real 0m1.811s
user 0m0.000s
sys 0m1.564s
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We hit a forever loop when doing balance relocation,the reason
is that we firstly reserve 4M(node size is 16k).and within transaction
we will try to add extra reservation for snapshot roots,this will
return -EAGAIN if there has been a thread flushing space to reserve
space.We will do this again and again with filesystem becoming nearly
full.
If the above '-EAGAIN' case happens, we try to refill reservation more
outsize of transaction, and this will return eariler in enospc case,however,
this dosen't really hurt because it makes no sense doing balance relocation
with the filesystem nearly full.
Miao Xie helped a lot to track this issue, thanks.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents);
fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents);
fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes);
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o compress <dev> <mnt>
# cd <mnt>
# sysbench --test=fileio --num-threads=8 --file-total-size=8G \
> --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
> --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
> --file-test-mode=sync prepare
# cd -
# umount <mnt>
# mount -o compress <dev> <mnt>
# cd <mnt>
# sysbench --test=fileio --num-threads=8 --file-total-size=8G \
> --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
> --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
> --file-test-mode=sync run
# cd -
# umount <mnt>
The reason of this problem is:
Task0 Task1
btrfs_direct_IO
unlock(&inode->i_mutex)
lock(&inode->i_mutex)
reserve_space()
prepare_pages()
lock_extent()
clear_extent()
unlock_extent()
lock_extent()
test_extent(uptodate)
return false
copy_data()
set_delalloc_extent()
extent need compress
go back to buffered write
clear_extent(DELALLOC | DIRTY)
unlock_extent()
Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.
This patch fixes the above bug by unlocking the extent after the delalloc.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
- the caller has gotten the inode object, needn't pass the file object.
And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
we also needn't pass the root object into prepare_pages().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't need to crash hard here, it's just reading a sysfs file. The
values considered in switch are from a fixed set, the default case
should not happen at all.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Added in patch "btrfs: add ability to change features via sysfs",
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.
Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.
Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.
This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use otherwise unused local variables slot in update_qgroup_limit_item and
in update_qgroup_info_item, and remove unused variable ins from
btrfs_qgroup_account_ref.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The variable window_start in setup_cluster_no_bitmap is not used since commit
1bb91902dc
(Btrfs: revamp clustered allocation logic)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The variable found_uncached_bg in find_free_extent is not used since commit
285ff5af6c
(Btrfs: remove the ideal caching code)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This adds a writeable attribute which describes the label.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.
This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds the ability to change (set/clear) features while the file
system is mounted. A bitmask is added for each feature set for the
support to set and clear the bits. A message indicating which bit
has been set or cleared is issued when it's been changed and also when
permission or support for a particular bit has been denied.
Since the the attributes can now be writable, we need to introduce
another struct attribute to hold the different permissions.
If neither set or clear is supported, the file will have 0444 permissions.
If either set or clear is supported, the file will have 0644 permissions
and the store handler will filter out the write based on the bitmask.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
With the compat and compat-ro bits, it's possible for file systems to
exist that have features that aren't supported by the kernel's file system
implementation yet still be mountable.
This patch publishes read-only info on those features using a prefix:number
format, where the number is the bit number rather than the shifted value.
e.g. "compat:12"
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch publishes information on which features are enabled in the
file system on a per-super basis. At this point, it only publishes
information on features supported by the file system implementation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds per-super attributes to sysfs.
It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds the ability to publish supported features to sysfs under
/sys/fs/btrfs/features.
The files are module-wide and export which features the kernel supports.
The content, for now, is just "0\n".
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.
We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.
The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset but we may have millions refs.
And in the case of delayed data refs, we don't usually have too much
data refs to merge.
So it's safe to shut it down for data refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.
The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.
When we have a great number of delayed refs pending to process,
this'll cost time a lot.
Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression. Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced. This fixes check_int_data freaking out when running
btrfs/004. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs has always had these filler extent data items for holes in inodes. This
has made somethings very easy, like logging hole punches and sending hole
punches. However for large holey files these extent data items are pure
overhead. So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files. This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any. I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull namespace fixes from Eric Biederman:
"This is a set of 3 regression fixes.
This fixes /proc/mounts when using "ip netns add <netns>" to display
the actual mount point.
This fixes a regression in clone that broke lxc-attach.
This fixes a regression in the permission checks for mounting /proc
that made proc unmountable if binfmt_misc was in use. Oops.
My apologies for sending this pull request so late. Al Viro gave
interesting review comments about the d_path fix that I wanted to
address in detail before I sent this pull request. Unfortunately a
bad round of colds kept from addressing that in detail until today.
The executive summary of the review was:
Al: Is patching d_path really sufficient?
The prepend_path, d_path, d_absolute_path, and __d_path family of
functions is a really mess.
Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
No it is not appropriate to rewrite all of d_path for a regression
that has existed for entirely too long already, when a two line
change will do"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Fix a regression in mounting proc
fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
vfs: In d_path don't call d_dname on a mount point
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJS080OAAoJECvKgwp+S8JaIdUQAJKNZTzXKylUjUZty42t57Jh
1qRrQeJ6ha+JVSpYX4jJz/mSzUdJdjoFg7J3O54OnVFj/CnlcY7GRZj3VMel9ijf
uhlf8DcU6JsThcFK4Q6mqXtdAHDPkQ1jkQHLNCe7bow9AjCzHymAZWJix4YvEsXF
zeJJURMqSaJeo/44MynnXyn/h5RRhg+5HWErhoFiVUzDzHR3RoQqmt3lPVVJkdj1
iokHLMzGui2vs52vUJj2yx7m9kaoDx/6bJpqR61qHfk5S4wjLkUI+1ID8dsTNVF2
4O3THb0nUDWx4wuJIxrAKoPiYjiemX1KmQXlUVr3IsfhDiiBbLyviVyn4aRaFIxV
IRCVXCj1CWw+cFLeCA5E+/WvpxjLfKs4WNBxIqjes5YRPM4PLpU3MDiabssaUzHI
0VPbU8TQ05hqH0wbs0hIgXyvED6yNn9d3sPHS2Lb5i2tp3E0FzVEoh2EH2jn8lmQ
1DAdi+ezk9EiJs8AFiN6MSIBpAZosX3Nq+RTmYGKqLZMGnxlJ30YspNlipiBPFpC
4xokkMZAZ0+wzpVabOMie36Rc/AaOAqiOjS1C6UIoOSrBTgtwWL7Ft2Da3SKb0KX
XQhNWCHNYcgOn9/DDDmGxzwt6HsEzOIYinMwrG37LSass5KEvopssmiLCXn8wry+
QXUoiFFFAPpg8iXaqj4X
=AHdo
-----END PGP SIGNATURE-----
Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback fix from Wu Fengguang:
"Fix data corruption on NFS writeback.
It has been in linux-next for one month"
* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix data corruption on NFS
There is a bug in the function nilfs_segctor_collect, which results in
active data being written to a segment, that is marked as clean. It is
possible, that this segment is selected for a later segment
construction, whereby the old data is overwritten.
The problem shows itself with the following kernel log message:
nilfs_sufile_do_cancel_free: segment 6533 must be clean
Usually a few hours later the file system gets corrupted:
NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
The issue can be reproduced with a file system that is nearly full and
with the cleaner running, while some IO intensive task is running.
Although it is quite hard to reproduce.
This is what happens:
1. The cleaner starts the segment construction
2. nilfs_segctor_collect is called
3. sc_stage is on NILFS_ST_SUFILE and segments are freed
4. sc_stage is on NILFS_ST_DAT current segment is full
5. nilfs_segctor_extend_segments is called, which
allocates a new segment
6. The new segment is one of the segments freed in step 3
7. nilfs_sufile_cancel_freev is called and produces an error message
8. Loop around and the collection starts again
9. sc_stage is on NILFS_ST_SUFILE and segments are freed
including the newly allocated segment, which will contain active
data and can be allocated at a later time
10. A few hours later another segment construction allocates the
segment and causes file system corruption
This can be prevented by simply reordering the statements. If
nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
the freed segments are marked as dirty and cannot be allocated any more.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- fix off-by-one in xfs_attr3_rmt_verify
- fix missing destroy_work_on_stack() in xfs_bmapi_allocate
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJS0ECWAAoJENaLyazVq6ZOgn0QAKSC/pkP4km+QbmL0R7SqSJH
ZSSj16gIjR5lHlwI3PQzv5BgyEC9BcRDKWXN6dy+GHHuMtP4qYK8cLWFcyl7EysH
HAyDBnaJVphXt23C5iIzk+iseNfRYXA2LOpYSH6qfhZ5bxEeYzQS42zL4YhxZrXq
kzLHojcTLUx0IzJ+4oHn5AXSgPt+PXxNz3s+TU9virFnfSMlw2qYukxQtG49nbQr
kQjNHgeTIBKzeHdlnxmv5Rd2bD//397w5aWXxmaUh8fk6Z7VJi40ALAG4Pks81HF
+TEgMtF9/xTXdlwrYJDoHp++vUs6HANCX+wSAb4MdrBQvjh/USytK2WFwOeMyyR6
L/iogfPXHHizTkoYSzPwPdEmCCFhzidvBEqNX68+ojlJnDtoart7IgkOcm9LvaQI
j//u76CPRcd8tFh+1fDNaXn1ykJ6/CepSY13/yOnbpc7JoDbtqK2R8HFxdSlkDDg
UooLF2AfQ6lX280cUWwV0flqGO6iTIM3Fw1mIq3z8X4usNn+bMnlOu/DUnCbF5bB
YJCV4uT7f04w7oJqin9a7LHaHKRD56tWQun/OCEd7ZV/hJ1YRYlhhLfSdWdX7+SX
oIawXJy7NvCPQLaTwycD3h2gDlaxw17GAc9rA3AcCknxBsgNosv1ETQnEPC4iIAq
QsVal7p6oMLZ/qx6mvX7
=Xpq3
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
"Here we have a bugfix for an off-by-one in the remote attribute
verifier that results in a forced shutdown which you can hit with v5
superblock by creating a 64k xattr, and a fix for a missing
destroy_work_on_stack() in the allocation worker.
It's a bit late, but they are both fairly straightforward"
* tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs:
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 6f96b3063c)
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:
<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<ffffffff816f0042>] dump_stack+0x45/0x56
[<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<ffffffff81097230>] ? wake_up_state+0x20/0x20
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<ffffffff811e054e>] setxattr+0x12e/0x1c0
[<ffffffff811c6e82>] ? final_putname+0x22/0x50
[<ffffffff811c708b>] ? putname+0x2b/0x40
[<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
Tests:
setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 85dd0707f0)
Commit f5a44db5d2 introduced a regression on filesystems created with
the bigalloc feature (cluster size > blocksize). It causes xfstests
generic/006 and /013 to fail with an unexpected JBD2 failure and
transaction abort that leaves the test file system in a read only state.
Other xfstests run on bigalloc file systems are likely to fail as well.
The cause is the accidental use of a cluster mask where a cluster
offset was needed in ext4_ext_map_blocks().
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Merge patches from Andrew Morton:
"Ten fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test()
mm/memory-failure.c: transfer page count from head page to tail page after split thp
MAINTAINERS: set up proper record for Xilinx Zynq
mm: remove bogus warning in copy_huge_pmd()
memcg: fix memcg_size() calculation
mm: fix use-after-free in sys_remap_file_pages
mm: munlock: fix deadlock in __munlock_pagevec()
mm: munlock: fix a bug where THP tail page is encountered
The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with
commmit 67347fe4e6 ("epoll: do not take global 'epmutex' for simple
topologies").
The acquistion of the ep->mtx for the destination 'ep' was added such
that a concurrent EPOLL_CTL_ADD operation would see the correct state of
the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
However, by simply not acquiring the lock, we do not serialize behind
the ep->mtx from the add path, and thus may perform a full path check
when if we had waited a little longer it may not have been necessary.
However, this is a transient state, and performing the full loop
checking in this case is not harmful.
The important point is that we wouldn't miss doing the full loop
checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
its operating upon. The reason we don't need to do lock ordering in the
add path, is that we are already are holding the global 'epmutex'
whenever we do the double lock. Further, the original posting of this
patch, which was tested for the intended performance gains, did not
perform this additional locking.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s_umount which is copied in from the core vfs, two patches
relate to a hard to hit "use after free" and memory leak.
Two patches related to using DIO and buffered I/O on the same
file to ensure correct operation in relation to glock state
changes. The final patch adds an RCU read lock to ensure
correct locking on an error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJSxVxyAAoJEMrg3m4a/8jSP1kQAKkyW6DgevgZ+IHlm5+mhTeZ
Bpdy3l6DdxZIiqoG0VqJo6DoeR4td+1q7TfyjpvFvgxjU/m/nLhKFcNd1A6TN3OK
G9Y6q0k0aWsAUPUjg3Y6gFRAlXHQaGXQ3nMDmoTCdYSYqid8gB+oqPbfwf5uHAgU
GVPgKxqSsJmzxPYTjpjx8mdpgiwCHa+iB+reoqxNSdxJnAk93GrBA7efonNoxKB1
r8VJlgkJubMjxGMu6xQYLMyt1Xed85sbiASOdE+Thw700tBA/ZAtKuB8xZ4+X1Fd
M5osKYnqodde+A3aSi6P7b+M6N+WyA/7bHhckbaQy8cwpC9xhgEqsEsIEFm0eJjB
wbdGe2tsCTUvLy37++D5e88cF9O2F6Ku0MJJtb7KsTLZPFD9XXs/6/xx4vSSNKQt
FC7BF5dkQiLDJvy1xvcHK43+PbOaS7/8WM1NuoNAS/L/3RYFrrHby3LqBo+kcUbV
L9HoL8aJd60bsX7PceXA9UzaH8yk/yTgeyOtd2+VCiRVldvNtx32ylTJLUqqxeRi
AL/tZWgxwPKb54AJMptPZ0fGP5A+pUhQgTm7fJCwrUdXQXWUW0YYK2sV3H9BZ8px
Ga0PuJtjxj8OkGFwnugEtuQNGQ9M5uCX4UiELqP3rVRNpq4e8UkOZRqtHsU7urSB
ezufwdI+b+uHUucva31D
=KSDi
-----END PGP SIGNATURE-----
Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull GFS2 fixes from Steven Whitehouse:
"Here is a set of small fixes for GFS2. There is a fix to drop
s_umount which is copied in from the core vfs, two patches relate to a
hard to hit "use after free" and memory leak. Two patches related to
using DIO and buffered I/O on the same file to ensure correct
operation in relation to glock state changes. The final patch adds an
RCU read lock to ensure correct locking on an error path"
* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: Fix unsafe dereference in dump_holder()
GFS2: Wait for async DIO in glock state changes
GFS2: Fix incorrect invalidation for DIO/buffered I/O
GFS2: Fix slab memory leak in gfs2_bufdata
GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
GFS2: don't hold s_umount over blkdev_put
GLOCK_BUG_ON() might call this function without RCU read lock. Make sure that
RCU read lock is held when using task_struct returned from pid_task().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Set FILE_CREATED on O_CREAT|O_EXCL.
cifs code didn't change during commit 116cc02253
Kernel bugzilla 66251
Signed-off-by: Shirish Pargaonkar <spargaonkar@suse.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
When we obtain tcon from cifs_sb, we use cifs_sb_tlink() to first obtain
tlink which also grabs a reference to it. We do not drop this reference
to tlink once we are done with the call.
The patch fixes this issue by instead passing tcon as a parameter and
avoids having to obtain a reference to the tlink. A lookup for the tcon
is already made in the calling functions and this way we avoid having to
re-run the lookup. This is also consistent with the argument list for
other similar calls for M-F symlinks.
We should also return an ENOSYS when we do not find a protocol specific
function to lookup the MF Symlink data.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
and a patch so that instead of BUG'ing we use the ext4_error()
framework to mark the file system is corrupted.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABCAAGBQJSuSZnAAoJENNvdpvBGATwp/8P/R/VjV1O4IhDmRxbc7mNAkoB
Mfh2Utnk9daGdMpSMhzWW6m3oyohA0ICledBusQ3ax6Ymg8jIcmGwm3rJ8gAXvR1
4g0rQ1nw3JEGROId58FnKB3fsEmOPlt4T/LKL4boY6BfER4yu1htH0zSKBuKqykt
feH9dMiaR1KMQ613eWY6GEonYaP8+nI1GxEfvrymInxznDPVuaLgR4oBMAmR8R76
9vfJfFHYjbk1wQ5UEv94tic8Hi055PGCRfsLc79QwxMr5KyKz+NydDUIjgKjP9pu
9sz8iuV79M5/hUguZY7HH9Xd0byZ+jPuNrpkrDqSNZYuArfIcsXKZM/dm7HOgFGQ
dQzf9S/kBzJvcSHuUchhS2cm6kxCsHaqo16Fxs5kP3TmB3TrVr7EV6uBS4cm53PJ
x6IdAORhbURfuJCRQOi/TDNUrb+ZHvIx7Gc1ujizczC3An7QurfYo7XY/rWfdj41
eIVy0+1gqvWJsbXGInni1hKbXMU3yTJ0MqQm05A7MW/G2G6eIgEVpz8MElm33jEE
VvC6KyZxpviRYPUmxxTbSa1vl0UG1rZdZXslgmlSyY1yItVmyTCIAG217JOTyhTX
Ae1aZEgzLYh6dQAwweme79WF4WsBPP28TOmW2xoOH7t04gMG0t+9b/RbUDtTPItc
HXNmIlFP9CULIQ1c2Cvh
=KPNa
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A collection of bug fixes destined for stable and some printk cleanups
and a patch so that instead of BUG'ing we use the ext4_error()
framework to mark the file system is corrupted"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add explicit casts when masking cluster sizes
ext4: fix deadlock when writing in ENOSPC conditions
jbd2: rename obsoleted msg JBD->JBD2
jbd2: revise KERN_EMERG error messages
jbd2: don't BUG but return ENOSPC if a handle runs out of space
ext4: Do not reserve clusters when fs doesn't support extents
ext4: fix del_timer() misuse for ->s_err_report
ext4: check for overlapping extents in ext4_valid_extent_entries()
ext4: fix use-after-free in ext4_mb_new_blocks
ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails
Pull ext2 fix from Jan Kara:
"One simple fix of oops in ext2 which was recently hit by Christoph"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
Pull AIO leak fixes from Ben LaHaise:
"I've put these two patches plus Linus's change through a round of
tests, and it passes millions of iterations of the aio numa
migratepage test, as well as a number of repetitions of a few simple
read and write tests.
The first patch fixes the memory leak Kent introduced, while the
second patch makes aio_migratepage() much more paranoid and robust"
* git://git.kvack.org/~bcrl/aio-next:
aio/migratepages: make aio migrate pages sane
aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
Since commit 36bc08cc01 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.
However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them. And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.
Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).
So clean it all up, making it much more straightforward. Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7da ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.
Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.
While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
e34ecee2ae reworked the percpu reference
counting to correct a bug trinity found. Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put. Add that reference count back in to fix things.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
- fix memory leak in xfs_dir2_node_removename
- fix quota assertion in xfs_setattr_size
- fix quota assertions in xfs_qm_vop_create_dqattach
- fix for hang when disabling group and project quotas before
disabling user quotas
- fix Dave Chinner's email address in MAINTAINERS
- fix for file allocation alignment
- fix for assertion in xfs_buf_stale by removing xfsbdstrat
- fix for alignment with swalloc mount option
- fix for "retry forever" semantics on IO errors
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJSs0PZAAoJENaLyazVq6ZOELgP/Rcx5JdjfCdvZZ7HFfzabLU6
TOpyEpc0TJso8C92+UNZJUZWNdToEn/v1VRh6dQ+cCz3RxQfOeOKVKXU1XkCBRQO
JxW7Pucb+SRoVf+uv6qZCCJUO1oY6JByZ8+9GuBGWK5Ul2ByxTPI50Et0Qy4wM3z
cDvQVyjtA5+63ToUS0sR8yBSKK+8c9SkjVkdLqa+AoFJHYC+meNrZ0J1PRV2ILWu
bFJtKFe/tO4jj/UJ1uj6ZjvVQ0jm9JH1ZE4m3tbjPcDCTHyxHu5vSBVSlPO4WbAb
Tfaj4eB7rQy05yno2/mAjn2koaqTSg1cP5V14TMP1GzBQUpwQDAWsNGkorXPfRIn
Xsrznxk33fTCTqVSkSnVsXKZhizzPydyVCcvf00YJssYh9IEjVdWVpxedLFVJDmO
jatsMaEAe7Z8avtah6u5vDGTQCEPQjhHPEqhW/EUfCNG1uK6DjyMG4dDsCMufJ7N
Ze646oXD6zd45hSPQxMV1r8ZvlQoubUgctOBNqs/nDhOblRQ7MRqkRHhPRvvzsBG
ffVB145l5v1cud0IcpIbfWPtosnPAvoqYS+qglkXkmXmU7rk0APePDYP7XLh4+qy
8ROkJQ0rsgmC2cyC/fmwtwWQCMCRUrI9YB2X1zRiBS6TwwATP2uIomtT7GwAfK4+
AmCwxwy6XPMhUd3xn3Vx
=32uU
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
"This contains fixes for some asserts
related to project quotas, a memory leak, a hang when disabling group or
project quotas before disabling user quotas, Dave's email address, several
fixes for the alignment of file allocation to stripe unit/width geometry, a
fix for an assertion with xfs_zero_remaining_bytes, and the behavior of
metadata writeback in the face of IO errors.
Details:
- fix memory leak in xfs_dir2_node_removename
- fix quota assertion in xfs_setattr_size
- fix quota assertions in xfs_qm_vop_create_dqattach
- fix for hang when disabling group and project quotas before
disabling user quotas
- fix Dave Chinner's email address in MAINTAINERS
- fix for file allocation alignment
- fix for assertion in xfs_buf_stale by removing xfsbdstrat
- fix for alignment with swalloc mount option
- fix for "retry forever" semantics on IO errors"
* tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs:
xfs: abort metadata writeback on permanent errors
xfs: swalloc doesn't align allocations properly
xfs: remove xfsbdstrat error
xfs: align initial file allocations correctly
MAINTAINERS: fix incorrect mail address of XFS maintainer
xfs: fix infinite loop by detaching the group/project hints from user dquot
xfs: fix assertion failure at xfs_setattr_nonsize
xfs: fix false assertion at xfs_qm_vop_create_dqattach
xfs: fix memory leak in xfs_dir2_node_removename
Some pstore backing devices use on board flash as persistent
storage. These have limited numbers of write cycles so it
is a poor idea to use them from high frequency operations.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The missing casts can cause the high 64-bits of the physical blocks to
be lost. Set up new macros which allows us to make sure the right
thing happen, even if at some point we end up supporting larger
logical block numbers.
Thanks to the Emese Revfy and the PaX security team for reporting this
issue.
Reported-by: PaX Team <pageexec@freemail.hu>
Reported-by: Emese Revfy <re.emese@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
We need to wait for any outstanding DIO to complete in a couple
of situations. Firstly, in case we are changing out of deferred
mode (in inode_go_sync) where GLF_DIRTY will not be set. That
call could be prefixed with a test for gl_state == LM_ST_DEFERRED
but it doesn't seem worth it bearing in mind that the test for
outstanding DIO is very quick anyway, in the usual case that there
is none.
The second case is in inode_go_lock which will catch the cases
where we have a cached EX lock, but where we grant deferred locks
against it so that there is no glock state transistion. We only
need to wait if the state is not deferred, since DIO is valid
anyway in that state.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In patch 209806aba9 we allowed
local deferred locks to be granted against a cached exclusive
lock. That opened up a corner case which this patch now
fixes.
The solution to the problem is to check whether we have cached
pages each time we do direct I/O and if so to unmap, flush
and invalidate those pages. Since the glock state machine
normally does that for us, mostly the code will be a no-op.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue in
sysfs that has been reported.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlKx3T4ACgkQMUfUDdst+ykALQCfchxid3U9xydNKQMx51PwLiKX
c08AnA1ZhDtQ6rfUi+49YJxhWLgUyYHD
=glqF
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue
in sysfs that has been reported"
* tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
sysfs: give different locking key to regular and bin files
Akira-san has been reporting rare deadlocks of his machine when running
xfstests test 269 on ext4 filesystem. The problem turned out to be in
ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
ext4_should_retry_alloc() while holding i_data_sem. Since
ext4_should_retry_alloc() can force a transaction commit, this is a
lock ordering violation and leads to deadlocks.
Fix the problem by just removing the retry loops. These functions should
just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
function must take care of retrying after dropping all necessary locks.
Reported-and-tested-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull two Ceph fixes from Sage Weil:
"One of these is fixing a regression from the d_flags file type patch
that went into -rc1 that broke instantiation of inodes and dentries
(we were doing dentries first). The other is just an off-by-one
corner case"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
ceph: initialize inode before instantiating dentry
If we are doing aysnc writeback of metadata, we can get write errors
but have nobody to report them to. At the moment, we simply attempt
to reissue the write from io completion in the hope that it's a
transient error.
When it's not a transient error, the buffer is stuck forever in
this loop, and we cannot break out of it. Eventually, unmount will
hang because the AIL cannot be emptied and everything goes downhill
from them.
To solve this problem, only retry the write IO once before aborting
it. We don't throw the buffer away because some transient errors can
last minutes (e.g. FC path failover) or even hours (thin
provisioned devices that have run out of backing space) before they
go away. Hence we really want to keep trying until we can't try any
more.
Because the buffer was not cleaned, however, it does not get removed
from the AIL and hence the next pass across the AIL will start IO on
it again. As such, we still get the "retry forever" semantics that
we currently have, but we allow other access to the buffer in the
mean time. Meanwhile the filesystem can continue to modify the
buffer and relog it, so the IO errors won't hang the log or the
filesystem.
Now when we are pushing the AIL, we can see all these "permanent IO
error" buffers and we can issue a warning about failures before we
retry the IO. We can also catch these buffers when unmounting an
issue a corruption warning, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
When swalloc is specified as a mount option, allocations are
supposed to be aligned to the stripe width rather than the stripe
unit of the underlying filesystem. However, it does not do this.
What the implementation does is round up the allocation size to a
stripe width, hence ensuring that all allocations span a full stripe
width. It does not, however, ensure that that allocation is aligned
to a stripe width, and hence the allocations can span multiple
underlying stripes and so still see RMW cycles for things like
direct IO on MD RAID.
So, if the swalloc mount option is set, change the allocation
alignment in xfs_bmap_btalloc() to use the stripe width rather than
the stripe unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
handles the case of a shut down filesystem. Most of the users have private,
uncached buffers that can just be freed in this case, but the complex error
handling in xfs_bioerror_relse messes up the case when it's called without
a locked buffer.
Remove xfsbdstrat and opencode the error handling in the callers. All but
one can simply return an error and don't need to deal with buffer state,
and the one caller that cares about the buffer state could do with a major
cleanup as well, but we'll defer that to later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.
Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.
Fix it by considering allocation into empty files as requiring
aligned allocation again.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f9b395a8ef)
xfs_quota(8) will hang up if trying to turn group/project quota off
before the user quota is off, this could be 100% reproduced by:
# mount -ouquota,gquota /dev/sda7 /xfs
# mkdir /xfs/test
# xfs_quota -xc 'off -g' /xfs <-- hangs up
# echo w > /proc/sysrq-trigger
# dmesg
SysRq : Show Blocked State
task PC stack pid father
xfs_quota D 0000000000000000 0 27574 2551 0x00000000
[snip]
Call Trace:
[<ffffffff81aaa21d>] schedule+0xad/0xc0
[<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0
[<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0
[<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0
[<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
[<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30
[<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs]
[<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
[<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs]
[<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
[<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
[<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
[<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
[<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
[<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs]
[<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0
[<ffffffff812fba4b>] ? iput+0x5b/0x90
[<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0
[<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b
It's fine if we turn user quota off at first, then turn off other
kind of quotas if they are enabled since the group/project dquot
refcount is decreased to zero once the user quota if off. Otherwise,
those dquots refcount is non-zero due to the user dquot might refer
to them as hint(s). Hence, above operation cause an infinite loop
at xfs_qm_dquot_walk() while trying to purge dquot cache.
This problem has been around since Linux 3.4, it was introduced by:
[ b84a3a9675 xfs: remove the per-filesystem list of dquots ]
Originally we will release the group dquot pointers because the user
dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
However, with above change, there is no such work to be done before
purging group/project dquot cache.
In order to solve this problem, this patch introduces a special routine
xfs_qm_dqpurge_hints(), and it would release the group/project dquot
pointers the user dquots maybe carrying around as a hint, and then it
will proceed to purge the user dquot cache if requested.
Cc: stable@vger.kernel.org
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit df8052e7da)
For CRC enabled v5 super block, change a file's ownership can simply
trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
project quota are enabled, i.e,
[ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
[ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
[ 305.383939] Call Trace:
[ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs]
[ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs]
[ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350
[ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110
[ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110
[ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30
[ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs]
This fix adjust the assertion to check if the super block support both
quota inodes or not.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 5a01dd54f4)
After the previous fix, there still has another ASSERT failure if turning
off any type of quota while fsstress is running at the same time.
Backtrace in this case:
[ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
[ 50.867924] ------------[ cut here ]------------
... <snip>
[ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
[ 50.867999] invalid opcode: 0000 [#1] SMP
[ 50.869407] Call Trace:
[ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
[ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs]
[ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs]
[ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs]
[ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130
[ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0
[ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20
[ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip>
[ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs]
[ 50.870050] RSP <ffff88002941fd60>
[ 50.879251] ---[ end trace c93a2b342341c65b ]---
We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
however the assertion itself is not right IMHO. While performing quota off, we
firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee
that the desired quota is still active.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 37eb9706eb)
Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit ef701600fd)
This patch fixes a slab memory leak that sometimes can occur
for files with a very short lifespan. The problem occurs when
a dinode is deleted before it has gotten to the journal properly.
In the leak scenario, the bd object is pinned for journal
committment (queued to the metadata buffers queue: sd_log_le_buf)
but is subsequently unpinned and dequeued before it finds its way
to the ail or the revoke queue. In this rare circumstance, the bd
object needs to be freed from slab memory, or it is forgotten.
We have to be very careful how we do it, though, because
multiple processes can call gfs2_remove_from_journal. In order to
avoid double-frees, only the process that does the unpinning is
allowed to free the bd.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function gfs2_remove_from_ail drops the reference on the bh via
brelse. This patch fixes a race condition whereby bh is deferenced
after the brelse when setting bd->bd_blkno = bh->b_blocknr;
Under certain rare circumstances, bh might be gone or reused,
and bd->bd_blkno is set to whatever that memory happens to be,
which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails
the test "bd->bd_blkno >= blkno" which causes it to never be freed.
The end result is that the bd is never freed from the bufdata cache,
which results in this error:
slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is a GFS2 version of Tejun's patch:
4f331f01b9
vfs: don't hold s_umount over close_bdev_exclusive() call
In this case its blkdev_put itself that is the issue and this
patch uses the same solution of dropping and retaking s_umount.
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Commit 4f8ad655db "writeback: Refactor writeback_single_inode()" added
a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
mode because there we also want to wait for outstanding writeback on
possibly clean inode. This was causing occasional data corruption issues
on NFS because it uses sync_inode() to make sure all outstanding writes
are flushed to the server before truncating the inode and with
sync_inode() returning prematurely file was sometimes extended back
by an outstanding write after it was truncated.
So modify the test to also check for pages under writeback in
WB_SYNC_ALL mode.
CC: stable@vger.kernel.org # >= 3.5
Fixes: 4f8ad655db
Reported-and-tested-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.
Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Merge patches from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: memcg: do not allow task about to OOM kill to bypass the limit
mm: memcg: fix race condition between memcg teardown and swapin
thp: move preallocated PTE page table on move_huge_pmd()
mfd/rtc: s5m: fix register updating by adding regmap for RTC
rtc: s5m: enable IRQ wake during suspend
rtc: s5m: limit endless loop waiting for register update
rtc: s5m: fix unsuccesful IRQ request during probe
drivers/rtc/rtc-s5m.c: fix info->rtc assignment
include/linux/kernel.h: make might_fault() a nop for !MMU
drivers/rtc/rtc-at91rm9200.c: correct alarm over day/month wrap
procfs: also fix proc_reg_get_unmapped_area() for !MMU case
mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
include/linux/hugetlb.h: make isolate_huge_page() an inline
Commit fad1a86e25 ("procfs: call default get_unmapped_area on
MMU-present architectures"), as its title says, took care of only the
MMU case, leaving the !MMU side still in the regressed state (returning
-EIO in all cases where pde->proc_fops->get_unmapped_area is NULL).
From the fad1a86e25 changelog:
"Commit c4fe244857 ("sparc: fix PCI device proc file mmap(2)") added
proc_reg_get_unmapped_area in proc_reg_file_ops and
proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
get_unmapped_area method is not defined for the target procfs file, which
causes regression of mmap on /proc/vmcore.
To address this issue, like get_unmapped_area(), call default
current->mm->get_unmapped_area on MMU-present architectures if
pde->proc_fops->get_unmapped_area, i.e. the one in actual file operation
in the procfs file, is not defined"
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> [3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes. It was rebased this morning, but
I was just fixing signed-off-by tags with the wrong email"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix access_ok() check in btrfs_ioctl_send()
Btrfs: make sure we cleanup all reloc roots if error happens
Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
Btrfs: fix an oops when doing balance relocation
Btrfs: don't miss skinny extent items on delayed ref head contention
btrfs: call mnt_drop_write after interrupted subvol deletion
Btrfs: don't clear the default compression type
Pull nfsd reply cache bugfix from Bruce Fields:
"One bugfix for nfsd crashes"
* 'for-3.13' of git://linux-nfs.org/~bfields/linux:
nfsd: when reusing an existing repcache entry, unhash it first
When explicitly hashing the end of a string with the word-at-a-time
interface, we have to be careful which end of the word we pick up.
On big-endian CPUs, the upper-bits will contain the data we're after, so
ensure we generate our masks accordingly (and avoid hashing whatever
random junk may have been sitting after the string).
This patch adds a new dcache helper, bytemask_from_count, which creates
a mask appropriate for the CPU endianness.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- fix for buffer overrun in agfl with growfs on v4 superblock
- return EINVAL if requested discard length is less than a block
- fix possible memory corruption in xfs_attrlist_by_handle()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJSp2+vAAoJENaLyazVq6ZOvm4QAJs3iPQwrqiBV+9KiHKDrHy3
6/G0XeW1gmWZR7DI2CUtx0Dgjv++y3zaWSR+AaoHtOVDYb2W5rq1dG19bxmBq55I
ZpV2kXsJuz29KEuHIvZqdoSHL+rdKwPejMezABFtUFINjpjZisjlk/FL7czfuhGS
d6OGl8/q4tX06lgZXED/ebHB5Zqboha8DLaL5HrvT0RINSoZwgu0Z0gSz5mh50UM
eKOQJVwZ3DLoO1rhSE8tGjEqJN3GQb4hM6cusrDKQ3fdTHVRQwDvN39z6h/OD0jr
8jxpbbtxzGiPQPnvtf2s26zrWdEZmDZmBOoh0A9acVxB8ZHkaPq462LJDGLWNK5u
SqXCTcWebT7xHoBXb0PhzjPMLpJDD2F90xLFC6PNYd/S2y0hWG+qSPcNxNjRQlN7
MjxL0xe0l0zIsl6WV/zZgWmxkxc7EiRuDPnO3kxkIlxyaH1cXLkkhWxw+3yYFRqm
HjUfpYRTRWr4JodYom/0vRZc5gqXgY+snIFex5a7d5Ukm2m22QeAbqdByuMtxEp+
dLN70Qn/57CAzYYScjd4mP1imFwhL/19Uzaiv/J13cW833jJ1n0EHm89fxKOzDHt
CYHVObph719AmEcoaMFNolyI4EozsxuIzq06v7gwgGSVbOmEqGY5g76hLoyKoxxc
uv5Yo+hoUQHyC6a7oD7r
=SbPd
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
- fix for buffer overrun in agfl with growfs on v4 superblock
- return EINVAL if requested discard length is less than a block
- fix possible memory corruption in xfs_attrlist_by_handle()
* tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs:
xfs: growfs overruns AGFL buffer on V4 filesystems
xfs: don't perform discard if the given range length is less than block size
xfs: underflow bug in xfs_attrlist_by_handle()
The closing parenthesis is in the wrong place. We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
I hit an oops when merging reloc roots fails, the reason is that
new reloc roots may be added and we should make sure we cleanup
all reloc roots.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Quota tree and UUID Tree is only cowed, they can not be snapshoted.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
I hit an oops when inserting reloc root into @reloc_root_tree(it can be
easily triggered when forcing cow for relocation root)
[ 866.494539] [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs]
[ 866.495321] [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs]
[ 866.496109] [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs]
[ 866.496908] [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs]
[ 866.497703] [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs]
This is because reloc root inserted into @reloc_root_tree is not within one
transaction,reloc root may be cowed and root block bytenr will be reused then
oops happens.We should update reloc root in @reloc_root_tree when cow reloc
root node, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
of skinny extent items. This can happen when the execution flow is the
following:
* We do an extent tree lookup and fail to find a skinny extent item;
* As a result, we attempt to see if a non-skinny extent item exists,
either by looking at previous item in the leaf or by doing another
full extent tree search;
* We have a transaction and then we check for a matching delayed ref
head in the transaction's delayed refs rbtree;
* We find such delayed ref head and then we try to lock it with a
call to mutex_trylock();
* The lock was contended so we jump to the label "again", which repeats
the extent tree search but for a non-skinny extent item, because we set
previously metadata variable to 0 and the search key to look for a
non-skinny extent-item;
* After the jump (and after releasing the transaction's delayed refs
lock), a skinny extent item might have been added to the extent tree
but we will miss it because metadata is set to 0 and the search key
is set for a non-skinny extent-item.
The fix here is to not reset metadata to 0 and to jump to the initial search
key setup if the delayed ref head is contended, instead of jumping directly
to the extent tree search label ("again").
This issue was found while investigating the issue reported at Bugzilla 64961.
David Sterba suspected this function was missing extent items, and that
this could be caused by the last change to this function, which was made
in the following patch:
[PATCH] Btrfs: optimize btrfs_lookup_extent_info()
(commit 74be951087)
But in fact this issue already existed before, because after failing to find
a skinny extent item, the code set the search key for a non-skinny extent
item, and on contention of a matching delayed ref head it would not search
the extent tree for a skinny extent item anymore.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We met a oops caused by the wrong compression type:
[ 556.512356] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[ 556.512490] [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[ 556.512539] [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[ 556.512546] [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[ 556.512576] [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[ 556.512601] [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[ 556.512627] [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[ 556.512655] [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[ 556.512661] [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[ 556.512689] [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[ 556.512695] [<ffffffff8104fa4e>] kthread+0x8d/0x95
[ 556.512699] [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[ 556.512704] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[ 556.512709] [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[ 556.512713] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o nodatacow <dev> <mnt>
# touch <mnt>/<file>
# chattr =c <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=1M count=10
It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.
The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it. This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.
Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.
Cc: stable@vger.kernel.org # v3.10+
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: g. artim <gartim@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:
for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.
This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.
Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit b7d961b35b)
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed
This issue can be triggered via xfstests/generic/288.
Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f9fd013561)
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.
This can only be triggered with CAP_SYS_ADMIN.
Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 071c529eb6)