IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip(). Prevent I/Os to files
tagged with the DAX flag from falling back to buffered I/O.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 9bd0f45b7037fcfa8b575c7e27d0431d6e6dc3bb.
Linus rightly pointed out that I failed to initialize the counters
when adding them, so they don't work as expected. Just revert this
patch for now.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Includes of pnfs.h in export.c and fcntl.c also bring in xdr4.h, which
won't build without CONFIG_NFSD_V3, breaking non-V3 builds. Ifdef-out
most of pnfs.h in that case.
Reported-by: Bas Peters <baspeters93@gmail.com>
Reported-by: Jim Davis <jim.epost@gmail.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 9cf514ccfac "nfsd: implement pNFS operations"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Recall all outstanding pNFS layouts and truncates, writes and similar extent
list modifying operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add operations to export pNFS block layouts from an XFS filesystem. See
the previous commit adding the operations for an explanation of them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Really tiny set of patches for this kernel. Nothing major, all
described in the shortlog and have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlTgtIAACgkQMUfUDdst+ymjSwCfWspNT71lmsVwasCTPQopgXov
TqAAoKR4I5ZebMks/nW6ClxUFYwVSL02
=leVc
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg KH:
"Really tiny set of patches for this kernel. Nothing major, all
described in the shortlog and have been in linux-next for a while"
* tag 'driver-core-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
sysfs: fix warning when creating a sysfs group without attributes
firmware_loader: handle timeout via wait_for_completion_interruptible_timeout()
firmware_loader: abort request if wait_for_completion is interrupted
firmware: Correct function name in comment
device: Change dev_<level> logging functions to return void
device: Fix dev_dbg_once macro
Pull UBI and UBIFS updates from Richard Weinberger:
- cleanups and bug fixes all over UBI and UBIFS
- block-mq support for UBI Block
- UBI volumes can now be renamed while they are in use
- security.* XATTR support for UBIFS
- a maintainer update
* 'for-linus-v3.20' of git://git.infradead.org/linux-ubifs:
UBI: block: Fix checking for NULL instead of IS_ERR()
UBI: block: Continue creating ubiblocks after an initialization error
UBIFS: return -EINVAL if log head is empty
UBI: Block: Explain usage of blk_rq_map_sg()
UBI: fix soft lockup in ubi_check_volume()
UBI: Fastmap: Care about the protection queue
UBIFS: add a couple of extra asserts
UBI: do propagate positive error codes up
UBI: clean-up printing helpers
UBI: extend UBI layer debug/messaging capabilities - cosmetics
UBIFS: add ubifs_err() to print error reason
UBIFS: Add security.* XATTR support for the UBIFS
UBIFS: Add xattr support for symlinks
UBI: Block: Add blk-mq support
UBI: Add initial support for scatter gather
UBI: rename_volumes: Use UBI_METAONLY
UBI: Implement UBI_METAONLY
Add myself as UBI co-maintainer
Commit 4f579ae7de56 (ext4: fix punch hole on files with indirect
mapping) rewrote FALLOC_FL_PUNCH_HOLE for ext4 files with indirect
mapping. However, there are bugs in several corner cases. This fixes 5
distinct bugs:
1. When there is at least one entire level of indirection between the
start and end of the punch range and the end of the punch range is the
first block of its level, we can't return early; we have to free the
intervening levels.
2. When the end is at a higher level of indirection than the start and
ext4_find_shared returns a top branch for the end, we still need to free
the rest of the shared branch it returns; we can't decrement partial2.
3. When a punch happens within one level of indirection, we need to
converge on an indirect block that contains the start and end. However,
because the branches returned from ext4_find_shared do not necessarily
start at the same level (e.g., the partial2 chain will be shallower if
the last block occurs at the beginning of an indirect group), the walk
of the two chains can end up "missing" each other and freeing a bunch of
extra blocks in the process. This mismatch can be handled by first
making sure that the chains are at the same level, then walking them
together until they converge.
4. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the start, we must free it,
but only if the end does not occur within that branch.
5. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the end, then we shouldn't
free the block referenced by the end of the returned chain (this mirrors
the different levels case).
Signed-off-by: Omar Sandoval <osandov@osandov.com>
If we are recording in the tree log that an inode has new names (new hard
links were added), we would drop items, belonging to the inode, that we
shouldn't:
1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
flags, we ended up dropping all the extent and xattr items that were
previously logged. This was done only in memory, since logging a new
name doesn't imply syncing the log;
2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
flags, we ended up dropping all the xattr items that were previously
logged. Like the case before, this was done only in memory because
logging a new name doesn't imply syncing the log.
This led to some surprises in scenarios such as the following:
1) write some extents to an inode;
2) fsync the inode;
3) truncate the inode or delete/modify some of its xattrs
4) add a new hard link for that inode
5) fsync some other file, to force the log tree to be durably persisted
6) power failure happens
The next time the fs is mounted, the fsync log replay code is executed,
and the resulting file doesn't have the content it had when the last fsync
against it was performed, instead if has a content matching what it had
when the last transaction commit happened.
So change the behaviour such that when a new name is logged, only the inode
item and reference items are processed.
This is easy to reproduce with the test I just made for xfstests, whose
main body is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with some data.
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Make sure the file is durably persisted.
sync
# Append some data to our file, to increase its size.
$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Fsync the file, so from this point on if a crash/power failure happens, our
# new data is guaranteed to be there next time the fs is mounted.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Now shrink our file to 5000 bytes.
$XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo
# Now do an expanding truncate to a size larger than what we had when we last
# fsync'ed our file. This is just to verify that after power failure and
# replaying the fsync log, our file matches what it was when we last fsync'ed
# it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
# data had a value of 0xcc.
$XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo
# Add one hard link to our file. This made btrfs drop all of our file's
# metadata from the fsync log, including the metadata relative to the
# extent we just wrote and fsync'ed. This change was made only to the fsync
# log in memory, so adding the hard link alone doesn't change the persisted
# fsync log. This happened because the previous truncates set the runtime
# flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now make sure the in memory fsync log is durably persisted.
# Creating and fsync'ing another file will do it.
# After this our persisted fsync log will no longer have metadata for our file
# foo that points to the extent we wrote and fsync'ed before.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# As expected, before the crash/power failure, we should be able to see a file
# with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
# the remaining bytes with value 0x00.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After mounting the fs again, the fsync log was replayed.
# The expected result is to see a file with a size of 12Kb, with its first 8Kb
# of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
# The btrfs bug used to leave the file as it used te be as of the last
# transaction commit - that is, with a size of 8Kb with all bytes having a
# value of 0xaa.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
The test case for xfstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).
This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.
This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create one file with data and fsync it.
# This made the btrfs fsync log persist the data and the inode metadata with
# a correct inode->i_size (4096 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now add one hard link to our file. This made the btrfs code update the fsync
# log, in memory only, with an inode metadata having a size of 0.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now force persistence of the fsync log to disk, for example, by fsyncing some
# other file.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# Before a power loss or crash, we could read the 4Kb of data from our file as
# expected.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After the fsync log replay, because the fsync log had a value of 0 for our
# inode's i_size, we couldn't read anymore the 4Kb of data that we previously
# wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with some data.
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Make sure the file is durably persisted.
sync
# Append some data to our file, to increase its size.
$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Fsync the file, so from this point on if a crash/power failure happens, our
# new data is guaranteed to be there next time the fs is mounted.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Add one hard link to our file. This made btrfs write into the in memory fsync
# log a special inode with generation 0 and an i_size of 0 too. Note that this
# didn't update the inode in the fsync log on disk.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now make sure the in memory fsync log is durably persisted.
# Creating and fsync'ing another file will do it.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# As expected, before the crash/power failure, we should be able to read the
# 12Kb of file data.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After mounting the fs again, the fsync log was replayed.
# The btrfs fsync log replay code didn't update the i_size of the persisted
# inode because the inode item in the log had a special generation with a
# value of 0 (and it couldn't know the correct i_size, since that inode item
# had a 0 i_size too). This made the last 4Kb of file data inaccessible and
# effectively lost.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:
Btrfs: Add a write ahead tree log to optimize synchronous operations
(commit e02119d5a7b4396c5a872582fddc8bd6d305a70a)
Test cases for xfstests follow soon.
CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
On our gluster boxes we stream large tar balls of backups onto our fses. With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent. The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need. To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We do this to get the space accounting, but this is just needless churn on the
io_tree, so just drop setting/clearing delalloc and just drop the reserved data
space when we have a successfull allocation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have this weird dance where we always inc outstanding_extents when we do a
O_DIRECT write, even if we allocate the entire range. To get around this we
also drop the metadata space if we successfully write. This is an unnecessary
dance, we only need to jack up outstanding_extents if we don't satisfy the
entire range request in get_blocks_direct, otherwise we are good using our
original reservation. So drop the unconditional inc and the drop of the
metadata space that we have for the unconditional inc. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.
Steps to reproduce:
1: Create a single-dev btrfs fs with default option
2: Write a file into it to take up most fs space
3: Delete above file
4: Wait about 100s to let chunk removed
5: goto 2
Script is like following:
#!/bin/bash
# Recommend 1.2G space, too large disk will make test slow
DEV="/dev/sda16"
MNT="/mnt/tmp"
dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))
echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"
for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
echo "mkfs $DEV"
mkfs.btrfs -f "$DEV" >/dev/null || exit 2
echo "mount $DEV $MNT"
mount "$DEV" "$MNT" || exit 2
for ((loop_i = 0; loop_i < 20; loop_i++)); do
echo
echo "loop $loop_i"
echo "dd file..."
cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
"${cmd[@]}" 2>/dev/null || {
# NO_SPACE error triggered
echo "dd failed: ${cmd[*]}"
exit 1
}
echo "rm file..."
rm -f "$MNT"/file0 || exit 2
for ((i = 0; i < 10; i++)); do
df "$MNT" | tail -1
sleep 10
done
done
Reason:
It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a
which is used to remove empty block groups automatically, but the
reason is not in that patch. Code before works well because btrfs
don't need to create and delete chunks so many times with high
complexity.
Above bug is caused by many reason, any of them can trigger it.
Reason1:
When we remove some continuous chunks but leave other chunks after,
these disk space should be used by chunk-recreating, but in current
code, only first create will successed.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Reason2:
contains_pending_extent() return wrong value in calculation.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Reason3:
btrfs_check_data_free_space() try to commit transaction and retry
allocating chunk when the first allocating failed, but space_info->full
is set in first allocating, and prevent second allocating in retry.
Fixed in this patch by clear space_info->full in commit transaction.
Tested for severial times by above script.
Changelog v3->v4:
use light weight int instead of atomic_t to record have_remove_bgs in
transaction, suggested by:
Josef Bacik <jbacik@fb.com>
Changelog v2->v3:
v2 fixed the bug by adding more commit-transaction, but we
only need to reclaim space when we are really have no space for
new chunk, noticed by:
Filipe David Manana <fdmanana@gmail.com>
Actually, our code already have this type of commit-and-retry,
we only need to make it working with removed-bgs.
v3 fixed the bug with above way.
Changelog v1->v2:
v1 will introduce a new bug when delete and create chunk in same disk
space in same transaction, noticed by:
Filipe David Manana <fdmanana@gmail.com>
V2 fix this bug by commit transaction after remove block grops.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: Filipe David Manana <fdmanana@gmail.com>
Suggested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
My previous patch "Btrfs: fix scrub race leading to use-after-free"
introduced the possibility to sleep in an atomic context, which happens
when the scrub_lock mutex is held at the time scrub_pending_bio_dec()
is called - this function can be called under an atomic context.
Chris ran into this in a debug kernel which gave the following trace:
[ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621
[ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress
[ 1928.981324] INFO: lockdep is turned off.
[ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G W 3.19.0-rc7-mason+ #41
[ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012
[ 1929.022207] ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78
[ 1929.037267] ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8
[ 1929.052315] 0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8
[ 1929.067381] Call Trace:
[ 1929.072344] <IRQ> [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e
[ 1929.083968] [<ffffffff810856c4>] ___might_sleep+0x174/0x230
[ 1929.095352] [<ffffffff810857d2>] __might_sleep+0x52/0x90
[ 1929.106223] [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0
[ 1929.117951] [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10
[ 1929.129708] [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs]
[ 1929.143370] [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs]
[ 1929.157191] [<ffffffff812fa603>] bio_endio+0x53/0xa0
[ 1929.167382] [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs]
[ 1929.180161] [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs]
[ 1929.194318] [<ffffffff812fa603>] bio_endio+0x53/0xa0
[ 1929.204496] [<ffffffff8130401b>] blk_update_request+0x1eb/0x450
[ 1929.216569] [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500
[ 1929.229176] [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0
[ 1929.240740] [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0
[ 1929.252654] [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150
[ 1929.264725] [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170
[ 1929.276635] [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0
[ 1929.288014] [<ffffffff8105d92e>] __do_softirq+0xde/0x600
[ 1929.298885] [<ffffffff8105df6d>] irq_exit+0xbd/0xd0
(...)
Fix this by using a reference count on the scrub context structure
instead of locking the scrub_lock mutex.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We need to manually unpoison rounded up allocation size for dname to avoid
kasan's reports in dentry_string_cmp(). When CONFIG_DCACHE_WORD_ACCESS=y
dentry_string_cmp may access few bytes beyound requested in kmalloc()
size.
dentry_string_cmp() relates on that fact that dentry allocated using
kmalloc and kmalloc internally round up allocation size. So this is not a
bug, but this makes kasan to complain about such accesses. To avoid such
reports we mark rounded up allocation size in shadow as accessible.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After waking up a task waiting for an event, we explicitly mark it as
TASK_RUNNING (which is necessary as we do the checks for wakeups as
TASK_INTERRUPTIBLE). Once running and dealing with actually delivering
the events, we're obviously not planning on calling schedule, thus we can
relax the implied barrier and simply update the state with
__set_current_state().
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that all bitmap formatting usages have been converted to
'%*pb[l]', the separate formatting functions are unnecessary. The
following functions are removed.
* bitmap_scn[list]printf()
* cpumask_scnprintf(), cpulist_scnprintf()
* [__]nodemask_scnprintf(), [__]nodelist_scnprintf()
* seq_bitmap[_list](), seq_cpumask[_list](), seq_nodemask[_list]()
* seq_buf_bitmask()
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VFS frequently performs duplication of strings located in read-only memory
section. Replacing kstrdup by kstrdup_const allows to avoid such
operations.
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
making a separate copy of its name. It's currently only used for sysfs
attributes whose filenames are required to stay accessible and unchanged.
There are rare exceptions where these names are allocated and formatted
dynamically but for the vast majority of cases they're consts in the
rodata section.
Now that kernfs is converted to use kstrdup_const() and kfree_const(),
there's little point in keeping KERNFS_STATIC_NAME around. Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysfs frequently performs duplication of strings located in read-only
memory section. Replacing kstrdup by kstrdup_const allows to avoid such
operations.
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 411a99adffb4f (nfs: clear_request_commit while holding i_lock)
assumes that the nfs_commit_info always points to the inode->i_lock.
For historical reasons, that is not the case for O_DIRECT writes.
Cc: Weston Andros Adamson <dros@primarydata.com>
Fixes: 411a99adffb4f ("nfs: clear_request_commit while holding i_lock")
Cc: stable@vger.kernel.org # 3.17.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
As of v3.18, ext4 started rejecting a remount which changes the
journal_checksum option.
Prior to that, it was simply ignored; the problem here is that
if someone has this in their fstab for the root fs, now the box
fails to boot properly, because remount of root with the new options
will fail, and the box proceeds with a readonly root.
I think it is a little nicer behavior to accept the option, but
warn that it's being ignored, rather than failing the mount,
but that might be a subjective matter...
Reported-by: Cónräd <conradsand.arma@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
rejection of, changing journal_checksum during remount. One suffices.
While we're at it, remove old comment about the "check" option
which has been deprecated for some time now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit 90a8020 and d6320cb, Jan Kara has fixed this issue partially.
This mmap data corruption still exists in nodelalloc mode, fix this.
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Add a rocompat feature, "readonly" to mark a FS image as read-only.
The feature prevents the kernel and e2fsprogs from changing the image;
the flag can be toggled by tune2fs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull f2fs updates from Jaegeuk Kim:
"Major changes are to:
- add f2fs_io_tracer and F2FS_IOC_GETVERSION
- fix wrong acl assignment from parent
- fix accessing wrong data blocks
- fix wrong condition check for f2fs_sync_fs
- align start block address for direct_io
- add and refactor the readahead flows of FS metadata
- refactor atomic and volatile write policies
But most of patches are for clean-ups and minor bug fixes. Some of
them refactor old code too"
* tag 'for-f2fs-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (64 commits)
f2fs: use spinlock for segmap_lock instead of rwlock
f2fs: fix accessing wrong indexed data blocks
f2fs: avoid variable length array
f2fs: fix sparse warnings
f2fs: allocate data blocks in advance for f2fs_direct_IO
f2fs: introduce macros to convert bytes and blocks in f2fs
f2fs: call set_buffer_new for get_block
f2fs: check node page contents all the time
f2fs: avoid data offset overflow when lseeking huge file
f2fs: fix to use highmem for pages of newly created directory
f2fs: introduce a batched trim
f2fs: merge {invalidate,release}page for meta/node/data pages
f2fs: show the number of writeback pages in stat
f2fs: keep PagePrivate during releasepage
f2fs: should fail mount when trying to recover data on read-only dev
f2fs: split UMOUNT and FASTBOOT flags
f2fs: avoid write_checkpoint if f2fs is mounted readonly
f2fs: support norecovery mount option
f2fs: fix not to drop mount options when retrying fill_super
f2fs: merge flags in struct f2fs_sb_info
...
Merge third set of updates from Andrew Morton:
- the rest of MM
[ This includes getting rid of the numa hinting bits, in favor of
just generic protnone logic. Yay. - Linus ]
- core kernel
- procfs
- some of lib/ (lots of lib/ material this time)
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (104 commits)
lib/lcm.c: replace include
lib/percpu_ida.c: remove redundant includes
lib/strncpy_from_user.c: replace module.h include
lib/stmp_device.c: replace module.h include
lib/sort.c: move include inside #if 0
lib/show_mem.c: remove redundant include
lib/radix-tree.c: change to simpler include
lib/plist.c: remove redundant include
lib/nlattr.c: remove redundant include
lib/kobject_uevent.c: remove redundant include
lib/llist.c: remove redundant include
lib/md5.c: simplify include
lib/list_sort.c: rearrange includes
lib/genalloc.c: remove redundant include
lib/idr.c: remove redundant include
lib/halfmd4.c: simplify includes
lib/dynamic_queue_limits.c: simplify includes
lib/sort.c: use simpler includes
lib/interval_tree.c: simplify includes
hexdump: make it return number of bytes placed in buffer
...
If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target. This is because the
restart_block is held in the same memory allocation as the kernel stack.
Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.
Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.
It's also a decent simplification, since the restart code is more or less
identical on all architectures.
[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of custom approach let's use string_escape_str() to escape a given
string (task_name in this case).
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The output of /proc/$pid/numa_maps is in terms of number of pages like
anon=22 or dirty=54. Here's some output:
7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
7fff8d425000 default stack anon=50 dirty=50 N0=50
Looks like we have a stack and a couple of anonymous hugetlbfs
areas page which both use the same amount of memory. They don't.
The 'bigfile' uses 1GB pages and takes up ~50GB of space. The
anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
uses normal 4k pages. You can go over to smaps to figure out what the
page size _really_ is with KernelPageSize or MMUPageSize. But, I think
this is a pretty nasty and counterintuitive interface as it stands.
This patch introduces 'kernelpagesize_kB' line element to
/proc/<pid>/numa_maps report file in order to help identifying the size of
pages that are backing memory areas mapped by a given task. This is
specially useful to help differentiating between HUGE and GIGANTIC page
backed VMAs.
This patch is based on Dave Hansen's proposal and reviewer's follow-ups
taken from the following dicussion threads:
* https://lkml.org/lkml/2011/9/21/454
* https://lkml.org/lkml/2014/12/20/66
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the PDE() helper to get proc_dir_entry instead of coding it directly.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs. The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.
[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak <petrcermak@chromium.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Primiano Tucci <primiano@chromium.org>
Cc: Petr Cermak <petrcermak@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the isolate callback passed to the list_lru_walk family of
functions is supposed to just delete an item from the list upon returning
LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
__list_lru_walk_one after the callback returns. Since the callback is
allowed to drop the lock after removing an item (it has to return
LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
of elements on the list even if we check them under the lock. This makes
it difficult to move items from one list_lru_one to another, which is
required for per-memcg list_lru reparenting - we can't just splice the
lists, we have to move entries one by one.
This patch therefore introduces helpers that must be used by callback
functions to isolate items instead of raw list_del/list_move. These are
list_lru_isolate and list_lru_isolate_move. They not only remove the
entry from the list, but also fix the nr_items counter, making sure
nr_items always reflects the actual number of elements on the list if
checked under the appropriate lock.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In super_cache_scan() we divide the number of objects of particular type
by the total number of objects in order to distribute pressure among As a
result, in some corner cases we can get nr_to_scan=0 even if there are
some objects to reclaim, e.g. dentries=1, inodes=1, fs_objects=1,
nr_to_scan=1/3=0.
This is unacceptable for per memcg kmem accounting, because this means
that some objects may never get reclaimed after memcg death, preventing it
from being freed.
This patch therefore assures that super_cache_scan() will scan at least
one object of each type if any.
[akpm@linux-foundation.org: add comment]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg aware. Let's do it for the general FS
shrinker (super_block::s_shrink).
There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that are shared among different cgroups, there is no point
making them memcg aware. It's a big question whether we should account
them to memcg at all.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To make list_lru memcg aware, we need all list_lrus to be kept on a list
protected by a mutex, so that we could sleep while walking over the
list.
Therefore after this change list_lru_destroy may sleep. Fortunately,
there is only one user that calls it from an atomic context - it's
put_super - and we can easily fix it by calling list_lru_destroy before
put_super in destroy_locked_super - anyway we don't longer need lrus by
that time.
Another point that should be noted is that list_lru_destroy is allowed
to be called on an uninitialized zeroed-out object, in which case it is
a no-op. Before this patch this was guaranteed by kfree, but now we
need an explicit check there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag
set, it will be called per memory cgroup. The memory cgroup to scan
objects from is passed in shrink_control->memcg. If the memory cgroup
is NULL, a memcg aware shrinker is supposed to scan objects from the
global list. Unaware shrinkers are only called on global pressure with
memcg=NULL.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to make FS shrinkers memcg-aware. To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support. That means when we hit the limit we will get ENOMEM w/o any
chance to recover. What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup. This is what
this patch set is intended to do.
Basically, it does two things. First, it introduces the notion of
per-memcg slab shrinker. A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
passed the memory cgroup to scan from in shrink_control->memcg. For
such shrinkers shrink_slab iterates over the whole cgroup subtree under
the target cgroup and calls the shrinker for each kmem-active memory
cgroup.
Secondly, this patch set makes the list_lru structure per-memcg. It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru. Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to. This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.
As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit. Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
it is still unclear whether we will have this knob in the unified
hierarchy.
This patch (of 9):
NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:
count = list_lru_count_node(lru, sc->nid);
freed = list_lru_walk_node(lru, sc->nid, isolate_func,
isolate_arg, &sc->nr_to_scan);
where sc is an instance of the shrink_control structure passed to it
from vmscan.
To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.
This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core block IO changes from Jens Axboe:
"This contains:
- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.
- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.
- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.
- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"
* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJU3NR8AAoJEDaohF61QIxkKsMP/Rv/2+WasvgZDH8I/c0YoWmJ
XHdMaUGOtFCQP5TezJPUrbqpXnEJrTU9RBMre7J7Z7FPMBqg4mPS++yPwYnkY1zx
LjO57Mj1y4eEPBNcfCcBpY+RlMlr8+Ts6SRxqDRlHjzIpXNnEHej03LcbvJg31gp
ukFVJwWH9jsvxloBwcrgEgEbzBS3cFhCPliGJa5VhvFHK8DcIpHNzA1mwhYzr8NX
5aO/Enbt8TFdoV3Oai9PiLqFJpF2C8NqhxfHOk6M3sosN0QGMb62NZMUySsKpCl6
Q+0QZOgKVEQkzdwob6kA2iMa5ufrsDpu2xe9w7vSgPieWThQFqWsSfX5/9h5hJyL
Y1qS9NhYPPxHOtSORdBIW9txwg+mYg6dL+xy8oPeNzAMi0f4PYY+7zASMY+f59YA
aICuCobCIL2gb0pJf6/83oSb6wbIvPeN3kdAIMBUYNcRaGvI36fhFJaRF4FPs0i4
DwVhu/GDlL7/sSAqQUWnDKzR3lrHBu3YrwcwQMjA0zWhsYhtH+NUnAqm+nZS0mLI
aYtoZhry6ykfObRMfFQcOAsAUcVYKOpMnX2nt9aAG66xXhab9Fo4cDlEZjHxM5Lr
0rHOiWOmLZkbfKQ9ZVpNjSV2skIzAeGA/RG5RPR1JWDAnkcqr67/MdC0YbkIyNaG
pmCIOjq/3ivGAhJdTXgp
=QbrB
-----END PGP SIGNATURE-----
Merge tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"A couple cleanups for jfs"
* tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy:
jfs: Deletion of an unnecessary check before the function call "unload_nls"
jfs: get rid of homegrown endianness helpers