IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
We want to drop all I/O path locks when recalling layouts, and that includes
i_mutex for the write path. Without this we get stuck processe when recalls
take too long.
[dchinner: fix build with !CONFIG_PNFS]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_attr3_leaf_remove() removes an attribute from an attr leaf block. If
the attribute nameval data happens to be at the start of the nameval
region, a new start offset (firstused) for the region is calculated
(since the region grows from the tail of the block to the start). Once
the new firstused is calculated, it is checked for zero in an apparent
overflow check.
Now that the in-core firstused is 32-bit, overflow is not possible and
this check can be removed. Since the purpose for this check is not
documented and appears to exist since the port to Linux, be conservative
and replace it with an assert.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The on-disk xfs_attr3_leaf_hdr structure firstused field is 16-bit and
subject to overflow when fs block size is 64k. The field is typically
initialized to block size when an attr leaf block is initialized. This
problem is demonstrated by assert failures when running xfstests
generic/117 on an fs with 64k blocks.
To support the existing attr leaf block algorithms for insertion,
rebalance and entry movement, increase the size of the in-core firstused
field to 32-bit and handle the potential overflow on conversion to/from
the on-disk structure. If the overflow condition occurs, set a special
value in the firstused field that is translated back on header read. The
special value is only required in the case of an empty 64k attr block. A
value of zero is used because firstused is initialized to the block size
and grows backwards from there. Furthermore, the attribute block header
occupies the first bytes of the block. Thus, a value of zero has no
other legitimate meaning for this structure. Two new conversion helpers
are created to manage the conversion of firstused to and from disk.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The firstused field of the xfs_attr3_leaf_hdr structure is subject to an
overflow when fs blocksize is 64k. In preparation to handle this
overflow in the header conversion functions, pass the attribute geometry
to the functions that convert the in-core structure to and from the
on-disk structure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There's a bit of a loophole in norecovery mount handling right
now: an initial mount must be readonly, but nothing prevents
a mount -o remount,rw from producing a writable, unrecovered
xfs filesystem.
It might be possible to try to perform a log recovery when this
is requested, but I'm not sure it's worth the effort. For now,
simply disallow this sort of transition.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success. Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/ext4/file.c
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch implements fallocate's FALLOC_FL_INSERT_RANGE for XFS.
1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
block number is not the starting block of the extent, split the extent
such that the block number is the starting block of the extent.
4) Shift all the extents which are lying bewteen [offset, last allocated extent]
towards right by len bytes. This step will make a hole of len bytes
at offset.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
added a positive error return value.
This value filters up through the return layers and should be
negative as the other return values are in the same function.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_mru_cache_insert() can be called from within transaction context
during block allocation like so:
write(2)
....
xfs_get_blocks
xfs_iomap_write_direct
start transaction
xfs_bmapi_write
xfs_bmapi_allocate
xfs_bmap_btalloc
xfs_bmap_btalloc_filestreams
xfs_filestream_new_ag
xfs_filestream_pick_ag
xfs_mru_cache_insert
radix_tree_preload(GFP_KERNEL)
In this case, GFP_KERNEL is incorrect and can potentially lead to
deadlocks in memory reclaim. It should use GFP_NOFS allocations to
avoid lock recursion problems.
[dchinner: rewrote commit message]
Signed-off-by: Byoungyoung Lee <blee@gatech.edu>
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use %pS for actual addresses, otherwise you'll get bad output
on arches like ppc64 where %pF expects a function descriptor.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use icnodehdr for struct xfs_da3_icnode_hdr instead of nodehdr
(already declared above).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
new_parent and src_is_directory are only used in 0/1 context.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If xfs_filestream_get_parent() fails, we have a null pip,
goto out, and attempt to IRELE(NULL). This causes a null
pointer dereference and BUG().
Fix this by directly returning NULLAGNUMBER in this case.
Reported-by: Adrien Nader <adrien@notk.org>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This code is redundant now that we have verifiers that sanity check
the buffers as they are read from disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Whiteouts are used by overlayfs - it has a crazy convention that a
whiteout is a character device inode with a major:minor of 0:0.
Because it's not documented anywhere, here's an example of what
RENAME_WHITEOUT does on ext4:
# echo foo > /mnt/scratch/foo
# echo bar > /mnt/scratch/bar
# ls -l /mnt/scratch
total 24
-rw-r--r-- 1 root root 4 Feb 11 20:22 bar
-rw-r--r-- 1 root root 4 Feb 11 20:22 foo
drwx------ 2 root root 16384 Feb 11 20:18 lost+found
# src/renameat2 -w /mnt/scratch/foo /mnt/scratch/bar
# ls -l /mnt/scratch
total 20
-rw-r--r-- 1 root root 4 Feb 11 20:22 bar
c--------- 1 root root 0, 0 Feb 11 20:23 foo
drwx------ 2 root root 16384 Feb 11 20:18 lost+found
# cat /mnt/scratch/bar
foo
#
In XFS rename terms, the operation that has been done is that source
(foo) has been moved to the target (bar), which is like a nomal
rename operation, but rather than the source being removed, it have
been replaced with a whiteout.
We can't allocate whiteout inodes within the rename transaction due
to allocation being a multi-commit transaction: rename needs to
be a single, atomic commit. Hence we have several options here, form
most efficient to least efficient:
- use DT_WHT in the target dirent and do no whiteout inode
allocation. The main issue with this approach is that we need
hooks in lookup to create a virtual chardev inode to present
to userspace and in places where we might need to modify the
dirent e.g. unlink. Overlayfs also needs to be taught about
DT_WHT. Most invasive change, lowest overhead.
- create a special whiteout inode in the root directory (e.g. a
".wino" dirent) and then hardlink every new whiteout to it.
This means we only need to create a single whiteout inode, and
rename simply creates a hardlink to it. We can use DT_WHT for
these, though using DT_CHR means we won't have to modify
overlayfs, nor anything in userspace. Downside is we have to
look up the whiteout inode on every operation and create it if
it doesn't exist.
- copy ext4: create a special whiteout chardev inode for every
whiteout. This is more complex than the above options because
of the lack of atomicity between inode creation and the rename
operation, requiring us to create a tmpfile inode and then
linking it into the directory structure during the rename. At
least with a tmpfile inode crashes between the create and
rename doesn't leave unreferenced inodes or directory
pollution around.
By far the simplest thing to do in the short term is to copy ext4.
While it is the most inefficient way of supporting whiteouts, but as
an initial implementation we can simply reuse existing functions and
add a small amount of extra code the the rename operation.
When we get full whiteout support in the VFS (via the dentry cache)
we can then look to supporting DT_WHT method outlined as the first
method of supporting whiteouts. But until then, we'll stick with
what overlayfs expects us to be: dumb and stupid.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Now that xfs_finish_rename() exists, there is no reason for
xfs_cross_rename() to return to xfs_rename() to finish off the
rename transaction. Drive the completion code into
xfs_cross_rename() and handle all errors there so as to simplify
the xfs_rename() code.
Further, push the rename exchange target_ip check to early in the
rename code so as to make the error handling easy and obviously
correct.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Rather than use a jump label for the final transaction commit in
the rename, factor it into a simple helper function and call it
appropriately. This slightly reduces the spaghetti nature of
xfs_rename.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The jump labels are ambiguous and unclear and some of the error
paths are used inconsistently. Rules for error jumps are:
- use out_trans_cancel for unmodified transaction context
- use out_bmap_cancel on ENOSPC errors
- use out_trans_abort when transaction is likely to be dirty.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When doing RENAME_WHITEOUT, we now have to lock 5 inodes into the
rename transaction. This means we need to update
xfs_sort_for_rename() and xfs_lock_inodes() to handle up to 5
inodes. Because of the vagaries of rename, this means we could have
anywhere between 3 and 5 inodes locked into the transaction....
While xfs_lock_inodes() does not need anything other than an assert
telling us we are passing more inodes that we ever thought we should
see, it could do with a logic rework to remove all the indenting.
This is not a functional change - it just makes the code a lot
easier to read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add support to XFS so that time limits can be set through Q_SETINFO
quotactl.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Convert xfs to use ->get_state callback instead of ->get_xstate and
->get_xstatev.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We recently removed deprecated sysctls; may as well
remove deprecated mount options as well, we've stated
that they'd be gone by now in the docs.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Test generic/224 is failing with a corruption being detected on one
of Michael's test boxes. Debug that Michael added is indicating
that the minleft trimming is resulting in an underflow:
.....
before fixup: rlen 1 args->len 0
after xfs_alloc_fix_len : rlen 1 args->len 1
before goto out_nominleft: rlen 1 args->len 0
before fixup: rlen 1 args->len 0
after xfs_alloc_fix_len : rlen 1 args->len 1
after fixup: rlen 1 args->len 1
before fixup: rlen 1 args->len 0
after xfs_alloc_fix_len : rlen 1 args->len 1
after fixup: rlen 4294967295 args->len 4294967295
XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 1424
The "goto out_nominleft:" indicates that we are getting close to
ENOSPC in the AG, and a couple of allocations later we underflow
and the corruption check fires in xfs_alloc_ag_vextent_size().
The issue is that the extent length fixups comaprisons are done
with variables of xfs_extlen_t types. These are unsigned so an
underflow looks like a really big value and hence is not detected
as being smaller than the minimum length allowed for the extent.
Hence the corruption check fires as it is noticing that the returned
length is longer than the original extent length passed in.
This can be easily fixed by ensuring we do the underflow test on
signed values, the same way xfs_alloc_fix_len() prevents underflow.
So we realise in future that these casts prevent underflows from
going undetected, add comments to the code indicating this.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Tested-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If xfs_trans_reserve fails we don't cancel the transaction,
and we'll leak the allocated transaction pointer.
Spotted by Coverity.
Signed-off-by: Eric Sandeen <ssandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The error messages document the reason for the checks better than the comment
and the comments about volume mounts date back to Irix and so aren't relevant
any more. So just remove the old and redundant comment.
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Today, when the "failing async writes" get ratelimited, we see:
XFS:: 62836 callbacks suppressed
Aside from the extra ":" it's not entirely clear which message is being
suppressed, especially if other messages or ratelimits are happening
at the same time. Clarify this as i.e.:
XFS (dm-11): Failing async write on buffer block 0x140090. Retrying async write.
XFS: Failing async write: 62836 callbacks suppressed
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are times, when doing triage and forensics,
that we would like to know whether a filesystem was unmounted,
or if the plug was pulled without a clean unmount. Log
unmounts at the same level (NOTICE) as we log mounts.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We shouldn't get here with RENAME_EXCHANGE set and no
target_ip, but let's be defensive, because xfs_cross_rename()
will dereference it.
Spotted by Coverity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Today, if we hit an XFS_WANT_CORRUPTED_RETURN we don't print any
information about which filesystem hit it. Passing in the mp allows
us to print the filesystem (device) name, which is a pretty critical
piece of information.
Tested by running fsfuzzer 'til I hit some.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Today, if we hit an XFS_WANT_CORRUPTED_GOTO we don't print any
information about which filesystem hit it. Passing in the mp allows
us to print the filesystem (device) name, which is a pretty critical
piece of information.
Tested by running fsfuzzer 'til I hit some.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Al Viro noticed a generic set of issues to do with filehandle lookup
racing with dentry cache setup. They involve a filehandle lookup
occurring while an inode is being created and the filehandle lookup
racing with the dentry creation for the real file. This can lead to
multiple dentries for the one path being instantiated. There are a
host of other issues around this same set of paths.
The underlying cause is that file handle lookup only waits on inode
cache instantiation rather than full dentry cache instantiation. XFS
is mostly immune to the problems discovered due to it's own internal
inode cache, but there are a couple of corner cases where races can
happen.
We currently clear the XFS_INEW flag when the inode is fully set up
after insertion into the cache. Newly allocated inodes are inserted
locked and so aren't usable until the allocation transaction
commits. This, however, occurs before the dentry and security
information is fully initialised and hence the inode is unlocked and
available for lookups to find too early.
To solve the problem, only clear the XFS_INEW flag for newly created
inodes once the dentry is fully instantiated. This means lookups
will retry until the XFS_INEW flag is removed from the inode and
hence avoids the race conditions in questions.
THis also means that xfs_create(), xfs_create_tmpfile() and
xfs_symlink() need to finish the setup of the inode in their error
paths if we had allocated the inode but failed later in the creation
process. xfs_symlink(), in particular, needed a lot of help to make
it's error handling match that of xfs_create().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
A new fsync vs power fail test in xfstests indicated that XFS can
have unreliable data consistency when doing extending truncates that
require block zeroing. The blocks beyond EOF get zeroed in memory,
but we never force those changes to disk before we run the
transaction that extends the file size and exposes those blocks to
userspace. This can result in the blocks not being correctly zeroed
after a crash.
Because in-memory behaviour is correct, tools like fsx don't pick up
any coherency problems - it's not until the filesystem is shutdown
or the system crashes after writing the truncate transaction to the
journal but before the zeroed data in the page cache is flushed that
the issue is exposed.
Fix this by also flushing the dirty data in memory region between
the old size and new size when we've found blocks that need zeroing
in the truncate process.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
For filesystems without separate project quota inode field in the
superblock we just reuse project quota file for group quotas (and vice
versa) if project quota file is allocated and we need group quota file.
When we reuse the file, quota structures on disk suddenly have wrong
type stored in d_flags though. Nobody really cares about this (although
structure type reported to userspace was wrong as well) except
that after commit 14bf61ffe6ac (quota: Switch ->get_dqblk() and
->set_dqblk() to use bytes as space units) assertion in
xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
was testing without XFS_DEBUG so I didn't notice when submitting the
above commit).
Fix the problem by properly resetting ddq->d_flags when running quotacheck
for a quota file.
CC: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Extent swap operations are another extent manipulation operation
that we need to ensure does not race against mmap page faults. The
current code returns if the file is mapped prior to the swap being
done, but it could potentially race against new page faults while
the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
for this operation, too.
While there, fix the error path handling that can result in double
unlocks of the inodes when cancelling the swapext transaction.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>