IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Convert XFS to the new truncate sequence. We still can have errors after
updating the file size in xfs_setattr, but these are real I/O errors and lead
to a transaction abort and filesystem shutdown, so they are not an issue.
Errors from ->write_begin and write_end can now be handled correctly because
we can actually get rid of the delalloc extents while previous the buffer
state was stipped in block_invalidatepage.
There is still no error handling for ->direct_IO, because doing so will need
some major restructuring given that we only have the iolock shared and do not
hold i_mutex at all. Fortunately leaving the normally allocated blocks behind
there is not a major issue and this will get cleaned up by xfs_free_eofblock
later.
Note: the patch is against Al's vfs.git tree as that contains the nessecary
preparations. I'd prefer to get it applied there so that we can get some
testing in linux-next.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move the call to vmtruncate to get rid of accessive blocks to the callers
in preparation of the new truncate sequence and rename the non-truncating
version to block_write_begin.
While we're at it also remove several unused arguments to block_write_begin.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move the call to vmtruncate to get rid of accessive blocks to the callers
in prepearation of the new truncate calling sequence. This was only done
for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
its _newtrunc variant while at it as just opencoding the two additional
paramters is shorted than the name suffix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Our current handling of direct I/O completions is rather suboptimal,
because we defer it to a workqueue more often than needed, and we
perform a much to aggressive flush of the workqueue in case unwritten
extent conversions happen.
This patch changes the direct I/O reads to not even use a completion
handler, as we don't bother to use it at all, and to perform the unwritten
extent conversions in caller context for synchronous direct I/O.
For a small I/O size direct I/O workload on a consumer grade SSD, such as
the untar of a kernel tree inside qemu this patch gives speedups of
about 5%. Getting us much closer to the speed of a native block device,
or a fully allocated XFS file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
If we write into an unwritten extent using AIO we need to complete the AIO
request after the extent conversion has finished. Without that a read could
race to see see the extent still unwritten and return zeros. For synchronous
I/O we already take care of that by flushing the xfsconvertd workqueue (which
might be a bit of overkill).
To do that add iocb and result fields to struct xfs_ioend, so that we can
call aio_complete from xfs_end_io after the extent conversion has happened.
Note that we need a new result field as io_error is used for positive errno
values, while the AIO code can return negative error values and positive
transfer sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited. That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.
This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Elder <aelder@sgi.com>
The b_strat callback is used by xfs_buf_iostrategy to perform additional
checks before submitting a buffer. It is used in xfs_bwrite and when
writing out delayed buffers. In xfs_bwrite it we can de-virtualize the
call easily as b_strat is set a few lines above the call to
xfs_buf_iostrategy. For the delayed buffers the rationale is a bit
more complicated:
- there are three callers of xfs_buf_delwri_queue, which places buffers
on the delwri list:
(1) xfs_bdwrite - this sets up b_strat, so it's fine
(2) xfs_buf_iorequest. None of the callers can have XBF_DELWRI set:
- xlog_bdstrat is only used for log buffers, which are never delwri
- _xfs_buf_read explicitly clears the delwri flag
- xfs_buf_iodone_work retries log buffers only
- xfsbdstrat - only used for reads, superblock writes without the
delwri flag, log I/O and file zeroing with explicitly allocated
buffers.
- xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is
not set
(3) xfs_buf_unlock
- only puts the buffer on the delwri list if the DELWRI flag is
already set. The DELWRI flag is only ever set in xfs_bwrite,
xfs_buf_iodone_callbacks, or xfs_trans_log_buf. For
xfs_buf_iodone_callbacks and xfs_trans_log_buf we require
an initialized buf item, which means b_strat was set to
xfs_bdstrat_cb in xfs_buf_item_init.
Conclusion: we can just get rid of the callback and replace it with
explicit calls to xfs_bdstrat_cb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
the osyncisosync option a no-op. Warn the users about this and remove
the mount flag for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined
to NULL - which breaks the code attempting to add a tracepoint
on this function.
Only define the tracepoint when the function exists.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Avoid a lockdep warning by preventing page cache allocation from
recursing back into the filesystem during memory reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_ireclaim has to get and put te pag structure because it is only
called with the inode to reclaim. The one caller of this function
already has a reference on the pag and a pointer to is, so move the
radix tree delete to the caller and remove xfs_ireclaim completely.
This avoids a xfs_perag_get/put on every inode being reclaimed.
The overhead was noticed in a bug report at:
https://bugzilla.kernel.org/show_bug.cgi?id=16348
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_buf_read() fails to detect dispatch errors before attempting to
wait on sychronous IO. If there was an error, it will get stuck
forever, waiting for an I/O that was never started. Make sure the
error is detected correctly.
Further, such a failure can leave locked pages in the page cache
which will cause a later operation to hang on the page. Ensure that
we correctly process pages in the buffers when we get a dispatch
error.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
I missed Dave Chinner's second revision of this change, and pushed
his first version out to the repository instead.
commit a476c59ebb279d738718edc0e3fb76aab3687114
Author: Dave Chinner <dchinner@redhat.com>
This commit compensates for that by moving a block of code up a bit
further, with a result that matches the the effect of Dave's second
version.
Dave's first version was:
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Dave's second version was:
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
The open_exec file operation is only added by the external dmapi
patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
These days we always have buffers thanks to ->page_mkwrite. And we
already have an assert a few lines above tripping in case that was
not true due to a bug.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we don't remove the XFS mount from the shrinker list until
late in the unmount path. By this time, we have already torn down
the internals of the filesystem (e.g. the per-ag structures), and
hence if the shrinker is executed between the teardown and the
unregistering, the shrinker will get NULL per-ag structure pointers
and panic trying to dereference them.
Fix this by removing the xfs mount from the shrinker list before
tearing down it's internal structures.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Replace the xfs_itrace_entry catchall with specific trace points. For
most simple callers we now use the simple inode class, which used to
be the iget class, but add more details tracing for namespace events,
which now includes the name of the directory entries manipulated.
Remove the xfs_inactive trace point, which is a duplicate of the clear_inode
one, and the xfs_change_file_space trace point, which is immediately
followed by the more specific alloc/free space trace points.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We never get an i_mode of 0 or a locked VFS inode until we pass in the
XFS_IGET_CREATE flag to xfs_iget, which makes xfs_iput_new equivalent to
xfs_iput for the only caller. In addition to that xfs_nfs_get_inode
does not even need to lock the inode given that the generation never changes
for a life inode, so just pass a 0 lock_flags to xfs_iget and release
the inode using IRELE in the error path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced.
Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining
of the xfs_iget_cache_hit/miss functions. Add a new xfs_iget_reclaim_fail
tracepoint for the case where we fail to re-initialize a VFS inode,
and add a second instance of the xfs_iget_skip tracepoint for the case
of a failed igrab() call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The tracing code can't print flags defined as enums. Most flags that
we want to print are defines as macros already, but move the few remaining
ones over to make the trace output more useful.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
On the final put of a superblock the VFS already calls sync_filesystem
for us to write out all data and wait for it. No need to start another
asynchronous writeback inside ->put_super.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the flags argument to __xfs_get_blocks as we can easily derive
it from the direct argument, and remove the unused BMAPI_MMAP flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We already rely on the fact that the sync code will cause a synchronous
log force later on (currently via xfs_fs_sync_fs -> xfs_quiesce_data ->
xfs_sync_data), so no need to do this here. This allows us to avoid
a lot of synchronous log forces during sync, which pays of especially
with delayed logging enabled. Some compilebench numbers that show
this:
xfs (delayed logging, 256k logbufs)
===================================
intial create 25.94 MB/s 25.75 MB/s 25.64 MB/s
create 8.54 MB/s 9.12 MB/s 9.15 MB/s
patch 2.47 MB/s 2.47 MB/s 3.17 MB/s
compile 29.65 MB/s 30.51 MB/s 27.33 MB/s
clean 90.92 MB/s 98.83 MB/s 128.87 MB/s
read tree 11.90 MB/s 11.84 MB/s 8.56 MB/s
read compiled 28.75 MB/s 29.96 MB/s 24.25 MB/s
delete tree 8.39 seconds 8.12 seconds 8.46 seconds
delete compiled 8.35 seconds 8.44 seconds 5.11 seconds
stat tree 6.03 seconds 5.59 seconds 5.19 seconds
stat compiled tree 9.00 seconds 9.52 seconds 8.49 seconds
xfs + write_inode log_force removal
===================================
intial create 25.87 MB/s 25.76 MB/s 25.87 MB/s
create 15.18 MB/s 14.80 MB/s 14.94 MB/s
patch 3.13 MB/s 3.14 MB/s 3.11 MB/s
compile 36.74 MB/s 37.17 MB/s 36.84 MB/s
clean 226.02 MB/s 222.58 MB/s 217.94 MB/s
read tree 15.14 MB/s 15.02 MB/s 15.14 MB/s
read compiled tree 29.30 MB/s 29.31 MB/s 29.32 MB/s
delete tree 6.22 seconds 6.14 seconds 6.15 seconds
delete compiled tree 5.75 seconds 5.92 seconds 5.81 seconds
stat tree 4.60 seconds 4.51 seconds 4.56 seconds
stat compiled tree 4.07 seconds 3.87 seconds 3.96 seconds
In addition to that also remove the delwri inode flush that is unessecary
now that bulkstat is always coherent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The writepage implementation in XFS still tries to deal with dirty but
unmapped buffers which used to caused by writes through shared mmaps. Since
the introduction of ->page_mkwrite these can't happen anymore, so remove the
code dealing with them.
Note that the all_bh variable which causes us to start I/O on all buffers on
the pages was controlled by the count of unmapped buffers, which also
included those not actually dirty. It's now unconditionally initialized to
0 but set to 1 for the case of small file size extensions. It probably can
be removed entirely, but that's left for another patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently the xfs releasepage implementation has code to deal with converting
delayed allocated and unwritten space. But we never get called for those as
we always convert delayed and unwritten space when cleaning a page, or drop
the state from the buffers in block_invalidatepage. We still keep a WARN_ON
on those cases for now, but remove all the case dealing with it, which allows
to fold xfs_page_state_convert into xfs_vm_writepage and remove the !startio
case from the whole writeback path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfstests 194 first truncats a file back and then extends it again by
truncating it to a larger size. This causes discard_buffer to drop
the mapped, but not the uptodate bit and thus creates something that
xfs_page_state_convert takes for unmapped space created by mmap because
it doesn't check for the dirty bit, which also gets cleared by
discard_buffer and checked by other ->writepage implementations like
block_write_full_page. Handle this kind of buffers early, and unlike
Eric's first version of the patch simply ASSERT that the buffers is
dirty, given that the mmap write case can't happen anymore since the
introduction of ->page_mkwrite. The now dead code dealing with that
will be deleted in a follow on patch.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This code was introduced four years ago in commit
3e57ecf640428c01ba1ed8c8fc538447ada1715b without any review and has
been unused since. Remove it just as the rest of the code introduced
in that commit to reduce that stack usage and complexity in this central
piece of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
joining it to a transaction via xfs_trans_ijoin.
This patches instead makes xfs_trans_ijoin usable on it's own by doing
an implicity xfs_trans_ihold, which also allows us to drop the third
argument. For the case where we want to hold a reference on the inode
a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
the inode for needing an xfs_iput. In addition to the cleaner interface
to the caller this also simplifies the implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
them in their only callers, just like we did for the inode pinning a while
ago. Also remove duplicate trace points - the bufitem tracepoints cover
all the information that is present in a buffer tracepoint.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently we track log item descriptor belonging to a transaction using a
complex opencoded chunk allocator. This code has been there since day one
and seems to work around the lack of an efficient slab allocator.
This patch replaces it with dynamically allocated log item descriptors
from a dedicated slab pool, linked to the transaction by a linked list.
This allows to greatly simplify the log item descriptor tracking to the
point where it's just a couple hundred lines in xfs_trans.c instead of
a separate file. The external API has also been simplified while we're
at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
delete items from a transaction have been simplified to the bare minium,
and the xfs_trans_find_item function is replaced with a direct dereference
of the li_desc field. All debug code walking the list of log items in
a transaction is down to a simple list_for_each_entry.
Note that we could easily use a singly linked list here instead of the
double linked list from list.h as the fastpath only does deletion from
sequential traversal. But given that we don't have one available as
a library function yet I use the list.h functions for simplicity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.
This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
https://bugzilla.kernel.org/show_bug.cgi?id=16348
When the filesystem grows to a large number of allocation groups,
the summing of recalimable inodes gets expensive. In many cases,
most AGs won't have any reclaimable inodes and so we are wasting CPU
time aggregating over these AGs. This is particularly important for
the inode shrinker that gets called frequently under memory
pressure.
To avoid the overhead, track AGs with reclaimable inodes in the
per-ag radix tree so that we can find all the AGs with reclaimable
inodes via a simple gang tag lookup. This involves setting the tag
when the first reclaimable inode is tracked in the AG, and removing
the tag when the last reclaimable inode is removed from the tree.
Then the summation process becomes a loop walking the radix tree
summing AGs with the reclaim tag set.
This significantly reduces the overhead of scanning - a 6400 AG
filesystea now only uses about 25% of a cpu in kswapd while slab
reclaim progresses instead of being permanently stuck at 100% CPU
and making little progress. Clean filesystems filesystems will see
no overhead and the overhead only increases linearly with the number
of dirty AGs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes. This significantly reduces
scanning overhead when lots of filesystems are present.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes. This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc->nr_to_write to get good writeback. Remove
that code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allowing writeback from reclaim context causes massive problems with stack
overflows as we can call into the writeback code which tends to be a heavy
stack user both in the generic code and XFS from random contexts that
perform memory allocations.
Follow the example of btrfs (and in slightly different form ext4) and refuse
to write out data from reclaim context. This issue should really be handled
by the VM so that we can tune better for this case, but until we get it
sorted out there we have to hack around this in each filesystem with a
complex writeback path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If a filesystem is mounted without the inode64 mount option we
should still be able to access inodes not fitting into 32 bits, just
not created new ones. For this to work we need to make sure the
inode cache radix tree is initialized for all allocation groups, not
just those we plan to allocate inodes from. This patch makes sure
we initialize the inode cache radix tree for all allocation groups,
and also cleans xfs_initialize_perag up a bit to separate the
inode32 logical from the general perag structure setup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Use DECLARE_EVENT_CLASS, and save ~15K:
text data bss dec hex filename
171949 43028 48 215025 347f1 fs/xfs/linux-2.6/xfs_trace.o.orig
156521 43028 36 199585 30ba1 fs/xfs/linux-2.6/xfs_trace.o
No change in functionality.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alex Elder <aelder@sgi.com>
The new xfsqa test 228 tries to preallocate more space than the
filesystem contains. it should fail, but instead triggers an assert
about lock flags. The failure is due to the size extension failing
in vmtruncate() due to rlimit being set. Check this before we start
the preallocation to avoid allocating space that will never be used.
Also the path through xfs_vn_allocate already holds the IO lock, so
it should not be present in the lock flags when the setattr fails.
Hence the assert needs to take this into account. This will prevent
other such callers from hitting this incorrect ASSERT.
(Fixed a reference to "newsize" to read "new_size". -Alex)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.
We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.
The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.
To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion. This includes adding a log ticket for tracking space
log required/used by the context checkpoint.
To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out. This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.
Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.
This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.
The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.
Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.
Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.
Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.
We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.
To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>