22823 Commits

Author SHA1 Message Date
Vivek Haldar
556b27abf7 ext4: do not normalize block requests from fallocate()
Currently, an fallocate request of size slightly larger than a power of
2 is turned into two block requests, each a power of 2, with the extra
blocks pre-allocated for future use. When an application calls
fallocate, it already has an idea about how large the file may grow so
there is usually little benefit to reserve extra blocks on the
preallocation list. This reduces disk fragmentation.

Tested: fsstress. Also verified manually that fallocat'ed files are
contiguously laid out with this change (whereas without it they begin at
power-of-2 boundaries, leaving blocks in between). CPU usage of
fallocate is not appreciably higher.  In a tight fallocate loop, CPU
usage hovers between 5%-8% with this change, and 5%-7% without it.

Using a simulated file system aging program which the file system to
70%, the percentage of free extents larger than 8MB (as measured by
e2freefrag) increased from 38.8% without this change, to 69.4% with
this change.

Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 07:41:54 -04:00
Allison Henderson
a4bb6b64e3 ext4: enable "punch hole" functionality
This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole"
and "ext4_ext_check_cache"

fallocate has been modified to call ext4_punch_hole when the punch hole
flag is passed.  At the moment, we only support punching holes in
extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole
routine.

The ext4_ext_punch_hole routine first completes all outstanding writes
with the associated pages, and then releases them.  The unblock
aligned data is zeroed, and all blocks in between are punched out.

The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache
except it accepts a ext4_ext_cache parameter instead of a ext4_extent
parameter.  This routine is used by ext4_ext_punch_hole to check and
see if a block in a hole that has been cached.  The ext4_ext_cache
parameter is necessary because the members ext4_extent structure are
not large enough to hold a 32 bit value.  The existing
ext4_ext_in_cache routine has become a wrapper to this new function.

[ext4 punch hole patch series 5/5 v7] 

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:50 -04:00
Allison Henderson
e861304b8e ext4: add "punch hole" flag to ext4_map_blocks()
This patch adds a new flag to ext4_map_blocks() that specifies the
given range of blocks should be punched out.  Extents are first
converted to uninitialized extents before they are punched
out. Because punching a hole may require that the extent be split, it
is possible that the splitting may need more blocks than are
available.  To deal with this, use of reserved blocks are enabled to
allow the split to proceed.

The routine then returns the number of blocks successfully
punched out.

[ext4 punch hole patch series 4/5 v7]

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:46 -04:00
Allison Henderson
d583fb87a3 ext4: punch out extents
This patch modifies the truncate routines to support hole punching
Below is a brief summary of the patches changes:

- Added end param to ext_ext4_rm_leaf
        This function has been modified to accept an end parameter
        which enables it to punch holes in leafs instead of just
        truncating them.

- Implemented the "remove head" case in the ext_remove_blocks routine
        This routine is used by ext_ext4_rm_leaf to remove the tail
        of an extent during a truncate.  The new ext_ext4_rm_leaf
        routine will now also use it to remove the head of an extent in the
        case that the hole covers a region of blocks at the beginning
        of an extent.

- Added "end" param to ext4_ext_remove_space routine
        This function has been modified to accept a stop parameter, which
        is passed through to ext4_ext_rm_leaf.

[ext4 punch hole patch series 3/5 v6] 

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 07:41:43 -04:00
Allison Henderson
308488518d ext4: add new function ext4_block_zero_page_range()
This patch modifies the existing ext4_block_truncate_page() function
which was used by the truncate code path, and which zeroes out block
unaligned data, by adding a new length parameter, and renames it to
ext4_block_zero_page_rage().  This function can now be used to zero out the
head of a block, the tail of a block, or the middle
of a block.

The ext4_block_truncate_page() function is now a wrapper to
ext4_block_zero_page_range().

[ext4 punch hole patch series 2/5 v7] 

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:32 -04:00
Allison Henderson
55f020db66 ext4: add flag to ext4_has_free_blocks
This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks.  This will allow a
punch hole to proceed even if the disk is full.  Punching a hole may
require additional blocks to first split the extents.

Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:

ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks

[ext4 punch hole patch series 1/5 v7]

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:26 -04:00
Tristan Ye
dda54e76d7 Ocfs2/move_extents: Set several trivial constraints for threshold.
The threshold should be greater than clustersize and less than i_size.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:13 +08:00
Tristan Ye
4dfa66bd59 Ocfs2/move_extents: Let defrag handle partial extent moving.
We're going to support partial extent moving, which may split entire extent
movement into pieces to compromise the insuffice allocations, it eases the
'ENSPC' pain and makes the whole moving much less likely to fail, the downside
is it may make the fs even more fragmented before moving, just let the userspace
make a trade-off here.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:12 +08:00
Tristan Ye
53069d4e76 Ocfs2/move_extents: move/defrag extents within a certain range.
the basic logic of moving extents for a file is pretty like punching-hole
sequence, walk the extents within the range as user specified, calculating
an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to
do the actual moving.

This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation
gets done successfully.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:12 +08:00
Tristan Ye
ee16cc037e Ocfs2/move_extents: helper to calculate the defraging length in one run.
The helper is to calculate the defrag length in one run according to a threshold,
it will proceed doing defragmentation until the threshold was meet, and skip a
LARGE extent if any.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:12 +08:00
Tristan Ye
e08477176d Ocfs2/move_extents: move entire/partial extent.
ocfs2_move_extent() logic will validate the goal_offset_in_block,
where extents to be moved, what's more, it also compromises a bit
to probe the appropriate region around given goal_offset when the
original goal is not able to fit the movement.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:11 +08:00
Tristan Ye
8473aa8a2b Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
These helpers were actually borrowed from alloc.c, which may be publicized
later.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:11 +08:00
Tristan Ye
e6b5859ccc Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
Before doing the movement of extents, we'd better probe the alloc group from
'goal_blk' for searching a contiguous region to fit the wanted movement, we
even will have a best-effort try by compromising to a threshold around the
given goal.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:11 +08:00
Tristan Ye
99e4c75041 Ocfs2/move_extents: helper to validate and adjust moving goal.
First best-effort attempt to validate and adjust the goal (physical address in
block), while it can't guarantee later operation can succeed all the time since
global_bitmap may change a bit over time.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:10 +08:00
Tristan Ye
1c06b91261 Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
This function tries locate the right alloc group, where a given physical block
resides, it returns the caller a buffer_head of victim group descriptor, and also
the offset of block in this group, by passing the block number.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:10 +08:00
Tristan Ye
202ee5facb Ocfs2/move_extents: defrag a range of extent.
It's a relatively complete function to accomplish defragmentation for entire
or partial extent, one journal handle was kept during the operation, it was
logically doing one more thing than ocfs2_move_extent() acutally, yes, it's
claiming the new clusters itself;-)

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye
8f603e567a Ocfs2/move_extents: move a range of extent.
The moving range of __ocfs2_move_extent() was within one extent always, it
consists following parts:

1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved.

2. Split the original extent with new extent, coalecse the nearby extents if possible.

3. Append old clusters to truncate log, or decrease_refcount if the extent was refcounted.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye
de474ee8bb Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(),
to lock metadata and data alloctors during extents moving, reserve appropriate
metadata blocks and data clusters, also performa a best- effort to calculate the
credits for journal transaction in one run of movement.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye
028ba5df63 Ocfs2/move_extents: Add basic framework and source files for extent moving.
Adding new files move_extents.[c|h] and fill it with nothing but
only a context structure.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye
220ebc4334 Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
Patch also manages to add a manipulative struture for this ioctl.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye
3e19a25e05 Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
The original goal of commonizing these funcs is to benefit defraging/extent_moving
codes in the future,  based on the fact that reflink and defragmentation having
the same Copy-On-Wrtie mechanism.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye
d24a10b9f8 Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
This new code is a bit more complicated than former ones, the goal is to
show user all statistics required to take a deep insight into filesystem
on how the disk is being fragmentaed.

The goal is achieved by scaning global bitmap from (cluster)group to group
to figure out following factors in the filesystem:

        - How many free chunks in a fixed size as user requested.
        - How many real free chunks in all size.
        - Min/Max/Avg size(in) clusters of free chunks.
        - How do free chunks distribute(in size) in terms of a histogram,
          just like following:
          ---------------------------------------------------------
          Extent Size Range :  Free extents  Free Clusters  Percent
             32K...   64K-  :             1             1    0.00%
              1M...    2M-  :             9           288    0.03%
              8M...   16M-  :             2           831    0.09%
             32M...   64M-  :             1          2047    0.23%
            128M...  256M-  :             1          8191    0.92%
            256M...  512M-  :             2         21706    2.43%
            512M... 1024M-  :            27        858623   96.29%
          ---------------------------------------------------------

Userspace ioctl() call eventually gets the above info returned by passing
a 'struct ocfs2_info_freefrag' with the chunk_size being specified first.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:18:07 +08:00
Tristan Ye
3e5db17d4d Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
The new code is dedicated to calculate free inodes number of all inode_allocs,
then return the info to userpace in terms of an array.

Specially, flag 'OCFS2_INFO_FL_NON_COHERENT', manipulated by '--cluster-coherent'
from userspace, is now going to be involved. setting the flag on means no cluster
coherency considered, usually, userspace tools choose none-coherency strategy by
default for the sake of performace.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:18:02 +08:00
Tristan Ye
8aa1fa360d Ocfs2: Using inline funcs to set/clear *FILLED* flags in info handler.
It just removes some macros for the sake of typechecking gains.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:17:18 +08:00
Aditya Kali
ae81230686 ext4: reserve inodes and feature code for 'quota' feature
I am working on patch to add quota as a built-in feature for ext4
filesystem. The implementation is based on the design given at
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4.
This patch reserves the inode numbers 3 and 4 for quota purposes and
also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 19:00:39 -04:00
Johann Lombardi
c5e06d101a ext4: add support for multiple mount protection
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:31:25 -04:00
Eric W. Biederman
62ca24baf1 ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
Spotted-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-24 15:30:33 -07:00
Kazuya Mio
d02a9391f7 ext4: ensure f_bfree returned by ext4_statfs() is non-negative
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

CASE 1:
ext4_da_writepages
  mpage_da_map_and_submit
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
        ext4_da_update_reserve_space
            percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                            used + ei->i_allocated_meta_blocks);

CASE 2:
ext4_write_begin
  __block_write_begin
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
2011-05-24 18:30:07 -04:00
Lukas Czerner
28739eea9c ext4: protect bb_first_free in ext4_trim_all_free() with group lock
We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:28:07 -04:00
Lukas Czerner
7894408666 ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.

This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().

Tested by me with xfstests 251.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:16:27 -04:00
Linus Torvalds
dc522adbee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  jbd: Fix comment to match the code in journal_start()
  jbd/jbd2: remove obsolete summarise_journal_usage.
  jbd: Fix forever sleeping process in do_get_write_access()
  ext2: fix error msg when mounting fs with too-large blocksize
  jbd: fix fsync() tid wraparound bug
  ext3: Fix fs corruption when make_indexed_dir() fails
  ext3: Fix lock inversion in ext3_symlink()
2011-05-24 15:11:46 -07:00
Linus Torvalds
df3256f9ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: make plock operation killable
  dlm: remove shared message stub for recovery
  dlm: delayed reply message warning
  dlm: Remove superfluous call to recalc_sigpending()
2011-05-24 15:04:00 -07:00
Eryu Guan
c867516de5 jbd2: Fix comment to match the code in jbd2__journal_start()
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 17:09:58 -04:00
John W. Linville
31ec97d9ce Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem 2011-05-24 16:47:54 -04:00
Linus Torvalds
b0ca118dba Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits)
  TOMOYO: Fix wrong domainname validation.
  SELINUX: add /sys/fs/selinux mount point to put selinuxfs
  CRED: Fix load_flat_shared_library() to initialise bprm correctly
  SELinux: introduce path_has_perm
  flex_array: allow 0 length elements
  flex_arrays: allow zero length flex arrays
  flex_array: flex_array_prealloc takes a number of elements, not an end
  SELinux: pass last path component in may_create
  SELinux: put name based create rules in a hashtable
  SELinux: generic hashtab entry counter
  SELinux: calculate and print hashtab stats with a generic function
  SELinux: skip filename trans rules if ttype does not match parent dir
  SELinux: rename filename_compute_type argument to *type instead of *con
  SELinux: fix comment to state filename_compute_type takes an objname not a qstr
  SMACK: smack_file_lock can use the struct path
  LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH
  LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE
  SELINUX: Make selinux cache VFS RCU walks safe
  SECURITY: Move exec_permission RCU checks into security modules
  SELinux: security_read_policy should take a size_t not ssize_t
  ...
2011-05-24 13:38:19 -07:00
Sage Weil
db3540522e ceph: fix cap flush race reentrancy
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure.  It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.

Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant).  This is cleaner and more
robust.

Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:12 -07:00
Sage Weil
45e3d3eeb6 ceph: avoid inode lookup on nfs fh reconnect
If we get the inode from the MDS, we have a reference in req; don't do a
fresh lookup.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:06 -07:00
Sage Weil
3c454cf216 ceph: use LOOKUPINO to make unconnected nfs fh more reliable
If we are unable to locate an inode by ino, ask the MDS using the new
LOOKUPINO command.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:05 -07:00
Linus Torvalds
eb08d8ff47 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (52 commits)
  UBIFS: switch to dynamic printks
  UBIFS: fix kernel-doc comments
  UBIFS: fix extremely rare mount failure
  UBIFS: simplify LEB recovery function further
  UBIFS: always cleanup the recovered LEB
  UBIFS: clean up LEB recovery function
  UBIFS: fix-up free space on mount if flag is set
  UBIFS: add the fixup function
  UBIFS: add a superblock flag for free space fix-up
  UBIFS: share the next_log_lnum helper
  UBIFS: expect corruption only in last journal head LEBs
  UBIFS: synchronize write-buffer before switching to the next bud
  UBIFS: remove BUG statement
  UBIFS: change bud replay function conventions
  UBIFS: substitute the replay tree with a replay list
  UBIFS: simplify replay
  UBIFS: store free and dirty space in the bud replay entry
  UBIFS: remove unnecessary stack variable
  UBIFS: double check that buds are replied in order
  UBIFS: make 2 functions static
  ...
2011-05-24 11:51:07 -07:00
Christoph Hellwig
55a7bc5a30 xfs: do not discard alloc btree blocks
Blocks for the allocation btree are allocated from and released to
the AGFL, and thus frequently reused.  Even worse we do not have an
easy way to avoid using an AGFL block when it is discarded due to
the simple FILO list of free blocks, and thus can frequently stall
on blocks that are currently undergoing a discard.

Add a flag to the busy extent tracking structure to skip the discard
for allocation btree blocks.  In normal operation these blocks are
reused frequently enough that there is no need to discard them
anyway, but if they spill over to the allocation btree as part of a
balance we "leak" blocks that we would otherwise discard.  We could
fix this by adding another flag and keeping these block in the
rbtree even after they aren't busy any more so that we could discard
them when they migrate out of the AGFL.  Given that this would cause
significant overhead I don't think it's worthwile for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-24 11:17:22 -05:00
Christoph Hellwig
e84661aa84 xfs: add online discard support
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard.  Note that we don't bother
supporting discard for the non-delaylog mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-24 11:17:13 -05:00
Jan Kara
93628ffb9b ext4: fix waiting and sending of a barrier in ext4_sync_file()
jbd2_log_start_commit() returns 1 only when we really start a
transaction.  But we also need to wait for a transaction when the
commit is already running.  Fix this problem by waiting for
transaction commit unconditionally (which is just a quick check if the
transaction is already committed).

Also we have to be more careful with sending of a barrier because when
transaction is being committed in parallel to ext4_sync_file()
running, we cannot be sure that the barrier the journalling code sends
happens after we wrote all the data for fsync (note that not every
data writeout needs to trigger metadata changes thus commit of some
metadata changes can be running while other data is still written
out). So use jbd2_will_send_data_barrier() helper to detect the common
cases when we can be sure barrier will be issued by the commit code
and issue the barrier ourselves in the remaining cases.

Reported-by: Edward Goggin <egoggin@vmware.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 12:00:54 -04:00
Jan Kara
bbd2be3691 jbd2: Add function jbd2_trans_will_send_data_barrier()
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device.  The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:59:18 -04:00
Jan Kara
81be12c817 jbd2: fix sending of data flush on journal commit
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit.  In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).

Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:52:40 -04:00
Yongqiang Yang
b221349fa8 ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
To get delayed-extent information, ext4_ext_fiemap_cb() looks up
pagecache, it thus collects information starting from a page's
head block.

If blocksize < pagesize, the beginning blocks of a page may lies
before the request range. So ext4_ext_fiemap_cb() should proceed
ignoring them, because they has been handled before. If no mapped
buffer in the range is found in the 1st page, we need to look up
the 2nd page, otherwise delayed-extents after a hole will be ignored.

Without this patch, xfstests 225 will hung on ext4 with 1K block.

Reported-by: Amir Goldstein <amir73il@users.sourceforge.net>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:36:58 -04:00
James Morris
434d42cfd0 Merge branch 'next' into for-linus 2011-05-24 22:55:24 +10:00
Robin Dong
98ba073c60 ocfs2: change incorrect 'extern' keyword to 'static' in dlmfs
Change function param_set_dlmfs_capabilities from 'extern' to 'static' since
function param_get_dlmfs_capabilities is also 'static'.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:59:40 -07:00
Sunil Mushran
9f62e96084 ocfs2/dlm: dlm_is_lockres_migrateable() returns boolean
Patch cleans up the gunk added by commit 388c4bcb4e63e88fb1f312a2f5f9eb2623afcf5b.
dlm_is_lockres_migrateable() now returns 1 if lockresource is deemed
migrateable and 0 if not.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:39 -07:00
Tao Ma
10fca35ff1 ocfs2: Add trace event for trim.
Add the corresponding trace event for trim.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:20 -07:00
Tao Ma
55e67872b6 ocfs2: Add FITRIM ioctl.
Add the corresponding ioctl function for FITRIM.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:19 -07:00