45762 Commits

Author SHA1 Message Date
Darrick J. Wong
0a1b0b3855 xfs: add an extent to the rmap btree
Originally-From: Dave Chinner <dchinner@redhat.com>

Now all the btree, free space and transaction infrastructure is in
place, we can finally add the code to insert reverse mappings to the
rmap btree. Freeing will be done in a separate patch, so just the
addition operation can be focussed on here.

[darrick: handle owner offsets when adding rmaps]
[dchinner: remove remaining debug printk statements]
[darrick: move unwritten bit to rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:44:21 +10:00
Darrick J. Wong
aa966d84aa xfs: add tracepoints for the rmap functions
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:43:24 +10:00
Darrick J. Wong
c543838a1e xfs: teach rmapbt to support interval queries
Now that the generic btree code supports querying all records within a
range of keys, use that functionality to allow us to ask for all the
extents mapped to a range of physical blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:42:39 +10:00
Darrick J. Wong
cfed56ae5f xfs: support overlapping intervals in the rmap btree
Now that the generic btree code supports overlapping intervals, plug
in the rmap btree to this functionality.  We will need it to find
potential left neighbors in xfs_rmap_{alloc,free} later in the patch
set.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:40:56 +10:00
Darrick J. Wong
4b8ed67794 xfs: add rmap btree operations
Originally-From: Dave Chinner <dchinner@redhat.com>

Implement the generic btree operations needed to manipulate rmap
btree blocks. This is very similar to the per-ag freespace btree
implementation, and uses the AGFL for allocation and freeing of
blocks.

Adapt the rmap btree to store owner offsets within each rmap record,
and to handle the primary key being redefined as the tuple
[agblk, owner, offset].  The expansion of the primary key is crucial
to allowing multiple owners per extent.

[darrick: adapt the btree ops to deal with offsets]
[darrick: remove init_rec_from_key]
[darrick: move unwritten bit to rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:39:05 +10:00
Darrick J. Wong
525488520a xfs: rmap btree requires more reserved free space
Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functions to handle this.

Also, because the macros are now executing conditional code and are
called quite frequently, convert them to functions that initialise
variables in the struct xfs_mount, use the new variables everywhere
and document the calculations better.

[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:38:24 +10:00
Darrick J. Wong
fa30f03cda xfs: rmap btree transaction reservations
The rmap btrees will use the AGFL as the block allocation source, so
we need to ensure that the transaction reservations reflect the fact
this tree is modified by allocation and freeing. Hence we need to
extend all the extent allocation/free reservations used in
transactions to handle this.

Note that this also gets rid of the unused XFS_ALLOCFREE_LOG_RES
macro, as we now do buffer reservations based on the number of
buffers logged via xfs_calc_buf_res(). Hence we only need the buffer
count calculation now.

[darrick: use rmap_maxlevels when calculating log block resv]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:37:10 +10:00
Darrick J. Wong
e70d829f8d xfs: add rmap btree growfs support
Originally-From: Dave Chinner <dchinner@redhat.com>

Now we can read and write rmap btree blocks, we can add support to
the growfs code to initialise new rmap btree blocks.

[darrick.wong@oracle.com: fill out the rmap offset fields]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:36:08 +10:00
Darrick J. Wong
035e00acb5 xfs: define the on-disk rmap btree format
Originally-From: Dave Chinner <dchinner@redhat.com>

Now we have all the surrounding call infrastructure in place, we can
start filling out the rmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for adding the
btree operations implementation.

[darrick: record owner and offset info in rmap btree]
[darrick: fork, bmbt and unwritten state in rmap btree]
[darrick: flags are a separate field in xfs_rmap_irec]
[darrick: calculate maxlevels separately]
[darrick: move the 'unwritten' bit into unused parts of rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:36:07 +10:00
Darrick J. Wong
673930c34a xfs: introduce rmap extent operation stubs
Originally-From: Dave Chinner <dchinner@redhat.com>

Add the stubs into the extent allocation and freeing paths that the
rmap btree implementation will hook into. While doing this, add the
trace points that will be used to track rmap btree extent
manipulations.

[darrick.wong@oracle.com: Extend the stubs to take full owner info.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:33:43 +10:00
Darrick J. Wong
340785cca1 xfs: add owner field to extent allocation and freeing
For the rmap btree to work, we have to feed the extent owner
information to the the allocation and freeing functions. This
information is what will end up in the rmap btree that tracks
allocated extents. While we technically don't need the owner
information when freeing extents, passing it allows us to validate
that the extent we are removing from the rmap btree actually
belonged to the owner we expected it to belong to.

We also define a special set of owner values for internal metadata
that would otherwise have no owner. This allows us to tell the
difference between metadata owned by different per-ag btrees, as
well as static fs metadata (e.g. AG headers) and internal journal
blocks.

There are also a couple of special cases we need to take care of -
during EFI recovery, we don't actually know who the original owner
was, so we need to pass a wildcard to indicate that we aren't
checking the owner for validity. We also need special handling in
growfs, as we "free" the space in the last AG when extending it, but
because it's new space it has no actual owner...

While touching the xfs_bmap_add_free() function, re-order the
parameters to put the struct xfs_mount first.

Extend the owner field to include both the owner type and some sort
of index within the owner.  The index field will be used to support
reverse mappings when reflink is enabled.

When we're freeing extents from an EFI, we don't have the owner
information available (rmap updates have their own redo items).
xfs_free_extent therefore doesn't need to do an rmap update. Make
sure that the log replay code signals this correctly.

This is based upon a patch originally from Dave Chinner. It has been
extended to add more owner information with the intent of helping
recovery operations when things go wrong (e.g. offset of user data
block in a file).

[dchinner: de-shout the xfs_rmap_*_owner helpers]
[darrick: minor style fixes suggested by Christoph Hellwig]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:33:42 +10:00
Darrick J. Wong
8018026ef2 xfs: rmap btree add more reserved blocks
Originally-From: Dave Chinner <dchinner@redhat.com>

XFS reserves a small amount of space in each AG for the minimum
number of free blocks needed for operation. Adding the rmap btree
increases the number of reserved blocks, but it also increases the
complexity of the calculation as the free inode btree is optional
(like the rmbt).

Rather than calculate the prealloc blocks every time we need to
check it, add a function to calculate it at mount time and store it
in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
just to use the xfs-mount variable directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:31:47 +10:00
Darrick J. Wong
00f4e4f907 xfs: add rmap btree stats infrastructure
Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree will require the same stats as all the other generic
btrees, so add all the code for that now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:31:11 +10:00
Darrick J. Wong
b87049444a xfs: introduce rmap btree definitions
Originally-From: Dave Chinner <dchinner@redhat.com>

Add new per-ag rmap btree definitions to the per-ag structures. The
rmap btree will sit in the empty slots on disk after the free space
btrees, and hence form a part of the array of space management
btrees. This requires the definition of the btree to be contiguous
with the free space btrees.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:30:32 +10:00
Darrick J. Wong
df3954ff72 xfs: increase XFS_BTREE_MAXLEVELS to fit the rmapbt
By my calculations, a 1,073,741,824 block AG with a 1k block size
can attain a maximum height of 9.  Assuming a record size of 24
bytes, a key/ptr size of 44 bytes, and half-full btree nodes, we'd
need 53,687,092 blocks for the records and ~6 million blocks for the
keys.  That requires a btree of height 9 based on the following
derivation:

Block size = 1024b
sblock CRC header = 56b
== 1024-56 = 968 bytes for tree data

rmapbt record = 24b
== 40 records per leaf block

rmapbt ptr/key = 44b
== 22 ptr/keys per block

Worst case, each block is half full, so 20 records and 11 ptrs per block.

1073741824 rmap records / 20 records per block
== 53687092 leaf blocks

53687092 leaves / 11 ptrs per block
== 4880645 level 1 blocks
== 443695 level 2 blocks
== 40336 level 3 blocks
== 3667 level 4 blocks
== 334 level 5 blocks
== 31 level 6 blocks
== 3 level 7 blocks
== 1 level 8 block

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:29:42 +10:00
Darrick J. Wong
ba9e780246 xfs: add tracepoints and error injection for deferred extent freeing
Add a couple of tracepoints for the deferred extent free operation and
a site for injecting errors while finishing the operation.  This makes
it easier to debug deferred ops and test log redo.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:26:33 +10:00
Darrick J. Wong
dc42375d5f xfs: refactor redo intent item processing
Refactor the EFI intent item recovery (and cancellation) functions
into a general function that scans the AIL and an intent item type
specific handler.  Move the function that recovers a single EFI item
into the extent free item code.  We'll want the generalized function
when we start wiring up more redo item types.

Furthermore, ensure that log recovery only replays the redo items
that were in the AIL prior to recovery by checking the item LSN
against the largest LSN seen during log scanning.  As written this
should never happen, but we can be defensive anyway.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:23:49 +10:00
Darrick J. Wong
2c3234d1ef xfs: rename flist/free_list to dfops
Mechanical change of flist/free_list to dfops, since they're now
deferred ops, not just a freeing list.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:19:29 +10:00
Darrick J. Wong
310a75a3c6 xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_*
Drop the compatibility shims that we were using to integrate the new
deferred operation mechanism into the existing code.  No new code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:18:10 +10:00
Darrick J. Wong
3ab78df2a5 xfs: rework xfs_bmap_free callers to use xfs_defer_ops
Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead.  For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:15:38 +10:00
Darrick J. Wong
9749fee83f xfs: enable the xfs_defer mechanism to process extents to free
Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred extent freeing.  We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:14:35 +10:00
Darrick J. Wong
bba61cbf30 xfs: clean up typedef usage in the EFI/EFD handling code
Replace structure typedefs with struct xfs_foo_* in the EFI/EFD
handling code in preparation to move it over to deferred ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:13:47 +10:00
Darrick J. Wong
3cd48abcc1 xfs: add tracepoints for the deferred ops mechanism
Add tracepoints for the internals of the deferred ops mechanism
and tracepoint classes for clients of the dops, to make debugging
easier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:13:02 +10:00
Darrick J. Wong
4e0cc29b91 xfs: move deferred operations into a separate file
All the code around struct xfs_bmap_free basically implements a
deferred operation framework through which we can roll transactions
(to unlock buffers and avoid violating lock order rules) while
managing all the necessary log redo items.  Previously we only used
this code to free extents after some sort of mapping operation, but
with the advent of rmap and reflink, we suddenly need to do more than
that.

With that in mind, xfs_bmap_free really becomes a deferred ops control
structure.  Rename the structure and move the deferred ops into their
own file to avoid further bloating of the bmap code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:12:25 +10:00
Darrick J. Wong
28a89567b8 xfs: refactor btree owner change into a separate visit-blocks function
Refactor the btree_change_owner function into a more generic apparatus
which visits all blocks in a btree.  We'll use this in a subsequent
patch for counting btree blocks for AG reservations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:10:55 +10:00
Darrick J. Wong
105f7d83db xfs: introduce interval queries on btrees
Create a function to enable querying of btree records mapping to a
range of keys.  This will be used in subsequent patches to allow
querying the reverse mapping btree to find the extents mapped to a
range of physical blocks, though the generic code can be used for
any range query.

The overlapped query range function needs to use the btree get_block
helper because the root block could be an inode, in which case
bc_bufs[nlevels-1] will be NULL.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:10:21 +10:00
Darrick J. Wong
2c813ad66a xfs: support btrees with overlapping intervals for keys
On a filesystem with both reflink and reverse mapping enabled, it's
possible to have multiple rmap records referring to the same blocks on
disk.  When overlapping intervals are possible, querying a classic
btree to find all records intersecting a given interval is inefficient
because we cannot use the left side of the search interval to filter
out non-matching records the same way that we can use the existing
btree key to filter out records coming after the right side of the
search interval.  This will become important once we want to use the
rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.

(For the non-overlapping case, we can perform such queries trivially
by starting at the left side of the interval and walking the tree
until we pass the right side.)

Therefore, extend the btree code to come closer to supporting
intervals as a first-class record attribute.  This involves widening
the btree node's key space to store both the lowest key reachable via
the node pointer (as the btree does now) and the highest key reachable
via the same pointer and teaching the btree modifying functions to
keep the highest-key records up to date.

This behavior can be turned on via a new btree ops flag so that btrees
that cannot store overlapping intervals don't pay the overhead costs
in terms of extra code and disk format changes.

When we're deleting a record in a btree that supports overlapped
interval records and the deletion results in two btree blocks being
joined, we defer updating the high/low keys until after all possible
joining (at higher levels in the tree) have finished.  At this point,
the btree pointers at all levels have been updated to remove the empty
blocks and we can update the low and high keys.

When we're doing this, we must be careful to update the keys of all
node pointers up to the root instead of stopping at the first set of
keys that don't need updating.  This is because it's possible for a
single deletion to cause joining of multiple levels of tree, and so
we need to update everything going back to the root.

The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
equal to, or greater than key2, respectively.  This is consistent
with the rest of the kernel and the C library.

In btree_updkeys(), we need to evaluate the force_all parameter before
running the key diff to avoid reading uninitialized memory when we're
forcing a key update.  This happens when we've allocated an empty slot
at level N + 1 to point to a new block at level N and we're in the
process of filling out the new keys.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:08:36 +10:00
Linus Torvalds
d52bd54db8 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of ocfs2

 - various hotfixes, mainly MM

 - quite a bit of misc stuff - drivers, fork, exec, signals, etc.

 - printk updates

 - firmware

 - checkpatch

 - nilfs2

 - more kexec stuff than usual

 - rapidio updates

 - w1 things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  ipc: delete "nr_ipc_ns"
  kcov: allow more fine-grained coverage instrumentation
  init/Kconfig: add clarification for out-of-tree modules
  config: add android config fragments
  init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
  relay: add global mode support for buffer-only channels
  init: allow blacklisting of module_init functions
  w1:omap_hdq: fix regression
  w1: add helper macro module_w1_family
  w1: remove need for ida and use PLATFORM_DEVID_AUTO
  rapidio/switches: add driver for IDT gen3 switches
  powerpc/fsl_rio: apply changes for RIO spec rev 3
  rapidio: modify for rev.3 specification changes
  rapidio: change inbound window size type to u64
  rapidio/idt_gen2: fix locking warning
  rapidio: fix error handling in mbox request/release functions
  rapidio/tsi721_dma: advance queue processing from transfer submit call
  rapidio/tsi721: add messaging mbox selector parameter
  rapidio/tsi721: add PCIe MRRS override parameter
  rapidio/tsi721_dma: add channel mask and queue size parameters
  ...
2016-08-02 21:08:07 -04:00
Darrick J. Wong
70b2265935 xfs: add function pointers for get/update keys to the btree
Add some function pointers to bc_ops to get the btree keys for
leaf and node blocks, and to update parent keys of a block.
Convert the _btree_updkey calls to use our new pointer, and
modify the tree shape changing code to call the appropriate
get_*_keys pointer instead of _btree_copy_keys because the
overlapping btree has to calculate high key values.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:03:38 +10:00
Darrick J. Wong
e5821e57af xfs: during btree split, save new block key & ptr for future insertion
When a btree block has to be split, we pass the new block's ptr from
xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
however, we pass the block's key through the cursor's record.  It is a
little weird to "initialize" a record from a key since the non-key
attributes will have garbage values.

When we go to add support for interval queries, we have to be able to
pass the lowest and highest keys accessible via a pointer.  There's no
clean way to pass this back through the cursor's record field.
Therefore, pass the key directly back to xfs_btree_insert() the same
way that we pass the btree_ptr.

As a bonus, we no longer need init_rec_from_key and can drop it from the
codebase.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:02:39 +10:00
Darrick J. Wong
0d309791bd xfs: set *stat=1 after iroot realloc
If we make the inode root block of a btree unfull by expanding the
root, we must set *stat to 1 to signal success, rather than leaving
it uninitialized.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:01:25 +10:00
Darrick J. Wong
f4a0660de3 xfs: fix locking of the rt bitmap/summary inodes
When we're deleting realtime extents, we need to lock the summary
inode in case we need to update the summary info to prevent an assert
on the rsumip inode lock on a debug kernel.  While we're at it, fix
the locking annotations so that we avoid triggering lockdep warnings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:00:42 +10:00
Darrick J. Wong
3dadf901dd xfs: fix attr shortform structure alignment on cris
Apparently cris doesn't require structure stride to align with the
largest type in the struct, so list[0] isn't at offset 4 like it is
everywhere else.  Fix this... insofar as existing XFSes on cris are
screwed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 10:59:42 +10:00
Darrick J. Wong
0facef7fb0 xfs: in _attrlist_by_handle, copy the cursor back to userspace
When we're iterating inode xattrs by handle, we have to copy the
cursor back to userspace so that a subsequent invocation actually
retrieves subsequent contents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 10:58:53 +10:00
Linus Torvalds
8cbdd85bda orangefs: kernel side caching and executable bugfix
This allows OrangeFS to utilize the dcache and adds an in kernel
 attribute cache. We previously used the user side client for this
 purpose.
 
 We see a modest performance increase on small file operations. For
 example, without the cache, compiling coreutils takes about 17 minutes.
 With the patch and a 50 millisecond timeout for dcache_timeout_msecs and
 getattr_timeout_msecs (the default), compiling coreutils takes about
 6 minutes 20 seconds. On the same hardware, compiling coreutils on an
 xfs filesystem takes 90 seconds. We see similar improvements with mdtest
 and a test involving writing, reading, and deleting a large number of
 small files.
 
 Interested parties can review more data at the following URL.
 
 https://docs.google.com/spreadsheets/d/1v4aUeppKexIbRMz_Yn9k4eaM3uy2KCaPoe_93YKWOtA/pubhtml
 
 The eventual goal of this is to allow getdents to turn into a
 readdirplus to the OrangeFS server. The cache will be filled then, which
 should provide a performance benefit to the common case of readdir
 followed by getattr on each entry (i.e. ls -l).
 
 This also fixes a bug. When orangefs_inode_permission was added, it did
 not collect i_size from the OrangeFS server, since this presses an
 unnecessary load on the OrangeFS server. However, it left a case where
 i_size is never initialized. Then running an executable could fail.
 
 With this patch, size is always collected to be inserted into the cache.
 Thus the bug disappears. If this patch is not accepted during this merge
 window, we will send a one-line band-aid for this bug instead.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIxBAABCAAbBQJXoPhPFBxtYXJ0aW5Ab21uaWJvbmQuY29tAAoJEPVzxHxs4+kh
 wCsQALUKnyoJzhHAmEoxYZGUPchgBS2yyWQJGP3ViqE8GbVubVG2NsLbluO1u5en
 /pdOPDXeij7pPGzdWk6wt0tXvM3oGJ3UPRi9ofEtU3XHnb4srX6XHBeG3ZHHZH0A
 91NPnMsmlBQvivBbVbjYrgXMKXz/UCQot7Y5iP7o9Gmick5tQqhRB21GcSCMeD7k
 ycrl61EA+GYDZOlzVspF2LJ52MhIXuT1T9ev66dLQWv8p6pMmpA4kda3Dwvqn/cE
 GGTeElq2PBGdhGapK4axGfRAW55997j9k6gcxLvFdA99ayAQ3+0hzXw4rNzcdabA
 ESUOe4riaYEaGEd686Mtd2w9hxvr1bOqkyRCKNnko90JJnqfGsgLfetpasG8CgUo
 n8VGxjimuCamBDf1+0ZzUs0Pj8q+U1QNQtHJi9QR/sNnNds/52k9OXV2r4MG+suU
 MAie5eD0Py6GzP9pOrAmuFbBkgd7Ag3EbiTjR1lKRpBR626inL/jM60XFfaF4P5g
 YOXC+VtJuVR88emIxqJ9ebdEy9+2yfkyinrLH9xZNctoz7KIoMhsmWb2bONKJDnx
 ngoqVKyH5opw6dKRkbTCM1A2mq8NntDvU6yeyHYJ2NXPXgARf9rSUIJ0RvR3oxdh
 Fqt5QyYHYDPZBuQn9XUV7t+VhAOFCbAPUDMMlifZUNx7icbj
 =rGmf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.8' of git://github.com/martinbrandenburg/linux

Pull orangefs update from Martin Brandenburg:
 "Kernel side caching and executable bugfix

  This allows OrangeFS to utilize the dcache and adds an in kernel
  attribute cache.  We previously used the user side client for this
  purpose.

  We see a modest performance increase on small file operations.  For
  example, without the cache, compiling coreutils takes about 17
  minutes.  With the patch and a 50 millisecond timeout for
  dcache_timeout_msecs and getattr_timeout_msecs (the default),
  compiling coreutils takes about 6 minutes 20 seconds.  On the same
  hardware, compiling coreutils on an xfs filesystem takes 90 seconds.
  We see similar improvements with mdtest and a test involving writing,
  reading, and deleting a large number of small files.

  Interested parties can review more data at the following URL.

    https://docs.google.com/spreadsheets/d/1v4aUeppKexIbRMz_Yn9k4eaM3uy2KCaPoe_93YKWOtA/pubhtml

  The eventual goal of this is to allow getdents to turn into a
  readdirplus to the OrangeFS server.  The cache will be filled then,
  which should provide a performance benefit to the common case of
  readdir followed by getattr on each entry (i.e.  ls -l).

  This also fixes a bug.  When orangefs_inode_permission was added, it
  did not collect i_size from the OrangeFS server, since this presses an
  unnecessary load on the OrangeFS server.  However, it left a case
  where i_size is never initialized.  Then running an executable could
  fail.

  With this patch, size is always collected to be inserted into the
  cache.  Thus the bug disappears.  If this patch is not accepted during
  this merge window, we will send a one-line band-aid for this bug
  instead"

* tag 'for-linus-v4.8' of git://github.com/martinbrandenburg/linux:
  Orangefs: update orangefs.txt
  orangefs: Account for jiffies wraparound.
  orangefs: Change default dcache and getattr timeout to 50 msec.
  orangefs: Allow dcache and getattr cache time to be configured.
  orangefs: Cache getattr results.
  orangefs: Use d_time to avoid excessive lookups
2016-08-02 19:47:06 -04:00
Linus Torvalds
72b5ac54d6 The highlights are:
* RADOS namespace support in libceph and CephFS (Zheng Yan and myself).
    The stopgaps added in 4.5 to deny access to inodes in namespaces are
    removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully
    supported.
 
  * A large rework of the MDS cap flushing code (Zheng Yan).
 
  * Handle some of ->d_revalidate() in RCU mode (Jeff Layton).  We were
    overly pessimistic before, bailing at the first sight of LOOKUP_RCU.
 
 On top of that we've got a few CephFS bug fixes, a couple of cleanups
 and Arnd's workaround for a weird genksyms issue.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXoKLJAAoJEEp/3jgCEfOLDTUIAIcctpKUiNBokc95mQaXYl34
 j7lPIaD0/Ur7JPt4nMdtlywYJYSVV2c+SglHztj/+fv0G4bWbLVEFRruh9SwKIci
 PzttcmycIAqSn1f5gBZwyQbGuffd/F0EnBj7fFjcukt01i3s1ZQ7t4XtLGtAV0Ts
 aIfFtx9SqWig57Z1OZqNgnhnOoh6IqNbic3FL5Hvdl5N5pFbBcQho6Vzoa5O1osH
 URG6RmCcO4nykfSoxiivE7UZ+CImsXHkRD7rupBuIjqjZ8wvmZqQF5qxnkb9Dw2F
 IkNhrHkTSIiv4EsNPLAETTnFSozrL1nEykKr2FBW+ti8nxNcav+8FgVapqLvFIw=
 =gQ0/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client

Pull Ceph updates from Ilya Dryomov:
 "The highlights are:

   - RADOS namespace support in libceph and CephFS (Zheng Yan and
     myself).  The stopgaps added in 4.5 to deny access to inodes in
     namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature
     bit is now fully supported

   - A large rework of the MDS cap flushing code (Zheng Yan)

   - Handle some of ->d_revalidate() in RCU mode (Jeff Layton).  We were
     overly pessimistic before, bailing at the first sight of LOOKUP_RCU

  On top of that we've got a few CephFS bug fixes, a couple of cleanups
  and Arnd's workaround for a weird genksyms issue"

* tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits)
  ceph: fix symbol versioning for ceph_monc_do_statfs
  ceph: Correctly return NXIO errors from ceph_llseek
  ceph: Mark the file cache as unreclaimable
  ceph: optimize cap flush waiting
  ceph: cleanup ceph_flush_snaps()
  ceph: kick cap flushes before sending other cap message
  ceph: introduce an inode flag to indicates if snapflush is needed
  ceph: avoid sending duplicated cap flush message
  ceph: unify cap flush and snapcap flush
  ceph: use list instead of rbtree to track cap flushes
  ceph: update types of some local varibles
  ceph: include 'follows' of pending snapflush in cap reconnect message
  ceph: update cap reconnect message to version 3
  ceph: mount non-default filesystem by name
  libceph: fsmap.user subscription support
  ceph: handle LOOKUP_RCU in ceph_d_revalidate
  ceph: allow dentry_lease_is_valid to work under RCU walk
  ceph: clear d_fsinfo pointer under d_lock
  ceph: remove ceph_mdsc_lease_release
  ceph: don't use ->d_time
  ...
2016-08-02 19:39:09 -04:00
Jeff Mahoney
0a11b9aae4 reiserfs: fix "new_insert_key may be used uninitialized ..."
new_insert_key only makes any sense when it's associated with a
new_insert_ptr, which is initialized to NULL and changed to a
buffer_head when we also initialize new_insert_key.  We can key off of
that to avoid the uninitialized warning.

Link: http://lkml.kernel.org/r/5eca5ffb-2155-8df2-b4a2-f162f105efed@suse.com
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:22 -04:00
Ryusuke Konishi
e63e88bc53 nilfs2: move ioctl interface and disk layout to uapi separately
The header file "include/linux/nilfs2_fs.h" is composed of parts for
ioctl and disk format, and both are intended to be shared with user
space programs.

This moves them to the uapi directory "include/uapi/linux" splitting the
file to "nilfs2_api.h" and "nilfs2_ondisk.h".  The following minor
changes are accompanied by this migration:

 - nilfs_direct_node struct in nilfs2/direct.h is converged to
   nilfs2_ondisk.h because it's an on-disk structure.
 - inline functions nilfs_rec_len_from_disk() and
   nilfs_rec_len_to_disk() are moved to nilfs2/dir.c.

Link: http://lkml.kernel.org/r/1465825507-3407-4-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:21 -04:00
Ryusuke Konishi
4ce5c3426c nilfs2: use BIT() macro
Replace bit shifts by BIT macro for clarity.

Link: http://lkml.kernel.org/r/1465825507-3407-3-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:21 -04:00
Ryusuke Konishi
ad980c9ab7 nilfs2: fix misuse of a semaphore in sysfs code
Variables ns_seg_seq, ns_segnum, ns_nextnum, ns_pseg_offset, ns_cno,
ns_ctime, ns_nongc_ctime, and ns_ndirtyblks, are protected by
ns_segctor_sem, but ns_sem is wrongly used by the nilfs sysfs code when
reading these variables.  This fixes the misuse and clarifies which
semaphore protects them in the comment of the_nilfs struct.

Link: http://lkml.kernel.org/r/1465825507-3407-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:20 -04:00
Ryusuke Konishi
a7d3f104da nilfs2: refactor parser of snapshot mount option
Move parser of snapshot mount option to a separate function
nilfs_parse_snapshot_option(), replace simple_strtoull() with
kstrtoull() to avoid checkpatch.pl warning "WARNING: simple_strtoull is
obsolete, use kstrtoull instead", and refine the error message of the
parser.

Link: http://lkml.kernel.org/r/1464875891-5443-9-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:20 -04:00
Ryusuke Konishi
aceb4170bb nilfs2: do not use yield()
Use cond_resched() instead of yield() in the loop of
nilfs_transaction_lock() since the usage corresponds to the "be nice for
others" case that the comment of yield() says.

This removes the following checkpatch.pl warning:

 "WARNING: Using yield() is generally wrong. See yield() kernel-doc
  (sched/core.c)"

Link: http://lkml.kernel.org/r/1464875891-5443-8-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:19 -04:00
Ryusuke Konishi
39a9dcca61 nilfs2: emit error message when I/O error is detected
When nilfs returned -EIO as an error code, it's not always clear if it
came from the underlying block device or not.  This will mend the issue
by having low level I/O routines of nilfs output an error message when
they detected an I/O error.

Link: http://lkml.kernel.org/r/1464875891-5443-7-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:19 -04:00
Ryusuke Konishi
d6517deb01 nilfs2: replace nilfs_warning() with nilfs_msg()
Use nilfs_msg() to output warning messages and get rid of
nilfs_warning() function.  This also removes function names from the
messages unless we embed them explicitly in format strings.  Instead,
some messages are revised to clarify the context.

[arnd@arndb.de: avoid warning about unused variables]
  Link: http://lkml.kernel.org/r/20160615201945.3348205-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1464875891-5443-6-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:18 -04:00
Ryusuke Konishi
feee880fa5 nilfs2: reduce bare use of printk() with nilfs_msg()
Replace most use of printk() in nilfs2 implementation with nilfs_msg(),
and reduce the following checkpatch.pl warning:

  "WARNING: Prefer [subsystem eg: netdev]_crit([subsystem]dev, ...
   then dev_crit(dev, ... then pr_crit(...  to printk(KERN_CRIT ..."

This patch also fixes a minor checkpatch warning "WARNING: quoted string
split across lines" that often accompanies the prior warning, and amends
message format as needed.

Link: http://lkml.kernel.org/r/1464875891-5443-5-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:17 -04:00
Ryusuke Konishi
6625689e15 nilfs2: embed a back pointer to super block instance in nilfs object
Insert a back pointer to super block instance in nilfs object so that
functions of nilfs2 easily refer to the super block instance.  This
simplifies replacement of printk() in the successive change.

Link: http://lkml.kernel.org/r/1464875891-5443-4-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:17 -04:00
Ryusuke Konishi
a66dfb0a91 nilfs2: add nilfs_msg() message interface
Define an own output routine to replace bare use of printk() function.
The output routine is implemented with a macro and a helper function,
which are named nilfs_msg() and __nilfs_msg(), respectively.

__nilfs_msg() formats a message like "NILFS (<device-name>): <message>",
prefixing it with a given log level, and terminates the statement with a
newline.  The "device-name" is optional to make it available in early
stages; it will be omitted if a NULL pointer is passed to super block
instance argument.  nilfs_msg() wraps __nilfs_msg() and is removed if
CONFIG_PRINTK is not set.

Link: http://lkml.kernel.org/r/1464875891-5443-3-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:16 -04:00
Ryusuke Konishi
cae3d4ca6f nilfs2: hide function name argument from nilfs_error()
Simplify nilfs_error(), an output function used to report critical
issues in file system.  This renames the original nilfs_error() function
to __nilfs_error() and redefines it as a macro to hide its function name
argument within the macro.

Every call site of nilfs_error() is changed to strip __func__ argument
except nilfs_bmap_convert_error(); nilfs_bmap_convert_error() directly
calls __nilfs_error() because it inherits caller's function name.

Link: http://lkml.kernel.org/r/1464875891-5443-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:16 -04:00
Daniel Wagner
a310dcb7a4 fs/binfmt_em86.c: fix incompatible pointer type
Since the -Wincompatible-pointer-types is reported as error, alpha
doesn't build anymore.  Let's fix it in a minimal way.

  fs/binfmt_em86.c:73:35: error: passing argument 2 of `copy_strings_kernel' from incompatible pointer type [-Werror=incompatible-pointer-types]
     retval = copy_strings_kernel(1, &i_arg, bprm);
                                     ^            ^
  fs/binfmt_em86.c:77:34: error: passing argument 2 of `copy_strings_kernel' from incompatible pointer type [-Werror=incompatible-pointer-types]
    retval = copy_strings_kernel(1, &i_name, bprm);
                                    ^

Link: http://lkml.kernel.org/r/1469525978-23359-1-git-send-email-wagi@monom.org
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:15 -04:00
Kees Cook
0036d1f7eb binfmt_elf: fix calculations for bss padding
A double-bug exists in the bss calculation code, where an overflow can
happen in the "last_bss - elf_bss" calculation, but vm_brk internally
aligns the argument, underflowing it, wrapping back around safe.  We
shouldn't depend on these bugs staying in sync, so this cleans up the
bss padding handling to avoid the overflow.

This moves the bss padzero() before the last_bss > elf_bss case, since
the zero-filling of the ELF_PAGE should have nothing to do with the
relationship of last_bss and elf_bss: any trailing portion should be
zeroed, and a zero size is already handled by padzero().

Then it handles the math on elf_bss vs last_bss correctly.  These need
to both be ELF_PAGE aligned to get the comparison correct, since that's
the expected granularity of the mappings.  Since elf_bss already had
alignment-based padding happen in padzero(), the "start" of the new
vm_brk() should be moved forward as done in the original code.  However,
since the "end" of the vm_brk() area will already become PAGE_ALIGNed in
vm_brk() then last_bss should get aligned here to avoid hiding it as a
side-effect.

Additionally makes a cosmetic change to the initial last_bss calculation
so it's easier to read in comparison to the load_addr calculation above
it (i.e.  the only difference is p_filesz vs p_memsz).

Link: http://lkml.kernel.org/r/1468014494-25291-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Ismael Ripoll Ripoll <iripoll@upv.es>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:14 -04:00