50807 Commits

Author SHA1 Message Date
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Darrick J. Wong
4cc1ee5e65 xfs: simplify the rmap code in xfs_bmse_merge
In Christoph's patch to refactor xfs_bmse_merge, the updated rmap code
does more work than it needs to (because map-extent auto-merges
records).  Remove the unnecessary unmap and save ourselves a deferred
op.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-01 13:08:26 -07:00
Eric Sandeen
f91fb956f2 xfs: remove unused flags arg from xfs_file_iomap_begin_delay
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Amir Goldstein
47c7d0b195 xfs: fix incorrect log_flushed on fsync
When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH]
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
742d842907 xfs: disable per-inode DAX flag
Currently flag switching can be used to easily crash the kernel.  Disable
the per-inode DAX flag until that is sorted out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
8bfadd8d03 xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
Use the existing functionality instead of directly poking into the extent
list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
e17a5c6f0e xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
This avoids poking into the internals of the extent list.  Also return
the number of extents as the return value instead of an additional
by reference argument, and make it available to callers outside of
xfs_bmap_util.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
4c35445b59 xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
This abstracts the function away from details of the low-level extent
list implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
4da6b514ea xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
This abstracts the function away from details of the low-level extent
list implementation.

Note that it seems like the previous implementation of rmap for
the merge case was completely broken, but it no seems appear to
trigger that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
05b7c8ab2b xfs: move some code around inside xfs_bmap_shift_extents
For the first right move we need to look up next_fsb.  That means
our last fsb that contains next_fsb must also be the current extent,
so take advantage of that by moving the code around a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Oleg Nesterov
138e4ad67a epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
The race was introduced by me in commit 971316f0503a ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().

Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll->lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞 <long7573@126.com>
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01 13:07:35 -07:00
Serge E. Hallyn
8db6c34f1d Introduce v3 namespaced file capabilities
Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr.  If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.

However supporting file capabilities in a user namespace is very
desirable.  Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege.  For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise.  The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.

This patch introduces v3 of the security.capability xattr.  It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.

When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change.  Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid.  This is done during the
execution of setxattr() to catch user-space-initiated capability writes.
Subsequently, any task executing the file which has the noted kuid as
its root uid, or which is in a descendent user_ns of such a user_ns,
will run the file with capabilities.

Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.

If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace.  The kernel will
translate that rootid to an absolute uid, and write that to disk.  After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.

Only a single security.capability xattr may exist at a time for a given
file.  A task may overwrite an existing xattr so long as it is
privileged over the inode.  Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr.  This check can be re-added if deemed useful.

This allows a simple setxattr to work, allows tar/untar to work, and
allows us to tar in one namespace and untar in another while preserving
the capability, without risking leaking privilege into a parent
namespace.

Example using tar:

 $ cp /bin/sleep sleepx
 $ mkdir b1 b2
 $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
 $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
 $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
 $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
 $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
   b2/sleepx = cap_sys_admin+ep
 # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
   v3 xattr, rootid is 100001

A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp

Changelog:
   Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
   Nov 07 2016: convert rootid from and to fs user_ns
   (From ebiederm: mar 28 2017)
     commoncap.c: fix typos - s/v4/v3
     get_vfs_caps_from_disk: clarify the fs_ns root access check
     nsfscaps: change the code split for cap_inode_setxattr()
   Apr 09 2017:
       don't return v3 cap for caps owned by current root.
      return a v2 cap for a true v2 cap in non-init ns
   Apr 18 2017:
      . Change the flow of fscap writing to support s_user_ns writing.
      . Remove refuse_fcap_overwrite().  The value of the previous
        xattr doesn't matter.
   Apr 24 2017:
      . incorporate Eric's incremental diff
      . move cap_convert_nscap to setxattr and simplify its usage
   May 8, 2017:
      . fix leaking dentry refcount in cap_inode_getsecurity

Signed-off-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-09-01 14:57:15 -05:00
Linus Torvalds
b8a78bb4d1 ceph fscache page locking fix from Zheng, marked for stable.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZqYGlAAoJEEp/3jgCEfOLx58H/1jnP79H03/kchVBCGPLCKjs
 E+pgHpb2922EGeYmEUoxfq627SiCODap/jo6JFVpsd+JnmHLZiMzmEzGpDce6fn9
 /YY5u3WNtmnKtyPvl0kzspK0ujaeCuiRyarULXBiHveL2ZQINKus4F9MiZphNnt4
 X4hgo866+esEf6LocuEkMoEGvgN7vk/Q9nDPgD/YoFrhCuwdvLJpBnE65CGbQyk5
 n3g0qlBR+yorDr1stdlSyVUDPkF5FQjhQTqkpi1oPAhsNPKgVPyZzRIEQEA+nI+N
 wTsQ0SMKfST4PNaRNdUuO1xwszYziqqlLZ2KwaaLIDHlElcbQR1S3GUKz6hddJc=
 =oesm
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "ceph fscache page locking fix from Zheng, marked for stable"

* tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client:
  ceph: fix readpage from fscache
2017-09-01 12:46:30 -07:00
Christoph Hellwig
f2285c148c xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
Use the bmap abstraction instead of open-coding bmbt details here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
50bb44c286 xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
Use the helper instead of open coding it, to provide a better abstraction
for the scalable extent list work.  This also gets an additional assert
and trace point for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
67e4e69cb2 xfs: add a xfs_iext_update_extent helper
This helper is used to update an extent record based on the extent index,
and can be used to provide a level of abstractions between callers that
want to modify in-core extent records and the details of the extent list
implementation.

Also switch all users of the xfs_bmbt_set_all(xfs_iext_get_ext(...))
pattern to this new helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
d522d569d6 xfs: consolidate the various page fault handlers
Add a new __xfs_filemap_fault helper that implements all four page fault
callouts, and make these methods themselves small stubs that set the
correct write_fault flag, and exit early for the non-DAX case for the
hugepage related ones.

Also remove the extra size checking in the pfn_fault path, which is now
handled in the core DAX code.

Life would be so much simpler if we only had one method for all this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
e7647fb491 iomap: return VM_FAULT_* codes from iomap_page_mkwrite
All callers will need the VM_FAULT_* flags, so convert in the helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
2dd3d709fc xfs: relog dirty buffers during swapext bmbt owner change
The owner change bmbt scan that occurs during extent swap operations
does not handle ordered buffer failures. Buffers that cannot be
marked ordered must be physically logged so previously dirty ranges
of the buffer can be relogged in the transaction.

Since the bmbt scan may need to process and potentially log a large
number of blocks, we can't expect to complete this operation in a
single transaction. Update extent swap to use a permanent
transaction with enough log reservation to physically log a buffer.
Update the bmbt scan to physically log any buffers that cannot be
ordered and to terminate the scan with -EAGAIN. On -EAGAIN, the
caller rolls the transaction and restarts the scan. Finally, update
the bmbt scan helper function to skip bmbt blocks that already match
the expected owner so they are not reprocessed after scan restarts.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix the xfs_trans_roll call]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
a5814bceea xfs: disallow marking previously dirty buffers as ordered
Ordered buffers are used in situations where the buffer is not
physically logged but must pass through the transaction/logging
pipeline for a particular transaction. As a result, ordered buffers
are not unpinned and written back until the transaction commits to
the log. Ordered buffers have a strict requirement that the target
buffer must not be currently dirty and resident in the log pipeline
at the time it is marked ordered. If a dirty+ordered buffer is
committed, the buffer is reinserted to the AIL but not physically
relogged at the LSN of the associated checkpoint. The buffer log
item is assigned the LSN of the latest checkpoint and the AIL
effectively releases the previously logged buffer content from the
active log before the buffer has been written back. If the tail
pushes forward and a filesystem crash occurs while in this state, an
inconsistent filesystem could result.

It is currently the caller responsibility to ensure an ordered
buffer is not already dirty from a previous modification. This is
unclear and error prone when not used in situations where it is
guaranteed a buffer has not been previously modified (such as new
metadata allocations).

To facilitate general purpose use of ordered buffers, update
xfs_trans_ordered_buf() to conditionally order the buffer based on
state of the log item and return the status of the result. If the
bli is dirty, do not order the buffer and return false. The caller
must either physically log the buffer (having acquired the
appropriate log reservation) or push it from the AIL to clean it
before it can be marked ordered in the current transaction.

Note that ordered buffers are currently only used in two situations:
1.) inode chunk allocation where previously logged buffers are not
possible and 2.) extent swap which will be updated to handle ordered
buffer failures in a separate patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
6fb10d6d22 xfs: move bmbt owner change to last step of extent swap
The extent swap operation currently resets bmbt block owners before
the inode forks are swapped. The bmbt buffers are marked as ordered
so they do not have to be physically logged in the transaction.

This use of ordered buffers is not safe as bmbt buffers may have
been previously physically logged. The bmbt owner change algorithm
needs to be updated to physically log buffers that are already dirty
when/if they are encountered. This means that an extent swap will
eventually require multiple rolling transactions to handle large
btrees. In addition, all inode related changes must be logged before
the bmbt owner change scan begins and can roll the transaction for
the first time to preserve fs consistency via log recovery.

In preparation for such fixes to the bmbt owner change algorithm,
refactor the bmbt scan out of the extent fork swap code to the last
operation before the transaction is committed. Update
xfs_swap_extent_forks() to only set the inode log flags when an
owner change scan is necessary. Update xfs_swap_extents() to trigger
the owner change based on the inode log flags. Note that since the
owner change now occurs after the extent fork swap, the inode btrees
must be fixed up with the inode number of the current inode (similar
to log recovery).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
99c794c639 xfs: skip bmbt block ino validation during owner change
Extent swap uses xfs_btree_visit_blocks() to fix up bmbt block
owners on v5 (!rmapbt) filesystems. The bmbt scan uses
xfs_btree_lookup_get_block() to read bmbt blocks which verifies the
current owner of the block against the parent inode of the bmbt.
This works during extent swap because the bmbt owners are updated to
the opposite inode number before the inode extent forks are swapped.

The modified bmbt blocks are marked as ordered buffers which allows
everything to commit in a single transaction. If the transaction
commits to the log and the system crashes such that recovery of the
extent swap is required, log recovery restarts the bmbt scan to fix
up any bmbt blocks that may have not been written back before the
crash. The log recovery bmbt scan occurs after the inode forks have
been swapped, however. This causes the bmbt block owner verification
to fail, leads to log recovery failure and requires xfs_repair to
zap the log to recover.

Define a new invalid inode owner flag to inform the btree block
lookup mechanism that the current inode may be invalid with respect
to the current owner of the bmbt block. Set this flag on the cursor
used for change owner scans to allow this operation to work at
runtime and during log recovery.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: bb3be7e7c ("xfs: check for bogus values in btree block headers")
Cc: stable@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
8dc518dfa7 xfs: don't log dirty ranges for ordered buffers
Ordered buffers are attached to transactions and pushed through the
logging infrastructure just like normal buffers with the exception
that they are not actually written to the log. Therefore, we don't
need to log dirty ranges of ordered buffers. xfs_trans_log_buf() is
called on ordered buffers to set up all of the dirty state on the
transaction, buffer and log item and prepare the buffer for I/O.

Now that xfs_trans_dirty_buf() is available, call it from
xfs_trans_ordered_buf() so the latter is now mutually exclusive with
xfs_trans_log_buf(). This reflects the implementation of ordered
buffers and helps eliminate confusion over the need to log ranges of
ordered buffers just to set up internal log state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
9684010d38 xfs: refactor buffer logging into buffer dirtying helper
xfs_trans_log_buf() is responsible for logging the dirty segments of
a buffer along with setting all of the necessary state on the
transaction, buffer, bli, etc., to ensure that the associated items
are marked as dirty and prepared for I/O. We have a couple use cases
that need to to dirty a buffer in a transaction without actually
logging dirty ranges of the buffer.  One existing use case is
ordered buffers, which are currently logged with arbitrary ranges to
accomplish this even though the content of ordered buffers is never
written to the log. Another pending use case is to relog an already
dirty buffer across rolled transactions within the deferred
operations infrastructure. This is required to prevent a held
(XFS_BLI_HOLD) buffer from pinning the tail of the log.

Refactor xfs_trans_log_buf() into a new function that contains all
of the logic responsible to dirty the transaction, lidp, buffer and
bli. This new function can be used in the future for the use cases
outlined above. This patch does not introduce functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
e9385cc6fb xfs: ordered buffer log items are never formatted
Ordered buffers pass through the logging infrastructure without ever
being written to the log. The way this works is that the ordered
buffer status is transferred to the log vector at commit time via
the ->iop_size() callback. In xlog_cil_insert_format_items(),
ordered log vectors bypass ->iop_format() processing altogether.

Therefore it is unnecessary for xfs_buf_item_format() to handle
ordered buffers. Remove the unnecessary logic and assert that an
ordered buffer never reaches this point.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
6453c65d35 xfs: remove unnecessary dirty bli format check for ordered bufs
xfs_buf_item_unlock() historically checked the dirty state of the
buffer by manually checking the buffer log formats for dirty
segments. The introduction of ordered buffers invalidated this check
because ordered buffers have dirty bli's but no dirty (logged)
segments. The check was updated to accommodate ordered buffers by
looking at the bli state first and considering the blf only if the
bli is clean.

This logic is safe but unnecessary. There is no valid case where the
bli is clean yet the blf has dirty segments. The bli is set dirty
whenever the blf is logged (via xfs_trans_log_buf()) and the blf is
cleared in the only place BLI_DIRTY is cleared (xfs_trans_binval()).

Remove the conditional blf dirty checks and replace with an assert
that should catch any discrepencies between bli and blf dirty
states. Refactor the old blf dirty check into a helper function to
be used by the assert.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
a4f6cf6b2b xfs: open-code xfs_buf_item_dirty()
It checks a single flag and has one caller. It probably isn't worth
its own function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
8ad7c629b1 xfs: remove the ip argument to xfs_defer_finish
And instead require callers to explicitly join the inode using
xfs_defer_ijoin.  Also consolidate the defer error handling in
a few places using a goto label.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
882d8785fb xfs: rename xfs_defer_join to xfs_defer_ijoin
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
411350df14 xfs: refactor xfs_trans_roll
Split xfs_trans_roll into a low-level helper that just rolls the
actual transaction and a new higher level xfs_trans_roll_inode
that takes care of logging and rejoining the inode.  This gets
rid of the NULL inode case, and allows to simplify the special
cases in the deferred operation code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Omar Sandoval
f2e9ad212d xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster()
After xfs_ifree_cluster() finds an inode in the radix tree and verifies
that the inode number is what it expected, xfs_reclaim_inode() can swoop
in and free it. xfs_ifree_cluster() will then happily continue working
on the freed inode. Most importantly, it will mark the inode stale,
which will probably be overwritten when the inode slab object is
reallocated, but if it has already been reallocated then we can end up
with an inode spuriously marked stale.

In 8a17d7ddedb4 ("xfs: mark reclaimed inodes invalid earlier") we added
a second check to xfs_iflush_cluster() to detect this race, but the
similar RCU lookup in xfs_ifree_cluster() needs the same treatment.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Darrick J. Wong
799ea9e9c5 xfs: evict all inodes involved with log redo item
When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery.  This also had the
effect of putting linked inodes on the lru instead of evicting them.

Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.

Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-09-01 10:55:30 -07:00
Dan Carpenter
ef13ecbc13 kernfs: checking for IS_ERR() instead of NULL
The kernfs_get_inode() returns NULL on error, it never returns error
pointers.

Fixes: aa8188253474 ("kernfs: add exportfs operations")
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-01 08:17:55 -06:00
Boris Brezillon
d1f936d736 This pull request contains the following core changes:
* Fix memory leaks in the core
 * Remove unused NAND locking support
 * Rename nand.h into rawnand.h (preparing support for spi NANDs)
 * Use NAND_MAX_ID_LEN where appropriate
 * Fix support for 20nm Hynix chips
 * Fix support for Samsung and Hynix SLC NANDs
 
 and the following driver changes:
 
 * Various cleanup, improvements and fixes in the qcom driver
 * Fixes for bugs detected by various static code analysis tools
 * Fix mxc ooblayout definition
 * Add a new part_parsers to tmio and sharpsl platform data in order to
   define a custom list of partition parsers
 * Request the reset line in exclusive mode in the sunxi driver
 * Fix a build error in the orion-nand driver when compiled for ARMv4
 * Allow 64-bit mvebu platforms to select the PXA3XX driver
 -----BEGIN PGP SIGNATURE-----
 
 iQJABAABCAAqBQJZpwMZIxxib3Jpcy5icmV6aWxsb25AZnJlZS1lbGVjdHJvbnMu
 Y29tAAoJEGXtNgF+CLcAJEoP/jRnEjPznWT3+ngw6k/rnykkn/wexKV3iyX/6b71
 MQT/ZFuT3HsHnUjyprPvyRWJeKun6XyIH5fk7FlXIei9TaWCt6/UGTKousaPKeR2
 maggGeEjxGVpHJM/jpIYCyjt83zpezBqTupv52XhXxPaU7ROSpuHCd92YcPzIaT5
 tcn8JrI7TGuGlBrBbA2y8ZrPtuug3IKqUfpIiQmoqr0jzQR+AbZKHg0kk5a8piOn
 OguK67uhxeOvq831bGPehCPDbuE0loNi4CssayJ1HrisfS95kH/cqrveapgKsUG/
 fxaHh1i65I0lxa8sgUgeUiU04Zsy1YcgNbCj41AY4AHnjJ0+Qp1cV6KAB/x5/wH9
 ES/fW06how+1BLEeLvOr+rIQ41WeP0qV2H3r/PtkeswKKAV3gSERBXVHmg1E7Yum
 HkmPqzhu+nSk3mP7p3yxpd7EwWQh2xpvVYrfQ5vQbtdfm8Uw9n6S8x+O89ch9wWi
 +KbMWFsmF78nuRWW3WTsaiOFKcTRLBT5RoU2z4i9hCvQT2Pnx3SBhHrIj6xqBI1S
 8MpeDdlXHmRQfZxj+jDqU77JYqVEmy/5it9OjhjMpOqxfCf4K6Nlb75TEdR5Nh/9
 BA1qqTBEslg3UqS8ofGFHGFZWrW3JHf02SYo2zU9IvBninh3HrqHiFhBc3p1xNDE
 7GwD
 =bC1+
 -----END PGP SIGNATURE-----

Merge tag 'nand/for-4.14' of git://git.infradead.org/l2-mtd into mtd/next

From Boris:
"
This pull request contains the following core changes:

* Fix memory leaks in the core
* Remove unused NAND locking support
* Rename nand.h into rawnand.h (preparing support for spi NANDs)
* Use NAND_MAX_ID_LEN where appropriate
* Fix support for 20nm Hynix chips
* Fix support for Samsung and Hynix SLC NANDs

and the following driver changes:

* Various cleanup, improvements and fixes in the qcom driver
* Fixes for bugs detected by various static code analysis tools
* Fix mxc ooblayout definition
* Add a new part_parsers to tmio and sharpsl platform data in order to
  define a custom list of partition parsers
* Request the reset line in exclusive mode in the sunxi driver
* Fix a build error in the orion-nand driver when compiled for ARMv4
* Allow 64-bit mvebu platforms to select the PXA3XX driver
"
2017-09-01 15:34:30 +02:00
Steve French
7e682f766f Fix warning messages when mounting to older servers
When mounting to older servers, such as Windows XP (or even Windows 7),
the limited error messages that can be passed back to user space can
get confusing since the default dialect has changed from SMB1 (CIFS) to
more secure SMB3 dialect. Log additional information when the user chooses
to use the default dialects and when the server does not support the
dialect requested.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-01 00:18:44 -05:00
Linus Torvalds
e89ce1f89f two cifs bug fixes for stable
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZpxWIAAoJEIosvXAHck9Rl1AL/2wNdEvDvDYZ48IeJN/R0Agb
 nhsXycDshjjUSMghm5IMXvaOitwQL3bcbVeGBDOF4UYoDgJZSuOeOvzYhAJaykiy
 Cs2rwve9Vuw9pliY1D4705g2jduvca6KP4Rzcmf8zeRqG8siQLDlFI8MqYcWiz6d
 F8O9EH+n+Daqf4L5b9IB/3aXZiQLDXKC5HjwJJlPjF6fhkqMbwFLSgTYPmCoM2M2
 oC4jDIPIvM7+p5K8Hs3UvkUaeUoMV1Hx/bba3gGFLK0lnGiD2davZPNBLR46nt6j
 L6l0ujg1pv0kOnTsyfiZp/m43UF5IGeLacBj5DczDrio/8FYKcLCoTIXpWB/dMn/
 NR9kGw/Y9M7jv+qumwrFzMaqv4ztMc2QssFsm37gHEtkEIBPiPTYixRIG/OkVRo7
 u8sZic1DvFot0YalZqpIX6Ss6egiY5CI4q7AWU8XwpkiW+K/RBN8g31Cmcx6scU2
 6/ywpNm2tP5D7AfH13jQh8CQOstnrFH50fUCyXpASA==
 =VgIs
 -----END PGP SIGNATURE-----

Merge tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two cifs bug fixes for stable"

* tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: remove endian related sparse warning
  CIFS: Fix maximum SMB2 header size
2017-08-31 18:45:04 -07:00
Linus Torvalds
ea25c43179 Merge branch 'mmu_notifier_fixes'
Merge mmu_notifier fixes from Jérôme Glisse:
 "The invalidate_page callback suffered from 2 pitfalls. First it used
  to happen after page table lock was release and thus a new page might
  have been setup for the virtual address before the call to
  invalidate_page().

  This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
  try_to_unmap_one() to use page_vma_mapped_walk()") which moved the
  callback under the page table lock. Which also broke several existing
  user of the mmu_notifier API that assumed they could sleep inside this
  callback.

  The second pitfall was invalidate_page being the only callback not
  taking a range of address in respect to invalidation but was giving an
  address and a page. Lot of the callback implementer assumed this could
  never be THP and thus failed to invalidate the appropriate range for
  THP pages.

  By killing this callback we unify the mmu_notifier callback API to
  always take a virtual address range as input.

  There is now two clear API (I am not mentioning the youngess API which
  is seldomly used):

   - invalidate_range_start()/end() callback (which allow you to sleep)

   - invalidate_range() where you can not sleep but happen right after
     page table update under page table lock

  Note that a lot of existing user feels broken in respect to
  range_start/ range_end. Many user only have range_start() callback but
  there is nothing preventing them to undo what was invalidated in their
  range_start() callback after it returns but before any CPU page table
  update take place.

  The code pattern use in kvm or umem odp is an example on how to
  properly avoid such race. In a nutshell use some kind of sequence
  number and active range invalidation counter to block anything that
  might undo what the range_start() callback did.

  If you do not care about keeping fully in sync with CPU page table (ie
  you can live with CPU page table pointing to new different page for a
  given virtual address) then you can take a reference on the pages
  inside the range_start callback and drop it in range_end or when your
  driver is done with those pages.

  Last alternative is to use invalidate_range() if you can do
  invalidation without sleeping as invalidate_range() callback happens
  under the CPU page table spinlock right after the page table is
  updated.

  The first two patches convert existing mmu_notifier_invalidate_page()
  calls to mmu_notifier_invalidate_range() and bracket those call with
  call to mmu_notifier_invalidate_range_start()/end().

  The next ten patches remove existing invalidate_page() callback as it
  can no longer happen.

  Finally the last page remove the invalidate_page() callback completely
  so it can RIP.

  Changes since v1:
   - remove more dead code in kvm (no testing impact)
   - more accurate end address computation (patch 2) in page_mkclean_one
     and try_to_unmap_one
   - added tested-by/reviewed-by gotten so far"

* emailed patches from Jérôme Glisse <jglisse@redhat.com>:
  mm/mmu_notifier: kill invalidate_page
  KVM: update to new mmu_notifier semantic v2
  xen/gntdev: update to new mmu_notifier semantic
  sgi-gru: update to new mmu_notifier semantic
  misc/mic/scif: update to new mmu_notifier semantic
  iommu/intel: update to new mmu_notifier semantic
  iommu/amd: update to new mmu_notifier semantic
  IB/hfi1: update to new mmu_notifier semantic
  IB/umem: update to new mmu_notifier semantic
  drm/amdgpu: update to new mmu_notifier semantic
  powerpc/powernv: update to new mmu_notifier semantic
  mm/rmap: update to new mmu_notifier semantic v2
  dax: update to new mmu_notifier semantic
2017-08-31 17:30:01 -07:00
Dave Kleikamp
c227390c91 jfs should use MAX_LFS_FILESIZE when calculating s_maxbytes
jfs had previously avoided the use of MAX_LFS_FILESIZE because it hadn't
accounted for the whole 32-bit index range on 32-bit systems.  That has
been fixed by commit 0cc3b0ec23ce ("Clarify (and fix) MAX_LFS_FILESIZE
macros"), so we can simplify the code now.

Suggested by Andreas Dilger.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 17:02:21 -07:00
Jérôme Glisse
a4d1a88525 dax: update to new mmu_notifier semantic
Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
and make sure it is bracketed by calls to *_invalidate_range_start()/end().

Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:12:59 -07:00
Yan, Zheng
dd2bc47348 ceph: fix readpage from fscache
ceph_readpage() unlocks page prematurely prematurely in the case
that page is reading from fscache. Caller of readpage expects that
page is uptodate when it get unlocked. So page shoule get locked
by completion callback of fscache_read_or_alloc_pages()

Cc: stable@vger.kernel.org # 4.1+, needs backporting for < 4.7
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-01 00:04:26 +02:00
Christoph Hellwig
ddef7ed2b5 annotate RWF_... flags
[AV: added missing annotations in syscalls.h/compat.h]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-31 17:32:38 -04:00
Dan Williams
5e405595e5 ext4: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 11:12:13 -07:00
Dan Williams
8cf037a8b2 ext2: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 09:33:30 -07:00
Dan Williams
486aff5e04 xfs: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 09:31:47 -07:00
Andreas Dilger
b5f515735b ext4: avoid Y2038 overflow in recently_deleted()
Avoid a 32-bit time overflow in recently_deleted() since i_dtime
(inode deletion time) is stored only as a 32-bit value on disk.
Since i_dtime isn't used for much beyond a boolean value in e2fsck
and is otherwise only used in this function in the kernel, there is
no benefit to use more space in the inode for this field on disk.

Instead, compare only the relative deletion time with the low
32 bits of the time using the newly-added time_before32() helper,
which is similar to time_before() and time_after() for jiffies.

Increase RECENTCY_DIRTY to 300s based on Ted's comments about
usage experience at Google.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2017-08-31 11:09:45 -04:00
Ernesto A. Fernández
309e8cda59 gfs2: preserve i_mode if __gfs2_set_acl() fails
When changing a file's acl mask, __gfs2_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-31 07:53:15 -05:00
Ernesto A. Fernández
54aae14bee gfs2: don't return ENODATA in __gfs2_xattr_set unless replacing
The function __gfs2_xattr_set() will return -ENODATA when called to
remove a xattr that does not exist. The result is that setfacl will
show an exit status of 1 when called to set only a file's mode bits
(on a file with no ACLs), despite succeeding. A "No data available"
error will be printed as well.

To fix this return 0 instead, except when the XATTR_REPLACE flag is
set, in which case -ENODATA is appropriate. This is consistent with
how most other xattr setting functions work, in other filesystems.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-31 07:43:03 -05:00
Steve French
6e3c1529c3 CIFS: remove endian related sparse warning
Recent patch had an endian warning ie
cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-08-30 14:43:11 -05:00
Pavel Shilovsky
9e37b1784f CIFS: Fix maximum SMB2 header size
Currently the maximum size of SMB2/3 header is set incorrectly which
leads to hanging of directory listing operations on encrypted SMB3
connections. Fix this by setting the maximum size to 170 bytes that
is calculated as RFC1002 length field size (4) + transform header
size (52) + SMB2 header size (64) + create response size (56).

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2017-08-30 14:42:30 -05:00
Bob Peterson
c4a9d1892f GFS2: Fix non-recursive truncate bug
Before this patch if you truncated a file to a smaller size it
wasn't freeing all the blocks properly. There are two reasons.

First, the metapath comparison was not comparing previous heights.
I added a function, mp_eq_to_hgt, which checks the metapath at
all heights prior to the target height.

Second, in function find_nonnull_ptr, it needed to zero out all
pointers for heights following the target height. Translated into
decimal integer terms, this way a number like 299, when incremented,
becomes 300, not 399. The 2 gets incremented to 3, and the following
digits need to be reset.

These two things allow the truncate state machine to properly find
the blocks it needs to delete.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30 13:29:22 -05:00