45422 Commits

Author SHA1 Message Date
Dan Williams
14df6a4e7e fs/dax: remove wmb_pmem()
Flushing posted-write queues is now deferred to REQ_FLUSH context, or
otherwise handled by an ADR event at the platform level.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12 15:13:48 -07:00
Sachin Prabhu
8d9535b6ef cifs: Check for existing directory when opening file with O_CREAT
When opening a file with O_CREAT flag, check to see if the file opened
is an existing directory.

This prevents the directory from being opened which subsequently causes
a crash when the close function for directories cifs_closedir() is called
which frees up the file->private_data memory while the file is still
listed on the open file list for the tcon.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
2016-07-12 16:09:38 -05:00
Bob Peterson
44f52122a2 GFS2: Check rs_free with rd_rsspin protection
For the last process to close a file opened for write, function
gfs2_rsqa_delete was deleting the file's inode's block reservation
out of the rgrp reservations tree. Then it was checking to make sure
rs_free was 0, but it was performing the check outside the protection
of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
to prevent a race between the process freeing the reservation and
another who is allocating a new set of blocks inside the same rgrp
for the same inode, thus changing its value.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-12 11:48:22 -05:00
Linus Torvalds
08d27eb206 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  posix_acl: de-union a_refcount and a_rcu
  nfs_atomic_open(): prevent parallel nfs_lookup() on a negative hashed
  Use the right predicate in ->atomic_open() instances
2016-07-12 16:49:01 +09:00
Sachin Prabhu
5b23c97d7e Add MF-Symlinks support for SMB 2.0
We should be able to use the same helper functions used for SMB 2.1 and
later versions.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-07-11 22:20:54 -05:00
Chuck Lever
a4e187d83d NFS: Don't drop CB requests with invalid principals
Before commit 778be232a207 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.

Fixes: 778be232a207 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Jaegeuk Kim
a7550b30ab ext4 crypto: migrate into vfs's crypto engine
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-10 14:01:03 -04:00
Tal Shorer
3dc3afadeb configfs: don't set buffer_needs_fill to zero if show() returns error
A confgifs attribute's show() callback is called once the first time
the user attempts to read from it. If it returns an error, that
error is returned to the user. However, the open file's
buffer_needs_fill is still set to zero and consecutive read() calls
will find an empty buffer that doesn't need filling and return 0 to
the user. This could give the user the wrong impression that the
attribute was read successfully.

Fix this by not setting buffer_needs_fill if show() returns an error,
making consecutive read() calls call show() again and either get an
error again or get data.

Signed-off-by: Tal Shorer <tal.shorer@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-07-10 21:02:18 +09:00
Mauro Carvalho Chehab
cbb5c8355a Merge branch 'topic/cec' into patchwork
* topic/cec:
  [media] DocBook/media: add CEC documentation
  [media] s5p_cec: get rid of an unused var
  [media] move s5p-cec to staging
  [media] vivid: add CEC emulation
  [media] cec: s5p-cec: Add s5p-cec driver
  [media] cec: adv7511: add cec support
  [media] cec: adv7842: add cec support
  [media] cec: adv7604: add cec support
  [media] cec: add compat32 ioctl support
  [media] cec/TODO: add TODO file so we know why this is still in staging
  [media] cec: add HDMI CEC framework (api)
  [media] cec: add HDMI CEC framework (adapter)
  [media] cec: add HDMI CEC framework (core)
  [media] cec-funcs.h: static inlines to pack/unpack CEC messages
  [media] cec.h: add cec header
  [media] cec-edid: add module for EDID CEC helper functions
  [media] cec.txt: add CEC framework documentation
  [media] rc: Add HDMI CEC protocol handling
2016-07-08 18:16:10 -03:00
Jaegeuk Kim
b56ab837a0 f2fs: avoid mark_inode_dirty
Let's check inode's dirtiness before calling mark_inode_dirty.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:34:09 -07:00
Jaegeuk Kim
a2ee0a3003 f2fs: move i_size_write in f2fs_write_end
We don't need to do i_size_write under page lock.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:35 -07:00
Chao Yu
c24a0fd655 f2fs: fix to avoid redundant discard during fstrim
With below test steps, f2fs will issue redundant discard when doing fstrim,
the reason is that we issue discards for both prefree segments and
consecutive freed region user wants to trim, part regions they covered are
overlapped, here, we change to do not to issue any discards for prefree
segments in trimmed range.

1. mount -t f2fs -o discard /dev/zram0 /mnt/f2fs
2. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/
3. dd if=/dev/zero  of=/mnt/f2fs/a bs=2M count=1
4. dd if=/dev/zero  of=/mnt/f2fs/b bs=1M count=1
5. sync
6. rm /mnt/f2fs/a /mnt/f2fs/b
7. fstrim -o 0 -l 3221225472 -m 2097152 -v /mnt/f2fs/

Before:
<...>-5428  [001] ...1  9511.052125: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x200
<...>-5428  [001] ...1  9511.052787: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300

After:
<...>-6764  [000] ...1  9720.382504: f2fs_issue_discard: dev = (251,0), blkstart = 0x2200, blklen = 0x300

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:34 -07:00
Yunlei He
c7b41e1613 f2fs: avoid mismatching block range for discard
This patch skip discard block range smaller than trim_minlen,
and can not be merged by neighbour

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:33 -07:00
Chao Yu
3e6d0b4d9c f2fs: fix incorrect f_bfree calculation in ->statfs
As manual described, f_bfree indicates total free blocks in fs, in f2fs, it
includes two parts: visible free blocks and over-provision blocks. This
patch corrrects the calculation.

fsblkcnt_t   f_bfree;   /* free blocks in fs */

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:32 -07:00
Jaegeuk Kim
ec795418c4 f2fs: use percpu_rw_semaphore
This patch replaces rw_semaphore with percpu_rw_semaphore for:
sbi->cp_rwsem
nm_i->nat_tree_lock

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:31 -07:00
Jaegeuk Kim
3bdad3c7ee f2fs: skip to check the block address of node page
If the node page is up-to-date, it should be alive.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:31 -07:00
Jaegeuk Kim
2555a2d558 f2fs: shrink critical region in spin_lock
This patch shrinks the critical region in spin_lock.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:30 -07:00
Jaegeuk Kim
237c0790e5 f2fs: call SetPageUptodate if needed
SetPageUptodate() issues memory barrier, resulting in performance degrdation.
Let's avoid that.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:29 -07:00
Jaegeuk Kim
fe76b796fc f2fs: introduce f2fs_set_page_dirty_nobuffer
This patch adds f2fs_set_page_dirty_nobuffer() copied from __set_page_dirty_buffer.
When appending 4KB blocks in f2fs on pmem with multiple cores, this improves the
overall performance.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:28 -07:00
Tiezhu Yang
a0995af695 f2fs: remove unnecessary goto statement
When base_addr is NULL, there is no need to call kzfree,
it should return -ENOMEM directly. Additionally, it is
better to initialize variable 'error' with 0.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:27 -07:00
Chao Yu
64058be9c8 f2fs: add nodiscard mount option
This patch adds 'nodiscard' mount option.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:26 -07:00
Chao Yu
72e1c797b5 f2fs: fix to redirty page if fail to gc data page
If we fail to move data page during foreground GC, we should give another
chance to writeback that page which was set dirty previously by writer.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:26 -07:00
Chao Yu
1563ac75e7 f2fs: fix to detect truncation prior rather than EIO during read
In procedure of synchonized read, after sending out the read request, reader
will try to lock the page for waiting device to finish the read jobs and
unlock the page, but meanwhile, truncater will race with reader, so after
reader get lock of the page, it should check page's mapping to detect
whether someone has truncated the page in advance, then reader has the
chance to do the retry if truncation was done, otherwise read can be failed
due to previous condition check.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:25 -07:00
Chao Yu
78682f7944 f2fs: fix to avoid reading out encrypted data in page cache
For encrypted inode, if user overwrites data of the inode, f2fs will read
encrypted data into page cache, and then do the decryption.

However reader can race with overwriter, and it will see encrypted data
which has not been decrypted by overwriter yet. Fix it by moving decrypting
work to background and keep page non-uptodated until data is decrypted.

Thread A				Thread B
- f2fs_file_write_iter
 - __generic_file_write_iter
  - generic_perform_write
   - f2fs_write_begin
    - f2fs_submit_page_bio
					- generic_file_read_iter
					 - do_generic_file_read
					  - lock_page_killable
					  - unlock_page
					  - copy_page_to_iter
					  hit the encrypted data in updated page
    - lock_page
    - fscrypt_decrypt_page

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-08 10:33:24 -07:00
Linus Torvalds
b987c759d2 eCryptfs fixes for 4.7-rc7:
- Provide a more concise fix for CVE-2016-1583
   + Additionally fixes linux-stable regressions caused by the cherry-picking of
     the original fix
 - Some very minor changes that have queued up
   + Fix typos in code comments
   + Remove unnecessary check for NULL before destroying kmem_cache
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJXf8nnAAoJENaSAD2qAscKwXgP/0awhY1z40dL/igP6fPv2ack
 HbqrOjUVO2DzxinvKB3vRLNy93zwESxe8UpwPsl84IJ85zOQjkUkJ8PYk1oyBf0N
 dVWqO11g6AKNZ+VQFspconvMhZATwSrsv8z3BzvwNGLsPhPuUQ+JmbBe8xMdrsZ5
 qVaWswsMtMlhM3p/zFh57vWO64fT1xiabpxSkKpG2LHJN6h6QAQxkfBfa2FuXCsN
 hZIw+ULcUJfdawXGq8lAfcYzbDmFpNt70fFquJgfJHrXFrOuensYfLcWUvhrSNbc
 HZ6imRK9LCG4IKjJTBNmCmBR8ho71yGzdKuup81Eap+2zx2kC7twokS1d5fha8iL
 Kzkx0NMDriY2N+tIfufHYk2IIenFzWG6Yuj0STswtJX4YhQGBc0H6VxcgrxE0PgW
 k1iKUV7jnJGxxN+d6lmV4+fX0vKGgBMsQq1Q76CkYLN1BAvdwz6GnWSfqP8hWz3o
 sNVyNtYh+/TXY8JMWKDBlps7Ib8W88qDW3K7YcAf2VPYAqIWm5Va1MR5m5s+UIeR
 QiCD32X/0PfDp13QRiKAHJ6C9CInyu0r+fF/g8ZMqLuWgLxoahxpr6ML/CnHoGl5
 IcDydJO3/bLBq9If8WxYsOQvVKCa4e7N7o7ZHPKd8U7O39mCGNfbQx7/FlMjtvf6
 +4HAxamUC1ogpLTkpWxI
 =Bt4P
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-4.7-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Provide a more concise fix for CVE-2016-1583:
   - Additionally fixes linux-stable regressions caused by the
     cherry-picking of the original fix

  Some very minor changes that have queued up:
   - Fix typos in code comments
   - Remove unnecessary check for NULL before destroying kmem_cache"

* tag 'ecryptfs-4.7-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: don't allow mmap when the lower fs doesn't support it
  Revert "ecryptfs: forbid opening files without mmap handler"
  ecryptfs: fix spelling mistakes
  eCryptfs: fix typos in comment
  ecryptfs: drop null test before destroy functions
2016-07-08 09:48:28 -07:00
Jeff Mahoney
f0fe970df3 ecryptfs: don't allow mmap when the lower fs doesn't support it
There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs.  We shouldn't emulate mmap support on file systems
that don't offer support natively.

CVE-2016-1583

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
[tyhicks: clean up f_op check by using ecryptfs_file_to_lower()]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2016-07-08 10:35:28 -05:00
Jeff Mahoney
78c4e17241 Revert "ecryptfs: forbid opening files without mmap handler"
This reverts commit 2f36db71009304b3f0b95afacd8eba1f9f046b87.

It fixed a local root exploit but also introduced a dependency on
the lower file system implementing an mmap operation just to open a file,
which is a bit of a heavy hammer.  The right fix is to have mmap depend
on the existence of the mmap handler instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2016-07-07 18:47:57 -05:00
Linus Torvalds
ac904ae6e6 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "Three small fixes that have been queued up and tested for this series:

   - A bug fix for xen-blkfront from Bob Liu, fixing an issue with
     incomplete requests during migration.

   - A fix for an ancient issue in retrieving the IO priority of a
     different PID than self, preventing that task from going away while
     we access it.  From Omar.

   - A writeback fix from Tahsin, fixing a case where we'd call ihold()
     with a zero ref count inode"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix use-after-free in sys_ioprio_get()
  writeback: inode cgroup wb switch should not call ihold()
  xen-blkfront: save uncompleted reqs in blkfront_resume()
2016-07-07 15:34:09 -07:00
Linus Torvalds
4c2a8499a4 Configfs fixes for Linux 4.7:
- a fix from Marek for ppos handling in configfs_write_bin_file,
    which was introduced in Linux 4.5, but didn't have any users
    until recently.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXfjK1AAoJEA+eU2VSBFGDJ0kQAMZHONK7pHwM+IxDUeTsDTfa
 FX+EplF1rLEtmUGOl01XbjgQp7acsP19YWikQfC09+ZjF6Vn1zFAFlNoU3qM+i/2
 zdukzIRBaSM+4w4HFDQ548zGGc8e9mesIIUHrHt6n/nL0OLKTU0XzbmRMXmvXAUJ
 u0nuB0OYlFJEVQFIlDYfG6E2rJy37FPilToonfw+AryVDenRm9iiUt0iMFoA8wPG
 EpogUinelxrKZ+ysOEeibaTGxxLLbd3AbWeUQbkhmsk4FfxuV7GFSfGPbhmJ1LeU
 n5X3LbK8lixG8goGdAW1NYulLnTjprZ6emUL0jwxYmI+MzOP7DUcsLqsGmdh5LEa
 Uw5gzQnzNDOUFR8XFt5CgARUHRXeDyJfzKeWvKFd4JUink6wE9R0yQJ2bPBXu3pY
 7t0P4qQSKWdeGjmYg54/JBamMba7BLQOLZuuiAplTTAt5Dg4tEi9Zuje2sUmBcDn
 3YnG4dnGxPeP9EKCElh1WWtwiRItUKT+YtaqSinL1Rh2j1aWu9WJQ3M1C+s3hFQ3
 vGR/CchllLtP4xmpY9TXEpUbBp4ZnTercWAxLRczi1/kOm3SdMIMUak26U6pTBmN
 QfkEzSx+K5F1wGaoqa2j7MuTMSNsxfT1R0xA6aBid4oOyCoPyHOVJ3mqV0AziyOl
 c7B92HgR/aGFMEevVA1G
 =mZ2H
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.7' of git://git.infradead.org/users/hch/configfs

Pull configfs fix from Christoph Hellwig:
 "A fix from Marek for ppos handling in configfs_write_bin_file, which
  was introduced in Linux 4.5, but didn't have any users until recently"

* tag 'configfs-for-4.7' of git://git.infradead.org/users/hch/configfs:
  configfs: Remove ppos increment in configfs_write_bin_file
2016-07-07 15:32:17 -07:00
Josef Bacik
8ca17f0f59 Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes
We used to allow you to set FLUSH_ALL and then just wouldn't do things like
commit transactions or wait on ordered extents if we noticed you were in a
transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
we've lost the ability to tell, and we could end up deadlocking.  So instead use
FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
we error out to preserve the previous behavior.  I've also added an ASSERT() to
catch anybody else who tries to do this.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
ac2fabac42 Btrfs: fill relocation block rsv after allocation
Since we set the reloc control before we've reserved our space for relocation we
could race with a root being dirtied and not actually have space to do our init
reloc root.  So once we've allocated it and set it up go ahead and make our
reservation before setting the relocate control, that way anybody who tries to
do the reloc root init has space to use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
40acc3eede Btrfs: always use trans->block_rsv for orphans
This is the case all the time anyway except for relocation which could be doing
a reloc root for a non ref counted root, in which case we'd end up with some
random block rsv rather than the one we have our reservation in.  If there isn't
enough space in the block rsv we are trying to steal from we'll BUG() because we
expect there to be space for the orphan to make its reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
ae2e472881 Btrfs: change how we calculate the global block rsv
Traditionally we've calculated the global block rsv by guessing how much of the
metadata used amount was the extent tree, and then taking the data size and
figuring out how large the csum tree would have to be to hold that much data.

This is imprecise and falls down on MIXED file systems as we can't trust the
data used amount.  This resulted in failures for xfstests generic/333 because it
creates lots of clones, which explodes out the extent tree.  Our global reserve
calculations were woefully inaccurate in this case which meant we got into a
situation where we did not have enough reserved to do our work.

We know we only use the global block rsv for the extent, csum, and root trees,
so just get the bytes used for these trees and use that as the basis of our
global reserve.  Since these are not reference counted trees the bytes_used
value will be accurate.  This fixed the transaction aborts seen with
generic/333.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
87241c2e68 Btrfs: use root when checking need_async_flush
Instead of doing fs_info->fs_root in need_async_flush, which may not be set
during recovery when mounting, just pass the root itself in, which makes more
sense as thats what btrfs_calc_reclaim_metadata_size takes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
d38b349c39 Btrfs: don't bother kicking async if there's nothing to reclaim
We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
31bada7c4e Btrfs: fix release reserved extents trace points
We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space.  We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
f376df2b7d Btrfs: add tracepoints for flush events
We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
f485c9ee32 Btrfs: fix delalloc reservation amount tracepoint
We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c51e7bb184 Btrfs: trace pinned extents
Pinned extents are an important metric to keep track of for enospc.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
957780eb27 Btrfs: introduce ticketed enospc infrastructure
Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation.  This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread.  This async flusher thread does the same old
flushing we always did, just asynchronously.  As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks.  These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c83f8effef Btrfs: add tracepoint for adding block groups
I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
d555b6c380 Btrfs: warn_on for unaccounted spaces
These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
c48f49d63d Btrfs: change delayed reservation fallback behavior
We reserve space for the inode update when we first reserve space for writing to
a file.  However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents.  Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve.  The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly.  In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations.  Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
48c3d480e4 Btrfs: always reserve metadata for delalloc extents
There are a few races in the metadata reservation stuff.  First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes.  So use the normal
btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation.  So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
25d609f86d Btrfs: fix callers of btrfs_block_rsv_migrate
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv.  This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size.  So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Josef Bacik
e40edf2da4 Btrfs: add bytes_readonly to the spaceinfo at once
For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info.  This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit.  Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Ingo Molnar
4b4b20852d Merge branch 'timers/fast-wheel' into timers/core 2016-07-07 10:35:28 +02:00
Jaegeuk Kim
ac6f199984 f2fs: avoid latency-critical readahead of node pages
The f2fs_map_blocks is very related to the performance, so let's avoid any
latency to read ahead node pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-06 10:44:10 -07:00
Jaegeuk Kim
2c237ebaa4 f2fs: avoid writing node/metapages during writes
Let's keep more node/meta pages in run time.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-06 10:44:09 -07:00
Jaegeuk Kim
ad4edb8314 f2fs: produce more nids and reduce readahead nats
The readahead nat pages are more likely to be reclaimed quickly, so it'd better
to gather more free nids in advance.

And, let's keep some free nids as much as possible.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-06 10:44:08 -07:00