linux

iv/linux

Author	SHA1	Message	Date
Nikolay Borisov	ec7d6dfd73	btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c Those functions are going to be used even after inode cache is removed so moved them to a more appropriate place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:05 +01:00
David Sterba	1201b58b67	btrfs: drop casts of bio bi_sector Since commit 72deb455b5ec ("block: remove CONFIG_LBDAF") (5.2) the sector_t type is u64 on all arches and configs so we don't need to typecast it. It used to be unsigned long and the result of sector size shifts were not guaranteed to fit in the type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:05 +01:00
Naohiro Aota	12659251ca	btrfs: implement log-structured superblock for ZONED mode Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two adjacent zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second one. Then, when both zones are filled up and before starting to write to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:04 +01:00
Naohiro Aota	a589dde0bc	btrfs: disallow mixed-bg in ZONED mode Placing both data and metadata in a block group is impossible in ZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do that, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. Check and disallow MIXED_BG with ZONED mode. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:04 +01:00
Naohiro Aota	f1569c4c10	btrfs: disable fallocate in ZONED mode fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in ZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in ZONED mode by utilizing space_info->bytes_may_use or similar, so this returns EOPNOTSUPP. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:04 +01:00
Naohiro Aota	d206e9c9c5	btrfs: disallow NODATACOW in ZONED mode NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:04 +01:00
Naohiro Aota	5d1ab66c56	btrfs: disallow space_cache in ZONED mode As updates to the space cache v1 are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache v1. This does not introduce any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Note: we can technically use free-space-tree (space cache v2) on ZONED mode. But, since ZONED mode now always allocates extents in a block group sequentially regardless of underlying device zone type, it's no use to enable and maintain the tree. For the same reason, NODATACOW is also disabled. In summary, ZONED will disable: \| Disabled features \| Reason \| \|-------------------+-----------------------------------------------------\| \| RAID/DUP \| Cannot handle two zone append writes to different \| \| \| zones \| \|-------------------+-----------------------------------------------------\| \| space_cache (v1) \| In-place updating \| \| NODATACOW \| In-place updating \| \|-------------------+-----------------------------------------------------\| \| fallocate \| Reserved extent will be a write hole \| \|-------------------+-----------------------------------------------------\| \| MIXED_BG \| Allocated metadata region will be write holes for \| \| \| data writes \| Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:04 +01:00
Naohiro Aota	862931c763	btrfs: introduce max_zone_append_size The zone append write command has a maximum IO size restriction it accepts. This is because a zone append write command cannot be split, as we ask the device to place the data into a specific target zone and the device responds with the actual written location of the data. Introduce max_zone_append_size to zone_info and fs_info to track the value, so we can limit all I/O to a zoned block device that we want to write using the zone append command to the device's limits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:04 +01:00
Naohiro Aota	b70f509774	btrfs: check and enable ZONED mode Introduce function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:03 +01:00
Naohiro Aota	5b31646898	btrfs: get zone information of zoned block devices If a zoned block device is found, get its zone information (number of zones and zone size). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:15:57 +01:00
Hui Su	0288e7fa35	fs/kernfs: remove the double check of dentry->inode In both kernfs_node_from_dentry() and in kernfs_dentry_node(), we will check the dentry->inode is NULL or not, which is superfluous. So remove the check in kernfs_node_from_dentry(). Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hui Su <sh_def@163.com> Link: https://lore.kernel.org/r/20201113132143.GA119541@rlk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-12-09 19:03:49 +01:00
Eric Sandeen	207ddc0ef4	xfs: don't catch dax+reflink inodes as corruption in verifier We don't yet support dax on reflinked files, but that is in the works. Further, having the flag set does not automatically mean that the inode is actually "in the CPU direct access state," which depends on several other conditions in addition to the flag being set. As such, we should not catch this as corruption in the verifier - simply not actually enabling S_DAX on reflinked files is enough for now. Fixes: 4f435ebe7d04 ("xfs: don't mix reflink and DAX mode for now") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [darrick: fix the scrubber too] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	a5336d6bb2	xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks In commit 27c14b5daa82 we started tracking the last inode seen during an inode walk to avoid infinite loops if a corrupt inobt record happens to have a lower ir_startino than the record preceeding it. Unfortunately, the assertion trips over the case where there are completely empty inobt records (which can happen quite easily on 64k page filesystems) because we advance the tracking cursor without actually putting the empty record into the processing buffer. Fix the assert to allow for this case. Reported-by: zlang@redhat.com Fixes: 27c14b5daa82 ("xfs: ensure inobt record walks always make forward progress") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-12-09 09:49:38 -08:00
Joseph Qi	2e984badbc	xfs: remove unneeded return value check for init_cursor() Since init_cursor() can always return a valid cursor, the NULL check in caller is unneeded. So clean them up. This also keeps the behavior consistent with other callers. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Gao Xiang	7bc1fea9d3	xfs: introduce xfs_validate_stripe_geometry() Introduce a common helper to consolidate stripe validation process. Also make kernel code xfs_validate_sb_common() use it first. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Kaixu Xia	237d7887ae	xfs: show the proper user quota options The quota option 'usrquota' should be shown if both the XFS_UQUOTA_ACCT and XFS_UQUOTA_ENFD flags are set. The option 'uqnoenforce' should be shown when only the XFS_UQUOTA_ACCT flag is set. The current code logic seems wrong, Fix it and show proper options. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Kaixu Xia	afbd914776	xfs: remove the unused XFS_B_FSB_OFFSET macro There are no callers of the XFS_B_FSB_OFFSET macro, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Kaixu Xia	88269b880a	xfs: remove unnecessary null check in xfs_generic_create The function posix_acl_release() test the passed-in argument and move on only when it is non-null, so maybe the null check in xfs_generic_create is unnecessary. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Kaixu Xia	b3b29cd106	xfs: directly return if the delta equal to zero The xfs_trans_mod_dquot() function will allocate new tp->t_dqinfo if it is NULL and make the changes in the tp->t_dqinfo->dqs[XFS_QM_TRANS _{USR,GRP,PRJ}]. Nowadays seems none of the callers want to join the dquots to the transaction and push them to device when the delta is zero. Actually, most of time the caller would check the delta and go on only when the delta value is not zero, so we should bail out when it is zero. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Kaixu Xia	04a58620a1	xfs: check tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag Nowadays the only things that the XFS_TRANS_DQ_DIRTY flag seems to do are indicates the tp->t_dqinfo->dqs[XFS_QM_TRANS_{USR,GRP,PRJ}] values changed and check in xfs_trans_apply_dquot_deltas() and the unreserve variant xfs_trans_unreserve_and_mod_dquots(). Actually, we also can use the tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag, that is to say, we allocate the new tp->t_dqinfo only when the qtrx values changed, so the tp->t_dqinfo value isn't NULL equals the XFS_TRANS_DQ_DIRTY flag is set, we only need to check if tp->t_dqinfo == NULL in xfs_trans_apply_dquot_deltas() and its unreserve variant to determine whether lock all of the dquots and join them to the transaction. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Kaixu Xia	a9382fa9a9	xfs: delete duplicated tp->t_dqinfo null check and allocation The function xfs_trans_mod_dquot_byino() wraps around xfs_trans_mod_dquot() to account for quotas, and also there is the function call chain xfs_trans_reserve_quota_bydquots -> xfs_trans_dqresv -> xfs_trans_mod_dquot, both of them do the duplicated null check and allocation. Thus we can delete the duplicated operation from them. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	1e5c39dfd3	xfs: rename xfs_fc_* back to xfs_fs_* Get rid of this one-off namespace since we're done converting things to fscontext now. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	33005fd0a5	xfs: refactor file range validation Refactor all the open-coded validation of file block ranges into a single helper, and teach the bmap scrubber to check the ranges. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	18695ad425	xfs: refactor realtime volume extent validation Refactor all the open-coded validation of realtime device extents into a single helper. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	67457eb0d2	xfs: refactor data device extent validation Refactor all the open-coded validation of non-static data device extents into a single helper. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	4b80ac6445	xfs: scrub should mark a directory corrupt if any entries cannot be iget'd It's possible that xfs_iget can return EINVAL for inodes that the inobt thinks are free, or ENOENT for inodes that look free. If this is the case, mark the directory corrupt immediately when we check ftype. Note that we already check the ftype of the '.' and '..' entries, so we can skip the iget part since we already know the inode type for '.' and we have a separate parent pointer scrubber for '..'. Fixes: a5c46e5e8912 ("xfs: scrub directory metadata") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	da531cc46e	xfs: fix parent pointer scrubber bailing out on unallocated inodes xfs_iget can return -ENOENT for a file that the inobt thinks is allocated but has zeroed mode. This currently causes scrub to exit with an operational error instead of flagging this as a corruption. The end result is that scrub mistakenly reports the ENOENT to the user instead of "directory parent pointer corrupt" like we do for EINVAL. Fixes: 5927268f5a04 ("xfs: flag inode corruption if parent ptr doesn't get us a real inode") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	acf104c233	xfs: detect overflows in bmbt records Detect file block mappings with a blockcount that's either so large that integer overflows occur or are zero, because neither are valid in the filesystem. Worse yet, attempting directory modifications causes the iext code to trip over the bmbt key handling and takes the filesystem down. We can fix most of this by preventing the bad metadata from entering the incore structures in the first place. Found by setting blockcount=0 in a directory data fork mapping and watching the fireworks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	6337032689	xfs: trace log intent item recovery failures Add a trace point so that we can capture when a recovered log intent item fails to recover. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	da5de11029	xfs: validate feature support when recovering rmap/refcount intents The rmap, and refcount log intent items were added to support the rmap and reflink features. Because these features come with changes to the ondisk format, the log items aren't tied to a log incompat flag. However, the log recovery routines don't actually check for those feature flags. The kernel has no business replayng an intent item for a feature that isn't enabled, so check that as part of recovered log item validation. (Note that kernels pre-dating rmap and reflink already fail log recovery on the unknown log item type code.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	7396c7fbe0	xfs: improve the code that checks recovered extent-free intent items The code that validates recovered extent-free intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	3c15df3de0	xfs: hoist recovered extent-free intent checks out of xfs_efi_item_recover When we recover a extent-free intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	0d79781a1a	xfs: improve the code that checks recovered refcount intent items The code that validates recovered refcount intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	ed64f8343a	xfs: hoist recovered refcount intent checks out of xfs_cui_item_recover When we recover a refcount intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	c447ad62dc	xfs: improve the code that checks recovered rmap intent items The code that validates recovered rmap intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	dda7ba65bf	xfs: hoist recovered rmap intent checks out of xfs_rui_item_recover When we recover a rmap intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	67d8679bd3	xfs: improve the code that checks recovered bmap intent items The code that validates recovered bmap intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	bc525cf455	xfs: hoist recovered bmap intent checks out of xfs_bui_item_recover When we recover a bmap intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	96f65bad7c	xfs: enable the needsrepair feature Make it so that libxfs recognizes the needsrepair feature. Note that the kernel will still refuse to mount these. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-12-09 09:49:38 -08:00
Darrick J. Wong	80c720b8eb	xfs: define a new "needrepair" feature Define an incompat feature flag to indicate that the filesystem needs to be repaired. While libxfs will recognize this feature, the kernel will refuse to mount if the feature flag is set, and only xfs_repair will be able to clear the flag. The goal here is to force the admin to run xfs_repair to completion after upgrading the filesystem, or if we otherwise detect anomalies. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2020-12-09 09:48:13 -08:00
Trond Myklebust	716a8bc7f7	nfsd: Record NFSv4 pre/post-op attributes as non-atomic For the case of NFSv4, specify to the client that the pre/post-op attributes were not recorded atomically with the main operation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
Trond Myklebust	01cbf38539	nfsd: Set PF_LOCAL_THROTTLE on local filesystems only Don't set PF_LOCAL_THROTTLE on remote filesystems like NFS, since they aren't expected to ever be subject to double buffering. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
Trond Myklebust	2e19d10c14	nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE If the underlying filesystem times out, then we want knfsd to return NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
Trond Myklebust	d045465fc6	exportfs: Add a function to return the raw output from fh_to_dentry() In order to allow nfsd to accept return values that are not acceptable to overlayfs and others, add a new function. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
Jeff Layton	7f84b488f9	nfsd: close cached files prior to a REMOVE or RENAME that would replace target It's not uncommon for some workloads to do a bunch of I/O to a file and delete it just afterward. If knfsd has a cached open file however, then the file may still be open when the dentry is unlinked. If the underlying filesystem is nfs, then that could trigger it to do a sillyrename. On a REMOVE or RENAME scan the nfsd_file cache for open files that correspond to the inode, and proactively unhash and put their references. This should prevent any delete-on-last-close activity from occurring, solely due to knfsd's open file cache. This must be done synchronously though so we use the variants that call flush_delayed_fput. There are deadlock possibilities if you call flush_delayed_fput while holding locks, however. In the case of nfsd_rename, we don't even do the lookups of the dentries to be renamed until we've locked for rename. Once we've figured out what the target dentry is for a rename, check to see whether there are cached open files associated with it. If there are, then unwind all of the locking, close them all, and then reattempt the rename. None of this is really necessary for "typical" filesystems though. It's mostly of use for NFS, so declare a new export op flag and use that to determine whether to close the files beforehand. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
Jeff Layton	ba5e8187c5	nfsd: allow filesystems to opt out of subtree checking When we start allowing NFS to be reexported, then we have some problems when it comes to subtree checking. In principle, we could allow it, but it would mean encoding parent info in the filehandles and there may not be enough space for that in a NFSv3 filehandle. To enforce this at export upcall time, we add a new export_ops flag that declares the filesystem ineligible for subtree checking. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
Jeff Layton	daab110e47	nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations With NFSv3 nfsd will always attempt to send along WCC data to the client. This generally involves saving off the in-core inode information prior to doing the operation on the given filehandle, and then issuing a vfs_getattr to it after the op. Some filesystems (particularly clustered or networked ones) have an expensive ->getattr inode operation. Atomicity is also often difficult or impossible to guarantee on such filesystems. For those, we're best off not trying to provide WCC information to the client at all, and to simply allow it to poll for that information as needed with a GETATTR RPC. This patch adds a new flags field to struct export_operations, and defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate that nfsd should not attempt to provide WCC info in NFSv3 replies. It also adds a blurb about the new flags field and flag to the exporting documentation. The server will also now skip collecting this information for NFSv2 as well, since that info is never used there anyway. Note that this patch does not add this flag to any filesystem export_operations structures. This was originally developed to allow reexporting nfs via nfsd. Other filesystems may want to consider enabling this flag too. It's hard to tell however which ones have export operations to enable export via knfsd and which ones mostly rely on them for open-by-filehandle support, so I'm leaving that up to the individual maintainers to decide. I am cc'ing the relevant lists for those filesystems that I think may want to consider adding this though. Cc: HPDD-discuss@lists.01.org Cc: ceph-devel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: fuse-devel@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
J. Bruce Fields	1631087ba8	Revert "nfsd4: support change_attr_type attribute" This reverts commit a85857633b04d57f4524cca0a2bfaf87b2543f9f. We're still factoring ctime into our change attribute even in the IS_I_VERSION case. If someone sets the system time backwards, a client could see the change attribute go backwards. Maybe we can just say "well, don't do that", but there's some question whether that's good enough, or whether we need a better guarantee. Also, the client still isn't actually using the attribute. While we're still figuring this out, let's just stop returning this attribute. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
J. Bruce Fields	942b20dc24	nfsd4: don't query change attribute in v2/v3 case inode_query_iversion() has side effects, and there's no point calling it when we're not even going to use it. We check whether we're currently processing a v4 request by checking fh_maxsize, which is arguably a little hacky; we could add a flag to svc_fh instead. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:38 -05:00
J. Bruce Fields	4b03d99794	nfsd: minor nfsd4_change_attribute cleanup Minor cleanup, no change in behavior. Also pull out a common helper that'll be useful elsewhere. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:39:37 -05:00

... 7 8 9 10 11 ...

68382 Commits