linux

iv/linux

Author	SHA1	Message	Date
Josef Bacik	bb385bedde	btrfs: fix error handling in __btrfs_update_delayed_inode If we get an error while looking up the inode item we'll simply bail without cleaning up the delayed node. This results in this style of warning happening on commit: WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90 CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90 RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000 RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060 RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000 R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00 R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90 FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0 Call Trace: btrfs_commit_transaction+0x43c/0xb00 ? finish_wait+0x80/0x80 ? vfs_fsync_range+0x90/0x90 iterate_supers+0x8c/0x100 ksys_sync+0x50/0x90 __do_sys_sync+0xa/0x10 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Because the iref isn't dropped and this leaves an elevated node->count, so any release just re-queues it onto the delayed inodes list. Fix this by going to the out label to handle the proper cleanup of the delayed node. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Josef Bacik	a4cb90dc01	btrfs: make btrfs_release_delayed_iref handle the !iref case Right now we only cleanup the delayed iref if we have BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error conditions that need to cleanup the iref if it still exists, so to make this code cleaner move the test_bit into btrfs_release_delayed_iref itself and unconditionally call it in each of the cases instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
David Sterba	eb3b505366	btrfs: scrub: per-device bandwidth control Add sysfs interface to limit io during scrub. We relied on the ionice interface to do that, eg. the idle class let the system usable while scrub was running. This has changed when mq-deadline got widespread and did not implement the scheduling classes. That was a CFQ thing that got deleted. We've got numerous complaints from users about degraded performance. Currently only BFQ supports that but it's not a common scheduler and we can't ask everybody to switch to it. Alternatively the cgroup io limiting can be used but that also a non-trivial setup (v2 required, the controller must be enabled on the system). This can still be used if desired. Other ideas that have been explored: piggy-back on ionice (that is set per-process and is accessible) and interpret the class and classdata as bandwidth limits, but this does not have enough flexibility as there are only 8 allowed and we'd have to map fixed limits to each value. Also adjusting the value would need to lookup the process that currently runs scrub on the given device, and the value is not sticky so would have to be adjusted each time scrub runs. Running out of options, sysfs does not look that bad: - it's accessible from scripts, or udev rules - the name is similar to what MD-RAID has (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max) - the value is sticky at least for filesystem mount time - adjusting the value has immediate effect - sysfs is available in constrained environments (eg. system rescue) - the limit also applies to device replace Sysfs: - raw value is in bytes - values written to the file accept suffixes like K, M - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max - 0 means use default priority of IO The scheduler is a simple deadline one and the accuracy is up to nearest 128K. Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Johannes Thumshirn	e7ff9e6b8e	btrfs: zoned: factor out zoned device lookup To be able to construct a zone append bio we need to look up the btrfs_device. The code doing the chunk map lookup to get the device is present in btrfs_submit_compressed_write and submit_extent_page. Factor out the lookup calls into a helper and use it in the submission paths. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Tian Tao	50535db8fb	btrfs: return EAGAIN if defrag is canceled When inode defrag is canceled, the error is set to EAGAIN but then overwritten by number of defragmented bytes. As this would hide the error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it will print the error and continue to next file (as it does in for any other error). Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Qu Wenruo	1245835d24	btrfs: remove io_failure_record::in_validation The io_failure_record::in_validation was introduced to handle failed bio which cross several sectors. In such case, we still need to verify which sectors are corrupted. But since we've changed the way how we handle corrupted sectors, by only submitting repair for each corrupted sector, there is no need for extra validation any more. This patch will cleanup all io_failure_record::in_validation related code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Qu Wenruo	150e4b0597	btrfs: submit read time repair only for each corrupted sector Currently btrfs_submit_read_repair() has some extra check on whether the failed bio needs extra validation for repair. But we can avoid all these extra mechanisms if we submit the repair for each sector. By this, each read repair can be easily handled without the need to verify which sector is corrupted. This will also benefit subpage, as one subpage bvec can contain several sectors, making the extra verification more complex. So this patch will: - Introduce repair_one_sector() The main code submitting repair, which is more or less the same as old btrfs_submit_read_repair(). But this time, it only repairs one sector. - Make btrfs_submit_read_repair() to handle sectors differently There are 3 different cases: * Good sector We need to release the page and extent, set the range uptodate. * Bad sector and failed to submit repair bio We need to release the page and extent, but not set the range uptodate. * Bad sector but repair bio submitted The page and extent release will be handled by the submitted repair bio. Nothing needs to be done. Since btrfs_submit_read_repair() will handle the page and extent release now, we need to skip to next bvec even we hit some error. - Change the lifespan of @uptodate in end_bio_extent_readpage() Since now btrfs_submit_read_repair() will handle the full bvec which contains any corruption, we don't need to bother updating @uptodate bit anymore. Just let @uptodate to be local variable inside the main loop, so that any error from one bvec won't affect later bvec. - Only export btrfs_repair_one_sector(), unexport btrfs_submit_read_repair() The only outside caller for read repair is DIO, which already submits its repair for just one sector. Only export btrfs_repair_one_sector() for DIO. This patch will focus on the change on the repair path, the extra validation code is still kept as is, and will be cleaned up later. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Qu Wenruo	08508fea07	btrfs: make btrfs_verify_data_csum() to return a bitmap This will provide the basis for later per-sector repair for subpage, while still keeping the existing code happy. As if all csums match, the return value will be 0, same as now. Only when csum mismatches, the return value is different. The new return value will be a bitmap, for 4K sectorsize and 4K page size, it will be either 1, instead of the -EIO (which is not used directly by the callers, no effective change). But for 4K sectorsize and 64K page size, aka subpage case, since the bvec can contain multiple sectors, knowing which sectors are corrupted will allow us to submit repair only for corrupted sectors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:05 +02:00
Johannes Thumshirn	f4dcfb3045	btrfs: rename check_async_write and let it return bool The 'check_async_write' function is a helper used in 'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be used for metadata. Make the function return bool and get rid of the local variable async in btrfs_submit_metadata_bio storing the result of check_async_write's tests. As this is touching all function call sites, also rename it to should_async_write as this is more in line with the naming we use. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Johannes Thumshirn	06e1e7f422	btrfs: zoned: bail out if we can't read a reliable write pointer If we can't read a reliable write pointer from a sequential zone fail creating the block group with an I/O error. Also if the read write pointer is beyond the end of the respective zone, fail the creation of the block group on this zone with an I/O error. While this could also happen in real world scenarios with misbehaving drives, this issue addresses a problem uncovered by fstests' test case generic/475. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Naohiro Aota	47cdfb5e1d	btrfs: zoned: print message when zone sanity check type fails This extends patch `784daf2b96` ("btrfs: zoned: sanity check zone type"), the message was supposed to be there but was lost during merge. We want to make the error noticeable so add it. Fixes: `784daf2b96` ("btrfs: zoned: sanity check zone type") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	385f421f18	btrfs: handle preemptive delalloc flushing slightly differently If we decide to flush delalloc from the preemptive flusher, we really do not want to wait on ordered extents, as it gains us nothing. However there was logic to go ahead and wait on ordered extents if there was more ordered bytes than delalloc bytes. We do not want this behavior, so pass through whether this flushing is for preemption, and do not wait for ordered extents if that's the case. Also break out of the shrink loop after the first flushing, as we just want to one shot shrink delalloc. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	3e10156997	btrfs: only ignore delalloc if delalloc is much smaller than ordered While testing heavy delalloc workloads I noticed that sometimes we'd just stop preemptively flushing when we had loads of delalloc available to flush. This is because we skip preemptive flushing if delalloc <= ordered. However if we start with say 4gib of delalloc, and we flush 2gib of that, we'll stop flushing there, when we still have 2gib of delalloc to flush. Instead adjust the ordered bytes down by half, this way if 2/3 of our outstanding delalloc reservations are tied up by ordered extents we don't bother preemptive flushing, as we're getting close to the state where we need to wait on ordered extents. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	30acce4eb0	btrfs: don't include the global rsv size in the preemptive used amount When deciding if we should preemptively flush space, we will add in the amount of space used by all block rsvs. However this also includes the global block rsv, which isn't flushable so shouldn't be accounted for in this calculation. If we decide to use ->bytes_may_use in our used calculation we need to subtract the global rsv size from this amount so it most closely matches the flushable space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	1239e2da16	btrfs: use the global rsv size in the preemptive thresh calculation We calculate the amount of "free" space available for normal reservations by taking the total space and subtracting out the hard used space, which is readonly, used, and reserved space. However we weren't taking into account the global block rsv, which is essentially hard used space. Handle this by subtracting it from the available free space, so that our threshold more closely mirrors reality. We need to do the check because it's possible that the global_rsv_size + used is > total_bytes, sometimes the global reserve can end up being calculated as larger than the available size (think small filesystems where we only have the original 8MiB chunk of metadata). It doesn't usually happen, but that can get us into trouble so this is safer. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	610a6ef44e	btrfs: take into account global rsv in need_preemptive_reclaim Global rsv can't be used for normal allocations, and for very full file systems we can decide to try and async flush constantly even though there's really not a lot of space to reclaim. Deal with this by including the global block rsv size in the "total used" calculation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	0aae4ca9e9	btrfs: only clamp the first time we have to start flushing We were clamping the threshold for preemptive reclaim any time we added a ticket to wait on, which if we have a lot of threads means we'd essentially max out the clamp the first time we start to flush. Instead of doing this, simply do it every time we have to start flushing, this will make us ramp up gradually instead of going to max clamping as soon as we start needing to do flushing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Josef Bacik	ed738ba7f9	btrfs: check worker before need_preemptive_reclaim need_preemptive_reclaim() does some calculations, which aren't heavy, but if we're already running preemptive reclaim there's no reason to do them at all, so re-order the checks so that we don't do the calculation if we're already doing reclaim. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Su Yue	94358c35d8	btrfs: remove stale comment for argument seed of btrfs_find_device Commit `b2598edf8b` ("btrfs: remove unused argument seed from btrfs_find_device") removed the argument seed from btrfs_find_device but forgot the comment, so remove it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Goldwyn Rodrigues	dc56219fe2	btrfs: correct try_lock_extent() usage in read_extent_buffer_subpage() try_lock_extent() returns 1 on success or 0 for failure and not an error code. If try_lock_extent() fails, read_extent_buffer_subpage() returns zero indicating subpage extent read success. Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the extent. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-21 15:19:04 +02:00
Steve French	e0ae8a9aae	smb311: remove dead code for non compounded posix query info Although we may need this in some cases in the future, remove the currently unused, non-compounded version of POSIX query info, SMB11_posix_query_info (instead smb311_posix_query_path_info is now called e.g. when revalidating dentries or retrieving info for getattr) Addresses-Coverity: 1495708 ("Resource leaks") Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Steve French	e39df24169	cifs: fix SMB1 error path in cifs_get_file_info_unix We were trying to fill in uninitialized file attributes in the error case. Addresses-Coverity: 139689 ("Uninitialized variables") Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Steve French	ff93b71a3e	smb3: fix uninitialized value for port in witness protocol move Although in practice this can not occur (since IPv4 and IPv6 are the only two cases currently supported), it is cleaner to avoid uninitialized variable warnings. Addresses smatch warning: fs/cifs/cifs_swn.c:468 cifs_swn_store_swn_addr() error: uninitialized symbol 'port'. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Samuel Cabrero <scabrero@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Steve French	3559134ecc	cifs: fix unneeded null check tcon can not be null in SMB2_tcon function so the check is not relevant and removing it makes Coverity happy. Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Addresses-Coverity: 13250131 ("Dereference before null check") Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Steve French	929be906fa	cifs: use SPDX-Licence-Identifier Add SPDX license identifier and replace license boilerplate. Corrects various checkpatch errors with the older format for noting the LGPL license. Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Baokun Li	a506ccb47c	cifs: convert list_for_each to entry variant in cifs_debug.c convert list_for_each() to list_for_each_entry() where applicable. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Baokun Li	647f592734	cifs: convert list_for_each to entry variant in smb2misc.c convert list_for_each() to list_for_each_entry() where applicable. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Ronnie Sahlberg	ca38fabc31	cifs: avoid extra calls in posix_info_parse In posix_info_parse() we call posix_info_sid_size twice for each of the owner and the group sid. The first time to check that it is valid, i.e. >= 0 and the second time to just pass it in as a length to memcpy(). As this is a pure function we know that it can not be negative the second time and this is technically a false warning in coverity. However, as it is a pure function we are just wasting cycles by calling it a second time. Record the length from the first time we call it and save some cycles as well as make Coverity happy. Addresses-Coverity-ID: 1491379 ("Argument can not be negative") Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Thiago Rafael Becker	6efa994e35	cifs: retry lookup and readdir when EAGAIN is returned. According to the investigation performed by Jacob Shivers at Red Hat, cifs_lookup and cifs_readdir leak EAGAIN when the user session is deleted on the server. Fix this issue by implementing a retry with limits, as is implemented in cifs_revalidate_dentry_attr. Reproducer based on the work by Jacob Shivers: ~~~ $ cat readdir-cifs-test.sh #!/bin/bash # Install and configure powershell and sshd on the windows # server as descibed in # https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_overview # This script uses expect(1) USER=dude SERVER=192.168.0.2 RPATH=root PASS='password' function debug_funcs { for line in $@ ; do echo "func $line +p" > /sys/kernel/debug/dynamic_debug/control done } function setup { echo 1 > /proc/fs/cifs/cifsFYI debug_funcs wait_for_compound_request \ smb2_query_dir_first cifs_readdir \ compound_send_recv cifs_reconnect_tcon \ generic_ip_connect cifs_reconnect \ smb2_reconnect_server smb2_reconnect \ cifs_readv_from_socket cifs_readv_receive tcpdump -i eth0 -w cifs.pcap host 192.168.2.182 & sleep 5 dmesg -C } function test_call { if [[ $1 == 1 ]] ; then tracer="strace -tt -f -s 4096 -o trace-$(date -Iseconds).txt" fi # Change the command here to anything appropriate $tracer ls $2 > /dev/null res=$? if [[ $1 == 1 ]] ; then if [[ $res == 0 ]] ; then 1>&2 echo success else 1>&2 echo "failure ($res)" fi fi } mountpoint /mnt > /dev/null \|\| mount -t cifs -o username=$USER,pass=$PASS //$SERVER/$RPATH /mnt test_call 0 /mnt/ /usr/bin/expect << EOF set timeout 60 spawn ssh $USER@$SERVER expect "yes/no" { send "yes\r" expect "?assword" { send "$PASS\r" } } "?assword" { send "$PASS\r" } expect ">" { send "powershell close-smbsession -force\r" } expect ">" { send "exit\r" } expect eof EOF sysctl -w vm.drop_caches=2 > /dev/null sysctl -w vm.drop_caches=2 > /dev/null setup test_call 1 /mnt/ ~~~ Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Paulo Alcantara	889c2a7007	cifs: fix check of dfs interlinks Interlink is a special type of DFS link that resolves to a different DFS domain-based namespace. To determine whether it is an interlink or not, check if ReferralServers and StorageServers bits are set to 1 and 0 respectively in ReferralHeaderFlags, as specified in MS-DFSC 3.1.5.4.5 Determining Whether a Referral Response is an Interlink. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Hyunchul Lee	0475c3655e	cifs: decoding negTokenInit with generic ASN1 decoder Decode negTokenInit with lib/asn1_decoder. For that, add OIDs in linux/oid_registry.h and a negTokenInit ASN1 file, "spnego_negtokeninit.asn1". And define decoder's callback functions, which are the gssapi_this_mech for checking SPENGO oid and the neg_token_init_mech_type for getting authentication mechanisms supported by a server. Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Paulo Alcantara	1023e90b73	cifs: avoid starvation when refreshing dfs cache When refreshing the DFS cache, keep SMB2 IOCTL calls as much outside critical sections as possible and avoid read/write starvation when getting new DFS referrals by using broken or slow connections. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Steve French	0d52df81e0	cifs: enable extended stats by default CONFIG_CIFS_STATS2 can be very useful since it shows latencies by command, and allows enabling the slow response dynamic tracepoint which can be useful to identify performance problems. For example: Total time spent processing by command. Time units are jiffies (1000 per second) SMB3 CMD Number Total Time Fastest Slowest -------- ------ ---------- ------- ------- 0 1 2 2 2 1 2 6 2 4 2 0 0 0 0 3 4 11 2 4 4 2 16 5 11 5 4546 34104 2 487 6 4421 32901 2 487 7 0 0 0 0 8 695 2781 2 39 9 391 1708 2 27 10 0 0 0 0 11 4 6 1 2 12 0 0 0 0 13 0 0 0 0 14 3887 17696 0 128 15 0 0 0 0 16 1471 9950 1 487 17 169 2695 9 116 18 80 381 2 10 1 2 6 2 4 2 0 0 0 0 3 4 11 2 4 4 2 16 5 11 5 4546 34104 2 487 6 4421 32901 2 487 7 0 0 0 0 8 695 2781 2 39 9 391 1708 2 27 10 0 0 0 0 11 4 6 1 2 12 0 0 0 0 13 0 0 0 0 14 3887 17696 0 128 15 0 0 0 0 16 1471 9950 1 487 17 169 2695 9 116 18 80 381 2 10 Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Shyam Prasad N	e695a9ad03	cifs: missed ref-counting smb session in find When we lookup an smb session based on session id, we did not up the ref-count for the session. This can potentially cause issues if the session is freed from under us. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Paulo Alcantara	f3c852b0b0	cifs: do not share tcp servers with dfs mounts It isn't enough to have unshared tcons because multiple DFS mounts can connect to same target server and failover to different servers, so we can't use a single tcp server for such cases. For the simplest solution, use nosharesock option to achieve that. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Paulo Alcantara	c950fc7af9	cifs: set a minimum of 2 minutes for refreshing dfs cache We don't want to refresh the dfs cache in very short intervals, so setting a minimum interval of 2 minutes is OK. If it needs to be refreshed immediately, one could have the cache cleared with $ echo 0 > /proc/fs/cifs/dfscache and then remounting the dfs share. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Paulo Alcantara	42caeba713	cifs: fix path comparison and hash calc Fix cache lookup and hash calculations when handling paths with different cases. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:17 -05:00
Paulo Alcantara	c870a8e70e	cifs: handle different charsets in dfs cache Convert all dfs paths to dfs cache's local codepage (@cache_cp) and avoid mixing them with different charsets. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Paulo Alcantara	c9f7110399	cifs: keep referral server sessions alive At every mount, keep all sessions alive that were used for chasing the DFS referrals as long as the dfs mounts are active. Use those sessions in DFS cache to refresh all active tcons as well as cached entries. They will be managed by a list of mount_group structures that will be indexed by a randomly generated uuid at mount time, so we can put all the sessions related to specific dfs mounts and avoid leaking them. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Paulo Alcantara	2b133b7e21	cifs: get rid of @noreq param in __dfs_cache_find() @noreq param isn't used anywhere, so just remove it. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Paulo Alcantara	f3191fc800	cifs: do not send tree disconnect to ipc shares On session close, the IPC is closed and the server must release all tcons of the session. It doesn't matter if we send a ipc close or not. Besides, it will make the server to not close durable and resilient files on session close, as specified in MS-SMB2 3.3.5.6 Receiving an SMB2 LOGOFF Request. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Ronnie Sahlberg	966a3cb7c7	cifs: improve fallocate emulation RHBZ: 1866684 We don't have a real fallocate in the SMB2 protocol so we used to emulate fallocate by simply switching the file to become non-sparse. But as that could potantially consume a lot more data than we intended to fallocate (large sparse file and fallocating a thin slice in the middle) we would only do this IFF the fallocate request was for virtually the entire file. This patch improves this and starts allowing us to fallocate smaller chunks of a file by overwriting the region with 0, for the parts that are unallocated. The method used is to first query the server for FSCTL_QUERY_ALLOCATED_RANGES to find what is unallocated in the fallocate range and then to only overwrite-with-zero the unallocated ranges to fill in the holes. As overwriting-with-zero is different from just allocating blocks, and potentially much more expensive, we limit this to only allow fallocate ranges up to 1Mb in size. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Aurelien Aptel <aaptel@suse.com> Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Baokun Li	aaf36df3ed	cifs: fix doc warnings in cifs_dfs_ref.c Add description for `cifs_compose_mount_options` to fix the W=1 warnings: fs/cifs/cifs_dfs_ref.c:139: warning: Function parameter or member 'devname' not described in 'cifs_compose_mount_options' Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Colin Ian King	032e091d3e	cifs: remove redundant initialization of variable rc The variable rc is being initialized with a value that is never read, the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Rikard Falkeborn	57c8ce7ab3	cifs: Constify static struct genl_ops The only usage of cifs_genl_ops[] is to assign its address to the ops field in the genl_family struct, which is a pointer to const. Make it const to allow the compiler to put it in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
YueHaibing	a23a71abca	cifs: Remove unused inline function is_sysvol_or_netlogon() is_sysvol_or_netlogon() is never used, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Steve French	f2756527d3	cifs: remove duplicated prototype smb2_find_smb_ses was defined twice in smb2proto.h Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Aurelien Aptel	5e538959f0	cifs: fix ipv6 formating in cifs_ses_add_channel Use %pI6 for IPv6 addresses Signed-off-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-06-20 21:28:16 -05:00
Linus Torvalds	6fab154a33	for-5.13-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDNEFMACgkQxWXV+ddt WDuZQg/7BpGG3IDhxydM7fUrNT0xmW2/0VG8blXAgNTiaUO1zOrlrlDKm38+dtW6 yEv3ruf68tggrPNRCkyh51n45+ExqNwc7WwrxaKIRKmvYhYDsxnt8JLiNkv64isi R/CQVETX1cKsMuRhBuqmUq3Sy6VJZoi6coUHIC7ddBcLqnz8c9p7oGqzxBT8J9u3 1CkDSeLM4HKlISlVKhmT4lRG28cQTuy3mSABUt7N5ljJvrrpQAvEN1HCOE9XUQFQ wHH2DjNnBMvfB7mrGCBL7XGf8DF6ucgcyfofuOj6CQLFJ8bOnVKsk8dk/8XUQod+ rQoNIrVwW91LjmEO/I767JmjrRMtHbXvl3DEw3BvaD/O4efw78SN2VN+DRi4j7Xx aMtAWWfakfIyyJNZu9IEDa736iCdp+yl4bnq+hZpqmOYRqTq8n/zWuCMWZ5ohNay JyjxCm+xgo3vH9VEgzje6GDUki3I4Bwe7VlsaMr9F6F5GKzFp/4fb9lCewBrH6le +Y4gWxRT09plThsC2N3qmBQ9uVIJUyzmvcsYiMJ95tb24srdcPUTCG0C9zBvuMCC nm+1n5d3ENSEBaRNKQsC3MYcjKIh8VDEaKnntJrHAzHP41hrD+makrw3LVs6wLzu amGYz40XNq8zK2Xxv/N8O/U/PwQWKGj4bxq/2c1Wi9p9HACWfgk= =JbJO -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix, for a space accounting bug in zoned mode. It happens when a block group is switched back rw->ro and unusable bytes (due to zoned constraints) are subtracted twice. It has user visible effects so I consider it important enough for late -rc inclusion and backport to stable" * tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix negative space_info->bytes_readonly	2021-06-18 16:39:03 -07:00
Matthew Wilcox (Oracle)	9620ad86d0	afs: Re-enable freezing once a page fault is interrupted If a task is killed during a page fault, it does not currently call sb_end_pagefault(), which means that the filesystem cannot be frozen at any time thereafter. This may be reported by lockdep like this: ==================================== WARNING: fsstress/10757 still has locks held! 5.13.0-rc4-build4+ #91 Not tainted ------------------------------------ 1 lock held by fsstress/10757: #0: ffff888104eac530 ( sb_pagefaults as filesystem freezing is modelled as a lock. Fix this by removing all the direct returns from within the function, and using 'ret' to indicate whether we were interrupted or successful. Fixes: `1cf7a1518a` ("afs: Implement shared-writeable mmap") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-18 13:49:07 -07:00
Peter Zijlstra	2f064a59a1	sched: Change task_struct::state Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org	2021-06-18 11:43:09 +02:00
Ingo Molnar	b2c0931a07	Merge branch 'sched/urgent' into sched/core, to resolve conflicts This commit in sched/urgent moved the cfs_rq_is_decayed() function: `a7b359fc6a`: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: `9e077b52d8`: ("sched/pelt: Check that _avg are null when _sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-06-18 11:31:25 +02:00
Linus Torvalds	39519f6a56	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDLL74ACgkQnJ2qBz9k QNleSAf/XikH+tsM6K9yDEeU93GGSqKUB71n9clSQBIiGZ7/UliG0wotrUjec9Rg vBTZlh3JEdfboeBei+mG3hmOdAoYK4HMsJJikqRGPyWOTujh1eOZlT1LOXaY5zNM 631A9pWe8edlpr4Mq7Wb4nO4FToEZ91iXDLliFF371aV8kP/yuv5ZjHwIn5Pt5gI DPnWwaJ+meW9KZ4gVKAfvZLVkKFat2xJ9r2LDpqbIkH9SBcfjBmeHOy0gFyCKx6l yma5iANgtWLhesP6ZwSeaRb1+T9altSLCCFZrYdKH9PXTMFUqzrbiZ8tfVmllePZ GaUOWcHYiLmvqvXnaAREiHnMFT6prg== =kevs -----END PGP SIGNATURE----- Merge tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota and fanotify fixes from Jan Kara: "A fixup finishing disabling of quotactl_path() syscall (I've missed archs using different way to declare syscalls) and a fix of an fd leak in error handling path of fanotify" * tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: finish disable quotactl_path syscall fanotify: fix copy_event_to_user() fid error clean up	2021-06-17 09:49:48 -07:00
Naohiro Aota	f9f28e5bd0	btrfs: zoned: fix negative space_info->bytes_readonly Consider we have a using block group on zoned btrfs. \|<- ZU ->\|<- used ->\|<---free--->\| `- Alloc offset ZU: Zone unusable Marking the block group read-only will migrate the zone unusable bytes to the read-only bytes. So, we will have this. \|<- RO ->\|<- used ->\|<--- RO --->\| RO: Read only When marking it back to read-write, btrfs_dec_block_group_ro() subtracts the above "RO" bytes from the space_info->bytes_readonly. And, it moves the zone unusable bytes back and again subtracts those bytes from the space_info->bytes_readonly, leading to negative bytes_readonly. This can be observed in the output as eg.: Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688 This commit fixes the issue by reordering the operations. Link: https://github.com/naota/linux/issues/37 Reported-by: David Sterba <dsterba@suse.com> Fixes: `169e0da91a` ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-17 11:12:14 +02:00
Kees Cook	1d1f6cc581	pstore/blk: Include zone in pstore_device_info Information was redundant between struct pstore_zone_info and struct pstore_device_info. Use struct pstore_zone_info, with member name "zone". Additionally untangle the logic for the "best effort" block device instance. Signed-off-by: Kees Cook <keescook@chromium.org> Fixed-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/lkml/20210617005424.182305-1-pulehui@huawei.com	2021-06-16 21:09:31 -07:00
Kees Cook	c811659bb9	pstore/blk: Fix kerndoc and redundancy on blkdev param Remove redundant details of blkdev and fix up resulting kerndoc. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org>	2021-06-16 09:27:32 -07:00
Kees Cook	7bb9557b48	pstore/blk: Use the normal block device I/O path Stop poking into block layer internals and just open the block device file an use kernel_read and kernel_write on it. Note that this means the transformation from name_to_dev_t can't be used anymore when pstore_blk is loaded as a module: a full filesystem device path name must be used instead. Additionally removes ":internal:" kerndoc link, since no such documentation remains. Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org>	2021-06-16 09:26:56 -07:00
Mike Kravetz	846be08578	mm/hugetlb: expand restore_reserve_on_error functionality The routine restore_reserve_on_error is called to restore reservation information when an error occurs after page allocation. The routine alloc_huge_page modifies the mapping reserve map and potentially the reserve count during allocation. If code calling alloc_huge_page encounters an error after allocation and needs to free the page, the reservation information needs to be adjusted. Currently, restore_reserve_on_error only takes action on pages for which the reserve count was adjusted(HPageRestoreReserve flag). There is nothing wrong with these adjustments. However, alloc_huge_page ALWAYS modifies the reserve map during allocation even if the reserve count is not adjusted. This can cause issues as observed during development of this patch [1]. One specific series of operations causing an issue is: - Create a shared hugetlb mapping Reservations for all pages created by default - Fault in a page in the mapping Reservation exists so reservation count is decremented - Punch a hole in the file/mapping at index previously faulted Reservation and any associated pages will be removed - Allocate a page to fill the hole No reservation entry, so reserve count unmodified Reservation entry added to map by alloc_huge_page - Error after allocation and before instantiating the page Reservation entry remains in map - Allocate a page to fill the hole Reservation entry exists, so decrement reservation count This will cause a reservation count underflow as the reservation count was decremented twice for the same index. A user would observe a very large number for HugePages_Rsvd in /proc/meminfo. This would also likely cause subsequent allocations of hugetlb pages to fail as it would 'appear' that all pages are reserved. This sequence of operations is unlikely to happen, however they were easily reproduced and observed using hacked up code as described in [1]. Address the issue by having the routine restore_reserve_on_error take action on pages where HPageRestoreReserve is not set. In this case, we need to remove any reserve map entry created by alloc_huge_page. A new helper routine vma_del_reservation assists with this operation. There are three callers of alloc_huge_page which do not currently call restore_reserve_on error before freeing a page on error paths. Add those missing calls. [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/ Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com Fixes: `96b96a96dd` ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-16 09:24:42 -07:00
Kees Cook	2a03ddbde1	pstore/blk: Move verify_size() macro out of function There's no good reason for the verify_size macro to live inside the function. Move it up with the check_size() macro and fix indenting. Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2021-06-16 08:19:40 -07:00
Kees Cook	6eed261f48	pstore/blk: Improve failure reporting There was no feedback on bad registration attempts. Add details on the failure cause. Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2021-06-16 08:19:37 -07:00
Linus Torvalds	94f0b2d4a1	proc: only require mm_struct for writing Commit `591a22c14d` ("proc: Track /proc/$pid/attr/ opener mm_struct") we started using __mem_open() to track the mm_struct at open-time, so that we could then check it for writes. But that also ended up making the permission checks at open time much stricter - and not just for writes, but for reads too. And that in turn caused a regression for at least Fedora 29, where NIC interfaces fail to start when using NetworkManager. Since only the write side wanted the mm_struct test, ignore any failures by __mem_open() at open time, leaving reads unaffected. The write() time verification of the mm_struct pointer will then catch the failure case because a NULL pointer will not match a valid 'current->mm'. Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/ Fixes: `591a22c14d` ("proc: Track /proc/$pid/attr/ opener mm_struct") Reported-and-tested-by: Leon Romanovsky <leon@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-15 10:47:51 -07:00
Dan Carpenter	a33d62662d	afs: Fix an IS_ERR() vs NULL check The proc_symlink() function returns NULL on error, it doesn't return error pointers. Fixes: `5b86d4ff5d` ("afs: Implement network namespacing") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/YLjMRKX40pTrJvgf@mwanda/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-15 07:42:26 -07:00
Matthew Bobrowski	f644bc449b	fanotify: fix copy_event_to_user() fid error clean up Ensure that clean up is performed on the allocated file descriptor and struct file object in the event that an error is encountered while copying fid info objects. Currently, we return directly to the caller when an error is experienced in the fid info copying helper, which isn't ideal given that the listener process could be left with a dangling file descriptor in their fdtable. Fixes: `5e469c830f` ("fanotify: copy event fid info to user") Fixes: `44d705b037` ("fanotify: report name info for FAN_DIR_MODIFY event") Link: https://lore.kernel.org/linux-fsdevel/YMKv1U7tNPK955ho@google.com/T/#m15361cd6399dad4396aad650de25dbf6b312288e Link: https://lore.kernel.org/r/1ef8ae9100101eb1a91763c516c2e9a3a3b112bd.1623376346.git.repnop@google.com Signed-off-by: Matthew Bobrowski <repnop@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-06-14 12:16:37 +02:00
Linus Torvalds	960f0716d8	NFS client bugfixes for Linux 5.13 Highlights include Stable fixes: - Fix use-after-free in nfs4_init_client() Bugfixes: - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() - Fix second deadlock in nfs4_evict_inode() - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP - Fix setting of the NFS_CAP_SECURITY_LABEL capability -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmDGJPEACgkQZwvnipYK APLH8xAAsdoKVCW35P+FtlzQvq0iWoTvk15i4Jv8+SyFtqAZe6y6pEj9+RT47CAV kt/uNa6CQ9KjxxgwBf2XoGTuf4MrOUU34kQBF/tRLy9zDdXUsZH263vapopmel6L BVHEEsID6hz8+BUt1LFsr+8sWxG+12UiimEu0CVo4BE8SgYushWpJOQ9iL/zxi1O gXmlAfA9g38I9aUApke4hOPSHVTGaQaAKl5LbSoycQlJblzgA1yIXdU9sVTHDJY6 sco9O9M+NPY8gefS4d7iXSihZin5V9rNuSJ9SKiCPikTEjZYgZbw1umGj6VnF/5e QD47QGgOwXKeCOBv6Oe4VYxE2JISoUFZw8+pxjy4eDO+EcJv3IrHOM8UrsiddGAA DLHzbbrMUx6mGdgibw/ktkwx0Q/DvGrfrvKidk33cs16DPWgTZAG//n7spuqYTmT 8fQbJF6DDjsYM7v+WdImf7VBA8dreXb/QcHwxCtH7uG+hGyRiYoDSOmH3mGBKpLX idkjz6Hvj7V7Y1z4qd+nvh4Ch1V0b9BX+J/+6dKHRykpmSJTIMIlQw7/wA6a8Lp6 WJX4KbUzZHojvqM1BMzRL34+qidihUso0RIj0VjCB1JQyosRnIeTPorfHLQZTOM0 IjP8h48BB7E7cJeJP1dmhvm7Hb8SpFVDxDHoWRtscbQflO3Wdkw= =PABi -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - Fix use-after-free in nfs4_init_client() Bugfixes: - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() - Fix second deadlock in nfs4_evict_inode() - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP - Fix setting of the NFS_CAP_SECURITY_LABEL capability" * tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix second deadlock in nfs4_evict_inode() NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() NFS: FMODE_READ and friends are C macros, not enum types NFS: Fix a potential NULL dereference in nfs_get_client() NFS: Fix use-after-free in nfs4_init_client() NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.	2021-06-13 12:32:59 -07:00
Linus Torvalds	87a7f7368b	Driver core fix for 5.13-rc6 Here is a single debugfs fix for 5.13-rc6. It fixes a bug in debugfs_read_file_str() that showed up in 5.13-rc1. It has been in linux-next for a full week with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYMTWug8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+yl5MQCeMMEMCGsoQdeXI1t2WMAMTmWRTZYAn1GqGliM b3RkczkNgKnEfDB2+M1r =wWW8 -----END PGP SIGNATURE----- Merge tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "A single debugfs fix for 5.13-rc6, fixing a bug in debugfs_read_file_str() that showed up in 5.13-rc1. It has been in linux-next for a full week with no reported problems" * tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Fix debugfs_read_file_str()	2021-06-12 12:18:49 -07:00
Linus Torvalds	b2568eeb96	io_uring-5.13-2021-06-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDEwEEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpu2uEACIZXc0e4Jz2tJmtlLzhm0T+YUXu88/n0Ki 3HsCfjyk0k2tvGjAmzLgBruR+0dxuoTlC8ZyLWkCgYFvRxCQMrjxB4+Q53WAAPud ictv/5C992eWfmkk5lKWYh/SVUZU0nN/HlcITggFzH+/Ek4RgqBJK6rYPpN4YM6W OifSZ22xwjZy9i8svzCPzGUbS5d5qbNeRSaacfADWFmzTqqzllWz/KkN633UFefR tkqWy610P0O8fz3xe5HcECIOc3aNRZuk5zrNqCJPvxcOdYlqlL/HfsWMACEiC/g1 N3ahNGrUzJqhB1QNAIKATKAlh8hzAws9t/alLJQzSHZWRu7vso0qctoVJT3i6xRp qD17EAQgrC0R0fQxdHmoMzRHEnKPCXQx36wb/mhZbG60/Q+scmSrFXp86XvbKZiI uzHTsUL/80bRXHuVrKXT+JWTRCzpv1yk9ufIVzSOheVCl/H6bxZ29cabBL2/XvvI d+OljDsy7oMH6rOBFi3XYmwZShEoUqeATeFoFf5isjkWfe7qdiMVu4apD8fBhIjX 8rNLjp0nIKN+5IjHwFkAXRwp8P1SJQ8c7Tl4I6xY82FsMQxUUgMhjSqrn58i2g9d Lem9YHKaXIbw1yfWcaf8erA6d0S4rujG+j3miG0y248kOTb9FeMbfbRgjj8v99m1 XB7F9SIQUw== =MbrN -----END PGP SIGNATURE----- Merge tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Just an API change for the registration changes that went into this release. Better to get it sorted out now than before it's too late" * tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block: io_uring: add feature flag for rsrc tags io_uring: change registration/upd/rsrc tagging ABI	2021-06-12 11:53:20 -07:00
Pavel Begunkov	9690557e22	io_uring: add feature flag for rsrc tags Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-10 16:33:51 -06:00
Pavel Begunkov	992da01aa9	io_uring: change registration/upd/rsrc tagging ABI There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-10 16:33:51 -06:00
Eric W. Biederman	06af867944	coredump: Limit what can interrupt coredumps Olivier Langlois has been struggling with coredumps being incompletely written in processes using io_uring. Olivier Langlois <olivier@trillion01.com> writes: > io_uring is a big user of task_work and any event that io_uring made a > task waiting for that occurs during the core dump generation will > generate a TIF_NOTIFY_SIGNAL. > > Here are the detailed steps of the problem: > 1. io_uring calls vfs_poll() to install a task to a file wait queue > with io_async_wake() as the wakeup function cb from io_arm_poll_handler() > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling > set_notify_signal() The coredump code deliberately supports being interrupted by SIGKILL, and depends upon prepare_signal to filter out all other signals. Now that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack in dump_emitted by the coredump code no longer works. Make the coredump code more robust by explicitly testing for all of the wakeup conditions the coredump code supports. This prevents new wakeup conditions from breaking the coredump code, as well as fixing the current issue. The filesystem code that the coredump code uses already limits itself to only aborting on fatal_signal_pending. So it should not develop surprising wake-up reasons either. v2: Don't remove the now unnecessary code in prepare_signal. Cc: stable@vger.kernel.org Fixes: `12db8b6900` ("entry: Add support for TIF_NOTIFY_SIGNAL") Reported-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-10 14:02:29 -07:00
Linus Torvalds	cc6cf827dd	for-5.13-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDAtXUACgkQxWXV+ddt WDtbdA//ccQ8JL5yC/x/j0ZXLJ2INqXpxIUPjadwwEjtTgOllvx+f1nU0QazeYfM XvvzDDvpemWajC2Ii54s2HCQbG+dAzO1YBl1XCyve91T0GeNGhzytZwM0pVxZePQ A+aOyVH7IcfFcmBy9T0yctqiGgtD3lre208kU9kolidsIyomLHxBckBhMYDXvJCK BOdrjq3f6H5J0zqOqAnWdc/Wc5z5pw3CHxlIuoA3Tp0Gv9TIx366Z/IvmFfCyvCt kYv2qnUaw10OlFLiqhetlZyv49ibW4waj0RbyY/rZx+69sE/PM4961NYAjLoFJc2 6OoZZO4OHWrNZpBJfbyyX9KVLspix075FID7qVhE/AVW4CYZGOFu5wJyXQiYlysH 1qqkihK3gbKEsB2429UeLZktupmx79LBIgg346+DSQYiMXMTGR8iZY1onbBM2wlf bep65hsiHhxoC6Z/KhxrTGZM2jyYW2nICw3o0xikhWv7MZPWKfKHrH9NJQ9Lpuhy gxut0ef9HbPXWP9PgRmY0Z8PsUi8RT1bv0bHVw7EnhLbi62neJLyxY3Q++W+7vBG LYeaxKWLTTJu73wpBQHLI0pD0UifXLrTkiCI+4gN8zVfzxUl+90mGz2AdSRRFI+U kNdX/haEHi00WBqYxWt33ae/FuSHjPuYXjiPQA7Kiy/C3n9GAB0= =mGAq -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that people hit during testing. Zoned mode fix: - fix 32bit value wrapping when calculating superblock offsets Error handling fixes: - properly check filesystema and device uuids - properly return errors when marking extents as written - do not write supers if we have an fs error" * tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: promote debugging asserts to full-fledged checks in validate_super btrfs: return value from btrfs_mark_extent_written() in case of error btrfs: zoned: fix zone number to sector/physical calculation btrfs: do not write supers if we have an fs error	2021-06-09 13:34:48 -07:00
Kees Cook	591a22c14d	proc: Track /proc/$pid/attr/ opener mm_struct Commit `bfb819ea20` ("proc: Check /proc/$pid/attr/ writes against file opener") tried to make sure that there could not be a confusion between the opener of a /proc/$pid/attr/ file and the writer. It used struct cred to make sure the privileges didn't change. However, there were existing cases where a more privileged thread was passing the opened fd to a differently privileged thread (during container setup). Instead, use mm_struct to track whether the opener and writer are still the same process. (This is what several other proc files already do, though for different reasons.) Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Reported-by: Andrea Righi <andrea.righi@canonical.com> Tested-by: Andrea Righi <andrea.righi@canonical.com> Fixes: `bfb819ea20` ("proc: Check /proc/$pid/attr/ writes against file opener") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-08 10:24:09 -07:00
Marc Dionne	dc2557308e	afs: Fix partial writeback of large files on fsync and close In commit `e87b03f583` ("afs: Prepare for use of THPs"), the return value for afs_write_back_from_locked_page was changed from a number of pages to a length in bytes. The loop in afs_writepages_region uses the return value to compute the index that will be used to find dirty pages in the next iteration, but treats it as a number of pages and wrongly multiplies it by PAGE_SIZE. This gives a very large index value, potentially skipping any dirty data that was not covered in the first pass, which is limited to 256M. This causes fsync(), and indirectly close(), to only do a partial writeback of a large file's dirty data. The rest is eventually written back by background threads after dirty_expire_centisecs. Fixes: `e87b03f583` ("afs: Prepare for use of THPs") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-07 12:56:05 -07:00
Gao Xiang	c5fcb51111	erofs: clean up file headers & footers - Remove my outdated misleading email address; - Get rid of all unnecessary trailing newline by accident. Link: https://lore.kernel.org/r/20210602160634.10757-1-xiang@kernel.org Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2021-06-08 00:41:24 +08:00
Yue Hu	7dea3de7d3	erofs: remove the occupied parameter from z_erofs_pagevec_enqueue() No any behavior to variable occupied in z_erofs_attach_page() which is only caller to z_erofs_pagevec_enqueue(). Link: https://lore.kernel.org/r/20210419102623.2015-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Gao Xiang <xiang@kernel.org>	2021-06-08 00:40:18 +08:00
Wei Yongjun	0508c1ad0f	erofs: fix error return code in erofs_read_superblock() 'ret' will be overwritten to 0 if erofs_sb_has_sb_chksum() return true, thus 0 will return in some error handling cases. Fix to return negative error code -EINVAL instead of 0. Link: https://lore.kernel.org/r/20210519141657.3062715-1-weiyongjun1@huawei.com Fixes: `b858a4844c` ("erofs: support superblock checksum") Cc: stable <stable@vger.kernel.org> # 5.5+ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <xiang@kernel.org>	2021-06-08 00:40:18 +08:00
Linus Torvalds	20e41d9bc8	Miscellaneous ext4 bug fixes for v5.13 -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmC82AQACgkQ8vlZVpUN gaOkAgf+KH57P/P0sB6aVBHpAzqa9jTKJWMA5kpCqYUDkYlfF7n2hwsjMzWpJ5MY ZvFpKAflmRnve/ULUZQX6+zrcbieNs3e+6VFZrZ0PmxN0dupyISLY7jnvCRDleA7 BFO34AcH+QEst9zXJmgta9eoy3LA8sawhQ/d7ujVY+IRFk40m26fuAMiaGznlQJ5 dmrx7pHZWKFIDFIg2TdFlP+Voqbxs2VTT16gmWpGBdTyWYHKjbSOLKJFc9DwYeE9 aANf6iIzwXz7y9pZiOnTrGuKDEJcIZNESkbIqw62YgqsoObLbsbCZNmNcqxyHpYQ Mh3L59KtmjANW3iOxQfyxkNTugxchw== =BSnf -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Only advertise encrypted_casefold when encryption and unicode are enabled ext4: fix no-key deletion for encrypt+casefold ext4: fix memory leak in ext4_fill_super ext4: fix fast commit alignment issues ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed ext4: fix accessing uninit percpu counter variable with fast_commit ext4: fix memory leak in ext4_mb_init_backend on error path.	2021-06-06 14:24:13 -07:00
Daniel Rosenberg	e71f99f2df	ext4: Only advertise encrypted_casefold when encryption and unicode are enabled Encrypted casefolding is only supported when both encryption and casefolding are both enabled in the config. Fixes: `471fbbea7f` ("ext4: handle casefolding with encryption") Cc: stable@vger.kernel.org # 5.13+ Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20210603094849.314342-1-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-06-06 10:10:23 -04:00
Daniel Rosenberg	63e7f12893	ext4: fix no-key deletion for encrypt+casefold commit `471fbbea7f` ("ext4: handle casefolding with encryption") is missing a few checks for the encryption key which are needed to support deleting enrypted casefolded files when the key is not present. This bug made it impossible to delete encrypted+casefolded directories without the encryption key, due to errors like: W : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key Repro steps in kvm-xfstests test appliance: mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc mount /vdc mkdir /vdc/dir chattr +F /vdc/dir keyid=$(head -c 64 /dev/zero \| xfs_io -c add_enckey /vdc \| awk '{print $NF}') xfs_io -c "set_encpolicy $keyid" /vdc/dir for i in `seq 1 100`; do mkdir /vdc/dir/$i done xfs_io -c "rm_enckey $keyid" /vdc rm -rf /vdc/dir # fails with the bug Fixes: `471fbbea7f` ("ext4: handle casefolding with encryption") Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-06-06 10:10:23 -04:00
Alexey Makhalov	afd09b617d	ext4: fix memory leak in ext4_fill_super Buffer head references must be released before calling kill_bdev(); otherwise the buffer head (and its page referenced by b_data) will not be freed by kill_bdev, and subsequently that bh will be leaked. If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffer head, and being busy, they were not freed and instead leaked. This can easily be reproduced by calling an infinite loop of: systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount ... since systemd creates a cgroup for each slice which it mounts, and the bh leak get amplified by a dying memory cgroup that also never gets freed, and memory consumption is much more easily noticed. Fixes: `ce40733ce9` ("ext4: Check for return value from sb_set_blocksize") Fixes: `ac27a0ec11` ("ext4: initial copy of files from ext3") Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com Signed-off-by: Alexey Makhalov <amakhalov@vmware.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2021-06-06 10:10:23 -04:00
Harshad Shirwadkar	a7ba36bc94	ext4: fix fast commit alignment issues Fast commit recovery data on disk may not be aligned. So, when the recovery code reads it, this patch makes sure that fast commit info found on-disk is first memcpy-ed into an aligned variable before accessing it. As a consequence of it, we also remove some macros that could resulted in unaligned accesses. Cc: stable@kernel.org Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-06-06 10:10:23 -04:00
Ye Bin	082cd4ec24	ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed We got follow bug_on when run fsstress with injecting IO fault: [130747.323114] kernel BUG at fs/ext4/extents_status.c:762! [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP ...... [130747.334329] Call trace: [130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4] [130747.334975] ext4_cache_extents+0x64/0xe8 [ext4] [130747.335368] ext4_find_extent+0x300/0x330 [ext4] [130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4] [130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4] [130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4] [130747.336995] ext4_readpage+0x54/0x100 [ext4] [130747.337359] generic_file_buffered_read+0x410/0xae8 [130747.337767] generic_file_read_iter+0x114/0x190 [130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4] [130747.338556] __vfs_read+0x11c/0x188 [130747.338851] vfs_read+0x94/0x150 [130747.339110] ksys_read+0x74/0xf0 This patch's modification is according to Jan Kara's suggestion in: https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/ "I see. Now I understand your patch. Honestly, seeing how fragile is trying to fix extent tree after split has failed in the middle, I would probably go even further and make sure we fix the tree properly in case of ENOSPC and EDQUOT (those are easily user triggerable). Anything else indicates a HW problem or fs corruption so I'd rather leave the extent tree as is and don't try to fix it (which also means we will not create overlapping extents)." Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-06-06 10:09:55 -04:00
Junxiao Bi	6bba4471f0	ocfs2: fix data corruption by fallocate When fallocate punches holes out of inode size, if original isize is in the middle of last cluster, then the part from isize to the end of the cluster will be zeroed with buffer write, at that time isize is not yet updated to match the new size, if writeback is kicked in, it will invoke ocfs2_writepage()->block_write_full_page() where the pages out of inode size will be dropped. That will cause file corruption. Fix this by zero out eof blocks when extending the inode size. Running the following command with qemu-image 4.2.1 can get a corrupted coverted image file easily. qemu-img convert -p -t none -T none -f qcow2 $qcow_image \ -O qcow2 -o compat=1.1 $qcow_image.conv The usage of fallocate in qemu is like this, it first punches holes out of inode size, then extend the inode size. fallocate(11, FALLOC_FL_KEEP_SIZE\|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0 fallocate(11, 0, 2276196352, 65536) = 0 v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html v2: https://lore.kernel.org/linux-fsdevel/20210525093034.GB4112@quack2.suse.cz/T/ Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-05 08:58:12 -07:00
Eric Biggers	2fc2b430f5	fscrypt: fix derivation of SipHash keys on big endian CPUs Typically, the cryptographic APIs that fscrypt uses take keys as byte arrays, which avoids endianness issues. However, siphash_key_t is an exception. It is defined as 'u64 key[2];', i.e. the 128-bit key is expected to be given directly as two 64-bit words in CPU endianness. fscrypt_derive_dirhash_key() and fscrypt_setup_iv_ino_lblk_32_key() forgot to take this into account. Therefore, the SipHash keys used to index encrypted+casefolded directories differ on big endian vs. little endian platforms, as do the SipHash keys used to hash inode numbers for IV_INO_LBLK_32-encrypted directories. This makes such directories non-portable between these platforms. Fix this by always using the little endian order. This is a breaking change for big endian platforms, but this should be fine in practice since these features (encrypt+casefold support, and the IV_INO_LBLK_32 flag) aren't known to actually be used on any big endian platforms yet. Fixes: `aa408f835d` ("fscrypt: derive dirhash key for casefolded directories") Fixes: `e3b1078bed` ("fscrypt: add support for IV_INO_LBLK_32 policies") Cc: <stable@vger.kernel.org> # v5.6+ Link: https://lore.kernel.org/r/20210605075033.54424-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-06-05 00:52:52 -07:00
Eric Biggers	77f30bfcfc	fscrypt: don't ignore minor_hash when hash is 0 When initializing a no-key name, fscrypt_fname_disk_to_usr() sets the minor_hash to 0 if the (major) hash is 0. This doesn't make sense because 0 is a valid hash code, so we shouldn't ignore the filesystem-provided minor_hash in that case. Fix this by removing the special case for 'hash == 0'. This is an old bug that appears to have originated when the encryption code in ext4 and f2fs was moved into fs/crypto/. The original ext4 and f2fs code passed the hash by pointer instead of by value. So 'if (hash)' actually made sense then, as it was checking whether a pointer was NULL. But now the hashes are passed by value, and filesystems just pass 0 for any hashes they don't have. There is no need to handle this any differently from the hashes actually being 0. It is difficult to reproduce this bug, as it only made a difference in the case where a filename's 32-bit major hash happened to be 0. However, it probably had the largest chance of causing problems on ubifs, since ubifs uses minor_hash to do lookups of no-key names, in addition to using it as a readdir cookie. ext4 only uses minor_hash as a readdir cookie, and f2fs doesn't use minor_hash at all. Fixes: `0b81d07790` ("fs crypto: move per-file encryption from f2fs tree to fs/crypto") Cc: <stable@vger.kernel.org> # v4.6+ Link: https://lore.kernel.org/r/20210527235236.2376556-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-06-05 00:22:53 -07:00
Dietmar Eggemann	f501b6a231	debugfs: Fix debugfs_read_file_str() Read the entire size of the buffer, including the trailing new line character. Discovered while reading the sched domain names of CPU0: before: cat /sys/kernel/debug/sched/domains/cpu0/domain/name SMTMCDIE after: cat /sys/kernel/debug/sched/domains/cpu0/domain/name SMT MC DIE Fixes: `9af0440ec8` ("debugfs: Implement debugfs_create_str()") Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210527091105.258457-1-dietmar.eggemann@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-04 15:01:08 +02:00
Nikolay Borisov	aefd7f7065	btrfs: promote debugging asserts to full-fledged checks in validate_super Syzbot managed to trigger this assert while performing its fuzzing. Turns out it's better to have those asserts turned into full-fledged checks so that in case buggy btrfs images are mounted the users gets an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT disabled such image would have been erroneously allowed to be mounted. Reported-by: syzbot+a6bf271c02e4fe66b4e4@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add uuids to the messages ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-04 13:12:06 +02:00
Ritesh Harjani	e7b2ec3d3d	btrfs: return value from btrfs_mark_extent_written() in case of error We always return 0 even in case of an error in btrfs_mark_extent_written(). Fix it to return proper error value in case of a failure. All callers handle it. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-04 13:11:58 +02:00
Naohiro Aota	5b434df877	btrfs: zoned: fix zone number to sector/physical calculation In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t sector" by shifting it. But, this "sector" is calculated in 32bit, leading it to be 0 for the 2nd superblock copy. Since zone number is u32, shifting it to sector (sector_t) or physical address (u64) can easily trigger a missing cast bug like this. This commit introduces helpers to convert zone number to sector/LBA, so we won't fall into the same pitfall again. Reported-by: Dmitry Fomichev <Dmitry.Fomichev@wdc.com> Fixes: `12659251ca` ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.11+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-04 13:11:50 +02:00
Josef Bacik	165ea85f14	btrfs: do not write supers if we have an fs error Error injection testing uncovered a pretty severe problem where we could end up committing a super that pointed to the wrong tree roots, resulting in transid mismatch errors. The way we commit the transaction is we update the super copy with the current generations and bytenrs of the important roots, and then copy that into our super_for_commit. Then we allow transactions to continue again, we write out the dirty pages for the transaction, and then we write the super. If the write out fails we'll bail and skip writing the supers. However since we've allowed a new transaction to start, we can have a log attempting to sync at this point, which would be blocked on fs_info->tree_log_mutex. Once the commit fails we're allowed to do the log tree commit, which uses super_for_commit, which now points at fs tree's that were not written out. Fix this by checking BTRFS_FS_STATE_ERROR once we acquire the tree_log_mutex. This way if the transaction commit fails we're sure to see this bit set and we can skip writing the super out. This patch fixes this specific transid mismatch error I was seeing with this particular error path. CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-04 13:11:38 +02:00
Linus Torvalds	ec95502396	io_uring-5.13-2021-06-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmC5BrwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq3tD/9FGANoxDDpLbQg/FCiK1pNoSf0EyoEWSdg ysTF5KPAPC3msQOmuPYwRZfRFCkvtOHmrexPAZAaorCxEYPjiVAZ9b/a0hBC4Zc1 vVW8RcTp6hSonAp1kk6VgLEHulJMcLANjAx3Me3NDRB/g0KGW5gevqkUXIJ+nXiR nqZcxaK7MD90v74IomO7y4P1GgwCbRhKYUL0JGQ4tXndYLxBYJnXBSnIKS2WdLZD PCBf+TDDFAZeioueZ/GrXRhWBmy97j8sEKUJLRqjI5YG8VVZSofgPlwNBi1e42C8 l3ZEmXldyk18O8KDsZCI2E8axt62gLjuD7Tu6+gv0GBJTcdyXP/FaZYbkBMWjMBH yq4Dk4QyJWfMFHJ886ukbGpwj1HJT1cJqzg4UUdkV3BlMNKtmZD8XrKTBw4HcPww EmB+yywRiuH+XqamxPglFXUEOa4bJH/EAsQ0R5NNxAT/X/9iIOLUDBDAGvtWtBr0 7cz+7jTQchqmV11gN+JcgN2LvG14m6Xq4Xtv5oHhIy/FHbRCNPPrC7KJ22TOBSaD d9mS5VM12+O9r9plYW7Cqdhdhnho/7/VfB+puiHg/lVcsXrMlrr0sc/WyrUixZeL AUlhDtmoROcyFpdcA49LCBEFvacu13ivEstkxIonx997Ct4MW7joYds2YfHCfuoO YlPVGdqeag== =3m3m -----END PGP SIGNATURE----- Merge tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block Pull io_uring fix from Jens Axboe: "Just a single one-liner fix for an accounting regression in this release" * tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block: io_uring: fix misaccounting fix buf pinned pages	2021-06-03 11:41:00 -07:00
Linus Torvalds	fd2ff2774e	for-5.13-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmC435cACgkQxWXV+ddt WDuh5w/+IGfsUFfKikJZpZUP7q/2gC0t0dzZemxeZMutJbT/KCZCDd4CjLf6YH6r oV9uYIgOWGd3aem9fe0R60ErJ4htgszIgeydCw3s2EuTms6WvAVA6Wp+wK/3UNx3 vQgYsqYkhMzIYKm/D4q8G+bqA2nPbBTDRNsXDIDrZYONxwSb+dNbQCGVknBRzRPa hiCqYhUSyXA7E6UZdlma7MvpDOquZN+iW3RRVx1AULLqVs01PCnG/CEN+0oQm2JE r9IyRxOZUvSeW6opT80yzZFCoboNSduMjPENTfzLY6Q1xzS/EtP4kM86fB/7AoJv UI0c3Sr84SC9vOsBsbGJaBHpxP3OpzxohKU///jVQgEDpGv4STPlkVfxk23BHcux Fdfg7wodkXeLU1Ff4dlJhvCqNYqc5V8lT5Kl52ai9Scct6D4yZBAq4KJp2LmYFC0 cHv6xFxBUv5zFZP1j6NMOmiLlCdDEkOruku2mMweQOBWYW/lHYNU469V5RCvfbLl HlbDrtZdnQ3m2IhpQrXiTnT47Ib4DPYWkhRVfWbyVJHA+CbcOV62RQfl+r95Bc7j FB1gM5vwUTJV7wgzErrq7+BD8quxG6/NuLDFjHYRcIj1kSIMK4/I1fOWruzuK+CL 6n7LLvBOojYfFo+ruQMSp2imDn3JJucBuh0/ssOlUWl2zsy6lDA= =8066 -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Error handling improvements, caught by error injection: - handle errors during checksum deletion - set error on mapping when ordered extent io cannot be finished - inode link count fixup in tree-log - missing return value checks for inode updates in tree-log - abort transaction in rename exchange if adding second reference fails Fixes: - fix fsync failure after writes to prealloc extents - fix deadlock when cloning inline extents and low on available space - fix compressed writes that cross stripe boundary" * tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: add btrfs IRC link btrfs: fix deadlock when cloning inline extents and low on available space btrfs: fix fsync failure and transaction abort after writes to prealloc extents btrfs: abort in rename_exchange if we fail to insert the second ref btrfs: check error value from btrfs_update_inode in tree log btrfs: fixup error handling in fixup_inode_link_counts btrfs: mark ordered extent and inode with error if we fail to finish btrfs: return errors from btrfs_del_csums in cleanup_ref_head btrfs: fix error handling in btrfs_del_csums btrfs: fix compressed writes that cross stripe boundary	2021-06-03 11:37:14 -07:00
Ingo Molnar	a9e906b71f	Merge branch 'sched/urgent' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-06-03 19:00:49 +02:00
Trond Myklebust	c3aba897c6	NFSv4: Fix second deadlock in nfs4_evict_inode() If the inode is being evicted but has to return a layout first, then that too can cause a deadlock in the corner case where the server reboots. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-03 10:14:42 -04:00
Trond Myklebust	dfe1fe75e0	NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() If the inode is being evicted, but has to return a delegation first, then it can cause a deadlock in the corner case where the server reboots before the delegreturn completes, but while the call to iget5_locked() in nfs4_opendata_get_inode() is waiting for the inode free to complete. Since the open call still holds a session slot, the reboot recovery cannot proceed. In order to break the logjam, we can turn the delegation return into a privileged operation for the case where we're evicting the inode. We know that in that case, there can be no other state recovery operation that conflicts. Reported-by: zhangxiaoxu (A) <zhangxiaoxu5@huawei.com> Fixes: `5fcdfacc01` ("NFSv4: Return delegations synchronously in evict_inode") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-03 10:14:42 -04:00
Chuck Lever	d1b5c230e9	NFS: FMODE_READ and friends are C macros, not enum types Address a sparse warning: CHECK fs/nfs/nfstrace.c fs/nfs/nfstrace.c: note: in included file (through /home/cel/src/linux/rpc-over-tls/include/trace/trace_events.h, /home/cel/src/linux/rpc-over-tls/include/trace/define_trace.h, ...): fs/nfs/./nfstrace.h:424:1: warning: incorrect type in initializer (different base types) fs/nfs/./nfstrace.h:424:1: expected unsigned long eval_value fs/nfs/./nfstrace.h:424:1: got restricted fmode_t [usertype] fs/nfs/./nfstrace.h:425:1: warning: incorrect type in initializer (different base types) fs/nfs/./nfstrace.h:425:1: expected unsigned long eval_value fs/nfs/./nfstrace.h:425:1: got restricted fmode_t [usertype] fs/nfs/./nfstrace.h:426:1: warning: incorrect type in initializer (different base types) fs/nfs/./nfstrace.h:426:1: expected unsigned long eval_value fs/nfs/./nfstrace.h:426:1: got restricted fmode_t [usertype] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-03 10:14:42 -04:00
Dan Carpenter	09226e8303	NFS: Fix a potential NULL dereference in nfs_get_client() None of the callers are expecting NULL returns from nfs_get_client() so this code will lead to an Oops. It's better to return an error pointer. I expect that this is dead code so hopefully no one is affected. Fixes: `31434f496a` ("nfs: check hostname in nfs_get_client") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-03 10:14:42 -04:00
Anna Schumaker	476bdb04c5	NFS: Fix use-after-free in nfs4_init_client() KASAN reports a use-after-free when attempting to mount two different exports through two different NICs that belong to the same server. Olga was able to hit this with kernels starting somewhere between 5.7 and 5.10, but I traced the patch that introduced the clear_bit() call to 4.13. So something must have changed in the refcounting of the clp pointer to make this call to nfs_put_client() the very last one. Fixes: `8dcbec6d20` ("NFSv41: Handle EXCHID4_FLAG_CONFIRMED_R during NFSv4.1 migration") Cc: stable@vger.kernel.org # 4.13+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-03 10:14:42 -04:00
Scott Mayhew	0b4f132b15	NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate Commit `ce62b114bb` ("NFS: Split attribute support out from the server capabilities") removed the logic from _nfs4_server_capabilities() that sets the NFS_CAP_SECURITY_LABEL capability based on the presence of FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response. Now NFS_CAP_SECURITY_LABEL is never set, which breaks labelled NFS. This was replaced with logic that clears the NFS_ATTR_FATTR_V4_SECURITY_LABEL bit in the newly added fattr_valid field based on the absence of FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response. This essentially has no effect since there's nothing looks for that bit in fattr_supported. So revert that part of the commit, but adding the logic that sets NFS_CAP_SECURITY_LABEL near where the other capabilities are set in _nfs4_server_capabilities(). Fixes: `ce62b114bb` ("NFS: Split attribute support out from the server capabilities") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-03 10:14:42 -04:00
Ritesh Harjani	b45f189a19	ext4: fix accessing uninit percpu counter variable with fast_commit When running generic/527 with fast_commit configuration, the following issue is seen on Power. With fast_commit, during ext4_fc_replay() (which can be called from ext4_fill_super()), if inode eviction happens then it can access an uninitialized percpu counter variable. This patch adds the check before accessing the counters in ext4_free_inode() path. [ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43 [ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none. [ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000 [ 323.619767] Faulting instruction address: 0xc000000000bae78c cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0] pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100 lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90 pid = 5593, comm = mount ext4_free_inode+0x780/0xb90 ext4_evict_inode+0xa8c/0xc60 evict+0xfc/0x1e0 ext4_fc_replay+0xc50/0x20f0 do_one_pass+0xfe0/0x1350 jbd2_journal_recover+0x184/0x2e0 jbd2_journal_load+0x1c0/0x4a0 ext4_fill_super+0x2458/0x4200 mount_bdev+0x1dc/0x290 ext4_mount+0x28/0x40 legacy_get_tree+0x4c/0xa0 vfs_get_tree+0x4c/0x120 path_mount+0xcf8/0xd70 do_mount+0x80/0xd0 sys_mount+0x3fc/0x490 system_call_exception+0x384/0x3d0 system_call_common+0xec/0x278 Cc: stable@kernel.org Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-06-02 21:40:42 -04:00
Andreas Gruenbacher	d5b8145455	Revert "gfs2: Fix mmap locking for write faults" This reverts commit `b7f55d928e`. As explained by Linus in [], write faults on a mmap region are reads from a filesysten point of view, so taking the inode glock exclusively on write faults is incorrect. Instead, when a page is marked writable, the .page_mkwrite vm operation will be called, which is where the exclusive lock taking needs to happen. I got this wrong because of a broken test case that made me believe .page_mkwrite isn't getting called when it actually is. [] https://lore.kernel.org/lkml/CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com/ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-06-01 23:16:42 +02:00
Dai Ngo	f8849e206e	NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error. Currently if __nfs4_proc_set_acl fails with NFS4ERR_BADOWNER it re-enables the idmapper by clearing NFS_CAP_UIDGID_NOMAP before retrying again. The NFS_CAP_UIDGID_NOMAP remains cleared even if the retry fails. This causes problem for subsequent setattr requests for v4 server that does not have idmapping configured. This patch modifies nfs4_proc_set_acl to detect NFS4ERR_BADOWNER and NFS4ERR_BADNAME and skips the retry, since the kernel isn't involved in encoding the ACEs, and return -EINVAL. Steps to reproduce the problem: # mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt # touch /tmp/mnt/file1 # chown 99 /tmp/mnt/file1 # nfs4_setfacl -a A::unknown.user@xyz.com:wrtncy /tmp/mnt/file1 Failed setxattr operation: Invalid argument # chown 99 /tmp/mnt/file1 chown: changing ownership of ‘/tmp/mnt/file1’: Invalid argument # umount /tmp/mnt # mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt # chown 99 /tmp/mnt/file1 # v2: detect NFS4ERR_BADOWNER and NFS4ERR_BADNAME and skip retry in nfs4_proc_set_acl. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-06-01 13:16:17 -04:00
Christian Brauner	dd8b477f9a	mount: Support "nosymfollow" in new mount api Commit `dab741e0e0` ("Add a "nosymfollow" mount option.") added support for the "nosymfollow" mount option allowing to block following symlinks when resolving paths. The mount option so far was only available in the old mount api. Make it available in the new mount api as well. Bonus is that it can be applied to a whole subtree not just a single mount. Cc: Christoph Hellwig <hch@lst.de> Cc: Mattias Nissler <mnissler@chromium.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <zwisler@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-06-01 12:09:27 +02:00
Linus Torvalds	c2131f7e73	Various gfs2 fixes -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmC0vkAUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqCGQ/+JiCdfHQao3/W9KsIeA5YO5fbsQXi tElY61L4eM7F+gEe1mbMzr8sefbejv73aAMGWJJD06gLPz/wIPeW/fYnC4/gcQEn +jLjVb7taGaxOn0fioCqjU+esGW4wstYrAXLp6XZmLnMETmr7PCbOhohRG7sK1TX m8si6riMOiNw20MOHhUK9DFZ3rF4Q5Rp/vYaTDwoGpcORIv5bpPoQKYT1FMCTD9h 5qI6ldOO2E4d9qXQXiCv2RqXElYqQxwxqvGP0Hj+HQLZQBCmJJYZNDqRwDunJTaN K9++1/XbTCFKEQz0UWz1x1k5fCDIewbzxX348aQjiLMkkpXr885AGhasAX8gRS3p D7Y4q6VCY3J5JzlCDfNWrTBd0abLJAjeJ70R71/kN/hgIY2PbU/CaPcyhUrp7rwH B6spZDXb2fBNdfYA5wmuUdSA9BRmw/MDpiGd9aQc5nv25YvZ5Apl9X4QSH2250vo MKTrlt90EyTmOgF6vRf28apVr41JO3PIXgMu+svZq769Ox2jSZJQT0UI4vzVThoP RGBsTDPtDL67OvNoC6H7Poc7ad+BRqtFxkwNCz7kkcwQlYkmVPUf49UC/pBnBV3M HtlkJdlhD7VEWqUPl3T02rTdLRXLuPIGw9Kk6gKiDCikONoD+icJ3fV7rWShMjhD O/KT/r3XM1V1sP8= =vV+Y -----END PGP SIGNATURE----- Merge tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "Various gfs2 fixes" * tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix use-after-free in gfs2_glock_shrink_scan gfs2: Fix mmap locking for write faults gfs2: Clean up revokes on normal withdraws gfs2: fix a deadlock on withdraw-during-mount gfs2: fix scheduling while atomic bug in glocks gfs2: Fix I_NEW check in gfs2_dinode_in gfs2: Prevent direct-I/O write fallback errors from getting lost	2021-05-31 05:57:22 -10:00
Linus Torvalds	36c795513a	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmC0vhsACgkQnJ2qBz9k QNlI9ggAjZSqIvNNs1w6VafSRY7XP5vItKAe0jhguD0o1ZtUI1gM1JlOJzbgt2z5 gpm/4v4485h5JUXNrB5TeQ1woOOvFKzlUcIr+ZgUiyq2UgZj6PzvK599u2TFf1vc gLMAUx5YgWafr048orhcSBqaYic04LESQ17op+9UjgBB7ATbNjJmEBb/+WvGh9os 8c4V9JrCTMdNJ5Rpc5+JsWAksgZKrW9VjTw8mHisWB0NIIPQWGCML8Z4ACzNObCW CrXL9xWgaQDov1okJSA0ZNkdatGhh4h/NxIZ2sLGg2F3bDfZwN+kFu6gqpxhTEVV v83aTAP3UxbK8bwRj0+lm/LImxULjA== =t4P5 -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "A fix for permission checking with fanotify unpriviledged groups. Also there's a small update in MAINTAINERS file for fanotify" * tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: fix permission model of unprivileged group MAINTAINERS: Add Matthew Bobrowski as a reviewer	2021-05-31 05:52:22 -10:00
Hillf Danton	1ab19c5de4	gfs2: Fix use-after-free in gfs2_glock_shrink_scan The GLF_LRU flag is checked under lru_lock in gfs2_glock_remove_from_lru() to remove the glock from the lru list in __gfs2_glock_put(). On the shrink scan path, the same flag is cleared under lru_lock but because of cond_resched_lock(&lru_lock) in gfs2_dispose_glock_lru(), progress on the put side can be made without deleting the glock from the lru list. Keep GLF_LRU across the race window opened by cond_resched_lock(&lru_lock) to ensure correct behavior on both sides - clear GLF_LRU after list_del under lru_lock. Reported-by: syzbot <syzbot+34ba7ddbf3021981a228@syzkaller.appspotmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-31 12:03:28 +02:00
Linus Torvalds	75b9c727af	Fixes for 5.13-rc4: - Fix a bug where unmapping operations end earlier than expected, which can cause chaos on multi-block directory and symlink shrink operations. - Fix an erroneous assert that can trigger if we try to transition a bmap structure from btree format to extents format with zero extents. This was exposed by xfs/538. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCvxs0ACgkQ+H93GTRK tOsfAQ//fAtDZkjKYKHhWUFoyG6kYNsIZr7wf+kow8jJgeWUibwtUYYQQV/RCRtJ zR+Tiys9ZorAReYpzq69s1LbZg/Zz1GT4bPgq/9Icni9x8EXIS6MVaWJHjnkFtKD 7IqDztV3tC3XgSuAEsjey5PA1V1xpSgxxVtaT1Q2BcY8zqf2bnEPzM/rpKdmE++x jlTYrgLBctI24nbmTX2Y/+Te1UWXjM4QiV/EBHiUPedAqJZhwA0hU7hJJv/9I/EG /GOjisxhAonKR7fr7wPE+LwJMaxxK4LAt3aLZmGKpm3smSYX8O6sGnJv9VI/stsS wRD9c3wzLvfmqL5MXeAYq83u3s5DuFsfqmYD2U49xHlFF9tvLTT5S0Pdi/Qiq962 n3wabi0slBCdzeY3xXXr9M4cCLL6utYY8Vfi7KvBiDHdtCRZUU33/SAwxZzvhHQv 0XN+2sqnIn3jM9xg342+/BZbi4+SX7h28qixmgxCo+hez96GHuwhdN5GUVa5lF0r 4uRPn+VVaOJPcNRx69/iTkrJ1R4YPqedCkgLShs6lZX5Ct92UtANLzqmm0xCZ6U5 Pe7WjXO6aVAugEsVd9qnPdx4o/+sabs6CEQ65BbYyKvOBVg1XWH0yUEeKyGRYDt9 21ol7QSKJ6+HIg473j8A3omM5s2XKUT8NZMmRm4EojNI+j4O3R4= =y7+K -----END PGP SIGNATURE----- Merge tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: "This week's pile mitigates some decades-old problems in how extent size hints interact with realtime volumes, fixes some failures in online shrink, and fixes a problem where directory and symlink shrinking on extremely fragmented filesystems could fail. The most user-notable change here is to point users at our (new) IRC channel on OFTC. Freedom isn't free, it costs folks like you and me; and if you don't kowtow, they'll expel everyone and take over your channel. (Ok, ok, that didn't fit the song lyrics...) Summary: - Fix a bug where unmapping operations end earlier than expected, which can cause chaos on multi-block directory and symlink shrink operations. - Fix an erroneous assert that can trigger if we try to transition a bmap structure from btree format to extents format with zero extents. This was exposed by xfs/538" * tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: bunmapi has unnecessary AG lock ordering issues xfs: btree format inode forks can have zero extents xfs: add new IRC channel to MAINTAINERS xfs: validate extsz hints against rt extent size when rtinherit is set xfs: standardize extent size hint validation xfs: check free AG space when making per-AG reservations	2021-05-29 17:47:19 -10:00
Pavel Begunkov	216e583596	io_uring: fix misaccounting fix buf pinned pages As Andres reports "... io_sqe_buffer_register() doesn't initialize imu. io_buffer_account_pin() does imu->acct_pages++, before calling io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM. Initialise the field. Reported-by: Andres Freund <andres@anarazel.de> Fixes: `41edf1a5ec` ("io_uring: keep table of pointers to ubufs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-29 19:27:21 -06:00
Linus Torvalds	e1a9e3db3b	Driver core fixes for 5.13-rc4 Here are 3 small driver core / debugfs fixes for 5.13-rc4: - debugfs fix for incorrect "lockdown" mode for selinux accesses - 2 device link changes, one bugfix and one cleanup All of these have been in linux-next for over a week with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYLJMrQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ynf8ACgvsZCX7Wi3GYtFovfomHsCRKpZBsAn0sqfSAL TXHePEnj2tJ5c22TSqSt =Zx6Z -----END PGP SIGNATURE----- Merge tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are three small driver core / debugfs fixes for 5.13-rc4: - debugfs fix for incorrect "lockdown" mode for selinux accesses - two device link changes, one bugfix and one cleanup All of these have been in linux-next for over a week with no reported problems" * tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: drivers: base: Reduce device link removal code duplication drivers: base: Fix device link removal debugfs: fix security_locked_down() call for SELinux	2021-05-29 06:33:28 -10:00
Linus Torvalds	b3dbbae609	io_uring-5.13-2021-05-28 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCxY4wQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqJnD/sEHg2ZVzc3CUtvLI11C+O4nkqzUpetOD8I iKtvCYKYNTATOPLGQjsznNTTVcUhN4Mud9XWHjyR3nli98fwRrzLuK3EfJjuq1cL v6DZVuYKq4k6s0QN6K8yTMslYBQTmk85l8rvXs06jVqDadnnVc+JdfWWBDducs0e 56Wtmlse18PhzfDjqtsjAOQBjpv4bhQaJTrYOHcEIqFiih2ZpSvyP3SLED7/nvoe Q8MNF0Htff/oVbUEzp/NfhHoOFIZ17wwPV3fRC7zat2Dp4R9ZxpScmozLn8PkdO9 DW+rKpuCbYTYwY1p11cQ5EhiNWNfPMxX4YXovUP9z+M2cgGUK1IhWQRM83L9bAXt r/9Md5WjnNpeDr6/YW6uMe1lOrrEy2ZJfNJ2JJbiXo6CWiz+g2qfHLOxwVsEnfoy vZoSbDD8ItZDooaXDFGEp1PLpkka4vt/6Ebg0fUtEeG8QQ48eG5L9xpPMSjm90y9 /UKZdS1pvSl/x6he+RDPg4aVGBWIhGJhv+Q22hNTO3g5u5QE+hXLvFh0QvoOkDQK FGlhIa431EiOdm3rdFCG2I4kH1QzQTO6XLHpoVabGXJULPvS2ztnHCz3pYqOU9w1 Mh12t1RtWzvcTkyOutfsjVqszV3kTl6O6GkI8CiqqjomnbbfORj6CDsi7h9RFZI+ HtnY2GbSJg== =dfLl -----END PGP SIGNATURE----- Merge tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "A few minor fixes: - Fix an issue with hashed wait removal on exit (Zqiang, Pavel) - Fix a recent data race introduced in this series (Marco)" * tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block: io_uring: fix data race to avoid potential NULL-deref io-wq: Fix UAF when wakeup wqe in hash waitqueue io_uring/io-wq: close io-wq full-stop gap	2021-05-28 14:35:55 -10:00
Linus Torvalds	7c0ec89d31	3 SMB3 fixes, two for stable, and the other fixes a problem pointed out with a recently added ioctl -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCxbv4ACgkQiiy9cAdy T1GkPwwAqq0tvNDZnQ6aAur1jwHMiOIAydvpgNKXlKkiXYu+qQbMRgOdUsOtjM00 Idi8PHKkID2X63HeiwLwoQTfTXGcs6I4UM1iOslEs2ZaX+Fkgo5PG25lIFRTAsqR tqYGqGi6yL6TWE7PlVJxr3QuwGeMr7B8X9A0lTZ7YJwslhByK8ymasPdF+jSgPQI zDOuAeXiYvlph8sCftWX7gF34aBfKgiH8LhA6M2SY5S16g7LwtXUJjq1PJactoD3 +nEPyCtRoN6ohScKNVnM8JDpOKIrM+mJ42RG28ZLo6//8so0SFcUdC8VhECBxOWL 9WkoL2GxRV0LoRnzCZS30EpAi/eQU+QlTrPueGp+n8GjauJMDPoxJ2l6UXox6CLm 8CqwxKATG6WbrdcGhaVbIxVbAWC7Ze271C/7L61R5K+RmDTXc6jI4vAIw1Pib4o+ CG6XtxHya5PM0zvyLgU28M6aY+WExbwnkSQKvI2FJZkOVG0xdFCy2O1QLLRBChmn a6hsA05a =8nZ2 -----END PGP SIGNATURE----- Merge tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Three SMB3 fixes. Two for stable, and the other fixes a problem pointed out with a recently added ioctl" * tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: change format of CIFS_FULL_KEY_DUMP ioctl cifs: fix string declarations and assignments in tracepoints cifs: set server->cipher_type to AES-128-CCM for SMB3.0	2021-05-28 14:15:47 -10:00
Linus Torvalds	5ff2756afd	NFS client bugfixes for Linux 5.13 Highlights include: Stable fixes - Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config - Fix Oops in xs_tcp_send_request() when transport is disconnected - Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() Bugfixes - Fix instances where signal_pending() should be fatal_signal_pending() - fix an incorrect limit in filelayout_decode_layout() - Fixes for the SUNRPC backlogged RPC queue - Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() - Revert commit `586a0787ce` ("Clean up rpcrdma_prepare_readch()") -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmCvomgACgkQZwvnipYK APLg6xAAqlR/HLNYOLAToQ6d9wzrL6Po3x8Lx7VURjCkFaoKB/jMq3Zbu/K8mZ+X CC6/XFtB9AikloK7sle6lRPuwwPL6y+vOML0Ais/dkYPNkbhe9ylf0rsYQiPljXT 8PAcqn8FXTZ9fKpKU8Quw24X1Jfkk6zUEeMy50HYDBfTx+gYEojMEKa6cl4URGzO 2JpuBO4Ku/vWDOPj7bWBX9wi7wkrJjGSYDnx1A5SOgUdV87H8VJkbTo9vVdEwFoE OtE8MQmFhdton0u9+MKImFQdVfxoYLB1Ig1G45NXGHee91dwfYU0U05THj7E/xP9 RQWtmJcKdvY1w8sRK/PNEHo43Vkow4usffSrIWNBZ6aO5EkbQFn1tmKMSDtsrkZ2 ONMfKBiEhhQSy+QRXMR/RC86t4dsQ8SApu62qQT4VuuXqzYhrBum2DqkW0X6Zcti gi17+PfjRbgWNvul2yegBvDU016H324aCeT9nfWe0D9iwF7tPK4xsuNTYrWwbFOA YFAecIXoyBRtbIV6NZ95/+P5HEFBLAYewEVLpdAOBGQ9fjO023ERiC2sitl5P+ku v6V+4HAtBgcPfm/8BZwUYUYBpXnnTZFTizqdJdGGydPXeC671gANPJe6e4xOttCK frXFGd9OOqPSXdsRZLUVvhczTOFOGa/UVVG0GxIr4ggy8oKjDMk= =66ks -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Stable fixes: - Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config - Fix Oops in xs_tcp_send_request() when transport is disconnected - Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() Bugfixes: - Fix instances where signal_pending() should be fatal_signal_pending() - fix an incorrect limit in filelayout_decode_layout() - Fixes for the SUNRPC backlogged RPC queue - Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() - Revert commit `586a0787ce` ("Clean up rpcrdma_prepare_readch()")" * tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Remove trailing semicolon in macros xprtrdma: Revert `586a0787ce` NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config NFS: Clean up reset of the mirror accounting variables NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() NFS: Fix an Oopsable condition in __nfs_pageio_add_request() SUNRPC: More fixes for backlog congestion SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnected NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() SUNRPC in case of backlog, hand free slots directly to waiting task pNFS/NFSv4: Remove redundant initialization of 'rd_size' NFS: fix an incorrect limit in filelayout_decode_layout() fs/nfs: Use fatal_signal_pending instead of signal_pending	2021-05-28 08:53:19 -10:00
Christian Brauner	cfe80306a0	open: don't silently ignore unknown O-flags in openat2() The new openat2() syscall verifies that no unknown O-flag values are set and returns an error to userspace if they are while the older open syscalls like open() and openat() simply ignore unknown flag values: #define O_FLAG_CURRENTLY_INVALID (1 << 31) struct open_how how = { .flags = O_RDONLY \| O_FLAG_CURRENTLY_INVALID, .resolve = 0, }; /* fails / fd = openat2(-EBADF, "/dev/null", &how, sizeof(how)); / succeeds / fd = openat(-EBADF, "/dev/null", O_RDONLY \| O_FLAG_CURRENTLY_INVALID); However, openat2() silently truncates the upper 32 bits meaning: #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31) #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40) struct open_how how_lowe32 = { .flags = O_RDONLY \| O_FLAG_CURRENTLY_INVALID_LOWER32, }; struct open_how how_upper32 = { .flags = O_RDONLY \| O_FLAG_CURRENTLY_INVALID_UPPER32, }; / fails / fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32)); / succeeds / fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32)); Fix this by preventing the immediate truncation in build_open_flags(). There's a snafu here though stripping FMODE_ directly from flags would cause the upper 32 bits to be truncated as well due to integer promotion rules since FMODE_* is unsigned int, O_* are signed ints (yuck). In addition, struct open_flags currently defines flags to be 32 bit which is reasonable. If we simply were to bump it to 64 bit we would need to change a lot of code preemptively which doesn't seem worth it. So simply add a compile-time check verifying that all currently known O_* flags are within the 32 bit range and fail to build if they aren't anymore. This change shouldn't regress old open syscalls since they silently truncate any unknown values anyway. It is a tiny semantic change for openat2() but it is very unlikely people pass ing > 32 bit unknown flags and the syscall is relatively new too. Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reported-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-05-28 17:44:37 +02:00
Filipe Manana	76a6d5cd74	btrfs: fix deadlock when cloning inline extents and low on available space There are a few cases where cloning an inline extent requires copying data into a page of the destination inode. For these cases we are allocating the required data and metadata space while holding a leaf locked. This can result in a deadlock when we are low on available space because allocating the space may flush delalloc and two deadlock scenarios can happen: 1) When starting writeback for an inode with a very small dirty range that fits in an inline extent, we deadlock during the writeback when trying to insert the inline extent, at cow_file_range_inline(), if the extent is going to be located in the leaf for which we are already holding a read lock; 2) After successfully starting writeback, for non-inline extent cases, the async reclaim thread will hang waiting for an ordered extent to complete if the ordered extent completion needs to modify the leaf for which the clone task is holding a read lock (for adding or replacing file extent items). So the cloning task will wait forever on the async reclaim thread to make progress, which in turn is waiting for the ordered extent completion which in turn is waiting to acquire a write lock on the same leaf. So fix this by making sure we release the path (and therefore the leaf) every time we need to copy the inline extent's data into a page of the destination inode, as by that time we do not need to have the leaf locked. Fixes: `05a5a7621c` ("Btrfs: implement full reflink support for inline extents") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:52 +02:00
Filipe Manana	ea7036de0d	btrfs: fix fsync failure and transaction abort after writes to prealloc extents When doing a series of partial writes to different ranges of preallocated extents with transaction commits and fsyncs in between, we can end up with a checksum items in a log tree. This causes an fsync to fail with -EIO and abort the transaction, turning the filesystem to RO mode, when syncing the log. For this to happen, we need to have a full fsync of a file following one or more fast fsyncs. The following example reproduces the problem and explains how it happens: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt # Create our test file with 2 preallocated extents. Leave a 1M hole # between them to ensure that we get two file extent items that will # never be merged into a single one. The extents are contiguous on disk, # which will later result in the checksums for their data to be merged # into a single checksum item in the csums btree. # $ xfs_io -f \ -c "falloc 0 1M" \ -c "falloc 3M 3M" \ /mnt/foobar # Now write to the second extent and leave only 1M of it as unwritten, # which corresponds to the file range [4M, 5M[. # # Then fsync the file to flush delalloc and to clear full sync flag from # the inode, so that a future fsync will use the fast code path. # # After the writeback triggered by the fsync we have 3 file extent items # that point to the second extent we previously allocated: # # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the # file range [3M, 4M[ # # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers # the file range [4M, 5M[ # # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the # file range [5M, 6M[ # # All these file extent items have a generation of 6, which is the ID of # the transaction where they were created. The split of the original file # extent item is done at btrfs_mark_extent_written() when ordered extents # complete for the file ranges [3M, 4M[ and [5M, 6M[. # $ xfs_io -c "pwrite -S 0xab 3M 1M" \ -c "pwrite -S 0xef 5M 1M" \ -c "fsync" \ /mnt/foobar # Commit the current transaction. This wipes out the log tree created by # the previous fsync. sync # Now write to the unwritten range of the second extent we allocated, # corresponding to the file range [4M, 5M[, and fsync the file, which # triggers the fast fsync code path. # # The fast fsync code path sees that there is a new extent map covering # the file range [4M, 5M[ and therefore it will log a checksum item # covering the range [1M, 2M[ of the second extent we allocated. # # Also, after the fsync finishes we no longer have the 3 file extent # items that pointed to 3 sections of the second extent we allocated. # Instead we end up with a single file extent item pointing to the whole # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the # current transaction ID). This is due to the file extent item merging we # do when completing ordered extents into ranges that point to unwritten # (preallocated) extents. This merging is done at # btrfs_mark_extent_written(). # $ xfs_io -c "pwrite -S 0xcd 4M 1M" \ -c "fsync" \ /mnt/foobar # Now do some write to our file outside the range of the second extent # that we allocated with fallocate() and truncate the file size from 6M # down to 5M. # # The truncate operation sets the full sync runtime flag on the inode, # forcing the next fsync to use the slow code path. It also changes the # length of the second file extent item so that it represents the file # range [3M, 5M[ and not the range [3M, 6M[ anymore. # # Finally fsync the file. Since this is a fsync that triggers the slow # code path, it will remove all items associated to the inode from the # log tree and then it will scan for file extent items in the # fs/subvolume tree that have a generation matching the current # transaction ID, which is 7. This means it will log 2 file extent # items: # # 1) One for the first extent we allocated, covering the file range # [0, 1M[ # # 2) Another for the first 2M of the second extent we allocated, # covering the file range [3M, 5M[ # # When logging the first file extent item we log a single checksum item # that has all the checksums for the entire extent. # # When logging the second file extent item, we also lookup for the # checksums that are associated with the range [0, 2M[ of the second # extent we allocated (file range [3M, 5M[), and then we log them with # btrfs_csum_file_blocks(). However that results in ending up with a log # that has two checksum items with ranges that overlap: # # 1) One for the range [1M, 2M[ of the second extent we allocated, # corresponding to the file range [4M, 5M[, which we logged in the # previous fsync that used the fast code path; # # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second # extents, respectively, corresponding to the files ranges [0, 1M[ # and [3M, 5M[. This one was added during this last fsync that uses # the slow code path and overlaps with the previous one logged by # the previous fast fsync. # # This happens because when logging the checksums for the second # extent, we notice they start at an offset that matches the end of the # checksums item that we logged for the first extent, and because both # extents are contiguous on disk, btrfs_csum_file_blocks() decides to # extend that existing checksums item and append the checksums for the # second extent to this item. The end result is we end up with two # checksum items in the log tree that have overlapping ranges, as # listed before, resulting in the fsync to fail with -EIO and aborting # the transaction, turning the filesystem into RO mode. # $ xfs_io -c "pwrite -S 0xff 0 1M" \ -c "truncate 5M" \ -c "fsync" \ /mnt/foobar fsync: Input/output error After running the example, dmesg/syslog shows the tree checker complained about the checksum items with overlapping ranges and we aborted the transaction: $ dmesg (...) [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610 [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929 [756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160 [756289.564649] inode generation 6 size 5242880 mode 100600 [756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16 [756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53 [756289.567725] extent data disk bytenr 13631488 nr 1048576 [756289.568697] extent data offset 0 nr 1048576 ram 1048576 [756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53 [756289.570682] extent data disk bytenr 0 nr 0 [756289.571363] extent data offset 0 nr 2097152 ram 2097152 [756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53 [756289.573246] extent data disk bytenr 14680064 nr 3145728 [756289.574121] extent data offset 0 nr 2097152 ram 3145728 [756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072 [756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024 [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected [756289.578644] ------------[ cut here ]------------ [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs] [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...) [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1 [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs] [756289.595122] Code: 5d c3 e8 76 60 (...) [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282 [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000 [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000 [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000 [756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000 [756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0 [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [756289.606400] Call Trace: [756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs] [756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs] [756289.608040] submit_one_bio+0x61/0x70 [btrfs] [756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs] [756289.609258] ? free_debug_processing+0x1d5/0x240 [756289.609812] ? __module_address+0x28/0xf0 [756289.610298] ? lock_acquire+0x1a0/0x3e0 [756289.610754] ? lock_acquired+0x19f/0x430 [756289.611220] ? lock_acquire+0x1a0/0x3e0 [756289.611675] do_writepages+0x43/0xf0 [756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100 [756289.612800] __filemap_fdatawrite_range+0xc5/0x100 [756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs] [756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs] [756289.614661] ? finish_wait+0x90/0x90 [756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0 [756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs] [756289.616338] ? lock_acquire+0x1a0/0x3e0 [756289.616801] ? lock_acquired+0x19f/0x430 [756289.617284] ? lock_acquire+0x1a0/0x3e0 [756289.617750] ? lock_release+0x214/0x470 [756289.618221] ? lock_acquired+0x19f/0x430 [756289.618704] ? dput+0x20/0x4a0 [756289.619079] ? dput+0x20/0x4a0 [756289.619452] ? lockref_put_or_lock+0x9/0x30 [756289.619969] ? lock_release+0x214/0x470 [756289.620445] ? lock_release+0x214/0x470 [756289.620924] ? lock_release+0x214/0x470 [756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs] [756289.621982] do_fsync+0x38/0x70 [756289.622395] __x64_sys_fsync+0x10/0x20 [756289.622907] do_syscall_64+0x33/0x80 [756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae [756289.624063] RIP: 0033:0x7f08b27fbb7b [756289.624588] Code: 0f 05 48 3d 00 (...) [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003 [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0 [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001 [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480 [756289.631819] irq event stamp: 0 [756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0 [756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0 [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0 [756289.635606] ---[ end trace 0a039fdc16ff3fef ]--- [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure [756289.637082] BTRFS info (device sdc): forced readonly Having checksum items covering ranges that overlap is dangerous as in some cases it can lead to having extent ranges for which we miss checksums after log replay or getting the wrong checksum item. There were some fixes in the past for bugs that resulted in this problem, and were explained and fixed by the following commits: `27b9a8122f` ("Btrfs: fix csum tree corruption, duplicate and outdated checksums") `b84b8390d6` ("Btrfs: fix file read corruption after extent cloning and fsync") `40e046acbd` ("Btrfs: fix missing data checksums after replaying a log tree") `e289f03ea7` ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents") Fix the issue by making btrfs_csum_file_blocks() taking into account the start offset of the next checksum item when it decides to extend an existing checksum item, so that it never extends the checksum to end at a range that goes beyond the start range of the next checksum item. When we can not access the next checksum item without releasing the path, simply drop the optimization of extending the previous checksum item and fallback to inserting a new checksum item - this happens rarely and the optimization is not significant enough for a log tree in order to justify the extra complexity, as it would only save a few bytes (the size of a struct btrfs_item) of leaf space. This behaviour is only needed when inserting into a log tree because for the regular checksums tree we never have a case where we try to insert a range of checksums that overlap with a range that was previously inserted. A test case for fstests will follow soon. Reported-by: Philipp Fent <fent@in.tum.de> Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ CC: stable@vger.kernel.org # 5.4+ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:36 +02:00
Josef Bacik	dc09ef3562	btrfs: abort in rename_exchange if we fail to insert the second ref Error injection stress uncovered a problem where we'd leave a dangling inode ref if we failed during a rename_exchange. This happens because we insert the inode ref for one side of the rename, and then for the other side. If this second inode ref insert fails we'll leave the first one dangling and leave a corrupt file system behind. Fix this by aborting if we did the insert for the first inode ref. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:16 +02:00
Josef Bacik	f96d44743a	btrfs: check error value from btrfs_update_inode in tree log Error injection testing uncovered a case where we ended up with invalid link counts on an inode. This happened because we failed to notice an error when updating the inode while replaying the tree log, and committed the transaction with an invalid file system. Fix this by checking the return value of btrfs_update_inode. This resolved the link count errors I was seeing, and we already properly handle passing up the error values in these paths. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:13 +02:00
Josef Bacik	011b28acf9	btrfs: fixup error handling in fixup_inode_link_counts This function has the following pattern while (1) { ret = whatever(); if (ret) goto out; } ret = 0 out: return ret; However several places in this while loop we simply break; when there's a problem, thus clearing the return value, and in one case we do a return -EIO, and leak the memory for the path. Fix this by re-arranging the loop to deal with ret == 1 coming from btrfs_search_slot, and then simply delete the ret = 0; out: bit so everybody can break if there is an error, which will allow for proper error handling to occur. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:08 +02:00
Josef Bacik	d61bec08b9	btrfs: mark ordered extent and inode with error if we fail to finish While doing error injection testing I saw that sometimes we'd get an abort that wouldn't stop the current transaction commit from completing. This abort was coming from finish ordered IO, but at this point in the transaction commit we should have gotten an error and stopped. It turns out the abort came from finish ordered io while trying to write out the free space cache. It occurred to me that any failure inside of finish_ordered_io isn't actually raised to the person doing the writing, so we could have any number of failures in this path and think the ordered extent completed successfully and the inode was fine. Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and marking the mapping of the inode with mapping_set_error, so any callers that simply call fdatawait will also get the error. With this we're seeing the IO error on the free space inode when we fail to do the finish_ordered_io. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:01 +02:00
Josef Bacik	856bd270dc	btrfs: return errors from btrfs_del_csums in cleanup_ref_head We are unconditionally returning 0 in cleanup_ref_head, despite the fact that btrfs_del_csums could fail. We need to return the error so the transaction gets aborted properly, fix this by returning ret from btrfs_del_csums in cleanup_ref_head. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:30:55 +02:00
Josef Bacik	b86652be7c	btrfs: fix error handling in btrfs_del_csums Error injection stress would sometimes fail with checksums on disk that did not have a corresponding extent. This occurred because the pattern in btrfs_del_csums was while (1) { ret = btrfs_search_slot(); if (ret < 0) break; } ret = 0; out: btrfs_free_path(path); return ret; If we got an error from btrfs_search_slot we'd clear the error because we were breaking instead of goto out. Instead of using goto out, simply handle the cases where we may leave a random value in ret, and get rid of the ret = 0; out: pattern and simply allow break to have the proper error reporting. With this fix we properly abort the transaction and do not commit thinking we successfully deleted the csum. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:30:49 +02:00
Qu Wenruo	4c80a97d7b	btrfs: fix compressed writes that cross stripe boundary [BUG] When running btrfs/027 with "-o compress" mount option, it always crashes with the following call trace: BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192 ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:6651! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G OE 5.13.0-rc2-custom+ #26 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs] Call Trace: btrfs_submit_compressed_write+0x2d7/0x470 [btrfs] submit_compressed_extents+0x3b0/0x470 [btrfs] ? mark_held_locks+0x49/0x70 btrfs_work_helper+0x131/0x3e0 [btrfs] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 ? process_one_work+0x5d0/0x5d0 kthread+0x141/0x160 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 ---[ end trace 63113a3a91f34e68 ]--- [CAUSE] The critical message before the crash means we have a bio at logical bytenr 298901504 length 12288, but only 8192 bytes can fit into one stripe, the remaining 4096 bytes go to another stripe. In btrfs, all bios are properly split to avoid cross stripe boundary, but commit `764c7c9a46` ("btrfs: zoned: fix parallel compressed writes") changed the behavior for compressed writes. Previously if we find our new page can't be fitted into current stripe, ie. "submit == 1" case, we submit current bio without adding current page. submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); page->mapping = NULL; if (submit \|\| bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { But after the modification, we will add the page no matter if it crosses stripe boundary, leading to the above crash. submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); if (pg_index == 0 && use_append) len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); else len = bio_add_page(bio, page, PAGE_SIZE, 0); page->mapping = NULL; if (submit \|\| len < PAGE_SIZE) { [FIX] It's no longer possible to revert to the original code style as we have two different bio_add__page() calls now. The new fix is to skip the bio_add__page() call if @submit is true. Also to avoid @len to be uninitialized, always initialize it to zero. If @submit is true, @len will not be checked. If @submit is not true, @len will be the return value of bio_add_*_page() call. Either way, the behavior is still the same as the old code. Reported-by: Josef Bacik <josef@toxicpanda.com> Fixes: `764c7c9a46` ("btrfs: zoned: fix parallel compressed writes") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:30:38 +02:00
Aurelien Aptel	1bb5681067	cifs: change format of CIFS_FULL_KEY_DUMP ioctl Make CIFS_FULL_KEY_DUMP ioctl able to return variable-length keys. * userspace needs to pass the struct size along with optional session_id and some space at the end to store keys * if there is enough space kernel returns keys in the extra space and sets the length of each key via xyz_key_length fields This also fixes the build error for get_user() on ARM. Sample program: #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include <sys/fcntl.h> #include <sys/ioctl.h> struct smb3_full_key_debug_info { uint32_t in_size; uint64_t session_id; uint16_t cipher_type; uint8_t session_key_length; uint8_t server_in_key_length; uint8_t server_out_key_length; uint8_t data[]; /* * return this struct with the keys appended at the end: * uint8_t session_key[session_key_length]; * uint8_t server_in_key[server_in_key_length]; * uint8_t server_out_key[server_out_key_length]; / } __attribute__((packed)); #define CIFS_IOCTL_MAGIC 0xCF #define CIFS_DUMP_FULL_KEY _IOWR(CIFS_IOCTL_MAGIC, 10, struct smb3_full_key_debug_info) void dump(const void p, size_t len) { const char hex = "0123456789ABCDEF"; const uint8_t b = p; for (int i = 0; i < len; i++) printf("%c%c ", hex[(b[i]>>4)&0xf], hex[b[i]&0xf]); putchar('\n'); } int main(int argc, char *argv) { struct smb3_full_key_debug_info keys; uint8_t buf[sizeof(keys)+1024] = {0}; size_t off = 0; int fd, rc; keys = (struct smb3_full_key_debug_info )&buf; keys->in_size = sizeof(buf); fd = open(argv[1], O_RDONLY); if (fd < 0) perror("open"), exit(1); rc = ioctl(fd, CIFS_DUMP_FULL_KEY, keys); if (rc < 0) perror("ioctl"), exit(1); printf("SessionId "); dump(&keys->session_id, 8); printf("Cipher %04x\n", keys->cipher_type); printf("SessionKey "); dump(keys->data+off, keys->session_key_length); off += keys->session_key_length; printf("ServerIn Key "); dump(keys->data+off, keys->server_in_key_length); off += keys->server_in_key_length; printf("ServerOut Key "); dump(keys->data+off, keys->server_out_key_length); return 0; } Usage: $ gcc -o dumpkeys dumpkeys.c Against Windows Server 2020 preview (with AES-256-GCM support): # mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.0,seal" # ./dumpkeys /mnt/somefile SessionId 0D 00 00 00 00 0C 00 00 Cipher 0002 SessionKey AB CD CC 0D E4 15 05 0C 6F 3C 92 90 19 F3 0D 25 ServerIn Key 73 C6 6A C8 6B 08 CF A2 CB 8E A5 7D 10 D1 5B DC ServerOut Key 6D 7E 2B A1 71 9D D7 2B 94 7B BA C4 F0 A5 A4 F8 # umount /mnt With 256 bit keys: # echo 1 > /sys/module/cifs/parameters/require_gcm_256 # mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.11,seal" # ./dumpkeys /mnt/somefile SessionId 09 00 00 00 00 0C 00 00 Cipher 0004 SessionKey 93 F5 82 3B 2F B7 2A 50 0B B9 BA 26 FB 8C 8B 03 ServerIn Key 6C 6A 89 B2 CB 7B 78 E8 04 93 37 DA 22 53 47 DF B3 2C 5F 02 26 70 43 DB 8D 33 7B DC 66 D3 75 A9 ServerOut Key 04 11 AA D7 52 C7 A8 0F ED E3 93 3A 65 FE 03 AD 3F 63 03 01 2B C0 1B D7 D7 E5 52 19 7F CC 46 B4 Signed-off-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-27 15:26:32 -05:00
Shyam Prasad N	eb06881805	cifs: fix string declarations and assignments in tracepoints We missed using the variable length string macros in several tracepoints. Fixed them in this change. There's probably more useful macros that we can use to print others like flags etc. But I'll submit sepawrate patches for those at a future date. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: <stable@vger.kernel.org> # v5.12 Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-27 14:04:32 -05:00
Aurelien Aptel	6d2fcfe6b5	cifs: set server->cipher_type to AES-128-CCM for SMB3.0 SMB3.0 doesn't have encryption negotiate context but simply uses the SMB2_GLOBAL_CAP_ENCRYPTION flag. When that flag is present in the neg response cifs.ko uses AES-128-CCM which is the only cipher available in this context. cipher_type was set to the server cipher only when parsing encryption negotiate context (SMB3.1.1). For SMB3.0 it was set to 0. This means cipher_type value can be 0 or 1 for AES-128-CCM. Fix this by checking for SMB3.0 and encryption capability and setting cipher_type appropriately. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-27 14:03:47 -05:00
David Howells	f610a5a29c	afs: Fix the nlink handling of dir-over-dir rename Fix rename of one directory over another such that the nlink on the deleted directory is cleared to 0 rather than being decremented to 1. This was causing the generic/035 xfstest to fail. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/ # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-27 06:23:58 -10:00
Dave Chinner	0fe0bbe00a	xfs: bunmapi has unnecessary AG lock ordering issues large directory block size operations are assert failing because xfs_bunmapi() is not completely removing fragmented directory blocks like so: XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 677 .... Call Trace: xfs_dir2_shrink_inode+0x1a8/0x210 xfs_dir2_block_to_sf+0x2ae/0x410 xfs_dir2_block_removename+0x21a/0x280 xfs_dir_removename+0x195/0x1d0 xfs_rename+0xb79/0xc50 ? avc_has_perm+0x8d/0x1a0 ? avc_has_perm_noaudit+0x9a/0x120 xfs_vn_rename+0xdb/0x150 vfs_rename+0x719/0xb50 ? __lookup_hash+0x6a/0xa0 do_renameat2+0x413/0x5e0 __x64_sys_rename+0x45/0x50 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae We are aborting the bunmapi() pass because of this specific chunk of code: /* * Make sure we don't touch multiple AGF headers out of order * in a single transaction, as that could cause AB-BA deadlocks. */ if (!wasdel && !isrt) { agno = XFS_FSB_TO_AGNO(mp, del.br_startblock); if (prev_agno != NULLAGNUMBER && prev_agno > agno) break; prev_agno = agno; } This is designed to prevent deadlocks in AGF locking when freeing multiple extents by ensuring that we only ever lock in increasing AG number order. Unfortunately, this also violates the "bunmapi will always succeed" semantic that some high level callers depend on, such as xfs_dir2_shrink_inode(), xfs_da_shrink_inode() and xfs_inactive_symlink_rmt(). This AG lock ordering was introduced back in 2017 to fix deadlocks triggered by generic/299 as reported here: https://lore.kernel.org/linux-xfs/800468eb-3ded-9166-20a4-047de8018582@gmail.com/ This codebase is old enough that it was before we were defering all AG based extent freeing from within xfs_bunmapi(). THat is, we never actually lock AGs in xfs_bunmapi() any more - every non-rt based extent free is added to the defer ops list, as is all BMBT block freeing. And RT extents are not RT based, so there's no lock ordering issues associated with them. Hence this AGF lock ordering code is both broken and dead. Let's just remove it so that the large directory block code works reliably again. Tested against xfs/538 and generic/299 which is the original test that exposed the deadlocks that this code fixed. Fixes: `5b094d6dac` ("xfs: fix multi-AG deadlock in xfs_bunmapi") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-05-27 08:11:24 -07:00
Dave Chinner	991c2c5980	xfs: btree format inode forks can have zero extents xfs/538 is assert failing with this trace when testing with directory block sizes of 64kB: XFS: Assertion failed: !xfs_need_iread_extents(ifp), file: fs/xfs/libxfs/xfs_bmap.c, line: 608 .... Call Trace: xfs_bmap_btree_to_extents+0x2a9/0x470 ? kmem_cache_alloc+0xe7/0x220 __xfs_bunmapi+0x4ca/0xdf0 xfs_bunmapi+0x1a/0x30 xfs_dir2_shrink_inode+0x71/0x210 xfs_dir2_block_to_sf+0x2ae/0x410 xfs_dir2_block_removename+0x21a/0x280 xfs_dir_removename+0x195/0x1d0 xfs_remove+0x244/0x460 xfs_vn_unlink+0x53/0xa0 ? selinux_inode_unlink+0x13/0x20 vfs_unlink+0x117/0x220 do_unlinkat+0x1a2/0x2d0 __x64_sys_unlink+0x42/0x60 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae This is a check to ensure that the extents have been read into memory before we are doing a ifork btree manipulation. This assert is bogus in the above case. We have a fragmented directory block that has more extents in it than can fit in extent format, so the inode data fork is in btree format. xfs_dir2_shrink_inode() asks to remove all remaining 16 filesystem blocks from the inode so it can convert to short form, and __xfs_bunmapi() removes all the extents. We now have a data fork in btree format but have zero extents in the fork. This incorrectly trips the xfs_need_iread_extents() assert because it assumes that an empty extent btree means the extent tree has not been read into memory yet. This is clearly not the case with xfs_bunmapi(), as it has an explicit call to xfs_iread_extents() in it to pull the extents into memory before it starts unmapping. Also, the assert directly after this bogus one is: ASSERT(ifp->if_format == XFS_DINODE_FMT_BTREE); Which covers the context in which it is legal to call xfs_bmap_btree_to_extents just fine. Hence we should just remove the bogus assert as it is clearly wrong and causes a regression. The returns the test behaviour to the pre-existing assert failure in xfs_dir2_shrink_inode() that indicates xfs_bunmapi() has failed to remove all the extents in the range it was asked to unmap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-05-27 08:11:24 -07:00
Marco Elver	b16ef427ad	io_uring: fix data race to avoid potential NULL-deref Commit `ba5ef6dc8a` ("io_uring: fortify tctx/io_wq cleanup") introduced setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to detect a data race between accesses to tctx->io_wq: write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1: io_uring_clean_tctx fs/io_uring.c:9042 [inline] __io_uring_cancel fs/io_uring.c:9136 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit kernel/exit.c:781 do_group_exit kernel/exit.c:923 get_signal kernel/signal.c:2835 arch_do_signal_or_restart arch/x86/kernel/signal.c:789 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] ... read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0: io_uring_try_cancel_iowq fs/io_uring.c:8911 [inline] io_uring_try_cancel_requests fs/io_uring.c:8933 io_ring_exit_work fs/io_uring.c:8736 process_one_work kernel/workqueue.c:2276 ... With the config used, KCSAN only reports data races with value changes: this implies that in the case here we also know that tctx->io_wq was non-NULL. Therefore, depending on interleaving, we may end up with: [CPU 0] \| [CPU 1] io_uring_try_cancel_iowq() \| io_uring_clean_tctx() if (!tctx->io_wq) // false \| ... ... \| tctx->io_wq = NULL io_wq_cancel_cb(tctx->io_wq, ...) \| ... -> NULL-deref \| Note: It is likely that thus far we've gotten lucky and the compiler optimizes the double-read into a single read into a register -- but this is never guaranteed, and can easily change with a different config! Fix the data race by restoring the previous behaviour, where both setting io_wq to NULL and put of the wq are _serialized_ after concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock and removal of the node in io_uring_del_task_file(). Fixes: `ba5ef6dc8a` ("io_uring: fortify tctx/io_wq cleanup") Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-27 07:44:49 -06:00
Huilong Deng	a799b68a7c	nfs: Remove trailing semicolon in macros Macros should not use a trailing semicolon. Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-27 09:19:33 -04:00
Zhang Xiaoxu	e67afa7ee4	NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config Since commit `bdcc2cd14e` ("NFSv4.2: handle NFS-specific llseek errors"), nfs42_proc_llseek would return -EOPNOTSUPP rather than -ENOTSUPP when SEEK_DATA on NFSv4.0/v4.1. This will lead xfstests generic/285 not run on NFSv4.0/v4.1 when set the CONFIG_NFS_V4_2, rather than run failed. Fixes: `bdcc2cd14e` ("NFSv4.2: handle NFS-specific llseek errors") Cc: <stable.vger.kernel.org> # 4.2 Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-27 08:46:19 -04:00
Gustavo A. R. Silva	53004ee78d	xfs: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix the following warnings by replacing /* fall through / comments, and its variants, with the new pseudo-keyword macro fallthrough: fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] Notice that Clang doesn't recognize / fall through */ comments as implicit fall-through markings, so in order to globally enable -Wimplicit-fallthrough for Clang, these comments need to be replaced with fallthrough; in the whole codebase. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2021-05-26 14:51:26 -05:00
Zqiang	3743c1723b	io-wq: Fix UAF when wakeup wqe in hash waitqueue BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802 Call Trace: __dump_stack [inline] dump_stack+0x141/0x1d7 print_address_description.constprop.0.cold+0x5b/0x2c6 __kasan_report [inline] kasan_report.cold+0x7c/0xd8 __wake_up_common+0x637/0x650 __wake_up_common_lock+0xd0/0x130 io_worker_handle_work+0x9dd/0x1790 io_wqe_worker+0xb2a/0xd40 ret_from_fork+0x1f/0x30 Allocated by task 28798: kzalloc_node [inline] io_wq_create+0x3c4/0xdd0 io_init_wq_offload [inline] io_uring_alloc_task_context+0x1bf/0x6b0 __io_uring_add_task_file+0x29a/0x3c0 io_uring_add_task_file [inline] io_uring_install_fd [inline] io_uring_create [inline] io_uring_setup+0x209a/0x2bd0 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 28798: kfree+0x106/0x2c0 io_wq_destroy+0x182/0x380 io_wq_put [inline] io_wq_put_and_exit+0x7a/0xa0 io_uring_clean_tctx [inline] __io_uring_cancel+0x428/0x530 io_uring_files_cancel do_exit+0x299/0x2a60 do_group_exit+0x125/0x310 get_signal+0x47f/0x2150 arch_do_signal_or_restart+0x2a8/0x1eb0 handle_signal_work[inline] exit_to_user_mode_loop [inline] exit_to_user_mode_prepare+0x171/0x280 __syscall_exit_to_user_mode_work [inline] syscall_exit_to_user_mode+0x19/0x60 do_syscall_64+0x47/0xb0 entry_SYSCALL_64_after_hwframe There are the following scenarios, hash waitqueue is shared by io-wq1 and io-wq2. (note: wqe is worker) io-wq1:worker2 \| locks bit1 io-wq2:worker1 \| waits bit1 io-wq1:worker3 \| waits bit1 io-wq1:worker2 \| completes all wqe bit1 work items io-wq1:worker2 \| drop bit1, exit io-wq2:worker1 \| locks bit1 io-wq1:worker3 \| can not locks bit1, waits bit1 and exit io-wq1 \| exit and free io-wq1 io-wq2:worker1 \| drops bit1 io-wq1:worker3 \| be waked up, even though wqe is freed After all iou-wrk belonging to io-wq1 have exited, remove wqe form hash waitqueue, it is guaranteed that there will be no more wqe belonging to io-wq1 in the hash waitqueue. Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-26 09:03:56 -06:00
Trond Myklebust	70536bf4eb	NFS: Clean up reset of the mirror accounting variables Now that nfs_pageio_do_add_request() resets the pg_count, we don't need these other inlined resets. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-26 06:36:13 -04:00
Trond Myklebust	0d0ea30935	NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() The value of mirror->pg_bytes_written should only be updated after a successful attempt to flush out the requests on the list. Fixes: `a7d42ddb30` ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-26 06:36:13 -04:00
Trond Myklebust	56517ab958	NFS: Fix an Oopsable condition in __nfs_pageio_add_request() Ensure that nfs_pageio_error_cleanup() resets the mirror array contents, so that the structure reflects the fact that it is now empty. Also change the test in nfs_pageio_do_add_request() to be more robust by checking whether or not the list is empty rather than relying on the value of pg_count. Fixes: `a7d42ddb30` ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-26 06:36:13 -04:00
Pavel Begunkov	17a91051fe	io_uring/io-wq: close io-wq full-stop gap There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-25 19:39:58 -06:00
Kees Cook	bfb819ea20	proc: Check /proc/$pid/attr/ writes against file opener Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/ files need to check the opener credentials, since these fds do not transition state across execve(). Without this, it is possible to trick another process (which may have different credentials) to write to its own /proc/$pid/attr/ files, leading to unexpected and possibly exploitable behaviors. [1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-25 10:24:41 -10:00
Linus Torvalds	ad9f25d338	netfslib fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCs8qYACgkQ+7dXa6fL C2s3QhAAkRSWoZQEmisGMcRKOhDJQh8Qc7lV4aKXFIa4EaLdeBPvhJhnJqMld2KY m35g4bSU/RsUjzSCLXVnEiHa9jdFKyK0C/XWshyidzrTDUk0HN6NXsBpp3ztWKlq iMOvQYnKWKoWr4seIdC1fAKSFcQ3uRlVnDnmm0GtB5ahu5ThNQtqf8nYMSuULZbo K9SybNUVCrDsORqDu2595gfK63MCOVn72Hj066s8owHrbD8Io52Kf6Q7jP1CkMGL x6Kl0pwjql6usUsaDEaqmNT3ck7UjlLp5h1EZnt/7SWbgInpNzk6BLP33DwCis+4 rUpu+Zf8TEeOYDU5if8QpVszwsMyoKtkp9AjgjZkvxbedCqHkXJjxrnkk6/H7yJc 4Zvi8sIU52D9PcZO0bD8zP/8eYm/ZTVjMjDt8PvIbTA583oGNWsfRBbvJYi1huxB i3G0PNVbqH0U3Z78XH4dmrkE1oMxbq2O5fg9ZNCuStxqD2vrZyyo/CcfidElLnCq vcT+obEI+NYFphMzk7rwSL4pH4OPwPziJfiudKANmUDei8rOejQ8nrw18CVF7neN Ewj1XiHOdi4JgGq92owpmCmTvle7GG9KNuCvfd4U67S9KOJAPT5UrSD696PrJJN7 YpcBHJMqS9XLXwrGuKD7oDroxEEpvJEunRH+yt3YPa5OQtX3wIA= =poNo -----END PGP SIGNATURE----- Merge tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull netfs fixes from David Howells: "A couple of fixes to the new netfs lib: - Pass the AOP flags through from netfs_write_begin() into grab_cache_page_write_begin(). - Automatically enable in Kconfig netfs lib rather than presenting an option for manual enablement" * tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual netfs: Pass flags through to grab_cache_page_write_begin()	2021-05-25 07:31:49 -10:00
Gustavo A. R. Silva	b2db6c35ba	afs: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple fallthrough pseudo-keywords in places where the code is intended to fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/r/51150b54e0b0431a2c401cd54f2c4e7f50e94601.1605896059.git.gustavoars@kernel.org/ # v1 Link: https://lore.kernel.org/r/20210420211615.GA51432@embeddedor/ # v2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-25 07:30:34 -10:00
David Howells	b71c791254	netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual Make the netfs helper library selected automatically by the things that use it rather than being manually configured, even though it's required[1]. Fixes: `3a5829fefd` ("netfs: Make a netfs helper module") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/CAMuHMdXJZ7iNQE964CdBOU=vRKVMFzo=YF_eiwsGgqzuvZ+TuA@mail.gmail.com [1] Link: https://lore.kernel.org/r/162090298141.3166007.2971118149366779916.stgit@warthog.procyon.org.uk # v1	2021-05-25 13:48:04 +01:00
David Howells	19dee61381	netfs: Pass flags through to grab_cache_page_write_begin() In netfs_write_begin(), pass the AOP flags through to grab_cache_page_write_begin() so that a request to use GFP_NOFS is honoured. Fixes: `e1b1240c1f` ("netfs: Add write_begin helper") Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/162090295383.3165945.13595101698295243662.stgit@warthog.procyon.org.uk # v1	2021-05-25 13:46:32 +01:00
Amir Goldstein	a8b98c808e	fanotify: fix permission model of unprivileged group Reporting event->pid should depend on the privileges of the user that initialized the group, not the privileges of the user reading the events. Use an internal group flag FANOTIFY_UNPRIV to record the fact that the group was initialized by an unprivileged user. To be on the safe side, the premissions to setup filesystem and mount marks now require that both the user that initialized the group and the user setting up the mark have CAP_SYS_ADMIN. Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiA77_P5vtv7e83g0+9d7B5W9ZTE4GfQEYbWmfT1rA=VA@mail.gmail.com/ Fixes: `7cea2a3c50` ("fanotify: support limited functionality for unprivileged users") Cc: <Stable@vger.kernel.org> # v5.12+ Link: https://lore.kernel.org/r/20210524135321.2190062-1-amir73il@gmail.com Reviewed-by: Matthew Bobrowski <repnop@google.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-05-25 12:21:14 +02:00
Darrick J. Wong	603f000b15	xfs: validate extsz hints against rt extent size when rtinherit is set The RTINHERIT bit can be set on a directory so that newly created regular files will have the REALTIME bit set to store their data on the realtime volume. If an extent size hint (and EXTSZINHERIT) are set on the directory, the hint will also be copied into the new file. As pointed out in previous patches, for realtime files we require the extent size hint be an integer multiple of the realtime extent, but we don't perform the same validation on a directory with both RTINHERIT and EXTSZINHERIT set, even though the only use-case of that combination is to propagate extent size hints into new realtime files. This leads to inode corruption errors when the bad values are propagated. Because there may be existing filesystems with such a configuration, we cannot simply amend the inode verifier to trip on these directories and call it a day because that will cause previously "working" filesystems to start throwing errors abruptly. Note that it's valid to have directories with rtinherit set even if there is no realtime volume, in which case the problem does not manifest because rtinherit is ignored if there's no realtime device; and it's possible that someone set the flag, crashed, repaired the filesystem (which clears the hint on the realtime file) and continued. Therefore, mitigate this issue in several ways: First, if we try to write out an inode with both rtinherit/extszinherit set and an unaligned extent size hint, turn off the hint to correct the error. Second, if someone tries to misconfigure a directory via the fssetxattr ioctl, fail the ioctl. Third, reverify both extent size hint values when we propagate heritable inode attributes from parent to child, to prevent misconfigurations from spreading. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-05-24 18:01:04 -07:00
Darrick J. Wong	6b69e48589	xfs: standardize extent size hint validation While chasing a bug involving invalid extent size hints being propagated into newly created realtime files, I noticed that the xfs_ioctl_setattr checks for the extent size hints weren't the same as the ones now encoded in libxfs and used for validation in repair and mkfs. Because the checks in libxfs are more stringent than the ones in the ioctl, it's possible for a live system to set inode flags that immediately result in corruption warnings. Specifically, it's possible to set an extent size hint on an rtinherit directory without checking if the hint is aligned to the realtime extent size, which makes no sense since that combination is used only to seed new realtime files. Replace the open-coded and inadequate checks with the libxfs verifier versions and update the code comments a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-05-24 18:01:04 -07:00
Darrick J. Wong	0f9342513c	xfs: check free AG space when making per-AG reservations The new online shrink code exposed a gap in the per-AG reservation code, which is that we only return ENOSPC to callers if the entire fs doesn't have enough free blocks. Except for debugging mode, the reservation init code doesn't ever check that there's enough free space in that AG to cover the reservation. Not having enough space is not considered an immediate fatal error that requires filesystem offlining because (a) it's shouldn't be possible to wind up in that state through normal file operations and (b) even if one did, freeing data blocks would recover the situation. However, online shrink now needs to know if shrinking would not leave enough space so that it can abort the shrink operation. Hence we need to promote this assertion into an actual error return. Observed by running xfs/168 with a 1k block size, though in theory this could happen with any configuration. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2021-05-24 18:01:04 -07:00
Mike Kravetz	e32905e573	userfaultfd: hugetlbfs: fix new flag usage in error path In commit `d6995da311` ("hugetlb: use page.private for hugetlb specific page flags") the use of PagePrivate to indicate a reservation count should be restored at free time was changed to the hugetlb specific flag HPageRestoreReserve. Changes to a userfaultfd error path as well as a VM_BUG_ON() in remove_inode_hugepages() were overlooked. Users could see incorrect hugetlb reserve counts if they experience an error with a UFFDIO_COPY operation. Specifically, this would be the result of an unlikely copy_huge_page_from_user error. There is not an increased chance of hitting the VM_BUG_ON. Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com Fixes: `d6995da311` ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasry.mina@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-22 15:09:07 -10:00
Linus Torvalds	4ff2473bdb	block-5.13-2021-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCpPO4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgI0EACitV5OwfX+saZdQEj3LF4dAo7uZkMV0cZK GJ3m1NWsMDXJJofcczyVTEs0iNT4fpb1dKE9cyOVjAFDoH8Dn7C+UZ163QWu+SCk WGgyiY+Qdwr7cyl6+2+WQkLBeLcyuFVjGtYHTxYWY2O+DpyhRw94Oiih1bfnI/6i KZTpaA3z+pZs/KFIE7eUnkI/iWC39VShZ1T8/gXO9vmIhUkA67j1o9i3LYpGYnXx Awza8Lpql7s3tfWcDL6FNHQmFPUjiowCSUNupzdnHgjggWwUCosJTTcL+mfdTHOJ YuYM3qRuzTbIeXXy/5JTZUt5AOkS8SCre7BpclSDrhZBiL/dkvAndN43ce/6vc7i FrgvnbY/Ik2PWQwcbxiXZzcEKxT9dzXbsyJG08ePZwQ5s+8M5KVZv+ElrV+T7/nJ DYjnWahQ674tHv2Z7Bp4hAjnchwiypxqie8OnOKBI+WseT2D8Pjs2sinUHSYKYDk 3m2e0BVsw+FAYt3bcdhocDQnrJwMNrhSuA9Rtyh6qeMG34yxOXJmZvrHNrbg2fG/ a/xgVewn/P4sDxGCwS3XH/zILYgvJAwTFWIfDeRXE4epqsPZ9h8FBq3Fzl5asL7V yl9iQlWuE1+Ks8IQMjunbJfQSTEghPCjJWHVQQVJm+rT33qI80Ac4a0vdd99TaXh 8P58LE+0jg== =ADzj -----END PGP SIGNATURE----- Merge tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Fix BLKRRPART and deletion race (Gulam, Christoph) - NVMe pull request (Christoph): - nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith Busch) - nvme-fc teardown fix (James Smart) - nvmet/nvme-loop memory leak fixes (Wu Bo)" * tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block: block: fix a race between del_gendisk and BLKRRPART block: prevent block device lookups at the beginning of del_gendisk nvme-fc: clear q_live at beginning of association teardown nvme-tcp: rerun io_work if req_list is not empty nvme-tcp: fix possible use-after-completion nvme-loop: fix memory leak in nvme_loop_create_ctrl() nvmet: fix memory leak in nvmet_alloc_ctrl()	2021-05-22 07:40:34 -10:00
Linus Torvalds	b9231dfbcb	io_uring-5.13-2021-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCpPQcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptBfD/0XV2+UXT7GTq3zfAgeCRojxzjH8YSh/fu1 uSYA2fME0J7B2lpgxwHmDwgW/JkxkQ9oal+QxoNUJnmTF4CN3c7edQYxaA+QAnb/ XiEY6s5slpBtopJCXQqPlE6dUnn1yjc0wNIm3EjvmmMaFjb6MVfMZWqWn9AANNTd yiRtk8a7KQuYdBeQMPVQG4v1ue37VTL5B9D9tM3p03W2ngNhtWw2Yxy5k/ePseip HYhPm+SKcbpmSFS+KN/a4aBLHyW89FRnhBWZF50sBmdUD+HLgz09IyFdSqyo9s2f wb7h3u3FbzTk3JofcJhYfqoQXkwmYhHrwNGCMjhK/zy+qloCIOu8Nw4jkcH+VYwK Rf7cFu+CZDRgcIu4Op/W5CPHNPY680Rxd/yBKlG/n4aZ/zxuuOu08992Z5BSaxfw UpIFMOWMuDvbBRUk71R34ME0o1wNhWL75Rljh97dAMRZLez1h8CmGdktT9g4keuo 71Swq51AQk7fWXW0yQK2kIpbTjazfh6+AEvdF4c/Njss83K7PHCq00xeKI9PeNXN aQvPBpFifTeN1B1IENH5wEHO8F7e38eU45WHPwgNJUuSEpuBoXQGoLBlf6WXzNUS sIt6UjGDCFTZddIYwVfVISl7+DLBLCxYRYnw0Mx99x1shUpH+6q0HdpPgOxiKmoH ZgdG/q8rVg== =tipG -----END PGP SIGNATURE----- Merge tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "One fix for a regression with poll in this merge window, and another just hardens the io-wq exit path a bit" * tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block: io_uring: fortify tctx/io_wq cleanup io_uring: don't modify req->poll for rw	2021-05-22 07:36:36 -10:00
Linus Torvalds	a3969ef463	Fixes for 5.13-rc3: - Fix some math errors in the realtime allocator when extent size hints are applied. - Fix unnecessary short writes to realtime files when free space is fragmented. - Fix a crash when using scrub tracepoints. - Restore ioctl uapi definitions that were accidentally removed in 5.13-rc1. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCmgSoACgkQ+H93GTRK tOvInhAAj9JbQLz/BPt14qVdteNDNBpdFdl/SC5Bow5ABOWi8FSOu9N/F32nMy9y fGPXP2G//sYzfW5jwE+ZPEEq7e+K55rLZHxHsbtWXjCL96t9edEDGFS5p6LmgnRW aJwE1QxhFHFJZEX1+Y08zmB+fSnwHJ4HzZihYb1f9sTI5cJRh7pvvj26HiqfUk9I FUT5Oo/Dy8gSSRmNCPWlLWhXJurrSwAYHmoE44vNHoEYHcodVwAK+ZR/Dj/5DQaS DTYgLeHZQmPijcW7/B/RKcEz96hMQ9afg7vBdgZwhG/oFCRKV2m3rYjjjo/AgKv4 4FkmUoP5+CTT1vt9UKrdyl9uLTMqxiHuU1qvb82DM5XbbsLEFBBa+e3QrWmJqR8D 3lBp6ogtOmbNEJpgLxCdbVl80HOjB+yaIWUB536nauz4USZvCcRGvdYDQ922q6ig 1eT5Q6KCNgO3e6WIQ3W5kNJsM+/gXlNvwwhN/jHCQKy//bgVRnNYbODC6K46mjp/ H8+NWyzQXpFjRWjmQ60LD2/RlyJVbbLbMkPTO1g/vjyZWjgp0fV2wtG6Ag/FlwVF DERUdp/0m6V+lPRcKbWCasjEc+pis6TDnvn+tCftXthh3fXGcalC3bnmv3yH9rkP YAejnDvuuLVtgPyVARA9DTQR5ch5CSRuFc6sU9SFL5nv2p59RtU= =IvUb -----END PGP SIGNATURE----- Merge tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: - Fix some math errors in the realtime allocator when extent size hints are applied. - Fix unnecessary short writes to realtime files when free space is fragmented. - Fix a crash when using scrub tracepoints. - Restore ioctl uapi definitions that were accidentally removed in 5.13-rc1. * tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: restore old ioctl definitions xfs: fix deadlock retry tracepoint arguments xfs: retry allocations when locality-based search fails xfs: adjust rt allocation minlen when extszhint > rtextsize	2021-05-21 18:45:09 -10:00
Linus Torvalds	45af60e7ce	for-5.13-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCoEQkACgkQxWXV+ddt WDsn6Q//XXQVextL6g6Wjx0SR9b5C1ndSV841jNY+KQ0drBPSOBs+0SXI+nIWAK1 iTpmj3s2qrRElZZ6DT4fKP28KnbUJed9+CcirNnN3IMOeauI760CLobXZLsw1wGH o0HKKgcPhw/v9o9jqX22rSfzDZ2Rx2KhZ8iEb1ZXIG5iJNFcnXCCoFOqk4I+UEvH /5734KU8RI3sCRhziSf/vDCF50p+BIWr8VilQkmZUzi0oa6Y1wXm0qd9j0unhICR NxcBk1NYdOosAvVRhSqync1BNLhXSctg4rwhLlSI5SDvt/Ivz5tguNr9HcizOvmW zyb0g1c3Pq0p2wQJLybbs1zn67d0+7Q23UPWx1C+IKU3nmX5mGWzToxjVOQASYaZ 8UbzYAjUHtJpLDB4dp6+k5Pv/yfVGyhxXI+qLMWow77qRPPf7/vw5nEwTXmjcPRH 9st0TopZVXI4IEpZP+HeNFdNONuPL3CqV0t1+MnC73WMhmUfXR5E8Yq5H3MscuFl smkrWUq/g+cmkiOw5r4MyadFuN1MsXGw4rOdbYjY4JqVht6gPkOp3P73Hme5rD3H Txw/1WKEl+w3I6wS0Dl/NFcMGOyl8gEv4rATDyRWkxfmCue2mcTGS/3jjjWWguu4 +Q7e6p1390PLAvMV/rEDoYmFCoPSYp6trvupW+5fkZdOyei1SZM= =98LW -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes: - fix unaligned compressed writes in zoned mode - fix false positive lockdep warning when cloning inline extent - remove wrong BUG_ON in tree-log error handling" * tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix parallel compressed writes btrfs: zoned: pass start block to btrfs_use_zone_append btrfs: do not BUG_ON in link_to_fixup_dir btrfs: release path before starting transaction when cloning inline extent	2021-05-21 13:24:12 -10:00
Linus Torvalds	8bb14ca171	7 SMB3 fixes, one for stable, 3 others fix problems found in testing handle leases, and a compounded request fix -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCn2x8ACgkQiiy9cAdy T1HjnQv+M87Xx++VVaJzeLQQlKGA/vfkhM7YLEkIwxmbUpt8JURORoK91xVa/RZA eS/K2tYOilAuuV7VXXw6ng6WNCWE/l+BNT5FHZ4WJt71pE1/tN/NIACtOhBB01GO r+JhAE08zYLu8vA1Ax1EBtSSBjTLUjDX0fWMfwD4C/BBABw5VZISnkSEj2lC6wT9 vovEalU9amMRrvlhK9Z+MRJRJFzxY4LingiEVlFIdLczCGia5PgSl3NXRY1//rNO wc//34cCGxBNc5Su5Bvn1kTZT5mdBFR98mLOuD+Dw55LlIlShKDnhZHGQDGPyQGT ey2w2b+pNAr3rwVNtU6JNmI7AiUllNHiDu5UsyB0ctDWJljzrILd4uPaWofcNXAh 5qPRvuGsqjo3D/10DPshla1pJtmFr8eKXy8o6UVfMYQSHDo1LbqMll7ArGgV3Fxn B2g5N+ax1+DXZlykKJGhYBBkvGANuUBU/tq810i5BvLhfrc1dx+pJlZAeO5OxCSA SBUiirq4 =neWC -----END PGP SIGNATURE----- Merge tag '5.13-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Seven smb3 fixes: one for stable, three others fix problems found in testing handle leases, and a compounded request fix" * tag '5.13-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6: Fix KASAN identified use-after-free issue. Defer close only when lease is enabled. Fix kernel oops when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. cifs: Fix inconsistent indenting cifs: fix memory leak in smb2_copychunk_range SMB3: incorrect file id in requests compounded with open cifs: remove deadstore in cifs_close_all_deferred_files()	2021-05-21 13:12:51 -10:00
Linus Torvalds	a0e31f3a38	Merge branch 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo fix from Eric Biederman: "During the merge window an issue with si_perf and the siginfo ABI came up. The alpha and sparc siginfo structure layout had changed with the addition of SIGTRAP TRAP_PERF and the new field si_perf. The reason only alpha and sparc were affected is that they are the only architectures that use si_trapno. Looking deeper it was discovered that si_trapno is used for only a few select signals on alpha and sparc, and that none of the other _sigfault fields past si_addr are used at all. Which means technically no regression on alpha and sparc. While the alignment concerns might be dismissed the abuse of si_errno by SIGTRAP TRAP_PERF does have the potential to cause regressions in existing userspace. While we still have time before userspace starts using and depending on the new definition siginfo for SIGTRAP TRAP_PERF this set of changes cleans up siginfo_t. - The si_trapno field is demoted from magic alpha and sparc status and made an ordinary union member of the _sigfault member of siginfo_t. Without moving it of course. - si_perf is replaced with si_perf_data and si_perf_type ending the abuse of si_errno. - Unnecessary additions to signalfd_siginfo are removed" * 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo signal: Deliver all of the siginfo perf data in _perf signal: Factor force_sig_perf out of perf_sigtrap signal: Implement SIL_FAULT_TRAPNO siginfo: Move si_trapno inside the union inside _si_fault	2021-05-21 06:12:52 -10:00
Phillip Potter	a8867f4e38	ext4: fix memory leak in ext4_mb_init_backend on error path. Fix a memory leak discovered by syzbot when a file system is corrupted with an illegally large s_log_groups_per_flex. Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-05-20 23:29:32 -04:00
Andreas Gruenbacher	b7f55d928e	gfs2: Fix mmap locking for write faults When a write fault occurs, we need to take the inode glock of the underlying inode in exclusive mode. Otherwise, there's no guarantee that the dirty page will be written back to disk. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-21 05:16:38 +02:00
Rohith Surabattula	9687c85dfb	Fix KASAN identified use-after-free issue. [ 612.157429] ================================================================== [ 612.158275] BUG: KASAN: use-after-free in process_one_work+0x90/0x9b0 [ 612.158801] Read of size 8 at addr ffff88810a31ca60 by task kworker/2:9/2382 [ 612.159611] CPU: 2 PID: 2382 Comm: kworker/2:9 Tainted: G OE 5.13.0-rc2+ #98 [ 612.159623] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 [ 612.159640] Workqueue: 0x0 (deferredclose) [ 612.159669] Call Trace: [ 612.159685] dump_stack+0xbb/0x107 [ 612.159711] print_address_description.constprop.0+0x18/0x140 [ 612.159733] ? process_one_work+0x90/0x9b0 [ 612.159743] ? process_one_work+0x90/0x9b0 [ 612.159754] kasan_report.cold+0x7c/0xd8 [ 612.159778] ? lock_is_held_type+0x80/0x130 [ 612.159789] ? process_one_work+0x90/0x9b0 [ 612.159812] kasan_check_range+0x145/0x1a0 [ 612.159834] process_one_work+0x90/0x9b0 [ 612.159877] ? pwq_dec_nr_in_flight+0x110/0x110 [ 612.159914] ? spin_bug+0x90/0x90 [ 612.159967] worker_thread+0x3b6/0x6c0 [ 612.160023] ? process_one_work+0x9b0/0x9b0 [ 612.160038] kthread+0x1dc/0x200 [ 612.160051] ? kthread_create_worker_on_cpu+0xd0/0xd0 [ 612.160092] ret_from_fork+0x1f/0x30 [ 612.160399] Allocated by task 2358: [ 612.160757] kasan_save_stack+0x1b/0x40 [ 612.160768] __kasan_kmalloc+0x9b/0xd0 [ 612.160778] cifs_new_fileinfo+0xb0/0x960 [cifs] [ 612.161170] cifs_open+0xadf/0xf20 [cifs] [ 612.161421] do_dentry_open+0x2aa/0x6b0 [ 612.161432] path_openat+0xbd9/0xfa0 [ 612.161441] do_filp_open+0x11d/0x230 [ 612.161450] do_sys_openat2+0x115/0x240 [ 612.161460] __x64_sys_openat+0xce/0x140 When mod_delayed_work is called to modify the delay of pending work, it might return false and queue a new work when pending work is already scheduled or when try to grab pending work failed. So, Increase the reference count when new work is scheduled to avoid use-after-free. Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-20 12:20:42 -05:00
Linus Torvalds	50f09a3dd5	Char/misc driver fixes for 5.13-rc3 Here is a big set of char/misc/other driver fixes for 5.13-rc3. The majority here is the fallout of the umn.edu re-review of all prior submissions. That resulted in a bunch of reverts along with the "correct" changes made, such that there is no regression of any of the potential fixes that were made by those individuals. I would like to thank the over 80 different developers who helped with the review and fixes for this mess. Other than that, there's a few habanna driver fixes for reported issues, and some dyndbg fixes for reported problems. All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYKZCBg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ynhRQCdGk6ri4oluyn/Z/2KAjvXDOmTmvgAn12VP42d S1Zmh4qRH2OWaLOBg7c2 =qtxj -----END PGP SIGNATURE----- Merge tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here is a big set of char/misc/other driver fixes for 5.13-rc3. The majority here is the fallout of the umn.edu re-review of all prior submissions. That resulted in a bunch of reverts along with the "correct" changes made, such that there is no regression of any of the potential fixes that were made by those individuals. I would like to thank the over 80 different developers who helped with the review and fixes for this mess. Other than that, there's a few habanna driver fixes for reported issues, and some dyndbg fixes for reported problems. All of these have been in linux-next for a while with no reported problems" * tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (82 commits) misc: eeprom: at24: check suspend status before disable regulator uio_hv_generic: Fix another memory leak in error handling paths uio_hv_generic: Fix a memory leak in error handling paths uio/uio_pci_generic: fix return value changed in refactoring Revert "Revert "ALSA: usx2y: Fix potential NULL pointer dereference"" dyndbg: drop uninformative vpr_info dyndbg: avoid calling dyndbg_emit_prefix when it has no work binder: Return EFAULT if we fail BINDER_ENABLE_ONEWAY_SPAM_DETECTION cdrom: gdrom: initialize global variable at init time brcmfmac: properly check for bus register errors Revert "brcmfmac: add a check for the status of usb_register" video: imsttfb: check for ioremap() failures Revert "video: imsttfb: fix potential NULL pointer dereferences" net: liquidio: Add missing null pointer checks Revert "net: liquidio: fix a NULL pointer dereference" media: gspca: properly check for errors in po1030_probe() Revert "media: gspca: Check the return value of write_bridge for timeout" media: gspca: mt9m111: Check write_bridge for timeout Revert "media: gspca: mt9m111: Check write_bridge for timeout" media: dvb: Add check on sp8870_readreg return ...	2021-05-20 06:31:52 -10:00
Linus Torvalds	7ac177143c	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCmN9AACgkQnJ2qBz9k QNn5ZwgAwnLdgBuILDqJwPaYpXOzvMhjjG8AwBDzhMYhhpt+OOCUevoRm7mDU7J2 t/DlwWGMhpp80ku+x+AURR/ltOfFvw4QAHeIXPWjkoieFKcLOEvAjWWZP6oIFC12 5e/QVXqK58fuRJwveYp4jZ+AXvDMoHJrDXsoTFezjBDIQQgzlIlrMzPavS/6UzUN mAF2sapE9lcQoRMfU8kktBWPVM/GpFkus2Q48EYFCZ1rp3aRyw/aahTVuvSUZCV0 XiY6f2F7qgFLtomK6UurlxTc7rPsrG+UmNvGWuXf3R81UawegmKQeG5zcaMGrZs1 kHyJQcP9nGYPLDXt/4kW9cY0s8oOKg== =RbOE -----END PGP SIGNATURE----- Merge tag 'quota_for_v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota fixes from Jan Kara: "The most important part in the pull is disablement of the new syscall quotactl_path() which was added in rc1. The reason is some people at LWN discussion pointed out dirfd would be useful for this path based syscall and Christian Brauner agreed. Without dirfd it may be indeed problematic for containers. So let's just disable the syscall for now when it doesn't have users yet so that we have more time to mull over how to best specify the filesystem we want to work on" * tag 'quota_for_v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Disable quotactl_path syscall quota: Use 'hlist_for_each_entry' to simplify code	2021-05-20 06:20:15 -10:00
Anna Schumaker	a421d21860	NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() Commit `de144ff423` changes _pnfs_return_layout() to call pnfs_mark_matching_lsegs_return() passing NULL as the struct pnfs_layout_range argument. Unfortunately, pnfs_mark_matching_lsegs_return() doesn't check if we have a value here before dereferencing it, causing an oops. I'm able to hit this crash consistently when running connectathon basic tests on NFS v4.1/v4.2 against Ontap. Fixes: `de144ff423` ("NFSv4: Don't discard segments marked for return in _pnfs_return_layout()") Cc: stable@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-20 12:17:08 -04:00
Yang Li	d1d973950a	pNFS/NFSv4: Remove redundant initialization of 'rd_size' Variable 'rd_size' is being initialized however this value is never read as 'rd_size' is assigned a new value in for statement. Remove the redundant assignment. Clean up clang warning: fs/nfs/pnfs.c:2681:6: warning: Value stored to 'rd_size' during its initialization is never read [clang-analyzer-deadcode.DeadStores] Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-20 12:17:08 -04:00
Dan Carpenter	769b01ea68	NFS: fix an incorrect limit in filelayout_decode_layout() The "sizeof(struct nfs_fh)" is two bytes too large and could lead to memory corruption. It should be NFS_MAXFHSIZE because that's the size of the ->data[] buffer. I reversed the size of the arguments to put the variable on the left. Fixes: `16b374ca43` ("NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-20 12:17:08 -04:00
zhouchuangao	bb00238890	fs/nfs: Use fatal_signal_pending instead of signal_pending We set the state of the current process to TASK_KILLABLE via prepare_to_wait(). Should we use fatal_signal_pending() to detect the signal here? Fixes: `b4868b44c5` ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE") Signed-off-by: zhouchuangao <zhouchuangao@vivo.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-05-20 12:15:35 -04:00
Darrick J. Wong	e3c2b04747	xfs: restore old ioctl definitions These ioctl definitions in xfs_fs.h are part of the userspace ABI and were mistakenly removed during the 5.13 merge window. Fixes: `9fefd5db08` ("xfs: convert to fileattr") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-05-20 08:31:22 -07:00
Darrick J. Wong	16c9de54dc	xfs: fix deadlock retry tracepoint arguments sc->ip is the inode that's being scrubbed, which means that it's not set for scrub types that don't involve inodes. If one of those scrubbers (e.g. inode btrees) returns EDEADLOCK, we'll trip over the null pointer. Fix that by reporting either the file being examined or the file that was used to call scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-05-20 08:31:22 -07:00
Darrick J. Wong	676a659b60	xfs: retry allocations when locality-based search fails If a realtime allocation fails because we can't find a sufficiently large free extent satisfying locality rules, relax the locality rules and try again. This reduces the occurrence of short writes to realtime files when the write size is large and the free space is fragmented. This was originally discovered by running generic/186 with the realtime reflink patchset and a 128k cow extent size hint, but the short write symptoms can manifest with a 128k extent size hint and no reflink, so apply the fix now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>	2021-05-20 08:28:34 -07:00
Gulam Mohamed	bc6a385132	block: fix a race between del_gendisk and BLKRRPART When BLKRRPART is called concurrently with del_gendisk, the partitions rescan can create a stale partition that will never be be cleaned up. Fix this by checking the the disk is up before rescanning partitions while under bd_mutex. Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210514131842.1600568-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-20 07:59:35 -06:00
Christoph Hellwig	6c60ff048c	block: prevent block device lookups at the beginning of del_gendisk As an artifact of how gendisk lookup used to work in earlier kernels, GENHD_FL_UP is only cleared very late in del_gendisk, and a global lock is used to prevent opens from succeeding while del_gendisk is tearing down the gendisk. Switch to clearing the flag early and under bd_mutex so that callers can use bd_mutex to stabilize the flag, which removes the need for the global mutex. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210514131842.1600568-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-20 07:59:35 -06:00
Johannes Thumshirn	764c7c9a46	btrfs: zoned: fix parallel compressed writes When multiple processes write data to the same block group on a compressed zoned filesystem, the underlying device could report I/O errors and data corruption is possible. This happens because on a zoned file system, compressed data writes where sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel submission it cannot be guaranteed that the data is always submitted aligned to the underlying zone's write pointer. The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned filesystem is non intrusive on a regular file system or when submitting to a conventional zone on a zoned filesystem, as it is guarded by btrfs_use_zone_append. Reported-by: David Sterba <dsterba@suse.com> Fixes: `9d294a685f` ("btrfs: zoned: enable to mount ZONED incompat flag") CC: stable@vger.kernel.org # 5.12.x: `e380adfc21`: btrfs: zoned: pass start block to btrfs_use_zone_append CC: stable@vger.kernel.org # 5.12.x Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-20 15:51:07 +02:00
Johannes Thumshirn	e380adfc21	btrfs: zoned: pass start block to btrfs_use_zone_append btrfs_use_zone_append only needs the passed in extent_map's block_start member, so there's no need to pass in the full extent map. This also enables the use of btrfs_use_zone_append in places where we only have a start byte but no extent_map. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-20 15:50:49 +02:00
Pavel Begunkov	ba5ef6dc8a	io_uring: fortify tctx/io_wq cleanup We don't want anyone poking into tctx->io_wq awhile it's being destroyed by io_wq_put_and_exit(), and even though it shouldn't even happen, if buggy would be preferable to get a NULL-deref instead of subtle delayed failure or UAF. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/827b021de17926fd807610b3e53a5a5fa8530856.1621513214.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-20 07:29:11 -06:00
Bob Peterson	f5456b5d67	gfs2: Clean up revokes on normal withdraws Before this patch, the system ail lists were cleaned up if the logd process withdrew, but on other withdraws, they were not cleaned up. This included the cleaning up of the revokes as well. This patch reorganizes things a bit so that all withdraws (not just logd) clean up the ail lists, including any pending revokes. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-20 13:31:37 +02:00
Bob Peterson	865cc3e9cc	gfs2: fix a deadlock on withdraw-during-mount Before this patch, gfs2 would deadlock because of the following sequence during mount: mount gfs2_fill_super gfs2_make_fs_rw <--- Detects IO error with glock kthread_stop(sdp->sd_quotad_process); <--- Blocked waiting for quotad to finish logd Detects IO error and the need to withdraw calls gfs2_withdraw gfs2_make_fs_ro kthread_stop(sdp->sd_quotad_process); <--- Blocked waiting for quotad to finish gfs2_quotad gfs2_statfs_sync gfs2_glock_wait <---- Blocked waiting for statfs glock to be granted glock_work_func do_xmote <---Detects IO error, can't release glock: blocked on withdraw glops->go_inval glock_blocked_by_withdraw requeue glock work & exit <--- work requeued, blocked by withdraw This patch makes a special exception for the statfs system inode glock, which allows the statfs glock UNLOCK to proceed normally. That allows the quotad daemon to exit during the withdraw, which allows the logd daemon to exit during the withdraw, which allows the mount to exit. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-20 13:31:37 +02:00
Bob Peterson	20265d9a67	gfs2: fix scheduling while atomic bug in glocks Before this patch, in the unlikely event that gfs2_glock_dq encountered a withdraw, it would do a wait_on_bit to wait for its journal to be recovered, but it never released the glock's spin_lock, which caused a scheduling-while-atomic error. This patch unlocks the lockref spin_lock before waiting for recovery. Fixes: `601ef0d52e` ("gfs2: Force withdraw to replay journals and wait for it to finish") Cc: stable@vger.kernel.org # v5.7+ Reported-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-20 13:31:37 +02:00
Bob Peterson	4194dec4b4	gfs2: Fix I_NEW check in gfs2_dinode_in Patch `4a378d8a0d` added a new check for I_NEW inodes, but unfortunately it used the wrong variable, i_flags. This caused GFS2 to withdraw when gfs2_lookup_by_inum needed to refresh an I_NEW inode. This patch switches to use the correct variable, i_state. Fixes: `4a378d8a0d` ("gfs2: be careful with inode refresh") Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-20 13:31:37 +02:00
Andreas Gruenbacher	43a511c44e	gfs2: Prevent direct-I/O write fallback errors from getting lost When a direct I/O write falls entirely and falls back to buffered I/O and the buffered I/O fails, the write failed with return value 0 instead of the error number reported by the buffered I/O. Fix that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-05-20 13:31:36 +02:00
Rohith Surabattula	0ab95c2510	Defer close only when lease is enabled. When smb2 lease parameter is disabled on server. Server grants batch oplock instead of RHW lease by default on open, inode page cache needs to be zapped immediatley upon close as cache is not valid. Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-19 21:11:28 -05:00
Rohith Surabattula	860b69a9d7	Fix kernel oops when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. Removed oplock_break_received flag which was added to achieve synchronization between oplock handler and open handler by earlier commit. It is not needed because there is an existing lock open_file_lock to achieve the same. find_readable_file takes open_file_lock and then traverses the openFileList. Similarly, cifs_oplock_break while closing the deferred handle (i.e cifsFileInfo_put) takes open_file_lock and then sends close to the server. Added comments for better readability. Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-19 21:11:26 -05:00
Jiapeng Chong	e83aa3528a	cifs: Fix inconsistent indenting Eliminate the follow smatch warning: fs/cifs/fs_context.c:1148 smb3_fs_context_parse_param() warn: inconsistent indenting. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-19 21:11:09 -05:00
Ronnie Sahlberg	d201d7631c	cifs: fix memory leak in smb2_copychunk_range When using smb2_copychunk_range() for large ranges we will run through several iterations of a loop calling SMB2_ioctl() but never actually free the returned buffer except for the final iteration. This leads to memory leaks everytime a large copychunk is requested. Fixes: `9bf0c9cd43` ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files") Cc: <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-19 19:19:20 -05:00
Linus Torvalds	c3d0e3fd41	fs.idmapped.mount_setattr.v5.13-rc3 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYKT8wgAKCRCRxhvAZXjc op2mAP9hyc3sp2/HvEuTYDc6LmljPNCqdKeCP1eiX5SZR0yMVwEAwR5xym7YeqYZ LRj+xjuvOmrSNhcxpZqLpXuPOcY78wM= =E4a/ -----END PGP SIGNATURE----- Merge tag 'fs.idmapped.mount_setattr.v5.13-rc3' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr fix from Christian Brauner: "This makes an underlying idmapping assumption more explicit. We currently don't have any filesystems that support idmapped mounts which are mountable inside a user namespace, i.e. where s_user_ns != init_user_ns. That was a deliberate decision for now as userns root can just mount the filesystem themselves. Express this restriction explicitly and enforce it until there's a real use-case for this. This way we can notice it and will have a chance to adapt and audit our translation helpers and fstests appropriately if we need to support such filesystems" * tag 'fs.idmapped.mount_setattr.v5.13-rc3' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux: fs/mount_setattr: tighten permission checks	2021-05-19 06:12:31 -10:00
Steve French	c0d46717b9	SMB3: incorrect file id in requests compounded with open See MS-SMB2 3.2.4.1.4, file ids in compounded requests should be set to 0xFFFFFFFFFFFFFFFF (we were treating it as u32 not u64 and setting it incorrectly). Signed-off-by: Steve French <stfrench@microsoft.com> Reported-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>	2021-05-19 10:10:58 -05:00
Eric W. Biederman	922e301304	signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo With the addition of ssi_perf_data and ssi_perf_type struct signalfd_siginfo is dangerously close to running out of space. All that remains is just enough space for two additional 64bit fields. A practice of adding all possible siginfo_t fields into struct singalfd_siginfo can not be supported as adding the missing fields ssi_lower, ssi_upper, and ssi_pkey would require two 64bit fields and one 32bit fields. In practice the fields ssi_perf_data and ssi_perf_type can never be used by signalfd as the signal that generates them always delivers them synchronously to the thread that triggers them. Therefore until someone actually needs the fields ssi_perf_data and ssi_perf_type in signalfd_siginfo remove them. This leaves a bit more room for future expansion. v1: https://lkml.kernel.org/r/20210503203814.25487-12-ebiederm@xmission.com v2: https://lkml.kernel.org/r/20210505141101.11519-12-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20210517195748.8880-5-ebiederm@xmission.com Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2021-05-18 16:20:54 -05:00
Eric W. Biederman	0683b53197	signal: Deliver all of the siginfo perf data in _perf Don't abuse si_errno and deliver all of the perf data in _perf member of siginfo_t. Note: The data field in the perf data structures in a u64 to allow a pointer to be encoded without needed to implement a 32bit and 64bit version of the same structure. There already exists a 32bit and 64bit versions siginfo_t, and the 32bit version can not include a 64bit member as it only has 32bit alignment. So unsigned long is used in siginfo_t instead of a u64 as unsigned long can encode a pointer on all architectures linux supports. v1: https://lkml.kernel.org/r/m11rarqqx2.fsf_-_@fess.ebiederm.org v2: https://lkml.kernel.org/r/20210503203814.25487-10-ebiederm@xmission.com v3: https://lkml.kernel.org/r/20210505141101.11519-11-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20210517195748.8880-4-ebiederm@xmission.com Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2021-05-18 16:20:54 -05:00
Eric W. Biederman	9abcabe311	signal: Implement SIL_FAULT_TRAPNO Now that si_trapno is part of the union in _si_fault and available on all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid. There is room for future changes to reduce when si_trapno is valid but this is all that is needed to make si_trapno and the other members of the the union in _sigfault mutually exclusive. Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO and have the same code ignore si_trapno in in all other cases. v1: https://lkml.kernel.org/r/m1o8dvs7s7.fsf_-_@fess.ebiederm.org v2: https://lkml.kernel.org/r/20210505141101.11519-6-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20210517195748.8880-2-ebiederm@xmission.com Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2021-05-18 16:20:34 -05:00
Ondrej Mosnacek	5881fa8dc2	debugfs: fix security_locked_down() call for SELinux When (ia->ia_valid & (ATTR_MODE \| ATTR_UID \| ATTR_GID)) is zero, then the SELinux implementation of the locked_down hook might report a denial even though the operation would actually be allowed. To fix this, make sure that security_locked_down() is called only when the return value will be taken into account (i.e. when changing one of the problematic attributes). Note: this was introduced by commit `5496197f9b` ("debugfs: Restrict debugfs when the kernel is locked down"), but it didn't matter at that time, as the SELinux support came in later. Fixes: `59438b4647` ("security,lockdown,selinux: implement SELinux lockdown") Cc: stable <stable@vger.kernel.org> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Link: https://lore.kernel.org/r/20210507125304.144394-1-omosnace@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-18 18:05:59 +02:00
Gustavo A. R. Silva	c3754da3b7	reiserfs: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2021-05-17 19:01:38 -05:00
Linus Torvalds	8ac91e6c60	for-5.13-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCibywACgkQxWXV+ddt WDs8QhAAlJ1INZGF01lP2mUhzesVIctIAPGBf/77Zsxmcu0rA6E66RVVsYMgGU54 +FWd+LwuFCtC1364OnDa2DnmYtvHfgR4If7EGowpk3qzZFeZQSLqayOFa5tZLYPG tJStjY32QTerfZRoxPJ1QPcoWjxNMxYqYw/s68G3tTTSHEYtlH9zNHbLm9ny507x uPHpxqKXdv3/LYHLt6XUypFqsZkMoDW98oOKvo0MZE/fjcqiDcrvAoYe+y8raFC3 FztlfA2TBmmp/PouDXLCspXAksLpVo9mgTQ0kW4K7152cC0X/zWXYNH01uQ+qTAS OFNKt2DSRIq5TR56ZmReYvRgq0FNMotYpRpxoePSF/rwL+wnsTl7QI3r/d/h/uxQ IzBeBv1Wd+1ZJcqnmEGx8Mws3nGswKyl4W65x8yin41djVoHgM4tYu3nGqielu+w ifEBmU5tUGo05z2HA5kpLjDzc6MwWaCIduQvjH/I4Vgo9fhDo6pQO2dyPC50Nkk5 DQ5jfxiXJ/ZSh5NbWtIkB/OQuwkVL1nDy2jtj3qnK06HDKstK1zui5nccFKFNOiX wtYjnGqd3+vIGIZniMuu9rbPLtG4CCerq44v1gyS6LSEycNW9/r2cOXRaiQk5pej CoYMdnmAqzwidtn4FZPRNQ7JgyckKCXQQSGCazN2vvLCXisCUrw= =ue6o -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes: - fix fiemap to print extents that could get misreported due to internal extent splitting and logical merging for fiemap output - fix RCU stalls during delayed iputs - fix removed dentries still existing after log is synced" * tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix removed dentries still existing after log is synced btrfs: return whole extents in fiemap btrfs: avoid RCU stalls while running delayed iputs btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error	2021-05-17 09:55:10 -07:00
Josef Bacik	91df99a6eb	btrfs: do not BUG_ON in link_to_fixup_dir While doing error injection testing I got the following panic kernel BUG at fs/btrfs/tree-log.c:1862! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 7836 Comm: mount Not tainted 5.13.0-rc1+ #305 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:link_to_fixup_dir+0xd5/0xe0 RSP: 0018:ffffb5800180fa30 EFLAGS: 00010216 RAX: fffffffffffffffb RBX: 00000000fffffffb RCX: ffff8f595287faf0 RDX: ffffb5800180fa37 RSI: ffff8f5954978800 RDI: 0000000000000000 RBP: ffff8f5953af9450 R08: 0000000000000019 R09: 0000000000000001 R10: 000151f408682970 R11: 0000000120021001 R12: ffff8f5954978800 R13: ffff8f595287faf0 R14: ffff8f5953c77dd0 R15: 0000000000000065 FS: 00007fc5284c8c40(0000) GS:ffff8f59bbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc5287f47c0 CR3: 000000011275e002 CR4: 0000000000370ee0 Call Trace: replay_one_buffer+0x409/0x470 ? btree_read_extent_buffer_pages+0xd0/0x110 walk_up_log_tree+0x157/0x1e0 walk_log_tree+0xa6/0x1d0 btrfs_recover_log_trees+0x1da/0x360 ? replay_one_extent+0x7b0/0x7b0 open_ctree+0x1486/0x1720 btrfs_mount_root.cold+0x12/0xea ? __kmalloc_track_caller+0x12f/0x240 legacy_get_tree+0x24/0x40 vfs_get_tree+0x22/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 ? vfs_parse_fs_string+0x4d/0x90 legacy_get_tree+0x24/0x40 vfs_get_tree+0x22/0xb0 path_mount+0x433/0xa10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae We can get -EIO or any number of legitimate errors from btrfs_search_slot(), panicing here is not the appropriate response. The error path for this code handles errors properly, simply return the error. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-17 15:49:24 +02:00
Filipe Manana	6416954ca7	btrfs: release path before starting transaction when cloning inline extent When cloning an inline extent there are a few cases, such as when we have an implicit hole at file offset 0, where we start a transaction while holding a read lock on a leaf. Starting the transaction results in a call to sb_start_intwrite(), which results in doing a read lock on a percpu semaphore. Lockdep doesn't like this and complains about it: [46.580704] ====================================================== [46.580752] WARNING: possible circular locking dependency detected [46.580799] 5.13.0-rc1 #28 Not tainted [46.580832] ------------------------------------------------------ [46.580877] cloner/3835 is trying to acquire lock: [46.580918] c00000001301d638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0 [46.581167] [46.581167] but task is already holding lock: [46.581217] c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0 [46.581293] [46.581293] which lock already depends on the new lock. [46.581293] [46.581351] [46.581351] the existing dependency chain (in reverse order) is: [46.581410] [46.581410] -> #1 (btrfs-tree-00){++++}-{3:3}: [46.581464] down_read_nested+0x68/0x200 [46.581536] __btrfs_tree_read_lock+0x70/0x1d0 [46.581577] btrfs_read_lock_root_node+0x88/0x200 [46.581623] btrfs_search_slot+0x298/0xb70 [46.581665] btrfs_set_inode_index+0xfc/0x260 [46.581708] btrfs_new_inode+0x26c/0x950 [46.581749] btrfs_create+0xf4/0x2b0 [46.581782] lookup_open.isra.57+0x55c/0x6a0 [46.581855] path_openat+0x418/0xd20 [46.581888] do_filp_open+0x9c/0x130 [46.581920] do_sys_openat2+0x2ec/0x430 [46.581961] do_sys_open+0x90/0xc0 [46.581993] system_call_exception+0x3d4/0x410 [46.582037] system_call_common+0xec/0x278 [46.582078] [46.582078] -> #0 (sb_internal#2){.+.+}-{0:0}: [46.582135] __lock_acquire+0x1e90/0x2c50 [46.582176] lock_acquire+0x2b4/0x5b0 [46.582263] start_transaction+0x3cc/0x950 [46.582308] clone_copy_inline_extent+0xe4/0x5a0 [46.582353] btrfs_clone+0x5fc/0x880 [46.582388] btrfs_clone_files+0xd8/0x1c0 [46.582434] btrfs_remap_file_range+0x3d8/0x590 [46.582481] do_clone_file_range+0x10c/0x270 [46.582558] vfs_clone_file_range+0x1b0/0x310 [46.582605] ioctl_file_clone+0x90/0x130 [46.582651] do_vfs_ioctl+0x874/0x1ac0 [46.582697] sys_ioctl+0x6c/0x120 [46.582733] system_call_exception+0x3d4/0x410 [46.582777] system_call_common+0xec/0x278 [46.582822] [46.582822] other info that might help us debug this: [46.582822] [46.582888] Possible unsafe locking scenario: [46.582888] [46.582942] CPU0 CPU1 [46.582984] ---- ---- [46.583028] lock(btrfs-tree-00); [46.583062] lock(sb_internal#2); [46.583119] lock(btrfs-tree-00); [46.583174] lock(sb_internal#2); [46.583212] [46.583212] * DEADLOCK * [46.583212] [46.583266] 6 locks held by cloner/3835: [46.583299] #0: c00000001301d448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130 [46.583382] #1: c00000000f6d3768 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}, at: lock_two_nondirectories+0x58/0xc0 [46.583477] #2: c00000000f6d72a8 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0 [46.583574] #3: c00000000f6d7138 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590 [46.583657] #4: c00000000f6d35f8 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590 [46.583743] #5: c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0 [46.583828] [46.583828] stack backtrace: [46.583872] CPU: 1 PID: 3835 Comm: cloner Not tainted 5.13.0-rc1 #28 [46.583931] Call Trace: [46.583955] [c0000000167c7200] [c000000000c1ee78] dump_stack+0xec/0x144 (unreliable) [46.584052] [c0000000167c7240] [c000000000274058] print_circular_bug.isra.32+0x3a8/0x400 [46.584123] [c0000000167c72e0] [c0000000002741f4] check_noncircular+0x144/0x190 [46.584191] [c0000000167c73b0] [c000000000278fc0] __lock_acquire+0x1e90/0x2c50 [46.584259] [c0000000167c74f0] [c00000000027aa94] lock_acquire+0x2b4/0x5b0 [46.584317] [c0000000167c75e0] [c000000000a0d6cc] start_transaction+0x3cc/0x950 [46.584388] [c0000000167c7690] [c000000000af47a4] clone_copy_inline_extent+0xe4/0x5a0 [46.584457] [c0000000167c77c0] [c000000000af525c] btrfs_clone+0x5fc/0x880 [46.584514] [c0000000167c7990] [c000000000af5698] btrfs_clone_files+0xd8/0x1c0 [46.584583] [c0000000167c7a00] [c000000000af5b58] btrfs_remap_file_range+0x3d8/0x590 [46.584652] [c0000000167c7ae0] [c0000000005d81dc] do_clone_file_range+0x10c/0x270 [46.584722] [c0000000167c7b40] [c0000000005d84f0] vfs_clone_file_range+0x1b0/0x310 [46.584793] [c0000000167c7bb0] [c00000000058bf80] ioctl_file_clone+0x90/0x130 [46.584861] [c0000000167c7c10] [c00000000058c894] do_vfs_ioctl+0x874/0x1ac0 [46.584922] [c0000000167c7d10] [c00000000058db4c] sys_ioctl+0x6c/0x120 [46.584978] [c0000000167c7d60] [c0000000000364a4] system_call_exception+0x3d4/0x410 [46.585046] [c0000000167c7e10] [c00000000000d45c] system_call_common+0xec/0x278 [46.585114] --- interrupt: c00 at 0x7ffff7e22990 [46.585160] NIP: 00007ffff7e22990 LR: 00000001000010ec CTR: 0000000000000000 [46.585224] REGS: c0000000167c7e80 TRAP: 0c00 Not tainted (5.13.0-rc1) [46.585280] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28000244 XER: 00000000 [46.585374] IRQMASK: 0 [46.585374] GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f17100 0000000000000004 [46.585374] GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000 [46.585374] GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000 [46.585374] GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000 [46.585374] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [46.585374] GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000 [46.585374] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004 [46.585374] GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f [46.585919] NIP [00007ffff7e22990] 0x7ffff7e22990 [46.585964] LR [00000001000010ec] 0x1000010ec [46.586010] --- interrupt: c00 This should be a false positive, as both locks are acquired in read mode. Nevertheless, we don't need to hold a leaf locked when we start the transaction, so just release the leaf (path) before starting it. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/linux-btrfs/20210513214404.xks77p566fglzgum@riteshh-domain/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-17 15:49:19 +02:00
Pavel Begunkov	7a27472770	io_uring: don't modify req->poll for rw __io_queue_proc() is used by both poll and apoll, so we should not access req->poll directly but selecting right struct io_poll_iocb depending on use case. Reported-and-tested-by: syzbot+a84b8783366ecb1c65d0@syzkaller.appspotmail.com Fixes: `ea6a693d86` ("io_uring: disable multishot poll for double poll add cases") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a6a1de31142d8e0250fe2dfd4c8923d82a5bbfc.1621251795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-17 07:28:48 -06:00
Andy Shevchenko	776797f1bd	nilfs2: Switch to use %ptTs Use %ptTs instead of open coded variant to print contents of time64_t type in human readable form. Use sysfs_emit() at the same time in the changed functions. Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: linux-nilfs@vger.kernel.org Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210511153958.34527-3-andriy.shevchenko@linux.intel.com	2021-05-17 12:01:46 +02:00
wenhuizhang	4236a26a6b	cifs: remove deadstore in cifs_close_all_deferred_files() Deadstore detected by Lukas Bulwahn's CodeChecker Tool (ELISA group). line 741 struct cifsInodeInfo *cinode; line 747 cinode = CIFS_I(d_inode(cfile->dentry)); could be deleted. cinode on filesystem should not be deleted when files are closed, they are representations of some data fields on a physical disk, thus no further action is required. The virtual inode on vfs will be handled by vfs automatically, and the denotation is inode, which is different from the cinode. Signed-off-by: wenhuizhang <wenhui@gwmail.gwu.edu> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-05-16 23:05:46 -05:00
Darrick J. Wong	9d5e8492ee	xfs: adjust rt allocation minlen when extszhint > rtextsize xfs_bmap_rtalloc doesn't handle realtime extent files with extent size hints larger than the rt volume's extent size properly, because xfs_bmap_extsize_align can adjust the offset/length parameters to try to fit the extent size hint. Under these conditions, minlen has to be large enough so that any allocation returned by xfs_rtallocate_extent will be large enough to cover at least one of the blocks that the caller asked for. If the allocation is too short, bmapi_write will return no mapping for the requested range, which causes ENOSPC errors in other parts of the filesystem. Therefore, adjust minlen upwards to fix this. This can be found by running generic/263 (g/127 or g/522) with a realtime extent size hint that's larger than the rt volume extent size. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>	2021-05-16 18:45:03 -07:00
Linus Torvalds	a4147415bd	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: resource, squashfs, hfsplus, modprobe, and mm (hugetlb, slub, userfaultfd, ksm, pagealloc, kasan, pagemap, and ioremap)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/ioremap: fix iomap_max_page_shift docs: admin-guide: update description for kernel.modprobe sysctl hfsplus: prevent corruption in shrinking truncate mm/filemap: fix readahead return types kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled mm: fix struct page layout on 32-bit systems ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()" userfaultfd: release page in error path to avoid BUG_ON squashfs: fix divide error in calculate_skip() kernel/resource: fix return code check in __request_free_mem_region mm, slub: move slub_debug static key enabling outside slab_mutex mm/hugetlb: fix cow where page writtable in child mm/hugetlb: fix F_SEAL_FUTURE_WRITE	2021-05-15 09:42:27 -07:00
Linus Torvalds	5601591035	io_uring-5.13-2021-05-14 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCew+oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnAAD/0RGU6BTpYX0AjSuHtHPsGxAWlLroe7Yvew BBXX58uL9LqSYDe+FfCherA7GbyLdXrN9yvbeKVEZH7wmFV0u6dGX/RiK8lpWvfd pMTSf14QkASkoQ5bMQURdSv73OruzKQN7CisN3btwD2sDcqZqz7RsFuWf5Fuxs4r UjyQdJpt+sNs1UTvHBjqQcCrAipEVWePH93/jhayx8iBykab4+aNKFtysjqYJdD0 LL5NG5LihP5G2WECD5Q7vDmb9+km3cN5TJLhHSDsmQg4Ln6U9zd4X3bnvEXNtlWk edNNhKVmS8rtwK2qiZCoVlR4HrSjCCjUg/0h6hyOL8AYNV9vPup/0EuWfRKxLE+3 l3TRTO02/SM8Tjdu27lYtxFYnIkIgRv+w2/ZmURzwnPpIvjwbdfth5DN+10bFnUV IPKcEvMXhbgdyQ5OtA1oPk3udWesrk836s2W6kqBLSEeqFrb0UbI8A40VXxoAfVQ Ig5LmuuDAlZzt4fCu3GYhVZS1jj2CXuBsGrsbSVZaJGbMu9MPbmMUoz6XBS3lsY6 gnhYv2paMuOo/hD6q4XeCH4j1jveLXgzenW3fzEP4E0wxfvMkybyWCwfW14a15Q+ Sr8VEEUTc74RfW5pP0ZTvrYGnR+oJwB1RacdbU5WpOrB01A5bWkmb0fRNHfj8vjH h49oIdqZKw== =5+hs -----END PGP SIGNATURE----- Merge tag 'io_uring-5.13-2021-05-14' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Just a few minor fixes/changes: - Fix issue with double free race for linked timeout completions - Fix reference issue with timeouts - Remove last few places that make SQPOLL special, since it's just an io thread now. - Bump maximum allowed registered buffers, as we don't allocate as much anymore" * tag 'io_uring-5.13-2021-05-14' of git://git.kernel.dk/linux-block: io_uring: increase max number of reg buffers io_uring: further remove sqpoll limits on opcodes io_uring: fix ltout double free on completion race io_uring: fix link timeout refs	2021-05-15 08:43:44 -07:00
Linus Torvalds	41f035c062	Changes since last update: - update documentation to fix the broken illustration due to ReST conversion by accident at that time and complete the big pcluster introduction; - fix 1 lcluster-sized pclusters for the big pcluster feature. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYJ8DGBEceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBAC0AQDaap8fSTWMLroLLBCcr1MwTqoS6wf44tx8 iq2FFcU/hQD+PqrnCFJW7wjWjMC84weOudRvh2/lu/GKH2a5LgJ5Xgs= =UTkq -----END PGP SIGNATURE----- Merge tag 'erofs-for-5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "This mainly fixes 1 lcluster-sized pclusters for the big pcluster feature, which can be forcely generated by mkfs as a specific on-disk case for per-(sub)file compression strategies but missed to handle in runtime properly. Also, documentation updates are included to fix the broken illustration due to the ReST conversion by accident and complete the big pcluster introduction. Summary: - update documentation to fix the broken illustration due to ReST conversion by accident at that time and complete the big pcluster introduction - fix 1 lcluster-sized pclusters for the big pcluster feature" * tag 'erofs-for-5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix 1 lcluster-sized pcluster for big pcluster erofs: update documentation about data compression erofs: fix broken illustration in documentation	2021-05-15 08:37:21 -07:00
Linus Torvalds	393f42f113	dax fixes for 5.13-rc2 - Fix a hang condition (missed wakeups with virtiofs when invalidating entries) -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAmCfBboACgkQHtKRamZ9 iAIiHQ/+LqD0USAXxWQFcDupTATVy0Z/hpUCBWcEKII/ljluUWLLkGUT2/Gy3TXE 0HZmJBWyJyqNRyWtzNZ8hu4FpxSawtYVkqTv0/ODAjrpva9m8p4eVYFp0UpTHn3d KL/DD+VeLWs1yoPIXgqd2dSwV2YsAJSEYYXcF0CYeHOWH4BVGrOglQBL7kJyra6n IQsnXGJQMXkOoDMB/5xTI7LgYD0R09OevsHE6Eupxm9SI8ud2qUQlBLde8Eh+7qb pMhkeNNjG2w461C8215rhGPzCweMMasiBwUz1EHXDpXebZSsDfURwBWMCFbe/H7p x3u0s3hlJydTZmUnaMeWje+wR1Ku8YXiBeelMobpXi4RzNyebhZ0Fap3fMDbrR8/ 5mro6H9blEYGZ1kISHSdvZUfh6uzWiL8hs+uBb/ANICZouValjyVrHuTauwncyQP PHaKZYo/kh6Hj3j1LYDHbMs69Cbr+E0x/JFnYAxIkZSggYJeXN9+3K9hhUXcQNIf Lh4p1F/t7DmIXzljFu6qwJl9JmCC+yx4PcSgOqa6vPvm2H6KEH+rMCLHtu+WgaXq 1Gj9EI1sshTXgot8Y1xlPCCTLNqxhV0O30L+EsasmjNCjWwVRi2zz+FjkgFAeDvo 7LZUNVepC9YMffknBNGkfNibfVBn5/DxbGR/9SWygHy8ahECoLc= =cWwB -----END PGP SIGNATURE----- Merge tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fixes from Dan Williams: "A fix for a hang condition due to missed wakeups in the filesystem-dax core when exercised by virtiofs. This bug has been there from the beginning, but the condition has not triggered on other filesystems since they hold a lock over invalidation events" * tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Wake up all waiters after invalidating dax entry dax: Add a wakeup mode parameter to put_unlocked_entry() dax: Add an enum for specifying dax wakup mode	2021-05-15 08:28:08 -07:00
Jouni Roivas	c3187cf322	hfsplus: prevent corruption in shrinking truncate I believe there are some issues introduced by commit `31651c6071` ("hfsplus: avoid deadlock on file truncation") HFS+ has extent records which always contains 8 extents. In case the first extent record in catalog file gets full, new ones are allocated from extents overflow file. In case shrinking truncate happens to middle of an extent record which locates in extents overflow file, the logic in hfsplus_file_truncate() was changed so that call to hfs_brec_remove() is not guarded any more. Right action would be just freeing the extents that exceed the new size inside extent record by calling hfsplus_free_extents(), and then check if the whole extent record should be removed. However since the guard (blk_cnt > start) is now after the call to hfs_brec_remove(), this has unfortunate effect that the last matching extent record is removed unconditionally. To reproduce this issue, create a file which has at least 10 extents, and then perform shrinking truncate into middle of the last extent record, so that the number of remaining extents is not under or divisible by 8. This causes the last extent record (8 extents) to be removed totally instead of truncating into middle of it. Thus this causes corruption, and lost data. Fix for this is simply checking if the new truncated end is below the start of this extent record, making it safe to remove the full extent record. However call to hfs_brec_remove() can't be moved to it's previous place since we're dropping ->tree_lock and it can cause a race condition and the cached info being invalidated possibly corrupting the node data. Another issue is related to this one. When entering into the block (blk_cnt > start) we are not holding the ->tree_lock. We break out from the loop not holding the lock, but hfs_find_exit() does unlock it. Not sure if it's possible for someone else to take the lock under our feet, but it can cause hard to debug errors and premature unlocking. Even if there's no real risk of it, the locking should still always be kept in balance. Thus taking the lock now just before the check. Link: https://lkml.kernel.org/r/20210429165139.3082828-1-jouni.roivas@tuxera.com Fixes: `31651c6071` ("hfsplus: avoid deadlock on file truncation") Signed-off-by: Jouni Roivas <jouni.roivas@tuxera.com> Reviewed-by: Anton Altaparmakov <anton@tuxera.com> Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-14 19:41:32 -07:00
Matthew Wilcox (Oracle)	076171a677	mm/filemap: fix readahead return types A readahead request will not allocate more memory than can be represented by a size_t, even on systems that have HIGHMEM available. Change the length functions from returning an loff_t to a size_t. Link: https://lkml.kernel.org/r/20210510201201.1558972-1-willy@infradead.org Fixes: `32c0a6bcaa` ("btrfs: add and use readahead_batch_length") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-14 19:41:32 -07:00
Phillip Lougher	d6e621de1f	squashfs: fix divide error in calculate_skip() Sysbot has reported a "divide error" which has been identified as being caused by a corrupted file_size value within the file inode. This value has been corrupted to a much larger value than expected. Calculate_skip() is passed i_size_read(inode) >> msblk->block_log. Due to the file_size value corruption this overflows the int argument/variable in that function, leading to the divide error. This patch changes the function to use u64. This will accommodate any unexpectedly large values due to corruption. The value returned from calculate_skip() is clamped to be never more than SQUASHFS_CACHED_BLKS - 1, or 7. So file_size corruption does not lead to an unexpectedly large return result here. Link: https://lkml.kernel.org/r/20210507152618.9447-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: <syzbot+e8f781243ce16ac2f962@syzkaller.appspotmail.com> Reported-by: <syzbot+7b98870d4fec9447b951@syzkaller.appspotmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-14 19:41:32 -07:00
Peter Xu	22247efd82	mm/hugetlb: fix F_SEAL_FUTURE_WRITE Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2. Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to hugetlbfs, which I can easily verify using the memfd_test program, which seems that the program is hardly run with hugetlbfs pages (as by default shmem). Meanwhile I found another probably even more severe issue on that hugetlb fork won't wr-protect child cow pages, so child can potentially write to parent private pages. Patch 2 addresses that. After this series applied, "memfd_test hugetlbfs" should start to pass. This patch (of 2): F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day. There is a test program for that and it fails constantly. $ ./memfd_test hugetlbfs memfd-hugetlb: CREATE memfd-hugetlb: BASIC memfd-hugetlb: SEAL-WRITE memfd-hugetlb: SEAL-FUTURE-WRITE mmap() didn't fail as expected Aborted (core dumped) I think it's probably because no one is really running the hugetlbfs test. Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in shmem_mmap(). Generalize a helper for that. Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com Fixes: `ab3948f58f` ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Hugh Dickins <hughd@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-14 19:41:32 -07:00

... 2 3 4 5 6 ...

71020 Commits