IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
We do not want to create initialized extents beyond end of file because
for e2fsck it is impossible to distinguish them from a case of corrupted
file size / extent tree and so it complains like:
Inode 12, i_size is 147456, should be 163840. Fix? no
Code in ext4_ext_convert_to_initialized() and
ext4_split_convert_extents() try to make sure it does not create
initialized extents beyond inode size however they check against
inode->i_size which is wrong. They should instead check against
EXT4_I(inode)->i_disksize which is the current inode size on disk.
That's what e2fsck is going to see in case of crash before all dirty
data is written. This bug manifests as generic/456 test failure (with
recent enough fstests where fsx got fixed to properly pass
FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock
mount option.
CC: stable@vger.kernel.org
Fixes: 21ca087a38 ("ext4: Do not zero out uninitialized extents beyond i_size")
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit a8ac900b81 ("ext4: use non-movable memory for the
superblock") buffers for ext4 superblock were allocated using
the sb_bread_unmovable() helper which allocated buffer heads
out of non-movable memory blocks. It was necessarily to not block
page migrations and do not cause cma allocation failures.
However commit 85c8f176a6 ("ext4: preload block group descriptors")
broke this by introducing pre-reading of the ext4 superblock.
The problem is that __breadahead() is using __getblk() underneath,
which allocates buffer heads out of movable memory.
It resulted in page migration failures I've seen on a machine
with an ext4 partition and a preallocated cma area.
Fix this by introducing sb_breadahead_unmovable() and
__breadahead_gfp() helpers which use non-movable memory for buffer
head allocations and use them for the ext4 superblock readahead.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Fixes: 85c8f176a6 ("ext4: preload block group descriptors")
Signed-off-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Found a read performance issue when linux kernel page size is 64KB.
If linux kernel page size is 64KB and mount options cache=strict &
vers=2.1+, it does not support cifs_readpages(). Instead, it is using
cifs_readpage() and cifs_read() with maximum read IO size 16KB, which is
much slower than read IO size 1MB when negotiated SMB 2.1+. Since modern
SMB server supported SMB 2.1+ and Max Read Size can reach more than 64KB
(for example 1MB ~ 8MB), this patch check max_read instead of maxBuf to
determine whether server support readpages() and improve read performance
for page size 64KB & cache=strict & vers=2.1+, and for SMB1 it is more
cleaner to initialize server->max_read to server->maxBuf.
The client is a linux box with linux kernel 4.2.8,
page size 64KB (CONFIG_ARM64_64K_PAGES=y),
cpu arm 1.7GHz, and use mount.cifs as smb client.
The server is another linux box with linux kernel 4.2.8,
share a file '10G.img' with size 10GB,
and use samba-4.7.12 as smb server.
The client mount a share from the server with different
cache options: cache=strict and cache=none,
mount -tcifs //<server_ip>/Public /cache_strict -overs=3.0,cache=strict,username=<xxx>,password=<yyy>
mount -tcifs //<server_ip>/Public /cache_none -overs=3.0,cache=none,username=<xxx>,password=<yyy>
The client download a 10GbE file from the server across 1GbE network,
dd if=/cache_strict/10G.img of=/dev/null bs=1M count=10240
dd if=/cache_none/10G.img of=/dev/null bs=1M count=10240
Found that cache=strict (without patch) is slower read throughput and
smaller read IO size than cache=none.
cache=strict (without patch): read throughput 40MB/s, read IO size is 16KB
cache=strict (with patch): read throughput 113MB/s, read IO size is 1MB
cache=none: read throughput 109MB/s, read IO size is 1MB
Looks like if page size is 64KB,
cifs_set_ops() would use cifs_addr_ops_smallbuf instead of cifs_addr_ops,
/* check if server can support readpages */
if (cifs_sb_master_tcon(cifs_sb)->ses->server->maxBuf <
PAGE_SIZE + MAX_CIFS_HDR_SIZE)
inode->i_data.a_ops = &cifs_addr_ops_smallbuf;
else
inode->i_data.a_ops = &cifs_addr_ops;
maxBuf is came from 2 places, SMB2_negotiate() and CIFSSMBNegotiate(),
(SMB2_MAX_BUFFER_SIZE is 64KB)
SMB2_negotiate():
/* set it to the maximum buffer size value we can send with 1 credit */
server->maxBuf = min_t(unsigned int, le32_to_cpu(rsp->MaxTransactSize),
SMB2_MAX_BUFFER_SIZE);
CIFSSMBNegotiate():
server->maxBuf = le32_to_cpu(pSMBr->MaxBufferSize);
Page size 64KB and cache=strict lead to read_pages() use cifs_readpage()
instead of cifs_readpages(), and then cifs_read() using maximum read IO
size 16KB, which is much slower than maximum read IO size 1MB.
(CIFSMaxBufSize is 16KB by default)
/* FIXME: set up handlers for larger reads and/or convert to async */
rsize = min_t(unsigned int, cifs_sb->rsize, CIFSMaxBufSize);
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Jones Syue <jonessyue@qnap.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We already dump these keys for SMB3, lets also dump it for SMB2
sessions so that we can use the session key in wireshark to check and validate
that the signatures are correct.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
When checking for draining with __req_need_defer(), it tries to match
how many requests were sent before a current one with number of already
completed. Dropped SQEs are included in req->sequence, and they won't
ever appear in CQ. To compensate for that, __req_need_defer() substracts
ctx->cached_sq_dropped.
However, what it should really use is number of SQEs dropped __before__
the current one. In other words, any submitted request shouldn't
shouldn't affect dequeueing from the drain queue of previously submitted
ones.
Instead of saving proper ctx->cached_sq_dropped in each request,
substract from req->sequence it at initialisation, so it includes number
of properly submitted requests.
note: it also changes behaviour of timeouts, but
1. it's already diverge from the description because of using SQ
2. the description is ambiguous regarding dropped SQEs
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->timeout.count and req->io->timeout.seq_offset store the same value,
which is sqe->off. Kill the second one
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_timeout() can be executed asynchronously by a worker and without
holding ctx->uring_lock
1. using ctx->cached_sq_head there is racy there
2. it should count events from a moment of timeout's submission, but
not execution
Use req->sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull btrfs fixes from David Sterba:
"We have a few regressions and one fix for stable:
- revert fsync optimization
- fix lost i_size update
- fix a space accounting leak
- build fix, add back definition of a deprecated ioctl flag
- fix search condition for old roots in relocation"
* tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: re-instantiate the removed BTRFS_SUBVOL_CREATE_ASYNC definition
btrfs: fix reclaim counter leak of space_info objects
btrfs: make full fsyncs always operate on the entire file again
btrfs: fix lost i_size update after cloning inline extent
btrfs: check commit root generation in should_ignore_root
We need to drop the inode spinlock while calling nfs4_select_rw_stateid(),
since nfs4_copy_delegation_stateid() could take the delegation lock.
Note that it is safe to do this, since all other calls to
pnfs_update_layout() for that inode will find themselves blocked by
the lock we hold on NFS_LAYOUT_FIRST_LAYOUTGET.
Fixes: fc51b1cf39 ("NFS: Beware when dereferencing the delegation cred")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The new async dirops callback routines can pass ERR_PTR values to
ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
even if ceph_mdsc_build_path fails.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the request has been marked as canceled, don't try and issue it.
Instead just fill a canceled event and finish the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We added this for just the regular poll requests in commit a6ba632d2c
("io_uring: retry poll if we got woken with non-matching mask"), we
should do the same for the poll handler used pollable async requests.
Move the re-wait check and arm into a helper, and call it from
io_async_task_func() as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the reflink extent remap function, it turns out that uirec (the block
mapping corresponding only to the part of the passed-in mapping that got
unmapped) was not fully initialized. Specifically, br_state was not
being copied from the passed-in struct to the uirec. This could lead to
unpredictable results such as the reflinked mapping being marked
unwritten in the destination file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The filesystem freeze sequence in XFS waits on any background
eofblocks or cowblocks scans to complete before the filesystem is
quiesced. At this point, the freezer has already stopped the
transaction subsystem, however, which means a truncate or cowblock
cancellation in progress is likely blocked in transaction
allocation. This results in a deadlock between freeze and the
associated scanner.
Fix this problem by holding superblock write protection across calls
into the block reapers. Since protection for background scans is
acquired from the workqueue task context, trylock to avoid a similar
deadlock between freeze and blocking on the write lock.
Fixes: d6b636ebb1 ("xfs: halt auto-reclamation activities while rebuilding rmap")
Reported-by: Paul Furtado <paulfurtado91@gmail.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
New struct nfsd4_blocked_lock allocated in find_or_allocate_block()
does not initialized nbl_list and nbl_lru.
If conflock allocation fails rollback can call list_del_init()
access uninitialized fields and corrupt memory.
v2: just initialize nbl_list and nbl_lru right after nbl allocation.
Fixes: 76d348fadf ("nfsd: have nfsd4_lock use blocking locks for v4.1+ lock")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If a dentry's version is somewhere between invalid_before and the current
directory version, we should be setting it forward to the current version,
not backwards to the invalid_before version. Note that we're only doing
this at all because dentry::d_fsdata isn't large enough on a 32-bit system.
Fix this by using a separate variable for invalid_before so that we don't
accidentally clobber the current dir version.
Fixes: a4ff7401fb ("afs: Keep track of invalid-before version for dentry coherency")
Signed-off-by: David Howells <dhowells@redhat.com>
AFS directories are retained locally as a structured file, with lookup
being effected by a local search of the file contents. When a modification
(such as mkdir) happens, the dir file content is modified locally rather
than redownloading the directory.
The directory contents are accessed in a number of ways, with a number of
different locks schemes:
(1) Download of contents - dvnode->validate_lock/write in afs_read_dir().
(2) Lookup and readdir - dvnode->validate_lock/read in afs_dir_iterate(),
downgrading from (1) if necessary.
(3) d_revalidate of child dentry - dvnode->validate_lock/read in
afs_do_lookup_one() downgrading from (1) if necessary.
(4) Edit of dir after modification - page locks on individual dir pages.
Unfortunately, because (4) uses different locking scheme to (1) - (3),
nothing protects against the page being scanned whilst the edit is
underway. Even download is not safe as it doesn't lock the pages - relying
instead on the validate_lock to serialise as a whole (the theory being that
directory contents are treated as a block and always downloaded as a
block).
Fix this by write-locking dvnode->validate_lock around the edits. Care
must be taken in the rename case as there may be two different dirs - but
they need not be locked at the same time. In any case, once the lock is
taken, the directory version must be rechecked, and the edit skipped if a
later version has been downloaded by revalidation (there can't have been
any local changes because the VFS holds the inode lock, but there can have
been remote changes).
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the length of the dump of a bad YFSFetchStatus record. The function
was copied from the AFS version, but the YFS variant contains bigger fields
and extra information, so expand the dump to match.
Signed-off-by: David Howells <dhowells@redhat.com>
The afs_deliver_fs_rename() and yfs_deliver_fs_rename() functions both only
decode the second file status returned unless the parent directories are
different - unfortunately, this means that the xdr pointer isn't advanced
and the volsync record will be read incorrectly in such an instance.
Fix this by always decoding the second status into the second
status/callback block which wasn't being used if the dirs were the same.
The afs_update_dentry_version() calls that update the directory data
version numbers on the dentries can then unconditionally use the second
status record as this will always reflect the state of the destination dir
(the two records will be identical if the destination dir is the same as
the source dir)
Fixes: 260a980317 ("[AFS]: Add "directory write" support.")
Fixes: 30062bd13e ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
If we're decoding an AFSFetchStatus record and we see that the version is 1
and the abort code is set and we're expecting inline errors, then we store
the abort code and ignore the remaining status record (which is correct),
but we don't set the flag to say we got a valid abort code.
This can affect operation of YFS.RemoveFile2 when removing a file and the
operation of {,Y}FS.InlineBulkStatus when prospectively constructing or
updating of a set of inodes during a lookup.
Fix this to indicate the reception of a valid abort code.
Fixes: a38a75581e ("afs: Fix unlink to handle YFS.RemoveFile2 better")
Signed-off-by: David Howells <dhowells@redhat.com>
If we receive a status record that has VNOVNODE set in the abort field,
xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() don't advance
the XDR pointer, thereby corrupting anything subsequent decodes from the
same block of data.
This has the potential to affect AFS.InlineBulkStatus and
YFS.InlineBulkStatus operation, but probably doesn't since the status
records are extracted as individual blocks of data and the buffer pointer
is reset between blocks.
It does affect YFS.RemoveFile2 operation, corrupting the volsync record -
though that is not currently used.
Other operations abort the entire operation rather than returning an error
inline, in which case there is no decoding to be done.
Fix this by unconditionally advancing the xdr pointer.
Fixes: 684b0f68cf ("afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility")
Signed-off-by: David Howells <dhowells@redhat.com>
The splice file punt check uses file->f_mode to check for O_NONBLOCK,
but it should be checking file->f_flags. This leads to punting even
for files that have O_NONBLOCK set, which isn't necessary. This equates
to checking for FMODE_PATH, which will never be set on the fd in
question.
Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cifs fixes from Steve French:
"Ten cifs/smb fixes:
- five RDMA (smbdirect) related fixes
- add experimental support for swap over SMB3 mounts
- also a fix which improves performance of signed connections"
* tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: enable swap on SMB3 mounts
smb3: change noisy error message to FYI
smb3: smbdirect support can be configured by default
cifs: smbd: Do not schedule work to send immediate packet on every receive
cifs: smbd: Properly process errors on ib_post_send
cifs: Allocate crypto structures on the fly for calculating signatures of incoming packets
cifs: smbd: Update receive credits before sending and deal with credits roll back on failure before sending
cifs: smbd: Check send queue size before posting a send
cifs: smbd: Merge code to track pending packets
cifs: ignore cached share root handle closing errors
Pull NFS client bugfix from Trond Myklebust:
"Fix an RCU read lock leakage in pnfs_alloc_ds_commits_list()"
* tag 'nfs-for-5.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS: Fix RCU lock leakage
Requests initialisation is scattered across several functions, namely
io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it
in io_init_req() for better data locality and code clarity.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a good idea to not read sqe->flags twice, as it's prone to security
bugs. Instead of passing it around, embeed them in req->flags. It's
already so except for IOSQE_IO_LINK.
1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD
2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK
And leave req_set_fail_links() using new REQ_F_LINK, because it's more
sensible.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having only one place for cleaning up a request after a link assembly/
submission failure will play handy in the future. At least it allows
to remove duplicated cleanup sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As a preparation for extracting request init bits, remove self-coded mm
tracking from io_submit_sqes(), but rely on current->mm. It's more
convenient, than passing this piece of state in other functions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If io_submit_sqes() can't grab an mm, it fails and exits right away.
There is no need to track the fact of the failure. Remove @mm_fault.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Another brown paper bag moment. pnfs_alloc_ds_commits_list() is leaking
the RCU lock.
Fixes: a9901899b6 ("pNFS: Add infrastructure for cleaning up per-layout commit structures")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Merge yet more updates from Andrew Morton:
- Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
gup, hugetlb, pagemap, memremap)
- Various other things (hfs, ocfs2, kmod, misc, seqfile)
* akpm: (34 commits)
ipc/util.c: sysvipc_find_ipc() should increase position index
kernel/gcov/fs.c: gcov_seq_next() should increase position index
fs/seq_file.c: seq_read(): add info message about buggy .next functions
drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
change email address for Pali Rohár
selftests: kmod: test disabling module autoloading
selftests: kmod: fix handling test numbers above 9
docs: admin-guide: document the kernel.modprobe sysctl
fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
kmod: make request_module() return an error when autoloading is disabled
mm/memremap: set caching mode for PCI P2PDMA memory to WC
mm/memory_hotplug: add pgprot_t to mhp_params
powerpc/mm: thread pgprot_t through create_section_mapping()
x86/mm: introduce __set_memory_prot()
x86/mm: thread pgprot_t through init_memory_mapping()
mm/memory_hotplug: rename mhp_restrictions to mhp_params
mm/memory_hotplug: drop the flags field from struct mhp_restrictions
mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
mm/vma: introduce VM_ACCESS_FLAGS
mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
...
Pull orangefs updates from Mike Marshall:
"A fix and two cleanups.
Fix:
- Christoph Hellwig noticed that some logic I added to
orangefs_file_read_iter introduced a race condition, so he sent a
reversion patch. I had to modify his patch since reverting at this
point broke Orangefs.
Cleanups:
- Christoph Hellwig noticed that we were doing some unnecessary work
in orangefs_flush, so he sent in a patch that removed the un-needed
code.
- Al Viro told me he had trouble building Orangefs. Orangefs should
be easy to build, even for Al :-).
I looked back at the test server build notes in orangefs.txt, just
in case that's where the trouble really is, and found a couple of
typos and made a couple of clarifications"
* tag 'for-linus-5.7-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: clarify build steps for test server in orangefs.txt
orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush
orangefs: get rid of knob code...
Patch series "seq_file .next functions should increase position index".
In Aug 2018 NeilBrown noticed commit 1f4aace60b ("fs/seq_file.c:
simplify seq_file iteration code and interface")
"Some ->next functions do not increment *pos when they return NULL...
Note that such ->next functions are buggy and should be fixed. A simple
demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size
larger than the size of /proc/swaps. This will always show the whole
last line of /proc/swaps"
Described problem is still actual. If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.
$ dd if=/proc/swaps bs=1 # usual output
Filename Type Size Used Priority
/dev/dm-0 partition 4194812 97536 -2
104+0 records in
104+0 records out
104 bytes copied
$ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0 partition 4194812 97536 -2
/dev/dm-0 partition 4194812 97536 -2
3+1 records in
3+1 records out
131 bytes copied
There are lot of other affected files, I've found 30+ including
/proc/net/ip_tables_matches and /proc/sysvipc/*
I've sent patches into maillists of affected subsystems already, this
patch-set fixes the problem in files related to pstore, tracing, gcov,
sysvipc and other subsystems processed via linux-kernel@ mailing list
directly
https://bugzilla.kernel.org/show_bug.cgi?id=206283
This patch (of 4):
Add debug code to seq_read() to detect missed or out-of-tree incorrect
.next seq_file functions.
[akpm@linux-foundation.org: s/pr_info/pr_info_ratelimited/, per Qian Cai]
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Waiman Long <longman@redhat.com>
Link: http://lkml.kernel.org/r/244674e5-760c-86bd-d08a-047042881748@virtuozzo.com
Link: http://lkml.kernel.org/r/7c24087c-e280-e580-5b0c-0cdaeb14cd18@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When removing files containing extended attributes, the hfsplus driver may
remove the wrong entries from the attributes b-tree, causing major
filesystem damage and in some cases even kernel crashes.
To remove a file, all its extended attributes have to be removed as well.
The driver does this by looking up all keys in the attributes b-tree with
the cnid of the file. Each of these entries then gets deleted using the
key used for searching, which doesn't contain the attribute's name when it
should. Since the key doesn't contain the name, the deletion routine will
not find the correct entry and instead remove the one in front of it. If
parent nodes have to be modified, these become corrupt as well. This
causes invalid links and unsorted entries that not even macOS's fsck_hfs
is able to fix.
To fix this, modify the search key before an entry is deleted from the
attributes b-tree by copying the found entry's key into the search key,
therefore ensuring that the correct entry gets removed from the tree.
Signed-off-by: Simon Gander <simon@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200327155541.1521-1-simon@tuxera.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull proc fix from Eric Biederman:
"A brown paper bag slipped through my proc changes, and syzcaller
caught it when the code ended up in your tree.
I have opted to fix it the simplest cleanest way I know how, so there
is no reasonable chance for the bug to repeat"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use a dedicated lock in struct pid
Add experimental support for allowing a swap file to be on an SMB3
mount. There are use cases where swapping over a secure network
filesystem is preferable. In some cases there are no local
block devices large enough, and network block devices can be
hard to setup and secure. And in some cases there are no
local block devices at all (e.g. with the recent addition of
remote boot over SMB3 mounts).
There are various enhancements that can be added later e.g.:
- doing a mandatory byte range lock over the swapfile (until
the Linux VFS is modified to notify the file system that an open
is for a swapfile, when the file can be opened "DENY_ALL" to prevent
others from opening it).
- pinning more buffers in the underlying transport to minimize memory
allocations in the TCP stack under the fs
- documenting how to create ACLs (on the server) to secure the
swapfile (or adding additional tools to cifs-utils to make it easier)
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Pull io_uring fixes from Jens Axboe:
"Here's a set of fixes that either weren't quite ready for the first,
or came about from some intensive testing on memcached with 350K+
sockets.
Summary:
- Fixes for races or deadlocks around poll handling
- Don't double account fixed files against RLIMIT_NOFILE
- IORING_OP_OPENAT LFS fix
- Poll retry handling (Bijan)
- Missing finish_wait() for SQPOLL (Hillf)
- Cleanup/split of io_kiocb alloc vs ctx references (Pavel)
- Fixed file unregistration and init fixes (Xiaoguang)
- Various little fixes (Xiaoguang, Pavel, Colin)"
* tag 'io_uring-5.7-2020-04-09' of git://git.kernel.dk/linux-block:
io_uring: punt final io_ring_ctx wait-and-free to workqueue
io_uring: fix fs cleanup on cqe overflow
io_uring: don't read user-shared sqe flags twice
io_uring: remove req init from io_get_req()
io_uring: alloc req only after getting sqe
io_uring: simplify io_get_sqring
io_uring: do not always copy iovec in io_req_map_rw()
io_uring: ensure openat sets O_LARGEFILE if needed
io_uring: initialize fixed_file_data lock
io_uring: remove redundant variable pointer nxt and io_wq_assign_next call
io_uring: fix ctx refcounting in io_submit_sqes()
io_uring: process requests completed with -EAGAIN on poll list
io_uring: remove bogus RLIMIT_NOFILE check in file registration
io_uring: use io-wq manager as backup task if task is exiting
io_uring: grab task reference for poll requests
io_uring: retry poll if we got woken with non-matching mask
io_uring: add missing finish_wait() in io_sq_thread()
io_uring: refactor file register/unregister/update handling
Pull more xfs updates from Darrick Wong:
"As promised last week, this batch changes how xfs interacts with
memory reclaim; how the log batches and throttles log items; how hard
writes near ENOSPC will try to squeeze more space out of the
filesystem; and hopefully fix the last of the umount hangs after a
catastrophic failure.
Summary:
- Validate the realtime geometry in the superblock when mounting
- Refactor a bunch of tricky flag handling in the log code
- Flush the CIL more judiciously so that we don't wait until there
are millions of log items consuming a lot of memory.
- Throttle transaction commits to prevent the xfs frontend from
flooding the CIL with too many log items.
- Account metadata buffers correctly for memory reclaim.
- Mark slabs properly for memory reclaim. These should help reclaim
run more effectively when XFS is using a lot of memory.
- Don't write a garbage log record at unmount time if we're trying to
trigger summary counter recalculation at next mount.
- Don't block the AIL on locked dquot/inode buffers; instead trigger
its backoff mechanism to give the lock holder a chance to finish
up.
- Ratelimit writeback flushing when buffered writes encounter ENOSPC.
- Other minor cleanups.
- Make reflink a synchronous operation when the fs is mounted with
wsync or sync, which means that now we force the log to disk to
record the changes"
* tag 'xfs-5.7-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
xfs: reflink should force the log out if mounted with wsync
xfs: factor out a new xfs_log_force_inode helper
xfs: fix inode number overflow in ifree cluster helper
xfs: remove redundant variable assignment in xfs_symlink()
xfs: ratelimit inode flush on buffered write ENOSPC
xfs: return locked status of inode buffer on xfsaild push
xfs: trylock underlying buffer on dquot flush
xfs: remove unnecessary ternary from xfs_create
xfs: don't write a corrupt unmount record to force summary counter recalc
xfs: factor inode lookup from xfs_ifree_cluster
xfs: tail updates only need to occur when LSN changes
xfs: factor common AIL item deletion code
xfs: correctly acount for reclaimable slabs
xfs: Improve metadata buffer reclaim accountability
xfs: don't allow log IO to be throttled
xfs: Throttle commits on delayed background CIL push
xfs: Lower CIL flush limit for large logs
xfs: remove some stale comments from the log code
xfs: refactor unmount record writing
xfs: merge xlog_commit_record with xlog_write_done
...
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the
task_works list isn't ordered (in fact it's LIFO ordered). We could
either fix this with a separate task_works list for io_uring work, or
just punt the wait-and-free to async context. This ensures that
task_work that comes in while we're shutting down is processed
correctly. If we don't go async, we could have work past the fput()
work for the ring that depends on work that won't be executed until
after we're done with the wait-and-free. But as this operation is
blocking, it'll never get a chance to run.
This was reproduced with hundreds of thousands of sockets running
memcached, haven't been able to reproduce this synthetically.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The noisy posix error message in readdir was supposed
to be an FYI (not enabled by default)
CIFS VFS: XXX dev 66306, reparse 0, mode 755
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Pull more powerpc updates from Michael Ellerman:
"The bulk of this is the series to make CONFIG_COMPAT user-selectable,
it's been around for a long time but was blocked behind the
syscall-in-C series.
Plus there's also a few fixes and other minor things.
Summary:
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and
turn it off by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
Nicholas Piggin, Stephen Boyd, Wen Xiong"
* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Always build the tm-poison test 64-bit
powerpc: Improve ppc_save_regs()
Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>
powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
powerpc/perf: split callchain.c by bitness
powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
powerpc/64: make buildable without CONFIG_COMPAT
powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
powerpc/perf: consolidate read_user_stack_32
powerpc: move common register copy functions from signal_32.c to signal.c
powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
powerpc/ps3: Remove an unneeded NULL check
powerpc/ps3: Remove duplicate error message
powerpc/powernv: Re-enable imc trace-mode in kernel
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
powerpc/pseries: Fix MCE handling on pseries
selftests/eeh: Skip ahci adapters
powerpc/64s: Fix doorbell wakeup msgclr optimisation
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&pid->wait_pidfd);
> local_irq_disable();
> lock(tasklist_lock);
> lock(&pid->wait_pidfd);
> <Interrupt>
> lock(tasklist_lock);
>
> *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:
The problem is that because wait_pidfd.lock is taken under the tasklist
lock. It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.
Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock. That should be safe and I think the code will eventually
do that. It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.
It is also true that my pre-merge window testing was insufficient. So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things. I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.
It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals. That will require taking pid->lock with irqs disabled.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>