IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
No need to postpone this to the commit release path, since no packets
are walking over this object, this is accessed from control plane only.
This helped uncovered UAF triggered by races with the netlink notifier.
Fixes: 9dd732e0bdf5 ("netfilter: nf_tables: memleak flow rule from commit path")
Reported-by: syzbot+8f747f62763bc6c32916@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit release path is invoked via call_rcu and it runs lockless to
release the objects after rcu grace period. The netlink notifier handler
might win race to remove objects that the transaction context is still
referencing from the commit release path.
Call rcu_barrier() to ensure pending rcu callbacks run to completion
if the list of transactions to be destroyed is not empty.
Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership")
Reported-by: syzbot+8f747f62763bc6c32916@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On 32-bit kernels, 64-bit syscall arguments are split into two
registers. For that to work with syscall wrappers, the prototype of the
syscall must have the argument split so that the wrapper macro properly
unpacks the arguments from pt_regs.
The fanotify_mark() syscall is one such syscall, which already has a
split prototype, guarded behind ARCH_SPLIT_ARG64.
So select ARCH_SPLIT_ARG64 to get that prototype and fix fanotify_mark()
on 32-bit kernels with syscall wrappers.
Note also that fanotify_mark() is the only usage of ARCH_SPLIT_ARG64.
Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221101034852.2340319-1-mpe@ellerman.id.au
Recently, we got two syzkaller problems because of oversize packet
when napi frags enabled.
One of the problems is because the first seg size of the iov_iter
from user space is very big, it is 2147479538 which is bigger than
the threshold value for bail out early in __alloc_pages(). And
skb->pfmemalloc is true, __kmalloc_reserve() would use pfmemalloc
reserves without __GFP_NOWARN flag. Thus we got a warning as following:
========================================================
WARNING: CPU: 1 PID: 17965 at mm/page_alloc.c:5295 __alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
...
Call trace:
__alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
__alloc_pages_node include/linux/gfp.h:550 [inline]
alloc_pages_node include/linux/gfp.h:564 [inline]
kmalloc_large_node+0x94/0x350 mm/slub.c:4038
__kmalloc_node_track_caller+0x620/0x8e4 mm/slub.c:4545
__kmalloc_reserve.constprop.0+0x1e4/0x2b0 net/core/skbuff.c:151
pskb_expand_head+0x130/0x8b0 net/core/skbuff.c:1654
__skb_grow include/linux/skbuff.h:2779 [inline]
tun_napi_alloc_frags+0x144/0x610 drivers/net/tun.c:1477
tun_get_user+0x31c/0x2010 drivers/net/tun.c:1835
tun_chr_write_iter+0x98/0x100 drivers/net/tun.c:2036
The other problem is because odd IPv6 packets without NEXTHDR_NONE
extension header and have big packet length, it is 2127925 which is
bigger than ETH_MAX_MTU(65535). After ipv6_gso_pull_exthdrs() in
ipv6_gro_receive(), network_header offset and transport_header offset
are all bigger than U16_MAX. That would trigger skb->network_header
and skb->transport_header overflow error, because they are all '__u16'
type. Eventually, it would affect the value for __skb_push(skb, value),
and make it be a big value. After __skb_push() in ipv6_gro_receive(),
skb->data would less than skb->head, an out of bounds memory bug occurred.
That would trigger the problem as following:
==================================================================
BUG: KASAN: use-after-free in eth_type_trans+0x100/0x260
...
Call trace:
dump_backtrace+0xd8/0x130
show_stack+0x1c/0x50
dump_stack_lvl+0x64/0x7c
print_address_description.constprop.0+0xbc/0x2e8
print_report+0x100/0x1e4
kasan_report+0x80/0x120
__asan_load8+0x78/0xa0
eth_type_trans+0x100/0x260
napi_gro_frags+0x164/0x550
tun_get_user+0xda4/0x1270
tun_chr_write_iter+0x74/0x130
do_iter_readv_writev+0x130/0x1ec
do_iter_write+0xbc/0x1e0
vfs_writev+0x13c/0x26c
To fix the problems, restrict the packet size less than
(ETH_MAX_MTU - NET_SKB_PAD - NET_IP_ALIGN) which has considered reserved
skb space in napi_alloc_skb() because transport_header is an offset from
skb->head. Add len check in tun_napi_alloc_frags() simply.
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221029094101.1653855-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Changed maintainers for vnic driver, since Dany has new responsibilities.
Also added Nick Child as reviewer.
Signed-off-by: Rick Lindsley <ricklind@us.ibm.com>
Link: https://lore.kernel.org/r/20221028203509.4070154-1-ricklind@us.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
blk_mq_flush_plug_list() empties ->mq_list and request we'd peeked there
before that call is gone; in any case, we are not dealing with a mix
of requests for different queues now - there's no requests left in the
plug.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the introduction of syscall wrappers all wrappers for syscalls with
64-bit arguments must be handled specially, not only those that have
unaligned 64-bit arguments. This left out the fallocate() and
sync_file_range2() syscalls.
Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper")
Fixes: e23750623835 ("powerpc/32: fix syscall wrappers with 64-bit arguments of unaligned register-pairs")
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/87mt9cxd6g.fsf_-_@igel.home
The macros are defined backwards.
This affects the following compat syscalls:
- compat_sys_truncate64()
- compat_sys_ftruncate64()
- compat_sys_fallocate()
- compat_sys_sync_file_range()
- compat_sys_fadvise64_64()
- compat_sys_readahead()
- compat_sys_pread64()
- compat_sys_pwrite64()
Fixes: 43d5de2b67d7 ("asm-generic: compat: Support BE for long long args in 32-bit ABIs")
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
[mpe: Add list of affected syscalls]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/871qqoyvni.fsf_-_@igel.home
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmNfzNwACgkQxWXV+ddt
WDuC6Q//a72PAq1sjwvQqAcr+OOe3PWnmlwYZCnXxiab5c74Kc7rDhDZcO3m/Qt5
3YTwgK5FT4Y0AI8RN1NXx3+UOAYCWp/TGeBdbPHg35XIYKAnCh4pfql84Uiw1Awz
HbqmSTma7sqVdRMehkKCkd7w4YoyAAsDdyXFQlSFm4ah9WHFZDswBc+m6xQZuWvU
QVQS6wUTxkxuBZp0UComWGBNHiDeDZbga7VqO8UHPYOB394IV2mYP6fh8l0oB/BS
bfKgsHjV9e0S0Ul0oPVADCGCiJcTbdnw3IA+Cje7MSgZ3kds/4Bo5IJWT5QRb94A
yDAFpxc+t3+FgpoKS3/tZK7imXwgpXueiT2bBj+BjDDWD2VUVVBG4QmXYIW6tuqY
vtEFw9+NCAvS2gRetHyXxQshYh/QW//+AZSkuI6/fuPSM+lRG5E0lnDxqrZiOMIo
e6SJOGH3tCmtusL5VSXIQ8DPaLI9PBg4OXChytwmLHwPIusbQOvD5sTDpd99UezB
dLXqZOGGScAc11HU1AFyZfAxTBybUgUxX/xCviJtf7ZOWKdcwiFrzSJOL5upSPz3
8qZTVjrD71mJlEa0Z8wj0Utuu4Psecp0GN+fs5JJxmqsFO0cYApU17OqPZ22+yEV
RU26YNpqurYVarHVER4WxyXYraBYd1Cr6s6bFVDnuZynfiCOYIw=
=3tvc
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes and regression fixes:
- fix a corner case when handling tree-mod-log chagnes in reallocated
notes
- fix crash on raid0 filesystems created with <5.4 mkfs.btrfs that
could lead to division by zero
- add missing super block checksum verification after thawing
filesystem
- handle one more case in send when dealing with orphan files
- fix parameter type mismatch for generation when reading dentry
- improved error handling in raid56 code
- better struct bio packing after recent cleanups"
* tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't use btrfs_chunk::sub_stripes from disk
btrfs: fix type of parameter generation in btrfs_get_dentry
btrfs: send: fix send failure of a subcase of orphan inodes
btrfs: make thaw time super block check to also verify checksum
btrfs: fix tree mod log mishandling of reallocated nodes
btrfs: reorder btrfs_bio for better packing
btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed
btrfs: raid56: properly handle the error when unable to find the missing stripe
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmNfpvEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXM4wBAAr3iQ2y+j88aZKbgHMp+uT5FF8fp6
xTAI+Zyqn6KUD3H2VC8DYm1crlyibA6bZhscO3Al14ustS4wyVxXqBkXBTukkXxE
exTzfmyx8SHCcke5vEfWvF1M/w9nHGRLTwtMwc2W0GR3Qz1uB65ezsxTikDwjlyP
Ax5nXoC9r0DMsunfkYuLlRpfoe3Vwbz2in93odemB4cHSDiqj0V0Llk5z/kidcqF
XrPf/GknVZblqS9NDZYg9accZGe8cLuIVHEeiXhmCt21mVoX13PycUWRzSnAvG7/
9M+Wb3KExpZFn+8J3G0HK89P7v+PUmpOUMsH03kQARdHS0br35jE7eAqfEwo96xk
UWJKbJCCEqURKmR9nzG6tuHqbUA2e8Sw/fqCMFRTxYBhAl64ptRqJPD5hqwY50Od
P6khJo75F8uIuwJtW+0fQ9kAIrJqjzVHiObOMEZmt9vSiOOGHqjriGsEitWIMe6+
cVxVSqwuNeaUyux5sj9IiKyKnFelPt0qMpMncrePZ8l2y4ATf9MQFX28X6HhskPt
7JD2nIprsCsMHUSjUf4Z+fBZC8IFw8yWSQbM+9S/ErnV2zieq5/OxlnJs87vro6W
3skrgwsB1C4TQoW9qRf3bDbT5O31kbu4lmUcD5mgUUzQd/V+L257DY2d+rF1rB3w
QMDyRxPPR/BP6bE=
=L2Xt
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20221031' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull LSM fix from Paul Moore:
"A single patch to the capabilities code to fix a potential memory leak
in the xattr allocation error handling"
* tag 'lsm-pr-20221031' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
capabilities: fix potential memleak on error path from vfs_getxattr_alloc()
There are two capabilities related to ring-based dirty page tracking:
KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are
supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL
only when the feature is supported on arm64. The userspace doesn't have
to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can
be enabled on arm64 by userspace and it's wrong.
Fix it by double checking if the capability has been advertised prior to
enabling it. It's rejected to enable the capability if it hasn't been
advertised.
Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option")
Reported-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
Starting with 6.1-rc1, CONFIG_FORTIFY_SOURCE checks became smart enough
to detect memcpy() callers that copy beyond what seems to be the end of
a struct. Unfortunately, gcc has a bug wherein it cannot reliably
compute the size of a struct containing another struct containing a flex
array at the end. This is the case with the xfs log item format
structures, which means that -rc1 starts complaining all over the place.
Fix these problems by memcpying the struct head and the flex arrays
separately. Although it's tempting to use the FLEX_ARRAY macros, the
structs involved are part of the ondisk log format. Some day we're
going to want to make the ondisk log contents endian-safe, which means
that we will have to stop using memcpy entirely.
While we're at it, fix some deficiencies in the validation of recovered
log intent items -- if the size of the recovery buffer is not even large
enough to cover the flex array record count in the head, we should abort
the recovery of that item immediately.
The last patch of this series changes the EFI/EFD sizeof functions names
and behaviors to be consistent with the similarly named sizeof helpers
for other log intent items.
v2: fix more inadequate log intent done recovery validation and dump
corrupt recovered items
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmNf8LkACgkQ+H93GTRK
tOt3QQ//SuyxzE4i2Vr8o7dwFQ6qtQeSt9RixtgKUG3ay+eZLCgpA7KS8po0Dv7W
/8aAY6K712Mp2IzmdJUIHb/Pch5UbRSN5rw0169CsNDOmU/R9njqfeMWMfDr9ixS
HAWfo13yh/QSmBTyioijZhP08N0TpyNVFsM9s5/4hKU7UGV4h5g2kz+hyDHrsSmB
KXAM7FAh6SX8eBjxpj3iKLgsdEW7mcsDYurSVOnfmgWkXvgZXoLOvPt84e09A+s3
tLq5AEiLr261o45VbfExrjqn0qvwE7HdMdLPJrTa/tp6ztfsU2SJ6AxmG/XTTlBj
jnIcYL9unu8JOndmJjLZxuhXmXXwZ3eFfsUgn0/tluSeR/nMMc3CCItZ58Ox5zk7
kUpN0JnY1+ecYmDw1Qz8LhhSIReOiA5Rw2SwVQ8wB3Oit9/cBQsxtM9YxxOne3MN
od2096CiyvCYjpm6EGTRCkxQuz2nleJ5LajXb7dmkw91IiPdvoWbTPT+trtjO/63
gYbD0A4Qko9iDW0bWCCvWPD6vBZhN1q6r1j1lu77Az+z/45W47ut6MGokK4NHzo3
fTarDMqbVDxyeSrhW713iQO7PypLoOv7b72HD1+SvSkHzKwdi42kIlWe8e5B8Rew
GkH2ycfQtaq+UR2fT4rSs4wWWuxoZLyo9Utdh0DA0ZF5QZfAqFo=
=wP07
-----END PGP SIGNATURE-----
Merge tag 'fix-log-recovery-misuse-6.1_2022-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.1-fixes
xfs: fix various problems with log intent item recovery
Starting with 6.1-rc1, CONFIG_FORTIFY_SOURCE checks became smart enough
to detect memcpy() callers that copy beyond what seems to be the end of
a struct. Unfortunately, gcc has a bug wherein it cannot reliably
compute the size of a struct containing another struct containing a flex
array at the end. This is the case with the xfs log item format
structures, which means that -rc1 starts complaining all over the place.
Fix these problems by memcpying the struct head and the flex arrays
separately. Although it's tempting to use the FLEX_ARRAY macros, the
structs involved are part of the ondisk log format. Some day we're
going to want to make the ondisk log contents endian-safe, which means
that we will have to stop using memcpy entirely.
While we're at it, fix some deficiencies in the validation of recovered
log intent items -- if the size of the recovery buffer is not even large
enough to cover the flex array record count in the head, we should abort
the recovery of that item immediately.
The last patch of this series changes the EFI/EFD sizeof functions names
and behaviors to be consistent with the similarly named sizeof helpers
for other log intent items.
v2: fix more inadequate log intent done recovery validation and dump
corrupt recovered items
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'fix-log-recovery-misuse-6.1_2022-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: dump corrupt recovered log intent items to dmesg consistently
xfs: actually abort log recovery on corrupt intent-done log items
xfs: refactor all the EFI/EFD log item sizeof logic
xfs: fix memcpy fortify errors in EFI log format copying
xfs: fix memcpy fortify errors in RUI log format copying
xfs: fix memcpy fortify errors in CUI log format copying
xfs: fix memcpy fortify errors in BUI log format copying
xfs: fix validation in attr log item recovery
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
a bit flag, even though it's *only* a bit flag. Rename the variable to
reflect its nature and update the cast target since we're not supposed
to be comparing it to xfs_agblock_t now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We're supposed to initialize the list head of an object before adding it
to another list. Fix that, and stop using the kmem_{alloc,free} calls
from the Irix days.
Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
As we've seen, refcount records use the upper bit of the rc_startblock
field to ensure that all the refcount records are at the right side of
the refcount btree. This works because an AG is never allowed to have
more than (1U << 31) blocks in it. If we ever encounter a filesystem
claiming to have that many blocks, we absolutely do not want reflink
touching it at all.
However, this test at the start of xfs_refcount_recover_cow_leftovers is
slightly incorrect -- it /should/ be checking that agblocks isn't larger
than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
constant is never large enough to conflict with that CoW flag.
Note that the V5 superblock verifier has not historically rejected
filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this
ended up in the COW recovery routine.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.
The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we have an explicit enum for shared and CoW staging extents, we
can get rid of the old FIND_RCEXT flags. Omit a couple of conversions
that disappear in the next patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a helper function to ensure that CoW staging extent records have
a single refcount and that shared extent records have more than 1
refcount. We'll put this to more use in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we've broken out the startblock and shared/cow domain in the
incore refcount extent record structure, update the tracepoints to
report the domain.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Just prior to committing the reflink code into upstream, the xfs
maintainer at the time requested that I find a way to shard the refcount
records into two domains -- one for records tracking shared extents, and
a second for tracking CoW staging extents. The idea here was to
minimize mount time CoW reclamation by pushing all the CoW records to
the right edge of the keyspace, and it was accomplished by setting the
upper bit in rc_startblock. We don't allow AGs to have more than 2^31
blocks, so the bit was free.
Unfortunately, this was a very late addition to the codebase, so most of
the refcount record processing code still treats rc_startblock as a u32
and pays no attention to whether or not the upper bit (the cow flag) is
set. This is a weakness is theoretically exploitable, since we're not
fully validating the incoming metadata records.
Fuzzing demonstrates practical exploits of this weakness. If the cow
flag of a node block key record is corrupted, a lookup operation can go
to the wrong record block and start returning records from the wrong
cow/shared domain. This causes the math to go all wrong (since cow
domain is still implicit in the upper bit of rc_startblock) and we can
crash the kernel by tricking xfs into jumping into a nonexistent AG and
tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
To fix this, start tracking the domain as an explicit part of struct
xfs_refcount_irec, adjust all refcount functions to check the domain
of a returned record, and alter the function definitions to accept them
where necessary.
Found by fuzzing keys[2].cowflag = add in xfs/464.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Consolidate the open-coded xfs_refcount_irec fields into an actual
struct and use the existing _btrec_to_irec to decode the ondisk record.
This will reduce code churn in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If log recovery decides that an intent item is corrupt and wants to
abort the mount, capture a hexdump of the corrupt log item in the kernel
log for further analysis. Some of the log item code already did this,
so we're fixing the rest to do it consistently.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Structure definitions for incore objects do not belong in the ondisk
format header. Move them to the incore types header where they belong.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If log recovery picks up intent-done log items that are not of the
correct size it needs to abort recovery and fail the mount. Debug
assertions are not good enough.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we're in the middle of a deferred refcount operation and decide to
roll the transaction to avoid overflowing the transaction space, we need
to check the new agbno/aglen parameters that we're about to record in
the new intent. Specifically, we need to check that the new extent is
completely within the filesystem, and that continuation does not put us
into a different AG.
If the keys of a node block are wrong, the lookup to resume an
xfs_refcount_adjust_extents operation can put us into the wrong record
block. If this happens, we might not find that we run out of aglen at
an exact record boundary, which will cause the loop control to do the
wrong thing.
The previous patch should take care of that problem, but let's add this
extra sanity check to stop corruption problems sooner than later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor all the open-coded sizeof logic for EFI/EFD log item and log
format structures into common helper functions whose names reflect the
struct names.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a predicate function to verify that a given agbno/blockcount pair
fit entirely within a single allocation group and don't suffer
mathematical overflows. Refactor the existng open-coded logic; we're
going to add more calls to this function in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy. Since we're already fixing problems with BUI item copying, we
should fix it everything else.
An extra difficulty here is that the ef[id]_extents arrays are declared
as single-element arrays. This is not the convention for flex arrays in
the modern kernel, and it causes all manner of problems with static
checking tools, since they often cannot tell the difference between a
single element array and a flex array.
So for starters, change those array[1] declarations to array[]
declarations to signal that they are proper flex arrays and adjust all
the "size-1" expressions to fit the new declaration style.
Next, refactor the xfs_efi_copy_format function to handle the copying of
the head and the flex array members separately. While we're at it, fix
a minor validation deficiency in the recovery function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen
such that the end of the range would not be in the middle of a refcount
record. If this is no longer the case, something is seriously wrong
with the btree. Bail out with a corruption error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy. Since we're already fixing problems with BUI item copying, we
should fix it everything else.
Refactor the xfs_rui_copy_format function to handle the copying of the
head and the flex array members separately. While we're at it, fix a
minor validation deficiency in the recovery function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy. Since we're already fixing problems with BUI item copying, we
should fix it everything else.
Refactor the xfs_cui_copy_format function to handle the copying of the
head and the flex array members separately. While we're at it, fix a
minor validation deficiency in the recovery function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy. Unfortunately, it doesn't handle flex arrays correctly:
------------[ cut here ]------------
memcpy: detected field-spanning write (size 48) of single field "dst_bui_fmt" at fs/xfs/xfs_bmap_item.c:628 (size 16)
Fix this by refactoring the xfs_bui_copy_format function to handle the
copying of the head and the flex array members separately. While we're
at it, fix a minor validation deficiency in the recovery function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Before we start fixing all the complaints about memcpy'ing log items
around, let's fix some inadequate validation in the xattr log item
recovery code and get rid of the (now trivial) copy_format function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.
This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().
For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).
So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.
A test case will later be sent for fstests, after this is fixed in Linus'
tree.
Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel robot complained about this:
>> fs/xfs/xfs_file.c:1266:31: sparse: sparse: incorrect type in return expression (different base types) @@ expected int @@ got restricted vm_fault_t @@
fs/xfs/xfs_file.c:1266:31: sparse: expected int
fs/xfs/xfs_file.c:1266:31: sparse: got restricted vm_fault_t
fs/xfs/xfs_file.c:1314:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted vm_fault_t [usertype] ret @@ got int @@
fs/xfs/xfs_file.c:1314:21: sparse: expected restricted vm_fault_t [usertype] ret
fs/xfs/xfs_file.c:1314:21: sparse: got int
Fix the incorrect return type for these two functions.
While we're at it, make the !fsdax version return VM_FAULT_SIGBUS
because a zero return value will cause some callers to try to lock
vmf->page, which we never set here.
Fixes: ea6c49b784f0 ("xfs: support CoW in fsdax mode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
After allocation 'dip' is tested instead of 'dip->csums'. Fix it.
Fixes: 642c5d34da53 ("btrfs: allocate the btrfs_dio_private as part of the iomap dio bio")
CC: stable@vger.kernel.org # 5.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Avoid that the hardware path is shown twice in the kernel log, and clean
up the output of the version numbers to show up in the same order as
they are listed in the hardware database in the hardware.c file.
Additionally, optimize the memory footprint of the hardware database
and mark some code as init code.
Fixes: cab56b51ec0e ("parisc: Fix device names in /proc/iomem")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v4.9+
There is a kmemleak caused by modprobe null_blk.ko
unreferenced object 0xffff8881acb1f000 (size 1024):
comm "modprobe", pid 836, jiffies 4294971190 (age 27.068s)
hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
ff ff ff ff ff ff ff ff 00 53 99 9e ff ff ff ff .........S......
backtrace:
[<000000004a10c249>] kmalloc_node_trace+0x22/0x60
[<00000000648f7950>] blk_mq_alloc_and_init_hctx+0x289/0x350
[<00000000af06de0e>] blk_mq_realloc_hw_ctxs+0x2fe/0x3d0
[<00000000e00c1872>] blk_mq_init_allocated_queue+0x48c/0x1440
[<00000000d16b4e68>] __blk_mq_alloc_disk+0xc8/0x1c0
[<00000000d10c98c3>] 0xffffffffc450d69d
[<00000000b9299f48>] 0xffffffffc4538392
[<0000000061c39ed6>] do_one_initcall+0xd0/0x4f0
[<00000000b389383b>] do_init_module+0x1a4/0x680
[<0000000087cf3542>] load_module+0x6249/0x7110
[<00000000beba61b8>] __do_sys_finit_module+0x140/0x200
[<00000000fdcfff51>] do_syscall_64+0x35/0x80
[<000000003c0f1f71>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
That is because q->ma_ops is set to NULL before blk_release_queue is
called.
blk_mq_init_queue_data
blk_mq_init_allocated_queue
blk_mq_realloc_hw_ctxs
for (i = 0; i < set->nr_hw_queues; i++) {
old_hctx = xa_load(&q->hctx_table, i);
if (!blk_mq_alloc_and_init_hctx(.., i, ..)) [1]
if (!old_hctx)
break;
xa_for_each_start(&q->hctx_table, j, hctx, j)
blk_mq_exit_hctx(q, set, hctx, j); [2]
if (!q->nr_hw_queues) [3]
goto err_hctxs;
err_exit:
q->mq_ops = NULL; [4]
blk_put_queue
blk_release_queue
if (queue_is_mq(q)) [5]
blk_mq_release(q);
[1]: blk_mq_alloc_and_init_hctx failed at i != 0.
[2]: The hctxs allocated by [1] are moved to q->unused_hctx_list and
will be cleaned up in blk_mq_release.
[3]: q->nr_hw_queues is 0.
[4]: Set q->mq_ops to NULL.
[5]: queue_is_mq returns false due to [4]. And blk_mq_release
will not be called. The hctxs in q->unused_hctx_list are leaked.
To fix it, call blk_release_queue in exception path.
Fixes: 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221031031242.94107-1-chenjun102@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit a5810f551d0a ("drm/i915: Allow more varied alternate
fixed modes for panels") intel_panel_add_edid_alt_fixed_modes()
no longer considers vrr vs. drrs separately. So no reason to
pass them as separate parameters either.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220927180615.25476-2-ville.syrjala@linux.intel.com
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit eb89e83c152b122a94e79527d63cb7c79823c37e)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
kmemleak reported memory leaks in device_add_disk():
kmemleak: 3 new suspected memory leaks
unreferenced object 0xffff88800f420800 (size 512):
comm "modprobe", pid 4275, jiffies 4295639067 (age 223.512s)
hex dump (first 32 bytes):
04 00 00 00 08 00 00 00 01 00 00 00 00 00 00 00 ................
00 e1 f5 05 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000d3662699>] kmalloc_trace+0x26/0x60
[<00000000edc7aadc>] wbt_init+0x50/0x6f0
[<0000000069601d16>] wbt_enable_default+0x157/0x1c0
[<0000000028fc393f>] blk_register_queue+0x2a4/0x420
[<000000007345a042>] device_add_disk+0x6fd/0xe40
[<0000000060e6aab0>] nbd_dev_add+0x828/0xbf0 [nbd]
...
It is because the memory allocated in wbt_enable_default() is not
released in device_add_disk() error path.
Normally, these memory are freed in:
del_gendisk()
rq_qos_exit()
rqos->ops->exit(rqos);
wbt_exit()
So rq_qos_exit() is called to free the rq_wb memory for wbt_init().
However in the error path of device_add_disk(), only
blk_unregister_queue() is called and make rq_wb memory leaked.
Add rq_qos_exit() to the error path to fix it.
Fixes: 83cbce957446 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221029071355.35462-1-chenzhongjin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add helper of ublk_queue_cmd() so that both ublk_queue_rq()
and ublk_handle_need_get_data() can reuse this helper.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring cmd is supposed to be used in ubq daemon context mainly,
and we should try to avoid to touch it in ublk io submission context,
otherwise this data could become shared between the two contexts,
and performance is hurt.
So link request into one per-queue list, and use same batching policy
of io_uring command, just avoid to touch ucmd in blk-mq io context.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add help info for choosing to build ublk_drv as module or builtin.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
UBLK_F_URING_CMD_COMP_IN_TASK needs to be set and returned to userspace
if ublk driver is built as module, otherwise userspace may get wrong
flags shown.
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call intel_sdvo_select_ddc_bus() before initializing any
of the outputs. And before that is functional (assuming no VBT)
we have to set up the controlled_outputs thing. Otherwise DDC
won't be functional during the output init but LVDS really
needs it for the fixed mode setup.
Note that the whole multi output support still looks very
bogus, and more work will be needed to make it correct.
But for now this should at least fix the LVDS EDID fixed mode
setup.
Cc: stable@vger.kernel.org
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/7301
Fixes: aa2b88074a56 ("drm/i915/sdvo: Fix multi function encoder stuff")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221026101134.20865-3-ville.syrjala@linux.intel.com
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit 64b7b557dc8a96d9cfed6aedbf81de2df80c025d)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
We try to filter out the corresponding xxx1 output
if the xxx0 output is not present. But the way that is
being done is pretty awkward. Make it less so.
Cc: stable@vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221026101134.20865-2-ville.syrjala@linux.intel.com
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit cc1e66394daaa7e9f005e2487a84e34a39f9308b)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
swiotlb_max_segment used to return either the maximum size that swiotlb
could bounce, or for Xen PV PAGE_SIZE even if swiotlb could bounce buffer
larger mappings. This made i915 on Xen PV work as it bypasses the
coherency aspect of the DMA API and can't cope with bounce buffering
and this avoided bounce buffering for the Xen/PV case.
So instead of adding this hack back, check for Xen/PV directly in i915
for the Xen case and otherwise use the proper DMA API helper to query
the maximum mapping size.
Replace swiotlb_max_segment() calls with dma_max_mapping_size().
In i915_gem_object_get_pages_internal() no longer consider max_segment
only if CONFIG_SWIOTLB is enabled. There can be other (iommu related)
causes of specific max segment sizes.
Fixes: a2daa27c0c61 ("swiotlb: simplify swiotlb_max_segment")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[hch: added the Xen hack, rewrote the changelog]
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221020110308.1582518-1-hch@lst.de
(cherry picked from commit 78a07fe777c42800bd1adaec12abe5dcee43919e)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Accessing the TypeC DKL PHY registers during modeset-commit,
-verification, DP link-retraining and AUX power well toggling is racy
due to these code paths being concurrent and the PHY register bank
selection register (HIP_INDEX_REG) being shared between PHY instances
(aka TC ports) and the bank selection being not atomic wrt. the actual
PHY register access.
Add the required locking around each PHY register bank selection->
register access sequence.
Kudos to Ville for noticing the race conditions.
v2:
- Add the DKL PHY register accessors to intel_dkl_phy.[ch]. (Jani)
- Make the DKL_REG_TC_PORT macro independent of PHY internals.
- Move initing the DKL PHY lock to a more logical place.
v3:
- Fix parameter reuse in the DKL_REG_TC_PORT definition.
- Document the usage of phy_lock.
v4:
- Fix adding TC_PORT_1 offset in the DKL_REG_TC_PORT definition.
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: <stable@vger.kernel.org> # v5.5+
Acked-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221025114457.2191004-1-imre.deak@intel.com
(cherry picked from commit 89cb0ba4ceee6bed1059904859c5723b3f39da68)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
We can't use "skb" again after passing it to qdisc_enqueue(). This is
basically identical to commit 2f09707d0c97 ("sch_sfb: Also store skb
len before calling child enqueue").
Fixes: d7f4f332f082 ("sch_red: update backlog as well")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>