IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The rpcgss_context trace event acceptor field is a dynamically sized
string that records the "data" parameter. But this parameter is also
dependent on the "len" field to determine the size of the data.
It needs to use __string_len() helper macro where the length can be passed
in. It also incorrectly uses strncpy() to save it instead of
__assign_str(). As these macros can change, it is not wise to open code
them in trace events.
As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
__assign_str() can be used for both __string() and __string_len() fields.
Before that commit, __assign_str_len() is required to be used. This needs
to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
Rework __assign_str() and __string() to not duplicate getting the string")
is the commit that makes __string_str_len() obsolete).
Cc: stable@vger.kernel.org
Fixes: 0c77668ddb4e ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Main user visible change:
- User events can now have "multi formats"
The current user events have a single format. If another event is created
with a different format, it will fail to be created. That is, once an
event name is used, it cannot be used again with a different format. This
can cause issues if a library is using an event and updates its format.
An application using the older format will prevent an application using
the new library from registering its event.
A task could also DOS another application if it knows the event names, and
it creates events with different formats.
The multi-format event is in a different name space from the single
format. Both the event name and its format are the unique identifier.
This will allow two different applications to use the same user event name
but with different payloads.
- Added support to have ftrace_dump_on_oops dump out instances and
not just the main top level tracing buffer.
Other changes:
- Add eventfs_root_inode
Only the root inode has a dentry that is static (never goes away) and
stores it upon creation. There's no reason that the thousands of other
eventfs inodes should have a pointer that never gets set in its
descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode
descriptor and a dentry pointer, and only the root inode will use this.
- Added WARN_ON()s in eventfs
There's some conditionals remaining in eventfs that should never be hit,
but instead of removing them, add WARN_ON() around them to make sure that
they are never hit.
- Have saved_cmdlines allocation also include the map_cmdline_to_pid array
The saved_cmdlines structure allocates a large amount of data to hold its
mappings. Within it, it has three arrays. Two are already apart of it:
map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by
also including the map_cmdline_to_pid[] array as well.
- Restructure __string() and __assign_str() macros used in TRACE_EVENT().
Dynamic strings in TRACE_EVENT() are declared with:
__string(name, source)
And assigned with:
__assign_str(name, source)
In the tracepoint callback of the event, the __string() is used to get the
size needed to allocate on the ring buffer and __assign_str() is used to
copy the string into the ring buffer. There's a helper structure that is
created in the TRACE_EVENT() macro logic that will hold the string length
and its position in the ring buffer which is created by __string().
There are several trace events that have a function to create the string
to save. This function is executed twice. Once for __string() and again
for __assign_str(). There's no reason for this. The helper structure could
also save the string it used in __string() and simply copy that into
__assign_str() (it also already has its length).
By using the structure to store the source string for the assignment, it
means that the second argument to __assign_str() is no longer needed.
It will be removed in the next merge window, but for now add a warning if
the source string given to __string() is different than the source string
given to __assign_str(), as the source to __assign_str() isn't even used
and will be going away.
- Added checks to make sure that the source of __string() is also the
source of __assign_str() so that it can be safely removed in the next
merge window.
Included fixes that the above check found.
- Other minor clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZfhbUBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrhJAP9bfnYO7tfNGZVNPmTT7Fz0z4zCU1Pb
P8M+24yiFTeFWwD/aIPlMFZONVkTdFAlLdffl6kJOKxZ7vW4XzUjfNWb6wo=
=z/D6
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"Main user visible change:
- User events can now have "multi formats"
The current user events have a single format. If another event is
created with a different format, it will fail to be created. That
is, once an event name is used, it cannot be used again with a
different format. This can cause issues if a library is using an
event and updates its format. An application using the older format
will prevent an application using the new library from registering
its event.
A task could also DOS another application if it knows the event
names, and it creates events with different formats.
The multi-format event is in a different name space from the single
format. Both the event name and its format are the unique
identifier. This will allow two different applications to use the
same user event name but with different payloads.
- Added support to have ftrace_dump_on_oops dump out instances and
not just the main top level tracing buffer.
Other changes:
- Add eventfs_root_inode
Only the root inode has a dentry that is static (never goes away)
and stores it upon creation. There's no reason that the thousands
of other eventfs inodes should have a pointer that never gets set
in its descriptor. Create a eventfs_root_inode desciptor that has a
eventfs_inode descriptor and a dentry pointer, and only the root
inode will use this.
- Added WARN_ON()s in eventfs
There's some conditionals remaining in eventfs that should never be
hit, but instead of removing them, add WARN_ON() around them to
make sure that they are never hit.
- Have saved_cmdlines allocation also include the map_cmdline_to_pid
array
The saved_cmdlines structure allocates a large amount of data to
hold its mappings. Within it, it has three arrays. Two are already
apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory
can be saved by also including the map_cmdline_to_pid[] array as
well.
- Restructure __string() and __assign_str() macros used in
TRACE_EVENT()
Dynamic strings in TRACE_EVENT() are declared with:
__string(name, source)
And assigned with:
__assign_str(name, source)
In the tracepoint callback of the event, the __string() is used to
get the size needed to allocate on the ring buffer and
__assign_str() is used to copy the string into the ring buffer.
There's a helper structure that is created in the TRACE_EVENT()
macro logic that will hold the string length and its position in
the ring buffer which is created by __string().
There are several trace events that have a function to create the
string to save. This function is executed twice. Once for
__string() and again for __assign_str(). There's no reason for
this. The helper structure could also save the string it used in
__string() and simply copy that into __assign_str() (it also
already has its length).
By using the structure to store the source string for the
assignment, it means that the second argument to __assign_str() is
no longer needed.
It will be removed in the next merge window, but for now add a
warning if the source string given to __string() is different than
the source string given to __assign_str(), as the source to
__assign_str() isn't even used and will be going away.
- Added checks to make sure that the source of __string() is also the
source of __assign_str() so that it can be safely removed in the
next merge window.
Included fixes that the above check found.
- Other minor clean ups and fixes"
* tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits)
tracing: Add __string_src() helper to help compilers not to get confused
tracing: Use strcmp() in __assign_str() WARN_ON() check
tracepoints: Use WARN() and not WARN_ON() for warnings
tracing: Use div64_u64() instead of do_div()
tracing: Support to dump instance traces by ftrace_dump_on_oops
tracing: Remove second parameter to __assign_rel_str()
tracing: Add warning if string in __assign_str() does not match __string()
tracing: Add __string_len() example
tracing: Remove __assign_str_len()
ftrace: Fix most kernel-doc warnings
tracing: Decrement the snapshot if the snapshot trigger fails to register
tracing: Fix snapshot counter going between two tracers that use it
tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
tracing: Use ? : shortcut in trace macros
tracing: Do not calculate strlen() twice for __string() fields
tracing: Rework __assign_str() and __string() to not duplicate getting the string
cxl/trace: Properly initialize cxl_poison region name
net: hns3: tracing: fix hclgevf trace event strings
drm/i915: Add missing ; to __assign_str() macros in tracepoint code
NFSD: Fix nfsd_clid_class use of __string_len() macro
...
The TRACE_EVENT macros has some dependency if a __string() field is NULL,
where it will save "(null)" as the string. This string is also used by
__assign_str(). It's better to create a single macro instead of having
something that will not be caught by the compiler if there is an
unfortunate typo.
Link: https://lore.kernel.org/linux-trace-kernel/20240222211443.106216915@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Highlights include:
Bugfixes:
- Fix for an Oops in the NFSv4.2 listxattr handler
- Correct an incorrect buffer size in listxattr
- Fix for an Oops in the pNFS flexfiles layout
- Fix a refcount leak in NFS O_DIRECT writes
- Fix missing locking in NFS O_DIRECT
- Avoid an infinite loop in pnfs_update_layout
- Fix an overflow in the RPC waitqueue queue length counter
- Ensure that pNFS I/O is also protected by TLS when xprtsec
is specified by the mount options
- Fix a leaked folio lock in the netfs read code
- Fix a potential deadlock in fscache
- Allow setting the fscache uniquifier in NFSv4
- Fix an off by one in root_nfs_cat()
- Fix another off by one in rpc_sockaddr2uaddr()
- nfs4_do_open() can incorrectly trigger state recovery.
- Various fixes for connection shutdown
Features and cleanups:
- Ensure that containers only see their own RPC and NFS stats
- Enable nconnect for RDMA
- Remove dead code from nfs_writepage_locked()
- Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO, and
mount options.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmX14K0ACgkQZwvnipYK
APLCeg/7Bdah7158TdNxSQAHPo3jzDqZmc933eZC0H8C9whNlu6XIa9fyT6ZrsQr
qkQ/ztSwsB6yp6vLPSnVdDh5KsndwrInTB874H8y6+8x+KwwuhSQ7Uy8epg5wrO0
kgiaRYSH7HB7EgUdNY14fHNXkA/DMLHz1F1aw2NVGCYmVCMg7kGV4wYCOH6bI2Ea
Wu8amZce6D1AbktbdSZcEz2ricR3lGXjCUPMnzRCaSpUmdd2t7d/rsnjTeKU1gb4
p9zLlOZs9Xe2vMT0ZQI8SEI+Scze82LBy7ykSKyhOjOt4AurVpzQFAvK+3dFZoIq
lzIHJwabBGNui26CR1k90ZqERLkkk+24i3ccT28HwhTqe5eM/qDCKOVQmuP0F1F8
QYsnIM+NnmPZveSGAMdOQwlGFQTyJbT5Na1blHTW2R2rjqBzgvfn8fR0vV4L5P7B
0J8ShmZKVkvb7mtJJhaaI4LF41ciCF8+I5zwpnYQi0tsX370XPNNFbzS3BmPUVFL
k0uEMVfNy69PkaH4DJWQT9GoE3qiAamkO+EdAlPad6b8QMdJJZxXOmaUzL8YsCHV
sX5ugsih/Hf5/+QFBCbHEy7G3oeeHsT80yO8nvGT+yy94bv4F+WcM/tviyRbKrls
t5audBDNRfrAeUlqAQkXfFmAyqP2CGNr29oL62cXL2muFG7d7ys=
=5n+X
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Bugfixes:
- Fix for an Oops in the NFSv4.2 listxattr handler
- Correct an incorrect buffer size in listxattr
- Fix for an Oops in the pNFS flexfiles layout
- Fix a refcount leak in NFS O_DIRECT writes
- Fix missing locking in NFS O_DIRECT
- Avoid an infinite loop in pnfs_update_layout
- Fix an overflow in the RPC waitqueue queue length counter
- Ensure that pNFS I/O is also protected by TLS when xprtsec is
specified by the mount options
- Fix a leaked folio lock in the netfs read code
- Fix a potential deadlock in fscache
- Allow setting the fscache uniquifier in NFSv4
- Fix an off by one in root_nfs_cat()
- Fix another off by one in rpc_sockaddr2uaddr()
- nfs4_do_open() can incorrectly trigger state recovery
- Various fixes for connection shutdown
Features and cleanups:
- Ensure that containers only see their own RPC and NFS stats
- Enable nconnect for RDMA
- Remove dead code from nfs_writepage_locked()
- Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO,
and mount options"
* tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits)
nfs: fix panic when nfs4_ff_layout_prepare_ds() fails
NFS: trace the uniquifier of fscache
NFS: Read unlock folio on nfs_page_create_from_folio() error
NFS: remove unused variable nfs_rpcstat
nfs: fix UAF in direct writes
nfs: properly protect nfs_direct_req fields
NFS: enable nconnect for RDMA
NFSv4: nfs4_do_open() is incorrectly triggering state recovery
NFS: avoid infinite loop in pnfs_update_layout.
NFS: remove sync_mode test from nfs_writepage_locked()
NFSv4.1/pnfs: fix NFS with TLS in pnfs
NFS: Fix an off by one in root_nfs_cat()
nfs: make the rpc_stat per net namespace
nfs: expose /proc/net/sunrpc/nfs in net namespaces
sunrpc: add a struct rpc_stats arg to rpc_create_args
nfs: remove unused NFS_CALL macro
NFSv4.1: add tracepoint to trunked nfs4_exchange_id calls
NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt
SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int
nfs: fix regression in handling of fsc= option in NFSv4
...
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is hotplugged
as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving policy
wherein we allocate memory across nodes in a weighted fashion rather
than uniformly. This is beneficial in heterogeneous memory environments
appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the process
has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown situations.
The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's series
"Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page faults.
He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction test",
Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
our handling of DAX on archiecctures which have virtually aliasing data
caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
improvements in worst-case mmap_lock hold times during certain
userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability improvements
in his series "Mitigate a vmap lock contention". It realizes a 12x
improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging of
large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages() to
an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which are
configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
=TG55
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series
"mm: zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is
hotplugged as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving
policy wherein we allocate memory across nodes in a weighted fashion
rather than uniformly. This is beneficial in heterogeneous memory
environments appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the
process has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown
situations. The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings"
Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's
series "Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page
faults. He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction
test", Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and
refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
in our handling of DAX on archiecctures which have virtually aliasing
data caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides
dramatic improvements in worst-case mmap_lock hold times during
certain userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability
improvements in his series "Mitigate a vmap lock contention". It
realizes a 12x improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging
of large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages()
to an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which
are configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
mm/zswap: remove the memcpy if acomp is not sleepable
crypto: introduce: acomp_is_async to expose if comp drivers might sleep
memtest: use {READ,WRITE}_ONCE in memory scanning
mm: prohibit the last subpage from reusing the entire large folio
mm: recover pud_leaf() definitions in nopmd case
selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
selftests/mm: skip uffd hugetlb tests with insufficient hugepages
selftests/mm: dont fail testsuite due to a lack of hugepages
mm/huge_memory: skip invalid debugfs new_order input for folio split
mm/huge_memory: check new folio order when split a folio
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
mm: fix list corruption in put_pages_list
mm: remove folio from deferred split list before uncharging it
filemap: avoid unnecessary major faults in filemap_fault()
mm,page_owner: drop unnecessary check
mm,page_owner: check for null stack_record before bumping its refcount
mm: swap: fix race between free_swap_and_cache() and swapoff()
mm/treewide: align up pXd_leaf() retval across archs
mm/treewide: drop pXd_large()
...
This was a relatively calm development cycle. Most of changes are
rather small device-specific fixes and enhancements. The only
significant changes in ALSA core are code refactoring with the recent
cleanup infrastructure, which should bring no functionality changes.
Some highlights below:
Core:
- Lots of cleanups in ALSA core code with automatic kfree cleanup
and locking guard macros
- New ALSA core kunit test
ASoC:
- SoundWire support for AMD ACP 6.3 systems
- Support for reporting version information for AVS firmware
- Support DSPless mode for Intel Soundwire systems
- Support for configuring CS35L56 amplifiers using EFI calibration
data
- Log which component is being operated on as part of power management
trace events.
- Support for Microchip SAM9x7, NXP i.MX95 and Qualcomm WCD939x
HD- and USB-audio:
- More Cirrus HD-audio codec support
- TAS2781 HD-audio codec fixes
- Scarlett2 mixer fixes
Others:
- Enhancement of virtio driver for audio control supports
- Cleanups of legacy PM code with new macros
- Firewire sound updates
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmXyzFQOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE80WQ//bQeLEUF9HQqprCW96jFiGeO3/0Zb5pdCCrZw
VYRxzeGBfMfVFvXSC4/Rp3zr4Dbc+sOg9GXAD6PVAo/QudIDkuX1pk/gRN2NFXQ5
bimdZ6obM4WCl7isbDIbn/ifOx05F7p0+J9T9nAPrvBG4lpzXoMhGz75YnwaPlrh
q5MKEZcuONlZPHZrBy/UsrYqWrnWUi2yWgQ5gRg/PTM4dgUAy2pH7NpKNxOiRntJ
eqBfdvglSWQDH9kPgmeTggtFN8Axy+pd+g9M5pi/KOJfoBpWuv2nK31gnymdqV4H
UrmwU/VAL2Y0zU34RCZQvPFre6S+487FEf/g+qgVTDqi0kxxFT2btcaTjggjLwEy
p/SJlqNnA7W7D67/qf4MPNOEp88Dd6o1YN7o01vyC9RoX5FAbzvNLF8oH4BwGxs+
HI+5aJUY1f2MGwN3NpPW5E12d1RSgSi9L9l/R8oAQmonARr3drj3tkndhFjndgXG
IctwHlkYRSibe6m5k6sDEcil70UNl5M6sr/IjPmDvYudjdKHisowrxqF+nPrAYdM
0z3fW333+OQf0XVd9iPLBmq+PpiAY1AhCJeF/hPr3D5qDZInhcd8CouFie+QGkHT
Z5j5CvhNLgRdmlW9jvfBPBBCT7u8jr6JFszA3g6wpWUx6ndAGsI1z6iC+h23NpZj
dxmJU00=
=h9kz
-----END PGP SIGNATURE-----
Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"This was a relatively calm development cycle. Most of changes are
rather small device-specific fixes and enhancements. The only
significant changes in ALSA core are code refactoring with the recent
cleanup infrastructure, which should bring no functionality changes.
Some highlights below:
Core:
- Lots of cleanups in ALSA core code with automatic kfree cleanup and
locking guard macros
- New ALSA core kunit test
ASoC:
- SoundWire support for AMD ACP 6.3 systems
- Support for reporting version information for AVS firmware
- Support DSPless mode for Intel Soundwire systems
- Support for configuring CS35L56 amplifiers using EFI calibration
data
- Log which component is being operated on as part of power
management trace events.
- Support for Microchip SAM9x7, NXP i.MX95 and Qualcomm WCD939x
HD- and USB-audio:
- More Cirrus HD-audio codec support
- TAS2781 HD-audio codec fixes
- Scarlett2 mixer fixes
Others:
- Enhancement of virtio driver for audio control supports
- Cleanups of legacy PM code with new macros
- Firewire sound updates"
* tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (307 commits)
ALSA: usb-audio: Stop parsing channels bits when all channels are found.
ALSA: hda/tas2781: remove unnecessary runtime_pm calls
ALSA: hda/realtek - ALC236 fix volume mute & mic mute LED on some HP models
ALSA: aaci: Delete unused variable in aaci_do_suspend
ALSA: scarlett2: Fix Scarlett 4th Gen input gain range again
ALSA: scarlett2: Fix Scarlett 4th Gen input gain range
ALSA: scarlett2: Fix Scarlett 4th Gen autogain status values
ALSA: scarlett2: Fix Scarlett 4th Gen 4i4 low-voltage detection
ALSA: hda/tas2781: restore power state after system_resume
ALSA: hda/tas2781: do not call pm_runtime_force_* in system_resume/suspend
ALSA: hda/tas2781: do not reset cur_* values in runtime_suspend
ALSA: hda/tas2781: add lock to system_suspend
ALSA: hda/tas2781: use dev_dbg in system_resume
ALSA: hda/realtek: fix ALC285 issues on HP Envy x360 laptops
platform/x86: serial-multi-instantiate: Add support for CS35L54 and CS35L57
ALSA: hda: cs35l56: Add support for CS35L54 and CS35L57
ASoC: cs35l56: Add support for CS35L54 and CS35L57
ASoC: Intel: catpt: Carefully use PCI bitwise constants
ALSA: hda: hda_component: Include sound/hda_codec.h
ALSA: hda: hda_component: Add missing #include guards
...
Highlights:
- acer-wmi: New HW support
- amd/pmf: Support for new revision of heartbeat notify
- asus-wmi: Correctly handle HW without LEDs
- fujitsu-laptop: Battery charge control support
- hp-wmi: Support for new thermal profiles
- ideapad-laptop: Support for refresh rate key
- intel/pmc: Put AI accelerator (GNA) into D3 if it has no
driver to allow entry into low-power modes, and
temporarily removed Lunar Lake SSRAM support due
to breaking FW changes causing probe fail
(further breaking FW changes are still pending)
- pmc/punit_atom: Report devices that prevent reacing low power
levels
- surface: Fan speed function support
- thinkpad_acpi: Support for more sperial keys and complete the
list of models with non-standard fan registers
- touchscreen_dmi: New HW support
- wmi: Continued modernization efforts
- Removal of obsoleted ledtrig-audio call and the related dependency
- Debug & metrics interface improvements
- Miscellaneous cleanups / fixes / improvements
The following is an automated shortlog grouped by driver:
acer-wmi:
- Add predator_v4 module parameter
- Add support for Acer PH16-71
amd/hsmp:
- Add support for ACPI based probing
- Cache pci_dev in struct hsmp_socket
- Change devm_kzalloc() to devm_kcalloc()
- Check num_sockets against MAX_AMD_SOCKETS
- Create static func to handle platdev
- Define a struct to hold mailbox regs
- Move dev from platdev to hsmp_socket
- Move hsmp_test to probe
- Non-ACPI support for AMD F1A_M00~0Fh
- Remove extra parenthesis and add a space
- Restructure sysfs group creation
amd/pmf:
- Add missing __iomem attribute to policy_base
- Add support to get APTS index numbers for static slider
- Add support to get sbios requests in PMF driver
- Add support to get sps default APTS index values
- Add support to notify sbios heart beat event
- Differentiate PMF ACPI versions
- Disable debugfs support for querying power thermals
- Do not use readl() for policy buffer access
- Fix possible out-of-bound memory accesses
- Fix return value of amd_pmf_start_policy_engine()
- Update sps power thermals according to the platform-profiles
- Use struct for cookie header
asus-wmi:
- Consider device is absent when the read is ~0
- Revert: Support WMI event queue
clk: x86:
- Move clk-pmc-atom register defines to include/linux/platform_data/x86/pmc_atom.h
dell-privacy:
- Remove usage of wmi_has_guid()
Documentation/x86/amd/hsmp:
- Updating urls
drivers/mellanox:
- Convert snprintf to sysfs_emit
fujitsu-laptop:
- Add battery charge control support
hp-wmi:
- Add thermal profile support for 8BAD boards
- Tidy up module source code
ideapad-laptop:
- map Fn + R key to KEY_REFRESH_RATE_TOGGLE
- support Fn+R dual-function key
Input:
- allocate keycode for Display refresh rate toggle
intel/ifs:
- Add an entry rendezvous for SAF
- Add current batch number to trace output
- Remove unnecessary initialization of 'ret'
- Replace the exit rendezvous with an entry rendezvous for ARRAY_BIST
- Trace on all HT threads when executing a test
intel/pmc/arl:
- Put GNA device in D3
intel/pmc:
- Improve PKGC residency counters debug
intel/pmc/lnl:
- Remove SSRAM support
intel_scu_ipcutil:
- Make scu static
intel_scu_pcidrv:
- Remove unused intel-mid.h
intel_scu_wdt:
- Remove unused intel-mid.h
intel/tpmi:
- Change vsec offset to u64
intel/vsec:
- Remove nuisance message
ISST:
- Allow reading core-power state on HWP disabled systems
mlxbf-pmc:
- Cleanup signed/unsigned mix-up
- fix signedness bugs
- Ignore unsupported performance blocks
mlxbf-pmc: mlxbf_pmc_event_list():
- make size ptr optional
mlxbf-pmc:
- Replace uintN_t with kernel-style types
mlxreg-hotplug:
- Remove redundant NULL-check
pmc_atom:
- Annotate d3_sts register bit defines
- Check state of PMC clocks on s2idle
- Check state of PMC managed devices on s2idle
silicom-platform:
- clean up a check
surface: aggregator_registry:
- add entry for fan speed
thinkpad_acpi:
- Add more ThinkPads with non-standard reg address for fan
- Fix to correct wrong temp reporting on some ThinkPads
- remove redundant assignment to variable i
- Simplify thermal mode checking
- Support for mode FN key
touchscreen_dmi:
- Add an extra entry for a variant of the Chuwi Vi8 tablet
wmi:
- Always evaluate _WED when receiving an event
- Check if event data is not NULL
- Check if WMxx control method exists
- Do not instantiate older WMI drivers multiple times
- Ignore duplicated GUIDs in legacy matches
- Make input buffer mandatory when evaluating methods
- Prevent incompatible event driver from probing
- Remove obsolete duplicate GUID allowlist
- Remove unnecessary out-of-memory message
- Replace pr_err() with dev_err()
- Stop using ACPI device class
- Update documentation regarding _WED
- Use ACPI device name in netlink event
- Use FW_BUG when warning about missing control methods
x86/atom:
- Check state of Punit managed devices on s2idle
x86: ibm_rtl:
- make rtl_subsys const
x86: wmi:
- make wmi_bus_type const
platform/x86:
- make fw_attr_class constant
- remove obsolete calls to ledtrig_audio_get
Merges:
- Merge tag 'platform-drivers-x86-v6.8-2' into pdx/for-next
- Merge tag 'platform-drivers-x86-v6.8-4' into pdx86/for-next
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCZfLZKgAKCRBZrE9hU+XO
MWqnAQCZW0KiSzXbJkTN4GWlMOqnlaJsiflnPeVNxH59bDUTeQEA/OdSzyiDUqKr
zJcGnOyILuQ3wCvQ5SuqRCwjFHXOQg0=
=8y6r
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Ilpo Järvinen:
- New acer-wmi HW support
- Support for new revision of amd/pmf heartbeat notify
- Correctly handle asus-wmi HW without LEDs
- fujitsu-laptop battery charge control support
- Support for new hp-wmi thermal profiles
- Support ideapad-laptop refresh rate key
- Put intel/pmc AI accelerator (GNA) into D3 if it has no driver to
allow entry into low-power modes, and temporarily removed Lunar Lake
SSRAM support due to breaking FW changes causing probe fail (further
breaking FW changes are still pending)
- Report pmc/punit_atom devices that prevent reacing low power levels
- Surface Fan speed function support
- Support for more sperial keys and complete the list of models with
non-standard fan registers in thinkpad_acpi
- New DMI touchscreen HW support
- Continued modernization efforts of wmi
- Removal of obsoleted ledtrig-audio call and the related dependency
- Debug & metrics interface improvements
- Miscellaneous cleanups / fixes / improvements
* tag 'platform-drivers-x86-v6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (87 commits)
platform/x86/intel/pmc: Improve PKGC residency counters debug
platform/x86: asus-wmi: Consider device is absent when the read is ~0
Documentation/x86/amd/hsmp: Updating urls
platform/mellanox: mlxreg-hotplug: Remove redundant NULL-check
platform/x86/amd/pmf: Update sps power thermals according to the platform-profiles
platform/x86/amd/pmf: Add support to get sps default APTS index values
platform/x86/amd/pmf: Add support to get APTS index numbers for static slider
platform/x86/amd/pmf: Add support to notify sbios heart beat event
platform/x86/amd/pmf: Add support to get sbios requests in PMF driver
platform/x86/amd/pmf: Disable debugfs support for querying power thermals
platform/x86/amd/pmf: Differentiate PMF ACPI versions
x86/platform/atom: Check state of Punit managed devices on s2idle
platform/x86: pmc_atom: Check state of PMC clocks on s2idle
platform/x86: pmc_atom: Check state of PMC managed devices on s2idle
platform/x86: pmc_atom: Annotate d3_sts register bit defines
clk: x86: Move clk-pmc-atom register defines to include/linux/platform_data/x86/pmc_atom.h
platform/x86: make fw_attr_class constant
platform/x86/intel/tpmi: Change vsec offset to u64
platform/x86: intel_scu_pcidrv: Remove unused intel-mid.h
platform/x86: intel_scu_wdt: Remove unused intel-mid.h
...
- Allow the Energy Model to be updated dynamically (Lukasz Luba).
- Add support for LZ4 compression algorithm to the hibernation image
creation and loading code (Nikhil V).
- Fix and clean up system suspend statistics collection (Rafael
Wysocki).
- Simplify device suspend and resume handling in the power management
core code (Rafael Wysocki).
- Fix PCI hibernation support description (Yiwei Lin).
- Make hibernation take set_memory_ro() return values into account as
appropriate (Christophe Leroy).
- Set mem_sleep_current during kernel command line setup to avoid an
ordering issue with handling it (Maulik Shah).
- Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
driver's system suspend callback (Qingliang Li).
- Simplify pm_runtime_get_if_active() usage and add a replacement for
pm_runtime_put_autosuspend() (Sakari Ailus).
- Add a tracepoint for runtime_status changes tracking (Vilas Bhat).
- Fix section title markdown in the runtime PM documentation (Yiwei
Lin).
- Enable preferred core support in the amd-pstate cpufreq driver (Meng
Li).
- Fix min_perf assignment in amd_pstate_adjust_perf() and make the
min/max limit perf values in amd-pstate always stay within the
(highest perf, lowest perf) range (Tor Vic, Meng Li).
- Allow intel_pstate to assign model-specific values to strings used in
the EPP sysfs interface and make it do so on Meteor Lake (Srinivas
Pandruvada).
- Drop long-unused cpudata::prev_cummulative_iowait from the
intel_pstate cpufreq driver (Jiri Slaby).
- Prevent scaling_cur_freq from exceeding scaling_max_freq when the
latter is an inefficient frequency (Shivnandan Kumar).
- Change default transition delay in cpufreq to 2ms (Qais Yousef).
- Remove references to 10ms minimum sampling rate from comments in the
cpufreq code (Pierre Gondois).
- Honour transition_latency over transition_delay_us in cpufreq (Qais
Yousef).
- Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar).
- General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
Belova).
- Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).
- Make the SCMI cpufreq driver get a transition delay value from
firmware (Pierre Gondois).
- Prevent the haltpoll cpuidle governor from shrinking guest
poll_limit_ns below grow_start (Parshuram Sangle).
- Avoid potential overflow in integer multiplication when computing
cpuidle state parameters (C Cheng).
- Adjust MWAIT hint target C-state computation in the ACPI cpuidle
driver and in intel_idle to return a correct value for C0 (He
Rongguang).
- Address multiple issues in the TPMI RAPL driver and add support for
new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui).
- Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
Lezcano).
- Fix kernel-doc for dtpm_create_hierarchy() (Yang Li).
- Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
Norway Ananda).
- Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil).
- Fix a couple of warnings in the OPP core code related to W=1
builds (Viresh Kumar).
- Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
Kumar).
- Extend dev_pm_opp_data with turbo support (Sibi Sankar).
- dt-bindings: drop maxItems from inner items (David Heidelberg).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmXvI/ISHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx24sP/jxg6fOGme8raHQvpTXG3/H56wlGzQ4P
YUvvKUXnfD3yf1zNISsUl7VQebZqDt8rygkwSdymXlUVZX1eubN0RpCFc0F8GZuc
THG/YQhYQr/9zro3FpKhfDj5evk21PCQzjf+dGvfQF9qVMxNPG1JzEFK6PnolT5X
2BvkonY1XFWZjCMbZ83B/jt35lTDb0cmeNbCpfD5UJgcnxmMOtZYpORdyfPWTJpG
GVCwmAFVVXxXlust/AIpt3mmOpKzSA9GnrtJkhtQe5GN+Y4OjnJiFJmTC7EfCctj
JlWgVUA716mtFMUrjXgjfI54firF2oQpqaSa2HG/V/A96JWQqjarGz5dAV1IrPEt
ZmYpvMe4E90S411wF1OWyrEqjXUuDnH1OWUvUdWSt4E7DhFw3esDi/jLW2tyVKAT
hIy+/O4wzbDSTX/h9Cgt1Qjhew6lKUIwvhEXclB3fuJ+JoviWNkC9lnK93e2H0A3
VYfkd/lpUD74035l0FrCJ/49MjX9kqrsn+TipHsIlSXAi8ZRdKbVvxOTD8RYudcI
GvCiDDrkMgNwGlyedgbtTBUepCvSg93b+vVmRj7YMPtBhioOUo3qCn6wpqhxfnth
9BCnPW7JxqUw/NJdlk9hKumaUZq+MK8G+kdYcIDg6xmAkWSUVP2QKlWavfMCxqRP
+dN6T2iHsKFe
=UePT
-----END PGP SIGNATURE-----
Merge tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"From the functional perspective, the most significant change here is
the addition of support for Energy Models that can be updated
dynamically at run time.
There is also the addition of LZ4 compression support for hibernation,
the new preferred core support in amd-pstate, new platforms support in
the Intel RAPL driver, new model-specific EPP handling in intel_pstate
and more.
Apart from that, the cpufreq default transition delay is reduced from
10 ms to 2 ms (along with some related adjustments), the system
suspend statistics code undergoes a significant rework and there is a
usual bunch of fixes and code cleanups all over.
Specifics:
- Allow the Energy Model to be updated dynamically (Lukasz Luba)
- Add support for LZ4 compression algorithm to the hibernation image
creation and loading code (Nikhil V)
- Fix and clean up system suspend statistics collection (Rafael
Wysocki)
- Simplify device suspend and resume handling in the power management
core code (Rafael Wysocki)
- Fix PCI hibernation support description (Yiwei Lin)
- Make hibernation take set_memory_ro() return values into account as
appropriate (Christophe Leroy)
- Set mem_sleep_current during kernel command line setup to avoid an
ordering issue with handling it (Maulik Shah)
- Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
driver's system suspend callback (Qingliang Li)
- Simplify pm_runtime_get_if_active() usage and add a replacement for
pm_runtime_put_autosuspend() (Sakari Ailus)
- Add a tracepoint for runtime_status changes tracking (Vilas Bhat)
- Fix section title markdown in the runtime PM documentation (Yiwei
Lin)
- Enable preferred core support in the amd-pstate cpufreq driver
(Meng Li)
- Fix min_perf assignment in amd_pstate_adjust_perf() and make the
min/max limit perf values in amd-pstate always stay within the
(highest perf, lowest perf) range (Tor Vic, Meng Li)
- Allow intel_pstate to assign model-specific values to strings used
in the EPP sysfs interface and make it do so on Meteor Lake
(Srinivas Pandruvada)
- Drop long-unused cpudata::prev_cummulative_iowait from the
intel_pstate cpufreq driver (Jiri Slaby)
- Prevent scaling_cur_freq from exceeding scaling_max_freq when the
latter is an inefficient frequency (Shivnandan Kumar)
- Change default transition delay in cpufreq to 2ms (Qais Yousef)
- Remove references to 10ms minimum sampling rate from comments in
the cpufreq code (Pierre Gondois)
- Honour transition_latency over transition_delay_us in cpufreq (Qais
Yousef)
- Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar)
- General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
Belova)
- Update cpufreq-dt-platdev to block/approve devices (Richard Acayan)
- Make the SCMI cpufreq driver get a transition delay value from
firmware (Pierre Gondois)
- Prevent the haltpoll cpuidle governor from shrinking guest
poll_limit_ns below grow_start (Parshuram Sangle)
- Avoid potential overflow in integer multiplication when computing
cpuidle state parameters (C Cheng)
- Adjust MWAIT hint target C-state computation in the ACPI cpuidle
driver and in intel_idle to return a correct value for C0 (He
Rongguang)
- Address multiple issues in the TPMI RAPL driver and add support for
new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui)
- Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
Lezcano)
- Fix kernel-doc for dtpm_create_hierarchy() (Yang Li)
- Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
Norway Ananda)
- Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil)
- Fix a couple of warnings in the OPP core code related to W=1 builds
(Viresh Kumar)
- Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
Kumar)
- Extend dev_pm_opp_data with turbo support (Sibi Sankar)
- dt-bindings: drop maxItems from inner items (David Heidelberg)"
* tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits)
dt-bindings: opp: drop maxItems from inner items
OPP: debugfs: Fix warning around icc_get_name()
OPP: debugfs: Fix warning with W=1 builds
cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h
OPP: Extend dev_pm_opp_data with turbo support
Fix cpupower-frequency-info.1 man page typo
cpufreq: scmi: Set transition_delay_us
firmware: arm_scmi: Populate fast channel rate_limit
firmware: arm_scmi: Populate perf commands rate_limit
cpuidle: ACPI/intel: fix MWAIT hint target C-state computation
PM: sleep: wakeirq: fix wake irq warning in system suspend
powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function
cpufreq: Don't unregister cpufreq cooling on CPU hotplug
PM: suspend: Set mem_sleep_current during kernel command line setup
cpufreq: Honour transition_latency over transition_delay_us
cpufreq: Limit resolving a frequency to policy min/max
Documentation: PM: Fix runtime_pm.rst markdown syntax
cpufreq: amd-pstate: adjust min/max limit perf
cpufreq: Remove references to 10ms min sampling rate
cpufreq: intel_pstate: Update default EPPs for Meteor Lake
...
Core & protocols
----------------
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps etc.)
lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core
instead of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length
and budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global config
variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug
of ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for
use on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code
--------------------------------------------
- Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena).
- Rework selftest harness to enable the use of the full range of
ksft exit code (pass, fail, skip, xfail, xpass).
Netfilter
---------
- Allow userspace to define a table that is exclusively owned by a daemon
(via netlink socket aliveness) without auto-removing this table when
the userspace program exits. Such table gets marked as orphaned and
a restarting management daemon can re-attach/regain ownership.
- Speed up element insertions to nftables' concatenated-ranges set type.
Compact a few related data structures.
BPF
---
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between BPF
program and user space where structures inside the arena can have
pointers to other areas of the arena, and pointers work seamlessly
for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the verifier
and the program. The verifier allows the program to loop assuming it's
behaving well, but reserves the right to terminate it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF objects.
Wireless
--------
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API
----------
- Major overhaul of the Energy Efficient Ethernet internals to support
new link modes (2.5GE, 5GE), share more code between drivers
(especially those using phylib), and encourage more uniform behavior.
Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc
----
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions,
and packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message encapsulation
or "Netlink polymorphism", where interpretation of nested attributes
depends on link type, classifier type or some other "class type".
Drivers
-------
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic
on CAN BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO) support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs to
have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmXv0mgACgkQMUZtbf5S
IrtgMxAAuRd+WJW++SENr4KxIWhYO1q6Xcxnai43wrNkan9swD24icG8TYALt4f3
yoT6idQvWReAb5JNlh9rUQz8R7E0nJXlvEFn5MtJwcthx2C6wFo/XkJlddlRrT+j
c2xGILwLjRhW65LaC0MZ2ECbEERkFz8xcGfK2SWzUgh6KYvPjcRfKFxugpM7xOQK
P/Wnqhs4fVRS/Mj/bCcXcO+yhwC121Q3qVeQVjGS0AzEC65hAW87a/kc2BfgcegD
EyI9R7mf6criQwX+0awubjfoIdr4oW/8oDVNvUDczkJkbaEVaLMQk9P5x/0XnnVS
UHUchWXyI80Q8Rj12uN1/I0h3WtwNQnCRBuLSmtm6GLfCAwbLvp2nGWDnaXiqryW
DVKUIHGvqPKjkOOMOVfSvfB3LvkS3xsFVVYiQBQCn0YSs/gtu4CoF2Nty9CiLPbK
tTuxUnLdPDZDxU//l0VArZmP8p2JM7XQGJ+JH8GFH4SBTyBR23e0iyPSoyaxjnYn
RReDnHMVsrS1i7GPhbqDJWn+uqMSs7N149i0XmmyeqwQHUVSJN3J2BApP2nCaDfy
H2lTuYly5FfEezt61NvCE4qr/VsWeEjm1fYlFQ9dFn4pGn+HghyCpw+xD1ZN56DN
lujemau5B3kk1UTtAT4ypPqvuqjkRFqpNV2LzsJSk/Js+hApw8Y=
=oY52
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps
etc) lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core instead
of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length and
budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global
config variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug of
ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for use
on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code:
- Enforce VM_IOREMAP flag and range in ioremap_page_range and
introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
bpf_arena).
- Rework selftest harness to enable the use of the full range of ksft
exit code (pass, fail, skip, xfail, xpass).
Netfilter:
- Allow userspace to define a table that is exclusively owned by a
daemon (via netlink socket aliveness) without auto-removing this
table when the userspace program exits. Such table gets marked as
orphaned and a restarting management daemon can re-attach/regain
ownership.
- Speed up element insertions to nftables' concatenated-ranges set
type. Compact a few related data structures.
BPF:
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between
BPF program and user space where structures inside the arena can
have pointers to other areas of the arena, and pointers work
seamlessly for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the
verifier and the program. The verifier allows the program to loop
assuming it's behaving well, but reserves the right to terminate
it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops
type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF
firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF
objects.
Wireless:
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API:
- Major overhaul of the Energy Efficient Ethernet internals to
support new link modes (2.5GE, 5GE), share more code between
drivers (especially those using phylib), and encourage more
uniform behavior. Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from
drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc:
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions, and
packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message
encapsulation or "Netlink polymorphism", where interpretation of
nested attributes depends on link type, classifier type or some
other "class type".
Drivers:
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic on CAN
BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO)
support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs
to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro"
* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
nexthop: Fix out-of-bounds access during attribute validation
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
selftests/bpf: Add kprobe multi triggering benchmarks
ptp: Move from simple ida to xarray
vxlan: Remove generic .ndo_get_stats64
vxlan: Do not alloc tstats manually
devlink: Add comments to use netlink gen tool
nfp: flower: handle acti_netdevs allocation failure
net/packet: Add getsockopt support for PACKET_COPY_THRESH
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
selftests/bpf: Add bpf_arena_htab test.
selftests/bpf: Add bpf_arena_list test.
selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
bpf: Add helper macro bpf_addr_space_cast()
libbpf: Recognize __arena global variables.
bpftool: Recognize arena map type
...
The bulk of the patches for this release are optimizations, code
clean-ups, and minor bug fixes.
One new feature to mention is that NFSD administrators now have the
ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support
has had this capability for some time.
As always I am grateful to NFSD contributors, reviewers, and
testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmXwV4QACgkQM2qzM29m
f5c7cg/8CRe0mGbeEMonoSycBjANDuiRolCM+DhVccUvSyWPqf4blF5yrNHcf5zN
WmjQHVXIJUMVpLovcakj+4aBIuXGgdSmBJamFTy9fVfcFadiWYRceNgMMXpLMDDI
fMAszRUyfL/r0Evj0Zajt86R5/gGn+W9X6HlDc1k7VV0Z+fzRw9WMxADy11cgHLp
mh2bzyPmwu0EfBYlWNWLqzWVZm1C5UCGnlInyr0KXImCLOkpJqAVXTDvDkGFW2Qw
1kJhodyabf6fRV2ZqPjLUuR4aRqABey83rB0N5z7MumO/dJUBW3CHR3uNMqvkmh3
XevI8bPzS2Kypijcx7dONtkDWwU+fsvCdepNpmVDB73B19BFiLG+HDbMypJ0dmp+
rvvfILRDCmIb+FA1DUeT3lIc6ac1f1+qAVc7hi3E7rGctEJWeHDsZg+E1PuTvpxM
3XfRaFnucY5vwyiB2/uI4eblBHcVXoKho+pUqQMegLPRbgsEUyFUfg3+ZMtntagd
OVUXvWYIARP97HNh0J5ChcGI72UpXtFWMlbbiTiCzYx4FeiCffeczIERXNJ4FYAg
fKUaiBhdAN1PPFCRXJORZ5XlSIeZttUNSJUPfmuOpkscMdkpRUIhuEUYo9K8/1eL
O+YZeGW/kTG+llxOERfEHJoekLf1TgGdU7oBmTIgQIK03hTUih8=
=75G4
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are optimizations, code
clean-ups, and minor bug fixes.
One new feature to mention is that NFSD administrators now have the
ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has
had this capability for some time.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits)
NFSD: Clean up nfsd4_encode_replay()
NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit
NFSD: Document nfsd_setattr() fill-attributes behavior
nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()
nfsd: Fix a regression in nfsd_setattr()
NFSD: OP_CB_RECALL_ANY should recall both read and write delegations
NFSD: handle GETATTR conflict with write delegation
NFSD: add support for CB_GETATTR callback
NFSD: Document the phases of CREATE_SESSION
NFSD: Fix the NFSv4.1 CREATE_SESSION operation
nfsd: clean up comments over nfs4_client definition
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Post WRs for Write chunks in svc_rdma_sendto()
svcrdma: Post the Reply chunk and Send WR together
svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt
svcrdma: Post Send WR chain
svcrdma: Fix retry loop in svc_rdma_send()
svcrdma: Prevent a UAF in svc_rdma_send()
svcrdma: Fix SQ wake-ups
svcrdma: Increase the per-transport rw_ctx count
...
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer wheel
of a CPU which is likely to be busy at the time of expiry. This is done
to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is close
to zero.
2) Due to #1 it is possible that timers are accumulated on a
single target CPU
3) The required computation in the enqueue path is just overhead for
dubious value especially under the consideration that the vast
majority of timer wheel timers are either canceled or rearmed
before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on which
they get armed.
This is achieved by having separate wheels for CPU pinned timers and
global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they expire.
- If the first expiring timer is a global timer, then the expiry time
is propagated into the timer pull hierarchy and the CPU makes sure
to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to the
point where no further aggregation of groups is required, i.e. the
number of levels is log8(NR_CPUS). The magic number of eight has been
established by experimention, but can be adjusted if needed.
In each group one busy CPU acts as the migrator. It's only one CPU to
avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether there
are other CPUs in the group which have gone idle and have global timers
to expire. If there are global timers to expire, the migrator locks the
remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can require
to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point the
CPU is the systemwide migrator at the top of the hierarchy and it
therefore cannot delegate to the hierarchy. It needs to arm its own
timer device to expire either at the first expiring timer in the
hierarchy or at the first CPU local timer, which ever expires first.
This completely removes the overhead from the enqueue path, which is
e.g. for networking a true hotpath and trades it for a slightly more
complex idle path.
This has been in development for a couple of years and the final series
has been extensively tested by various teams from silicon vendors and
ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them to
power down a die completely on a mult-die socket for the first time in
a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific overloaded
netperf test which is currently investigated, but the rest is either
positive or neutral performance wise and positive on the power
management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware timers
and interpolated them back to clock MONOTONIC. The changes address a
few corner cases in the interpolation code which got the math and logic
wrong.
- Simplifcation of the clocksource watchdog retry logic to automatically
adjust to handle larger systems correctly instead of having more
incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ
6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY
mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt
1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG
D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST
s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv
lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP
ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ
FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz
S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN
RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh
rQ23UBms6ZRR+A==
=iQlu
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer
wheel of a CPU which is likely to be busy at the time of expiry.
This is done to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is
close to zero.
2) Due to #1 it is possible that timers are accumulated on
a single target CPU
3) The required computation in the enqueue path is just overhead
for dubious value especially under the consideration that the
vast majority of timer wheel timers are either canceled or
rearmed before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on
which they get armed.
This is achieved by having separate wheels for CPU pinned timers
and global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they
expire.
- If the first expiring timer is a global timer, then the expiry
time is propagated into the timer pull hierarchy and the CPU
makes sure to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to
the point where no further aggregation of groups is required, i.e.
the number of levels is log8(NR_CPUS). The magic number of eight
has been established by experimention, but can be adjusted if
needed.
In each group one busy CPU acts as the migrator. It's only one CPU
to avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether
there are other CPUs in the group which have gone idle and have
global timers to expire. If there are global timers to expire, the
migrator locks the remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can
require to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point
the CPU is the systemwide migrator at the top of the hierarchy and
it therefore cannot delegate to the hierarchy. It needs to arm its
own timer device to expire either at the first expiring timer in
the hierarchy or at the first CPU local timer, which ever expires
first.
This completely removes the overhead from the enqueue path, which
is e.g. for networking a true hotpath and trades it for a slightly
more complex idle path.
This has been in development for a couple of years and the final
series has been extensively tested by various teams from silicon
vendors and ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them
to power down a die completely on a mult-die socket for the first
time in a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific
overloaded netperf test which is currently investigated, but the
rest is either positive or neutral performance wise and positive on
the power management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware
timers and interpolated them back to clock MONOTONIC. The changes
address a few corner cases in the interpolation code which got the
math and logic wrong.
- Simplifcation of the clocksource watchdog retry logic to
automatically adjust to handle larger systems correctly instead of
having more incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place"
* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
timer/migration: Fix quick check reporting late expiry
tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
vdso/datapage: Quick fix - use asm/page-def.h for ARM64
timers: Assert no next dyntick timer look-up while CPU is offline
tick: Assume timekeeping is correctly handed over upon last offline idle call
tick: Shut down low-res tick from dying CPU
tick: Split nohz and highres features from nohz_mode
tick: Move individual bit features to debuggable mask accesses
tick: Move got_idle_tick away from common flags
tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
tick: Start centralizing tick related CPU hotplug operations
tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
tick: Use IS_ENABLED() whenever possible
tick/sched: Remove useless oneshot ifdeffery
tick/nohz: Remove duplicate between lowres and highres handlers
tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
hrtimer: Select housekeeping CPU during migration
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuD/AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsojEACNlJKqsebZv24szCR5ViBGqoDi/A5v5vZv
1p7f0sVgpwFLuDu3CCb9IG1tuAiuhBa5yvBKKpyGuGglQd+7Sxqsgdc2Bv/76D7S
Ej/fc1x5dxuvAvAetYk4yH2idPhYIBVIx3g2oz44bO4Ur3jFZ/yXzp+JtuKEuTba
7kQmAXfN7c497XDsmSv1eJM/+D/LKjmvjqMX2gnXprw2qPgdAklXcUSnBYaS2JEt
o4HGWAImJOV416d7QkOWgKfk6ksJbO3lFzQ6R+JdQCl6KVqc0+5u0oT06ZGVpSUf
fQqfcV+cJw41dQB47Qr017ku0EdDI19L3YpL9/WOnNMBM421j1QER1cKiKfiHD2B
LCOn+tvunxcGMzYonAFfgSF4XXFJWSK33TpvmmVsU3w0+YSC9oIqFfCxOdHuAJqB
tHSuGHgzkufgqhNIQWHiWZEJJUW+MO4Dv2rUV6n+dfCz6JQG48Gs9clDv/tAEY4U
4NzErfYLCsWlNaMPQK1f/b9dWjBXAnpJA4yq8jPyYB3GqjnVuX3Ze14UfwOWgv0B
E++qgPsh30ShbP/NRHqS9tNQC2hIy27x/jzpTyKwxuoSs/nyeZg7lFXIPaQQo7wt
GZhGzsMasbhoylqblB171NFlxpRetY9aYvHZ3OfUP4xAt1THVOzR6hZrBurOKMv/
e8FBGBh/cg==
=Hy//
-----END PGP SIGNATURE-----
Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Make running of task_work internal loops more fair, and unify how the
different methods deal with them (me)
- Support for per-ring NAPI. The two minor networking patches are in a
shared branch with netdev (Stefan)
- Add support for truncate (Tony)
- Export SQPOLL utilization stats (Xiaobing)
- Multishot fixes (Pavel)
- Fix for a race in manipulating the request flags via poll (Pavel)
- Cleanup the multishot checking by making it generic, moving it out of
opcode handlers (Pavel)
- Various tweaks and cleanups (me, Kunwu, Alexander)
* tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits)
io_uring: Fix sqpoll utilization check racing with dying sqpoll
io_uring/net: dedup io_recv_finish req completion
io_uring: refactor DEFER_TASKRUN multishot checks
io_uring: fix mshot io-wq checks
io_uring/net: add io_req_msg_cleanup() helper
io_uring/net: simplify msghd->msg_inq checking
io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
io_uring/net: correctly handle multishot recvmsg retry setup
io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
io_uring: fix io_queue_proc modifying req->flags
io_uring: fix mshot read defer taskrun cqe posting
io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
io_uring/net: correct the type of variable
io_uring/sqpoll: statistics of the true utilization of sq threads
io_uring/net: move recv/recvmsg flags out of retry loop
io_uring/kbuf: flag request if buffer pool is empty after buffer pick
io_uring/net: improve the usercopy for sendmsg/recvmsg
io_uring/net: move receive multishot out of the generic msghdr path
io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4tQAKCRCRxhvAZXjc
ohnfAP4sm946PZfiC4y5Euk96WDC3hC8WCSBar+fpFmYVzeD9wEAy+NVCsjkMElz
vqNxwFULUwQjFxxvsM9gvhrgGUud1AE=
=UZk/
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull file locking updates from Christian Brauner:
"A few years ago struct file_lock_context was added to allow for
separate lists to track different types of file locks instead of using
a singly-linked list for all of them.
Now leases no longer need to be tracked using struct file_lock.
However, a lot of the infrastructure is identical for leases and locks
so separating them isn't trivial.
This splits a group of fields used by both file locks and leases into
a new struct file_lock_core. The new core struct is embedded in struct
file_lock. Coccinelle was used to convert a lot of the callers to deal
with the move, with the remaining 25% or so converted by hand.
Afterwards several internal functions in fs/locks.c are made to work
with struct file_lock_core. Ultimately this allows to split struct
file_lock into struct file_lock and struct file_lease. The file lease
APIs are then converted to take struct file_lease"
* tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
filelock: fix deadlock detection in POSIX locking
filelock: always define for_each_file_lock()
smb: remove redundant check
filelock: don't do security checks on nfsd setlease calls
filelock: split leases out of struct file_lock
filelock: remove temporary compatibility macros
smb/server: adapt to breakup of struct file_lock
smb/client: adapt to breakup of struct file_lock
ocfs2: adapt to breakup of struct file_lock
nfsd: adapt to breakup of struct file_lock
nfs: adapt to breakup of struct file_lock
lockd: adapt to breakup of struct file_lock
fuse: adapt to breakup of struct file_lock
gfs2: adapt to breakup of struct file_lock
dlm: adapt to breakup of struct file_lock
ceph: adapt to breakup of struct file_lock
afs: adapt to breakup of struct file_lock
9p: adapt to breakup of struct file_lock
filelock: convert seqfile handling to use file_lock_core
filelock: convert locks_translate_pid to take file_lock_core
...
It is useful to expose skb addr and sock addr to user in tracepoint
tcp_probe, so that we can get more information while monitoring
receiving of tcp data, by ebpf or other ways.
For example, we need to identify a packet by seq and end_seq when
calculate transmit latency between layer 2 and layer 4 by ebpf, but which is
not available in tcp_probe, so we can only use kprobe hooking
tcp_rcv_established to get them. But we can use tcp_probe directly if skb
addr and sock addr are available, which is more efficient.
Signed-off-by: fuyuanli <fuyuanli@didiglobal.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?
Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.
We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.
This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).
The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.
To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.
Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.
We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmXnrwoACgkQ+7dXa6fL
C2vtoA//UstImRxievynpzfaFy14vCY+qVgW3n88sDT9ITP/jCQH/hMorIft7B2K
W7xN15OrWJ7SmfSSiWgZRTPgc0y5J/25woinZk9zSrFyOPfYXEJUy/Rw2VpQQ1eU
lN1ArHvSzj/wWYZjb5FfzNds2mO9dpSvgP8gsBPaN/49OX4ZDpfwQO0ho20c8qE3
VEUAoEPSYFd9EUJ4NBvXhn5q0Z1dp/tl3sQZdOFIN3EZnQBloPFFkvBbDTuX2jtK
Q7F6cgPc/BWfQ+U4OwyPOA3QBrTPMmWCUtM1QCvYV+cj1O7xGaof372Fqin0E7z9
NVIN4fVk3SIW/LRLhpyHoVOc0jHIona9gAsawfRzSIz7sPdeoXy0oOFHyKWrf3dH
TdrdnPPkUaGLPbTv2/l6lzz+uCkMlQ6BOZV+y6gm1+OcM7BEvYBeA6RShPKEk0vo
XsCwn6Pl9y8TZTOpcjal0b+M6mR5Nz0hYzcdCjmBc3lY8ueURPdvaNyLgmAtMu5c
udsq8yMxaXyc3BblrtTTdV+sqdycTfU6/CoNY44uS4X9H+MsFYBgoba+HSvekUWj
b/DCE2YfjQXdRmdlfGhyHSjnvOZohd8wq7i9TEOc8PSNXHo0CV/U8Q6pgsRrwSpG
7DecipkrdFPYqzOl8PRIwWg/VwNGEI5ESzDD9Svs+cODSYJXJjE=
=Egry
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
Here are some changes to AF_RXRPC:
(1) Cache the transmission serial number of ACK and DATA packets in the
rxrpc_txbuf struct and log this in the retransmit tracepoint.
(2) Don't use atomics on rxrpc_txbuf::flags[*] and cache the intended wire
header flags there too to avoid duplication.
(3) Cache the wire checksum in rxrpc_txbuf to make it easier to create
jumbo packets in future (which will require altering the wire header
to a jumbo header and restoring it back again for retransmission).
(4) Fix the protocol names in the wire ACK trailer struct.
(5) Strip all the barriers and atomics out of the call timer tracking[*].
(6) Remove atomic handling from call->tx_transmitted and
call->acks_prev_seq[*].
(7) Don't bother resetting the DF flag after UDP packet transmission. To
change it, we now call directly into UDP code, so it's quick just to
set it every time.
(8) Merge together the DF/non-DF branches of the DATA transmission to
reduce duplication in the code.
(9) Add a kvec array into rxrpc_txbuf and start moving things over to it.
This paves the way for using page frags.
(10) Split (sub)packet preparation and timestamping out of the DATA
transmission function. This helps pave the way for future jumbo
packet generation.
(11) In rxkad, don't pick values out of the wire header stored in
rxrpc_txbuf, buf rather find them elsewhere so we can remove the wire
header from there.
(12) Move rxrpc_send_ACK() to output.c so that it can be merged with
rxrpc_send_ack_packet().
(13) Use rxrpc_txbuf::kvec[0] to access the wire header for the packet
rather than directly accessing the copy in rxrpc_txbuf. This will
allow that to be removed to a page frag.
(14) Switch from keeping the transmission buffers in rxrpc_txbuf allocated
in the slab to allocating them using page fragment allocators. There
are separate allocators for DATA packets (which persist for a while)
and control packets (which are discarded immediately).
We can then turn on MSG_SPLICE_PAGES when transmitting DATA and ACK
packets.
We can also get rid of the RCU cleanup on rxrpc_txbufs, preferring
instead to release the page frags as soon as possible.
(15) Parse received packets before handling timeouts as the former may
reset the latter.
(16) Make sure we don't retransmit DATA packets after all the packets have
been ACK'd.
(17) Differentiate traces for PING ACK transmission.
(18) Switch to keeping timeouts as ktime_t rather than a number of jiffies
as the latter is too coarse a granularity. Only set the call timer at
the end of the call event function from the aggregate of all the
timeouts, thereby reducing the number of timer calls made. In future,
it might be possible to reduce the number of timers from one per call
to one per I/O thread and to use a high-precision timer.
(19) Record RTT probes after successful transmission rather than recording
it before and then cancelling it after if unsuccessful[*]. This
allows a number of calls to get the current time to be removed.
(20) Clean up the resend algorithm as there's now no need to walk the
transmission buffer under lock[*]. DATA packets can be retransmitted
as soon as they're found rather than being queued up and transmitted
when the locked is dropped.
(21) When initially parsing a received ACK packet, extract some of the
fields from the ack info to the skbuff private data. This makes it
easier to do path MTU discovery in the future when the call to which a
PING RESPONSE ACK refers has been deallocated.
[*] Possible with the move of almost all code from softirq context to the
I/O thread.
Link: https://lore.kernel.org/r/20240301163807.385573-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20240304084322.705539-1-dhowells@redhat.com/ # v2
* tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (21 commits)
rxrpc: Extract useful fields from a received ACK to skb priv data
rxrpc: Clean up the resend algorithm
rxrpc: Record probes after transmission and reduce number of time-gets
rxrpc: Use ktimes for call timeout tracking and set the timer lazily
rxrpc: Differentiate PING ACK transmission traces.
rxrpc: Don't permit resending after all Tx packets acked
rxrpc: Parse received packets before dealing with timeouts
rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags
rxrpc: Use rxrpc_txbuf::kvec[0] instead of rxrpc_txbuf::wire
rxrpc: Move rxrpc_send_ACK() to output.c with rxrpc_send_ack_packet()
rxrpc: Don't pick values out of the wire header when setting up security
rxrpc: Split up the DATA packet transmission function
rxrpc: Add a kvec[] to the rxrpc_txbuf struct
rxrpc: Merge together DF/non-DF branches of data Tx function
rxrpc: Do lazy DF flag resetting
rxrpc: Remove atomic handling on some fields only used in I/O thread
rxrpc: Strip barriers and atomics off of timer tracking
rxrpc: Fix the names of the fields in the ACK trailer struct
rxrpc: Note cksum in txbuf
rxrpc: Convert rxrpc_txbuf::flags into a mask and don't use atomics
...
====================
Link: https://lore.kernel.org/r/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the existing parameter and print the address of skbaddr
as other trace functions do.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Printing the addresses can help us identify the exact skb/sk
for those system in which it's not that easy to run BPF program.
As we can see, it already fetches those, then use it directly
and it will print like below:
...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Track the call timeouts as ktimes rather than jiffies as the latter's
granularity is too high and only set the timer at the end of the event
handling function.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
There are three points that transmit PING ACKs and all of them use the same
trace string. Change two of them to use different strings.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.
This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.
[akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled]
Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com
Signed-off-by: Richard Chang <richardycc@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org.
Cc: Martin Liu <liumartin@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process. This limitation poses challenges
for userspace tools requiring real-time OOM analysis and intervention.
Although this information is available from the kernel logs, it’s not
the appropriate format to provide OOM notifications. In Android, BPF
programs are used with the mark_victim trace events to notify userspace of
an OOM kill. For consistency, update the trace event to include the same
information about the OOMed victim as the kernel logs.
- UID
In Android each installed application has a unique UID. Including
the `uid` assists in correlating OOM events with specific apps.
- Process Name (comm)
Enables identification of the affected process.
- OOM Score
Will allow userspace to get additional insight of the relative kill
priority of the OOM victim. In Android, the oom_score_adj is used to
categorize app state (foreground, background, etc.), which aids in
analyzing user-perceptible impacts of OOM events [1].
- Total VM, RSS Stats, and pgtables
Amount of memory used by the victim that will, potentially, be freed up
by killing it.
[1] 246dc8fc95:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I'm updating __assign_str() and will be removing the second parameter. To
make sure that it does not break anything, I make sure that it matches the
__string() field, as that is where the string is actually going to be
saved in. To make sure there's nothing that breaks, I added a WARN_ON() to
make sure that what was used in __string() is the same that is used in
__assign_str().
In doing this change, an error was triggered as __assign_str() now expects
the string passed in to be a char * value. I instead had the following
warning:
include/trace/events/qdisc.h: In function ‘trace_event_raw_event_qdisc_reset’:
include/trace/events/qdisc.h:91:35: error: passing argument 1 of 'strcmp' from incompatible pointer type [-Werror=incompatible-pointer-types]
91 | __assign_str(dev, qdisc_dev(q));
That's because the qdisc_enqueue() and qdisc_reset() pass in qdisc_dev(q)
to __assign_str() and to __string(). But that function returns a pointer
to struct net_device and not a string.
It appears that these events are just saving the pointer as a string and
then reading it as a string as well.
Use qdisc_dev(q)->name to save the device instead.
Fixes: a34dac0b90552 ("net_sched: add tracepoints for qdisc_reset() and qdisc_destroy()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the RPC transaction's svc_rdma_send_ctxt will stay around for
the duration of the RDMA Write operation, the write_info structure
for the Reply chunk can reside in the request's svc_rdma_send_ctxt
instead of being allocated separately.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
From AFS-3.3 a trailer containing extra info was added to the ACK packet
format - but AF_RXRPC has the names of some of the fields mixed up compared
to other AFS implementations.
Rename the struct and the fields to make them match.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Convert the transmission buffer flags into a mask and use | and & rather
than bitops functions (atomic ops are not required as only the I/O thread
can manipulate them once submitted for transmission).
The bottom byte can then correspond directly to the Rx protocol header
flags.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Each Rx protocol packet contains a per-connection monotonically increasing
serial number used to correlate outgoing messages with their replies -
something that can be used for RTT calculation.
Note this value in the rxrpc_txbuf struct in addition to the wire header
and then log it in the rxrpc_retransmit trace for reference.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
In order to get the latency per xprt under the same clientid this patch
adds xprt_id to the tracepoint output.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Existing runtime PM ftrace events (`rpm_suspend`, `rpm_resume`,
`rpm_return_int`) offer limited visibility into the exact timing of device
runtime power state transitions, particularly when asynchronous operations
are involved. When the `rpm_suspend` or `rpm_resume` functions are invoked
with the `RPM_ASYNC` flag, a return value of 0 i.e., success merely
indicates that the device power state request has been queued, not that
the device has yet transitioned.
A new ftrace event, `rpm_status`, is introduced. This event directly logs
the `power.runtime_status` value of a device whenever it changes providing
granular tracking of runtime power state transitions regardless of
synchronous or asynchronous `rpm_suspend` / `rpm_resume` usage.
Signed-off-by: Vilas Bhat <vilasbhat@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison
to ensure that enough freepages are isolated in isolate_freepages(),
however it just decreases the cc->nr_freepages without updating
cc->nr_migratepages in compaction_alloc(), which will waste more CPU
cycles and cause too many freepages to be isolated.
So we should also update the cc->nr_migratepages when allocating or
freeing the freepages to avoid isolating excess freepages. And I can see
fewer free pages are scanned and isolated when running thpcompact on my
Arm64 server:
k6.7 k6.7_patched
Ops Compaction pages isolated 120692036.00 118160797.00
Ops Compaction migrate scanned 131210329.00 154093268.00
Ops Compaction free scanned 1090587971.00 1080632536.00
Ops Compact scan efficiency 12.03 14.26
Moreover, I did not see an obvious latency improvements, this is likely
because isolating freepages is not the bottleneck in the thpcompact test
case.
k6.7 k6.7_patched
Amean fault-both-1 1089.76 ( 0.00%) 1080.16 * 0.88%*
Amean fault-both-3 1616.48 ( 0.00%) 1636.65 * -1.25%*
Amean fault-both-5 2266.66 ( 0.00%) 2219.20 * 2.09%*
Amean fault-both-7 2909.84 ( 0.00%) 2801.90 * 3.71%*
Amean fault-both-12 4861.26 ( 0.00%) 4733.25 * 2.63%*
Amean fault-both-18 7351.11 ( 0.00%) 6950.51 * 5.45%*
Amean fault-both-24 9059.30 ( 0.00%) 9159.99 * -1.11%*
Amean fault-both-30 10685.68 ( 0.00%) 11399.02 * -6.68%*
Link: https://lkml.kernel.org/r/6440493f18da82298152b6305d6b41c2962a3ce6.1708409245.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The timer pull logic needs proper debugging aids. Add tracepoints so the
hierarchical idle machinery can be diagnosed.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240222103403.31923-1-anna-maria@linutronix.de
Current release - regressions:
- nic: intel: fix old compiler regressions
- netfilter: ipset: missing gc cancellations fixed
Current release - new code bugs:
- netfilter: ctnetlink: fix filtering for zone 0
Previous releases - regressions:
- core: fix from address in memcpy_to_iter_csum()
- netfilter: nfnetlink_queue: un-break NF_REPEAT
- af_unix: fix memory leak for dead unix_(sk)->oob_skb in GC.
- devlink: avoid potential loop in devlink_rel_nested_in_notify_work()
- iwlwifi:
- mvm: fix a battery life regression
- fix double-free bug
- mac80211: fix waiting for beacons logic
- nic: nfp: flower: prevent re-adding mac index for bonded port
Previous releases - always broken:
- rxrpc: fix generation of serial numbers to skip zero
- tipc: check the bearer type before calling tipc_udp_nl_bearer_add()
- tunnels: fix out of bounds access when building IPv6 PMTU error
- nic: hv_netvsc: register VF in netvsc_probe if NET_DEVICE_REGISTER missed
- nic: atlantic: fix DMA mapping for PTP hwts ring
Misc:
- selftests: more fixes to deal with very slow hosts
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmXEy4ISHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkd9EQALDZrYm67bPy7TX0+/EXS6wSBe4/ADNN
4tZ+iFnLS/HTKx/YGJmC8pW3VOTgg2+Hko9nfXXQOKXuEPmgMQO8+bYFe1a0ZpPv
1PH7+yq+OCniy16xUG66xv/+pDR5SjN6LuHvFYuCT3AZcmIr3jTXDa+XaCXCXZOu
KOdXZ0RqSNe4hsJoU0lRstSwRzHL0UH1XibahQe6OJet6kI2wa9udMXhecZ4xY1i
7FqRpB7b/vEYlxPTeb/h4U0PYchm1G/z0acV1BZ0+/PjuuvULT0gcWlHJm1X4K1l
IKGibpet1OobQ7MxUjA0zLjcFoybl2AKNcVaBKQty+uKCUfkUIDLMB1cmLvUiCTi
vV2993fvxQrwoZD5Y+LKVaAUjmlyLfkdMwjZ6b7YCmp1ENYeI+liho8xBxGN5eFI
WqbYepOeG4QSoHqHPg6ny1xW7fdVPBYpWM3zrJG3h+SkHwPEOI7j/5tDqHA2rU32
+rNpiB0r0/v54ymO3oahB3ttdA/LxWRls8OjRr8h4cUktwUnGtgW3WPmyHVCl4Q2
xV5B2PZnzxIEkU+UPPPUelZh4Q/wtqtS5oKVT92Io3U6MXRfSC37g75C67p7jCsW
TLV2RdhNk7RyuaybOC5VszZxKBgenOZNdAZZ6KJotYWzM/NQ+NCIKDBpDksM7Hva
hVDYTlZOP+1e
=ihj+
-----END PGP SIGNATURE-----
Merge tag 'net-6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi and netfilter.
Current release - regressions:
- nic: intel: fix old compiler regressions
- netfilter: ipset: missing gc cancellations fixed
Current release - new code bugs:
- netfilter: ctnetlink: fix filtering for zone 0
Previous releases - regressions:
- core: fix from address in memcpy_to_iter_csum()
- netfilter: nfnetlink_queue: un-break NF_REPEAT
- af_unix: fix memory leak for dead unix_(sk)->oob_skb in GC.
- devlink: avoid potential loop in devlink_rel_nested_in_notify_work()
- iwlwifi:
- mvm: fix a battery life regression
- fix double-free bug
- mac80211: fix waiting for beacons logic
- nic: nfp: flower: prevent re-adding mac index for bonded port
Previous releases - always broken:
- rxrpc: fix generation of serial numbers to skip zero
- tipc: check the bearer type before calling tipc_udp_nl_bearer_add()
- tunnels: fix out of bounds access when building IPv6 PMTU error
- nic: hv_netvsc: register VF in netvsc_probe if NET_DEVICE_REGISTER
missed
- nic: atlantic: fix DMA mapping for PTP hwts ring
Misc:
- selftests: more fixes to deal with very slow hosts"
* tag 'net-6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits)
netfilter: nft_set_pipapo: remove scratch_aligned pointer
netfilter: nft_set_pipapo: add helper to release pcpu scratch area
netfilter: nft_set_pipapo: store index in scratch maps
netfilter: nft_set_rbtree: skip end interval element from gc
netfilter: nfnetlink_queue: un-break NF_REPEAT
netfilter: nf_tables: use timestamp to check for set element timeout
netfilter: nft_ct: reject direction for ct id
netfilter: ctnetlink: fix filtering for zone 0
s390/qeth: Fix potential loss of L3-IP@ in case of network issues
netfilter: ipset: Missing gc cancellations fixed
octeontx2-af: Initialize maps.
net: ethernet: ti: cpsw: enable mac_managed_pm to fix mdio
net: ethernet: ti: cpsw_new: enable mac_managed_pm to fix mdio
netfilter: nft_set_pipapo: remove static in nft_pipapo_get()
netfilter: nft_compat: restrict match/target protocol to u16
netfilter: nft_compat: reject unused compat flag
netfilter: nft_compat: narrow down revision to unsigned 8-bits
net: intel: fix old compiler regressions
MAINTAINERS: Maintainer change for rds
selftests: cmsg_ipv6: repeat the exact packet
...
We no longer loop in task_work handling, hence delete the argument from
the tracepoint as it's always 1 and hence not very informative.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're out of space here, and none of the flags are easily reclaimable.
Bump it to 64-bits and re-arrange the struct a bit to avoid gaps.
Add a specific bitwise type for the request flags, io_request_flags_t.
This will help catch violations of casting this value to a smaller type
on 32-bit archs, like unsigned int.
This creates a hole in the io_kiocb, so move nr_tw up and rsrc_node down
to retain needing only cacheline 0 and 1 for non-polled opcodes.
No functional changes intended in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix the counting of new acks and nacks when parsing a packet - something
that is used in congestion control.
As the code stands, it merely notes if there are any nacks whereas what we
really should do is compare the previous SACK table to the new one,
assuming we get two successive ACK packets with nacks in them. However, we
really don't want to do that if we can avoid it as the tables might not
correspond directly as one may be shifted from the other - something that
will only get harder to deal with once extended ACK tables come into full
use (with a capacity of up to 8192).
Instead, count the number of nacks shifted out of the old SACK, the number
of nacks retained in the portion still active and the number of new acks
and nacks in the new table then calculate what we need.
Note this ends up a bit of an estimate as the Rx protocol allows acks to be
withdrawn by the receiver and packets requested to be retransmitted.
Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new struct file_lease and move the lease-specific fields from
struct file_lock to it. Convert the appropriate API calls to take
struct file_lease instead, and convert the callers to use them.
There is zero overlap between the lock manager operations for file
locks and the ones for file leases, so split the lease-related
operations off into a new lease_manager_operations struct.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-47-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-35-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Both locks and leases deal with fl_blocker. Switch the fl_blocker
pointer in struct file_lock_core to point to the file_lock_core of the
blocker instead of a file_lock structure.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-26-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert fs/locks.c to access fl_core fields direcly rather than using
the backward-compatibility macros. Most of this was done with
coccinelle, with a few by-hand fixups.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-18-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
and extent handling code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmW/G4YACgkQ8vlZVpUN
gaPTpwf/c/Fk27GV8ge9PQtR6gmir/lyw2qkvK3Z+12aEsblZRmyvElyZWjAuNQG
bciQyltabIPOA4XxfsZOrdgYI42n0rTTFG7bmI0lr+BJM/HRw0tGGMN91FZla0FP
EXv/AiHKCqlT5OFZbD+8n1TzfdOgWotjug1VLteXve3YKjkDgt5IQm/0Gx9hKBld
IR8SrQlD/rYe+VPvaHz5G4u09Ne5pUE5fDj3xE23wxfU5cloVzuVRCSOGWUCTnCW
T0v6sHeKrmiLC8tIOZkBjer4nXC0MOu0p5+geAjwOArc9VJ1Lh2eAkH+GgDOVprx
ahdl2qmbIbacBYECIeQ/+1i78+O1yw==
=CmYr
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous bug fixes and cleanups in ext4's multi-block allocator
and extent handling code"
* tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
ext4: make ext4_map_blocks() distinguish delalloc only extent
ext4: add a hole extent entry in cache after punch
ext4: correct the hole length returned by ext4_map_blocks()
ext4: convert to exclusive lock while inserting delalloc extents
ext4: refactor ext4_da_map_blocks()
ext4: remove 'needed' in trace_ext4_discard_preallocations
ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations
ext4: remove unused return value of ext4_mb_release_group_pa
ext4: remove unused return value of ext4_mb_release_inode_pa
ext4: remove unused return value of ext4_mb_release
ext4: remove unused ext4_allocation_context::ac_groups_considered
ext4: remove unneeded return value of ext4_mb_release_context
ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()
ext4: remove unused return value of __mb_check_buddy
ext4: mark the group block bitmap as corrupted before reporting an error
ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()
ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()
ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt
ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()
...
In later patches we're going to introduce some macros with names that
clash with fields here. To prevent problems building, just rename the
fields in the trace entry structures.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-2-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add the current batch number in the trace output. When there are
failures, it's important to know which test content resulted in failure.
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
migration/0-18 [000] d..1. 527287.084668: ifs_status: batch: 02, start: 0000, stop: 007f, status: 0000000000007f80
migration/128-785 [128] d..1. 527287.084669: ifs_status: batch: 02, start: 0000, stop: 007f, status: 0000000000007f80
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240125082254.424859-4-ashok.raj@intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Enable the trace function on all HT threads. Currently, the trace is
called from some arbitrary CPU where the test was invoked.
This change gives visibility to the exact errors as seen by each
participating HT threads, and not just what was seen from the primary
thread.
Sample output below.
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
migration/0-18 [000] d..1. 527287.084668: start: 0000, stop: 007f, status: 0000000000007f80
migration/128-785 [128] d..1. 527287.084669: start: 0000, stop: 007f, status: 0000000000007f80
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240125082254.424859-3-ashok.raj@intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
When afs does a lookup, it tries to use FS.InlineBulkStatus to preemptively
look up a bunch of files in the parent directory and cache this locally, on
the basis that we might want to look at them too (for example if someone
does an ls on a directory, they may want want to then stat every file
listed).
FS.InlineBulkStatus can be considered a compound op with the normal abort
code applying to the compound as a whole. Each status fetch within the
compound is then given its own individual abort code - but assuming no
error that prevents the bulk fetch from returning the compound result will
be 0, even if all the constituent status fetches failed.
At the conclusion of afs_do_lookup(), we should use the abort code from the
appropriate status to determine the error to return, if any - but instead
it is assumed that we were successful if the op as a whole succeeded and we
return an incompletely initialised inode, resulting in ENOENT, no matter
the actual reason. In the particular instance reported, a vnode with no
permission granted to be accessed is being given a UAEACCES abort code
which should be reported as EACCES, but is instead being reported as
ENOENT.
Fix this by abandoning the inode (which will be cleaned up with the op) if
file[1] has an abort code indicated and turn that abort code into an error
instead.
Whilst we're at it, add a tracepoint so that the abort codes of the
individual subrequests of FS.InlineBulkStatus can be logged. At the moment
only the container abort code can be 0.
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZabMrQAKCRCRxhvAZXjc
ovnUAQDgCOonb1tjtTvC8s8IMDUEoaVYZI91KVfsZQSJYN1sdQD+KfJmX1BhJnWG
l0cEffGfnWGXMZkZqDgLPHUIPzFrmws=
=1b3j
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This extends the netfs helper library that network filesystems can use
to replace their own implementations. Both afs and 9p are ported. cifs
is ready as well but the patches are way bigger and will be routed
separately once this is merged. That will remove lots of code as well.
The overal goal is to get high-level I/O and knowledge of the page
cache and ouf of the filesystem drivers. This includes knowledge about
the existence of pages and folios
The pull request converts afs and 9p. This removes about 800 lines of
code from afs and 300 from 9p. For 9p it is now possible to do writes
in larger than a page chunks. Additionally, multipage folio support
can be turned on for 9p. Separate patches exist for cifs removing
another 2000+ lines. I've included detailed information in the
individual pulls I took.
Summary:
- Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
calls to prevent these from happening at the same time.
- Support for direct and unbuffered I/O.
- Support for write-through caching in the page cache.
- O_*SYNC and RWF_*SYNC writes use write-through rather than writing
to the page cache and then flushing afterwards.
- Support for write-streaming.
- Support for write grouping.
- Skip reads for which the server could only return zeros or EOF.
- The fscache module is now part of the netfs library and the
corresponding maintainer entry is updated.
- Some helpers from the fscache subsystem are renamed to mark them as
belonging to the netfs library.
- Follow-up fixes for the netfs library.
- Follow-up fixes for the 9p conversion"
* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
netfs: Fix wrong #ifdef hiding wait
cachefiles: Fix signed/unsigned mixup
netfs: Fix the loop that unmarks folios after writing to the cache
netfs: Fix interaction between write-streaming and cachefiles culling
netfs: Count DIO writes
netfs: Mark netfs_unbuffered_write_iter_locked() static
netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
netfs: Rearrange netfs_io_subrequest to put request pointer first
9p: Use length of data written to the server in preference to error
9p: Do a couple of cleanups
9p: Fix initialisation of netfs_inode for 9p
cachefiles: Fix __cachefiles_prepare_write()
9p: Use netfslib read/write_iter
afs: Use the netfs write helpers
netfs: Export the netfs_sreq tracepoint
netfs: Optimise away reads above the point at which there can be no data
netfs: Implement a write-through caching option
netfs: Provide a launder_folio implementation
netfs: Provide a writepages implementation
netfs, cachefiles: Pass upper bound length to allow expansion
...
As 'needed' to trace_ext4_discard_preallocations is always 0 which
is meaningless. Just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>