44281 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Vincent Guittot
|
cd18bec668 |
sched/fair: Fix update of rd->sg_overutilized
sg_overloaded is used instead of sg_overutilized to update rd->sg_overutilized. Fixes: 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240404155738.2866102-1-vincent.guittot@linaro.org |
||
Alexander Gordeev
|
89d6910cc5 |
sched/vtime: Get rid of generic vtime_task_switch() implementation
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2cb6e3caada93623f6d4f78ad938ac6cd0e2fda8.1712760275.git.agordeev@linux.ibm.com |
||
Ingo Molnar
|
4475cd8bfd |
sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an unnecessary complication that only make the code harder to read and slower. We only ever set them separately: thule:~/tip> git grep SG_OVER kernel/sched/ kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED); kernel/sched/fair.c: *sg_status |= SG_OVERLOADED; kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED; kernel/sched/fair.c: *sg_status |= SG_OVERLOADED; kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED); kernel/sched/fair.c: sg_status & SG_OVERUTILIZED); kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) { kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED); kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */ kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */ kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED); And use them separately, which results in suboptimal code: /* update overload indicator if we are at root domain */ set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED); /* Update over-utilization (tipping point, U >= 0) indicator */ set_rd_overutilized_status(env->dst_rq->rd, Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats() and its lower level functions, and change all of them to 'bool'. Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Qais Yousef <qyousef@layalina.io> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com |
||
Ingo Molnar
|
4d0a63e5b8 |
sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
The _status() postfix has no real meaning, simplify the naming and harmonize it with set_rd_overloaded(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com |
||
Ingo Molnar
|
7bda10ba7f |
sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
Follow the rename of the root_domain::overloaded flag. Note that this also matches the SG_OVERUTILIZED flag better. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com |
||
Ingo Molnar
|
76cc4f9148 |
sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
Follow the rename of the root_domain::overloaded flag. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com |
||
Ingo Molnar
|
dfb83ef7b8 |
sched/fair: Rename root_domain::overload to ::overloaded
It is silly to use an ambiguous noun instead of a clear adjective when naming such a flag ... Note how root_domain::overutilized already used a proper adjective. rd->overloaded is now set to 1 when the root domain is overloaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com |
||
Shrikanth Hegde
|
caac629172 |
sched/fair: Use helper functions to access root_domain::overload
Introduce two helper functions to access & set the root_domain::overload flag: get_rd_overload() set_rd_overload() To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods. No change in functionality intended. [ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ] Suggested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com |
||
Shrikanth Hegde
|
c628db0a68 |
sched/fair: Check root_domain::overload value before update
The root_domain::overload flag is 1 when there's any rq in the root domain that has 2 or more running tasks. (Ie. it's overloaded.) The root_domain structure itself is a global structure per cpuset island. The ::overload flag is maintained the following way: - Set when adding a second task to the runqueue. - It is cleared in update_sd_lb_stats() during load balance, if none of the rqs have 2 or more running tasks. This flag is used during newidle balance to see if its worth doing a full load balance pass, which can be an expensive operation. If it is set, then newidle balance will try to aggressively pull a task. Since commit: 630246a06ae2 ("sched/fair: Clean-up update_sg_lb_stats parameters") ::overload is being written unconditionally, even if it has the same value. The change in value of this depends on the workload, but on typical workloads, it doesn't change all that often: a system is either dominantly overloaded for substantial amounts of time, or not. Extra writes to this semi-global structure cause unnecessary overhead, extra bus traffic, etc. - so avoid it as much as possible. Perf probe stats show that it's worth making this change (numbers are with patch applied): 1M probe:sched_balance_newidle_L38 139 probe:update_sd_lb_stats_L53 <====== 1->0 writes 129K probe:add_nr_running_L12 74 probe:add_nr_running_L13 <====== 0->1 writes 54K probe:update_sd_lb_stats_L50 <====== reads These numbers prove that actual change in the ::overload value is (much) less frequent: L50 is much larger at ~54,000 accesses vs L53+L13 of 139+74. [ mingo: Rewrote the changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240325054505.201995-2-sshegde@linux.ibm.com |
||
Shrikanth Hegde
|
902e786c4a |
sched/fair: Combine EAS check with root_domain::overutilized access
Access to root_domainoverutilized is always used with sched_energy_enabled in the pattern: if (sched_energy_enabled && !overutilized) do something So modify the helper function to utilize this pattern. This is more readable code as it would say, do something when root domain is not overutilized. This function always return true when EAS is disabled. No change in functionality intended. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240326152616.380999-1-sshegde@linux.ibm.com |
||
Shrikanth Hegde
|
c829d6818b |
sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
newidle(CPU_NEWLY_IDLE) balancing doesn't stop the load-balancing if the continue_balancing flag is reset, but the other two balancing (IDLE, BUSY) cases do that. newidle balance stops the load balancing if rq has a task or there is wakeup pending. The same checks are present in should_we_balance for newidle. Hence use the return value and simplify continue_balancing mechanism for newidle. Update the comment surrounding it as well. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20240325153926.274284-1-sshegde@linux.ibm.com |
||
Shrikanth Hegde
|
d0f5d3cefc |
sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized
The root_domain::overutilized field is READ_ONCE() accessed in multiple places, which could be simplified with a helper function. This might also make it more apparent that it needs to be used only in case of EAS. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-3-sshegde@linux.ibm.com |
||
Shrikanth Hegde
|
be3a51e68f |
sched/fair: Add EAS checks before updating root_domain::overutilized
root_domain::overutilized is only used for EAS(energy aware scheduler) to decide whether to do load balance or not. It is not used if EAS not possible. Currently enqueue_task_fair and task_tick_fair accesses, sometime updates this field. In update_sd_lb_stats it is updated often. This causes cache contention due to true sharing and burns a lot of cycles. ::overload and ::overutilized are part of the same cacheline. Updating it often invalidates the cacheline. That causes access to ::overload to slow down due to false sharing. Hence add EAS check before accessing/updating this field. EAS check is optimized at compile time or it is a static branch. Hence it shouldn't cost much. With the patch, both enqueue_task_fair and newidle_balance don't show up as hot routines in perf profile. 6.8-rc4: 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair 6.78% s [kernel.vmlinux] [k] newidle_balance +patch: 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair 0.00% swapper [kernel.vmlinux] [k] newidle_balance While at it: trace_sched_overutilized_tp expect that second argument to be bool. So do a int to bool conversion for that. Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-2-sshegde@linux.ibm.com |
||
Qais Yousef
|
58eeb2d79b |
sched/fair: Don't double balance_interval for migrate_misfit
It is not necessarily an indication of the system being busy and requires a backoff of the load balancer activities. But pushing it high could mean generally delaying other misfit activities or other type of imbalances. Also don't pollute nr_balance_failed because of misfit failures. The value is used for enabling cache hot migration and in migrate_util/load types. None of which should be impacted (skewed) by misfit failures. Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-5-qyousef@layalina.io |
||
Qais Yousef
|
fa427e8e53 |
sched/topology: Remove root_domain::max_cpu_capacity
The value is no longer used as we now keep track of max_allowed_capacity for each task instead. Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io |
||
Qais Yousef
|
22d5607400 |
sched/fair: Check if a task has a fitting CPU when updating misfit
If a misfit task is affined to a subset of the possible CPUs, we need to verify that one of these CPUs can fit it. Otherwise the load balancer code will continuously trigger needlessly leading the balance_interval to increase in return and eventually end up with a situation where real imbalances take a long time to address because of this impossible imbalance situation. This can happen in Android world where it's common for background tasks to be restricted to little cores. Similarly if we can't fit the biggest core, triggering misfit is pointless as it is the best we can ever get on this system. To be able to detect that; we use asym_cap_list to iterate through capacities in the system to see if the task is able to run at a higher capacity level based on its p->cpus_ptr. We do that when the affinity change, a fair task is forked, or when a task switched to fair policy. We store the max_allowed_capacity in task_struct to allow for cheap comparison in the fast path. Improve check_misfit_status() function by removing redundant checks. misfit_task_load will be 0 if the task can't move to a bigger CPU. And nohz_balancer_kick() already checks for cpu_check_capacity() before calling check_misfit_status(). Test: ===== Add trace_printk("balance_interval = %lu\n", interval) in get_sd_balance_interval(). run if [ "$MASK" != "0" ]; then adb shell "taskset -a $MASK cat /dev/zero > /dev/null" fi sleep 10 // parse ftrace buffer counting the occurrence of each valaue Where MASK is either: * 0: no busy task running * 1: busy task is pinned to 1 cpu; handled today to not cause misfit * f: busy task pinned to little cores, simulates busy background task, demonstrates the problem to be fixed Results: ======== Note how occurrence of balance_interval = 128 overshoots for MASK = f. BEFORE ------ MASK=0 1 balance_interval = 175 120 balance_interval = 128 846 balance_interval = 64 55 balance_interval = 63 215 balance_interval = 32 2 balance_interval = 31 2 balance_interval = 16 4 balance_interval = 8 1870 balance_interval = 4 65 balance_interval = 2 MASK=1 27 balance_interval = 175 37 balance_interval = 127 840 balance_interval = 64 167 balance_interval = 63 449 balance_interval = 32 84 balance_interval = 31 304 balance_interval = 16 1156 balance_interval = 8 2781 balance_interval = 4 428 balance_interval = 2 MASK=f 1 balance_interval = 175 1328 balance_interval = 128 44 balance_interval = 64 101 balance_interval = 63 25 balance_interval = 32 5 balance_interval = 31 23 balance_interval = 16 23 balance_interval = 8 4306 balance_interval = 4 177 balance_interval = 2 AFTER ----- Note how the high values almost disappear for all MASK values. The system has background tasks that could trigger the problem without simulate it even with MASK=0. MASK=0 103 balance_interval = 63 19 balance_interval = 31 194 balance_interval = 8 4827 balance_interval = 4 179 balance_interval = 2 MASK=1 131 balance_interval = 63 1 balance_interval = 31 87 balance_interval = 8 3600 balance_interval = 4 7 balance_interval = 2 MASK=f 8 balance_interval = 127 182 balance_interval = 63 3 balance_interval = 31 9 balance_interval = 16 415 balance_interval = 8 3415 balance_interval = 4 21 balance_interval = 2 Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-3-qyousef@layalina.io |
||
Qais Yousef
|
77222b0d12 |
sched/topology: Export asym_cap_list
So that we can use it to iterate through available capacities in the system. Sort asym_cap_list in descending order as expected users are likely to be interested on the highest capacity first. Make the list RCU protected to allow for cheap access in hot paths. Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io |
||
Ingo Molnar
|
f4566a1e73 |
Linux 6.9-rc1
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYAlq0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYqwH/0fb4pRbVtULpiIK Cs7/e/IWzRRWLBq+Jj2KVVTxwjyiKFNOq6K/CHHnljIWo1yN2CIWeOgbHfTI0WfN xmBdJP7OtK8MCN9PwwoWhZxMLcyv4pFCERrrkGa7AD+cdN4j/ytQ3mH5V8f/21fd rnpQSdpgGXB2SSMHd520Y+e56+gxrrTmsDXjZWM08Wt0bbqAWJrjNe58BMz5hI1t yQtcgYRTdUuZBn5TMkT99lK9EFQslV38YCo7RUP5D0DWXS1jSfWlgnCD1Nc1ziF4 ps/xPdUMDJAc5Tslg/hgJOciSuLqgMzIUsVgZrKysuu3NhwDY1LDWGORmH1t8E8W RC25950= =F+01 -----END PGP SIGNATURE----- Merge tag 'v6.9-rc1' into sched/core, to pick up fixes and to refresh the branch Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
864ad046c1 |
dma-mapping fixes for Linux 6.9
This has a set of swiotlb alignment fixes for sometimes very long standing bugs from Will. We've been discussion them for a while and they should be solid now. -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmX/bmILHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYOuKQ//cUR3EywszAc04x8dIYsfegFGdQxUeJD0+1elAPss ELiqrlg5A/Yn4uHKpXjWbvJ+v1Ywh3o8+vlgUiG4aFeg4xEd+FsJqm2SDa3jhdMP 2hV8pwB92kpkKCxyCAqx8O/4o4fY++KCFsOtnammEudFjurJaCrRTlauOn6D1t/i JsBYCFtjFIhIPHQe7jmZ6dNiLEfiIJ+q8ImW+UxuB+gOGgU8C4VVW3tHuo3KeU7n yVOcz4yJrQ4xYzG3RKtaU0FE0ybA860xwiA5oPvqpI9A2ISGovv7ik0QCUlHXhff z+iL8Lj/KsOucq5pBDhbRYeN2n4VVogEwb/hut6mgyqj1ESjqeZaLioVHqOTDbmB +vNTVBt6OGTOq1YkNKttK9vBBXs5RdZSBalzBG/QO1ewmrNVVZ7z8fWXVRDipoIl sAIXmI8xAy5TNL6UbJ+RDfYeLlTzHjXGKQGB49gumOA8s4w5P5v9diYegX6GcVZV PKkYLOvprwcyi8Xxx2mNxFDxh+LWqzMYqzwsN7AoRTW4TRc7Tel0G6Axs+V/cL/Y 23IHfFfT2HqDUM5PuBfUcgCrtw1hinuD80xqXVcvaU+AYoQhrGHJFLHkj6lTwV2b hmuul170froI2A/vm8yGGqcn2Me55AexlpMab+UWL+iisGtqFTWi9b9vK/2Vi+Zj wBg= =Xaob -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: "This has a set of swiotlb alignment fixes for sometimes very long standing bugs from Will. We've been discussion them for a while and they should be solid now" * tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE iommu/dma: Force swiotlb_max_mapping_size on an untrusted device swiotlb: Fix alignment checks when both allocation and DMA masks are present swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() swiotlb: Enforce page alignment in swiotlb_alloc() swiotlb: Fix double-allocation of slots due to broken alignment handling |
||
Linus Torvalds
|
70293240c5 |
Two regression fixes for the timer and timer migration code:
1) Prevent endless timer requeuing which is caused by two CPUs racing out of idle. This happens when the last CPU goes idle and therefore has to ensure to expire the pending global timers and some other CPU come out of idle at the same time and the other CPU wins the race and expires the global queue. This causes the last CPU to chase ghost timers forever and reprogramming it's clockevent device endlessly. Cure this by re-evaluating the wakeup time unconditionally. 2) The split into local (pinned) and global timers in the timer wheel caused a regression for NOHZ full as it broke the idle tracking of global timers. On NOHZ full this prevents an self IPI being sent which in turn causes the timer to be not programmed and not being expired on time. Restore the idle tracking for the global timer base so that the self IPI condition for NOHZ full is working correctly again. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmX/Mn8THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYX0D/93fKa9qp+qBs72Vvctj4bJuk6auel7 GRWx3Vc//kk4VOJFCteQ2eykr1/fsLuibPb67iiKp41stvaWKeJPj0kD+RUuVf8E dPGPpPU+qY5ynhoiqJekZL7+5NSA48y4bDc00a4U31MPpEcJB5y94zFOCKMiCWtk 6Tf6I6168bsSFYvKqb2LImVoowu/bf7bXLVUk1HcdNnSC7bfx+yN8nkQ1zSy3K2M IyE1CQqMyDmdfKW9Vs68ooTIpuA0n7bxOuXbVaFdJyiJ035v3Z3+m2vQmrHHLDdz MfqHbFDEomDC+zfiugFvuxyxLIi2Gf/NXPibu6OxLkVe2Pu1KUJkhFbgZVUR2W6A EU6SmZr77zMPAMZyqG8OJqTqlCPiJfJX2KMWDF+ezbXBt+sbMe6LfazPqj9TopnN /ECMCt77xl1POCEnP81hPWizKsqf8HCTDDZEi9UlqbIxT3TgrZFlTnKfmmciwWiP uGoUXgZmi8qJ+lSGNTVUgbTmIbazvUz43sKgfndUy2yxeCb/SBlx1/8Ys2ntszOy XDOF8QroPH0zXlBaYo0QVZbOdB4O0/En1qZuGScBmoUY7bRr0NRD/C3ObhvQHI7C iguAsnB+zirwwZSTDzwhQDhXAtWgSaqBB7pb4aCDxC0AvnKtL5HKdDNeIafpPJeA 4Xh40iu44u/V9w== =lY5O -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two regression fixes for the timer and timer migration code: - Prevent endless timer requeuing which is caused by two CPUs racing out of idle. This happens when the last CPU goes idle and therefore has to ensure to expire the pending global timers and some other CPU come out of idle at the same time and the other CPU wins the race and expires the global queue. This causes the last CPU to chase ghost timers forever and reprogramming it's clockevent device endlessly. Cure this by re-evaluating the wakeup time unconditionally. - The split into local (pinned) and global timers in the timer wheel caused a regression for NOHZ full as it broke the idle tracking of global timers. On NOHZ full this prevents an self IPI being sent which in turn causes the timer to be not programmed and not being expired on time. Restore the idle tracking for the global timer base so that the self IPI condition for NOHZ full is working correctly again" * tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix removed self-IPI on global timer's enqueue in nohz_full timers/migration: Fix endless timer requeue after idle interrupts |
||
Linus Torvalds
|
976b029d06 |
A single fix for the generic entry code:
THe trace_sys_enter() tracepoint can modify the syscall number via kprobes or BPF in pt_regs, but that requires that the syscall number is re-evaluted from pt_regs after the tracepoint. A seccomp fix in that area removed the re-evaluation so the change does not take effect as the code just uses the locally cached number. Restore the original behaviour by re-evaluating the syscall number after the tracepoint. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmX/LJoTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodkJEACFbJqxdlpoWvc8lpYd0htqP+EyN6li quMk+acE8IOtiqOetphOrmKGzBYdtpNu+7uWVeRTyCarm2Gb5lX1GnIMwyEK8R25 R+fVbPaECldSUKypAbTze2RpJl/vp2jeKhCpOn3A5MpITd0vV9SEs+9hecBz3ucZ CXB4vPR7n8VxHX5ArbHqe05fF5TnSZ4CnpeWONQghGsEaSJ87XrljVfC/fxmb7mZ yB2KAIKeYqslvoFoIWSAJk6xvwcGC08vY8mxkM6n61yXmAZ33BWvpnafGHsstF/V gT9Kmvs7F6Tn6V5j0SUuC4/nUO/dtEDhvra793OSvR/rB3+AmZfECPQaDNIJqNaG 0pG558E+hrnpV/YWGzRYJCtqUgtTS0kV5JJY9ffoJRQB3Nb7GCeyykDcs823Yt7j A5o3JUQU0kW4X08hGFMU6DSFppxp9nS0eXDoKMlVRmDB8l30bLK+tMPMaID6yGse 0UafVMWRhtM85mqNVZIZKTsRuONvT+8PV2UH5cNryGmjjbykLmjSLFRvXBmk2Jl+ uhcc0QaApQleA/H/fSNoyPFOPCqvDBEo4xM9e3ypc8TflOQegZfpzrjAITHzv9hY r4ci8faKXTR7jY70a6TiEgcGYYLYDmyvOL0I+TwYA9qHB7tEaJMF6YfU8UDsM9K2 Gb4l6n8fNQubTg== =iW0L -----END PGP SIGNATURE----- Merge tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry fix from Thomas Gleixner: "A single fix for the generic entry code: The trace_sys_enter() tracepoint can modify the syscall number via kprobes or BPF in pt_regs, but that requires that the syscall number is re-evaluted from pt_regs after the tracepoint. A seccomp fix in that area removed the re-evaluation so the change does not take effect as the code just uses the locally cached number. Restore the original behaviour by re-evaluating the syscall number after the tracepoint" * tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Respect changes to system call number by trace_sys_enter() |
||
Linus Torvalds
|
c150b809f7 |
RISC-V Patches for the 6.9 Merge Window
* Support for various vector-accelerated crypto routines. * Hibernation is now enabled for portable kernel builds. * mmap_rnd_bits_max is larger on systems with larger VAs. * Support for fast GUP. * Support for membarrier-based instruction cache synchronization. * Support for the Andes hart-level interrupt controller and PMU. * Some cleanups around unaligned access speed probing and Kconfig settings. * Support for ACPI LPI and CPPC. * Various cleanus related to barriers. * A handful of fixes. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmX9icgTHHBhbG1lckBk YWJiZWx0LmNvbQAKCRAuExnzX7sYib+UD/4xyL6UMixx6A06BVBL9UT4vOrxRvNr JIihG5y5QNMjes9DHWL35mZTMqFtQ0tq94ViWFLmJWloV/8KRVM2C9R9KX7vplf3 M/OwvP106spxgvNHoeQbycgs42RU1t2mpqT7N1iK2hCjqieP3vLn6hsSLXWTAG0L 3gQbQw6XCLC3hPyLq+nbFY2i4faeCmpXWmixoy/IvQ5calZQrRU0LNlP6lcMBhVo uocjG0uGAhrahw2s81jxcMZcxa3AvUCiplapdD5H5v9rBM85SkYJj2Q9SqdSorkb xzuimRnKPI5s47yM3pTfZY0qnQUYHV7PXXuw4WujpCQVQdhaG+Ggq63UUZA61J9t IzZK2zdcfHqICrGTtXImUzRT3dcc3oq+IFq4tTY+rEJm29hrXkAtx+qBm5xtMvax fJz5feJ/iT0u7MDj4Oq24n+Kpl+Olm+MJaZX3m5Ovi/9V6a9iK9HXqxg9/Fs0fMO +J/0kTgd8Vu9CYH7KNWz3uztcO9eMAH3VyzuXuab4BGj1i1Y/9EjpALQi7rDN73S OsYQX6NnzMkBV4dvElJVLXiPlvNlMHZZwdak5CqPb48jaJu6iiIZAuvOrG6/naGP wnQSLVA2WWWoOkl3AJhxfpa11CLhbMl9E2gYm1VtNvASXoSFIxlAq1Yv3sG8yjty 4ZT0rYFJOstYiQ== =3dL5 -----END PGP SIGNATURE----- Merge tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for various vector-accelerated crypto routines - Hibernation is now enabled for portable kernel builds - mmap_rnd_bits_max is larger on systems with larger VAs - Support for fast GUP - Support for membarrier-based instruction cache synchronization - Support for the Andes hart-level interrupt controller and PMU - Some cleanups around unaligned access speed probing and Kconfig settings - Support for ACPI LPI and CPPC - Various cleanus related to barriers - A handful of fixes * tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits) riscv: Fix syscall wrapper for >word-size arguments crypto: riscv - add vector crypto accelerated AES-CBC-CTS crypto: riscv - parallelize AES-CBC decryption riscv: Only flush the mm icache when setting an exec pte riscv: Use kcalloc() instead of kzalloc() riscv/barrier: Add missing space after ',' riscv/barrier: Consolidate fence definitions riscv/barrier: Define RISCV_FULL_BARRIER riscv/barrier: Define __{mb,rmb,wmb} RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ cpufreq: Move CPPC configs to common Kconfig and add RISC-V ACPI: RISC-V: Add CPPC driver ACPI: Enable ACPI_PROCESSOR for RISC-V ACPI: RISC-V: Add LPI driver cpuidle: RISC-V: Move few functions to arch/riscv riscv: Introduce set_compat_task() in asm/compat.h riscv: Introduce is_compat_thread() into compat.h riscv: add compile-time test into is_compat_task() riscv: Replace direct thread flag check with is_compat_task() riscv: Improve arch_get_mmap_end() macro ... |
||
Linus Torvalds
|
3faae16b5a |
RTC for 6.9
Subsytem: - rtc_class is now const Drivers: - ds1511: driver cleanup, set date and time range and alarm offset limit - max31335: fix interrupt handler - pcf8523: improve suspend support -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEBqsFVZXh8s/0O5JiY6TcMGxwOjIFAmX8uCcACgkQY6TcMGxw OjI1hA//fK1Cs2eDL2E7e8bQCJtDJ+gdzVT210LwT9TCOgUV+arqLjF/i/pq8m5j CK/ukk5OLZL3v5dS7rsOZAiTg1N9Q+3SEoaTP029kRfA2SxOE2isR9W5nNAvyJC0 L4Y4YEcEMsJpluK94G1/4ult1We9vCGSlqlNknChp3BL5ELxg7T/Ea5wda2OLSRx +ZgQpB6O+LIIQucm8mgCAlucMJrAj4+qEF7HO0AeElDh+md4OvjrF8Y30vTfs82L hGazWQgu8J4Jmy4u6q00s5IhfsvwNZVYgaASv8ivB8/9qgQLktZ6HFHRSwloRXg2 o9OqcM/OR7Lw4taH/6pnVQ37UUAQ/Vy9QJy2GHza1uDzlBBjtqVz8zFqheu/rmnJ 3AX7hlowBx9fPpxmStP2Ac9WIvtepV25eOlDDSlDiS8/4T/DRJHwpRQghSTDnFJ4 fRU1HFbaj3ZKRgBzu5zJU4XvQO4X/DKxSHVMy+rVtsndWzDa8q11QZrQJAP4ZpLM /+zfar2RFnsnop1Dg9cz/hq0gWbQUMJ16qN3cFYl3jVVzC3JSld7B0+DhZdII1Lg YsCXuUQusnOTFm9GtyC6doHgmC3Dy2wentz6Vn6ynyBOR4NKjJPkI7PICdgzkIDi AJI9qSxQcT6gOSw2PAlzOcYb7AD3cOJCeg3ukt7aDl8S38RKR3w= =Is7n -----END PGP SIGNATURE----- Merge tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "Subsytem: - rtc_class is now const Drivers: - ds1511: cleanup, set date and time range and alarm offset limit - max31335: fix interrupt handler - pcf8523: improve suspend support" * tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (28 commits) MAINTAINER: Include linux-arm-msm for Qualcomm RTC patches dt-bindings: rtc: zynqmp: Add support for Versal/Versal NET SoCs rtc: class: make rtc_class constant dt-bindings: rtc: abx80x: Improve checks on trickle charger constraints MAINTAINERS: adjust file entry in ARM/Mediatek RTC DRIVER rtc: nct3018y: fix possible NULL dereference rtc: max31335: fix interrupt status reg rtc: mt6397: select IRQ_DOMAIN instead of depending on it dt-bindings: rtc: abx80x: convert to yaml rtc: m41t80: Use the unified property API get the wakeup-source property dt-bindings: at91rm9260-rtt: add sam9x7 compatible dt-bindings: rtc: convert MT7622 RTC to the json-schema dt-bindings: rtc: convert MT2717 RTC to the json-schema rtc: pcf8523: add suspend handlers for alarm IRQ rtc: ds1511: set alarm offset limit rtc: ds1511: set range rtc: ds1511: drop inline/noinline hints rtc: ds1511: rename pdata rtc: ds1511: implement ds1511_rtc_read_alarm properly rtc: ds1511: remove partial alarm support ... |
||
Linus Torvalds
|
cba9ffdb99 |
Including fixes from CAN, netfilter, wireguard and IPsec.
Current release - regressions: - rxrpc: fix use of page_frag_alloc_align(), it changed semantics and we added a new caller in a different subtree - xfrm: allow UDP encapsulation only in offload modes Current release - new code bugs: - tcp: fix refcnt handling in __inet_hash_connect() - Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets", conflicted with some expectations in BPF uAPI Previous releases - regressions: - ipv4: raw: fix sending packets from raw sockets via IPsec tunnels - devlink: fix devlink's parallel command processing - veth: do not manipulate GRO when using XDP - esp: fix bad handling of pages from page_pool Previous releases - always broken: - report RCU QS for busy network kthreads (with Paul McK's blessing) - tcp/rds: fix use-after-free on netns with kernel TCP reqsk - virt: vmxnet3: fix missing reserved tailroom with XDP Misc: - couple of build fixes for Documentation Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmX8bXsACgkQMUZtbf5S IrsfBg/+KzrEx0tB/Af57ZZGZ5PMjPy+XFDox4iFfHm338UFuGXVvZrXd7G+6YkH ZwWeF5YDPKzwIEiZ5D3hewZPlkLH0Eg88q74chlE0gUv7t1jhuQHUdIVeFnPcLbN t/8AcCZCJ2fENbr1iNnzZON1RW0fVOl+SDxhiSYFeFqii6FywDfqWL/h0u86H/AF KRktgb0LzH0waH6IiefVV1NZyjnZwmQ6+UVQerTzUnQmWhV1xQKoO3MQpZuFRvr6 O+kPZMkrqnTCCy7RO1BexS5cefqc80i5Z25FLGcaHgpnYd2pDNDMMxqrhqO9Y0Pv 6u/tLgRxzVUDXWouzREIRe50Z9GJswkg78zilAhpqYiHRjd8jaBH6y+9mhGFc7F8 iVAx02WfJhlk0aynFf2qZmR7PQIb9XjtFJ7OAeJrno9UD7zAubtikGM/6m6IZfRV TD1mze95RVnNjbHZMeg6oNLFUMJXVTobtvtqk5pTQvsNsmSYGFvkvWC5/P6ycyYt pMx6E0PA/ZCnQAlThCOCzFa5BO+It3RJHcQJhgbOzHrlWKwmrjBKcKJcLLcxFSUt 4wwjdEcG1Bo2wdnsjwsQwJDHQW+M9TSLdLM3YVptM9jbqOMizoqr6/xSykg3H4wZ t/dSiYSsEr06z7lvwbAjUXJ/mfszZ+JsVAFXAN7ahcM4OZb5WTQ= =gpLl -----END PGP SIGNATURE----- Merge tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from CAN, netfilter, wireguard and IPsec. I'd like to highlight [ lowlight? - Linus ] Florian W stepping down as a netfilter maintainer due to constant stream of bug reports. Not sure what we can do but IIUC this is not the first such case. Current release - regressions: - rxrpc: fix use of page_frag_alloc_align(), it changed semantics and we added a new caller in a different subtree - xfrm: allow UDP encapsulation only in offload modes Current release - new code bugs: - tcp: fix refcnt handling in __inet_hash_connect() - Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets", conflicted with some expectations in BPF uAPI Previous releases - regressions: - ipv4: raw: fix sending packets from raw sockets via IPsec tunnels - devlink: fix devlink's parallel command processing - veth: do not manipulate GRO when using XDP - esp: fix bad handling of pages from page_pool Previous releases - always broken: - report RCU QS for busy network kthreads (with Paul McK's blessing) - tcp/rds: fix use-after-free on netns with kernel TCP reqsk - virt: vmxnet3: fix missing reserved tailroom with XDP Misc: - couple of build fixes for Documentation" * tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits) selftests: forwarding: Fix ping failure due to short timeout MAINTAINERS: step down as netfilter maintainer netfilter: nf_tables: Fix a memory leak in nf_tables_updchain net: dsa: mt7530: fix handling of all link-local frames net: dsa: mt7530: fix link-local frames that ingress vlan filtering ports bpf: report RCU QS in cpumap kthread net: report RCU QS on threaded NAPI repolling rcu: add a helper to report consolidated flavor QS ionic: update documentation for XDP support lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel doc netfilter: nf_tables: do not compare internal table flags on updates netfilter: nft_set_pipapo: release elements in clone only from destroy path octeontx2-af: Use separate handlers for interrupts octeontx2-pf: Send UP messages to VF only when VF is up. octeontx2-pf: Use default max_active works instead of one octeontx2-pf: Wait till detach_resources msg is complete octeontx2: Detect the mbox up or down message via register devlink: fix port new reply cmd type tcp: Clear req->syncookie in reqsk_alloc(). net/bnx2x: Prevent access to a freed page in page_pool ... |
||
Linus Torvalds
|
1d35aae78f |
Kbuild updates for v6.9
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC 7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p +XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0 IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V dWH7BPecS44h4KXx =tFdl -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig * tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits) kconfig: tests: test dependency after shuffling choices kconfig: tests: add a test for randconfig with dependent choices kconfig: tests: support KCONFIG_SEED for the randconfig runner kbuild: rpm-pkg: add dtb files in kernel rpm kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig() kconfig: check prompt for choice while parsing kconfig: lxdialog: remove unused dialog colors kconfig: lxdialog: fix button color for blackbg theme modpost: fix null pointer dereference kbuild: remove GCC's default -Wpacked-bitfield-compat flag kbuild: unexport abs_srctree and abs_objtree kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1 kconfig: remove named choice support kconfig: use linked list in get_symbol_str() to iterate over menus kconfig: link menus to a symbol kbuild: fix inconsistent indentation in top Makefile kbuild: Use -fmin-function-alignment when available alpha: merge two entries for CONFIG_ALPHA_GAMMA alpha: merge two entries for CONFIG_ALPHA_EV4 kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj) ... |
||
Linus Torvalds
|
241590e5a1 |
Driver core changes for 6.9-rc1
Here is the "big" set of driver core and kernfs changes for 6.9-rc1. Nothing all that crazy here, just some good updates that include: - automatic attribute group hiding from Dan Williams (he fixed up my horrible attempt at doing this.) - kobject lock contention fixes from Eric Dumazet - driver core cleanups from Andy - kernfs rcu work from Tejun - fw_devlink changes to resolve some reported issues - other minor changes, all details in the shortlog All of these have been in linux-next for a long time with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwsHg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ynT4ACePcNRAsYrINlOPPKPHimJtyP01yEAn0pZYnj2 0/UpqIqf3HVPu7zsLKTa =vR9S -----END PGP SIGNATURE----- Merge tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of driver core and kernfs changes for 6.9-rc1. Nothing all that crazy here, just some good updates that include: - automatic attribute group hiding from Dan Williams (he fixed up my horrible attempt at doing this.) - kobject lock contention fixes from Eric Dumazet - driver core cleanups from Andy - kernfs rcu work from Tejun - fw_devlink changes to resolve some reported issues - other minor changes, all details in the shortlog All of these have been in linux-next for a long time with no reported issues" * tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits) device: core: Log warning for devices pending deferred probe on timeout driver: core: Use dev_* instead of pr_* so device metadata is added driver: core: Log probe failure as error and with device metadata of: property: fw_devlink: Add support for "post-init-providers" property driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link driver core: Adds flags param to fwnode_link_add() debugfs: fix wait/cancellation handling during remove device property: Don't use "proxy" headers device property: Move enum dev_dma_attr to fwnode.h driver core: Move fw_devlink stuff to where it belongs driver core: Drop unneeded 'extern' keyword in fwnode.h firmware_loader: Suppress warning on FW_OPT_NO_WARN flag sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group. firmware_loader: introduce __free() cleanup hanler platform-msi: Remove usage of the deprecated ida_simple_xx() API sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE() sysfs: Document new "group visible" helpers sysfs: Fix crash on empty group attributes array sysfs: Introduce a mechanism to hide static attribute_groups sysfs: Introduce a mechanism to hide static attribute_groups ... |
||
Linus Torvalds
|
3bcb0bf65c |
TTY/Serial driver update for 6.9-rc1
Here is the big set of TTY/Serial driver updates and cleanups for 6.9-rc1. Included in here are: - more tty cleanups from Jiri - loads of 8250 driver cleanups from Andy - max310x driver updates - samsung serial driver updates - uart_prepare_sysrq_char() updates for many drivers - platform driver remove callback void cleanups - stm32 driver updates - other small tty/serial driver updates All of these have been in linux-next for a long time with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwqow8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ynNegCffxTbsnbMGjWhVrQ326IJx/DFvNMAoI9csigv m+G3RzefzZLRx8nAma0c =GMfc -----END PGP SIGNATURE----- Merge tag 'tty-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial driver updates from Greg KH: "Here is the big set of TTY/Serial driver updates and cleanups for 6.9-rc1. Included in here are: - more tty cleanups from Jiri - loads of 8250 driver cleanups from Andy - max310x driver updates - samsung serial driver updates - uart_prepare_sysrq_char() updates for many drivers - platform driver remove callback void cleanups - stm32 driver updates - other small tty/serial driver updates All of these have been in linux-next for a long time with no reported issues" * tag 'tty-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits) dt-bindings: serial: stm32: add power-domains property serial: 8250_dw: Replace ACPI device check by a quirk serial: Lock console when calling into driver before registration serial: 8250_uniphier: Switch to use uart_read_port_properties() serial: 8250_tegra: Switch to use uart_read_port_properties() serial: 8250_pxa: Switch to use uart_read_port_properties() serial: 8250_omap: Switch to use uart_read_port_properties() serial: 8250_of: Switch to use uart_read_port_properties() serial: 8250_lpc18xx: Switch to use uart_read_port_properties() serial: 8250_ingenic: Switch to use uart_read_port_properties() serial: 8250_dw: Switch to use uart_read_port_properties() serial: 8250_bcm7271: Switch to use uart_read_port_properties() serial: 8250_bcm2835aux: Switch to use uart_read_port_properties() serial: 8250_aspeed_vuart: Switch to use uart_read_port_properties() serial: port: Introduce a common helper to read properties serial: core: Add UPIO_UNKNOWN constant for unknown port type serial: core: Move struct uart_port::quirks closer to possible values serial: sh-sci: Call sci_serial_{in,out}() directly serial: core: only stop transmit when HW fifo is empty serial: pch: Use uart_prepare_sysrq_char(). ... |
||
Yan Zhai
|
00bf631224 |
bpf: report RCU QS in cpumap kthread
When there are heavy load, cpumap kernel threads can be busy polling packets from redirect queues and block out RCU tasks from reaching quiescent states. It is insufficient to just call cond_resched() in such context. Periodically raise a consolidated RCU QS before cond_resched fixes the problem. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Linus Torvalds
|
fbd88dd057 |
More power management updates for 6.9-rc1
- Modify the Energy Model code to bail out and complain if the unit of power is not uW to prevent errors due to unit mismatches (Lukasz Luba). - Make the intel_rapl platform driver use a remove callback returning void (Uwe Kleine-König). - Fix typo in the suspend and interrupts document (Saravana Kannan). - Make per-policy boost flags actually take effect on platforms using cpufreq_boost_set_sw() (Sibi Sankar). - Enable boost support in the SCMI cpufreq driver (Sibi Sankar). - Make the DT cpufreq driver use zalloc_cpumask_var() for allocating cpumasks to avoid using unitinialized memory (Marek Szyprowski). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmX5iRASHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxof8P/R+MrENvw2bIe5VGwqX99Nd47spbAe2J QchuY8R9RIhd4laJ74C+Nc7E1rawLukLHctVql1QzkygldyriAjhHnVeL98Tmh8C H+2928BEztPzbIzLMQ7uAVKY+iUD9BfTipPwqV6D118nswtRUbubQkLtrKAmssHQ Sm5BqYmEb91CiSIe9wjqCyfc7kq7ina7nqmGa7DgZNMUgCzjmAEDRz1vXV1oRjNl aIXpx9X1wtG699HRCXBkxt7SbgXN3bFim6ZFoTvxT4U2Rhcj9JJ/PxKufFL2vuGF 1M99eu1aMqL0iS3vhj4k88uPG5kDv37qIFYD6Z2vqRVTy040fL+6Sny5IKMlQb+Z PUB/TT1rulhH4kCjhkJXuErLqMBG2Ujf8D4drosQK2avp20pF0TCm2h+0LX3HVXM dcv+WirjOyljygWm4D0XyvY6U88gcKiGGY/Qf7YQ/GC8Jyrj0h6qMvDw38725XU+ VQ0JlYCoM1R24dT4iYWWrMbvK6pEQ5R9lwJNSgdtQWOFNmyQy8j8s7D4qyRrCdG/ JERwNIi+NF3oezg3ahF8s9cemnkndqxlqX1atMmjL9WCbqUPl2LCxJ7TWM5H/mfg /h3jNk9ai6d2cHStvoxFzbAYqZX/72IVEteOomKKiHdW7lxWUW+J2xZRy0Pv0163 iNNyshJAsuaf =dlQ/ -----END PGP SIGNATURE----- Merge tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These update the Energy Model to make it prevent errors due to power unit mismatches, fix a typo in power management documentation, convert one driver to using a platform remove callback returning void, address two cpufreq issues (one in the core and one in the DT driver), and enable boost support in the SCMI cpufreq driver. Specifics: - Modify the Energy Model code to bail out and complain if the unit of power is not uW to prevent errors due to unit mismatches (Lukasz Luba) - Make the intel_rapl platform driver use a remove callback returning void (Uwe Kleine-König) - Fix typo in the suspend and interrupts document (Saravana Kannan) - Make per-policy boost flags actually take effect on platforms using cpufreq_boost_set_sw() (Sibi Sankar) - Enable boost support in the SCMI cpufreq driver (Sibi Sankar) - Make the DT cpufreq driver use zalloc_cpumask_var() for allocating cpumasks to avoid using unitinialized memory (Marek Szyprowski)" * tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: scmi: Enable boost support firmware: arm_scmi: Add support for marking certain frequencies as turbo cpufreq: dt: always allocate zeroed cpumask cpufreq: Fix per-policy boost behavior on SoCs using cpufreq_boost_set_sw() Documentation: power: Fix typo in suspend and interrupts doc PM: EM: Force device drivers to provide power in uW powercap: intel_rapl: Convert to platform remove callback returning void |
||
Frederic Weisbecker
|
0387703986 |
timers: Fix removed self-IPI on global timer's enqueue in nohz_full
While running in nohz_full mode, a task may enqueue a timer while the tick is stopped. However the only places where the timer wheel, alongside the timer migration machinery's decision, may reprogram the next event accordingly with that new timer's expiry are the idle loop or any IRQ tail. However neither the idle task nor an interrupt may run on the CPU if it resumes busy work in userspace for a long while in full dynticks mode. To solve this, the timer enqueue path raises a self-IPI that will re-evaluate the timer wheel on its IRQ tail. This asynchronous solution avoids potential locking inversion. This is supposed to happen both for local and global timers but commit: b2cf7507e186 ("timers: Always queue timers on the local CPU") broke the global timers case with removing the ->is_idle field handling for the global base. As a result, global timers enqueue may go unnoticed in nohz_full. Fix this with restoring the idle tracking of the global timer's base, allowing self-IPIs again on enqueue time. Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240318230729.15497-3-frederic@kernel.org |
||
Frederic Weisbecker
|
f55acb1e44 |
timers/migration: Fix endless timer requeue after idle interrupts
When a CPU is an idle migrator, but another CPU wakes up before it, becomes an active migrator and handles the queue, the initial idle migrator may end up endlessly reprogramming its clockevent, chasing ghost timers forever such as in the following scenario: [GRP0:0] migrator = 0 active = 0 nextevt = T1 / \ 0 1 active idle (T1) 0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is the active migrator. [GRP0:0] migrator = NONE active = NONE nextevt = T1 / \ 0 1 idle idle (T1) wakeup = T1 1) CPU 0 is now idle and is therefore the idle migrator. It has programmed its next timer interrupt to handle T1. [GRP0:0] migrator = 1 active = 1 nextevt = KTIME_MAX / \ 0 1 idle active wakeup = T1 2) CPU 1 has woken up, it is now active and it has just handled its own timer T1. 3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote() realize it is not the migrator anymore. So it early returns without observing that T1 has been expired already and therefore without updating its ->wakeup value. 4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns because it doesn't queue a timer of its own. So ->wakeup is left unchanged and the next timer is programmed to fire now. 5) goto 3) forever This results in timer interrupt storms in idle and also in nohz_full (as observed in rcutorture's TREE07 scenario). Fix this with forcing a re-evaluation of tmc->wakeup while trying remote timer handling when the CPU isn't the migrator anymmore. The check is inherently racy but in the worst case the CPU just races setting the KTIME_MAX value that a remote expiry also tries to set. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240318230729.15497-2-frederic@kernel.org |
||
Linus Torvalds
|
ad584d73a2 |
Tracing updates for 6.9:
Main user visible change: - User events can now have "multi formats" The current user events have a single format. If another event is created with a different format, it will fail to be created. That is, once an event name is used, it cannot be used again with a different format. This can cause issues if a library is using an event and updates its format. An application using the older format will prevent an application using the new library from registering its event. A task could also DOS another application if it knows the event names, and it creates events with different formats. The multi-format event is in a different name space from the single format. Both the event name and its format are the unique identifier. This will allow two different applications to use the same user event name but with different payloads. - Added support to have ftrace_dump_on_oops dump out instances and not just the main top level tracing buffer. Other changes: - Add eventfs_root_inode Only the root inode has a dentry that is static (never goes away) and stores it upon creation. There's no reason that the thousands of other eventfs inodes should have a pointer that never gets set in its descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode descriptor and a dentry pointer, and only the root inode will use this. - Added WARN_ON()s in eventfs There's some conditionals remaining in eventfs that should never be hit, but instead of removing them, add WARN_ON() around them to make sure that they are never hit. - Have saved_cmdlines allocation also include the map_cmdline_to_pid array The saved_cmdlines structure allocates a large amount of data to hold its mappings. Within it, it has three arrays. Two are already apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by also including the map_cmdline_to_pid[] array as well. - Restructure __string() and __assign_str() macros used in TRACE_EVENT(). Dynamic strings in TRACE_EVENT() are declared with: __string(name, source) And assigned with: __assign_str(name, source) In the tracepoint callback of the event, the __string() is used to get the size needed to allocate on the ring buffer and __assign_str() is used to copy the string into the ring buffer. There's a helper structure that is created in the TRACE_EVENT() macro logic that will hold the string length and its position in the ring buffer which is created by __string(). There are several trace events that have a function to create the string to save. This function is executed twice. Once for __string() and again for __assign_str(). There's no reason for this. The helper structure could also save the string it used in __string() and simply copy that into __assign_str() (it also already has its length). By using the structure to store the source string for the assignment, it means that the second argument to __assign_str() is no longer needed. It will be removed in the next merge window, but for now add a warning if the source string given to __string() is different than the source string given to __assign_str(), as the source to __assign_str() isn't even used and will be going away. - Added checks to make sure that the source of __string() is also the source of __assign_str() so that it can be safely removed in the next merge window. Included fixes that the above check found. - Other minor clean ups and fixes -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZfhbUBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qrhJAP9bfnYO7tfNGZVNPmTT7Fz0z4zCU1Pb P8M+24yiFTeFWwD/aIPlMFZONVkTdFAlLdffl6kJOKxZ7vW4XzUjfNWb6wo= =z/D6 -----END PGP SIGNATURE----- Merge tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "Main user visible change: - User events can now have "multi formats" The current user events have a single format. If another event is created with a different format, it will fail to be created. That is, once an event name is used, it cannot be used again with a different format. This can cause issues if a library is using an event and updates its format. An application using the older format will prevent an application using the new library from registering its event. A task could also DOS another application if it knows the event names, and it creates events with different formats. The multi-format event is in a different name space from the single format. Both the event name and its format are the unique identifier. This will allow two different applications to use the same user event name but with different payloads. - Added support to have ftrace_dump_on_oops dump out instances and not just the main top level tracing buffer. Other changes: - Add eventfs_root_inode Only the root inode has a dentry that is static (never goes away) and stores it upon creation. There's no reason that the thousands of other eventfs inodes should have a pointer that never gets set in its descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode descriptor and a dentry pointer, and only the root inode will use this. - Added WARN_ON()s in eventfs There's some conditionals remaining in eventfs that should never be hit, but instead of removing them, add WARN_ON() around them to make sure that they are never hit. - Have saved_cmdlines allocation also include the map_cmdline_to_pid array The saved_cmdlines structure allocates a large amount of data to hold its mappings. Within it, it has three arrays. Two are already apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by also including the map_cmdline_to_pid[] array as well. - Restructure __string() and __assign_str() macros used in TRACE_EVENT() Dynamic strings in TRACE_EVENT() are declared with: __string(name, source) And assigned with: __assign_str(name, source) In the tracepoint callback of the event, the __string() is used to get the size needed to allocate on the ring buffer and __assign_str() is used to copy the string into the ring buffer. There's a helper structure that is created in the TRACE_EVENT() macro logic that will hold the string length and its position in the ring buffer which is created by __string(). There are several trace events that have a function to create the string to save. This function is executed twice. Once for __string() and again for __assign_str(). There's no reason for this. The helper structure could also save the string it used in __string() and simply copy that into __assign_str() (it also already has its length). By using the structure to store the source string for the assignment, it means that the second argument to __assign_str() is no longer needed. It will be removed in the next merge window, but for now add a warning if the source string given to __string() is different than the source string given to __assign_str(), as the source to __assign_str() isn't even used and will be going away. - Added checks to make sure that the source of __string() is also the source of __assign_str() so that it can be safely removed in the next merge window. Included fixes that the above check found. - Other minor clean ups and fixes" * tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) tracing: Add __string_src() helper to help compilers not to get confused tracing: Use strcmp() in __assign_str() WARN_ON() check tracepoints: Use WARN() and not WARN_ON() for warnings tracing: Use div64_u64() instead of do_div() tracing: Support to dump instance traces by ftrace_dump_on_oops tracing: Remove second parameter to __assign_rel_str() tracing: Add warning if string in __assign_str() does not match __string() tracing: Add __string_len() example tracing: Remove __assign_str_len() ftrace: Fix most kernel-doc warnings tracing: Decrement the snapshot if the snapshot trigger fails to register tracing: Fix snapshot counter going between two tracers that use it tracing: Use EVENT_NULL_STR macro instead of open coding "(null)" tracing: Use ? : shortcut in trace macros tracing: Do not calculate strlen() twice for __string() fields tracing: Rework __assign_str() and __string() to not duplicate getting the string cxl/trace: Properly initialize cxl_poison region name net: hns3: tracing: fix hclgevf trace event strings drm/i915: Add missing ; to __assign_str() macros in tracepoint code NFSD: Fix nfsd_clid_class use of __string_len() macro ... |
||
Thorsten Blum
|
d6cb38e108 |
tracing: Use div64_u64() instead of do_div()
Fixes Coccinelle/coccicheck warnings reported by do_div.cocci. Compared to do_div(), div64_u64() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lore.kernel.org/linux-trace-kernel/20240225164507.232942-2-thorsten.blum@toblux.com Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Huang Yiwei
|
19f0423fd5 |
tracing: Support to dump instance traces by ftrace_dump_on_oops
Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=<instance_name>: new behavior -- dump the tracing instance matching <instance_name> - ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu], <instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com Cc: Ross Zwisler <zwisler@google.com> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mcgrof@kernel.org> Cc: <keescook@chromium.org> Cc: <j.granados@samsung.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <corbet@lwn.net> Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Randy Dunlap
|
d15304135c |
ftrace: Fix most kernel-doc warnings
Reduce the number of kernel-doc warnings from 52 down to 10, i.e., fix 42 kernel-doc warnings by (a) using the Returns: format for function return values or (b) using "@var:" instead of "@var -" for function parameter descriptions. Fix one return values list so that it is formatted correctly when rendered for output. Spell "non-zero" with a hyphen in several places. Link: https://lore.kernel.org/linux-trace-kernel/20240223054833.15471-1-rdunlap@infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/oe-kbuild-all/202312180518.X6fRyDSN-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
2048fdc275 |
tracing: Decrement the snapshot if the snapshot trigger fails to register
Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a snapshot trigger was added, even if that snapshot trigger failed. # cd /sys/kernel/tracing # echo "snapshot" > events/sched/sched_process_fork/trigger # echo "snapshot" > events/sched/sched_process_fork/trigger -bash: echo: write error: File exists That second one that fails increments the snapshot counter but doesn't decrement it. It needs to be decremented when the snapshot fails. Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.729055907@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
cca990c7b5 |
tracing: Fix snapshot counter going between two tracers that use it
Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a tracer that uses the snapshot is enabled even if the snapshot was used by the previous tracer. That is: # cd /sys/kernel/tracing # echo wakeup_rt > current_tracer # echo wakeup_dl > current_tracer # echo nop > current_tracer would leave the snapshot counter at 1 and not zero. That's because the enabling of wakeup_dl would increment the counter again but the setting the tracer to nop would only decrement it once. Do not arm the snapshot for a tracer if the previous tracer already had it armed. Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.570525723@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
John Garry
|
ed89683763 |
tracing: Use init_utsname()->release
Instead of using UTS_RELEASE, use init_utsname()->release, which means that we don't need to rebuild the code just for the git head commit changing. Link: https://lore.kernel.org/linux-trace-kernel/20240222124639.65629-1-john.g.garry@oracle.com Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Beau Belgrave
|
64805e4039 |
tracing/user_events: Introduce multi-format events
Currently user_events supports 1 event with the same name and must have the exact same format when referenced by multiple programs. This opens an opportunity for malicious or poorly thought through programs to create events that others use with different formats. Another scenario is user programs wishing to use the same event name but add more fields later when the software updates. Various versions of a program may be running side-by-side, which is prevented by the current single format requirement. Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates the user program wishes to use the same user_event name, but may have several different formats of the event. When this flag is used, create the underlying tracepoint backing the user_event with a unique name per-version of the format. It's important that existing ABI users do not get this logic automatically, even if one of the multi format events matches the format. This ensures existing programs that create events and assume the tracepoint name will match exactly continue to work as expected. Add logic to only check multi-format events with other multi-format events and single-format events to only check single-format events during find. Change system name of the multi-format event tracepoint to ensure that multi-format events are isolated completely from single-format events. This prevents single-format names from conflicting with multi-format events if they end with the same suffix as the multi-format events. Add a register_name (reg_name) to the user_event struct which allows for split naming of events. We now have the name that was used to register within user_events as well as the unique name for the tracepoint. Upon registering events ensure matches based on first the reg_name, followed by the fields and format of the event. This allows for multiple events with the same registered name to have different formats. The underlying tracepoint will have a unique name in the format of {reg_name}.{unique_id}. For example, if both "test u32 value" and "test u64 value" are used with the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique tracepoints. The dynamic_events file would then show the following: u:test u64 count u:test u32 count The actual tracepoint names look like this: test.0 test.1 Both would be under the new user_events_multi system name to prevent the older ABI from being used to squat on multi-formatted events and block their use. Deleting events via "!u:test u64 count" would only delete the first tracepoint that matched that format. When the delete ABI is used all events with the same name will be attempted to be deleted. If per-version deletion is required, user programs should either not use persistent events or delete them via dynamic_events. Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-3-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Beau Belgrave
|
1e953de9e9 |
tracing/user_events: Prepare find/delete for same name events
The current code for finding and deleting events assumes that there will never be cases when user_events are registered with the same name, but different formats. Scenarios exist where programs want to use the same name but have different formats. An example is multiple versions of a program running side-by-side using the same event name, but with updated formats in each version. This change does not yet allow for multi-format events. If user_events are registered with the same name but different arguments the programs see the same return values as before. This change simply makes it possible to easily accommodate for this. Update find_user_event() to take in argument parameters and register flags to accommodate future multi-format event scenarios. Have find validate argument matching and return error pointers to cover when an existing event has the same name but different format. Update callers to handle error pointer logic. Move delete_user_event() to use hash walking directly now that find_user_event() has changed. Delete all events found that match the register name, stop if an error occurs and report back to the user. Update user_fields_match() to cover list_empty() scenarios now that find_user_event() uses it directly. This makes the logic consistent across several callsites. Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Vincent Donnefort
|
180e4e3909 |
tracing: Add snapshot refcount
When a ring-buffer is memory mapped by user-space, no trace or ring-buffer swap is possible. This means the snapshot feature is mutually exclusive with the memory mapping. Having a refcount on snapshot users will help to know if a mapping is possible or not. Instead of relying on the global trace_types_lock, a new spinlock is introduced to serialize accesses to trace_array->snapshot. This intends to allow access to that variable in a context where the mmap lock is already held. Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-4-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
b70f293824 |
ring-buffer: Make wake once of ring_buffer_wait() more robust
The default behavior of ring_buffer_wait() when passed a NULL "cond" parameter is to exit the function the first time it is woken up. The current implementation uses a counter that starts at zero and when it is greater than one it exits the wait_event_interruptible(). But this relies on the internal working of wait_event_interruptible() as that code basically has: if (cond) return; prepare_to_wait(); if (!cond) schedule(); finish_wait(); That is, cond is called twice before it sleeps. The default cond of ring_buffer_wait() needs to account for that and wait for its counter to increment twice before exiting. Instead, use the seq/atomic_inc logic that is used by the tracing code that calls this function. Add an atomic_t seq to rb_irq_work and when cond is NULL, have the default callback take a descriptor as its data that holds the rbwork and the value of the seq when it started. The wakeups will now increment the rbwork->seq and the cond callback will simply check if that number is different, and no longer have to rely on the implementation of wait_event_interruptible(). Link: https://lore.kernel.org/linux-trace-kernel/20240315063115.6cb5d205@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 7af9ded0c2ca ("ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Linus Torvalds
|
8048ba24e1 |
Fix timer migration bug that can result in long bootup
delays and other oddities. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmX2sR4RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gGYA/+PLcIFIaaoAtZProEYYwKBqf/G59RWCNF hP5ytKDy/59NlIP7VhRWXYUbxzZkiFpvNduOdnUaU7gYKiQGk2kpuD6EM479t8og H4nSd8/izuS995kBwuthpISssxggoWXYU30fA6JRIJU7Imfwjcapvlv6Nrj2+NEj vsw4HNAtXgj+e8ZSDmjzw00zsSaJhIQx539mGYNvMF3geGzDdRuM2XXHjXj27WSO /BPS+oFhR0IYIf+FUVCo5w57SgQamdL2rAobZq6y89visqnvOfej4g1Je/mgUjO/ I/dcoxhLFp3Ebp68c3l4KbH+3f9LheLU+6xCdMMxOUbVyZEtziuFBqXjsh10b3ey n4hYnGq35vhIYYRrlQsBYuPegn9MECb0A1rOvj2HwWM9IfpWSZ+6thrj78ZT0SEw cxxFqGBWLFwq6wRIwEQDriN8Rbc5FgybPHqRcmJjDwtlRFFJzs6u78IViZWHpq18 oYxLzlfhOrjhKG/0W2yVpMqNgZdolO1pX3Xz/5mF7JnnlcOibOC5GMIo8AvehxsN zCI6QhKs8ZcfHbY2QSZQnn3g07VbG+3IbaA2pH1ZgWAVEPOqRqZthKFCoBzsno+u if2dqLoOMmPrW4N4u+o6sAu1WYNUyI1bEGKsl6u/gXawlcCmbjbgr5OjQ53ehW+N mNFUKOIX4Pk= =MhsC -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix timer migration bug that can result in long bootup delays and other oddities" * tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timer/migration: Remove buggy early return on deactivation |
||
linke li
|
f1e30cb636 |
ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment
In function ring_buffer_iter_empty(), cpu_buffer->commit_page is read while other threads may change it. It may cause the time_stamp that read in the next line come from a different page. Use READ_ONCE() to avoid having to reason about compiler optimizations now and in future. Link: https://lore.kernel.org/linux-trace-kernel/tencent_DFF7D3561A0686B5E8FC079150A02505180A@qq.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Vincent Donnefort
|
6b76323e5a |
ring-buffer: Zero ring-buffer sub-buffers
In preparation for the ring-buffer memory mapping where each subbuf will be accessible to user-space, zero all the page allocations. Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-2-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
2cc621fd2e |
tracing: Move saved_cmdline code into trace_sched_switch.c
The code that handles saved_cmdlines is split between the trace.c file and the trace_sched_switch.c. There's some history to this. The trace_sched_switch.c was originally created to handle the sched_switch tracer that was deprecated due to sched_switch trace event making it obsolete. But that file did not get deleted as it had some code to help with saved_cmdlines. But trace.c has grown tremendously since then. Just move all the saved_cmdlines code into trace_sched_switch.c as that's the only reason that file still exists, and trace.c has gotten too big. No functional changes. Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.497966629@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
e85d471c2b |
tracing: Move open coded processing of tgid_map into helper function
In preparation of moving the saved_cmdlines logic out of trace.c and into trace_sched_switch.c, replace the open coded manipulation of tgid_map in set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it can be easily moved into trace_sched_switch.c without changing existing functions in trace.c. No functional changes. Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.338116216@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
0b18c852cc |
tracing: Have saved_cmdlines arrays all in one allocation
The saved_cmdlines have three arrays for mapping PIDs to COMMs: - map_pid_to_cmdline[] - map_cmdline_to_pid[] - saved_cmdlines The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index into the other arrays. The map_cmdline_to_pid[] is a mapping back to the full pid as it can be larger than PID_MAX_DEFAULT. And the saved_cmdlines[] just holds the COMMs associated to the pids. Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated together (in reality the saved_cmdlines is just in the memory of the rounding of the allocation of the structure as it is always allocated in powers of two). The map_cmdline_to_pid[] array is allocated separately. Since the rounding to a power of two is rather large (it allows for 8000 elements in saved_cmdlines), also include the map_cmdline_to_pid[] array. (This drops it to 6000 by default, which is still plenty for most use cases). This saves even more memory as the map_cmdline_to_pid[] array doesn't need to be allocated. Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Frederic Weisbecker
|
4b6f4c5a67 |
timer/migration: Remove buggy early return on deactivation
When a CPU enters into idle and deactivates itself from the timer migration hierarchy without any global timer of its own to propagate, the group event of that CPU is set to "ignore" and tmigr_update_events() accordingly performs an early return without considering timers queued by other CPUs. If the hierarchy has a single level, and the CPU is the last one to enter idle, it will ignore others' global timers, as in the following layout: [GRP0:0] migrator = 0 active = 0 nextevt = T0i / \ 0 1 active (T0i) idle (T1) 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. [GRP0:0] migrator = NONE active = NONE nextevt = T0i / \ 0 1 idle (T0i) idle (T1) 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". As a result tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1 unhandled. This isn't proper to single level hierarchy though. A similar issue, although slightly different, may arise on multi-level: [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = NONE active = 0 active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 active idle idle idle 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2 [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". As a result tmigr_update_events() ignores T1. The change only propagated up to 1st level so far. [GRP1:0] migrator = NONE active = NONE nextevt = T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T0i nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 2) The change now propagates up to the top. tmigr_update_events() finds that the child event is ignored and thus removes it. The top level next event is now T2 which is returned to CPU 0 as its next effective expiry to take account for as the global idle migrator. However T1 has been ignored along the way, leaving it unhandled. Fix those issues with removing the buggy related early return. Ignored child events must not prevent from evaluating the other events within the same group. Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen |
||
Linus Torvalds
|
32a50540c3 |
bcachefs updates for 6.9
- Subvolume children btree; this is needed for providing a userspace interface for walking subvolumes, which will come later - Lots of improvements to directory structure checking - Improved journal pipelining, significantly improving performance on high iodepth write workloads - Discard path improvements: the discard path is more efficient, and no longer flushes the journal unnecessarily - Buffered write path can now avoid taking the inode lock - new mm helper: memalloc_flags_{save|restore} - mempool now does kvmalloc mempools -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXycEcACgkQE6szbY3K bnYUTg/+K4Nv2EdAqOCyHRTKaF2OgJDUb25ZDmbGpfT1XyPrNB7/+CxHqSdEP7/e FVuhtP61vnQAImDv82u9iZiab/TnuCZPUrjSobFEvrWYoGRtP9Bm9MyYB28NzmMa AXGmS4yJGVwtxxrFNxZP98IbiHYiHSoYbkqxX2E5VgLag8Ru8peb7oD0Ro3zw0rb z+6UM/seJ7on5i/9IJEMKKXFVEoZC2J5DAVoe1TghG2kgOw3cKu5OUdltLPOY5jL jkm5J5wa6Ep46nufHat92yiMxXIQrf4U9LkXxzTi5ThoSmt+Af2qXcBjqTTVqd2D 1dGxj+UG8iu4DCCbQC6EA7J5EMvxfJM0+9lk1ULUgxUs3X69co6nlI6XH1fwEMqk KpIqd35+Y/IYgogt9ioXI0dtXyL7dbaTVt6NZhc9SaPGPX+C2V0+l4bqToFdNaPH 0KATjjyQaJRE4ZFIjr6GliYOtKWDLi/HPEyoBivniUn7cF5vjSvti+cSQwNDSPpa 6jOd5Y923Iq9ZqDAPM3+mvTH8nNaaf2T2fmbPNrc5pdWbha9bGwOU71zvKHNFGm/ 66ZsnwhKSk+uwglTMZHPKSkJJXUYAHESw3slQtEWHZVlliArc55+pBHwE00bvRt7 KHUUqkqXBUPzbp/kdZGylMAdH9+8j9TE5QJ2RaoryFm/eCfexmI= =6xnj -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs Pull bcachefs updates from Kent Overstreet: - Subvolume children btree; this is needed for providing a userspace interface for walking subvolumes, which will come later - Lots of improvements to directory structure checking - Improved journal pipelining, significantly improving performance on high iodepth write workloads - Discard path improvements: the discard path is more efficient, and no longer flushes the journal unnecessarily - Buffered write path can now avoid taking the inode lock - new mm helper: memalloc_flags_{save|restore} - mempool now does kvmalloc mempools * tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs: (128 commits) bcachefs: time_stats: shrink time_stat_buffer for better alignment bcachefs: time_stats: split stats-with-quantiles into a separate structure bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet bcachefs: time_stats: add larger units bcachefs: pull out time_stats.[ch] bcachefs: reconstruct_alloc cleanup bcachefs: fix bch_folio_sector padding bcachefs: Fix btree key cache coherency during replay bcachefs: Always flush write buffer in delete_dead_inodes() bcachefs: Fix order of gc_done passes bcachefs: fix deletion of indirect extents in btree_gc bcachefs: Prefer struct_size over open coded arithmetic bcachefs: Kill unused flags argument to btree_split() bcachefs: Check for writing superblocks with nonsense member seq fields bcachefs: fix bch2_journal_buf_to_text() lib/generic-radix-tree.c: Make nodes more reasonably sized bcachefs: copy_(to|from)_user_errcode() bcachefs: Split out bkey_types.h bcachefs: fix lost journal buf wakeup due to improved pipelining bcachefs: intercept mountoption value for bool type ... |