8192 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Linus Torvalds
|
53b5e72b9d |
asm-generic updates for 6.4
These are various cleanups, fixing a number of uapi header files to no longer reference CONFIG_* symbols, and one patch that introduces the new CONFIG_HAS_IOPORT symbol for architectures that provide working inb()/outb() macros, as a preparation for adding driver dependencies on those in the following release. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8= =H3k4 -----END PGP SIGNATURE----- Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "These are various cleanups, fixing a number of uapi header files to no longer reference CONFIG_* symbols, and one patch that introduces the new CONFIG_HAS_IOPORT symbol for architectures that provide working inb()/outb() macros, as a preparation for adding driver dependencies on those in the following release" * tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: Kconfig: introduce HAS_IOPORT option and select it as necessary scripts: Update the CONFIG_* ignore list in headers_install.sh pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header Move bp_type_idx to include/linux/hw_breakpoint.h Move ep_take_care_of_epollwakeup() to fs/eventpoll.c Move COMPAT_ATM_ADDPARTY to net/atm/svc.c |
||
Linus Torvalds
|
e7989789c6 |
Timers and timekeeping updates:
- Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relcations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements - Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. - Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. - Restructure struct tick_sched for better cache layout - Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGrj4THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZhdEAC/lwfDWCnTXHC8ExQQRDIVNyXmDlLb EHB8ZY7Wc4gNZ8UEXEOLOXJHMG9bsbtPGctVewJwRGnXZWKVhpPwQba6kCRycyX0 0J6l5DlvUaGGrpoOzOZwgETRmtIZE9tEArZR8xlfRScYd93a7yLhwIjO8JaV9vKs IQpAQMeJ/ysp6gHrS59qakYfoHU/ERUAu3Tk4GqHUtPtcyz3nX3eTlLWV8LySqs+ 00qr2yc0bQFUFoKzTCxtM8lcEi9ja9SOj1rw28348O+BXE4d0HC12Ie7eU/CDN2Y OAlWYxVjy4LMh24LDrRQKTzoVqx9MXDx2g+09B3t8NK5LgeS+EJIjujDhZF147/H 5y906nplZUKa8BiZW5Rpm/HKH8tFI80T9XWSQCRBeMgTEJyRyRU1yASAwO4xw+dY Dn3tGmFGymcV/72o4ic9JFKQd8cTSxPjEJS3qqzMkEAtyI/zPBmKxj/Tce50OH40 6FSZq1uU21ZQzszwSHISwgFtNr75laUSK4Z1te5OhPOOz+C7O9YqHvqS/1jwhPj2 tMd8X17fRW3UTUBlBj+zqxqiEGBl/Yk2AvKrJIXGUtfWYCtjMJ7ieCf0kZ7NSVJx 9ewubA0gqseMD783YomZsy8LLtMKnhclJeslUOVb1oKs1q/WF1R/k6qjy9vUwYaB nIJuHl8mxSetag== =SVnj -----END PGP SIGNATURE----- Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relocations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements: * Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. * Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. * Restructure struct tick_sched for better cache layout * Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Implement the missing timer_wait_running callback selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity selftests/proc: Remove idle time monotonicity assertions MAINTAINERS: Remove stale email address timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick() timers/nohz: Add a comment about broken iowait counter update race timers/nohz: Protect idle/iowait sleep time under seqcount timers/nohz: Only ever update sleeptime from idle exit timers/nohz: Restructure and reshuffle struct tick_sched tick/common: Align tick period with the HZ tick. selftests/timers/posix_timers: Test delivery of signals across threads posix-timers: Prefer delivery of signals to the current thread vdso: Improve cmd_vdso_check to check all dynamic relocations |
||
Linus Torvalds
|
29e95a4b26 |
A single update to debugobjects:
Prevent a race vs. statically initialized objects. Such objects are usually not initialized via an init() function. They are special cased and detected on first use under the assumption that they are already correctly initialized via the static initializer. This works correctly unless there are two concurrent debug object operations on such an object. The first one detects that the object is not yet tracked and tries to establish a tracking object after dropping the debug objects hash bucket lock. The concurrent operation does the same. The one which wins the race ends up modifying the state of the object which makes the other one fail resulting in a bogus debug objects warning. Prevent this by making the detection of a static object and the allocation of a tracking object atomic under the hash bucket lock. So the first one to acquire the hash bucket lock will succeed and the second one will observe the correct tracking state. This race existed forever but was only exposed when the timer wheel code added a debug_object_assert_init() call outside of the timer base locked region. This replaced the previous warning about timer::function being NULL which had to be removed when the timer_shutdown() mechanics were added. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGfx8THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYod4YD/98pgjxl9zht0tJpjOQv1GHQeKWGOnS T2NcK7UBF7IZnGVCaQovs1TLPEiHZMY9TgSmefP9UuYNCUthdzgxUv1hljb915zI xcQmqFopUyFF+F+qE7ti1C4HvXzbdss14XK97EcsoooS1ALq5xTkUJEcmdLFRL85 /ACkHz0/iMEHT9QVX6WoAOptg7HLoscb30CEZGa8skStAIRZfMIFqmN5GXzKUsPH oLUldSjoXyqq2ZBu9jiO9GoPmei3VuaZO3qWtN4KYY0C37BvKavgS2N/NsOh7s+0 I51G5+R8o6kQgr3RSll6frsPcy1EXsgPDZXO5tC1W9bp6+yrQ97ztdG0QS52fcPb fcCQtAX3L+K38vf4GfvboDyf7x21leJSYE3u+HCXUlyC2Es8QZgWw4U7Bi8IwSZg /BKC6QkQD/YyG/aQyZq6ZGiLgbJt8g53WiR8HGx35P3RUEy5Mit3bBSuq1dSuGR0 RozFlWswUif3Teticq33MR6Mv9M3866lX4iTMGT50xjJZirb8ongpKkRxIOHVeXV 4//0V/GOswyTwkY884Q6zJCZZq2FEudn6/Vtjh97zLxvJzLbdIEnEPC5HG75Jed0 a9NISg+NT9VOx4PLwgMWgW6dlT5SNUeWD4ddC879c4ELbyNd1i4AY54pMrcwEVVj fGdL6pFfFzZI5w== =19cg -----END PGP SIGNATURE----- Merge tag 'core-debugobjects-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core debugobjects update from Thomas Gleixner: "A single update to debugobjects: Prevent a race vs statically initialized objects. Such objects are usually not initialized via an init() function. They are special cased and detected on first use under the assumption that they are already correctly initialized via the static initializer. This works correctly unless there are two concurrent debug object operations on such an object. The first one detects that the object is not yet tracked and tries to establish a tracking object after dropping the debug objects hash bucket lock. The concurrent operation does the same. The one which wins the race ends up modifying the state of the object which makes the other one fail resulting in a bogus debug objects warning. Prevent this by making the detection of a static object and the allocation of a tracking object atomic under the hash bucket lock. So the first one to acquire the hash bucket lock will succeed and the second one will observe the correct tracking state. This race existed forever but was only exposed when the timer wheel code added a debug_object_assert_init() call outside of the timer base locked region. This replaced the previous warning about timer::function being NULL which had to be removed when the timer_shutdown() mechanics were added" * tag 'core-debugobjects-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debugobject: Prevent init race with static objects |
||
Linus Torvalds
|
1be89faab3 |
linux-kselftest-kunit-6.4-rc1
This KUnit update Linux 6.4-rc1 consists of: - several fixes to kunit tool - new klist structure test - support for m68k under QEMU - support for overriding the QEMU serial port - support for SH under QEMU -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmRFYFsACgkQCwJExA0N Qxw+PxAA1KHnHool3QbzZouFgLgTS2N/hxsOIoWKeUl6guUPX0XYu67FEIyt7p5k a1eFLjt+q4URW/heHKYdffP+Up6xhN5yVP8xJEcbn6GD13lz1clI9RAjObiPOehc KOV90PeAEfzosEGRIp97g4Gzu8NUMZqN7BsKBdzYJ4rEftlcjaILBVp4OfSuCyAi UbYBdRjK4eIOwGXuHVfhNqzH1HRSbzcoSRTywj5qW0Qhpe6KnZBRuZESXYBsxzGb G0nd4+OttjZyplI/xQYwaU0XGAI6roG5G4nAT5YGHLp5g8rTaHetTi+i3iK4iEru wEL0NgywkA0ujAge97RldOjtU97vvSFk7FwxdS9lxaMW/Ut2sN72I2ThI8dBvVRZ fcw8t8mmT1gUv3SCq+s1X13vz22IedXLOfvOY2o/fLk2zxOw5e8FirAz/aFeOf3K ++hK+IQvDmeMMv08bz0ORzdRQcjdwQNQ3klnfdrUVFN9yK+iAllOJ/nrXHLNIXu4 c3ITlAMldcAf2W+LRWzvqqKyT4H8MCXL3L0bBc1M1reRu9nM89AZedO8MHCB0R9Q 2ic0rOxIwZzPJuk0qPDxEVmN7Rpyx85I96YOwRemJTEfdkB/ZX+BfOU0KzinOVHC 3qrHuIw/SyRTlUEDAr53gJ5WHbdjhKAmrd1/FuplyoOSX0w6VVA= =COQn -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-kunit-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: - several fixes to kunit tool - new klist structure test - support for m68k under QEMU - support for overriding the QEMU serial port - support for SH under QEMU * tag 'linux-kselftest-kunit-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: add tests for using current KUnit test field kunit: tool: Add support for SH under QEMU kunit: tool: Add support for overriding the QEMU serial port .gitignore: Unignore .kunitconfig list: test: Test the klist structure kunit: increase KUNIT_LOG_SIZE to 2048 bytes kunit: Use gfp in kunit_alloc_resource() kernel-doc kunit: tool: fix pre-existing `mypy --strict` errors and update run_checks.py kunit: tool: remove unused imports and variables kunit: tool: add subscripts for type annotations where appropriate kunit: fix bug of extra newline characters in debugfs logs kunit: fix bug in the order of lines in debugfs logs kunit: fix bug in debugfs logs of parameterized tests kunit: tool: Add support for m68k under QEMU |
||
Linus Torvalds
|
5dfb75e842 |
RCU Changes for 6.4:
o MAINTAINERS files additions and changes. o Fix hotplug warning in nohz code. o Tick dependency changes by Zqiang. o Lazy-RCU shrinker fixes by Zqiang. o rcu-tasks stall reporting improvements by Neeraj. o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep() name for robustness. o Documentation Updates: o Significant changes to srcu_struct size. o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun. o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree. o Other misc changes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+ XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/ +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A= =Rgbj -----END PGP SIGNATURE----- Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ... |
||
Linus Torvalds
|
487c20b016 |
iov: improve copy_iovec_from_user() code generation
Use the same pattern as the compat version of this code does: instead of copying the whole array to a kernel buffer and then having a separate phase of verifying it, just do it one entry at a time, verifying as you go. On Jens' /dev/zero readv() test this improves performance by ~6%. [ This was obviously triggered by Jens' ITER_UBUF updates series ] Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/all/de35d11d-bce7-e976-7372-1f2caf417103@kernel.dk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
b9dff2195f |
iter-ubuf.2-2023-04-21
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvdsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg4oD/457EJ21Fm36NuyT/S0Cr8ok9Tdk7t9BeBh V/9CYThoXr5aqAox0Vq23FF+Rhzm81GzwYERN4493LBblliNeNOo2IaXF9/7qrUW 11v9Bkug2J3k3hRGtEa6Zl0EpMu+FRLsNpchjFS2KPuOq+iMDxrvwuy50kidWg7n r25e4UwpExVO9fIoUSmzgWVfRHOTuj9yiG/UsaH2+2BRXerIX0Q1tyElwmcGh25M Ad2hN+yDnuIbNA5gNUpnzY32Dp0zjAsquc//QOvq9mltcNTElokB8idGliismvyd 8qF0lkwQwewOBT/sSD5EY3K0Qd8IJu425bvT/yPUDScHz1chxHUoxo5eisIr2M9l 5AL5KHAf7Zzs8ZuV+IYPzZ5qM6a/vF3mHUisKRNKYVhF46Nmd4cBratfXwWb1MxV clQM2qr0TLOYli9mOeTXph3hg/rBVqKqf90boAZoN8b2tWBKlMykpqRadbepjrgx bmBSwwAF99NxIHEjU3U5DMdUloCSiMZIfMfDxQrPNDrfWAW4xJs5Ym0VeOjEotTt oFEs1fr6c3Mn7KEuPPfOtnDxvs51IP/B8+gDgMt/edf+wHiCU1Zm31u2gxt2dsKh g73Y92i5SHjIf36H5szBTeioyMy1E1VA9HF14xWz2eKdQ+wxQ9VNWoctcJ85k3F4 6AZDYRIrWA== =EaE9 -----END PGP SIGNATURE----- Merge tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux Pull ITER_UBUF updates from Jens Axboe: "This turns singe vector imports into ITER_UBUF, rather than ITER_IOVEC. The former is more trivial to iterate and advance, and hence a bit more efficient. From some very unscientific testing, ~60% of all iovec imports are single vector" * tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux: iov_iter: Mark copy_compat_iovec_from_user() noinline iov_iter: import single vector iovecs as ITER_UBUF iov_iter: convert import_single_range() to ITER_UBUF iov_iter: overlay struct iovec and ubuf/len iov_iter: set nr_segs = 1 for ITER_UBUF iov_iter: remove iov_iter_iovec() iov_iter: add iter_iov_addr() and iter_iov_len() helpers ALSA: pcm: check for user backed iterator, not specific iterator type IB/qib: check for user backed iterator, not specific iterator type IB/hfi1: check for user backed iterator, not specific iterator type iov_iter: add iter_iovec() helper block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly |
||
Liam R. Howlett
|
06e8fd9993 |
maple_tree: fix mas_empty_area() search
The internal function of mas_awalk() was incorrectly skipping the last entry in a node, which could potentially be NULL. This is only a problem for the left-most node in the tree - otherwise that NULL would not exist. Fix mas_awalk() by using the metadata to obtain the end of the node for the loop and the logical pivot as apposed to the raw pivot value. Link: https://lkml.kernel.org/r/20230414145728.4067069-2-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
fad8e4291d |
maple_tree: make maple state reusable after mas_empty_area_rev()
Stop using maple state min/max for the range by passing through pointers for those values. This will allow the maple state to be reused without resetting. Also add some logic to fail out early on searching with invalid arguments. Link: https://lkml.kernel.org/r/20230414145728.4067069-1-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peng Zhang
|
1f5f12ece7 |
maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
In mas_alloc_nodes(), "node->node_count = 0" means to initialize the node_count field of the new node, but the node may not be a new node. It may be a node that existed before and node_count has a value, setting it to 0 will cause a memory leak. At this time, mas->alloc->total will be greater than the actual number of nodes in the linked list, which may cause many other errors. For example, out-of-bounds access in mas_pop_node(), and mas_pop_node() may return addresses that should not be used. Fix it by initializing node_count only for new nodes. Also, by the way, an if-else statement was removed to simplify the code. Link: https://lkml.kernel.org/r/20230411041005.26205-1-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Thomas Gleixner
|
63a759694e |
debugobject: Prevent init race with static objects
Statically initialized objects are usually not initialized via the init() function of the subsystem. They are special cased and the subsystem provides a function to validate whether an object which is not yet tracked by debugobjects is statically initialized. This means the object is started to be tracked on first use, e.g. activation. This works perfectly fine, unless there are two concurrent operations on that object. Schspa decoded the problem: T0 T1 debug_object_assert_init(addr) lock_hash_bucket() obj = lookup_object(addr); if (!obj) { unlock_hash_bucket(); - > preemption lock_subsytem_object(addr); activate_object(addr) lock_hash_bucket(); obj = lookup_object(addr); if (!obj) { unlock_hash_bucket(); if (is_static_object(addr)) init_and_track(addr); lock_hash_bucket(); obj = lookup_object(addr); obj->state = ACTIVATED; unlock_hash_bucket(); subsys function modifies content of addr, so static object detection does not longer work. unlock_subsytem_object(addr); if (is_static_object(addr)) <- Fails debugobject emits a warning and invokes the fixup function which reinitializes the already active object in the worst case. This race exists forever, but was never observed until mod_timer() got a debug_object_assert_init() added which is outside of the timer base lock held section right at the beginning of the function to cover the lockless early exit points too. Rework the code so that the lookup, the static object check and the tracking object association happens atomically under the hash bucket lock. This prevents the issue completely as all callers are serialized on the hash bucket lock and therefore cannot observe inconsistent state. Fixes: 3ac7fe5a4aab ("infrastructure to debug (dynamic) objects") Reported-by: syzbot+5093ba19745994288b53@syzkaller.appspotmail.com Debugged-by: Schspa Shi <schspa@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://syzkaller.appspot.com/bug?id=22c8a5938eab640d1c6bcc0e3dc7be519d878462 Link: https://lore.kernel.org/lkml/20230303161906.831686-1-schspa@gmail.com Link: https://lore.kernel.org/r/87zg7dzgao.ffs@tglx |
||
Josh Poimboeuf
|
50f9a76ef1 |
iov_iter: Mark copy_compat_iovec_from_user() noinline
After commit 6376ce56feb6 ("iov_iter: import single vector iovecs as ITER_UBUF"), GCC does an inter-procedural compiler optimization which moves the user_access_begin() out of copy_compat_iovec_from_user() and into its callers: lib/iov_iter.o: warning: objtool: .altinstr_replacement+0x0: redundant UACCESS disable lib/iov_iter.o: warning: objtool: iovec_from_user.part.0+0xc7: call to copy_compat_iovec_from_user.part.0() with UACCESS enabled lib/iov_iter.o: warning: objtool: __import_iovec+0x21d: call to copy_compat_iovec_from_user.part.0() with UACCESS enabled Enforce the "no UACCESS enable across function boundaries" rule by disabling cloning for copy_compat_iovec_from_user(). Fixes: 6376ce56feb6 ("iov_iter: import single vector iovecs as ITER_UBUF") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> https://lkml.kernel.org/lkml/20230327120017.6bb826d7@canb.auug.org.au Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Peng Zhang
|
c45ea315a6 |
maple_tree: fix a potential concurrency bug in RCU mode
There is a concurrency bug that may cause the wrong value to be loaded when a CPU is modifying the maple tree. CPU1: mtree_insert_range() mas_insert() mas_store_root() ... mas_root_expand() ... rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node)); ma_set_meta(node, maple_leaf_64, 0, slot); <---IP CPU2: mtree_load() mtree_lookup_walk() ma_data_end(); When CPU1 is about to execute the instruction pointed to by IP, the ma_data_end() executed by CPU2 may return the wrong end position, which will cause the value loaded by mtree_load() to be wrong. An example of triggering the bug: Add mdelay(100) between rcu_assign_pointer() and ma_set_meta() in mas_root_expand(). static DEFINE_MTREE(tree); int work(void *p) { unsigned long val; for (int i = 0 ; i< 30; ++i) { val = (unsigned long)mtree_load(&tree, 8); mdelay(5); pr_info("%lu",val); } return 0; } mt_init_flags(&tree, MT_FLAGS_USE_RCU); mtree_insert(&tree, 0, (void*)12345, GFP_KERNEL); run_thread(work) mtree_insert(&tree, 1, (void*)56789, GFP_KERNEL); In RCU mode, mtree_load() should always return the value before or after the data structure is modified, and in this example mtree_load(&tree, 8) may return 56789 which is not expected, it should always return NULL. Fix it by put ma_set_meta() before rcu_assign_pointer(). Link: https://lkml.kernel.org/r/20230314124203.91572-4-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peng Zhang
|
ec07967d75 |
maple_tree: fix get wrong data_end in mtree_lookup_walk()
if (likely(offset > end)) max = pivots[offset]; The above code should be changed to if (likely(offset < end)), which is correct. This affects the correctness of ma_data_end(). Now it seems that the final result will not be wrong, but it is best to change it. This patch does not change the code as above, because it simplifies the code by the way. Link: https://lkml.kernel.org/r/20230314124203.91572-1-zhangpeng.00@bytedance.com Link: https://lkml.kernel.org/r/20230314124203.91572-2-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
790e1fa86b |
maple_tree: add RCU lock checking to rcu callback functions
Dereferencing RCU objects within the RCU callback without the RCU check has caused lockdep to complain. Fix the RCU dereferencing by using the RCU callback lock to ensure the operation is safe. Also stop creating a new lock to use for dereferencing during destruction of the tree or subtree. Instead, pass through a pointer to the tree that has the lock that is held for RCU dereferencing checking. It also does not make sense to use the maple state in the freeing scenario as the tree walk is a special case where the tree no longer has the normal encodings and parent pointers. Link: https://lkml.kernel.org/r/20230227173632.3292573-8-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
0a2b18d948 |
maple_tree: add smp_rmb() to dead node detection
Add an smp_rmb() before reading the parent pointer to ensure that anything read from the node prior to the parent pointer hasn't been reordered ahead of this check. The is necessary for RCU mode. Link: https://lkml.kernel.org/r/20230227173632.3292573-7-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
c13af03de4 |
maple_tree: fix write memory barrier of nodes once dead for RCU mode
During the development of the maple tree, the strategy of freeing multiple nodes changed and, in the process, the pivots were reused to store pointers to dead nodes. To ensure the readers see accurate pivots, the writers need to mark the nodes as dead and call smp_wmb() to ensure any readers can identify the node as dead before using the pivot values. There were two places where the old method of marking the node as dead without smp_wmb() were being used, which resulted in RCU readers seeing the wrong pivot value before seeing the node was dead. Fix this race condition by using mte_set_node_dead() which has the smp_wmb() call to ensure the race is closed. Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being freed are marked as dead to ensure there are no other call paths besides the two updated paths. This is necessary for the RCU mode of the maple tree. Link: https://lkml.kernel.org/r/20230227173632.3292573-6-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam Howlett
|
8372f4d83f |
maple_tree: remove extra smp_wmb() from mas_dead_leaves()
The call to mte_set_dead_node() before the smp_wmb() already calls smp_wmb() so this is not needed. This is an optimization for the RCU mode of the maple tree. Link: https://lkml.kernel.org/r/20230227173632.3292573-5-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam Howlett
|
2e5b4921f8 |
maple_tree: fix freeing of nodes in rcu mode
The walk to destroy the nodes was not always setting the node type and would result in a destroy method potentially using the values as nodes. Avoid this by setting the correct node types. This is necessary for the RCU mode of the maple tree. Link: https://lkml.kernel.org/r/20230227173632.3292573-4-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam Howlett
|
a7b92d59c8 |
maple_tree: detect dead nodes in mas_start()
When initially starting a search, the root node may already be in the process of being replaced in RCU mode. Detect and restart the walk if this is the case. This is necessary for RCU mode of the maple tree. Link: https://lkml.kernel.org/r/20230227173632.3292573-3-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam Howlett
|
39d0bd86c4 |
maple_tree: be more cautious about dead nodes
Patch series "Fix VMA tree modification under mmap read lock". Syzbot reported a BUG_ON in mm/mmap.c which was found to be caused by an inconsistency between threads walking the VMA maple tree. The inconsistency is caused by the page fault handler modifying the maple tree while holding the mmap_lock for read. This only happens for stack VMAs. We had thought this was safe as it only modifies a single pivot in the tree. Unfortunately, syzbot constructed a test case where the stack had no guard page and grew the stack to abut the next VMA. This causes us to delete the NULL entry between the two VMAs and rewrite the node. We considered several options for fixing this, including dropping the mmap_lock, then reacquiring it for write; and relaxing the definition of the tree to permit a zero-length NULL entry in the node. We decided the best option was to backport some of the RCU patches from -next, which solve the problem by allocating a new node and RCU-freeing the old node. Since the problem exists in 6.1, we preferred a solution which is similar to the one we intended to merge next merge window. These patches have been in -next since next-20230301, and have received intensive testing in Android as part of the RCU page fault patchset. They were also sent as part of the "Per-VMA locks" v4 patch series. Patches 1 to 7 are bug fixes for RCU mode of the tree and patch 8 enables RCU mode for the tree. Performance v6.3-rc3 vs patched v6.3-rc3: Running these changes through mmtests showed there was a 15-20% performance decrease in will-it-scale/brk1-processes. This tests creating and inserting a single VMA repeatedly through the brk interface and isn't representative of any real world applications. This patch (of 8): ma_pivots() and ma_data_end() may be called with a dead node. Ensure to that the node isn't dead before using the returned values. This is necessary for RCU mode of the maple tree. Link: https://lkml.kernel.org/r/20230327185532.2354250-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230227173632.3292573-1-surenb@google.com Link: https://lkml.kernel.org/r/20230227173632.3292573-2-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chriscli@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: freak07 <michalechner92@googlemail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Niklas Schnelle
|
fcbfe8121a
|
Kconfig: introduce HAS_IOPORT option and select it as necessary
We introduce a new HAS_IOPORT Kconfig option to indicate support for I/O Port access. In a future patch HAS_IOPORT=n will disable compilation of the I/O accessor functions inb()/outb() and friends on architectures which can not meaningfully support legacy I/O spaces such as s390. The following architectures do not select HAS_IOPORT: * ARC * C-SKY * Hexagon * Nios II * OpenRISC * s390 * User-Mode Linux * Xtensa All other architectures select HAS_IOPORT at least conditionally. The "depends on" relations on HAS_IOPORT in drivers as well as ifdefs for HAS_IOPORT specific sections will be added in subsequent patches on a per subsystem basis. Co-developed-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Arnd Bergmann <arnd@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> # for ARCH=um Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> |
||
Rae Moar
|
a42077b787 |
kunit: add tests for using current KUnit test field
Create test suite called "kunit_current" to add test coverage for the use of current->kunit_test, which returns the current KUnit test. Add two test cases: - kunit_current_test to test current->kunit_test and the method kunit_get_current_test(), which utilizes current->kunit_test. - kunit_current_fail_test to test the method kunit_fail_current_test(), which utilizes current->kunit_test. Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> |
||
Uladzislau Rezki (Sony)
|
c779b97281 |
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
The kvfree_rcu() macro's single-argument form is deprecated. Therefore switch to the new kvfree_rcu_mightsleep() variant. The goal is to avoid accidental use of the single-argument forms, which can introduce functionality bugs in atomic contexts and latency bugs in non-atomic contexts. Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> |
||
Sadiya Kazi
|
57b4f760f9 |
list: test: Test the klist structure
Add KUnit tests to the klist linked-list structure. These perform testing for different variations of node add and node delete in the klist data structure (<linux/klist.h>). Limitation: Since we use a static global variable, and if multiple instances of this test are run concurrently, the test may fail. Signed-off-by: Sadiya Kazi <sadiyakazi@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> |
||
Jens Axboe
|
3b2deb0e46 |
iov_iter: import single vector iovecs as ITER_UBUF
Add a special case to __import_iovec(), which imports a single segment iovec as an ITER_UBUF rather than an ITER_IOVEC. ITER_UBUF is cheaper to iterate than ITER_IOVEC, and for a single segment iovec, there's no point in using a segmented iterator. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Jens Axboe
|
e03ad4ee27 |
iov_iter: convert import_single_range() to ITER_UBUF
Since we're just importing a single vector, we don't have to turn it into an ITER_IOVEC. Instead turn it into an ITER_UBUF, which is cheaper to iterate. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Jens Axboe
|
de4f5fed3f |
iov_iter: add iter_iovec() helper
This returns a pointer to the current iovec entry in the iterator. Only useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF and ITER_IOVEC identically for the first segment. Rename struct iov_iter->iov to iov_iter->__iov to find any potentially troublesome spots, and also to prevent anyone from adding new code that accesses iter->iov directly. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Tiezhu Yang
|
f478b9987c |
lib/Kconfig.debug: correct help info of LOCKDEP_STACK_TRACE_HASH_BITS
We can see the following definition in kernel/locking/lockdep_internals.h: #define STACK_TRACE_HASH_SIZE (1 << CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS) CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS is related with STACK_TRACE_HASH_SIZE instead of MAX_STACK_TRACE_ENTRIES, fix it. Link: https://lkml.kernel.org/r/1679380508-20830-1-git-send-email-yangtiezhu@loongson.cn Fixes: 5dc33592e955 ("lockdep: Allow tuning tracing capacity constants.") Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
ye xingchen
|
35260cf545 |
Kconfig.debug: fix SCHED_DEBUG dependency
The path for SCHED_DEBUG is /sys/kernel/debug/sched. So, SCHED_DEBUG should depend on DEBUG_FS, not PROC_FS. Link: https://lkml.kernel.org/r/202301291110098787982@zte.com.cn Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
f768b35a23 |
Fixes for 6.3-rc3:
* Fix a race in the percpu counters summation code where the summation failed to add in the values for any CPUs that were dying but not yet dead. This fixes some minor discrepancies and incorrect assertions when running generic/650. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZBdAbgAKCRBKO3ySh0YR pkltAQCs4QO5LjYReqjUxd4cSsLtNnNon09qswRsl2GuRyI36AEAxI9QMq4Q6D9V ZasNbiTCkV3KPKfmp6gf1mQNLk1lGQ0= =Bz3q -----END PGP SIGNATURE----- Merge tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs percpu counter fixes from Darrick Wong: "We discovered a filesystem summary counter corruption problem that was traced to cpu hot-remove racing with the call to percpu_counter_sum that sets the free block count in the superblock when writing it to disk. The root cause is that percpu_counter_sum doesn't cull from dying cpus and hence misses those counter values if the cpu shutdown hooks have not yet run to merge the values. I'm hoping this is a fairly painless fix to the problem, since the dying cpu mask should generally be empty. It's been in for-next for a week without any complaints from the bots. - Fix a race in the percpu counters summation code where the summation failed to add in the values for any CPUs that were dying but not yet dead. This fixes some minor discrepancies and incorrect assertions when running generic/650" * tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: pcpcntr: remove percpu_counter_sum_all() fork: remove use of percpu_counter_sum_all pcpcntrs: fix dying cpu summation race cpumask: introduce for_each_cpu_or |
||
Geert Uytterhoeven
|
13684e966d |
lib: dhry: fix unstable smp_processor_id(_) usage
When running the in-kernel Dhrystone benchmark with CONFIG_DEBUG_PREEMPT=y: BUG: using smp_processor_id() in preemptible [00000000] code: bash/938 Fix this by not using smp_processor_id() directly, but instead wrapping the whole benchmark inside a get_cpu()/put_cpu() pair. This makes sure the whole benchmark is run on the same CPU core, and the reported values are consistent. Link: https://lkml.kernel.org/r/b0d29932bb24ad82cea7f821e295c898e9657be0.1678890070.git.geert+renesas@glider.be Fixes: d5528cc16893f1f6 ("lib: add Dhrystone benchmark test") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reported-by: Tobias Klausmann <klausman@schwarzvogel.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217179 Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
4bd6dded63 |
test_maple_tree: add more testing for mas_empty_area()
Test robust filling of an entire area of the tree, then test one beyond. This is to test the walking back up the tree at the end of nodes and error condition. Test inspired by the reproducer code provided by Snild Dolkow. The last test in the function tests for the case of a corrupted maple state caused by the incorrect limits set during mas_skip_node(). There needs to be a gap in the second last child and last child, but the search must rule out the second last child's gap. This would avoid correcting the maple state to the correct max limit and return an error. Link: https://lkml.kernel.org/r/20230307180247.2220303-3-Liam.Howlett@oracle.com Cc: Snild Dolkow <snild@sony.com> Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.com/ Fixes: e15e06a83923 ("lib/test_maple_tree: add testing for maple tree") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
0fa99fdfe1 |
maple_tree: fix mas_skip_node() end slot detection
Patch series "Fix mas_skip_node() for mas_empty_area()", v2. mas_empty_area() was incorrectly returning an error when there was room. The issue was tracked down to mas_skip_node() using the incorrect end-of-slot count. Instead of using the nodes hard limit, the limit of data should be used. mas_skip_node() was also setting the min and max to that of the child node, which was unnecessary. Within these limits being set, there was also a bug that corrupted the maple state's max if the offset was set to the maximum node pivot. The bug was without consequence unless there was a sufficient gap in the next child node which would cause an error to be returned. This patch set fixes these errors by removing the limit setting from mas_skip_node() and uses the mas_data_end() for slot limits, and adds tests for all failures discovered. This patch (of 2): mas_skip_node() is used to move the maple state to the node with a higher limit. It does this by walking up the tree and increasing the slot count. Since slot count may not be able to be increased, it may need to walk up multiple times to find room to walk right to a higher limit node. The limit of slots that was being used was the node limit and not the last location of data in the node. This would cause the maple state to be shifted outside actual data and enter an error state, thus returning -EBUSY. The result of the incorrect error state means that mas_awalk() would return an error instead of finding the allocation space. The fix is to use mas_data_end() in mas_skip_node() to detect the nodes data end point and continue walking the tree up until it is safe to move to a node with a higher limit. The walk up the tree also sets the maple state limits so remove the buggy code from mas_skip_node(). Setting the limits had the unfortunate side effect of triggering another bug if the parent node was full and the there was no suitable gap in the second last child, but room in the next child. mas_skip_node() may also be passed a maple state in an error state from mas_anode_descend() when no allocations are available. Return on such an error state immediately. Link: https://lkml.kernel.org/r/20230307180247.2220303-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230307180247.2220303-2-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Snild Dolkow <snild@sony.com> Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.com/ Tested-by: Snild Dolkow <snild@sony.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fangrui Song
|
aff69273af |
vdso: Improve cmd_vdso_check to check all dynamic relocations
The actual intention is that no dynamic relocation exists in the VDSO. For this the VDSO build validates that the resulting .so file does not have any relocations which are specified via $(ARCH_REL_TYPE_ABS) per architecture, which is fragile as e.g. ARM64 lacks an entry for R_AARCH64_RELATIVE. Aside of that ARCH_REL_TYPE_ABS is a misnomer as it checks for relative relocations too. However, some GNU ld ports produce unneeded R_*_NONE relocation entries. If a port fails to determine the exact .rel[a].dyn size, the trailing zeros become R_*_NONE relocations. E.g. ld's powerpc port recently fixed https://sourceware.org/bugzilla/show_bug.cgi?id=29540). R_*_NONE are generally a no-op in the dynamic loaders. So just ignore them. Remove the ARCH_REL_TYPE_ABS defines and just validate that the resulting .so file does not contain any R_* relocation entries except R_*_NONE. Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for aarch64 Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for vDSO, aarch64 Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/r/20230310190750.3323802-1-maskray@google.com |
||
Dave Chinner
|
e9b60c7f97 |
pcpcntr: remove percpu_counter_sum_all()
percpu_counter_sum_all() is now redundant as the race condition it was invented to handle is now dealt with by percpu_counter_sum() directly and all users of percpu_counter_sum_all() have been removed. Remove it. This effectively reverts the changes made in f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface") except for the cpumask iteration that fixes percpu_counter_sum() made earlier in this series. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
||
Dave Chinner
|
8b57b11cca |
pcpcntrs: fix dying cpu summation race
In commit f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface") a race condition between a cpu dying and percpu_counter_sum() iterating online CPUs was identified. The solution was to iterate all possible CPUs for summation via percpu_counter_sum_all(). We recently had a percpu_counter_sum() call in XFS trip over this same race condition and it fired a debug assert because the filesystem was unmounting and the counter *should* be zero just before we destroy it. That was reported here: https://lore.kernel.org/linux-kernel/20230314090649.326642-1-yebin@huaweicloud.com/ likely as a result of running generic/648 which exercises filesystems in the presence of CPU online/offline events. The solution to use percpu_counter_sum_all() is an awful one. We use percpu counters and percpu_counter_sum() for accurate and reliable threshold detection for space management, so a summation race condition during these operations can result in overcommit of available space and that may result in filesystem shutdowns. As percpu_counter_sum_all() iterates all possible CPUs rather than just those online or even those present, the mask can include CPUs that aren't even installed in the machine, or in the case of machines that can hot-plug CPU capable nodes, even have physical sockets present in the machine. Fundamentally, this race condition is caused by the CPU being offlined being removed from the cpu_online_mask before the notifier that cleans up per-cpu state is run. Hence percpu_counter_sum() will not sum the count for a cpu currently being taken offline, regardless of whether the notifier has run or not. This is the root cause of the bug. The percpu counter notifier iterates all the registered counters, locks the counter and moves the percpu count to the global sum. This is serialised against other operations that move the percpu counter to the global sum as well as percpu_counter_sum() operations that sum the percpu counts while holding the counter lock. Hence the notifier is safe to run concurrently with sum operations, and the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation. This change makes percpu_counter_sum() always do the right thing in the presence of CPU hot unplug events and makes percpu_counter_sum_all() unnecessary. This, in turn, means that filesystems like XFS, ext4, and btrfs don't have to work out when they should use percpu_counter_sum() vs percpu_counter_sum_all() in their space accounting algorithms Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
||
Dave Chinner
|
1470afefc3 |
cpumask: introduce for_each_cpu_or
Equivalent of for_each_cpu_and, except it ORs the two masks together so it iterates all the CPUs present in either mask. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
||
Linus Torvalds
|
ed38ff164f |
Zstd fixes for v6.3
A small number of fixes for zstd-v1.5.2. I'm not pulling in zstd-v1.5.4 from upstream this release because it didn't have any time to bake in linux-next, but I'm aiming for the next update in v6.4. I've rebased my tree onto v6.2 to remove the incorrect back merges as suggested by Linus in my initial PR for v6.3 [0]. [0] https://lore.kernel.org/lkml/C8C4DFDA-998F-48AD-93C9-DE16F8080A02@meta.com/ Signed-off-by: Nick Terrell <terrelln@fb.com> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEmIwAqlFIzbQodPwyuzRpqaNEqPUFAmQPxq4ACgkQuzRpqaNE qPUC/w/+OnUlhZu4RKQuiZsHtmtFdWBgPxti3nCC/kYdNxTcX1LXKTvVU54WpgMG pbEh2CN9l5isKOcoCOtmU6Mbt0WxUWhK5P1OGqVjrgjt3bubMlwR5t5JZ7D1RIH4 TgAFOOB64x1Q0dWSuZmkLX8rFTsu1ig8jJGpCRiKkG+ckK+PqeJLszzEmtj+bmuT fRn4WpItg9DBcoS/SBWjC9/CC1K1rzsuZghwDWzo5OP6wBF+VugMuZ/wXT9uY3yT y0lhB3mBmIZZSZwD/t7gZN3aVD8550W8taZGJ7T3fdIsurmlPKEqefIJ1bFKalfc ZR7j3v/ro3t+uwvFlzZuxnnNXSavk7yz/wLjAnQhW4RYXDt7Gso3+pDCMDHha2oE An3DqAha5KaOsJlW97mka1527El6gmK0xsAHPQ29waj1H6a7IYq2fGaFdUA/3L7c s5qtuUuhn3FyVX8POr79jPJa9xNiT4gj63V18s/4lChuHKPHBAlS07OsbbUY33Ep q+O1zb6fYQpgcCV/gv4yHKoGMdCOZzpp2VCtKS9gz/XU40NV1qLurCWM+wYJc94Y Afkthf6BLX41kmqtNzS2g/CZUN1rH3mHJrG8RKm68+rIHB4dvzG55VUwjXdj+2gY OYakXRlEw4S4YiNn4uFg6OoaSlYJJASusVK1Ed7MpnkCiLNAxS4= =1tCu -----END PGP SIGNATURE----- Merge tag 'zstd-linus-v6.3-rc3' of https://github.com/terrelln/linux Pull zstd fixes from Nick Terrell: "A small number of fixes for zstd-v1.5.2. I'm not pulling in zstd-v1.5.4 from upstream this release because it didn't have any time to bake in linux-next, but I'm aiming for the next update in v6.4" * tag 'zstd-linus-v6.3-rc3' of https://github.com/terrelln/linux: zstd: Fix definition of assert() lib: zstd: Backport fix for in-place decompression lib: zstd: Fix -Wstringop-overflow warning |
||
Rae Moar
|
2c6a96dad5 |
kunit: fix bug of extra newline characters in debugfs logs
Fix bug of the extra newline characters in debugfs logs. When a line is added to debugfs with a newline character at the end, an extra line appears in the debugfs log. This is due to a discrepancy between how the lines are printed and how they are added to the logs. Remove this discrepancy by checking if a newline character is present before adding a newline character. This should closely match the printk behavior. Add kunit_log_newline_test to provide test coverage for this issue. (Also, move kunit_log_test above suite definition to remove the unnecessary declaration prior to the suite definition) As an example, say we add these two lines to the log: kunit_log(..., "KTAP version 1\n"); kunit_log(..., "1..1"); The debugfs log before this fix: KTAP version 1 1..1 The debugfs log after this fix: KTAP version 1 1..1 Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> |
||
Rae Moar
|
f9a301c331 |
kunit: fix bug in the order of lines in debugfs logs
Fix bug in debugfs logs that causes an incorrect order of lines in the debugfs log. Currently, the test counts lines that show the number of tests passed, failed, and skipped, as well as any suite diagnostic lines, appear prior to the individual results, which is a bug. Ensure the order of printing for the debugfs log is correct. Additionally, add a KTAP header to so the debugfs logs can be valid KTAP. This is an example of a log prior to these fixes: KTAP version 1 # Subtest: kunit_status 1..2 # kunit_status: pass:2 fail:0 skip:0 total:2 # Totals: pass:2 fail:0 skip:0 total:2 ok 1 kunit_status_set_failure_test ok 2 kunit_status_mark_skipped_test ok 1 kunit_status Note the two lines with stats are out of order. This is the same debugfs log after the fixes (in combination with the third patch to remove the extra line): KTAP version 1 1..1 KTAP version 1 # Subtest: kunit_status 1..2 ok 1 kunit_status_set_failure_test ok 2 kunit_status_mark_skipped_test # kunit_status: pass:2 fail:0 skip:0 total:2 # Totals: pass:2 fail:0 skip:0 total:2 ok 1 kunit_status Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> |
||
Rae Moar
|
887d85a073 |
kunit: fix bug in debugfs logs of parameterized tests
Fix bug in debugfs logs that causes individual parameterized results to not appear because the log is reinitialized (cleared) when each parameter is run. Ensure these results appear in the debugfs logs, increase log size to allow for the size of parameterized results. As a result, append lines to the log directly rather than using an intermediate variable that can cause stack size warnings due to the increased log size. Here is the debugfs log of ext4_inode_test which uses parameterized tests before the fix: KTAP version 1 # Subtest: ext4_inode_test 1..1 # Totals: pass:16 fail:0 skip:0 total:16 ok 1 ext4_inode_test As you can see, this log does not include any of the individual parametrized results. After (in combination with the next two fixes to remove extra empty line and ensure KTAP valid format): KTAP version 1 1..1 KTAP version 1 # Subtest: ext4_inode_test 1..1 KTAP version 1 # Subtest: inode_test_xtimestamp_decoding ok 1 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits ... (the rest of the individual parameterized tests) ok 16 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra # inode_test_xtimestamp_decoding: pass:16 fail:0 skip:0 total:16 ok 1 inode_test_xtimestamp_decoding # Totals: pass:16 fail:0 skip:0 total:16 ok 1 ext4_inode_test Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> |
||
Jonathan Neuschäfer
|
6906598f1c |
zstd: Fix definition of assert()
assert(x) should emit a warning if x is false. WARN_ON(x) emits a warning if x is true. Thus, assert(x) should be defined as WARN_ON(!x) rather than WARN_ON(x). Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Nick Terrell <terrelln@fb.com> |
||
Nick Terrell
|
038505c41f |
lib: zstd: Backport fix for in-place decompression
Backport the relevant part of upstream commit 5b266196 [0]. This fixes in-place decompression for x86-64 kernel decompression. It uses a bound of 131072 + (uncompressed_size >> 8), which can be violated after upstream commit 6a7ede3d [1], as zstd can use part of the output buffer as temporary storage, and without this patch needs a bound of ~262144. The fix is for zstd to detect that the input and output buffers overlap, so that zstd knows it can't use the overlapping portion of the output buffer as tempoary storage. If the margin is not large enough, this will ensure that zstd will fail the decompression, rather than overwriting part of the input data, and causing corruption. This fix has been landed upstream and is in release v1.5.4. That commit also adds unit and fuzz tests to verify that the margin we use is respected, and correct. That means that the fix is well tested upstream. I have not been able to reproduce the potential bug in x86-64 kernel decompression locally, nor have I recieved reports of failures to decompress the kernel. It is possible that compression saves enough space to make it very hard for the issue to appear. I've boot tested the zstd compressed kernel on x86-64 and i386 with this patch, which uses in-place decompression, and sanity tested zstd compression in btrfs / squashfs to make sure that we don't see any issues, but other uses of zstd shouldn't be affected, because they don't use in-place decompression. Thanks to Vasily Gorbik <gor@linux.ibm.com> for debugging a related issue on s390, which was triggered by the same commit, but was a bug in how __decompress() was called [2]. And to Sasha Levin <sashal@kernel.org> for the CC alerting me of the issue. [0] |
||
Kees Cook
|
780f6a9afe |
lib: zstd: Fix -Wstringop-overflow warning
Fix the following -Wstringop-overflow warning when building with GCC 11+: lib/zstd/decompress/huf_decompress.c: In function ‘HUF_readDTableX2_wksp’: lib/zstd/decompress/huf_decompress.c:700:5: warning: ‘HUF_fillDTableX2.constprop’ accessing 624 bytes in a region of size 52 [-Wstringop-overflow=] 700 | HUF_fillDTableX2(dt, maxTableLog, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 701 | wksp->sortedSymbol, sizeOfSort, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 702 | wksp->rankStart0, wksp->rankVal, maxW, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 703 | tableLog+1, | ~~~~~~~~~~~ 704 | wksp->calleeWksp, sizeof(wksp->calleeWksp) / sizeof(U32)); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ lib/zstd/decompress/huf_decompress.c:700:5: note: referencing argument 6 of type ‘U32 (*)[13]’ {aka ‘unsigned int (*)[13]’} lib/zstd/decompress/huf_decompress.c:571:13: note: in a call to function ‘HUF_fillDTableX2.constprop’ 571 | static void HUF_fillDTableX2(HUF_DEltX2* DTable, const U32 targetLog, | ^~~~~~~~~~~~~~~~ by using pointer notation instead of array notation. This is one of the last remaining warnings to be fixed before globally enabling -Wstringop-overflow. Co-developed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Nick Terrell <terrelln@fb.com> |
||
Linus Torvalds
|
596ff4a09b |
cpumask: re-introduce constant-sized cpumask optimizations
Commit aa47a7c215e7 ("lib/cpumask: deprecate nr_cpumask_bits") resulted in the cpumask operations potentially becoming hugely less efficient, because suddenly the cpumask was always considered to be variable-sized. The optimization was then later added back in a limited form by commit 6f9c07be9d02 ("lib/cpumask: add FORCE_NR_CPUS config option"), but that FORCE_NR_CPUS option is not useful in a generic kernel and more of a special case for embedded situations with fixed hardware. Instead, just re-introduce the optimization, with some changes. Instead of depending on CPUMASK_OFFSTACK being false, and then always using the full constant cpumask width, this introduces three different cpumask "sizes": - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids. This is used for situations where we should use the exact size. - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it fits in a single word and the bitmap operations thus end up able to trigger the "small_const_nbits()" optimizations. This is used for the operations that have optimized single-word cases that get inlined, notably the bit find and scanning functions. - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it is an sufficiently small constant that makes simple "copy" and "clear" operations more efficient. This is arbitrarily set at four words or less. As a an example of this situation, without this fixed size optimization, cpumask_clear() will generate code like movl nr_cpu_ids(%rip), %edx addq $63, %rdx shrq $3, %rdx andl $-8, %edx callq memset@PLT on x86-64, because it would calculate the "exact" number of longwords that need to be cleared. In contrast, with this patch, using a MAX_CPU of 64 (which is quite a reasonable value to use), the above becomes a single movq $0,cpumask instruction instead, because instead of caring to figure out exactly how many CPU's the system has, it just knows that the cpumask will be a single word and can just clear it all. Note that this does end up tightening the rules a bit from the original version in another way: operations that set bits in the cpumask are now limited to the actual nr_cpu_ids limit, whereas we used to do the nr_cpumask_bits thing almost everywhere in the cpumask code. But if you just clear bits, or scan for bits, we can use the simpler compile-time constants. In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()' which were not useful, and which fundamentally have to be limited to 'nr_cpu_ids'. Better remove them now than have somebody introduce use of them later. Of course, on x86-64 with MAXSMP there is no sane small compile-time constant for the cpumask sizes, and we end up using the actual CPU bits, and will generate the above kind of horrors regardless. Please don't use MAXSMP unless you really expect to have machines with thousands of cores. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
20fdfd55ab |
17 hotfixes. Eight are for MM and seven are for other parts of the
kernel. Seven are cc:stable and eight address post-6.3 issues or were judged unsuitable for -stable backporting. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZAO0bAAKCRDdBJ7gKXxA jo73AP0Sbgd+E0u5Hs+aACHW28FpxleVRdyexc5chXD5QsyLKgEAwjntE7jfHHYK GkUKsoWQJblgjm3ksRxdLbVkDSQ8sQE= =CQ0B -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. Eight are for MM and seven are for other parts of the kernel. Seven are cc:stable and eight address post-6.3 issues or were judged unsuitable for -stable backporting" * tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: map Dikshita Agarwal's old address to his current one mailmap: map Vikash Garodia's old address to his current one fs/cramfs/inode.c: initialize file_ra_state fs: hfsplus: fix UAF issue in hfsplus_put_super panic: fix the panic_print NMI backtrace setting lib: parser: update documentation for match_NUMBER functions kasan, x86: don't rename memintrinsics in uninstrumented files kasan: test: fix test for new meminstrinsic instrumentation kasan: treat meminstrinsic as builtins in uninstrumented files kasan: emit different calls for instrumentable memintrinsics ocfs2: fix non-auto defrag path not working issue ocfs2: fix defrag path triggering jbd2 ASSERT mailmap: map Georgi Djakov's old Linaro address to his current one mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON lib/zlib: DFLTCC deflate does not write all available bits for Z_NO_FLUSH mm/damon/paddr: fix missing folio_put() mm/mremap: fix dup_anon_vma() in vma_merge() case 4 |
||
Eric Biggers
|
359d62559f |
lib: parser: update documentation for match_NUMBER functions
commit 67222c4ba8af ("lib: parser: optimize match_NUMBER apis to use local array") removed -ENOMEM as a possible return value, so update the comments accordingly. Link: https://lkml.kernel.org/r/20230224042618.9092-1-ebiggers@kernel.org Fixes: 67222c4ba8af ("lib: parser: optimize match_NUMBER apis to use local array") Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Li Lingfeng <lilingfeng3@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Marco Elver
|
36be5cba99 |
kasan: treat meminstrinsic as builtins in uninstrumented files
Where the compiler instruments meminstrinsics by generating calls to __asan/__hwasan_ prefixed functions, let the compiler consider memintrinsics as builtin again. To do so, never override memset/memmove/memcpy if the compiler does the correct instrumentation - even on !GENERIC_ENTRY architectures. [elver@google.com: powerpc: don't rename memintrinsics if compiler adds prefixes] Link: https://lore.kernel.org/all/20230224085942.1791837-1-elver@google.com/ [1] Link: https://lkml.kernel.org/r/20230227094726.3833247-1-elver@google.com Link: https://lkml.kernel.org/r/20230224085942.1791837-2-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mikhail Zaslonko
|
1c0a0af511 |
lib/zlib: DFLTCC deflate does not write all available bits for Z_NO_FLUSH
DFLTCC deflate with Z_NO_FLUSH might generate a corrupted stream when the output buffer is not large enough to fit all the deflate output at once. The problem takes place on closing the deflate block since flush_pending() might leave some output bits not written. Similar problem for software deflate with Z_BLOCK flush option (not supported by kernel zlib deflate) has been fixed a while ago in userspace zlib but the fix never got to the kernel. Now flush_pending() flushes the bit buffer before copying out the byte buffer, in order to really flush as much as possible. Currently there are no users of DFLTCC deflate with Z_NO_FLUSH option in the kernel so the problem remained hidden for a while. This commit is based on the old zlib commit: https://github.com/madler/zlib/commit/0b828b4 Link: https://lkml.kernel.org/r/20230221131617.3369978-2-zaslonko@linux.ibm.com Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |