Go to file
Linus Torvalds cd4699c5fd prlimit and set/getpriority tasklist_lock optimizations
The tasklist_lock popped up as a scalability bottleneck on some testing
 workloads.  The readlocks in do_prlimit and set/getpriority are not
 necessary in all cases.
 
 Based on a cycles profile, it looked like ~87% of the time was spent in
 the kernel, ~42% of which was just trying to get *some* spinlock
 (queued_spin_lock_slowpath, not necessarily the tasklist_lock).
 
 The big offenders (with rough percentages in cycles of the overall trace):
 
 - do_wait 11%
 - setpriority 8% (this patchset)
 - kill 8%
 - do_exit 5%
 - clone 3%
 - prlimit64 2%   (this patchset)
 - getrlimit 1%   (this patchset)
 
 I can't easily test this patchset on the original workload for various
 reasons.  Instead, I used the microbenchmark below to at least verify
 there was some improvement.  This patchset had a 28% speedup (12% from
 baseline to set/getprio, then another 14% for prlimit).
 
 One interesting thing is that my libc's getrlimit() was calling
 prlimit64, so hoisting the read_lock(tasklist_lock) into sys_prlimit64
 had no effect - it essentially optimized the older syscalls only.  I
 didn't do that in this patchset, but figured I'd mention it since it was
 an option from the previous patch's discussion.
 
 v3: https://lkml.kernel.org/r/20220106172041.522167-1-brho@google.com
 v2: https://lore.kernel.org/lkml/20220105212828.197013-1-brho@google.com/
 - update_rlimit_cpu on the group_leader instead of for_each_thread.
 - update_rlimit_cpu still returns 0 or -ESRCH, even though we don't care
   about the error here.  it felt safer that way in case someone uses
   that function again.
 
 v1: https://lore.kernel.org/lkml/20211213220401.1039578-1-brho@google.com/
 
 int main(int argc, char **argv)
 {
         pid_t child;
         struct rlimit rlim[1];
 
         fork(); fork(); fork(); fork(); fork(); fork();
 
         for (int i = 0; i < 5000; i++) {
                 child = fork();
                 if (child < 0)
                         exit(1);
                 if (child > 0) {
                         usleep(1000);
                         kill(child, SIGTERM);
                         waitpid(child, NULL, 0);
                 } else {
                         for (;;) {
                                 setpriority(PRIO_PROCESS, 0,
                                             getpriority(PRIO_PROCESS, 0));
                                 getrlimit(RLIMIT_CPU, rlim);
                         }
                 }
         }
 
         return 0;
 }
 
 Barret Rhoden (3):
   setpriority: only grab the tasklist_lock for PRIO_PGRP
   prlimit: make do_prlimit() static
   prlimit: do not grab the tasklist_lock
 
  include/linux/posix-timers.h   |   2 +-
  include/linux/resource.h       |   2 -
  kernel/sys.c                   | 127 +++++++++++++++++----------------
  kernel/time/posix-cpu-timers.c |  12 +++-
  4 files changed, 76 insertions(+), 67 deletions(-)
 
 I have dropped the first change in this series as an almost identical
 change was merged as commit 7f8ca0edfe ("kernel/sys.c: only take
 tasklist_lock for get/setpriority(PRIO_PGRP)").
 
 Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmI7eCAACgkQC/v6Eiaj
 j0CN8w/+MEol1+sB/mDKgDgqbNE0sIXHTjQF37KPrsqB51aas9LSX7E7CBzvxF3M
 Y0MSk0VzSt4oGpmrNQOAEueeMeaMucPxI5JejGHEhtdHFBMqYXKpWuhqewIHx1pc
 lUcYpDeUOOBjwLO/VT5hfAKzIEMUl6tEDfzexl9IvpVwd661nVjDe+z12mDplJTi
 tjO8ZiSHkjkLE3cAYaTCajsaqpj7NLuIYB1d4CbbpU3vO5LYoffj/vtQ1e+7UxMB
 jhgaP/ylo0Ab8udYJ0PFIDmmQG/6s7csc3I1wtMgf8mqv88z4xspXNZBwYvf2hxa
 lBpSo+zD8Q88XipC+w63iBUa7YElLaai9xpLInO/Ir42G03/H/8TS9me1OLG+1Cz
 vloOid6CqH7KkNQ842txXeyj3xjW1DGR7U0QOrSxFQuWc6WZ2Q/l8KIZsuXuyt9G
 EwTjtoQvr1R+FNMtT/4g5WZ8sTYooIaHFvFQ745T6FzBp8mCVjINg4SUbVV3Wvck
 JRMxuHSFFBXj8IIJi9Bv6UE/j5APwa209KthvFCQayniNZU3XPKVa/bDWVoBk+SK
 Hch3M//QdAjKYmRf5gmDaBbRyqzaeiFjvX1MSnkbFryBX4/yIoEfo0/QsDRzSrJV
 vSSSU79h/XDI080gILOzNX4HiI4cpNcpOIB63Pmajyr6MxhrMqE=
 =VVGP
 -----END PGP SIGNATURE-----

Merge tag 'prlimit-tasklist_lock-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull tasklist_lock optimizations from Eric Biederman:
 "prlimit and getpriority tasklist_lock optimizations

  The tasklist_lock popped up as a scalability bottleneck on some
  testing workloads. The readlocks in do_prlimit and set/getpriority are
  not necessary in all cases.

  Based on a cycles profile, it looked like ~87% of the time was spent
  in the kernel, ~42% of which was just trying to get *some* spinlock
  (queued_spin_lock_slowpath, not necessarily the tasklist_lock).

  The big offenders (with rough percentages in cycles of the overall
  trace):
   - do_wait 11%
   - setpriority 8% (done previously in commit 7f8ca0edfe)
   - kill 8%
   - do_exit 5%
   - clone 3%
   - prlimit64 2%   (this patchset)
   - getrlimit 1%   (this patchset)

  I can't easily test this patchset on the original workload for various
  reasons. Instead, I used the microbenchmark below to at least verify
  there was some improvement. This patchset had a 28% speedup (12% from
  baseline to set/getprio, then another 14% for prlimit).

  This series used to do the setpriority case, but an almost identical
  change was merged as commit 7f8ca0edfe ("kernel/sys.c: only take
  tasklist_lock for get/setpriority(PRIO_PGRP)") so that has been
  dropped from here.

  One interesting thing is that my libc's getrlimit() was calling
  prlimit64, so hoisting the read_lock(tasklist_lock) into sys_prlimit64
  had no effect - it essentially optimized the older syscalls only. I
  didn't do that in this patchset, but figured I'd mention it since it
  was an option from the previous patch's discussion"

micobenchmark.c:
---------------
	int main(int argc, char **argv)
	{
		pid_t child;
		struct rlimit rlim[1];

		fork(); fork(); fork(); fork(); fork(); fork();

		for (int i = 0; i < 5000; i++) {
			child = fork();
			if (child < 0)
				exit(1);
			if (child > 0) {
				usleep(1000);
				kill(child, SIGTERM);
				waitpid(child, NULL, 0);
			} else {
				for (;;) {
					setpriority(PRIO_PROCESS, 0,
						    getpriority(PRIO_PROCESS, 0));
					getrlimit(RLIMIT_CPU, rlim);
				}
			}
		}

		return 0;
	}

Link: https://lore.kernel.org/lkml/20211213220401.1039578-1-brho@google.com/ [v1]
Link: https://lore.kernel.org/lkml/20220105212828.197013-1-brho@google.com/ [v2]
Link: https://lore.kernel.org/lkml/20220106172041.522167-1-brho@google.com/ [v3]

* tag 'prlimit-tasklist_lock-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  prlimit: do not grab the tasklist_lock
  prlimit: make do_prlimit() static
2022-03-24 10:16:00 -07:00
arch ARM: DT updates for 5.18 2022-03-23 18:37:22 -07:00
block Filesystem folio changes for 5.18 2022-03-22 18:26:56 -07:00
certs KEYS: Introduce link restriction for machine keys 2022-03-08 13:55:52 +02:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2022-03-21 16:02:36 -07:00
Documentation ARM: DT updates for 5.18 2022-03-23 18:37:22 -07:00
drivers ARM: DT updates for 5.18 2022-03-23 18:37:22 -07:00
fs fs.rt.v5.18 2022-03-24 10:06:43 -07:00
include prlimit and set/getpriority tasklist_lock optimizations 2022-03-24 10:16:00 -07:00
init Changes in this cycle were: 2022-03-22 14:39:12 -07:00
ipc fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
kernel prlimit and set/getpriority tasklist_lock optimizations 2022-03-24 10:16:00 -07:00
lib asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
LICENSES LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers 2021-12-16 14:33:10 +01:00
mm asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
net asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
samples Tracing updates for 5.18: 2022-03-23 11:40:25 -07:00
scripts asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
security ARM driver updates for 5.18 2022-03-23 18:23:13 -07:00
sound ARM: SoC updates for 5.18 2022-03-23 18:20:09 -07:00
tools asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
usr reiserfs_xattr.h: add linux/reiserfs_xattr.h to UAPI compile-test coverage 2022-02-17 09:09:38 +01:00
virt KVM: Fix lockdep false negative during host resume 2022-02-17 09:52:50 -05:00
.clang-format genirq/msi: Make interrupt allocation less convoluted 2021-12-16 22:22:20 +01:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap MAINTAINERS: Update Jisheng's email address 2022-03-08 17:30:32 +01:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: replace a Microchip AT91 maintainer 2022-02-09 11:30:01 +01:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS ARM: DT updates for 5.18 2022-03-23 18:37:22 -07:00
Makefile Linux 5.17 2022-03-20 13:14:17 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.