2019-05-27 09:55:06 +03:00
/* SPDX-License-Identifier: GPL-2.0-or-later */
2015-04-24 21:56:30 +03:00
/*
* Queued spinlock
*
2022-03-17 01:48:29 +03:00
* A ' generic ' spinlock implementation that is based on MCS locks . For an
* architecture that ' s looking for a ' generic ' spinlock , please first consider
* ticket - lock . h and only come looking here when you ' ve considered all the
* constraints below and can show your hardware does actually perform better
* with qspinlock .
*
* qspinlock relies on atomic_ * _release ( ) / atomic_ * _acquire ( ) to be RCsc ( or no
* weaker than RCtso if you ' re power ) , where regular code only expects atomic_t
* to be RCpc .
*
* qspinlock relies on a far greater ( compared to asm - generic / spinlock . h ) set
* of atomic operations to behave well together , please audit them carefully to
* ensure they all have forward progress . Many atomic operations may default to
* cmpxchg ( ) loops which will not have good forward progress properties on
* LL / SC architectures .
*
* One notable example is atomic_fetch_or_acquire ( ) , which x86 cannot ( cheaply )
* do . Carefully read the patches that introduced
* queued_fetch_set_pending_acquire ( ) .
*
* qspinlock also heavily relies on mixed size atomic operations , in specific
* it requires architectures to have xchg16 ; something which many LL / SC
* architectures need to implement as a 32 bit and + or in order to satisfy the
* forward progress guarantees mentioned above .
*
* Further reading on mixed size atomics that might be relevant :
*
* http : //www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
*
2015-04-24 21:56:30 +03:00
* ( C ) Copyright 2013 - 2015 Hewlett - Packard Development Company , L . P .
2015-11-10 03:09:21 +03:00
* ( C ) Copyright 2015 Hewlett - Packard Enterprise Development LP
2015-04-24 21:56:30 +03:00
*
2015-11-10 03:09:21 +03:00
* Authors : Waiman Long < waiman . long @ hpe . com >
2015-04-24 21:56:30 +03:00
*/
# ifndef __ASM_GENERIC_QSPINLOCK_H
# define __ASM_GENERIC_QSPINLOCK_H
# include <asm-generic/qspinlock_types.h>
2020-07-29 15:33:16 +03:00
# include <linux/atomic.h>
2015-04-24 21:56:30 +03:00
powerpc/64s: Implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
2020-07-24 16:14:20 +03:00
# ifndef queued_spin_is_locked
2015-04-24 21:56:30 +03:00
/**
* queued_spin_is_locked - is the spinlock locked ?
* @ lock : Pointer to queued spinlock structure
* Return : 1 if it is locked , 0 otherwise
*/
static __always_inline int queued_spin_is_locked ( struct qspinlock * lock )
{
2016-05-20 19:04:36 +03:00
/*
2016-06-08 11:19:51 +03:00
* Any ! 0 state indicates it is locked , even if _Q_LOCKED_VAL
* isn ' t immediately observable .
2016-05-20 19:04:36 +03:00
*/
2016-06-08 11:19:51 +03:00
return atomic_read ( & lock - > val ) ;
2015-04-24 21:56:30 +03:00
}
powerpc/64s: Implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
2020-07-24 16:14:20 +03:00
# endif
2015-04-24 21:56:30 +03:00
/**
* queued_spin_value_unlocked - is the spinlock structure unlocked ?
* @ lock : queued spinlock structure
* Return : 1 if it is unlocked , 0 otherwise
*
* N . B . Whenever there are tasks waiting for the lock , it is considered
* locked wrt the lockref code to avoid lock stealing by the lockref
* code and change things underneath the lock . This also allows some
* optimizations to be applied without conflict with lockref .
*/
static __always_inline int queued_spin_value_unlocked ( struct qspinlock lock )
{
2023-11-10 09:22:13 +03:00
return ! lock . val . counter ;
2015-04-24 21:56:30 +03:00
}
/**
* queued_spin_is_contended - check if the lock is contended
* @ lock : Pointer to queued spinlock structure
* Return : 1 if lock contended , 0 otherwise
*/
static __always_inline int queued_spin_is_contended ( struct qspinlock * lock )
{
return atomic_read ( & lock - > val ) & ~ _Q_LOCKED_MASK ;
}
/**
* queued_spin_trylock - try to acquire the queued spinlock
* @ lock : Pointer to queued spinlock structure
* Return : 1 if lock acquired , 0 if failed
*/
static __always_inline int queued_spin_trylock ( struct qspinlock * lock )
{
2020-10-19 10:09:21 +03:00
int val = atomic_read ( & lock - > val ) ;
2018-08-20 17:19:14 +03:00
if ( unlikely ( val ) )
return 0 ;
return likely ( atomic_try_cmpxchg_acquire ( & lock - > val , & val , _Q_LOCKED_VAL ) ) ;
2015-04-24 21:56:30 +03:00
}
extern void queued_spin_lock_slowpath ( struct qspinlock * lock , u32 val ) ;
2020-07-24 16:14:21 +03:00
# ifndef queued_spin_lock
2015-04-24 21:56:30 +03:00
/**
* queued_spin_lock - acquire a queued spinlock
* @ lock : Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_lock ( struct qspinlock * lock )
{
2020-10-19 10:09:21 +03:00
int val = 0 ;
2015-04-24 21:56:30 +03:00
2018-08-20 17:19:14 +03:00
if ( likely ( atomic_try_cmpxchg_acquire ( & lock - > val , & val , _Q_LOCKED_VAL ) ) )
2015-04-24 21:56:30 +03:00
return ;
2018-08-20 17:19:14 +03:00
2015-04-24 21:56:30 +03:00
queued_spin_lock_slowpath ( lock , val ) ;
}
2020-07-24 16:14:21 +03:00
# endif
2015-04-24 21:56:30 +03:00
# ifndef queued_spin_unlock
/**
* queued_spin_unlock - release a queued spinlock
* @ lock : Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_unlock ( struct qspinlock * lock )
{
/*
2016-06-03 11:38:14 +03:00
* unlock ( ) needs release semantics :
2015-04-24 21:56:30 +03:00
*/
2018-04-26 13:34:24 +03:00
smp_store_release ( & lock - > locked , 0 ) ;
2015-04-24 21:56:30 +03:00
}
# endif
2015-09-04 18:25:23 +03:00
# ifndef virt_spin_lock
static __always_inline bool virt_spin_lock ( struct qspinlock * lock )
2015-04-24 21:56:36 +03:00
{
return false ;
}
# endif
2015-04-24 21:56:30 +03:00
/*
* Remapping spinlock architecture specific functions to the corresponding
* queued spinlock functions .
*/
# define arch_spin_is_locked(l) queued_spin_is_locked(l)
# define arch_spin_is_contended(l) queued_spin_is_contended(l)
# define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
# define arch_spin_lock(l) queued_spin_lock(l)
# define arch_spin_trylock(l) queued_spin_trylock(l)
# define arch_spin_unlock(l) queued_spin_unlock(l)
# endif /* __ASM_GENERIC_QSPINLOCK_H */