IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
vruntime numbers.
time advanced in 100ms:
weight=2
64765.988352
67012.881408
88501.412352
weight=3
35496.181411
34130.971298
35497.411573
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make rt-limit tunables work again
Mark Glines reported:
> I've got an issue on x86-64 where I can't configure the system to allow
> RT tasks for a non-root user.
>
> In 2.6.26.5, I was able to do the following to set things up nicely:
> echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
> echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
>
> Seems like every value I try to echo into the /sys files returns EINVAL.
For UID grouping we initialize the root group with infinite bandwidth
which by default is actually more than the global limit, therefore the
bandwidth check always fails.
Because the root group is a phantom group (for UID grouping) we cannot
runtime adjust it, therefore we let it reflect the global bandwidth
settings.
Reported-by: Mark Glines <mark@glines.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a new avg_wakeup statistic.
avg_wakeup is a measure of how frequently a task wakes up other tasks, it
represents the average time between wakeups, with a limit of avg_runtime
for when it doesn't wake up anybody.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.
This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.
Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:
# ./test-mutex-shm V 16 10 | grep "^avg ops"
avg ops/sec: 296604
# ./test-mutex-shm V 16 10 | grep "^avg ops"
avg ops/sec: 85870
The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.
Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.
Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:
# echo NO_OWNER_SPIN > /debug/sched_features
This command re-enables spinning again (this is also the default):
# echo OWNER_SPIN > /debug/sched_features
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.
Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo found a build error in the scheduler when RT_GROUP_SCHED was
enabled, but SMP was not. This patch rearranges the code such
that it is a little more streamlined and compiles under all permutations
of SMP, UP and RT_GROUP_SCHED. It was boot tested on my 4-way x86_64
and it still passes preempt-test.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
This reverts commit 7317d7b87edb41a9135e30be1ec3f7ef817c53dd.
This has been reported (and bisected) by Alexey Zaytsev and
Kamalesh Babulal to produce annoying warnings during bootup
on both x86 and powerpc.
kernel_locked() is not a valid test in IRQ context (we update the
BKL's ->lock_depth and the preempt count separately and non-atomicalyy),
so we cannot put it into the generic preempt debugging checks which
can run in IRQ contexts too.
Reported-and-bisected-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Reported-and-bisected-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on certain configs
Added 'double_rq_lock' forward declaration, allowing double_rq_lock
to be used in _double_lock_balance().
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix panic on ia64 with NR_CPUS=1024
struct sched_domain is now a dangling structure; where we really want
static ones, we need to use static_sched_domain.
(As the FIXME in this file says, cpumask_var_t would be better, but
this code is hairy enough without trying to add initialization code to
the right places).
Reported-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Vaidyanathan Srinivasan reported:
> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.28-autotest-tip-sv #1
> ---------------------------------------------
> klogd/5062 is trying to acquire lock:
> (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
>
> but task is already holding lock:
> (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
With sched_mc at 2. (it is default-off)
Strictly speaking we'll not deadlock, because ttwu will not be able to
place the migration task on our rq, but since the code can deal with
both rqs getting unlocked, this seems the easiest way out.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
init_rootdomain() calls alloc_bootmem_cpumask_var() at system boot,
so does cpupri_init().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not the responsibility of init_rootdomain() to free root_domain
allocated by alloc_rootdomain().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- Make arch_reinit_sched_domains() static. It was exported to be used in
s390, but now rebuild_sched_domains() is used instead.
- Make it return void.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The only caller is cpu_dev_init() which is marked as __init.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
x86: setup_per_cpu_areas() cleanup
cpumask: fix compile error when CONFIG_NR_CPUS is not defined
cpumask: use alloc_cpumask_var_node where appropriate
cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
x86: use cpumask_var_t in acpi/boot.c
x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
sched: put back some stack hog changes that were undone in kernel/sched.c
x86: enable cpus display of kernel_max and offlined cpus
ia64: cpumask fix for is_affinity_mask_valid()
cpumask: convert RCU implementations, fix
xtensa: define __fls
mn10300: define __fls
m32r: define __fls
h8300: define __fls
frv: define __fls
cris: define __fls
cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
cpumask: zero extra bits in alloc_cpumask_var_node
cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
cpumask: convert mm/
...
Impact: prevents panic from stack overflow on numa-capable machines.
Some of the "removal of stack hogs" changes in kernel/sched.c by using
node_to_cpumask_ptr were undone by the early cpumask API updates, and
causes a panic due to stack overflow. This patch undoes those changes
by using cpumask_of_node() which returns a 'const struct cpumask *'.
In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
reducing stack usage. (Both of these updates removed 9 FIXME's!)
Also:
Pick up some remaining changes from the old 'cpumask_t' functions to
the new 'struct cpumask *' functions.
Optimize memory traffic by allocating each percpu local_cpu_mask on the
same node as the referring cpu.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
x86: export vector_used_by_percpu_irq
x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
sched: nominate preferred wakeup cpu, fix
x86: fix lguest used_vectors breakage, -v2
x86: fix warning in arch/x86/kernel/io_apic.c
sched: fix warning in kernel/sched.c
sched: move test_sd_parent() to an SMP section of sched.h
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: framework for sched_mc/smt_power_savings=N
sched: convert BALANCE_FOR_xx_POWER to inline functions
x86: use possible_cpus=NUM to extend the possible cpus allowed
x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
x86: update io_apic.c to the new cpumask code
x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
x86: xen: use smp_call_function_many()
x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
...
Fixed up trivial conflict in kernel/time/tick-sched.c manually
The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.
In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.
This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimers: fix warning in kernel/hrtimer.c
x86: make sure we really have an hpet mapping before using it
x86: enable HPET on Fujitsu u9200
linux/timex.h: cleanup for userspace
posix-timers: simplify de_thread()->exit_itimers() path
posix-timers: check ->it_signal instead of ->it_pid to validate the timer
posix-timers: use "struct pid*" instead of "struct task_struct*"
nohz: suppress needless timer reprogramming
clocksource, acpi_pm.c: put acpi_pm_read_slow() under CONFIG_PCI
nohz: no softirq pending warnings for offline cpus
hrtimer: removing all ur callback modes, fix
hrtimer: removing all ur callback modes, fix hotplug
hrtimer: removing all ur callback modes
x86: correct link to HPET timer specification
rtc-cmos: export second NVRAM bank
Fixed up conflicts in sound/drivers/pcsp/pcsp.c and sound/core/hrtimer.c
manually.
* 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
stacktrace: provide save_stack_trace_tsk() weak alias
rcu: provide RCU options on non-preempt architectures too
printk: fix discarding message when recursion_bug
futex: clean up futex_(un)lock_pi fault handling
"Tree RCU": scalable classic RCU implementation
futex: rename field in futex_q to clarify single waiter semantics
x86/swiotlb: add default swiotlb_arch_range_needs_mapping
x86/swiotlb: add default phys<->bus conversion
x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
x86: add swiotlb allocation functions
swiotlb: consolidate swiotlb info message printing
swiotlb: support bouncing of HighMem pages
swiotlb: factor out copy to/from device
swiotlb: add arch hook to force mapping
swiotlb: allow architectures to override phys<->bus<->phys conversions
swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
rcu: fix rcutorture behavior during reboot
resources: skip sanity check of busy resources
swiotlb: move some definitions to header
swiotlb: allow architectures to override swiotlb pool allocation
...
Fix up trivial conflicts in
arch/x86/kernel/Makefile
arch/x86/mm/init_32.c
include/linux/hardirq.h
as per Ingo's suggestions.
The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis). When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted. This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.
When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state. Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over. The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.
However, the current implementation suffers from a limitation in the
push logic. Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc). Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue. This causes two problems:
1) We can have the same tasks analyzed over and over again during each
push, which extends out the fast path in the scheduler for no
gain. Consider a RQ that has dozens of tasks that are bound to a
core. Each one of those tasks will be encountered and skipped
for each push operation while they are queued.
2) There may be lower-priority tasks under the unpushable task that
could have been successfully pushed, but will never be considered
until either the unpushable task is cleared, or a pull operation
succeeds. The net result is a potential latency source for mid
priority tasks.
This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks". A task is added to the list
each time a task is activated or preempted. It is removed from the
list any time it is deactivated, made current, or fails to push.
This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper. This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks. Now every task will have a push attempted (when
appropriate).
This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.
[ rostedt: added a couple more BUG_ONs ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion. This is currently not a problem
as we only look at rq->rt.overloaded. However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set. Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work. Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).
We will use this later in the series
[ rostedt: removed paranoid BUG_ON ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs. This can result in higher latencies on
higher cpu-ids.
This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again. Since callsites to
double_lock_balance already consider it a potential preemption/reschedule
point, they have the proper logic to recheck for atomicity violations.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
newidle critical section length by stopping after at least one task
was moved. Further investigation has shown that there are other
paths nested further inside the algorithm which still remain that allow
long latencies to occur with newidle balancing. This patch applies
the same technique inside balance_tasks() to limit the duration of
this optional balancing operation.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Nick Piggin <npiggin@suse.de>
We will use this later in the series to reduce the amount of rq-lock
contention during a pull operation
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
cpu_coregroup_map returned a cpumask_t: it's going away.
(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
Impact: extend the wakeup tracepoint with the info whether the wakeup was real
Add the information needed to distinguish 'real' wakeups from 'false'
wakeups.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew Morton reported:
> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
>
> This warning is correct - the code is buggy.
In sched.c load_balance_newidle, there's real potential use of
uninitialised variable - fix it.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix cpumask conversion bug
this warning:
kernel/sched.c: In function ‘find_busiest_group’:
kernel/sched.c:3429: warning: passing argument 1 of ‘__first_cpu’ from incompatible pointer type
shows that we forgot to convert a new patch to the new cpumask APIs.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: tweak task balancing to save power more agressively
Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.
This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
This logic provides effective consolidation of short running
daemon jobs in a almost idle system
The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend load-balancing code (no change in behavior yet)
When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.
Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.
This info can be stored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>