Commit Graph

13821 Commits

Author SHA1 Message Date
Linus Torvalds
bca1a5c0ea Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
  updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
  fixes) now that Arnaldo is back from vacation."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  uprobes: __replace_page() needs munlock_vma_page()
  uprobes: Rename vma_address() and make it return "unsigned long"
  uprobes: Fix register_for_each_vma()->vma_address() check
  uprobes: Introduce vaddr_to_offset(vma, vaddr)
  uprobes: Teach build_probe_list() to consider the range
  uprobes: Remove insert_vm_struct()->uprobe_mmap()
  uprobes: Remove copy_vma()->uprobe_mmap()
  uprobes: Fix overflow in vma_address()/find_active_uprobe()
  uprobes: Suppress uprobe_munmap() from mmput()
  uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
  uprobes: Clean up and document write_opcode()->lock_page(old_page)
  uprobes: Kill write_opcode()->lock_page(new_page)
  uprobes: __replace_page() should not use page_address_in_vma()
  uprobes: Don't recheck vma/f_mapping in write_opcode()
  perf/x86: Fix missing struct before structure name
  perf/x86: Fix format definition of SNB-EP uncore QPI box
  perf/x86: Make bitfield unsigned
  perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
  perf/x86: Add Intel Nehalem-EX uncore support
  perf/x86: Fix typo in format definition of uncore PCU filter
  ...
2012-07-31 15:34:13 -07:00
Octavian Purdila
65fed8f6f2 resource: make sure requested range is included in the root range
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.

This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff).  In
this case __request_resource would return the whole root range as
conflict range (i.e.  0-ffffffff).  Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.

This patch aborts looking for an usable range when the request does not
intersect with the root range.  When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.

When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.

[    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[    5.975150] Modules linked in:
[    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[    5.985324] Call Trace:
[    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
[    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
[    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
[    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
[    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
[    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
[    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
[    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
[    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
[    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
[    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
[    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
[    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
[    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Alan Cox
25353b3377 taskstats: check nla_reserve() return
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621

Reported-by: <rucsoftsec@gmail.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Steven Rostedt
fd4b616b0f sysctl: suppress kmemleak messages
register_sysctl_table() is a strange function, as it makes internal
allocations (a header) to register a sysctl_table.  This header is a
handle to the table that is created, and can be used to unregister the
table.  But if the table is permanent and never unregistered, the header
acts the same as a static variable.

Unfortunately, this allocation of memory that is never expected to be
freed fools kmemleak in thinking that we have leaked memory.  For those
sysctl tables that are never unregistered, and have no pointer referencing
them, kmemleak will think that these are memory leaks:

unreferenced object 0xffff880079fb9d40 (size 192):
  comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
    [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [<ffffffff8110b852>] __kmalloc+0x107/0x153
    [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
    [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
    [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
    [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
    [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
    [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
    [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
    [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
    [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
    [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
    [<ffffffffffffffff>] 0xffffffffffffffff

The sysctl_base_table used by sysctl itself is one such instance that
registers the table to never be unregistered.

Use kmemleak_not_leak() to suppress the kmemleak false positive.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Vivek Goyal
63dca8d5b5 kdump: append newline to the last lien of vmcoreinfo note
The last line of vmcoreinfo note does not end with \n.  Parsing all the
lines in note becomes easier if all lines end with \n instead of trying to
special case the last line.

I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
assumption that all lines end with \n.  I think it is a good idea to fix
it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Akinobu Mita
f19b9f74b7 fork: fix error handling in dup_task()
The function dup_task() may fail at the following function calls in the
following order.

0) alloc_task_struct_node()
1) alloc_thread_info_node()
2) arch_dup_task_struct()

Error by 0) is not a matter, it can just return.  But error by 1) requires
releasing task_struct allocated by 0) before it returns.  Likewise, error
by 2) requires releasing task_struct and thread_info allocated by 0) and
1).

The existing error handling calls free_task_struct() and
free_thread_info() which do not only release task_struct and thread_info,
but also call architecture specific arch_release_task_struct() and
arch_release_thread_info().

The problem is that task_struct and thread_info are not fully initialized
yet at this point, but arch_release_task_struct() and
arch_release_thread_info() are called with them.

For example, x86 defines its own arch_release_task_struct() that releases
a task_xstate.  If alloc_thread_info_node() fails in dup_task(),
arch_release_task_struct() is called with task_struct which is just
allocated and filled with garbage in this error handling.

This actually happened with tools/testing/fault-injection/failcmd.sh

	# env FAILCMD_TYPE=fail_page_alloc \
		./tools/testing/fault-injection/failcmd.sh --times=100 \
		--min-order=0 --ignore-gfp-wait=0 \
		-- make -C tools/testing/selftests/ run_tests

In order to fix this issue, make free_{task_struct,thread_info}() not to
call arch_release_{task_struct,thread_info}() and call
arch_release_{task_struct,thread_info}() implicitly where needed.

Default arch_release_task_struct() and arch_release_thread_info() are
defined as empty by default.  So this change only affects the
architectures which implement their own arch_release_task_struct() or
arch_release_thread_info() as listed below.

arch_release_task_struct(): x86, sh
arch_release_thread_info(): mn10300, tile

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Andrew Morton
87bec58a52 revert "sched: Fix fork() error path to not crash"
To make way for "fork: fix error handling in dup_task()", which fixes the
errors more completely.

Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Huang Shijie
b2412b7fa7 fork: use vma_pages() to simplify the code
The current code can be replaced by vma_pages().  So use it to simplify
the code.

[akpm@linux-foundation.org: initialise `len' at its definition site]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Tetsuo Handa
0f20784d4b kmod: avoid deadlock from recursive kmod call
The system deadlocks (at least since 2.6.10) when
call_usermodehelper(UMH_WAIT_EXEC) request triggers
call_usermodehelper(UMH_WAIT_PROC) request.

This is because "khelper thread is waiting for the worker thread at
wait_for_completion() in do_fork() since the worker thread was created
with CLONE_VFORK flag" and "the worker thread cannot call complete()
because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
thread cannot start processing UMH_WAIT_PROC request because the khelper
thread is waiting for the worker thread at wait_for_completion() in
do_fork()".

The easiest example to observe this deadlock is to use a corrupted
/sbin/hotplug binary (like shown below).

  # : > /tmp/dummy
  # chmod 755 /tmp/dummy
  # echo /tmp/dummy > /proc/sys/kernel/hotplug
  # modprobe whatever

call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
module.  do_execve("/tmp/dummy") triggers a call to
request_module("binfmt-0000") from search_binary_handler() which in turn
calls call_usermodehelper(UMH_WAIT_PROC).

In order to avoid deadlock, as a for-now and easy-to-backport solution, do
not try to call wait_for_completion() in call_usermodehelper_exec() if the
worker thread was created by khelper thread with CLONE_VFORK flag.  Future
and fundamental solution might be replacing singleton khelper thread with
some workqueue so that recursive calls up to max_active dependency loop
can be handled without deadlock.

[akpm@linux-foundation.org: add comment to kmod_thread_locker]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Andrew Morton
79c743dd1e kernel/kmod.c: document call_usermodehelper_fns() a bit
This function's interface is, uh, subtle.  Attempt to apologise for it.

Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Joe Perches
088a52aac8 printk: only look for prefix levels in kernel messages
vprintk_emit() prefix parsing should only be done for internal kernel
messages.  This allows existing behavior to be kept in all cases.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:14 -07:00
Joe Perches
acc8fa41ad printk: add generic functions to find KERN_<LEVEL> headers
The current form of a KERN_<LEVEL> is "<.>".

Add printk_get_level and printk_skip_level functions to handle these
formats.

These functions centralize tests of KERN_<LEVEL> so a future modification
can change the KERN_<LEVEL> style and shorten the number of bytes consumed
by these headers.

[akpm@linux-foundation.org: fix build error and warning]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Kay Sievers
cdf5344136 kmsg: /dev/kmsg - properly return possible copy_from_user() failure
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Andrew Morton
b57b44ae69 kernel/sys.c: avoid argv_free(NULL)
If argv_split() failed, the code will end up calling argv_free(NULL).  Fix
it up and clean things up a bit.

Addresses Coverity report 703573.

Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Sameer Nanda
45226e944c NMI watchdog: fix for lockup detector breakage on resume
On the suspend/resume path the boot CPU does not go though an
offline->online transition.  This breaks the NMI detector post-resume
since it depends on PMU state that is lost when the system gets
suspended.

Fix this by forcing a CPU offline->online transition for the lockup
detector on the boot CPU during resume.

To provide more context, we enable NMI watchdog on Chrome OS.  We have
seen several reports of systems freezing up completely which indicated
that the NMI watchdog was not firing for some reason.

Debugging further, we found a simple way of repro'ing system freezes --
issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
after the system has been suspended/resumed one or more times.

With this patch in place, the system freeze result in panics, as
expected.

These panics provide a nice stack trace for us to debug the actual issue
causing the freeze.

[akpm@linux-foundation.org: fiddle with code comment]
[akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND]
[akpm@linux-foundation.org: fix section errors]
Signed-off-by: Sameer Nanda <snanda@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Vikram Mulukutla
190320c3b6 panic: fix a possible deadlock in panic()
panic_lock is meant to ensure that panic processing takes place only on
one cpu; if any of the other cpus encounter a panic, they will spin
waiting to be shut down.

However, this causes a regression in this scenario:

1. Cpu 0 encounters a panic and acquires the panic_lock
   and proceeds with the panic processing.
2. There is an interrupt on cpu 0 that also encounters
   an error condition and invokes panic.
3. This second invocation fails to acquire the panic_lock
   and enters the infinite while loop in panic_smp_self_stop.

Thus all panic processing is stopped, and the cpu is stuck for eternity
in the while(1) inside panic_smp_self_stop.

To address this, disable local interrupts with local_irq_disable before
acquiring the panic_lock.  This will prevent interrupt handlers from
executing during the panic processing, thus avoiding this particular
problem.

Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Kees Cook
54b501992d coredump: warn about unsafe suid_dumpable / core_pattern combo
When suid_dumpable=2, detect unsafe core_pattern settings and warn when
they are seen.

Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:11 -07:00
Sasikantha babu
f1fd75bfa0 prctl: remove redunant assignment of "error" to zero
Just setting the "error" to error number is enough on failure and It
doesn't require to set "error" variable to zero in each switch case,
since it was already initialized with zero.  And also removed return 0
in switch case with break statement

Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:11 -07:00
Oleg Nesterov
194f8dcbe9 uprobes: __replace_page() needs munlock_vma_page()
Like do_wp_page(), __replace_page() should do munlock_vma_page()
for the case when the old page still has other !VM_LOCKED
mappings. Unfortunately this needs mm/internal.h.

Also, move put_page() outside of ptl lock. This doesn't really
matter but looks a bit better.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182249.GA20372@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:25 +02:00
Oleg Nesterov
57683f72b8 uprobes: Rename vma_address() and make it return "unsigned long"
1. vma_address() returns loff_t, this looks confusing and this
   is unnecessary after the previous change. Make it return "ulong",
   all callers truncate the result anyway.

2. Its name conflicts with mm/rmap.c:vma_address(), rename it to
   offset_to_vaddr(), this matches vaddr_to_offset().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182247.GA20365@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:25 +02:00
Oleg Nesterov
f4d6dfe551 uprobes: Fix register_for_each_vma()->vma_address() check
1. register_for_each_vma() checks that vma_address() == vaddr,
   but this is not enough. We should also ensure that
   vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only.

2. After the prevous changes, register_for_each_vma() is the
   only reason why vma_address() has to return loff_t, all other
   users know that we have the valid mapping at this offset and
   thus the overflow is not possible.

   Change the code to use vaddr_to_offset() instead, imho this looks
   more clean/understandable and now we can change vma_address().

3. While at it, remove the unnecessary type-cast.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182244.GA20362@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:24 +02:00
Oleg Nesterov
cb113b47d0 uprobes: Introduce vaddr_to_offset(vma, vaddr)
Add the new helper, vaddr_to_offset(vma, vaddr) which returns
the offset in vma->vm_file this vaddr is mapped at.

Change build_probe_list() and find_active_uprobe() to use the
new helper, the next patch adds another user.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182242.GA20355@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:24 +02:00
Oleg Nesterov
891c397081 uprobes: Teach build_probe_list() to consider the range
Currently build_probe_list() builds the list of all uprobes
attached to the given inode, and the caller should filter out
those who don't fall into the [start,end) range, this is
sub-optimal.

This patch turns find_least_offset_node() into
find_node_in_range() which returns the first node inside the
[min,max] range, and changes build_probe_list() to use this node
as a starting point for rb_prev() and rb_next() to find all
other nodes the caller needs. The resulting list is no longer
sorted but we do not care.

This can speed up both build_probe_list() and the callers, but
there is another reason to introduce find_node_in_range(). It
can be used to figure out whether the given vma has uprobes or
not, this will be needed soon.

While at it, shift INIT_LIST_HEAD(tmp_list) into
build_probe_list().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182240.GA20352@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:23 +02:00
Oleg Nesterov
aefd8933d4 uprobes: Fix overflow in vma_address()/find_active_uprobe()
vma->vm_pgoff is "unsigned long", it should be promoted to
loff_t before the multiplication to avoid the overflow.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182233.GA20339@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:21 +02:00
Oleg Nesterov
2fd611a991 uprobes: Suppress uprobe_munmap() from mmput()
uprobe_munmap() does get_user_pages() and it is also called from
the final mmput()->exit_mmap() path. This slows down
exit/mmput() for no reason, and I think  it is simply
dangerous/wrong to try to fault-in a page into the dying mm. If
nothing else, this happens after the last sync_mm_rss(), afaics
handle_mm_fault() can change the task->rss_stat and make the
subsequent check_mm() unhappy.

Change uprobe_munmap() to check mm->mm_users != 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182231.GA20336@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:21 +02:00
Oleg Nesterov
665605a2a2 uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the
usage of uprobe->pending_list").

Yes, we do not care about uprobe->pending_list after return and
nobody can remove the current list entry, but put_uprobe(uprobe)
can actually free it and thus we need list_for_each_safe().

Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120729182229.GA20329@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:20 +02:00
Oleg Nesterov
9f92448cee uprobes: Clean up and document write_opcode()->lock_page(old_page)
The comment above write_opcode()->lock_page(old_page) tells
about the race with do_wp_page(). I don't really understand
which exactly race it means, but afaics this lock_page() was not
enough to close all races with do_wp_page().

Anyway, since:

   77fc4af1b5 uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing

this code is always called with ->mmap_sem held for writing,
so we can forget about do_wp_page().

However, we can't simply remove this lock_page(), and the only
(afaics) reason is __replace_page()->try_to_free_swap().

Nothing in write_opcode() needs it, move it into
__replace_page() and fix the comment.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182220.GA20322@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:20 +02:00
Oleg Nesterov
089ba999dc uprobes: Kill write_opcode()->lock_page(new_page)
write_opcode() does lock_page(new_page) for no reason. Nobody
can see this page until __replace_page() exposes it under ptl
lock, and we do nothing with this page after pte_unmap_unlock().

If nothing else, the similar code in do_wp_page() doesn't lock
the new page for page_add_new_anon_rmap/set_pte_at_notify.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182218.GA20315@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:19 +02:00
Oleg Nesterov
c517ee744b uprobes: __replace_page() should not use page_address_in_vma()
page_address_in_vma(old_page) in __replace_page() is ugly and
wrong. The caller already knows the correct virtual address,
this page was found by get_user_pages(vaddr).

However, page_address_in_vma() can actually fail if
page->mapping was cleared by __delete_from_page_cache() after
get_user_pages() returns. But this means the race with page
reclaim, write_opcode() should not fail, it should retry and
read this page again. Probably the race with remove_mapping() is
not possible due to page_freeze_refs() logic, but afaics at
least shmem_writepage()->shmem_delete_from_page_cache() can
clear ->mapping.

We could change __replace_page() to return -EAGAIN in this case,
but it would be better to simply use the caller's vaddr and rely
on page_check_address().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182216.GA20311@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:19 +02:00
Oleg Nesterov
f403072c61 uprobes: Don't recheck vma/f_mapping in write_opcode()
write_opcode() rechecks valid_vma() and ->f_mapping, this is
pointless. The caller, register_for_each_vma() or uprobe_mmap(),
has already done these checks under mmap_sem.

To clarify, uprobe_mmap() checks valid_vma() only, but we can
rely on build_probe_list(vm_file->f_mapping->host).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120729182212.GA20304@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30 11:27:18 +02:00
Josh Boyer
8ded2bbc18 posix_types.h: Cleanup stale __NFDBITS and related definitions
Recently, glibc made a change to suppress sign-conversion warnings in
FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
kernel's definition of __NFDBITS if applications #include
<linux/types.h> after including <sys/select.h>.  A build failure would
be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
flags to gcc.

It was suggested that the kernel should either match the glibc
definition of __NFDBITS or remove that entirely.  The current in-kernel
uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
uses of the related __FDELT and __FDMASK defines.  Given that, we'll
continue the cleanup that was started with commit 8b3d1cda4f
("posix_types: Remove fd_set macros") and drop the remaining unused
macros.

Additionally, linux/time.h has similar macros defined that expand to
nothing so we'll remove those at the same time.

Reported-by: Jeff Law <law@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
CC: <stable@vger.kernel.org>
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
[ .. and fix up whitespace as per akpm ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-26 13:36:43 -07:00
Linus Torvalds
79071638ce Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The biggest change is a performance improvement on SMP systems:

  | 4 socket 40 core + SMT Westmere box, single 30 sec tbench
  | runs, higher is better:
  |
  | clients     1       2       4        8       16       32       64      128
  |..........................................................................
  | pre        30      41     118      645     3769     6214    12233    14312
  | post      299     603    1211     2418     4697     6847    11606    14557
  |
  | A nice increase in performance.

  which speedup is particularly noticeable on heavily interacting
  few-tasks workloads, so the changes should help desktop-style Xorg
  workloads and interactivity as well, on multi-core CPUs.

  There are also cpuset suspend behavior fixes/restructuring and various
  smaller tweaks."

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix race in task_group()
  sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
  sched: Reset loop counters if all tasks are pinned and we need to redo load balance
  sched: Reorder 'struct lb_env' members to reduce its size
  sched: Improve scalability via 'CPU buddies', which withstand random perturbations
  cpusets: Remove/update outdated comments
  cpusets, hotplug: Restructure functions that are invoked during hotplug
  cpusets, hotplug: Implement cpuset tree traversal in a helper function
  CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
  sched/x86: Remove broken power estimation
2012-07-26 13:08:01 -07:00
Linus Torvalds
fa93669a19 Driver core merge for 3.6-rc1
Here's the big driver core pull request for 3.6-rc1.
 
 Unlike 3.5, this kernel should be a lot tamer, with the printk changes now
 settled down.  All we have here is some extcon driver updates, w1 driver
 updates, a few printk cleanups that weren't needed for 3.5, but are good to
 have now, and some other minor fixes/changes in the driver core.
 
 All of these have been in the linux-next releases for a while now.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAlARgIUACgkQMUfUDdst+ynDHgCfRNwIB9L+zZvjcKE5e1BhDbUl
 wVUAn398DFgbJ1+PjGkd1EMR2uVTh7Ou
 =MIFu
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core changes from Greg Kroah-Hartman:
 "Here's the big driver core pull request for 3.6-rc1.

  Unlike 3.5, this kernel should be a lot tamer, with the printk changes
  now settled down.  All we have here is some extcon driver updates, w1
  driver updates, a few printk cleanups that weren't needed for 3.5, but
  are good to have now, and some other minor fixes/changes in the driver
  core.

  All of these have been in the linux-next releases for a while now.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
  printk: Export struct log size and member offsets through vmcoreinfo
  Drivers: hv: Change the hex constant to a decimal constant
  driver core: don't trigger uevent after failure
  extcon: MAX77693: Add extcon-max77693 driver to support Maxim MAX77693 MUIC device
  sysfs: fail dentry revalidation after namespace change fix
  sysfs: fail dentry revalidation after namespace change
  extcon: spelling of detach in function doc
  extcon: arizona: Stop microphone detection if we give up on it
  extcon: arizona: Update cable reporting calls and split headset
  PM / Runtime: Do not increment device usage counts before probing
  kmsg - do not flush partial lines when the console is busy
  kmsg - export "continuation record" flag to /dev/kmsg
  kmsg - avoid warning for CONFIG_PRINTK=n compilations
  kmsg - properly print over-long continuation lines
  driver-core: Use kobj_to_dev instead of re-implementing it
  driver-core: Move kobj_to_dev from genhd.h to device.h
  driver core: Move deferred devices to the end of dpm_list before probing
  driver core: move uevent call to driver_register
  driver core: fix shutdown races with probe/remove(v3)
  Extcon: Arizona: Add driver for Wolfson Arizona class devices
  ...
2012-07-26 11:25:33 -07:00
Linus Torvalds
b13bc8dda8 Staging tree patches for 3.6-rc1
Here's the big staging tree merge for the 3.6-rc1 merge window.
 
 There are some patches in here outside of drivers/staging/, notibly the iio
 code (which is still stradeling the staging / not staging boundry), the pstore
 code, and the tracing code.  All of these have gotten ackes from the various
 subsystem maintainers to be included in this tree.  The pstore and tracing
 patches are related, and are coming here as they replace one of the android
 staging drivers.
 
 Otherwise, the normal staging mess.  Lots of cleanups and a few new drivers
 (some iio drivers, and the large csr wireless driver abomination.)
 
 Note, you will get a merge issue with the following files:
 	drivers/staging/comedi/drivers/s626.h
 	drivers/staging/gdm72xx/netlink_k.c
 both of which should be trivial for you to handle.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAlAQiD8ACgkQMUfUDdst+ykxhgCeMUjvc+1RTtSprzvkzpejgoUU
 6A4AnAleWMnkaCD8vruGnRdGl/Qtz51+
 =mN6M
 -----END PGP SIGNATURE-----

Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging tree patches from Greg Kroah-Hartman:
 "Here's the big staging tree merge for the 3.6-rc1 merge window.

  There are some patches in here outside of drivers/staging/, notibly
  the iio code (which is still stradeling the staging / not staging
  boundry), the pstore code, and the tracing code.  All of these have
  gotten acks from the various subsystem maintainers to be included in
  this tree.  The pstore and tracing patches are related, and are coming
  here as they replace one of the android staging drivers.

  Otherwise, the normal staging mess.  Lots of cleanups and a few new
  drivers (some iio drivers, and the large csr wireless driver
  abomination.)

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up trivial conflicts in drivers/staging/comedi/drivers/s626.h and
drivers/staging/gdm72xx/netlink_k.c

* tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1108 commits)
  staging: csr: delete a bunch of unused library functions
  staging: csr: remove csr_utf16.c
  staging: csr: remove csr_pmem.h
  staging: csr: remove CsrPmemAlloc
  staging: csr: remove CsrPmemFree()
  staging: csr: remove CsrMemAllocDma()
  staging: csr: remove CsrMemCalloc()
  staging: csr: remove CsrMemAlloc()
  staging: csr: remove CsrMemFree() and CsrMemFreeDma()
  staging: csr: remove csr_util.h
  staging: csr: remove CsrOffSetOf()
  stating: csr: remove unneeded #includes in csr_util.c
  staging: csr: make CsrUInt16ToHex static
  staging: csr: remove CsrMemCpy()
  staging: csr: remove CsrStrLen()
  staging: csr: remove CsrVsnprintf()
  staging: csr: remove CsrStrDup
  staging: csr: remove CsrStrChr()
  staging: csr: remove CsrStrNCmp
  staging: csr: remove CsrStrCmp
  ...
2012-07-26 11:14:49 -07:00
Andrew Vagin
895dd92c03 sched: Deliver sched_switch events to the current task
Otherwise they can't be filtered for a defined task:

  perf record -e sched:sched_switch ./foo

This command doesn't report any events without this patch.

I think it isn't a security concern if someone knows who will
be executed next - this can already be observed by polling /proc
state. By default perf is disabled for non-root users in any case.

I need these events for profiling sleep times.  sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-26 12:23:10 +02:00
Linus Torvalds
bdc0077af5 SCSI misc on 20120724
The most important feature of this patch set is the new async infrastructure
 that makes sure async_synchronize_full() synchronizes all domains and allows
 us to remove all the hacks (like having scsi_complete_async_scans() in the
 device base code) and means that the async infrastructure will "just work" in
 future. The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
 megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure work in
 sas and FC.
 
 Signed-off-by: James Bottomley <JBottomley@Parallels.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJQDjDCAAoJEDeqqVYsXL0M/sMH/jVgBfF1mjR+DQuTscKyD21w
 0BQLn5OmvDZDqo44iqQzNRObw7CxkBkUtHoozsknLijw+KggER653ZOAtUdIHfI/
 /uo7iJQ3J3D/Ezm99HYSpZiF2juZwsBRtFBoKkGqOpMlzFUx5o4hUbH5OcINxnHR
 VmvJU5K1kg8D77Q6zK+Atl14/Rfibc2IoufFmbYdplUAM/tV0BpBSSHJAJvqua76
 NGMl4KJcPZnXe/4LXcxZia5A2efdFFEzaQ2mM9rUVEAgHDAxc0Zg9IoDhGd08FX4
 G55NK+6+bKb9s7bgyva0T/iy817TRCzjteeYNFrb8nBRe7aQbAivaBHQFXIyvdQ=
 =y2sh
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull first round of SCSI updates from James Bottomley:
 "The most important feature of this patch set is the new async
  infrastructure that makes sure async_synchronize_full() synchronizes
  all domains and allows us to remove all the hacks (like having
  scsi_complete_async_scans() in the device base code) and means that
  the async infrastructure will "just work" in future.

  The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
  megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure
  work in sas and FC.

  Signed-off-by: James Bottomley <JBottomley@Parallels.com>"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (97 commits)
  [SCSI] Revert "[SCSI] fix async probe regression"
  [SCSI] cleanup usages of scsi_complete_async_scans
  [SCSI] queue async scan work to an async_schedule domain
  [SCSI] async: make async_synchronize_full() flush all work regardless of domain
  [SCSI] async: introduce 'async_domain' type
  [SCSI] bfa: Fix to set correct return error codes and misc cleanup.
  [SCSI] aacraid: Series 7 Async. (performance) mode support
  [SCSI] aha152x: Allow use on 64bit systems
  [SCSI] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
  [SCSI] bfa: squelch lockdep complaint with a spin_lock_init
  [SCSI] qla2xxx: remove unnecessary reads of PCI_CAP_ID_EXP
  [SCSI] qla4xxx: remove unnecessary read of PCI_CAP_ID_EXP
  [SCSI] ufs: fix incorrect return value about SUCCESS and FAILED
  [SCSI] ufs: reverse the ufshcd_is_device_present logic
  [SCSI] ufs: use module_pci_driver
  [SCSI] usb-storage: update usb devices for write cache quirk in quirk list.
  [SCSI] usb-storage: add support for write cache quirk
  [SCSI] set to WCE if usb cache quirk is present.
  [SCSI] virtio-scsi: hotplug support for virtio-scsi
  [SCSI] virtio-scsi: split scatterlist per target
  ...
2012-07-24 18:11:22 -07:00
Linus Torvalds
614a6d4341 Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Nothing too interesting.  A minor bug fix and some cleanups."

* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Update remount documentation
  cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
  cgroup: Remove populate() documentation
  cgroup: remove hierarchy_mutex
2012-07-24 17:47:44 -07:00
Linus Torvalds
a08489c569 Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "There are three major changes.

   - WQ_HIGHPRI has been reimplemented so that high priority work items
     are served by worker threads with -20 nice value from dedicated
     highpri worker pools.

   - CPU hotplug support has been reimplemented such that idle workers
     are kept across CPU hotplug events.  This makes CPU hotplug cheaper
     (for PM) and makes the code simpler.

   - flush_kthread_work() has been reimplemented so that a work item can
     be freed while executing.  This removes an annoying behavior
     difference between kthread_worker and workqueue."

* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix spurious CPU locality WARN from process_one_work()
  kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
  kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
  workqueue: simplify CPU hotplug code
  workqueue: remove CPU offline trustee
  workqueue: don't butcher idle workers on an offline CPU
  workqueue: reimplement CPU online rebinding to handle idle workers
  workqueue: drop @bind from create_worker()
  workqueue: use mutex for global_cwq manager exclusion
  workqueue: ROGUE workers are UNBOUND workers
  workqueue: drop CPU_DYING notifier operation
  workqueue: perform cpu down operations from low priority cpu_notifier()
  workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
  workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()
  workqueue: separate out worker_pool flags
  workqueue: use @pool instead of @gcwq or @cpu where applicable
  workqueue: factor out worker_pool from global_cwq
  workqueue: don't use WQ_HIGHPRI for unbound workqueues
2012-07-24 17:46:16 -07:00
Linus Torvalds
6dd53aa456 PCI changes for the 3.6 merge window:
Host bridge hotplug
     - Add MMCONFIG support for hot-added host bridges (Jiang Liu)
   Device hotplug
     - Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
     - Call FINAL fixups for hot-added devices, too (Myron Stowe)
     - Factor out generic code for P2P bridge hot-add (Yinghai Lu)
     - Remove all functions in a slot, not just those with _EJx (Amos Kong)
   Dynamic resource management
     - Track bus number allocation (struct resource tree per domain) (Yinghai Lu)
     - Make P2P bridge 1K I/O windows work with resource reassignment (Bjorn Helgaas, Yinghai Lu)
     - Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
   Power management
     - Add PCIe runtime D3cold support (Huang Ying)
   Virtualization
     - Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex Williamson)
     - Add quirks for devices with broken INTx masking (Jan Kiszka)
   Miscellaneous
     - Fix some PCI Express capability version issues (Myron Stowe)
     - Factor out some arch code with a weak, generic, pcibios_setup() (Myron Stowe)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQBy+9AAoJEPGMOI97Hn6zOpQP+wVFvA7pcteFj6HPs5nTq2Hc
 55oeRqCO0wBHoFMCKB0AjeTATjqxi9OhcjaiVrZejxNyWKC9MnrXuunpQ0l/hCbR
 M/TK+BCelfX2FU4eXNf+TBCCcOhOVWqQft9Gm6nYKwX8Y0msRVCceI4WwhZgSwtI
 vdtmnqlwolscdnq+8ThsnvUMtwkN0gExmn2FJRl6EoEgG0DTqhMkZ83uA+NPBhvv
 I+g0XbA6haaZph2nnSYR0hIW4Q7JkT/LgA6uVAQxamctwxLol7xxsjCRnfqrulkf
 kaRr2fAgBXfmaOIltro4UkXrCM52ZSyggCDfExHp6mWGPKMjE5ZcyK1YbGfmmumk
 DS3t1S0eBdDJXrnf9l/Yb8e95dQxRCYKelKzr1rTD9QAXsInE8rC40hvhfFaTa4s
 nZYRTz0SKv6coQihqaOR7shx1DNomLFk7jndaWEElfl9/cT/nQnZ8XLfVMzkJNNB
 Y4SM6zkiIaCL0aiSEE16MqVjmODYRjbURLYzQIrqr2KJQg8X6XjIRojQLjL6xEgA
 22ry2ZRPhqO68g7aLqvixiSDaTp0Z0Vw+JmgjtBqvkokwZcGQtm4umkpAdOi+Es8
 3bJaMY7ZUpDX53FE8iyP6AnmR/1k19rC1gNnNq/syWyjtYOYJ9i3QCTafFgvE1VC
 5coQ1L5tByHvpzK5PHwf
 =oo/A
 -----END PGP SIGNATURE-----

Merge tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Host bridge hotplug:
    - Add MMCONFIG support for hot-added host bridges (Jiang Liu)
  Device hotplug:
    - Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
    - Call FINAL fixups for hot-added devices, too (Myron Stowe)
    - Factor out generic code for P2P bridge hot-add (Yinghai Lu)
    - Remove all functions in a slot, not just those with _EJx (Amos
      Kong)
  Dynamic resource management:
    - Track bus number allocation (struct resource tree per domain)
      (Yinghai Lu)
    - Make P2P bridge 1K I/O windows work with resource reassignment
      (Bjorn Helgaas, Yinghai Lu)
    - Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
  Power management:
    - Add PCIe runtime D3cold support (Huang Ying)
  Virtualization:
    - Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
      Williamson)
    - Add quirks for devices with broken INTx masking (Jan Kiszka)
  Miscellaneous:
    - Fix some PCI Express capability version issues (Myron Stowe)
    - Factor out some arch code with a weak, generic, pcibios_setup()
      (Myron Stowe)"

* tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
  PCI: hotplug: ensure a consistent return value in error case
  PCI: fix undefined reference to 'pci_fixup_final_inited'
  PCI: build resource code for M68K architecture
  PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
  PCI: reorder __pci_assign_resource() (no change)
  PCI: fix truncation of resource size to 32 bits
  PCI: acpiphp: merge acpiphp_debug and debug
  PCI: acpiphp: remove unused res_lock
  sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
  PCI: call final fixups hot-added devices
  PCI: move final fixups from __init to __devinit
  x86/PCI: move final fixups from __init to __devinit
  MIPS/PCI: move final fixups from __init to __devinit
  PCI: support sizing P2P bridge I/O windows with 1K granularity
  PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
  PCI: disable MEM decoding while updating 64-bit MEM BARs
  PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
  PCI: never discard enable/suspend/resume_early/resume fixups
  PCI: release temporary reference in __nv_msi_ht_cap_quirk()
  PCI: restructure 'pci_do_fixups()'
  ...
2012-07-24 16:17:07 -07:00
Linus Torvalds
f14121ab35 Devicetree updates for 3.6
A small set of changes for devicetree:
 - Couple of Documentation fixes
 - Addition of new helper function of_node_full_name
 - Improve of_parse_phandle_with_args return values
 - Some NULL related sparse fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJQDwsgAAoJEMhvYp4jgsXiuwUH/Ri6ZSnqHcz4Wa/X4FxvNc3I
 3Xelo/Vt3WLYue3s/+OYiM5FK9+KH8T6x+U79Q4p7vePcfUh6GJII0AUbMeRghkS
 m3FjNd5syzYNJlnDnqdngQYRDpaz8U/SyftjXyMPjJ1VWiyLx/EJQUkj1EEwDLe/
 ZVabppnco3Y6OJpFuETONNvXx5mE7xq86isW5+aYmviMkWSMMwJPf8qofLJ78Dh5
 OAhWuCPRDooz548+Wkabt90qHjF6FU43w5fU7zZW26NT39ptppcbZ2bAXcTYqIIq
 sATp5YSitvwFqO2c1mA/drZ9nrgxDPCaw3qCDyiMdcbWgXqDirz2x7q1iauVHF4=
 =5TZ/
 -----END PGP SIGNATURE-----

Merge tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux

Pull devicetree updates from Rob Herring:
 "A small set of changes for devicetree:
   - Couple of Documentation fixes
   - Addition of new helper function of_node_full_name
   - Improve of_parse_phandle_with_args return values
   - Some NULL related sparse fixes"

Grant's busy packing.

* tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux:
  of: mtd: nuke useless const qualifier
  devicetree: add helper inline for retrieving a node's full name
  of: return -ENOENT when no property
  usage-model.txt: fix typo machine_init->init_machine
  of: Fix null pointer related warnings in base.c file
  LED: Fix missing semicolon in OF documentation
  of: fix a few typos in the binding documentation
2012-07-24 14:07:22 -07:00
Linus Torvalds
3c4cfadef6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David S Miller:

 1) Remove the ipv4 routing cache.  Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache.  Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making.  Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point.  Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

 2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

 3) TCP SYN/ACK performance tweaks from Eric Dumazet.

 4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

 5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

 6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

 7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

 8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

 9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer.  Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket.  The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too.  From Eric
    Dumazet.

15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
  genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
  r8169: revert "add byte queue limit support".
  ipv4: Change rt->rt_iif encoding.
  net: Make skb->skb_iif always track skb->dev
  ipv4: Prepare for change of rt->rt_iif encoding.
  ipv4: Remove all RTCF_DIRECTSRC handliing.
  ipv4: Really ignore ICMP address requests/replies.
  decnet: Don't set RTCF_DIRECTSRC.
  net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
  ipv4: Remove redundant assignment
  rds: set correct msg_namelen
  openvswitch: potential NULL deref in sample()
  tcp: dont drop MTU reduction indications
  bnx2x: Add new 57840 device IDs
  tcp: avoid oops in tcp_metrics and reset tcpm_stamp
  niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
  niu: Fix to check for dma mapping errors.
  net: Fix references to out-of-scope variables in put_cmsg_compat()
  net: ethernet: davinci_emac: add pm_runtime support
  net: ethernet: davinci_emac: Remove unnecessary #include
  ...
2012-07-24 10:01:50 -07:00
Peter Zijlstra
8323f26ce3 sched: Fix race in task_group()
Stefan reported a crash on a kernel before a3e5d1091c ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.

Looking at all that I found a lack of serialization and plain
wrong comments.

The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:58:20 +02:00
Srivatsa Vaddagiri
88b8dac0a1 sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

          Core0            Core1
         /     \          /     \
	C0     C1	 C2     C3
        |      |         |      |
        v      v         v      v
	F0     T1        F1     [idle]
			 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:58:06 +02:00
Prashanth Nageshappa
bbf18b1949 sched: Reset loop counters if all tasks are pinned and we need to redo load balance
While load balancing, if all tasks on the source runqueue are pinned,
we retry after excluding the corresponding source cpu. However, loop counters
env.loop and env.loop_break are not reset before retrying, which can lead
to failure in moving the tasks. In this patch we reset env.loop and
env.loop_break to their inital values before we retry.

Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:55:37 +02:00
Prashanth Nageshappa
85c1e7dae1 sched: Reorder 'struct lb_env' members to reduce its size
Members of 'struct lb_env' are not in appropriate order to reuse compiler
added padding on 64bit architectures. In this patch we reorder those struct
members and help reduce the size of the structure from 96 bytes to 80
bytes on 64 bit architectures.

Suggested-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:55:20 +02:00
Mike Galbraith
970e178985 sched: Improve scalability via 'CPU buddies', which withstand random perturbations
Traversing an entire package is not only expensive, it also leads to tasks
bouncing all over a partially idle and possible quite large package.  Fix
that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
to motivate that one other CPU, if it's busy, tough, it may then try its
SMT sibling, but that's all this optimization is allowed to cost.

Sibling cache buddies are cross-wired to prevent bouncing.

4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

 clients     1       2       4        8       16       32       64      128
 ..........................................................................
 pre        30      41     118      645     3769     6214    12233    14312
 post      299     603    1211     2418     4697     6847    11606    14557

A nice increase in performance.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:34 +02:00
Srivatsa S. Bhat
a1cd2b13f7 cpusets: Remove/update outdated comments
cpuset_track_online_cpus() is no longer present. So remove the
outdated comment and replace it with reference to cpuset_update_active_cpus()
which is its equivalent.

Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
out how it is dealt with. So update that comment as well.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:28 +02:00
Srivatsa S. Bhat
7ddf96b02f cpusets, hotplug: Restructure functions that are invoked during hotplug
Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".

And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which is more appropriate considering how
it is restructured.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:22 +02:00
Srivatsa S. Bhat
80d1fa6463 cpusets, hotplug: Implement cpuset tree traversal in a helper function
At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.

As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:18 +02:00
Srivatsa S. Bhat
d35be8bab9 CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:14 +02:00