14689 Commits

Author SHA1 Message Date
Anton Vorontsov
717a9ef7f3 tracing: Remove unneeded checks from the stack tracer
It seems that 'ftrace_enabled' flag should not be used inside the tracer
functions. The ftrace core is using this flag for internal purposes, and
the flag wasn't meant to be used in tracers' runtime checks.

stack tracer is the only tracer that abusing the flag. So stop it from
serving as a bad example.

Also, there is a local 'stack_trace_disabled' flag in the stack tracer,
which is never updated; so it can be removed as well.

Link: http://lkml.kernel.org/r/1342637761-9655-1-git-send-email-anton.vorontsov@linaro.org

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-19 15:07:13 -05:00
Tejun Heo
0a950f65e1 cgroup: add cgroup->id
With the introduction of generic cgroup hierarchy iterators, css_id is
being phased out.  It was unnecessarily complex, id'ing the wrong
thing (cgroups need IDs, not CSSes) and has other oddities like not
being available at ->css_alloc().

This patch adds cgroup->id, which is a simple per-hierarchy
ida-allocated ID which is assigned before ->css_alloc() and released
after ->css_free().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2012-11-19 09:02:12 -08:00
Tejun Heo
033fa1c5f5 cgroup, cpuset: remove cgroup_subsys->post_clone()
Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone().  Now
that clone_children is cpuset specific, there's no reason to have this
rather odd option activation mechanism in cgroup core.  cpuset can
check the flag from its ->css_allocate() and take the necessary
action.

Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and
remove cgroup_subsys->post_clone().

Loosely based on Glauber's "generalize post_clone into post_create"
patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Glauber Costa <glommer@parallels.com>
Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-19 08:13:39 -08:00
Tejun Heo
2260e7fc1f cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
clone_children is only meaningful for cpuset and will stay that way.
Rename the flag to reflect that and update documentation.  Also, drop
clone_children() wrapper in cgroup.c.  The thin wrapper is used only a
few times and one of them will go away soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-19 08:13:38 -08:00
Tejun Heo
92fb97487a cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are.  Also, update documentation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:38 -08:00
Tejun Heo
b1929db42f cgroup: allow ->post_create() to fail
There could be cases where controllers want to do initialization
operations which may fail from ->post_create().  This patch makes
->post_create() return -errno to indicate failure and online_css()
relay such failures.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-19 08:13:38 -08:00
Tejun Heo
4b8b47eb00 cgroup: update cgroup_create() failure path
cgroup_create() was ignoring failure of cgroupfs files.  Update it
such that, if file creation fails, it rolls back by calling
cgroup_destroy_locked() and returns failure.

Note that error out goto labels are renamed.  The labels are a bit
confusing but will become better w/ later cgroup operation renames.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:38 -08:00
Tejun Heo
b8a2df6a5b cgroup: use mutex_trylock() when grabbing i_mutex of a new cgroup directory
All cgroup directory i_mutexes nest outside cgroup_mutex; however, new
directory creation is a special case.  A new cgroup directory is
created while holding cgroup_mutex.  Populating the new directory
requires both the new directory's i_mutex and cgroup_mutex.  Because
all directory i_mutexes nest outside cgroup_mutex, grabbing both
requires releasing cgroup_mutex first, which isn't a good idea as the
new cgroup isn't yet ready to be manipulated by other cgroup
opreations.

This is worked around by grabbing the new directory's i_mutex while
holding cgroup_mutex before making it visible.  As there's no other
user at that point, grabbing the i_mutex under cgroup_mutex can't lead
to deadlock.

cgroup_create_file() was using I_MUTEX_CHILD to tell lockdep not to
worry about the reverse locking order; however, this creates pseudo
locking dependency cgroup_mutex -> I_MUTEX_CHILD, which isn't true -
all directory i_mutexes are still nested outside cgroup_mutex.  This
pseudo locking dependency can lead to spurious lockdep warnings.

Use mutex_trylock() instead.  This will always succeed and lockdep
doesn't create any locking dependency for it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:37 -08:00
Tejun Heo
d19e19de48 cgroup: simplify cgroup_load_subsys() failure path
Now that cgroup_unload_subsys() can tell whether the root css is
online or not, we can safely call cgroup_unload_subsys() after idr
init failure in cgroup_load_subsys().

Replace the manual unrolling and invoke cgroup_unload_subsys() on
failure.  This drops cgroup_mutex inbetween but should be safe as the
subsystem will fail try_module_get() and thus can't be mounted
inbetween.  As this means that cgroup_unload_subsys() can be called
before css_sets are rehashed, remove BUG_ON() on %NULL
css_set->subsys[] from cgroup_unload_subsys().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:37 -08:00
Tejun Heo
a31f2d3ff7 cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers
New helpers on/offline_css() respectively wrap ->post_create() and
->pre_destroy() invocations.  online_css() sets CSS_ONLINE after
->post_create() is complete and offline_css() invokes ->pre_destroy()
iff CSS_ONLINE is set and clears it while also handling the temporary
dropping of cgroup_mutex.

This patch doesn't introduce any behavior change at the moment but
will be used to improve cgroup_create() failure path and allow
->post_create() to fail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:37 -08:00
Tejun Heo
42809dd422 cgroup: separate out cgroup_destroy_locked()
Separate out cgroup_destroy_locked() from cgroup_destroy().  This will
be later used in cgroup_create() failure path.

While at it, add lockdep asserts on i_mutex and cgroup_mutex, and move
@d and @parent assignments to their declarations.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:37 -08:00
Tejun Heo
02ae7486d0 cgroup: fix harmless bugs in cgroup_load_subsys() fail path and cgroup_unload_subsys()
* If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[]
  before calilng ->destroy() making CSS inaccessible to the callback,
  and didn't unlink ss->sibling.  As no modular controller uses
  ->use_id, this doesn't cause any actual problems.

* cgroup_unload_subsys() was forgetting to free idr, call
  ->pre_destroy() and clear ->active.  As there currently is no
  modular controller which uses ->use_id, ->pre_destroy() or ->active,
  this doesn't cause any actual problems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:37 -08:00
Tejun Heo
648bb56d07 cgroup: lock cgroup_mutex in cgroup_init_subsys()
Make cgroup_init_subsys() grab cgroup_mutex while initializing a
subsystem so that all helpers and callbacks are called under the
context they expect.  This isn't strictly necessary as
cgroup_init_subsys() doesn't race with anybody but will allow adding
lockdep assertions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo
b48c6a80a0 cgroup: trivial cleanup for cgroup_init/load_subsys()
Consistently use @css and @dummytop in these two functions instead of
referring to them indirectly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo
38b53abaa3 cgroup: make CSS_* flags bit masks instead of bit positions
Currently, CSS_* flags are defined as bit positions and manipulated
using atomic bitops.  There's no reason to use atomic bitops for them
and bit positions are clunkier to deal with than bit masks.  Make
CSS_* bit masks instead and use the usual C bitwise operators to
access them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo
febfcef60d cgroup: cgroup->dentry isn't a RCU pointer
cgroup->dentry is marked and used as a RCU pointer; however, it isn't
one - the final dentry put doesn't go through call_rcu().  cgroup and
dentry share the same RCU freeing rule via synchronize_rcu() in
cgroup_diput() (kfree_rcu() used on cgrp is unnecessary).  If cgrp is
accessible under RCU read lock, so is its dentry and dereferencing
cgrp->dentry doesn't need any further RCU protection or annotation.

While not being accurate, before the previous patch, the RCU accessors
served a purpose as memory barriers - cgroup->dentry used to be
assigned after the cgroup was made visible to cgroup_path(), so the
assignment and dereferencing in cgroup_path() needed the memory
barrier pair.  Now that list_add_tail_rcu() happens after
cgroup->dentry is assigned, this no longer is necessary.

Remove the now unnecessary and misleading RCU annotations from
cgroup->dentry.  To make up for the removal of rcu_dereference_check()
in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts
the dereference rule of @cgrp, not cgrp->dentry.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo
4e139afc22 cgroup: create directory before linking while creating a new cgroup
While creating a new cgroup, cgroup_create() links the newly allocated
cgroup into various places before trying to create its directory.
Because cgroup life-cycle is tied to the vfs objects, this makes it
impossible to use cgroup_rmdir() for rolling back creation - the
removal logic depends on having full vfs objects.

This patch moves directory creation above linking and collect linking
operations to one place.  This allows directory creation failure to
share error exit path with css allocation failures and any failure
sites afterwards (to be added later) can use cgroup_rmdir() logic to
undo creation.

Note that this also makes the memory barriers around cgroup->dentry,
which currently is misleadingly using RCU operations, unnecessary.
This will be handled in the next patch.

While at it, locking BUG_ON() on i_mutex is converted to
lockdep_assert_held().

v2: Patch originally removed %NULL dentry check in cgroup_path();
    however, Li pointed out that this patch doesn't make it
    unnecessary as ->create() may call cgroup_path().  Drop the
    change for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo
28fd6f30ac cgroup: open-code cgroup_create_dir()
The operation order of cgroup creation is about to change and
cgroup_create_dir() is more of a hindrance than a proper abstraction.
Open-code it by moving the parent nlink adjustment next to self nlink
adjustment in cgroup_create_file() and the rest to cgroup_create().

This patch doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo
2243076ad1 cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping()
Not strictly necessary but it's annoying to have uninitialized
list_head around.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:35 -08:00
Tejun Heo
175431635e cgroup: remove incorrect dget/dput() pair in cgroup_create_dir()
cgroup_create_dir() does weird dancing with dentry refcnt.  On
success, it gets and then puts it achieving nothing.  On failure, it
puts but there isn't no matching get anywhere leading to the following
oops if cgroup_create_file() fails for whatever reason.

  ------------[ cut here ]------------
  kernel BUG at /work/os/work/fs/dcache.c:552!
  invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in:
  CPU 2
  Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs
  RIP: 0010:[<ffffffff811d9c0c>]  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
  RSP: 0018:ffff88001a3ebef8  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403
  RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58
  RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001
  R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea
  R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60
  FS:  00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080)
  Stack:
   ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000
   ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8
   ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8
  Call Trace:
   [<ffffffff811cc889>] done_path_create+0x19/0x50
   [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80
   [<ffffffff811d2009>] sys_mkdir+0x19/0x20
   [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b
  Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41
  RIP  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
   RSP <ffff88001a3ebef8>
  ---[ end trace 1277bcfd9561ddb0 ]---

Fix it by dropping the unnecessary dget/dput() pair.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2012-11-19 08:13:35 -08:00
Frederic Weisbecker
1017769bd0 vtime: No need to disable irqs on vtime_account()
vtime_account() is only called from irq entry. irqs
are always disabled at this point so we can safely
remove the irq disabling guards on that function.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-11-19 16:41:41 +01:00
Frederic Weisbecker
e3942ba040 vtime: Consolidate a bit the ctx switch code
On ia64 and powerpc, vtime context switch only consists
in flushing system and user pending time, plus a few
arch housekeeping.

Consolidate that into a generic implementation. s390 is
a special case because pending user and system time accounting
there is hard to dissociate. So it's keeping its own implementation.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-11-19 16:41:32 +01:00
Frederic Weisbecker
fd25b4c2f2 vtime: Remove the underscore prefix invasion
Prepending irq-unsafe vtime APIs with underscores was actually
a bad idea as the result is a big mess in the API namespace that
is even waiting to be further extended. Also these helpers
are always called from irq safe callers except kvm. Just
provide a vtime_account_system_irqsafe() for this specific
case so that we can remove the underscore prefix on other
vtime functions.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-11-19 16:40:16 +01:00
Eric W. Biederman
5eaf563e53 userns: Allow unprivileged users to create user namespaces.
Now that we have been through every permission check in the kernel
having uid == 0 and gid == 0 in your local user namespace no
longer adds any special privileges.  Even having a full set
of caps in your local user namespace is safe because capabilies
are relative to your local user namespace, and do not confer
unexpected privileges.

Over the long term this should allow much more of the kernels
functionality to be safely used by non-root users.  Functionality
like unsharing the mount namespace that is only unsafe because
it can fool applications whose privileges are raised when they
are executed.  Since those applications have no privileges in
a user namespaces it becomes safe to spoof and confuse those
applications all you want.

Those capabilities will still need to be enabled carefully because
we may still need things like rlimits on the number of unprivileged
mounts but that is to avoid DOS attacks not to avoid fooling root
owned processes.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:24 -08:00
Eric W. Biederman
771b137168 vfs: Add a user namespace reference from struct mnt_namespace
This will allow for support for unprivileged mounts in a new user namespace.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:19 -08:00
Eric W. Biederman
50804fe373 pidns: Support unsharing the pid namespace.
Unsharing of the pid namespace unlike unsharing of other namespaces
does not take affect immediately.  Instead it affects the children
created with fork and clone.  The first of these children becomes the init
process of the new pid namespace, the rest become oddball children
of pid 0.  From the point of view of the new pid namespace the process
that created it is pid 0, as it's pid does not map.

A couple of different semantics were considered but this one was
settled on because it is easy to implement and it is usable from
pam modules.  The core reasons for the existence of unshare.

I took a survey of the callers of pam modules and the following
appears to be a representative sample of their logic.
{
	setup stuff include pam
	child = fork();
	if (!child) {
		setuid()
                exec /bin/bash
        }
        waitpid(child);

        pam and other cleanup
}

As you can see there is a fork to create the unprivileged user
space process.  Which means that the unprivileged user space
process will appear as pid 1 in the new pid namespace.  Further
most login processes do not cope with extraneous children which
means shifting the duty of reaping extraneous child process to
the creator of those extraneous children makes the system more
comprehensible.

The practical reason for this set of pid namespace semantics is
that it is simple to implement and verify they work correctly.
Whereas an implementation that requres changing the struct
pid on a process comes with a lot more races and pain.  Not
the least of which is that glibc caches getpid().

These semantics are implemented by having two notions
of the pid namespace of a proces.  There is task_active_pid_ns
which is the pid namspace the process was created with
and the pid namespace that all pids are presented to
that process in.  The task_active_pid_ns is stored
in the struct pid of the task.

Then there is the pid namespace that will be used for children
that pid namespace is stored in task->nsproxy->pid_ns.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:16 -08:00
Eric W. Biederman
1c4042c29b pidns: Consolidate initialzation of special init task state
Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
for the system init process, and another way for pid namespace
init processes test pid->nr == 1 and use the same code for both.

For the global init this results in SIGNAL_UNKILLABLE being set
much earlier in the initialization process.

This is a small cleanup and it paves the way for allowing unshare and
enter of the pid namespace as that path like our global init also will
not set CLONE_NEWPID.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:15 -08:00
Eric W. Biederman
57e8391d32 pidns: Add setns support
- Pid namespaces are designed to be inescapable so verify that the
  passed in pid namespace is a child of the currently active
  pid namespace or the currently active pid namespace itself.

  Allowing the currently active pid namespace is important so
  the effects of an earlier setns can be cancelled.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:14 -08:00
Eric W. Biederman
225778d68d pidns: Deny strange cases when creating pid namespaces.
task_active_pid_ns(current) != current->ns_proxy->pid_ns will
soon be allowed to support unshare and setns.

The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of current->ns_proxy->pid_ns.  However
that leads to strange cases like trying to have a single process be
init in multiple pid namespaces, which is racy and hard to think
about.

The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of task_active_pid_ns(current).  While
that seems less racy it does not provide any utility.

Therefore define the semantics of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
pid namespace creation fails.  That is easy to implement and easy
to think about.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:13 -08:00
Eric W. Biederman
af4b8a83ad pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
Looking at pid_ns->nr_hashed is a bit simpler and it works for
disjoint process trees that an unshare or a join of a pid_namespace
may create.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:12 -08:00
Eric W. Biederman
5e1182deb8 pidns: Don't allow new processes in a dead pid namespace.
Set nr_hashed to -1 just before we schedule the work to cleanup proc.
Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
fail.

This guaranteees that processes never enter a pid namespaces after we
have cleaned up the state to support processes in a pid namespace.

Currently sending SIGKILL to all of the process in a pid namespace as
init exists gives us this guarantee but we need something a little
stronger to support unsharing and joining a pid namespace.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:11 -08:00
Eric W. Biederman
0a01f2cc39 pidns: Make the pidns proc mount/umount logic obvious.
Track the number of pids in the proc hash table.  When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move the mount of proc into alloc_pid when we allocate the pid for
init.

Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task.  Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.

Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue.  This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.

In the process of making the code clear and obvious this fixes a bug
reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:10 -08:00
Eric W. Biederman
17cf22c33e pidns: Use task_active_pid_ns where appropriate
The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.

Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.

So I have used task_active_pid_ns everywhere I can.

In fork since the pid has not yet been attached to the
process I use ns_of_pid, to achieve the same effect.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:09 -08:00
Eric W. Biederman
49f4d8b93c pidns: Capture the user namespace and filter ns_last_pid
- Capture the the user namespace that creates the pid namespace
- Use that user namespace to test if it is ok to write to
  /proc/sys/kernel/ns_last_pid.

Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
in when destroying a pid_ns.  I have foloded his patch into this one
so that bisects will work properly.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:57:31 -08:00
Eric W. Biederman
038e7332b8 userns: make each net (net_ns) belong to a user_ns
The user namespace which creates a new network namespace owns that
namespace and all resources created in it.  This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives.  Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.

This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-18 22:46:23 -08:00
Eric W. Biederman
d328b83682 userns: make each net (net_ns) belong to a user_ns
The user namespace which creates a new network namespace owns that
namespace and all resources created in it.  This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives.  Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.

This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:30:55 -05:00
Ingo Molnar
ec05a2311c Merge branch 'sched/urgent' into sched/core
Merge in fixes before we queue up dependent bits, to avoid conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-11-18 09:34:44 +01:00
David S. Miller
67f4efdce7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor line offset auto-merges.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-17 22:00:43 -05:00
Greg Kroah-Hartman
1e619a1bf9 Merge 3.7-rc6 into tty-next 2012-11-16 18:26:00 -08:00
Paul E. McKenney
4e79752c25 sched: Mark RCU reader in sched_show_task()
When sched_show_task() is invoked from try_to_freeze_tasks(), there is
no RCU read-side critical section, resulting in the following splat:

[  125.780730] ===============================
[  125.780766] [ INFO: suspicious RCU usage. ]
[  125.780804] 3.7.0-rc3+ #988 Not tainted
[  125.780838] -------------------------------
[  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
[  125.780946]
[  125.780946] other info that might help us debug this:
[  125.780946]
[  125.781031]
[  125.781031] rcu_scheduler_active = 1, debug_locks = 0
[  125.781087] 4 locks held by s2ram/4211:
[  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
[  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
[  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
[  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
[  125.781543]
[  125.781543] stack backtrace:
[  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
[  125.781632] Call Trace:
[  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
[  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
[  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
[  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
[  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
[  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
[  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
[  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
[  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
[  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
[  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b

This commit therefore adds the needed RCU read-side critical section.

Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-11-16 10:05:58 -08:00
Paul E. McKenney
c635a4e1c2 rcu: Separate accounting of callbacks from callback-free CPUs
Currently, callback invocations from callback-free CPUs are accounted to
the CPU that registered the callback, but using the same field that is
used for normal callbacks.  This makes it impossible to determine from
debugfs output whether callbacks are in fact being diverted.  This commit
therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
in which diverted callback invocations are counted.  RCU's debugfs tracing
still displays normal callback invocations using ci=, but displayed
diverted callbacks with nci=.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-11-16 10:05:57 -08:00
Paul E. McKenney
3fbfbf7a3b rcu: Add callback-free CPUs
RCU callback execution can add significant OS jitter and also can
degrade both scheduling latency and, in asymmetric multiprocessors,
energy efficiency.  This commit therefore adds the ability for selected
CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
these kthreads will do polling, removing the need for the offloaded
CPUs to do wakeups.  At least one CPU must be doing normal callback
processing: currently CPU 0 cannot be selected as a no-CBs CPU.
In addition, attempts to offline the last normal-CBs CPU will fail.

This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
this commit includes fixes to problems located by Fengguang Wu's
kbuild test robot.

[ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-11-16 10:05:56 -08:00
Paul E. McKenney
aac1cda34b Merge branches 'urgent.2012.10.27a', 'doc.2012.11.16a', 'fixes.2012.11.13a', 'srcu.2012.10.27a', 'stall.2012.11.13a', 'tracing.2012.11.08a' and 'idle.2012.10.24a' into HEAD
urgent.2012.10.27a: Fix for RCU user-mode transition (already in -tip).

doc.2012.11.08a: Documentation updates, most notably codifying the
	memory-barrier guarantees inherent to grace periods.

fixes.2012.11.13a: Miscellaneous fixes.

srcu.2012.10.27a: Allow statically allocated and initialized srcu_struct
	structures (courtesy of Lai Jiangshan).

stall.2012.11.13a: Add more diagnostic information to RCU CPU stall
	warnings, also decrease from 60 seconds to 21 seconds.

hotplug.2012.11.08a: Minor updates to CPU hotplug handling.

tracing.2012.11.08a: Improved debugfs tracing, courtesy of Michael Wang.

idle.2012.10.24a: Updates to RCU idle/adaptive-idle handling, including
	a boot parameter that maps normal grace periods to expedited.

Resolved conflict in kernel/rcutree.c due to side-by-side change.
2012-11-16 09:59:58 -08:00
Oleg Nesterov
32cdba1e05 uprobes: Use percpu_rw_semaphore to fix register/unregister vs dup_mmap() race
This was always racy, but 268720903f87e0b84b161626c4447b81671b5d18
"uprobes: Rework register_for_each_vma() to make it O(n)" should be
blamed anyway, it made everything worse and I didn't notice.

register/unregister call build_map_info() and then do install/remove
breakpoint for every mm which mmaps inode/offset. This can obviously
race with fork()->dup_mmap() in between and we can miss the child.

uprobe_register() could be easily fixed but unregister is much worse,
the new mm inherits "int3" from parent and there is no way to detect
this if uprobe goes away.

So this patch simply adds percpu_down_read/up_read around dup_mmap(),
and percpu_down_write/up_write into register_for_each_vma().

This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap()
and fold it into uprobe_end_dup_mmap().

Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2012-11-16 14:52:51 +01:00
Hiraku Toyooka
d60da506cb tracing: Add a resize function to make one buffer equivalent to another buffer
Trace buffer size is now per-cpu, so that there are the following two
patterns in resizing of buffers.

  (1) resize per-cpu buffers to same given size
  (2) resize per-cpu buffers to another trace_array's buffer size
      for each CPU (such as preparing the max_tr which is equivalent
      to the global_trace's size)

__tracing_resize_ring_buffer() can be used for (1), and had
implemented (2) inside it for resetting the global_trace to the
original size.

(2) was also implemented in another place. So this patch assembles
them in a new function - resize_buffer_duplicate_size().

Link: http://lkml.kernel.org/r/20121017025616.2627.91226.stgit@falsita

Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-15 17:10:21 -05:00
Dan Carpenter
70f77b3f7e ftrace: Clear bits properly in reset_iter_read()
There is a typo here where '&' is used instead of '|' and it turns the
statement into a noop.  The original code is equivalent to:

	iter->flags &= ~((1 << 2) & (1 << 4));

Link: http://lkml.kernel.org/r/20120609161027.GD6488@elgon.mountain

Cc: stable@vger.kernel.org # all of them
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-15 16:10:17 -05:00
Davidlohr Bueso
8316bd72c0 PM / Hibernate: use rb_entry
Since the software suspend extents are organized in an rbtree, use rb_entry
instead of container_of, as it is semantically more appropriate in order to
get a node as it is iterated.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15 00:37:08 +01:00
Daniel Walter
883ee4f79d PM / sysfs: replace strict_str* with kstrto*
Replace strict_strtoul() with kstrtoul() in pm_async_store() and
pm_qos_power_write().

[rjw: Modified subject and changelog.]

Signed-off-by: Daniel Walter <sahne@0x90.at>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15 00:37:08 +01:00
Youquan Song
69a37beabf cpuidle: Quickly notice prediction failure for repeat mode
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.

There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
 governor will predict it is repeat mode and there is another IPI wake up idle
 CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
 idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.

In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.

Below is another case which will clearly show the patch much benefit:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>

volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;

void usage(void)
{
	fprintf(stderr,
		"Usage: idle_predict [options]\n"
		"  --help	-h  Print this help\n"
		"  --thread	-n  Thread number\n"
		"  --loop     	-l  Loop times in shallow Cstate\n"
		"  --delay	-t  Sleep time (uS)in shallow Cstate\n");
}

void *simple_loop() {
	int idle_num = 1;
	while (!(*shutdown)) {
		*count = *count + 1;

		if (idle_num % loop)
			usleep(delay);
		else {
			/* sleep 1 second */
			usleep(1000000);
			idle_num = 0;
		}
		idle_num++;
	}

}

static void sighand(int sig)
{
	*shutdown = 1;
}

int main(int argc, char *argv[])
{
	sigset_t sigset;
	int signum = SIGALRM;
	int i, c, er = 0, thread_num = 8;
	pthread_t pt[1024];

	static char optstr[] = "n:l:t:h:";

	while ((c = getopt(argc, argv, optstr)) != EOF)
		switch (c) {
			case 'n':
				thread_num = atoi(optarg);
				break;
			case 'l':
				loop = atoi(optarg);
				break;
			case 't':
				delay = atoi(optarg);
				break;
			case 'h':
			default:
				usage();
				exit(1);
		}

	printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
	count = malloc(sizeof(long));
	shutdown = malloc(sizeof(int));
	*count = 0;
	*shutdown = 0;

	sigemptyset(&sigset);
	sigaddset(&sigset, signum);
	sigprocmask (SIG_BLOCK, &sigset, NULL);
	signal(SIGINT, sighand);
	signal(SIGTERM, sighand);

	for(i = 0; i < thread_num ; i++)
		pthread_create(&pt[i], NULL, simple_loop, NULL);

	for (i = 0; i < thread_num; i++)
		pthread_join(pt[i], NULL);

	exit(0);
}

Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop

We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.

While after patched the kernel, we find that deep C-state will keep >99.6%.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15 00:34:19 +01:00
Yasuaki Ishimatsu
5e5041f352 ACPI / processor: prevent cpu from becoming online
Even if acpi_processor_handle_eject() offlines cpu, there is a chance
to online the cpu after that. So the patch closes the window by using
get/put_online_cpus().

Why does the patch change _cpu_up() logic?

The patch cares the race of hot-remove cpu and _cpu_up(). If the patch
does not change it, there is the following race.

hot-remove cpu                         |  _cpu_up()
------------------------------------- ------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
     call put_online_cpus()            |
                                       | start and continue _cpu_up()
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

So _cpu_up() can continue to itself. And hot-remove cpu can also continue
itself. If the patch changes _cpu_up() logic, the race disappears as below:

hot-remove cpu                         | _cpu_up()
-----------------------------------------------------------------------
call acpi_processor_handle_eject()     |
     call cpu_down()                   |
     call get_online_cpus()            |
                                       | call cpu_hotplug_begin() and stop here
     call arch_unregister_cpu()        |
     call acpi_unmap_lsapic()          |
          cpu's cpu_present is set     |
          to false by set_cpu_present()|
     call put_online_cpus()            |
                                       | start _cpu_up()
                                       | check cpu_present() and return -EINVAL
     return acpi_processor_remove()    |
continue hot-remove the cpu            |

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15 00:16:00 +01:00