21855 Commits

Author SHA1 Message Date
Peter Zijlstra
07c4a77613 perf: Update locking order
Update the locking order to note that ctx::lock nests inside of
child_mutex, as per:

  perf_ioctl():                ctx::mutex
  -> perf_event_for_each():    event::child_mutex
    -> _perf_event_enable():   ctx::lock

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-29 08:35:29 +01:00
Peter Zijlstra
a0733e695b perf: Remove __free_event()
There is but a single caller, remove the function - we already have
_free_event(), the extra indirection is nonsensical..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-29 08:35:25 +01:00
Alexei Starovoitov
e03e7ee34f perf/bpf: Convert perf_event_array to use struct file
Robustify refcounting.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-29 08:35:25 +01:00
Peter Zijlstra
828b6f0e26 perf: Fix NULL deref
Dan reported:

  1229                  if (ctx->task == TASK_TOMBSTONE ||
  1230                      !atomic_inc_not_zero(&ctx->refcount)) {
  1231                          raw_spin_unlock(&ctx->lock);
  1232                          ctx = NULL;
                                ^^^^^^^^^^
ctx is NULL.

  1233                  }
  1234
  1235                  WARN_ON_ONCE(ctx->task != task);
                                     ^^^^^^^^^^^^^^^^^
The patch adds a NULL dereference.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 63b6da39bb38 ("perf: Fix perf_event_exit_task() race")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-29 08:35:24 +01:00
Linus Torvalds
26cd83670f This includes three minor fixes, mostly due to cut-and-paste issues.
The first is a cut and paste issue that changed the amount of stack
 to skip when tracing a stack dump from 0 to 6, which basically made
 the stack disappear for small stack traces.
 
 The second fix is just removing an unused field in a struct that is no
 longer used, and currently just wastes space.
 
 The third is another cut-and-paste fix that had a tracepoint recording
 the wrong field (it was recording the previous field a second time).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWqibPAAoJEKKk/i67LK/8/NkH/3M6WB7RIiMMd4O403imbKcs
 yIH0j9vH6Z5hwoAUUr0bEw+gHVgzsiRky5z+fP0f1J3QdVAdgEig6RgQtIbWRynu
 i7fohNAiSMBob0wOIHTohQDKkQjvgoO9gO5S8nY6Axgpf4iqOTy3RF2a/gcltULY
 qdgy9A0vLk6yMbP6c0P+kEzg4y+Q90DsUh8YzQKW7F1EJPneDmNdug3VM16gefTR
 4yrodSBHxr8NV3kAhN8G7FjWmK5cBDFwD66vsti64mKVCW00hjYRCQ+5BrgQ7h0V
 EDC7kHisckLb415SQxe8XdF4fKbfE1PuQYZhjTo02hx9XCMeyxDWbjTF2PrZCHw=
 =gab6
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull minor tracing fixes from Steven Rostedt:
 "This includes three minor fixes, mostly due to cut-and-paste issues.

  The first is a cut and paste issue that changed the amount of stack to
  skip when tracing a stack dump from 0 to 6, which basically made the
  stack disappear for small stack traces.

  The second fix is just removing an unused field in a struct that is no
  longer used, and currently just wastes space.

  The third is another cut-and-paste fix that had a tracepoint recording
  the wrong field (it was recording the previous field a second time)"

* tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/dma-buf/fence: Fix timeline str value on fence_annotate_wait_on
  ftrace: Remove unused nr_trampolines var
  tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
2016-01-28 17:00:50 -08:00
Peter Zijlstra
6a3351b612 perf: Fix race in perf_event_exit_task_context()
There is a race between perf_event_exit_task_context() and
orphans_remove_work() which results in a use-after-free.

We mark ctx->task with TASK_TOMBSTONE to indicate a context is
'dead', under ctx->lock. After which point event_function_call()
on any event of that context will NOP

A concurrent orphans_remove_work() will only hold ctx->mutex for
the list iteration and not serialize against this. Therefore its
possible that orphans_remove_work()'s perf_remove_from_context()
call will fail, but we'll continue to free the event, with the
result of free'd memory still being on lists and everything.

Once perf_event_exit_task_context() gets around to acquiring
ctx->mutex it too will iterate the event list, encounter the
already free'd event and proceed to free it _again_. This fails
with the WARN in free_event().

Plug the race by having perf_event_exit_task_context() hold
ctx::mutex over the whole tear-down, thereby 'naturally'
serializing against all other sites, including the orphan work.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: alexander.shishkin@linux.intel.com
Cc: dsahern@gmail.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/20160125130954.GY6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-28 20:06:36 +01:00
Peter Zijlstra
78cd2c748f perf: Fix orphan hole
We should set event->owner before we install the event,
otherwise there is a hole where the target task can fork() and
we'll not inherit the event because it thinks the event is
orphaned.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-28 20:06:35 +01:00
James Morris
1c1ecf172a Merge tag 'seccomp-4.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux into for-linus 2016-01-28 10:53:54 +11:00
Arnd Bergmann
993e9fe1e7 PM: APM_EMULATION does not depend on PM
The APM emulation code does multiple things, and some of them depend on
PM_SLEEP, while the battery management does not. However, selecting
the symbol like SHARPSL_PM does causes a Kconfig warning:

warning: (SHARPSL_PM && PMAC_APM_EMU) selects APM_EMULATION which has unmet direct dependencies (PM && SYS_SUPPORTS_APM_EMULATION)

From all I can tell, this is completely harmless, and we can simply allow
APM_EMULATION to be enabled here, even if PM is not.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-27 23:20:14 +01:00
Jann Horn
103502a35c seccomp: always propagate NO_NEW_PRIVS on tsync
Before this patch, a process with some permissive seccomp filter
that was applied by root without NO_NEW_PRIVS was able to add
more filters to itself without setting NO_NEW_PRIVS by setting
the new filter from a throwaway thread with NO_NEW_PRIVS.

Signed-off-by: Jann Horn <jann@thejh.net>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-01-27 07:38:25 -08:00
Wanpeng Li
1ca8ec532f tick/nohz: Set the correct expiry when switching to nohz/lowres mode
commit 0ff53d096422 sets the next tick interrupt to the last jiffies update,
i.e. in the past, because the forward operation is invoked before the set
operation. There is no resulting damage (yet), but we get an extra pointless
tick interrupt.

Revert the order so we get the next tick interrupt in the future.

Fixes: commit 0ff53d096422 "tick: sched: Force tick interrupt and get rid of softirq magic"
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-27 12:45:57 +01:00
Arnd Bergmann
7809998ab1 tick/sched: Hide unused oneshot timer code
A couple of functions in kernel/time/tick-sched.c are only
relevant for oneshot timer mode, i.e. when hires-timers or
nohz mode are enabled. If both are disabled, we get gcc warnings
about them:

kernel/time/tick-sched.c:98:16: warning: 'tick_init_jiffy_update' defined but not used [-Wunused-function]
 static ktime_t tick_init_jiffy_update(void)
                ^
kernel/time/tick-sched.c:112:13: warning: 'tick_sched_do_timer' defined but not used [-Wunused-function]
 static void tick_sched_do_timer(ktime_t now)
             ^
kernel/time/tick-sched.c:134:13: warning: 'tick_sched_handle' defined but not used [-Wunused-function]
 static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
             ^

This encloses the whole set of functions in an appropriate ifdef
to avoid the warning and to make it clearer when they are used.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1453736525-1959191-1-git-send-email-arnd@arndb.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-26 16:26:06 +01:00
Marc Zyngier
530cbe100e irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token
Let's take the (outlandish) example of an interrupt controller
capable of handling both wired interrupts and PCI MSIs.

With the current code, the PCI MSI domain is going to be tagged
with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY.

Things get hairy when we start looking up the domain for a wired
interrupt (typically when creating it based on some firmware
information - DT or ACPI).

In irq_create_fwspec_mapping(), we perform the lookup using
DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives
us one chance out of two to end up with the wrong domain, and
we try to configure a wired interrupt with the MSI domain.
Everything grinds to a halt pretty quickly.

What we really need to do is to start looking for a domain that
would uniquely identify a wired interrupt domain, and only use
DOMAIN_BUS_ANY as a fallback.

In order to solve this, let's introduce a new DOMAIN_BUS_WIRED
token, which is going to be used exactly as described above.
Of course, this depends on the irqchip to setup the domain
bus_token, and nobody had to implement this so far.

Only so far.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-26 16:00:14 +01:00
Thomas Gleixner
b4abf91047 rtmutex: Make wait_lock irq safe
Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0					CPU1

rtmutex_lock(&rcu->rt_mutex)
  spin_lock(&rcu->rt_mutex.wait_lock)
					local_irq_disable()
					spin_lock(&timer->it_lock)
					spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
    spin_lock(&timer->it_lock)

This is caused by the following code sequence on CPU1

     rcu_read_lock()
     x = lookup();
     if (x)
     	spin_lock_irqsave(&x->it_lock);
     rcu_read_unlock();
     return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
2016-01-26 11:08:35 +01:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
eadee0ce6f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Embarrassing braino fix + pipe page accounting + fixing an eyesore in
  find_filesystem() (checking that s1 is equal to prefix of s2 of given
  length can be done in many ways, but "compare strlen(s1) with length
  and then do strncmp()" is not a good one...)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [regression] fix braino in fs/dlm/user.c
  pipe: limit the per-user amount of pages allocated in pipes
  find_filesystem(): simplify comparison
2016-01-22 10:24:03 -08:00
Tejun Heo
8bb5ef79bc cgroup: make sure a parent css isn't freed before its children
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

The previous patch updated ordering for css_offline() which fixes the
cpu controller issue.  While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.

css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-01-22 10:42:58 -05:00
Tejun Heo
aa226ff4a1 cgroup: make sure a parent css isn't offlined before its children
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
2016-01-22 10:42:57 -05:00
Tejun Heo
e93ad19d05 cpuset: make mm migration asynchronous
If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
2016-01-22 10:22:46 -05:00
Gavin Guo
1dff76b92f sched/numa: Fix use-after-free bug in the task_numa_compare
The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [<ffffffff81a6ce35>] dump_stack+0x45/0x56
   [<ffffffff81244aed>] print_trailer+0xfd/0x170
   [<ffffffff8124ac36>] object_err+0x36/0x40
   [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
   [<ffffffff8124d260>] kasan_report+0x40/0x50
   [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
   [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
   [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
   [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
   [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
   [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
   [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
   [<ffffffff8120ef02>] do_numa_page+0x192/0x200
   [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
   [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
   [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
   [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
   [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
   [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
   [<ffffffff81a772e8>] page_fault+0x28/0x30
   [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
   [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
   [<ffffffff810233c9>] ? sched_clock+0x9/0x10
   [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
   [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
   [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
   [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
   [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
   [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
   [<ffffffff81022c89>] ? read_tsc+0x9/0x20
   [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
   [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
   [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
   [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f

As commit 1effd9f19324 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:

do_exit()
        current->flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq->curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-22 13:51:04 +01:00
John Stultz
dd4e17ab70 ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO
Recently, in commit 37cf4dc3370f I forgot to check if the timeval being passed
was actually a timespec (as is signaled with ADJ_NANO).

This resulted in that patch breaking ADJ_SETOFFSET users who set
ADJ_NANO, by rejecting valid timespecs that were compared with
valid timeval ranges.

This patch addresses this by checking for the ADJ_NANO flag and
using the timepsec check instead in that case.

Reported-by: Harald Hoyer <harald@redhat.com>
Reported-by: Kay Sievers <kay@vrfy.org>
Fixes: 37cf4dc3370f "time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow"
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Link: http://lkml.kernel.org/r/1453417415-19110-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-22 12:01:42 +01:00
Sudeep Holla
6f16886b7c cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze
Commit 51164251f5c3 "sched / idle: Drop default_idle_call() fallback
from call_cpuidle()" made find_deepest_state() return non-negative
value and check all the states with index > 0.  Also as a result,
find_deepest_state() returns 0 even when enter_freeze callbacks are not
implemented and enter_freeze_proper() is called which ends up crashing
the kernel.

This patch updates the check for index > 0 in cpuidle_enter_freeze and
cpuidle_idle_call(when idle_should_freeze is true) to restore the
suspend-to-idle functionality in absence of enter_freeze callback.

Fixes: 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()"
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-22 02:35:49 +01:00
Linus Torvalds
eae21770b4 Merge branch 'akpm' (patches from Andrew)
Merge third patch-bomb from Andrew Morton:
 "I'm pretty much done for -rc1 now:

   - the rest of MM, basically

   - lib/ updates

   - checkpatch, epoll, hfs, fatfs, ptrace, coredump, exit

   - cpu_mask simplifications

   - kexec, rapidio, MAINTAINERS etc, etc.

   - more dma-mapping cleanups/simplifications from hch"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
  MAINTAINERS: add/fix git URLs for various subsystems
  mm: memcontrol: add "sock" to cgroup2 memory.stat
  mm: memcontrol: basic memory statistics in cgroup2 memory controller
  mm: memcontrol: do not uncharge old page in page cache replacement
  Documentation: cgroup: add memory.swap.{current,max} description
  mm: free swap cache aggressively if memcg swap is full
  mm: vmscan: do not scan anon pages if memcg swap limit is hit
  swap.h: move memcg related stuff to the end of the file
  mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online
  mm: vmscan: pass memcg to get_scan_count()
  mm: memcontrol: charge swap to cgroup2
  mm: memcontrol: clean up alloc, online, offline, free functions
  mm: memcontrol: flatten struct cg_proto
  mm: memcontrol: rein in the CONFIG space madness
  net: drop tcp_memcontrol.c
  mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM
  mm: memcontrol: allow to disable kmem accounting for cgroup2
  mm: memcontrol: account "kmem" consumers in cgroup2 memory controller
  mm: memcontrol: move kmem accounting code to CONFIG_MEMCG
  mm: memcontrol: separate kmem code from legacy tcp accounting code
  ...
2016-01-21 12:32:08 -08:00
Linus Torvalds
d43421565b PCI changes for the v4.5 merge window:
Enumeration
     Simplify config space size computation (Bjorn Helgaas)
     Avoid iterating through ROM outside the resource window (Edward O'Callaghan)
     Support PCIe devices with short cfg_size (Jason S. McMullan)
     Add Netronome vendor and device IDs (Jason S. McMullan)
     Limit config space size for Netronome NFP6000 family (Jason S. McMullan)
     Add Netronome NFP4000 PF device ID (Simon Horman)
     Limit config space size for Netronome NFP4000 (Simon Horman)
     Print warnings for all invalid expansion ROM headers (Vladis Dronov)
 
   Resource management
     Fix minimum allocation address overwrite (Christoph Biedl)
 
   PCI device hotplug
     acpiphp_ibm: Fix null dereferences on null ibm_slot (Colin Ian King)
     pciehp: Always protect pciehp_disable_slot() with hotplug mutex (Guenter Roeck)
     shpchp: Constify hpc_ops structure (Julia Lawall)
     ibmphp: Remove unneeded NULL test (Julia Lawall)
 
   Power management
     Make ASPM sysfs link_state_store() consistent with link_state_show() (Andy Lutomirski)
 
   Virtualization
     Add function 1 DMA alias quirk for Lite-On/Plextor M6e/Marvell 88SS9183 (Tim Sander)
 
   MSI
     Remove empty pci_msi_init_pci_dev() (Bjorn Helgaas)
     Mark PCIe/PCI (MSI) IRQ cascade handlers as IRQF_NO_THREAD (Grygorii Strashko)
     Initialize MSI capability for all architectures (Guilherme G. Piccoli)
     Relax msi_domain_alloc() to support parentless MSI irqdomains (Liu Jiang)
 
   ARM Versatile host bridge driver
     Remove unused pci_sys_data structures (Lorenzo Pieralisi)
 
   Broadcom iProc host bridge driver
     Hide CONFIG_PCIE_IPROC (Arnd Bergmann)
     Do not use 0x in front of %pap (Dmitry V. Krivenok)
     Update iProc PCIe device tree binding (Ray Jui)
     Add PAXC interface support (Ray Jui)
     Add iProc PCIe MSI device tree binding (Ray Jui)
     Add iProc PCIe MSI support (Ray Jui)
 
   Freescale i.MX6 host bridge driver
     Use gpio_set_value_cansleep() (Fabio Estevam)
     Add support for active-low reset GPIO (Petr Štetiar)
 
   HiSilicon host bridge driver
     Add support for HiSilicon Hip06 PCIe host controllers (Gabriele Paoloni)
 
   Intel VMD host bridge driver
     Export irq_domain_set_info() for module use (Keith Busch)
     x86/PCI: Allow DMA ops specific to a PCI domain (Keith Busch)
     Use 32 bit PCI domain numbers (Keith Busch)
     Add driver for Intel Volume Management Device (VMD) (Keith Busch)
 
   Qualcomm host bridge driver
     Document PCIe devicetree bindings (Stanimir Varbanov)
     Add Qualcomm PCIe controller driver (Stanimir Varbanov)
     dts: apq8064: add PCIe devicetree node (Stanimir Varbanov)
     dts: ifc6410: enable PCIe DT node for this board (Stanimir Varbanov)
 
   Renesas R-Car host bridge driver
     Add support for R-Car H3 to pcie-rcar (Harunobu Kurokawa)
     Allow DT to override default window settings (Phil Edworthy)
     Convert to DT resource parsing API (Phil Edworthy)
     Revert "PCI: rcar: Build pcie-rcar.c only on ARM" (Phil Edworthy)
     Remove unused pci_sys_data struct from pcie-rcar (Phil Edworthy)
     Add runtime PM support to pcie-rcar (Phil Edworthy)
     Add Gen2 PHY setup to pcie-rcar (Phil Edworthy)
     Add gen2 fallback compatibility string for pci-rcar-gen2 (Simon Horman)
     Add gen2 fallback compatibility string for pcie-rcar (Simon Horman)
 
   Synopsys DesignWare host bridge driver
     Simplify control flow (Bjorn Helgaas)
     Make config accessor override checking symmetric (Bjorn Helgaas)
     Ensure ATU is enabled before IO/conf space accesses (Stanimir Varbanov)
 
   Miscellaneous
     Add of_pci_get_host_bridge_resources() stub (Arnd Bergmann)
     Check for PCI_HEADER_TYPE_BRIDGE equality, not bitmask (Bjorn Helgaas)
     Fix all whitespace issues (Bogicevic Sasa)
     x86/PCI: Simplify pci_bios_{read,write} (Geliang Tang)
     Use to_pci_dev() instead of open-coding it (Geliang Tang)
     Use kobj_to_dev() instead of open-coding it (Geliang Tang)
     Use list_for_each_entry() to simplify code (Geliang Tang)
     Fix typos in <linux/msi.h> (Thomas Petazzoni)
     x86/PCI: Clarify AMD Fam10h config access restrictions comment (Tomasz Nowicki)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWoQIuAAoJEFmIoMA60/r8ckYP/0ZrkANeN1SB5cQVi2k7aceq
 kQb1Hk6ifxohJvgpJ/iwmVCHoApyeBfUBfrC+fUpIC2f7ncPsE5HNyjqpAWzFzj2
 sYWwY029yjBQ9g4mPhkvjBXfha+lNtLthWc+Xxcat5pdcyG63Dg4SfJKWm2ZYnbN
 0GJzyRZXIwAMnNf0KIr61Aqru0nXeHvi5wblyJ08UZ7AcNzCtB0wKLmE3S6SeZVF
 f2fry35zcGu+TFvQ1hAYemfl3XyDBJ87nPiKzJAwYSaKcWPFWt+72PBDPO6X9squ
 6prm4nmAgeG2Oo4Zu0fbkDlB2bEsWUc14/xT0i5Wfs35vcwzF+S1zirJAtVqoNir
 NgC7fSbEHbsS7FZOz0rBOBIvIkbb6NdfLFuZqUFv0X1M5bRFywjo8lZRfAYoGJzK
 Mmus0uKbklx5m6RT5adf9+Plev1YJT6XZW9XrDpGnxrwRyPjHmyvuTWsYkumxY7Q
 CE5Wr3p7q2I2+MtrQVv2D9Nzsb+4zQ6BgHrd2vwR/IxTsfdXLU7+B691wkUDX8No
 UKFxBd0FiVCn+srG96u7lWQvdoUqoNCogTZSVzGR5gFBv3zAN9gi8HS7NbV558Mg
 Io3Xw+6dcbG33uvWdU6jHEDLMQsohZcp05Q5esCgRQNV4cGJbPxBDtOZEO/ezvW4
 FAI7lfgYTFiQK3NzE3Ng
 =z9mQ
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI changes for the v4.5 merge window:

  Enumeration:
   - Simplify config space size computation (Bjorn Helgaas)
   - Avoid iterating through ROM outside the resource window (Edward O'Callaghan)
   - Support PCIe devices with short cfg_size (Jason S. McMullan)
   - Add Netronome vendor and device IDs (Jason S. McMullan)
   - Limit config space size for Netronome NFP6000 family (Jason S. McMullan)
   - Add Netronome NFP4000 PF device ID (Simon Horman)
   - Limit config space size for Netronome NFP4000 (Simon Horman)
   - Print warnings for all invalid expansion ROM headers (Vladis Dronov)

  Resource management:
   - Fix minimum allocation address overwrite (Christoph Biedl)

  PCI device hotplug:
   - acpiphp_ibm: Fix null dereferences on null ibm_slot (Colin Ian King)
   - pciehp: Always protect pciehp_disable_slot() with hotplug mutex (Guenter Roeck)
   - shpchp: Constify hpc_ops structure (Julia Lawall)
   - ibmphp: Remove unneeded NULL test (Julia Lawall)

  Power management:
   - Make ASPM sysfs link_state_store() consistent with link_state_show() (Andy Lutomirski)

  Virtualization
   - Add function 1 DMA alias quirk for Lite-On/Plextor M6e/Marvell 88SS9183 (Tim Sander)

  MSI:
   - Remove empty pci_msi_init_pci_dev() (Bjorn Helgaas)
   - Mark PCIe/PCI (MSI) IRQ cascade handlers as IRQF_NO_THREAD (Grygorii Strashko)
   - Initialize MSI capability for all architectures (Guilherme G. Piccoli)
   - Relax msi_domain_alloc() to support parentless MSI irqdomains (Liu Jiang)

  ARM Versatile host bridge driver:
   - Remove unused pci_sys_data structures (Lorenzo Pieralisi)

  Broadcom iProc host bridge driver:
   - Hide CONFIG_PCIE_IPROC (Arnd Bergmann)
   - Do not use 0x in front of %pap (Dmitry V. Krivenok)
   - Update iProc PCIe device tree binding (Ray Jui)
   - Add PAXC interface support (Ray Jui)
   - Add iProc PCIe MSI device tree binding (Ray Jui)
   - Add iProc PCIe MSI support (Ray Jui)

  Freescale i.MX6 host bridge driver:
   - Use gpio_set_value_cansleep() (Fabio Estevam)
   - Add support for active-low reset GPIO (Petr Štetiar)

  HiSilicon host bridge driver:
   - Add support for HiSilicon Hip06 PCIe host controllers (Gabriele Paoloni)

  Intel VMD host bridge driver:
   - Export irq_domain_set_info() for module use (Keith Busch)
   - x86/PCI: Allow DMA ops specific to a PCI domain (Keith Busch)
   - Use 32 bit PCI domain numbers (Keith Busch)
   - Add driver for Intel Volume Management Device (VMD) (Keith Busch)

  Qualcomm host bridge driver:
   - Document PCIe devicetree bindings (Stanimir Varbanov)
   - Add Qualcomm PCIe controller driver (Stanimir Varbanov)
   - dts: apq8064: add PCIe devicetree node (Stanimir Varbanov)
   - dts: ifc6410: enable PCIe DT node for this board (Stanimir Varbanov)

  Renesas R-Car host bridge driver:
   - Add support for R-Car H3 to pcie-rcar (Harunobu Kurokawa)
   - Allow DT to override default window settings (Phil Edworthy)
   - Convert to DT resource parsing API (Phil Edworthy)
   - Revert "PCI: rcar: Build pcie-rcar.c only on ARM" (Phil Edworthy)
   - Remove unused pci_sys_data struct from pcie-rcar (Phil Edworthy)
   - Add runtime PM support to pcie-rcar (Phil Edworthy)
   - Add Gen2 PHY setup to pcie-rcar (Phil Edworthy)
   - Add gen2 fallback compatibility string for pci-rcar-gen2 (Simon Horman)
   - Add gen2 fallback compatibility string for pcie-rcar (Simon Horman)

  Synopsys DesignWare host bridge driver:
   - Simplify control flow (Bjorn Helgaas)
   - Make config accessor override checking symmetric (Bjorn Helgaas)
   - Ensure ATU is enabled before IO/conf space accesses (Stanimir Varbanov)

  Miscellaneous:
   - Add of_pci_get_host_bridge_resources() stub (Arnd Bergmann)
   - Check for PCI_HEADER_TYPE_BRIDGE equality, not bitmask (Bjorn Helgaas)
   - Fix all whitespace issues (Bogicevic Sasa)
   - x86/PCI: Simplify pci_bios_{read,write} (Geliang Tang)
   - Use to_pci_dev() instead of open-coding it (Geliang Tang)
   - Use kobj_to_dev() instead of open-coding it (Geliang Tang)
   - Use list_for_each_entry() to simplify code (Geliang Tang)
   - Fix typos in <linux/msi.h> (Thomas Petazzoni)
   - x86/PCI: Clarify AMD Fam10h config access restrictions comment (Tomasz Nowicki)"

* tag 'pci-v4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (58 commits)
  PCI: Add function 1 DMA alias quirk for Lite-On/Plextor M6e/Marvell 88SS9183
  PCI: Limit config space size for Netronome NFP4000
  PCI: Add Netronome NFP4000 PF device ID
  x86/PCI: Add driver for Intel Volume Management Device (VMD)
  PCI/AER: Use 32 bit PCI domain numbers
  x86/PCI: Allow DMA ops specific to a PCI domain
  irqdomain: Export irq_domain_set_info() for module use
  PCI: host: Add of_pci_get_host_bridge_resources() stub
  genirq/MSI: Relax msi_domain_alloc() to support parentless MSI irqdomains
  PCI: rcar: Add Gen2 PHY setup to pcie-rcar
  PCI: rcar: Add runtime PM support to pcie-rcar
  PCI: designware: Make config accessor override checking symmetric
  PCI: ibmphp: Remove unneeded NULL test
  ARM: dts: ifc6410: enable PCIe DT node for this board
  ARM: dts: apq8064: add PCIe devicetree node
  PCI: hotplug: Use list_for_each_entry() to simplify code
  PCI: rcar: Remove unused pci_sys_data struct from pcie-rcar
  PCI: hisi: Add support for HiSilicon Hip06 PCIe host controllers
  PCI: Avoid iterating through memory outside the resource window
  PCI: acpiphp_ibm: Fix null dereferences on null ibm_slot
  ...
2016-01-21 11:52:16 -08:00
Alexander Shishkin
45c815f06b perf: Synchronously free aux pages in case of allocation failure
We are currently using asynchronous deallocation in the error path in
AUX mmap code, which is unnecessary and also presents a problem for users
that wish to probe for the biggest possible buffer size they can get:
they'll get -EINVAL on all subsequent attemts to allocate a smaller
buffer before the asynchronous deallocation callback frees up the pages
from the previous unsuccessful attempt.

Currently, gdb does that for allocating AUX buffers for Intel PT traces.
More specifically, overwrite mode of AUX pmus that don't support hardware
sg (some implementations of Intel PT, for instance) is limited to only
one contiguous high order allocation for its buffer and there is no way
of knowing its size without trying.

This patch changes error path freeing to be synchronous as there won't
be any contenders for the AUX pages at that point.

Reported-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1453216469-9509-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:27 +01:00
Peter Zijlstra
63b6da39bb perf: Fix perf_event_exit_task() race
There is a race against perf_event_exit_task() vs
event_function_call(),find_get_context(),perf_install_in_context()
(iow, everyone).

Since there is no permanent marker on a context that its dead, it is
quite possible that we access (and even modify) a context after its
passed through perf_event_exit_task().

For instance, find_get_context() might find the context still
installed, but by the time we get to perf_install_in_context() it
might already have passed through perf_event_exit_task() and be
considered dead, we will however still add the event to it.

Solve this by marking a ctx dead by setting its ctx->task value to -1,
it must be !0 so we still know its a (former) task context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:25 +01:00
Peter Zijlstra
c97f473643 perf: Add more assertions
Try to trigger warnings before races do damage.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:25 +01:00
Peter Zijlstra
fae3fde651 perf: Collapse and fix event_function_call() users
There is one common bug left in all the event_function_call() users,
between loading ctx->task and getting to the remote_function(),
ctx->task can already have been changed.

Therefore we need to double check and retry if ctx->task != current.

Insert another trampoline specific to event_function_call() that
checks for this and further validates state. This also allows getting
rid of the active/inactive functions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:24 +01:00
Peter Zijlstra
32132a3d0d perf: Specialize perf_event_exit_task()
The perf_remove_from_context() usage in __perf_event_exit_task() is
different from the other usages in that this site has already
detached and scheduled out the task context.

This will stand in the way of stronger assertions checking the (task)
context scheduling invariants.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:24 +01:00
Peter Zijlstra
39a4364076 perf: Fix task context scheduling
There is a very nasty problem wrt disabling the perf task scheduling
hooks.

Currently we {set,clear} ctx->is_active on every
__perf_event_task_sched_{in,out}, _however_ this means that if we
disable these calls we'll have task contexts with ->is_active set that
are not active and 'active' task contexts without ->is_active set.

This can result in event_function_call() looping on the ctx->is_active
condition basically indefinitely.

Resolve this by changing things such that contexts without events do
not set ->is_active like we used to. From this invariant it trivially
follows that if there are no (task) events, every task ctx is inactive
and disabling the context switch hooks is harmless.

This leaves two places that need attention (and already had
accumulated weird and wonderful hacks to work around, without
recognising this actual problem).

Namely:

 - perf_install_in_context() will need to deal with installing events
   in an inactive context, meaning it cannot rely on ctx-is_active for
   its IPIs.

 - perf_remove_from_context() will have to mark a context as inactive
   when it removes the last event.

For specific detail, see the patch/comments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:23 +01:00
Peter Zijlstra
63e30d3e52 perf: Make ctx->is_active and cpuctx->task_ctx consistent
For no apparent reason and to great confusion the rules for
ctx->is_active and cpuctx->task_ctx are different. This means that its
not always possible to find all active (task) contexts.

Fix this such that if ctx->is_active gets set, we also set (or verify)
cpuctx->task_ctx.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:23 +01:00
Peter Zijlstra
25432ae96a perf: Optimize perf_sched_events() usage
It doesn't make sense to take up-to _4_ references on
perf_sched_events() per event, avoid doing this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:22 +01:00
Peter Zijlstra
aee7dbc45f perf: Simplify/fix perf_event_enable() event scheduling
Like perf_enable_on_exec(), perf_event_enable() event scheduling has problems
respecting the context hierarchy when trying to schedule events (for
example, it will try and add a pinned event without first removing
existing flexible events).

So simplify it by using the new ctx_resched() call which will DTRT.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:22 +01:00
Peter Zijlstra
8833d0e286 perf: Use task_ctx_sched_out()
We have a function that does exactly what we want here, use it. This
reduces the amount of cpuctx->task_ctx muckery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:21 +01:00
Peter Zijlstra
3e349507d1 perf: Fix perf_enable_on_exec() event scheduling
There are two problems with the current perf_enable_on_exec() event
scheduling:

  - the newly enabled events will be immediately scheduled
    irrespective of their ctx event list order.

  - there's a hole in the ctx->lock between scheduling the events
    out and putting them back on.

Esp. the latter issue is a real problem because a hole in event
scheduling leaves the thing in an observable inconsistent state,
confusing things.

Fix both issues by first doing the enable iteration and at the end,
when there are newly enabled events, reschedule the ctx in one go.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:20 +01:00
Peter Zijlstra
5947f6576e perf: Remove stale comment
The comment here is horribly out of date, remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:20 +01:00
Peter Zijlstra
70a0165752 perf: Fix cgroup scheduling in perf_enable_on_exec()
There is a comment that states that perf_event_context_sched_in() will
also switch in the cgroup events, I cannot find it does so. Therefore
all the resulting logic goes out the window too.

Clean that up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:19 +01:00
Peter Zijlstra
7e41d17753 perf: Fix cgroup event scheduling
There appears to be a problem in __perf_event_task_sched_in() wrt
cgroup event scheduling.

The normal event scheduling order is:

	CPU pinned
	Task pinned
	CPU flexible
	Task flexible

And since perf_cgroup_sched*() only schedules the cpu context, we must
call this _before_ adding the task events.

Note: double check what happens on the ctx switch optimization where
the task ctx isn't scheduled.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:19 +01:00
Peter Zijlstra
c994d61367 perf: Add lockdep assertions
Make various bugs easier to see.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21 18:54:18 +01:00
Linus Torvalds
30f05309bd More power management and ACPI updates for v4.5-rc1
- Modify the driver core and the USB subsystem to allow USB devices
    to stay suspended over system suspend/resume cycles if they have
    been runtime-suspended already beforehand and fix some bugs on
    top of these changes (Tomeu Vizoso, Rafael Wysocki).
 
  - Update ACPICA to upstream revision 20160108, including updates
    of the ACPICA's copyright notices, a code fixup resulting from
    a regression fix that was necessary in the upstream code only
    (the regression fixed by it has never been present in Linux)
    and a compiler warning fix (Bob Moore, Lv Zheng).
 
  - Fix a recent regression in the cpuidle menu governor that broke
    it on practically all architectures other than x86 and make a
    couple of optimizations on top of that fix (Rafael Wysocki).
 
  - Clean up the selection of cpuidle governors depending on whether
    or not the kernel is configured for tickless systems (Jean Delvare).
 
  - Revert a recent commit that introduced a regression in the ACPI
    backlight driver, address the problem it attempted to fix in a
    different way and revert one more cosmetic change depending on
    the problematic commit (Hans de Goede).
 
  - Add two more ACPI backlight quirks (Hans de Goede).
 
  - Fix a few minor problems in the core devfreq code, clean it up
    a bit and update the MAINTAINERS information related to it
    (Chanwoo Choi, MyungJoo Ham).
 
  - Improve an error message in the ACPI fan driver (Andy Lutomirski).
 
  - Fix a recent build regression in the cpupower tool (Shreyas Prabhu).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJWoCQ4AAoJEILEb/54YlRxLscQALEFVKSRnNaco72OqqRZs9Bu
 1RI6TgHTpZxR+Ef0+QWqE1QMnDwfImGhKDbSRm/t3S2sMYYZbAOL8cu4y6GmkBv4
 bOon/f9WEoPlQCFoo/6U4u8H45rNT5W9zX5+Bva8x+4Wu3n2J1QdvirnS5JHeHe1
 o6tGLaHuZXSwX8SLnCk8gJYK1VhATxbubJtpcVtvlnAhO11qUAwsscCrkUmB60i7
 5hLyrZb06hoa/hZVcIefGFuSd9qPhzDMQE2M20EohQ7UVkNJQdY9QNHMqCk2P42T
 nMWCNSwGnwfiO1p9ByXqunOFBCmyL7P+KV/DHsz6TFCVjz+jeG53Kqey9SkSJ/2W
 iaAE80K9MfOMvg8j7rib6fTn5uXBwRfqdeUDF/Hr64QqJoRn3R2LX4HmZe4L8ufb
 zA1rece67o8FD+7p7GkNbT3rPV/kA62tn/moFk446X5N+b261Kz90t1DVci8kRVf
 k+1gcvEdqO0GPpEHoirfXrBvQFixqkXakKj4r2aAob/DldQeLX7CkOUuRRJ1ykec
 bxwI9R0v8MlVe5rDxg+rPB0I9EFxRDmxqxpU5j0MRWxKnMRzLvBtHuk8YNVS/eU1
 xwyJOGcwF6yI0PaCFggPqmhebSrWLE7wJxaK+3bC+yiDTvHYPjB+4MfQrmkRAwwM
 azgb+ZgXDYx5wXeb8EjB
 =bKJ9
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management and ACPI updates from Rafael Wysocki:
 "This includes fixes on top of the previous batch of PM+ACPI updates
  and some new material as well.

  From the new material perspective the most significant are the driver
  core changes that should allow USB devices to stay suspended over
  system suspend/resume cycles if they have been runtime-suspended
  already beforehand.  Apart from that, ACPICA is updated to upstream
  revision 20160108 (cosmetic mostly, but including one fixup on top of
  the previous ACPICA update) and there are some devfreq updates the
  didn't make it before (due to timing).

  A few recent regressions are fixed, most importantly in the cpuidle
  menu governor and in the ACPI backlight driver and some x86 platform
  drivers depending on it.

  Some more bugs are fixed and cleanups are made on top of that.

  Specifics:

   - Modify the driver core and the USB subsystem to allow USB devices
     to stay suspended over system suspend/resume cycles if they have
     been runtime-suspended already beforehand and fix some bugs on top
     of these changes (Tomeu Vizoso, Rafael Wysocki).

   - Update ACPICA to upstream revision 20160108, including updates of
     the ACPICA's copyright notices, a code fixup resulting from a
     regression fix that was necessary in the upstream code only (the
     regression fixed by it has never been present in Linux) and a
     compiler warning fix (Bob Moore, Lv Zheng).

   - Fix a recent regression in the cpuidle menu governor that broke it
     on practically all architectures other than x86 and make a couple
     of optimizations on top of that fix (Rafael Wysocki).

   - Clean up the selection of cpuidle governors depending on whether or
     not the kernel is configured for tickless systems (Jean Delvare).

   - Revert a recent commit that introduced a regression in the ACPI
     backlight driver, address the problem it attempted to fix in a
     different way and revert one more cosmetic change depending on the
     problematic commit (Hans de Goede).

   - Add two more ACPI backlight quirks (Hans de Goede).

   - Fix a few minor problems in the core devfreq code, clean it up a
     bit and update the MAINTAINERS information related to it (Chanwoo
     Choi, MyungJoo Ham).

   - Improve an error message in the ACPI fan driver (Andy Lutomirski).

   - Fix a recent build regression in the cpupower tool (Shreyas
     Prabhu)"

* tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
  cpuidle: menu: Avoid pointless checks in menu_select()
  sched / idle: Drop default_idle_call() fallback from call_cpuidle()
  cpupower: Fix build error in cpufreq-info
  cpuidle: Don't enable all governors by default
  cpuidle: Default to ladder governor on ticking systems
  time: nohz: Expose tick_nohz_enabled
  ACPICA: Update version to 20160108
  ACPICA: Silence a -Wbad-function-cast warning when acpi_uintptr_t is 'uintptr_t'
  ACPICA: Additional 2016 copyright changes
  ACPICA: Reduce regression fix divergence from upstream ACPICA
  ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830
  ACPI / video: Revert "thinkpad_acpi: Use acpi_video_handles_brightness_key_presses()"
  ACPI / video: Document acpi_video_handles_brightness_key_presses() a bit
  ACPI / video: Fix using an uninitialized mutex / list_head in acpi_video_handles_brightness_key_presses()
  ACPI / video: Revert "ACPI / video: driver must be registered before checking for keypresses"
  ACPI / fan: Improve acpi_device_update_power error message
  ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700
  cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0
  MAINTAINERS: Add devfreq-event entry
  MAINTAINERS: Add missing git repository and directory for devfreq
  ...
2016-01-20 19:06:49 -08:00
Mateusz Guzik
ddf1d398e5 prctl: take mmap sem for writing to protect against others
An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.

proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
        BUG_ON(arg_start > arg_end);
        BUG_ON(env_start > env_end);

These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:

  kernel BUG at fs/proc/base.c:240!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: virtio_net
  CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
  RIP: proc_pid_cmdline_read+0x520/0x530
  RSP: 0018:ffff8800784d3db8  EFLAGS: 00010206
  RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
  RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
  RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
  R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
  FS:  00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
  Call Trace:
    __vfs_read+0x37/0x100
    vfs_read+0x82/0x130
    SyS_read+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
  Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
  RIP   proc_pid_cmdline_read+0x520/0x530
  ---[ end trace 97882617ae9c6818 ]---

Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.

Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.

The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.

The second patch is optional and puts locking around aformentioned
consumers for safety.  Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.

This patch (of 2):

The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.

The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.

Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Andrey Ryabinin
5c9cf8af2e kernel: printk: specify alignment for struct printk_log
On architectures that have support for efficient unaligned access struct
printk_log has 4-byte alignment.  Specify alignment attribute in type
declaration.

The whole point of this patch is to fix deadlock which happening when
UBSAN detects unaligned access in printk() thus UBSAN recursively calls
printk() with logbuf_lock held by top printk() call.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yury Gribov <y.gribov@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kostya Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Kees Cook
41662f5cc5 sysctl: enable strict writes
SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014.  Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2].  As such, it
appears safe to flip this to the strict state now.

[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Xunlei Pang
978e30c9b4 kexec: move some memembers and definitions within the scope of CONFIG_KEXEC_FILE
Move the stuff currently only used by the kexec file code within
CONFIG_KEXEC_FILE (and CONFIG_KEXEC_VERIFY_SIG).

Also move internal "struct kexec_sha_region" and "struct kexec_buf" into
"kexec_internal.h".

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Geliang Tang
2b24692b92 kernel/kexec_core.c: use list_for_each_entry_safe in kimage_free_page_list
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Xunlei Pang
cdf4b3fa03 kexec: set KEXEC_TYPE_CRASH before sanity_check_segment_list()
sanity_check_segment_list() checks KEXEC_TYPE_CRASH flag to ensure all the
segments of the loaded crash kernel are within the kernel crash resource
limits, so set the flag beforehand.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Rasmus Villemoes
9425676a36 kernel/cpu.c: make set_cpu_* static inlines
Almost all callers of the set_cpu_* functions pass an explicit true or
false.  Making them static inline thus replaces the function calls with a
simple set_bit/clear_bit, saving some .text.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Rasmus Villemoes
5aec01b834 kernel/cpu.c: eliminate cpu_*_mask
Replace the variables cpu_possible_mask, cpu_online_mask, cpu_present_mask
and cpu_active_mask with macros expanding to expressions of the same type
and value, eliminating some indirection.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Rasmus Villemoes
4b804c85dc kernel/cpu.c: export __cpu_*_mask
Exporting the cpumasks __cpu_possible_mask and friends will allow us to
remove the extra indirection through the cpu_*_mask variables.  It will
also allow the set_cpu_* functions to become static inlines, which will
give a .text reduction.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Rasmus Villemoes
c4c54dd1ca kernel/cpu.c: change type of cpu_possible_bits and friends
Change cpu_possible_bits and friends (online, present, active) from being
bitmaps that happen to have the right size to actually being struct
cpumasks.  Also rename them to __cpu_xyz_mask.  This is mostly a small
cleanup in preparation for exporting them and, eventually, eliminating the
extra indirection through the cpu_xyz_mask variables.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00