15086 Commits

Author SHA1 Message Date
Paul E. McKenney
dc975e94f3 tracing: Export trace_clock_local()
The rcutorture tests need to be able to trace the time of the
beginning of an RCU read-side critical section, and thus need access
to trace_clock_local().  This commit therefore adds a the needed
EXPORT_SYMBOL_GPL().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-01-08 14:14:55 -08:00
Paul Gortmaker
1b0048a44c rcu: Make rcu_nocb_poll an early_param instead of module_param
The as-documented rcu_nocb_poll will fail to enable this feature
for two reasons.  (1) there is an extra "s" in the documented
name which is not in the code, and (2) since it uses module_param,
it really is expecting a prefix, akin to "rcutree.fanout_leaf"
and the prefix isn't documented.

However, there are several reasons why we might not want to
simply fix the typo and add the prefix:

1) we'd end up with rcutree.rcu_nocb_poll, and rather probably make
a change to rcutree.nocb_poll

2) if we did #1, then the prefix wouldn't be consistent with the
rcu_nocbs=<cpumap> parameter (i.e. one with, one without prefix)

3) the use of module_param in a header file is less than desired,
since it isn't immediately obvious that it will get processed
via rcutree.c and get the prefix from that (although use of
module_param_named() could clarify that.)

4) the implied export of /sys/module/rcutree/parameters/rcu_nocb_poll
data to userspace via module_param() doesn't really buy us anything,
as it is read-only and we can tell if it is enabled already without
it, since there is a printk at early boot telling us so.

In light of all that, just change it from a module_param() to an
early_setup() call, and worry about adding it to /sys later on if
we decide to allow a dynamic setting of it.

Also change the variable to be tagged as read_mostly, since it
will only ever be fiddled with at most, once at boot.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-01-08 14:12:19 -08:00
Paul Gortmaker
353af9c9a8 rcu: Prevent soft-lockup complaints about no-CBs CPUs
The wait_event() at the head of the rcu_nocb_kthread() can result in
soft-lockup complaints if the CPU in question does not register RCU
callbacks for an extended period.  This commit therefore changes
the wait_event() to a wait_event_interruptible().

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-01-08 14:12:18 -08:00
Olof Johansson
981302783e ARM/...: timer and clock events cleanup, and remove struct sys_timer
This branch contains a number of cleanups and unifications to various
 timer- clock-events- and ARM timer code. The main points are:
 
 1) Convert arch_gettimeoffset to a pointer, so that architectures with
    multiple timer implementations can simply set this standard pointer
    rather than maintaining their own arch-specific pointers for the
    same purpose. Various architectures are converted to using this new
    feature.
 
 2) Conversion of ARM timer implementations to use clock_event_devices's
    suspend/resume operations, rather than the ARM-specific sys_timer
    versions. Thus, the ARM code begins to use more common infra-structure
    rather than arch-specific code.
 
 3) Removal of ARM's struct sys_timer completely, now that everything uses
    common code.
 
 4) Introduction of drivers/clocksource/clksrc-of.c, which allows ARM clock
    source implementations to be moved into drivers/clocksource, with the
    need to add SoC-specific header files for each timer initialization
    function; instead, all enabled implementations are registered into a
    table which a single core function iterates over, and calls the
    relevant initialization functions based on device tree. At least the
    Tegra and BCM2835 clocksource implementations will use this feature in
    the 3.9 kernel cycle.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ5yJiAAoJEMzrak5tbycxaPQQAI4gxhVfJk44R8NudWX/Q+6l
 xq60iILp9JfdOYmQY7FzzxON5DkDrosbI8TKS8K74VV7Lx7sP7iKc7BS1jtVKbFP
 ow/oqFlDlPL00Ne/Zzf7A4zDI0CYvGwNvkND+YcARrW+PRO+omGcU1qnUVhMOUCp
 s5v6Xa5rgUZWJ6QslKagfgHRpZMFtf1e74waS4zVWP0HymyWU5v9x8GaGKSt6Aj6
 tk1Of/Bc7dhmsul2QZdolSqZbu1lTR3QnaKltMAFklbHjJKNR6w7llQBCkbp5Qyg
 OUzMVUpDjQzfPMQyu2ovzEQ6gMZYX/XuVNlnAmLcy9b1A0TExkRHICiEpkGdVn4b
 Kh4MQWW4/V19pOgROs8+L/XfRmi96EDEMeb/kaVo7r5iMO85UwouRpBP/KLPKvZ+
 2pXTdZmbOAQhu5OKzx1q8pD9gm+quMs3fy8Fc7F0hZkXQUlqWHAcBcV8Bm7hW8u+
 32gUpUcZV45sdK6x+POLY+6F836aFWdgBug90fhnGKqq9HcDDvzST09ctrtKoSSZ
 LfSvklv8szYnVu09vvO5HZTKMcMSC5Y8Uo4RZkBKKTOMWFbhWvkpqOMEW/+R+r2X
 FrUVRI7SShbXV2W2EVRFRsTwMALy6gr859MclQGmNzwbdjM68xjeKdnsv0TTAAwM
 emqx0qzopWUN2Btp3yPM
 =DOD6
 -----END PGP SIGNATURE-----

Merge tag 'swarren-for-3.9-arm-timer-rework' of git://git.kernel.org/pub/scm/linux/kernel/git/swarren/linux-tegra into next/cleanup

From Stephen Warren:
ARM/...: timer and clock events cleanup, and remove struct sys_timer

This branch contains a number of cleanups and unifications to various
timer- clock-events- and ARM timer code. The main points are:

1) Convert arch_gettimeoffset to a pointer, so that architectures with
   multiple timer implementations can simply set this standard pointer
   rather than maintaining their own arch-specific pointers for the
   same purpose. Various architectures are converted to using this new
   feature.

2) Conversion of ARM timer implementations to use clock_event_devices's
   suspend/resume operations, rather than the ARM-specific sys_timer
   versions. Thus, the ARM code begins to use more common infra-structure
   rather than arch-specific code.

3) Removal of ARM's struct sys_timer completely, now that everything uses
   common code.

4) Introduction of drivers/clocksource/clksrc-of.c, which allows ARM clock
   source implementations to be moved into drivers/clocksource, with the
   need to add SoC-specific header files for each timer initialization
   function; instead, all enabled implementations are registered into a
   table which a single core function iterates over, and calls the
   relevant initialization functions based on device tree. At least the
   Tegra and BCM2835 clocksource implementations will use this feature in
   the 3.9 kernel cycle.

* tag 'swarren-for-3.9-arm-timer-rework' of git://git.kernel.org/pub/scm/linux/kernel/git/swarren/linux-tegra:
  clocksource: add common of_clksrc_init() function
  ARM: delete struct sys_timer
  ARM: remove struct sys_timer suspend and resume fields
  ARM: samsung: register syscore_ops for timer resume directly
  ARM: ux500: convert timer suspend/resume to clock_event_device
  ARM: sa1100: convert timer suspend/resume to clock_event_device
  ARM: pxa: convert timer suspend/resume to clock_event_device
  ARM: at91: convert timer suspend/resume to clock_event_device
  ARM: set arch_gettimeoffset directly
  m68k: set arch_gettimeoffset directly
  time: convert arch_gettimeoffset to a pointer
  cris: move usec/nsec conversion to do_slow_gettimeoffset

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-01-08 05:53:53 -08:00
Tejun Heo
c431069fe4 cpuset: remove cpuset->parent
cgroup already tracks the hierarchy.  Follow cgroup->parent to find
the parent and drop cpuset->parent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:08 -08:00
Tejun Heo
fc560a26ac cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()
Implement cpuset_for_each_descendant_pre() and replace the
cpuset-specific tree walking using cpuset->stack_list with it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:08 -08:00
Tejun Heo
5d21cc2db0 cpuset: replace cgroup_mutex locking with cpuset internal locking
Supposedly for historical reasons, cpuset depends on cgroup core for
locking.  It depends on cgroup_mutex in cgroup callbacks and grabs
cgroup_mutex from other places where it wants to be synchronized.
This is majorly messy and highly prone to introducing circular locking
dependency especially because cgroup_mutex is supposed to be one of
the outermost locks.

As previous patches already plugged possible races which may happen by
decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
specific cpuset_mutex is mostly straight-forward.  Introduce
cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
cpuset_mutex locking to places which inherited cgroup_mutex from
cgroup core.

The only complication is from cpuset wanting to initiate task
migration when a cpuset loses all cpus or memory nodes.  Task
migration may go through full cgroup and all subsystem locking and
should be initiated without holding any cpuset specific lock; however,
a previous patch already made hotplug handled asynchronously and
moving the task migration part outside other locks is easy.
cpuset_propagate_hotplug_workfn() now invokes
remove_tasks_in_empty_cpuset() without holding any lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:08 -08:00
Tejun Heo
02bb586372 cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty
cpuset is scheduled to be decoupled from cgroup_lock which will make
hotplug handling race with task migration.  cpus or mems will be
allowed to go offline between ->can_attach() and ->attach().  If
hotplug takes down all cpus or mems of a cpuset while attach is in
progress, ->attach() may end up putting tasks into an empty cpuset.

This patchset makes ->attach() schedule hotplug propagation if the
cpuset is empty after attaching is complete.  This will move the tasks
to the nearest ancestor which can execute and the end result would be
as if hotplug handling happened after the tasks finished attaching.

cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
wait for propagations scheduled directly by cpuset_attach().

This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:08 -08:00
Tejun Heo
452477fa68 cpuset: pin down cpus and mems while a task is being attached
cpuset is scheduled to be decoupled from cgroup_lock which will make
configuration updates race with task migration.  Any config update
will be allowed to happen between ->can_attach() and ->attach().  If
such config update removes either all cpus or mems, by the time
->attach() is called, the condition verified by ->can_attach(), that
the cpuset is capable of hosting the tasks, is no longer true.

This patch adds cpuset->attach_in_progress which is incremented from
->can_attach() and decremented when the attach operation finishes
either successfully or not.  validate_change() treats cpusets w/
non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
remove all cpus or mems from it.

This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
8d03394877 cpuset: make CPU / memory hotplug propagation asynchronous
cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
directly to propagate hotplug updates to !root cpusets; however, this
has the following problems.

* cpuset locking is scheduled to be decoupled from cgroup_mutex,
  cgroup_mutex will be unexported, and cgroup_attach_task() will do
  cgroup locking internally, so propagation can't synchronously move
  tasks to a parent cgroup while walking the hierarchy.

* We can't use cgroup generic tree iterator because propagation to
  each cpuset may sleep.  With propagation done asynchronously, we can
  lose the rather ugly cpuset specific iteration.

Convert cpuset_propagate_hotplug() to
cpuset_propagate_hotplug_workfn() and execute it from newly added
cpuset->hotplug_work.  The work items are run on an ordered workqueue,
so the propagation order is preserved.  cpuset_hotplug_workfn()
schedules all propagations while holding cgroup_mutex and waits for
completion without cgroup_mutex.  Each in-flight propagation holds a
reference to the cpuset->css.

This patch doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
699140ba83 cpuset: drop async_rebuild_sched_domains()
In general, we want to make cgroup_mutex one of the outermost locks
and be able to use get_online_cpus() and friends from cgroup methods.
With cpuset hotplug made async, get_online_cpus() can now be nested
inside cgroup_mutex.

Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
by bouncing sched_domain rebuilding to a work item.  As such nesting
is allowed now, remove the workqueue bouncing code and always rebuild
sched_domains synchronously.  This also nests sched_domains_mutex
inside cgroup_mutex, which is intended and should be okay.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
3a5a6d0c2b cpuset: don't nest cgroup_mutex inside get_online_cpus()
CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
event notifications.  We want to separate cpuset locking from cgroup
core and make cgroup_mutex outer to hotplug synchronization so that,
among other things, mechanisms which depend on get_online_cpus() can
be used from cgroup callbacks.  In general, we want to keep
cgroup_mutex the outermost lock to minimize locking interactions among
different controllers.

Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
schedule it from the hotplug notifications.  As the function can
already handle multiple mixed events without any input, converting it
to a work function is mostly trivial; however, one complication is
that cpuset_update_active_cpus() needs to update sched domains
synchronously to reflect an offlined cpu to avoid confusing the
scheduler.  This is worked around by falling back to the the default
single sched domain synchronously before scheduling the actual hotplug
work.  This makes sched domain rebuilt twice per CPU hotplug event but
the operation isn't that heavy and a lot of the second operation would
be noop for systems w/ single sched domain, which is the common case.

This decouples cpuset hotplug handling from the notification callbacks
and there can be an arbitrary delay between the actual event and
updates to cpusets.  Scheduler and mm can handle it fine but moving
tasks out of an empty cpuset may race against writes to the cpuset
restoring execution resources which can lead to confusing behavior.
Flush hotplug work item from cpuset_write_resmask() to avoid such
confusions.

v2: Synchronous sched domain rebuilding using the fallback sched
    domain added.  This fixes various issues caused by confused
    scheduler putting tasks on a dead CPU, including the one reported
    by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
deb7aa308e cpuset: reorganize CPU / memory hotplug handling
Reorganize hotplug path to prepare for async hotplug handling.

* Both CPU and memory hotplug handlings are collected into a single
  function - cpuset_handle_hotplug().  It doesn't take any argument
  but compares the current setttings of top_cpuset against what's
  actually available to determine what happened.  This function
  directly updates top_cpuset.  If there are CPUs or memory nodes
  which are taken down, cpuset_propagate_hotplug() in invoked on all
  !root cpusets.

* cpuset_propagate_hotplug() is responsible for updating the specified
  cpuset so that it doesn't include any resource which isn't available
  to top_cpuset.  If no CPU or memory is left after update, all tasks
  are moved to the nearest ancestor with both resources.

* update_tasks_cpumask() and update_tasks_nodemask() are now always
  called after cpus or mems masks are updated even if the cpuset
  doesn't have any task.  This is for brevity and not expected to have
  any measureable effect.

* cpu_active_mask and N_HIGH_MEMORY are read exactly once per
  cpuset_handle_hotplug() invocation, all cpusets share the same view
  of what resources are available, and cpuset_handle_hotplug() can
  handle multiple resources going up and down.  These properties will
  allow async operation.

The reorganization, while drastic, is equivalent and shouldn't cause
any behavior difference.  This will enable making hotplug handling
async and remove get_online_cpus() -> cgroup_mutex nesting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
4e4c9a140f cpuset: cleanup cpuset[_can]_attach()
cpuset_can_attach() prepare global variables cpus_attach and
cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
There is no reason to prepare in cpuset_can_attach().  The same
information can be accessed from cpuset_attach().

Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
and make the global variables static ones inside cpuset_attach().

With this change, there's no reason to keep
cpuset_attach_nodemask_{from|to} global.  Move them inside
cpuset_attach().  Unfortunately, we need to keep cpus_attach global as
it can't be allocated from cpuset_attach().

v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty
    Russell.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
2013-01-07 08:51:07 -08:00
Tejun Heo
ae8086ce15 cpuset: introduce cpuset_for_each_child()
Instead of iterating cgroup->children directly, introduce and use
cpuset_for_each_child() which wraps cgroup_for_each_child() and
performs online check.  As it uses the generic iterator, it requires
RCU read locking too.

As cpuset is currently protected by cgroup_mutex, non-online cpusets
aren't visible to all the iterations and this patch currently doesn't
make any functional difference.  This will be used to de-couple cpuset
locking from cgroup core.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
efeb77b2f1 cpuset: introduce CS_ONLINE
Add CS_ONLINE which is set from css_online() and cleared from
css_offline().  This will enable using generic cgroup iterator while
allowing decoupling cpuset from cgroup internal locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
c8f699bb56 cpuset: introduce ->css_on/offline()
Add cpuset_css_on/offline() and rearrange css init/exit such that,

* Allocation and clearing to the default values happen in css_alloc().
  Allocation now uses kzalloc().

* Config inheritance and registration happen in css_online().

* css_offline() undoes what css_online() did.

* css_free() frees.

This doesn't introduce any visible behavior changes.  This will help
cleaning up locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
0772324ae6 cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()
The function isn't that hot, the overhead of missing the fast exit is
low, the test itself depends heavily on cgroup internals, and it's
gonna be a hindrance when trying to decouple cpuset locking from
cgroup core.  Remove the fast exit path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
01c889cf4f cpuset: remove unused cpuset_unlock()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:51:07 -08:00
Tejun Heo
12a9d2fef1 cgroup: implement cgroup_rightmost_descendant()
Implement cgroup_rightmost_descendant() which returns the right most
descendant of the specified cgroup.  This can be used to skip the
cgroup's subtree while iterating with
cgroup_for_each_descendant_pre().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:50:28 -08:00
Linus Torvalds
d0631c6e09 Merge branch 'akpm' (fixes from Andrew)
Merge emailed fixes from Andrew Morton:
 "Bunch of fixes:

   - delayed IPC updates.  I held back on this because of some possible
     outstanding bug reports, but they appear to have been addressed in
     later versions

   - A bunch of MAINTAINERS updates

   - Yet Another RTC driver.  I'd held this back while a couple of
     little issues were being worked out.

  I'm expecting an intrusive-but-simple patchset from Joe Perches which
  splits up printk.c into kernel/printk/*.  That will be a pig to
  maintain for two months so if it passes testing I'd like to get it
  upstream after a week or so."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  printk: fix incorrect length from print_time() when seconds > 99999
  drivers/rtc/rtc-vt8500.c: fix handling of data passed in struct rtc_time
  drivers/rtc/rtc-vt8500.c: correct handling of CR_24H bitfield
  rtc: add RTC driver for TPS6586x
  MAINTAINERS: fix drivers/staging/sm7xx/
  MAINTAINERS: remove include/linux/of_pwm.h
  MAINTAINERS: remove arch/*/lib/perf_event*.c
  MAINTAINERS: remove drivers/mmc/host/imxmmc.*
  MAINTAINERS: fix Documentation/mei/
  MAINTAINERS: remove arch/x86/platform/mrst/pmu.*
  MAINTAINERS: remove firmware/isci/
  MAINTAINERS: fix drivers/ieee802154/
  MAINTAINERS: fix .../plat-mxc/include/mach/imxfb.h
  MAINTAINERS: remove drivers/video/epson1355fb.c
  MAINTAINERS: fix drivers/media/usb/dvb-usb/cxusb*
  MAINTAINERS: adjust for UAPI
  MAINTAINERS: fix drivers/media/platform/atmel-isi.c
  MAINTAINERS: fix arch/arm/mach-at91/include/mach/at_hdmac.h
  MAINTAINERS: fix drivers/rtc/rtc-vt8500.c
  MAINTAINERS: remove arch/arm/plat-s5p/
  ...
2013-01-07 07:42:38 -08:00
Oleg Nesterov
0c4a842349 signals: set_current_blocked() can use __set_current_blocked()
Cleanup.  And I think we need more cleanups, in particular
__set_current_blocked() and sigprocmask() should die.  Nobody should
ever block SIGKILL or SIGSTOP.

 - Change set_current_blocked() to use __set_current_blocked()

 - Change sys_sigprocmask() to use set_current_blocked(), this way it
   should not worry about SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-05 19:34:54 -08:00
Oleg Nesterov
5ba53ff648 signals: sys_ssetmask() uses uninitialized newmask
Commit 77097ae503b1 ("most of set_current_blocked() callers want
SIGKILL/SIGSTOP removed from set") removed the initialization of newmask
by accident, causing ltp to complain like this:

  ssetmask01    1  TFAIL  :  sgetmask() failed: TEST_ERRNO=???(0): Success

Restore the proper initialization.

Reported-and-tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org	# v3.5+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-05 19:34:54 -08:00
Roland Dreier
35dac27ced printk: fix incorrect length from print_time() when seconds > 99999
print_prefix() passes a NULL buf to print_time() to get the length of
the time prefix; when printk times are enabled, the current code just
returns the constant 15, which matches the format "[%5lu.%06lu] " used
to print the time value.  However, this is obviously incorrect when the
whole seconds part of the time gets beyond 5 digits (100000 seconds is a
bit more than a day of uptime).

The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual
length of the time prefix.  This could be micro-optimized but it seems
better to have simpler, more readable code here.

The bug leads to the syslog system call miscomputing which messages fit
into the userspace buffer.  If there are enough messages to fill
log_buf_len and some have a timestamp >= 100000, dmesg may fail with:

    # dmesg
    klogctl: Bad address

When this happens, strace shows that the failure is indeed EFAULT due to
the kernel mistakenly accessing past the end of dmesg's buffer, since
dmesg asks the kernel how big a buffer it needs, allocates a bit more,
and then gets an error when it asks the kernel to fill it:

    syslog(0xa, 0, 0)                       = 1048576
    mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000
    syslog(0x3, 0x7fa4d25d2010, 0x100008)   = -1 EFAULT (Bad address)

As far as I can see, the bug has been there as long as print_time(),
which comes from commit 084681d14e42 ("printk: flush continuation lines
immediately to console") in 3.5-rc5.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04 16:11:48 -08:00
Sasha Levin
52441fa8f2 module: prevent warning when finit_module a 0 sized file
If we try to finit_module on a file sized 0 bytes vmalloc will
scream and spit out a warning.

Since modules have to be bigger than 0 bytes anyways we can just
check that beforehand and avoid the warning.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-01-03 11:10:32 +10:30
Li Zefan
923c753823 userns: Allow unprivileged reboot
In a container with its own pid namespace and user namespace, rebooting
the system won't reboot the host, but terminate all the processes in
it and thus have the container shutdown, so it's safe.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-12-26 20:29:30 -08:00
Al Viro
b2ddedcd21 x32: fix sigtimedwait
It needs 64bit timespec.  As it is, we end up truncating the timeout
to whole seconds; usually it doesn't matter, but for having all
sub-second timeouts truncated to one jiffy is visibly wrong.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26 01:15:03 -05:00
Al Viro
a566c28882 x32: fix waitid()
It needs 64bit rusage and 32bit siginfo.  glibc never calls it with
non-NULL rusage pointer, or we would've seen breakage already...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26 01:15:03 -05:00
Al Viro
8d9807b109 switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINE
Strictly speaking, ppc64 needs it for C ABI compliance.  Realistically
I would be very surprised if e.g. passing 0xffffffff as 'options'
argument to waitid() from 32bit task would cause problems, but yes,
it puts us into undefined behaviour territory.  ppc64 expects int
argument to be passed in 64bit register with bits 31..63 containing
the same value.  SYSCALL_DEFINE on ppc provides a wrapper that normalizes
the value passed from userland; so does COMPAT_SYSCALL_DEFINE.  Plain
declaration of compat_sys_something() with an int argument obviously
doesn't.  Again, for wait4 and waitid I would be extremely surprised
if gcc started to produce code depending on that value having been
properly sign-extended - the argument(s) in question end up passed
blindly to sys_wait4 and sys_waitid resp. and normalization for native
syscalls takes care of their use there.  Still, better to use
COMPAT_SYSCALL_DEFINE here than worry about nasal daemons...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26 01:15:02 -05:00
Al Viro
90228fc110 switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINE
Makes sigaltstack conversion easier to split into per-architecture
parts.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26 01:15:02 -05:00
Eric W. Biederman
c876ad7682 pidns: Stop pid allocation when init dies
Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.

Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25 16:10:05 -08:00
Eric W. Biederman
8382fcac1b pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
The sequence:
unshare(CLONE_NEWPID)
clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

Creates a new process in the new pid namespace without setting
pid_ns->child_reaper.  After forking this results in a NULL
pointer dereference.

Avoid this and other nonsense scenarios that can show up after
creating a new pid namespace with unshare by adding a new
check in copy_prodcess.

Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
Acked-by:  Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-24 22:53:14 -08:00
Stephen Warren
7b1f62076b time: convert arch_gettimeoffset to a pointer
Currently, whenever CONFIG_ARCH_USES_GETTIMEOFFSET is enabled, each
arch core provides a single implementation of arch_gettimeoffset(). In
many cases, different sub-architectures, different machines, or
different timer providers exist, and so the arch ends up implementing
arch_gettimeoffset() as a call-through-pointer anyway. Examples are
ARM, Cris, M68K, and it's arguable that the remaining architectures,
M32R and Blackfin, should be doing this anyway.

Modify arch_gettimeoffset so that it itself is a function pointer, which
the arch initializes. This will allow later changes to move the
initialization of this function into individual machine support or timer
drivers. This is particularly useful for code in drivers/clocksource
which should rely on an arch-independant mechanism to register their
implementation of arch_gettimeoffset().

This patch also converts the Cris architecture to set arch_gettimeoffset
directly to the final implementation in time_init(), because Cris already
had separate time_init() functions per sub-architecture. M68K and ARM
are converted to set arch_gettimeoffset to the final implementation in
later patches, because they already have function pointers in place for
this purpose.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
2012-12-24 09:36:07 -07:00
Linus Torvalds
96680d2b91 Merge branch 'for-next' of git://git.infradead.org/users/eparis/notify
Pull filesystem notification updates from Eric Paris:
 "This pull mostly is about locking changes in the fsnotify system.  By
  switching the group lock from a spin_lock() to a mutex() we can now
  hold the lock across things like iput().  This fixes a problem
  involving unmounting a fs and having inodes be busy, first pointed out
  by FAT, but reproducible with tmpfs.

  This also restores signal driven I/O for inotify, which has been
  broken since about 2.6.32."

Ugh.  I *hate* the timing of this.  It was rebased after the merge
window opened, and then left to sit with the pull request coming the day
before the merge window closes.  That's just crap.  But apparently the
patches themselves have been around for over a year, just gathering
dust, so now it's suddenly critical.

Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
Rothwell's fixes from -next.

* 'for-next' of git://git.infradead.org/users/eparis/notify:
  inotify: automatically restart syscalls
  inotify: dont skip removal of watch descriptor if creation of ignored event failed
  fanotify: dont merge permission events
  fsnotify: make fasync generic for both inotify and fanotify
  fsnotify: change locking order
  fsnotify: dont put marks on temporary list when clearing marks by group
  fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
  fsnotify: pass group to fsnotify_destroy_mark()
  fsnotify: use a mutex instead of a spinlock to protect a groups mark list
  fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
  fsnotify: take groups mark_lock before mark lock
  fsnotify: use reference counting for groups
  fsnotify: introduce fsnotify_get_group()
  inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()
2012-12-20 20:11:52 -08:00
Linus Torvalds
4c9a44aebe Merge branch 'akpm' (Andrew's patch-bomb)
Merge the rest of Andrew's patches for -rc1:
 "A bunch of fixes and misc missed-out-on things.

  That'll do for -rc1.  I still have a batch of IPC patches which still
  have a possible bug report which I'm chasing down."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
  keys: use keyring_alloc() to create module signing keyring
  keys: fix unreachable code
  sendfile: allows bypassing of notifier events
  SGI-XP: handle non-fatal traps
  fat: fix incorrect function comment
  Documentation: ABI: remove testing/sysfs-devices-node
  proc: fix inconsistent lock state
  linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
  memcg: don't register hotcpu notifier from ->css_alloc()
  checkpatch: warn on uapi #includes that #include <uapi/...
  revert "rtc: recycle id when unloading a rtc driver"
  mm: clean up transparent hugepage sysfs error messages
  hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
  hfsplus: rework processing of hfs_btree_write() returned error
  hfsplus: rework processing errors in hfsplus_free_extents()
  hfsplus: avoid crash on failed block map free
  kcmp: include linux/ptrace.h
  drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
  mm: cma: WARN if freed memory is still in use
  exec: do not leave bprm->interp on stack
  ...
2012-12-20 20:00:43 -08:00
Linus Torvalds
54d46ea993 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
 "sigaltstack infrastructure + conversion for x86, alpha and um,
  COMPAT_SYSCALL_DEFINE infrastructure.

  Note that there are several conflicts between "unify
  SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
  resolution is trivial - just remove definitions of SS_ONSTACK and
  SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
  include/uapi/linux/signal.h contains the unified variant."

Fixed up conflicts as per Al.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  alpha: switch to generic sigaltstack
  new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
  generic compat_sys_sigaltstack()
  introduce generic sys_sigaltstack(), switch x86 and um to it
  new helper: compat_user_stack_pointer()
  new helper: restore_altstack()
  unify SS_ONSTACK/SS_DISABLE definitions
  new helper: current_user_stack_pointer()
  missing user_stack_pointer() instances
  Bury the conditionals from kernel_thread/kernel_execve series
  COMPAT_SYSCALL_DEFINE: infrastructure
2012-12-20 18:05:28 -08:00
David Howells
cfde819088 keys: use keyring_alloc() to create module signing keyring
Use keyring_alloc() to create special keyrings now that it has
a permissions parameter rather than using key_alloc() +
key_instantiate_and_link().

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:21 -08:00
Cyrill Gorcunov
44fd07e989 kcmp: include linux/ptrace.h
This makes it compile on s390. After all the ptrace_may_access
(which we use this file) is declared exactly in linux/ptrace.h.

This is preparatory work to wire this syscall up on all archs.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Alexander Kartashov <alekskartashov@parallels.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Hugh Dickins
2832bc19f6 sched: numa: ksm: fix oops in task_numa_placment()
task_numa_placement() oopsed on NULL p->mm when task_numa_fault() got
called in the handling of break_ksm() for ksmd.  That might be a
peculiar case, which perhaps KSM could takes steps to avoid? but it's
more robust if task_numa_placement() allows for such a possibility.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 07:06:56 -08:00
Linus Torvalds
7005cd3970 A few /dev/random improvements for the v3.8 merge window.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQz+SFAAoJENNvdpvBGATww0QP+gLGPbydQbW25SF2SUjcBdAA
 tGLTFmEIAATxfQihsuMsBjnNIuc9gLbQMTvEd0flhkxkc6wFAcaJA5Q6SuEv64jV
 frz+T36v1hLP3xCq2b0z93yHAadRq1twALgGzCjSQh9Od73kY4DOOqj/1DZO9CvA
 cPbP7FIqlVhHLYtfLv7m8OMVkTjgyKvDhWcKZyaN5ticVzZImSbOMHXQ7SX9jnpc
 ktz+vHc48Lnix8NGmodZF81QEtLWheGhKRwOiifpBq7BKmFyiUJNEDOaQHofcgCb
 LRjNvsGkhKo36xf/T84pXPj17fmhOHKChAfOABarGY8SzNRbgD7DcsEqT0YXO71r
 MV17L9kxS34ULYPdbXs8QRO9q0v0vS2YQletT/oykFdb895cp8oX4rFHu4TFgoPV
 S6oDR0UD7T/OsJ9nsvqjxxH2UJeCTrYMi5JD71ywsY805WOEn4gUc3TLsfscqmte
 gMVzxQP46JuNBVEsZVKf4oIeeRSMH/Ja8pHLPjOLvQ4nszqnLl+WaSqJQWSSfCv8
 5hJfIpX+CX+mJuEiskiHatbam8anZYD5m/TXaizjAdG80YiAgaMBA7fh7oK/mgTq
 1OjKAnEQhOAlDCTCp9szk7ye1f3ivdCy0Hr6MvvrGTdQEY5b3Y7lt74gEQmhjTrv
 Fhb2FX7lLcDv7NGvyBqQ
 =tTqL
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random updates from Ted Ts'o:
 "A few /dev/random improvements for the v3.8 merge window."

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  random: Mix cputime from each thread that exits to the pool
  random: prime last_data value per fips requirements
  random: fix debug format strings
  random: make it possible to enable debugging without rebuild
2012-12-19 20:23:37 -08:00
Al Viro
c40702c49f new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
note that they are relying on access_ok() already checked by caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-19 18:07:41 -05:00
Al Viro
9026843952 generic compat_sys_sigaltstack()
Again, conditional on CONFIG_GENERIC_SIGALTSTACK

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-19 18:07:41 -05:00
Al Viro
6bf9adfc90 introduce generic sys_sigaltstack(), switch x86 and um to it
Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not
select it are completely unaffected

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-19 18:07:40 -05:00
Al Viro
5c49574ffd new helper: restore_altstack()
to be used by rt_sigreturn instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-19 18:07:40 -05:00
Al Viro
ae903caae2 Bury the conditionals from kernel_thread/kernel_execve series
All architectures have
	CONFIG_GENERIC_KERNEL_THREAD
	CONFIG_GENERIC_KERNEL_EXECVE
	__ARCH_WANT_SYS_EXECVE
None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
Kill the conditionals and make both callers use do_execve().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-19 18:07:38 -05:00
Bjørn Mork
3935e89505 watchdog: Fix disable/enable regression
Commit 8d4516904b39 ("watchdog: Fix CPU hotplug regression") causes an
oops or hard lockup when doing

 echo 0 > /proc/sys/kernel/nmi_watchdog
 echo 1 > /proc/sys/kernel/nmi_watchdog

and the kernel is booted with nmi_watchdog=1 (default)

Running laptop-mode-tools and disconnecting/connecting AC power will
cause this to trigger, making it a common failure scenario on laptops.

Instead of bailing out of watchdog_disable() when !watchdog_enabled we
can initialize the hrtimer regardless of watchdog_enabled status.  This
makes it safe to call watchdog_disable() in the nmi_watchdog=0 case,
without the negative effect on the enabled => disabled => enabled case.

All these tests pass with this patch:
- nmi_watchdog=1
  echo 0 > /proc/sys/kernel/nmi_watchdog
  echo 1 > /proc/sys/kernel/nmi_watchdog

- nmi_watchdog=0
  echo 0 > /sys/devices/system/cpu/cpu1/online

- nmi_watchdog=0
  echo mem > /sys/power/state

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661

Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 12:10:33 -08:00
Tejun Heo
023f27d3d6 workqueue: fix find_worker_executing_work() brekage from hashtable conversion
42f8570f43 ("workqueue: use new hashtable implementation") incorrectly
made busy workers hashed by the pointer value of worker instead of
work.  This broke find_worker_executing_work() which in turn broke a
lot of fundamental operations of workqueue - non-reentrancy and
flushing among others.  The flush malfunction triggered warning in
disk event code in Fengguang's automated test.

 write_dev_root_ (3265) used greatest stack depth: 2704 bytes left
 ------------[ cut here ]------------
 WARNING: at /c/kernel-tests/src/stable/block/genhd.c:1574 disk_clear_events+0x\
cf/0x108()
 Hardware name: Bochs
 Modules linked in:
 Pid: 3328, comm: ata_id Not tainted 3.7.0-01930-gbff6343 #1167
 Call Trace:
  [<ffffffff810997c4>] warn_slowpath_common+0x83/0x9c
  [<ffffffff810997f7>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff816aea77>] disk_clear_events+0xcf/0x108
  [<ffffffff811bd8be>] check_disk_change+0x27/0x59
  [<ffffffff822e48e2>] cdrom_open+0x49/0x68b
  [<ffffffff81ab0291>] idecd_open+0x88/0xb7
  [<ffffffff811be58f>] __blkdev_get+0x102/0x3ec
  [<ffffffff811bea08>] blkdev_get+0x18f/0x30f
  [<ffffffff811bebfd>] blkdev_open+0x75/0x80
  [<ffffffff8118f510>] do_dentry_open+0x1ea/0x295
  [<ffffffff8118f5f0>] finish_open+0x35/0x41
  [<ffffffff8119c720>] do_last+0x878/0xa25
  [<ffffffff8119c993>] path_openat+0xc6/0x333
  [<ffffffff8119cf37>] do_filp_open+0x38/0x86
  [<ffffffff81190170>] do_sys_open+0x6c/0xf9
  [<ffffffff8119021e>] sys_open+0x21/0x23
  [<ffffffff82c1c3d9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
2012-12-19 11:24:06 -08:00
Linus Torvalds
7a684c452e Nothing all that exciting; a new module-from-fd syscall for those who want
to verify the source of the module (ChromeOS) and/or use standard IMA on it
 or other security hooks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ0VKlAAoJENkgDmzRrbjxjuEQALVHpD1cSmryOzVwkNn7rVGP
 PV3KVbUs+qzUCm2c3AafIIlSBm2LOUl+cR3uNC7di8aHarRF3VHkK2OQ4Fx97ECd
 KKBqAyY3R0q1mAKujb/MWwiK0YgosEDIOzGGn2yQhNFsxKqnMB02P4j82IO7+g+w
 Cc3XuDyWHoH2I+ySgz0Q8NHAqufD/DMZUKud7jw2Lsv6PuICJ1Oqgl/Gd/muxort
 4a5tV3tjhRGywHS/8b2fbDUXkybC5NKK0FN+gyoaROmJ/THeHEQDGXZT9bc2vmVx
 HvRy/5k8dzQ6LAJ2mLnPvy0pmv0u7NYMvjxTxxUlUkFMkYuVticikQfwSYDbDPt4
 mbsLxchpgi8z4x8HltEERffCX5tldo/5hz1uemqhqIsMRIrRFnlHkSIgkGjVHf2u
 LXQBLT8uTm6C0VyNQPrI/hUZzIax7WtKbPSoK9lmExNbKqloEFh/mVXvfQxei2kp
 wnUZcnmPIqSvw7b4CWu7HibMYu2VvGBgm3YIfJRi4AQme1mzFYLpZoxF5Pj+Ykbt
 T//Hb1EsNQTTFCg7MZhnJSAw/EVUvNDUoullORClyqw6+xxjVKqWpPJgYDRfWOlJ
 Xa+s7DNrL+Oo1WWR8l5ruoQszbR8szIyeyPKKxRUcQj2zsqghoWuzKAx2saSEw3W
 pNkoJU+dGC7kG/yVAS8N
 =uoJj
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module update from Rusty Russell:
 "Nothing all that exciting; a new module-from-fd syscall for those who
  want to verify the source of the module (ChromeOS) and/or use standard
  IMA on it or other security hooks."

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  MODSIGN: Fix kbuild output when using default extra_certificates
  MODSIGN: Avoid using .incbin in C source
  modules: don't hand 0 to vmalloc.
  module: Remove a extra null character at the top of module->strtab.
  ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants
  ASN.1: Define indefinite length marker constant
  moduleparam: use __UNIQUE_ID()
  __UNIQUE_ID()
  MODSIGN: Add modules_sign make target
  powerpc: add finit_module syscall.
  ima: support new kernel module syscall
  add finit_module syscall to asm-generic
  ARM: add finit_module syscall to ARM
  security: introduce kernel_module_from_file hook
  module: add flags arg to sys_finit_module()
  module: add syscall to load module from fd
2012-12-19 07:55:08 -08:00
Linus Torvalds
673ab8783b Merge branch 'akpm' (more patches from Andrew)
Merge patches from Andrew Morton:
 "Most of the rest of MM, plus a few dribs and drabs.

  I still have quite a few irritating patches left around: ones with
  dubious testing results, lack of review, ones which should have gone
  via maintainer trees but the maintainers are slack, etc.

  I need to be more activist in getting these things wrapped up outside
  the merge window, but they're such a PITA."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (48 commits)
  mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
  vmscan: comment too_many_isolated()
  mm/kmemleak.c: remove obsolete simple_strtoul
  mm/memory_hotplug.c: improve comments
  mm/hugetlb: create hugetlb cgroup file in hugetlb_init
  mm/mprotect.c: coding-style cleanups
  Documentation: ABI: /sys/devices/system/node/
  slub: drop mutex before deleting sysfs entry
  memcg: add comments clarifying aspects of cache attribute propagation
  kmem: add slab-specific documentation about the kmem controller
  slub: slub-specific propagation changes
  slab: propagate tunable values
  memcg: aggregate memcg cache values in slabinfo
  memcg/sl[au]b: shrink dead caches
  memcg/sl[au]b: track all the memcg children of a kmem_cache
  memcg: destroy memcg caches
  sl[au]b: allocate objects from memcg cache
  sl[au]b: always get the cache from its page in kmem_cache_free()
  memcg: skip memcg kmem allocations in specified code regions
  memcg: infrastructure to match an allocation to the right cache
  ...
2012-12-18 15:08:12 -08:00
Glauber Costa
2ad306b17c fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.

This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
architectures fall in this category.

This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.

For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path.  Once the slab is also tracked by memcg, we can
get rid of that flag.

Tested to successfully protect against :(){ :|:& };:

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18 15:02:13 -08:00