IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The rcutorture tests need to be able to trace the time of the
beginning of an RCU read-side critical section, and thus need access
to trace_clock_local(). This commit therefore adds a the needed
EXPORT_SYMBOL_GPL().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The as-documented rcu_nocb_poll will fail to enable this feature
for two reasons. (1) there is an extra "s" in the documented
name which is not in the code, and (2) since it uses module_param,
it really is expecting a prefix, akin to "rcutree.fanout_leaf"
and the prefix isn't documented.
However, there are several reasons why we might not want to
simply fix the typo and add the prefix:
1) we'd end up with rcutree.rcu_nocb_poll, and rather probably make
a change to rcutree.nocb_poll
2) if we did #1, then the prefix wouldn't be consistent with the
rcu_nocbs=<cpumap> parameter (i.e. one with, one without prefix)
3) the use of module_param in a header file is less than desired,
since it isn't immediately obvious that it will get processed
via rcutree.c and get the prefix from that (although use of
module_param_named() could clarify that.)
4) the implied export of /sys/module/rcutree/parameters/rcu_nocb_poll
data to userspace via module_param() doesn't really buy us anything,
as it is read-only and we can tell if it is enabled already without
it, since there is a printk at early boot telling us so.
In light of all that, just change it from a module_param() to an
early_setup() call, and worry about adding it to /sys later on if
we decide to allow a dynamic setting of it.
Also change the variable to be tagged as read_mostly, since it
will only ever be fiddled with at most, once at boot.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The wait_event() at the head of the rcu_nocb_kthread() can result in
soft-lockup complaints if the CPU in question does not register RCU
callbacks for an extended period. This commit therefore changes
the wait_event() to a wait_event_interruptible().
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This branch contains a number of cleanups and unifications to various
timer- clock-events- and ARM timer code. The main points are:
1) Convert arch_gettimeoffset to a pointer, so that architectures with
multiple timer implementations can simply set this standard pointer
rather than maintaining their own arch-specific pointers for the
same purpose. Various architectures are converted to using this new
feature.
2) Conversion of ARM timer implementations to use clock_event_devices's
suspend/resume operations, rather than the ARM-specific sys_timer
versions. Thus, the ARM code begins to use more common infra-structure
rather than arch-specific code.
3) Removal of ARM's struct sys_timer completely, now that everything uses
common code.
4) Introduction of drivers/clocksource/clksrc-of.c, which allows ARM clock
source implementations to be moved into drivers/clocksource, with the
need to add SoC-specific header files for each timer initialization
function; instead, all enabled implementations are registered into a
table which a single core function iterates over, and calls the
relevant initialization functions based on device tree. At least the
Tegra and BCM2835 clocksource implementations will use this feature in
the 3.9 kernel cycle.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQ5yJiAAoJEMzrak5tbycxaPQQAI4gxhVfJk44R8NudWX/Q+6l
xq60iILp9JfdOYmQY7FzzxON5DkDrosbI8TKS8K74VV7Lx7sP7iKc7BS1jtVKbFP
ow/oqFlDlPL00Ne/Zzf7A4zDI0CYvGwNvkND+YcARrW+PRO+omGcU1qnUVhMOUCp
s5v6Xa5rgUZWJ6QslKagfgHRpZMFtf1e74waS4zVWP0HymyWU5v9x8GaGKSt6Aj6
tk1Of/Bc7dhmsul2QZdolSqZbu1lTR3QnaKltMAFklbHjJKNR6w7llQBCkbp5Qyg
OUzMVUpDjQzfPMQyu2ovzEQ6gMZYX/XuVNlnAmLcy9b1A0TExkRHICiEpkGdVn4b
Kh4MQWW4/V19pOgROs8+L/XfRmi96EDEMeb/kaVo7r5iMO85UwouRpBP/KLPKvZ+
2pXTdZmbOAQhu5OKzx1q8pD9gm+quMs3fy8Fc7F0hZkXQUlqWHAcBcV8Bm7hW8u+
32gUpUcZV45sdK6x+POLY+6F836aFWdgBug90fhnGKqq9HcDDvzST09ctrtKoSSZ
LfSvklv8szYnVu09vvO5HZTKMcMSC5Y8Uo4RZkBKKTOMWFbhWvkpqOMEW/+R+r2X
FrUVRI7SShbXV2W2EVRFRsTwMALy6gr859MclQGmNzwbdjM68xjeKdnsv0TTAAwM
emqx0qzopWUN2Btp3yPM
=DOD6
-----END PGP SIGNATURE-----
Merge tag 'swarren-for-3.9-arm-timer-rework' of git://git.kernel.org/pub/scm/linux/kernel/git/swarren/linux-tegra into next/cleanup
From Stephen Warren:
ARM/...: timer and clock events cleanup, and remove struct sys_timer
This branch contains a number of cleanups and unifications to various
timer- clock-events- and ARM timer code. The main points are:
1) Convert arch_gettimeoffset to a pointer, so that architectures with
multiple timer implementations can simply set this standard pointer
rather than maintaining their own arch-specific pointers for the
same purpose. Various architectures are converted to using this new
feature.
2) Conversion of ARM timer implementations to use clock_event_devices's
suspend/resume operations, rather than the ARM-specific sys_timer
versions. Thus, the ARM code begins to use more common infra-structure
rather than arch-specific code.
3) Removal of ARM's struct sys_timer completely, now that everything uses
common code.
4) Introduction of drivers/clocksource/clksrc-of.c, which allows ARM clock
source implementations to be moved into drivers/clocksource, with the
need to add SoC-specific header files for each timer initialization
function; instead, all enabled implementations are registered into a
table which a single core function iterates over, and calls the
relevant initialization functions based on device tree. At least the
Tegra and BCM2835 clocksource implementations will use this feature in
the 3.9 kernel cycle.
* tag 'swarren-for-3.9-arm-timer-rework' of git://git.kernel.org/pub/scm/linux/kernel/git/swarren/linux-tegra:
clocksource: add common of_clksrc_init() function
ARM: delete struct sys_timer
ARM: remove struct sys_timer suspend and resume fields
ARM: samsung: register syscore_ops for timer resume directly
ARM: ux500: convert timer suspend/resume to clock_event_device
ARM: sa1100: convert timer suspend/resume to clock_event_device
ARM: pxa: convert timer suspend/resume to clock_event_device
ARM: at91: convert timer suspend/resume to clock_event_device
ARM: set arch_gettimeoffset directly
m68k: set arch_gettimeoffset directly
time: convert arch_gettimeoffset to a pointer
cris: move usec/nsec conversion to do_slow_gettimeoffset
Signed-off-by: Olof Johansson <olof@lixom.net>
cgroup already tracks the hierarchy. Follow cgroup->parent to find
the parent and drop cpuset->parent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Implement cpuset_for_each_descendant_pre() and replace the
cpuset-specific tree walking using cpuset->stack_list with it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Supposedly for historical reasons, cpuset depends on cgroup core for
locking. It depends on cgroup_mutex in cgroup callbacks and grabs
cgroup_mutex from other places where it wants to be synchronized.
This is majorly messy and highly prone to introducing circular locking
dependency especially because cgroup_mutex is supposed to be one of
the outermost locks.
As previous patches already plugged possible races which may happen by
decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
specific cpuset_mutex is mostly straight-forward. Introduce
cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
cpuset_mutex locking to places which inherited cgroup_mutex from
cgroup core.
The only complication is from cpuset wanting to initiate task
migration when a cpuset loses all cpus or memory nodes. Task
migration may go through full cgroup and all subsystem locking and
should be initiated without holding any cpuset specific lock; however,
a previous patch already made hotplug handled asynchronously and
moving the task migration part outside other locks is easy.
cpuset_propagate_hotplug_workfn() now invokes
remove_tasks_in_empty_cpuset() without holding any lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cpuset is scheduled to be decoupled from cgroup_lock which will make
hotplug handling race with task migration. cpus or mems will be
allowed to go offline between ->can_attach() and ->attach(). If
hotplug takes down all cpus or mems of a cpuset while attach is in
progress, ->attach() may end up putting tasks into an empty cpuset.
This patchset makes ->attach() schedule hotplug propagation if the
cpuset is empty after attaching is complete. This will move the tasks
to the nearest ancestor which can execute and the end result would be
as if hotplug handling happened after the tasks finished attaching.
cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
wait for propagations scheduled directly by cpuset_attach().
This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cpuset is scheduled to be decoupled from cgroup_lock which will make
configuration updates race with task migration. Any config update
will be allowed to happen between ->can_attach() and ->attach(). If
such config update removes either all cpus or mems, by the time
->attach() is called, the condition verified by ->can_attach(), that
the cpuset is capable of hosting the tasks, is no longer true.
This patch adds cpuset->attach_in_progress which is incremented from
->can_attach() and decremented when the attach operation finishes
either successfully or not. validate_change() treats cpusets w/
non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
remove all cpus or mems from it.
This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
directly to propagate hotplug updates to !root cpusets; however, this
has the following problems.
* cpuset locking is scheduled to be decoupled from cgroup_mutex,
cgroup_mutex will be unexported, and cgroup_attach_task() will do
cgroup locking internally, so propagation can't synchronously move
tasks to a parent cgroup while walking the hierarchy.
* We can't use cgroup generic tree iterator because propagation to
each cpuset may sleep. With propagation done asynchronously, we can
lose the rather ugly cpuset specific iteration.
Convert cpuset_propagate_hotplug() to
cpuset_propagate_hotplug_workfn() and execute it from newly added
cpuset->hotplug_work. The work items are run on an ordered workqueue,
so the propagation order is preserved. cpuset_hotplug_workfn()
schedules all propagations while holding cgroup_mutex and waits for
completion without cgroup_mutex. Each in-flight propagation holds a
reference to the cpuset->css.
This patch doesn't cause any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
In general, we want to make cgroup_mutex one of the outermost locks
and be able to use get_online_cpus() and friends from cgroup methods.
With cpuset hotplug made async, get_online_cpus() can now be nested
inside cgroup_mutex.
Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
by bouncing sched_domain rebuilding to a work item. As such nesting
is allowed now, remove the workqueue bouncing code and always rebuild
sched_domains synchronously. This also nests sched_domains_mutex
inside cgroup_mutex, which is intended and should be okay.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
event notifications. We want to separate cpuset locking from cgroup
core and make cgroup_mutex outer to hotplug synchronization so that,
among other things, mechanisms which depend on get_online_cpus() can
be used from cgroup callbacks. In general, we want to keep
cgroup_mutex the outermost lock to minimize locking interactions among
different controllers.
Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
schedule it from the hotplug notifications. As the function can
already handle multiple mixed events without any input, converting it
to a work function is mostly trivial; however, one complication is
that cpuset_update_active_cpus() needs to update sched domains
synchronously to reflect an offlined cpu to avoid confusing the
scheduler. This is worked around by falling back to the the default
single sched domain synchronously before scheduling the actual hotplug
work. This makes sched domain rebuilt twice per CPU hotplug event but
the operation isn't that heavy and a lot of the second operation would
be noop for systems w/ single sched domain, which is the common case.
This decouples cpuset hotplug handling from the notification callbacks
and there can be an arbitrary delay between the actual event and
updates to cpusets. Scheduler and mm can handle it fine but moving
tasks out of an empty cpuset may race against writes to the cpuset
restoring execution resources which can lead to confusing behavior.
Flush hotplug work item from cpuset_write_resmask() to avoid such
confusions.
v2: Synchronous sched domain rebuilding using the fallback sched
domain added. This fixes various issues caused by confused
scheduler putting tasks on a dead CPU, including the one reported
by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Reorganize hotplug path to prepare for async hotplug handling.
* Both CPU and memory hotplug handlings are collected into a single
function - cpuset_handle_hotplug(). It doesn't take any argument
but compares the current setttings of top_cpuset against what's
actually available to determine what happened. This function
directly updates top_cpuset. If there are CPUs or memory nodes
which are taken down, cpuset_propagate_hotplug() in invoked on all
!root cpusets.
* cpuset_propagate_hotplug() is responsible for updating the specified
cpuset so that it doesn't include any resource which isn't available
to top_cpuset. If no CPU or memory is left after update, all tasks
are moved to the nearest ancestor with both resources.
* update_tasks_cpumask() and update_tasks_nodemask() are now always
called after cpus or mems masks are updated even if the cpuset
doesn't have any task. This is for brevity and not expected to have
any measureable effect.
* cpu_active_mask and N_HIGH_MEMORY are read exactly once per
cpuset_handle_hotplug() invocation, all cpusets share the same view
of what resources are available, and cpuset_handle_hotplug() can
handle multiple resources going up and down. These properties will
allow async operation.
The reorganization, while drastic, is equivalent and shouldn't cause
any behavior difference. This will enable making hotplug handling
async and remove get_online_cpus() -> cgroup_mutex nesting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cpuset_can_attach() prepare global variables cpus_attach and
cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
There is no reason to prepare in cpuset_can_attach(). The same
information can be accessed from cpuset_attach().
Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
and make the global variables static ones inside cpuset_attach().
With this change, there's no reason to keep
cpuset_attach_nodemask_{from|to} global. Move them inside
cpuset_attach(). Unfortunately, we need to keep cpus_attach global as
it can't be allocated from cpuset_attach().
v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty
Russell.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Instead of iterating cgroup->children directly, introduce and use
cpuset_for_each_child() which wraps cgroup_for_each_child() and
performs online check. As it uses the generic iterator, it requires
RCU read locking too.
As cpuset is currently protected by cgroup_mutex, non-online cpusets
aren't visible to all the iterations and this patch currently doesn't
make any functional difference. This will be used to de-couple cpuset
locking from cgroup core.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Add CS_ONLINE which is set from css_online() and cleared from
css_offline(). This will enable using generic cgroup iterator while
allowing decoupling cpuset from cgroup internal locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Add cpuset_css_on/offline() and rearrange css init/exit such that,
* Allocation and clearing to the default values happen in css_alloc().
Allocation now uses kzalloc().
* Config inheritance and registration happen in css_online().
* css_offline() undoes what css_online() did.
* css_free() frees.
This doesn't introduce any visible behavior changes. This will help
cleaning up locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The function isn't that hot, the overhead of missing the fast exit is
low, the test itself depends heavily on cgroup internals, and it's
gonna be a hindrance when trying to decouple cpuset locking from
cgroup core. Remove the fast exit path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Implement cgroup_rightmost_descendant() which returns the right most
descendant of the specified cgroup. This can be used to skip the
cgroup's subtree while iterating with
cgroup_for_each_descendant_pre().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Merge emailed fixes from Andrew Morton:
"Bunch of fixes:
- delayed IPC updates. I held back on this because of some possible
outstanding bug reports, but they appear to have been addressed in
later versions
- A bunch of MAINTAINERS updates
- Yet Another RTC driver. I'd held this back while a couple of
little issues were being worked out.
I'm expecting an intrusive-but-simple patchset from Joe Perches which
splits up printk.c into kernel/printk/*. That will be a pig to
maintain for two months so if it passes testing I'd like to get it
upstream after a week or so."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
printk: fix incorrect length from print_time() when seconds > 99999
drivers/rtc/rtc-vt8500.c: fix handling of data passed in struct rtc_time
drivers/rtc/rtc-vt8500.c: correct handling of CR_24H bitfield
rtc: add RTC driver for TPS6586x
MAINTAINERS: fix drivers/staging/sm7xx/
MAINTAINERS: remove include/linux/of_pwm.h
MAINTAINERS: remove arch/*/lib/perf_event*.c
MAINTAINERS: remove drivers/mmc/host/imxmmc.*
MAINTAINERS: fix Documentation/mei/
MAINTAINERS: remove arch/x86/platform/mrst/pmu.*
MAINTAINERS: remove firmware/isci/
MAINTAINERS: fix drivers/ieee802154/
MAINTAINERS: fix .../plat-mxc/include/mach/imxfb.h
MAINTAINERS: remove drivers/video/epson1355fb.c
MAINTAINERS: fix drivers/media/usb/dvb-usb/cxusb*
MAINTAINERS: adjust for UAPI
MAINTAINERS: fix drivers/media/platform/atmel-isi.c
MAINTAINERS: fix arch/arm/mach-at91/include/mach/at_hdmac.h
MAINTAINERS: fix drivers/rtc/rtc-vt8500.c
MAINTAINERS: remove arch/arm/plat-s5p/
...
Cleanup. And I think we need more cleanups, in particular
__set_current_blocked() and sigprocmask() should die. Nobody should
ever block SIGKILL or SIGSTOP.
- Change set_current_blocked() to use __set_current_blocked()
- Change sys_sigprocmask() to use set_current_blocked(), this way it
should not worry about SIGKILL/SIGSTOP.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
print_prefix() passes a NULL buf to print_time() to get the length of
the time prefix; when printk times are enabled, the current code just
returns the constant 15, which matches the format "[%5lu.%06lu] " used
to print the time value. However, this is obviously incorrect when the
whole seconds part of the time gets beyond 5 digits (100000 seconds is a
bit more than a day of uptime).
The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual
length of the time prefix. This could be micro-optimized but it seems
better to have simpler, more readable code here.
The bug leads to the syslog system call miscomputing which messages fit
into the userspace buffer. If there are enough messages to fill
log_buf_len and some have a timestamp >= 100000, dmesg may fail with:
# dmesg
klogctl: Bad address
When this happens, strace shows that the failure is indeed EFAULT due to
the kernel mistakenly accessing past the end of dmesg's buffer, since
dmesg asks the kernel how big a buffer it needs, allocates a bit more,
and then gets an error when it asks the kernel to fill it:
syslog(0xa, 0, 0) = 1048576
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000
syslog(0x3, 0x7fa4d25d2010, 0x100008) = -1 EFAULT (Bad address)
As far as I can see, the bug has been there as long as print_time(),
which comes from commit 084681d14e42 ("printk: flush continuation lines
immediately to console") in 3.5-rc5.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we try to finit_module on a file sized 0 bytes vmalloc will
scream and spit out a warning.
Since modules have to be bigger than 0 bytes anyways we can just
check that beforehand and avoid the warning.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In a container with its own pid namespace and user namespace, rebooting
the system won't reboot the host, but terminate all the processes in
it and thus have the container shutdown, so it's safe.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
It needs 64bit timespec. As it is, we end up truncating the timeout
to whole seconds; usually it doesn't matter, but for having all
sub-second timeouts truncated to one jiffy is visibly wrong.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It needs 64bit rusage and 32bit siginfo. glibc never calls it with
non-NULL rusage pointer, or we would've seen breakage already...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Strictly speaking, ppc64 needs it for C ABI compliance. Realistically
I would be very surprised if e.g. passing 0xffffffff as 'options'
argument to waitid() from 32bit task would cause problems, but yes,
it puts us into undefined behaviour territory. ppc64 expects int
argument to be passed in 64bit register with bits 31..63 containing
the same value. SYSCALL_DEFINE on ppc provides a wrapper that normalizes
the value passed from userland; so does COMPAT_SYSCALL_DEFINE. Plain
declaration of compat_sys_something() with an int argument obviously
doesn't. Again, for wait4 and waitid I would be extremely surprised
if gcc started to produce code depending on that value having been
properly sign-extended - the argument(s) in question end up passed
blindly to sys_wait4 and sys_waitid resp. and normalization for native
syscalls takes care of their use there. Still, better to use
COMPAT_SYSCALL_DEFINE here than worry about nasal daemons...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.
Can lead to processes attempting access their child reaper and
instead following a stale pointer.
That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.
Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.
Pointed-out-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The sequence:
unshare(CLONE_NEWPID)
clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)
Creates a new process in the new pid namespace without setting
pid_ns->child_reaper. After forking this results in a NULL
pointer dereference.
Avoid this and other nonsense scenarios that can show up after
creating a new pid namespace with unshare by adding a new
check in copy_prodcess.
Pointed-out-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently, whenever CONFIG_ARCH_USES_GETTIMEOFFSET is enabled, each
arch core provides a single implementation of arch_gettimeoffset(). In
many cases, different sub-architectures, different machines, or
different timer providers exist, and so the arch ends up implementing
arch_gettimeoffset() as a call-through-pointer anyway. Examples are
ARM, Cris, M68K, and it's arguable that the remaining architectures,
M32R and Blackfin, should be doing this anyway.
Modify arch_gettimeoffset so that it itself is a function pointer, which
the arch initializes. This will allow later changes to move the
initialization of this function into individual machine support or timer
drivers. This is particularly useful for code in drivers/clocksource
which should rely on an arch-independant mechanism to register their
implementation of arch_gettimeoffset().
This patch also converts the Cris architecture to set arch_gettimeoffset
directly to the final implementation in time_init(), because Cris already
had separate time_init() functions per sub-architecture. M68K and ARM
are converted to set arch_gettimeoffset to the final implementation in
later patches, because they already have function pointers in place for
this purpose.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Pull filesystem notification updates from Eric Paris:
"This pull mostly is about locking changes in the fsnotify system. By
switching the group lock from a spin_lock() to a mutex() we can now
hold the lock across things like iput(). This fixes a problem
involving unmounting a fs and having inodes be busy, first pointed out
by FAT, but reproducible with tmpfs.
This also restores signal driven I/O for inotify, which has been
broken since about 2.6.32."
Ugh. I *hate* the timing of this. It was rebased after the merge
window opened, and then left to sit with the pull request coming the day
before the merge window closes. That's just crap. But apparently the
patches themselves have been around for over a year, just gathering
dust, so now it's suddenly critical.
Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
Rothwell's fixes from -next.
* 'for-next' of git://git.infradead.org/users/eparis/notify:
inotify: automatically restart syscalls
inotify: dont skip removal of watch descriptor if creation of ignored event failed
fanotify: dont merge permission events
fsnotify: make fasync generic for both inotify and fanotify
fsnotify: change locking order
fsnotify: dont put marks on temporary list when clearing marks by group
fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
fsnotify: pass group to fsnotify_destroy_mark()
fsnotify: use a mutex instead of a spinlock to protect a groups mark list
fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
fsnotify: take groups mark_lock before mark lock
fsnotify: use reference counting for groups
fsnotify: introduce fsnotify_get_group()
inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()
Merge the rest of Andrew's patches for -rc1:
"A bunch of fixes and misc missed-out-on things.
That'll do for -rc1. I still have a batch of IPC patches which still
have a possible bug report which I'm chasing down."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
keys: use keyring_alloc() to create module signing keyring
keys: fix unreachable code
sendfile: allows bypassing of notifier events
SGI-XP: handle non-fatal traps
fat: fix incorrect function comment
Documentation: ABI: remove testing/sysfs-devices-node
proc: fix inconsistent lock state
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
memcg: don't register hotcpu notifier from ->css_alloc()
checkpatch: warn on uapi #includes that #include <uapi/...
revert "rtc: recycle id when unloading a rtc driver"
mm: clean up transparent hugepage sysfs error messages
hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
hfsplus: rework processing of hfs_btree_write() returned error
hfsplus: rework processing errors in hfsplus_free_extents()
hfsplus: avoid crash on failed block map free
kcmp: include linux/ptrace.h
drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
mm: cma: WARN if freed memory is still in use
exec: do not leave bprm->interp on stack
...
Pull signal handling cleanups from Al Viro:
"sigaltstack infrastructure + conversion for x86, alpha and um,
COMPAT_SYSCALL_DEFINE infrastructure.
Note that there are several conflicts between "unify
SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
resolution is trivial - just remove definitions of SS_ONSTACK and
SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
include/uapi/linux/signal.h contains the unified variant."
Fixed up conflicts as per Al.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to generic sigaltstack
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
generic compat_sys_sigaltstack()
introduce generic sys_sigaltstack(), switch x86 and um to it
new helper: compat_user_stack_pointer()
new helper: restore_altstack()
unify SS_ONSTACK/SS_DISABLE definitions
new helper: current_user_stack_pointer()
missing user_stack_pointer() instances
Bury the conditionals from kernel_thread/kernel_execve series
COMPAT_SYSCALL_DEFINE: infrastructure
Use keyring_alloc() to create special keyrings now that it has
a permissions parameter rather than using key_alloc() +
key_instantiate_and_link().
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes it compile on s390. After all the ptrace_may_access
(which we use this file) is declared exactly in linux/ptrace.h.
This is preparatory work to wire this syscall up on all archs.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Alexander Kartashov <alekskartashov@parallels.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_numa_placement() oopsed on NULL p->mm when task_numa_fault() got
called in the handling of break_ksm() for ksmd. That might be a
peculiar case, which perhaps KSM could takes steps to avoid? but it's
more robust if task_numa_placement() allows for such a possibility.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQz+SFAAoJENNvdpvBGATww0QP+gLGPbydQbW25SF2SUjcBdAA
tGLTFmEIAATxfQihsuMsBjnNIuc9gLbQMTvEd0flhkxkc6wFAcaJA5Q6SuEv64jV
frz+T36v1hLP3xCq2b0z93yHAadRq1twALgGzCjSQh9Od73kY4DOOqj/1DZO9CvA
cPbP7FIqlVhHLYtfLv7m8OMVkTjgyKvDhWcKZyaN5ticVzZImSbOMHXQ7SX9jnpc
ktz+vHc48Lnix8NGmodZF81QEtLWheGhKRwOiifpBq7BKmFyiUJNEDOaQHofcgCb
LRjNvsGkhKo36xf/T84pXPj17fmhOHKChAfOABarGY8SzNRbgD7DcsEqT0YXO71r
MV17L9kxS34ULYPdbXs8QRO9q0v0vS2YQletT/oykFdb895cp8oX4rFHu4TFgoPV
S6oDR0UD7T/OsJ9nsvqjxxH2UJeCTrYMi5JD71ywsY805WOEn4gUc3TLsfscqmte
gMVzxQP46JuNBVEsZVKf4oIeeRSMH/Ja8pHLPjOLvQ4nszqnLl+WaSqJQWSSfCv8
5hJfIpX+CX+mJuEiskiHatbam8anZYD5m/TXaizjAdG80YiAgaMBA7fh7oK/mgTq
1OjKAnEQhOAlDCTCp9szk7ye1f3ivdCy0Hr6MvvrGTdQEY5b3Y7lt74gEQmhjTrv
Fhb2FX7lLcDv7NGvyBqQ
=tTqL
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random updates from Ted Ts'o:
"A few /dev/random improvements for the v3.8 merge window."
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: Mix cputime from each thread that exits to the pool
random: prime last_data value per fips requirements
random: fix debug format strings
random: make it possible to enable debugging without rebuild
Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not
select it are completely unaffected
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All architectures have
CONFIG_GENERIC_KERNEL_THREAD
CONFIG_GENERIC_KERNEL_EXECVE
__ARCH_WANT_SYS_EXECVE
None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
Kill the conditionals and make both callers use do_execve().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 8d4516904b39 ("watchdog: Fix CPU hotplug regression") causes an
oops or hard lockup when doing
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 1 > /proc/sys/kernel/nmi_watchdog
and the kernel is booted with nmi_watchdog=1 (default)
Running laptop-mode-tools and disconnecting/connecting AC power will
cause this to trigger, making it a common failure scenario on laptops.
Instead of bailing out of watchdog_disable() when !watchdog_enabled we
can initialize the hrtimer regardless of watchdog_enabled status. This
makes it safe to call watchdog_disable() in the nmi_watchdog=0 case,
without the negative effect on the enabled => disabled => enabled case.
All these tests pass with this patch:
- nmi_watchdog=1
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 1 > /proc/sys/kernel/nmi_watchdog
- nmi_watchdog=0
echo 0 > /sys/devices/system/cpu/cpu1/online
- nmi_watchdog=0
echo mem > /sys/power/state
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661
Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
42f8570f43 ("workqueue: use new hashtable implementation") incorrectly
made busy workers hashed by the pointer value of worker instead of
work. This broke find_worker_executing_work() which in turn broke a
lot of fundamental operations of workqueue - non-reentrancy and
flushing among others. The flush malfunction triggered warning in
disk event code in Fengguang's automated test.
write_dev_root_ (3265) used greatest stack depth: 2704 bytes left
------------[ cut here ]------------
WARNING: at /c/kernel-tests/src/stable/block/genhd.c:1574 disk_clear_events+0x\
cf/0x108()
Hardware name: Bochs
Modules linked in:
Pid: 3328, comm: ata_id Not tainted 3.7.0-01930-gbff6343 #1167
Call Trace:
[<ffffffff810997c4>] warn_slowpath_common+0x83/0x9c
[<ffffffff810997f7>] warn_slowpath_null+0x1a/0x1c
[<ffffffff816aea77>] disk_clear_events+0xcf/0x108
[<ffffffff811bd8be>] check_disk_change+0x27/0x59
[<ffffffff822e48e2>] cdrom_open+0x49/0x68b
[<ffffffff81ab0291>] idecd_open+0x88/0xb7
[<ffffffff811be58f>] __blkdev_get+0x102/0x3ec
[<ffffffff811bea08>] blkdev_get+0x18f/0x30f
[<ffffffff811bebfd>] blkdev_open+0x75/0x80
[<ffffffff8118f510>] do_dentry_open+0x1ea/0x295
[<ffffffff8118f5f0>] finish_open+0x35/0x41
[<ffffffff8119c720>] do_last+0x878/0xa25
[<ffffffff8119c993>] path_openat+0xc6/0x333
[<ffffffff8119cf37>] do_filp_open+0x38/0x86
[<ffffffff81190170>] do_sys_open+0x6c/0xf9
[<ffffffff8119021e>] sys_open+0x21/0x23
[<ffffffff82c1c3d9>] system_call_fastpath+0x16/0x1b
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
to verify the source of the module (ChromeOS) and/or use standard IMA on it
or other security hooks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQ0VKlAAoJENkgDmzRrbjxjuEQALVHpD1cSmryOzVwkNn7rVGP
PV3KVbUs+qzUCm2c3AafIIlSBm2LOUl+cR3uNC7di8aHarRF3VHkK2OQ4Fx97ECd
KKBqAyY3R0q1mAKujb/MWwiK0YgosEDIOzGGn2yQhNFsxKqnMB02P4j82IO7+g+w
Cc3XuDyWHoH2I+ySgz0Q8NHAqufD/DMZUKud7jw2Lsv6PuICJ1Oqgl/Gd/muxort
4a5tV3tjhRGywHS/8b2fbDUXkybC5NKK0FN+gyoaROmJ/THeHEQDGXZT9bc2vmVx
HvRy/5k8dzQ6LAJ2mLnPvy0pmv0u7NYMvjxTxxUlUkFMkYuVticikQfwSYDbDPt4
mbsLxchpgi8z4x8HltEERffCX5tldo/5hz1uemqhqIsMRIrRFnlHkSIgkGjVHf2u
LXQBLT8uTm6C0VyNQPrI/hUZzIax7WtKbPSoK9lmExNbKqloEFh/mVXvfQxei2kp
wnUZcnmPIqSvw7b4CWu7HibMYu2VvGBgm3YIfJRi4AQme1mzFYLpZoxF5Pj+Ykbt
T//Hb1EsNQTTFCg7MZhnJSAw/EVUvNDUoullORClyqw6+xxjVKqWpPJgYDRfWOlJ
Xa+s7DNrL+Oo1WWR8l5ruoQszbR8szIyeyPKKxRUcQj2zsqghoWuzKAx2saSEw3W
pNkoJU+dGC7kG/yVAS8N
=uoJj
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"Nothing all that exciting; a new module-from-fd syscall for those who
want to verify the source of the module (ChromeOS) and/or use standard
IMA on it or other security hooks."
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Fix kbuild output when using default extra_certificates
MODSIGN: Avoid using .incbin in C source
modules: don't hand 0 to vmalloc.
module: Remove a extra null character at the top of module->strtab.
ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants
ASN.1: Define indefinite length marker constant
moduleparam: use __UNIQUE_ID()
__UNIQUE_ID()
MODSIGN: Add modules_sign make target
powerpc: add finit_module syscall.
ima: support new kernel module syscall
add finit_module syscall to asm-generic
ARM: add finit_module syscall to ARM
security: introduce kernel_module_from_file hook
module: add flags arg to sys_finit_module()
module: add syscall to load module from fd
Merge patches from Andrew Morton:
"Most of the rest of MM, plus a few dribs and drabs.
I still have quite a few irritating patches left around: ones with
dubious testing results, lack of review, ones which should have gone
via maintainer trees but the maintainers are slack, etc.
I need to be more activist in getting these things wrapped up outside
the merge window, but they're such a PITA."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (48 commits)
mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
vmscan: comment too_many_isolated()
mm/kmemleak.c: remove obsolete simple_strtoul
mm/memory_hotplug.c: improve comments
mm/hugetlb: create hugetlb cgroup file in hugetlb_init
mm/mprotect.c: coding-style cleanups
Documentation: ABI: /sys/devices/system/node/
slub: drop mutex before deleting sysfs entry
memcg: add comments clarifying aspects of cache attribute propagation
kmem: add slab-specific documentation about the kmem controller
slub: slub-specific propagation changes
slab: propagate tunable values
memcg: aggregate memcg cache values in slabinfo
memcg/sl[au]b: shrink dead caches
memcg/sl[au]b: track all the memcg children of a kmem_cache
memcg: destroy memcg caches
sl[au]b: allocate objects from memcg cache
sl[au]b: always get the cache from its page in kmem_cache_free()
memcg: skip memcg kmem allocations in specified code regions
memcg: infrastructure to match an allocation to the right cache
...
Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.
This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.
This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.
For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path. Once the slab is also tracked by memcg, we can
get rid of that flag.
Tested to successfully protect against :(){ :|:& };:
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>