2005-04-17 02:20:36 +04:00
/*
* linux / init / main . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* GK 2 / 5 / 95 - Changed to support mounting root fs via NFS
* Added initrd & change_root : Werner Almesberger & Hans Lermen , Feb ' 96
* Moan early if gcc is old , avoiding bogus kernels - Paul Gortmaker , May ' 96
2014-08-09 01:23:44 +04:00
* Simplified starting of init : Michael A . Griffith < grif @ acm . org >
2005-04-17 02:20:36 +04:00
*/
2013-04-30 03:18:20 +04:00
# define DEBUG /* Enable initcall_debug */
2005-04-17 02:20:36 +04:00
# include <linux/types.h>
# include <linux/module.h>
# include <linux/proc_fs.h>
# include <linux/kernel.h>
# include <linux/syscalls.h>
2008-02-14 11:41:09 +03:00
# include <linux/stackprotector.h>
2005-04-17 02:20:36 +04:00
# include <linux/string.h>
# include <linux/ctype.h>
# include <linux/delay.h>
# include <linux/ioport.h>
# include <linux/init.h>
# include <linux/initrd.h>
# include <linux/bootmem.h>
2009-06-13 04:42:08 +04:00
# include <linux/acpi.h>
2005-04-17 02:20:36 +04:00
# include <linux/tty.h>
# include <linux/percpu.h>
# include <linux/kmod.h>
mm: rewrite vmap layer
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 07:27:03 +04:00
# include <linux/vmalloc.h>
2005-04-17 02:20:36 +04:00
# include <linux/kernel_stat.h>
2006-12-07 04:14:08 +03:00
# include <linux/start_kernel.h>
2005-04-17 02:20:36 +04:00
# include <linux/security.h>
2008-06-26 13:21:34 +04:00
# include <linux/smp.h>
2005-04-17 02:20:36 +04:00
# include <linux/profile.h>
# include <linux/rcupdate.h>
# include <linux/moduleparam.h>
# include <linux/kallsyms.h>
# include <linux/writeback.h>
# include <linux/cpu.h>
# include <linux/cpuset.h>
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 10:39:30 +04:00
# include <linux/cgroup.h>
2005-04-17 02:20:36 +04:00
# include <linux/efi.h>
2007-02-16 12:28:01 +03:00
# include <linux/tick.h>
2007-02-18 08:22:39 +03:00
# include <linux/interrupt.h>
2006-07-14 11:24:40 +04:00
# include <linux/taskstats_kern.h>
2006-07-14 11:24:36 +04:00
# include <linux/delayacct.h>
2005-04-17 02:20:36 +04:00
# include <linux/unistd.h>
# include <linux/rmap.h>
# include <linux/mempolicy.h>
# include <linux/key.h>
2006-06-27 13:53:54 +04:00
# include <linux/buffer_head.h>
mm/page_ext: resurrect struct page extending code for debugging
When we debug something, we'd like to insert some information to every
page. For this purpose, we sometimes modify struct page itself. But,
this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency. At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel. Keeping this as it is would be better to reproduce errornous
situation.
This feature is intended to overcome above mentioned problems. This
feature allocates memory for extended data per page in certain place
rather than the struct page itself. This memory can be accessed by the
accessor functions provided by this code. During the boot process, it
checks whether allocation of huge chunk of memory is needed or not. If
not, it avoids allocating memory at all. With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.
Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature, so
this patch resurrect it.
To help these things to work well, this patch introduces two callbacks for
clients. One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time. The other is optional, init
callback, which is used to do proper initialization after memory is
allocated. Detailed explanation about purpose of these functions is in
code comment. Please refer it.
Others are completely same with previous extension code in memcg.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:55:46 +03:00
# include <linux/page_ext.h>
2006-07-03 11:24:33 +04:00
# include <linux/debug_locks.h>
2008-04-30 11:55:01 +04:00
# include <linux/debugobjects.h>
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 11:24:50 +04:00
# include <linux/lockdep.h>
2009-06-11 16:22:39 +04:00
# include <linux/kmemleak.h>
2006-12-08 13:38:01 +03:00
# include <linux/pid_namespace.h>
2006-12-20 00:01:28 +03:00
# include <linux/device.h>
2007-05-09 13:34:32 +04:00
# include <linux/kthread.h>
2007-11-10 00:39:39 +03:00
# include <linux/sched.h>
2008-02-06 12:36:44 +03:00
# include <linux/signal.h>
2008-04-29 12:03:13 +04:00
# include <linux/idr.h>
2010-05-21 06:04:29 +04:00
# include <linux/kgdb.h>
2008-08-14 23:45:08 +04:00
# include <linux/ftrace.h>
2009-01-07 19:45:46 +03:00
# include <linux/async.h>
2008-04-04 02:51:41 +04:00
# include <linux/kmemcheck.h>
2009-08-14 23:13:46 +04:00
# include <linux/sfi.h>
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
very early at kernel initialization, before any driver-core device
is registered. Every device with a major/minor will provide a
device node in devtmpfs.
Devtmpfs can be changed and altered by userspace at any time,
and in any way needed - just like today's udev-mounted tmpfs.
Unmodified udev versions will run just fine on top of it, and will
recognize an already existing kernel-created device node and use it.
The default node permissions are root:root 0600. Proper permissions
and user/group ownership, meaningful symlinks, all other policy still
needs to be applied by userspace.
If a node is created by devtmps, devtmpfs will remove the device node
when the device goes away. If the device node was created by
userspace, or the devtmpfs created node was replaced by userspace, it
will no longer be removed by devtmpfs.
If it is requested to auto-mount it, it makes init=/bin/sh work
without any further userspace support. /dev will be fully populated
and dynamic, and always reflect the current device state of the kernel.
With the commonly used dynamic device numbers, it solves the problem
where static devices nodes may point to the wrong devices.
It is intended to make the initial bootup logic simpler and more robust,
by de-coupling the creation of the inital environment, to reliably run
userspace processes, from a complex userspace bootstrap logic to provide
a working /dev.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Tested-By: Harald Hoyer <harald@redhat.com>
Tested-By: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-30 17:23:42 +04:00
# include <linux/shmem_fs.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2010-11-18 01:17:33 +03:00
# include <linux/perf_event.h>
2012-06-24 09:56:45 +04:00
# include <linux/file.h>
2012-10-11 05:28:25 +04:00
# include <linux/ptrace.h>
2013-01-19 02:05:56 +04:00
# include <linux/blkdev.h>
# include <linux/elevator.h>
2013-06-02 10:39:40 +04:00
# include <linux/sched_clock.h>
2013-07-11 21:12:32 +04:00
# include <linux/context_tracking.h>
2013-09-10 18:52:35 +04:00
# include <linux/random.h>
2014-06-05 03:12:17 +04:00
# include <linux/list.h>
2014-11-05 18:01:15 +03:00
# include <linux/integrity.h>
take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-01 17:57:28 +03:00
# include <linux/proc_ns.h>
2005-04-17 02:20:36 +04:00
# include <asm/io.h>
# include <asm/bugs.h>
# include <asm/setup.h>
2005-07-29 08:15:30 +04:00
# include <asm/sections.h>
2006-01-06 11:12:01 +03:00
# include <asm/cacheflush.h>
2005-04-17 02:20:36 +04:00
2007-02-26 18:45:41 +03:00
static int kernel_init ( void * ) ;
2005-04-17 02:20:36 +04:00
extern void init_IRQ ( void ) ;
extern void fork_init ( unsigned long ) ;
extern void radix_tree_init ( void ) ;
2006-01-06 11:12:01 +03:00
# ifndef CONFIG_DEBUG_RODATA
static inline void mark_rodata_ro ( void ) { }
# endif
2005-04-17 02:20:36 +04:00
2011-01-20 14:06:35 +03:00
/*
* Debug helper : via this flag we know that we are in ' early bootup code '
* where only the boot processor is running with IRQ disabled . This means
* two things - IRQ must not be enabled before the flag is cleared and some
* operations which are not allowed with IRQ disabled are allowed while the
* flag is set .
*/
bool early_boot_irqs_disabled __read_mostly ;
rcu: Teach RCU that idle task is not quiscent state at boot
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton. And cleans up the variable-name
and function-name language. ;-)
The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state. This in turn causes
RCU to prematurely end grace periods during this time.
This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition. RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().
Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1. This allows boot-time code that
calls synchronize_rcu() to proceed normally. Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence. Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.
In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.
Changes since v4:
o Changes the name of the introduced function and variable to
be less emotional. ;-)
Changes since v3:
o WARN_ON(nr_context_switches() > 0) to verify that RCU
switches out of boot-time mode before the first context
switch, as suggested by Nick Piggin.
Changes since v2:
o Created rcu_blocking_is_gp() internal-to-RCU API that
determines whether a call to synchronize_rcu() is itself
a grace period.
o The definition of rcu_blocking_is_gp() for rcuclassic and
rcutree checks to see if but a single CPU is online.
o The definition of rcu_blocking_is_gp() for rcupreempt
checks to see both if but a single CPU is online and if
the system is still in early boot.
This allows rcupreempt to again work correctly if running
on a single CPU after booting is complete.
o Added check to rcupreempt's synchronize_sched() for there
being but one online CPU.
Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 05:03:42 +03:00
enum system_states system_state __read_mostly ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( system_state ) ;
/*
* Boot command - line arguments
*/
# define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
# define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
extern void time_init ( void ) ;
/* Default late time init is NULL. archs can override this later. */
2009-01-07 01:41:10 +03:00
void ( * __initdata late_time_init ) ( void ) ;
2005-04-17 02:20:36 +04:00
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 11:53:52 +03:00
/* Untouched command line saved by arch-specific code. */
char __initdata boot_command_line [ COMMAND_LINE_SIZE ] ;
/* Untouched saved command line (eg. for /proc) */
char * saved_command_line ;
/* Command line for parameter parsing */
static char * static_command_line ;
2013-10-31 07:15:39 +04:00
/* Command line for per-initcall parameter parsing */
static char * initcall_command_line ;
2005-04-17 02:20:36 +04:00
static char * execute_command ;
2005-09-07 02:17:19 +04:00
static char * ramdisk_execute_command ;
2005-04-17 02:20:36 +04:00
2013-10-19 23:48:53 +04:00
/*
* Used to generate warnings if static_key manipulation functions are used
* before jump_label_init is called .
*/
2014-08-09 01:23:44 +04:00
bool static_key_initialized __read_mostly ;
2013-10-19 23:48:53 +04:00
EXPORT_SYMBOL_GPL ( static_key_initialized ) ;
2007-07-16 10:41:07 +04:00
/*
* If set , this is an indication to the drivers that reset the underlying
* device before going ahead with the initialization otherwise driver might
* rely on the BIOS and skip the reset operation .
*
* This is useful if kernel is booting in an unreliable environment .
* For ex . kdump situaiton where previous kernel has crashed , BIOS has been
* skipped and devices will be in unknown state .
*/
unsigned int reset_devices ;
EXPORT_SYMBOL ( reset_devices ) ;
2005-04-17 02:20:36 +04:00
2006-09-27 12:50:44 +04:00
static int __init set_reset_devices ( char * str )
{
reset_devices = 1 ;
return 1 ;
}
__setup ( " reset_devices " , set_reset_devices ) ;
2014-08-09 01:23:44 +04:00
static const char * argv_init [ MAX_INIT_ARGS + 2 ] = { " init " , NULL , } ;
const char * envp_init [ MAX_INIT_ENVS + 2 ] = { " HOME=/ " , " TERM=linux " , NULL , } ;
2005-04-17 02:20:36 +04:00
static const char * panic_later , * panic_param ;
2010-08-12 09:04:18 +04:00
extern const struct obs_kernel_param __setup_start [ ] , __setup_end [ ] ;
2005-04-17 02:20:36 +04:00
static int __init obsolete_checksetup ( char * line )
{
2010-08-12 09:04:18 +04:00
const struct obs_kernel_param * p ;
2006-09-26 12:52:32 +04:00
int had_early_param = 0 ;
2005-04-17 02:20:36 +04:00
p = __setup_start ;
do {
int n = strlen ( p - > str ) ;
2011-10-10 02:03:37 +04:00
if ( parameqn ( line , p - > str , n ) ) {
2005-04-17 02:20:36 +04:00
if ( p - > early ) {
2006-09-26 12:52:32 +04:00
/* Already done in parse_early_param?
* ( Needs exact match on param part ) .
* Keep iterating , as we can have early
* params and __setups of same names 8 ( */
2005-04-17 02:20:36 +04:00
if ( line [ n ] = = ' \0 ' | | line [ n ] = = ' = ' )
2006-09-26 12:52:32 +04:00
had_early_param = 1 ;
2005-04-17 02:20:36 +04:00
} else if ( ! p - > setup_func ) {
2013-04-30 03:18:20 +04:00
pr_warn ( " Parameter %s is obsolete, ignored \n " ,
p - > str ) ;
2005-04-17 02:20:36 +04:00
return 1 ;
} else if ( p - > setup_func ( line + n ) )
return 1 ;
}
p + + ;
} while ( p < __setup_end ) ;
2006-09-26 12:52:32 +04:00
return had_early_param ;
2005-04-17 02:20:36 +04:00
}
/*
* This should be approx 2 Bo * oMips to start ( note initial shift ) , and will
* still work even if initially too large , it will just take slightly longer
*/
unsigned long loops_per_jiffy = ( 1 < < 12 ) ;
EXPORT_SYMBOL ( loops_per_jiffy ) ;
static int __init debug_kernel ( char * str )
{
2014-06-05 03:11:46 +04:00
console_loglevel = CONSOLE_LOGLEVEL_DEBUG ;
2008-02-08 15:21:58 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
static int __init quiet_kernel ( char * str )
{
2014-06-05 03:11:46 +04:00
console_loglevel = CONSOLE_LOGLEVEL_QUIET ;
2008-02-08 15:21:58 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2008-02-08 15:21:58 +03:00
early_param ( " debug " , debug_kernel ) ;
early_param ( " quiet " , quiet_kernel ) ;
2005-04-17 02:20:36 +04:00
static int __init loglevel ( char * str )
{
2011-09-21 11:51:40 +04:00
int newlevel ;
/*
* Only update loglevel value when a correct setting was passed ,
* to prevent blind crashes ( when loglevel being set to 0 ) that
* are quite hard to debug
*/
if ( get_option ( & str , & newlevel ) ) {
console_loglevel = newlevel ;
return 0 ;
}
return - EINVAL ;
2005-04-17 02:20:36 +04:00
}
2008-02-08 15:21:58 +03:00
early_param ( " loglevel " , loglevel ) ;
2005-04-17 02:20:36 +04:00
2012-04-06 20:53:50 +04:00
/* Change NUL term back to "=", to make "param" the whole string. */
2012-05-03 01:33:37 +04:00
static int __init repair_env_string ( char * param , char * val , const char * unused )
2005-04-17 02:20:36 +04:00
{
if ( val ) {
/* param=val or param="val"? */
if ( val = = param + strlen ( param ) + 1 )
val [ - 1 ] = ' = ' ;
else if ( val = = param + strlen ( param ) + 2 ) {
val [ - 2 ] = ' = ' ;
memmove ( val - 1 , val , strlen ( val ) + 1 ) ;
val - - ;
} else
BUG ( ) ;
}
2012-04-06 20:53:50 +04:00
return 0 ;
}
2014-04-28 06:04:33 +04:00
/* Anything after -- gets handed straight to init. */
static int __init set_init_arg ( char * param , char * val , const char * unused )
{
unsigned int i ;
if ( panic_later )
return 0 ;
repair_env_string ( param , val , unused ) ;
for ( i = 0 ; argv_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ARGS ) {
panic_later = " init " ;
panic_param = param ;
return 0 ;
}
}
argv_init [ i ] = param ;
return 0 ;
}
2012-04-06 20:53:50 +04:00
/*
* Unknown boot options get handed to init , unless they look like
* unused parameters ( modprobe will find them in / proc / cmdline ) .
*/
2012-05-03 01:33:37 +04:00
static int __init unknown_bootoption ( char * param , char * val , const char * unused )
2012-04-06 20:53:50 +04:00
{
2012-05-03 01:33:37 +04:00
repair_env_string ( param , val , unused ) ;
2005-04-17 02:20:36 +04:00
/* Handle obsolete-style parameters */
if ( obsolete_checksetup ( param ) )
return 0 ;
2009-12-01 07:26:44 +03:00
/* Unused module parameter. */
if ( strchr ( param , ' . ' ) & & ( ! val | | strchr ( param , ' . ' ) < val ) )
2005-04-17 02:20:36 +04:00
return 0 ;
if ( panic_later )
return 0 ;
if ( val ) {
/* Environment option */
unsigned int i ;
for ( i = 0 ; envp_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ENVS ) {
2014-01-24 03:54:56 +04:00
panic_later = " env " ;
2005-04-17 02:20:36 +04:00
panic_param = param ;
}
if ( ! strncmp ( param , envp_init [ i ] , val - param ) )
break ;
}
envp_init [ i ] = param ;
} else {
/* Command line option */
unsigned int i ;
for ( i = 0 ; argv_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ARGS ) {
2014-01-24 03:54:56 +04:00
panic_later = " init " ;
2005-04-17 02:20:36 +04:00
panic_param = param ;
}
}
argv_init [ i ] = param ;
}
return 0 ;
}
static int __init init_setup ( char * str )
{
unsigned int i ;
execute_command = str ;
/*
* In case LILO is going to boot us with default command line ,
* it prepends " auto " before the whole cmdline which makes
* the shell think it should execute a script with such name .
* So we ignore all arguments entered _before_ init = . . . [ MJ ]
*/
for ( i = 1 ; i < MAX_INIT_ARGS ; i + + )
argv_init [ i ] = NULL ;
return 1 ;
}
__setup ( " init= " , init_setup ) ;
2005-09-07 02:17:19 +04:00
static int __init rdinit_setup ( char * str )
{
unsigned int i ;
ramdisk_execute_command = str ;
/* See "auto" comment in init_setup */
for ( i = 1 ; i < MAX_INIT_ARGS ; i + + )
argv_init [ i ] = NULL ;
return 1 ;
}
__setup ( " rdinit= " , rdinit_setup ) ;
2005-04-17 02:20:36 +04:00
# ifndef CONFIG_SMP
2011-03-23 02:34:06 +03:00
static const unsigned int setup_max_cpus = NR_CPUS ;
2008-03-27 00:23:48 +03:00
static inline void setup_nr_cpu_ids ( void ) { }
2005-04-17 02:20:36 +04:00
static inline void smp_prepare_cpus ( unsigned int maxcpus ) { }
# endif
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 11:53:52 +03:00
/*
* We need to store the untouched command line for future reference .
* We also need to store the touched command line since the parameter
* parsing is performed in place , and we should allow a component to
* store reference of name / value for future reference .
*/
static void __init setup_command_line ( char * command_line )
{
2014-01-22 03:50:21 +04:00
saved_command_line =
memblock_virt_alloc ( strlen ( boot_command_line ) + 1 , 0 ) ;
initcall_command_line =
memblock_virt_alloc ( strlen ( boot_command_line ) + 1 , 0 ) ;
static_command_line = memblock_virt_alloc ( strlen ( command_line ) + 1 , 0 ) ;
2014-08-09 01:23:44 +04:00
strcpy ( saved_command_line , boot_command_line ) ;
strcpy ( static_command_line , command_line ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 11:53:52 +03:00
}
2005-04-17 02:20:36 +04:00
/*
* We need to finalize in a non - __init function or else race conditions
* between the root thread and the init thread may cause start_kernel to
* be reaped by free_initmem before the root thread has proceeded to
* cpu_idle .
*
* gcc - 3.4 accidentally inlines this function , so use noinline .
*/
2010-06-28 18:51:01 +04:00
static __initdata DECLARE_COMPLETION ( kthreadd_done ) ;
2009-01-07 01:40:38 +03:00
static noinline void __init_refok rest_init ( void )
2005-04-17 02:20:36 +04:00
{
2007-05-09 13:34:32 +04:00
int pid ;
2009-09-03 01:01:24 +04:00
rcu_scheduler_starting ( ) ;
2010-06-28 18:51:01 +04:00
/*
2010-06-30 12:37:11 +04:00
* We need to spawn init first so that it obtains pid 1 , however
2010-06-28 18:51:01 +04:00
* the init task will end up wanting to create kthreads , which , if
* we schedule it before we create kthreadd , will OOPS .
*/
2014-06-05 03:12:19 +04:00
kernel_thread ( kernel_init , NULL , CLONE_FS ) ;
2005-04-17 02:20:36 +04:00
numa_default_policy ( ) ;
2007-05-09 13:34:32 +04:00
pid = kernel_thread ( kthreadd , NULL , CLONE_FS | CLONE_FILES ) ;
2010-02-23 04:04:50 +03:00
rcu_read_lock ( ) ;
2008-04-30 11:54:24 +04:00
kthreadd_task = find_task_by_pid_ns ( pid , & init_pid_ns ) ;
2010-02-23 04:04:50 +03:00
rcu_read_unlock ( ) ;
2010-06-28 18:51:01 +04:00
complete ( & kthreadd_done ) ;
2005-06-28 18:40:42 +04:00
/*
* The boot idle thread must execute schedule ( )
2007-07-09 20:51:58 +04:00
* at least once to get things moving :
2005-06-28 18:40:42 +04:00
*/
2007-07-09 20:51:58 +04:00
init_idle_bootup_task ( current ) ;
2011-03-21 14:33:18 +03:00
schedule_preempt_disabled ( ) ;
2005-11-09 08:39:01 +03:00
/* Call into cpu_idle with preempt disabled */
2013-03-22 01:49:34 +04:00
cpu_startup_entry ( CPUHP_ONLINE ) ;
2007-07-09 20:51:58 +04:00
}
2005-04-17 02:20:36 +04:00
/* Check for early params. */
params: add 3rd arg to option handler callback signature
Add a 3rd arg, named "doing", to unknown-options callbacks invoked
from parse_args(). The arg is passed as:
"Booting kernel" from start_kernel(),
initcall_level_names[i] from do_initcall_level(),
mod->name from load_module(), via parse_args(), parse_one()
parse_args() already has the "name" parameter, which is renamed to
"doing" to better reflect current uses 1,2 above. parse_args() passes
it to an altered parse_one(), which now passes it down into the
unknown option handler callbacks.
The mod->name will be needed to handle dyndbg for loadable modules,
since params passed by modprobe are not qualified (they do not have a
"$modname." prefix), and by the time the unknown-param callback is
called, the module name is not otherwise available.
Minor tweaks:
Add param-name to parse_one's pr_debug(), current message doesnt
identify the param being handled, add it.
Add a pr_info to print current level and level_name of the initcall,
and number of registered initcalls at that level. This adds 7 lines
to dmesg output, like:
initlevel:6=device, 172 registered initcalls
Drop "parameters" from initcall_level_names[], its unhelpful in the
pr_info() added above. This array is passed into parse_args() by
do_initcall_level().
CC: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-28 00:30:34 +04:00
static int __init do_early_param ( char * param , char * val , const char * unused )
2005-04-17 02:20:36 +04:00
{
2010-08-12 09:04:18 +04:00
const struct obs_kernel_param * p ;
2005-04-17 02:20:36 +04:00
for ( p = __setup_start ; p < __setup_end ; p + + ) {
2011-10-10 02:03:37 +04:00
if ( ( p - > early & & parameq ( param , p - > str ) ) | |
serial: convert early_uart to earlycon for 8250
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 10:37:59 +04:00
( strcmp ( param , " console " ) = = 0 & &
strcmp ( p - > str , " earlycon " ) = = 0 )
) {
2005-04-17 02:20:36 +04:00
if ( p - > setup_func ( val ) ! = 0 )
2013-04-30 03:18:20 +04:00
pr_warn ( " Malformed early option '%s' \n " , param ) ;
2005-04-17 02:20:36 +04:00
}
}
/* We accept everything at this stage. */
return 0 ;
}
2009-03-31 01:37:25 +04:00
void __init parse_early_options ( char * cmdline )
{
2012-03-26 06:20:51 +04:00
parse_args ( " early options " , cmdline , NULL , 0 , 0 , 0 , do_early_param ) ;
2009-03-31 01:37:25 +04:00
}
2005-04-17 02:20:36 +04:00
/* Arch code calls this early on, or if not, just before other parsing. */
void __init parse_early_param ( void )
{
2014-08-09 01:23:44 +04:00
static int done __initdata ;
static char tmp_cmdline [ COMMAND_LINE_SIZE ] __initdata ;
2005-04-17 02:20:36 +04:00
if ( done )
return ;
/* All fall through to do_early_param. */
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 11:53:52 +03:00
strlcpy ( tmp_cmdline , boot_command_line , COMMAND_LINE_SIZE ) ;
2009-03-31 01:37:25 +04:00
parse_early_options ( tmp_cmdline ) ;
2005-04-17 02:20:36 +04:00
done = 1 ;
}
/*
* Activate the first processor .
*/
2006-03-23 13:59:44 +03:00
static void __init boot_cpu_init ( void )
{
int cpu = smp_processor_id ( ) ;
/* Mark the boot cpu "present", "online" etc for SMP and UP case */
2009-01-01 02:42:15 +03:00
set_cpu_online ( cpu , true ) ;
2009-12-16 20:04:31 +03:00
set_cpu_active ( cpu , true ) ;
2009-01-01 02:42:15 +03:00
set_cpu_present ( cpu , true ) ;
set_cpu_possible ( cpu , true ) ;
2006-03-23 13:59:44 +03:00
}
2008-04-18 10:56:18 +04:00
void __init __weak smp_setup_processor_id ( void )
2006-06-30 12:55:50 +04:00
{
}
2012-11-02 17:20:42 +04:00
# if THREAD_SIZE >= PAGE_SIZE
2008-04-18 10:56:15 +04:00
void __init __weak thread_info_cache_init ( void )
{
}
2012-11-02 17:20:42 +04:00
# endif
2008-04-18 10:56:15 +04:00
2009-06-11 19:29:06 +04:00
/*
* Set up kernel memory allocators
*/
static void __init mm_init ( void )
{
mm/page_ext: resurrect struct page extending code for debugging
When we debug something, we'd like to insert some information to every
page. For this purpose, we sometimes modify struct page itself. But,
this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency. At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel. Keeping this as it is would be better to reproduce errornous
situation.
This feature is intended to overcome above mentioned problems. This
feature allocates memory for extended data per page in certain place
rather than the struct page itself. This memory can be accessed by the
accessor functions provided by this code. During the boot process, it
checks whether allocation of huge chunk of memory is needed or not. If
not, it avoids allocating memory at all. With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.
Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature, so
this patch resurrect it.
To help these things to work well, this patch introduces two callbacks for
clients. One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time. The other is optional, init
callback, which is used to do proper initialization after memory is
allocated. Detailed explanation about purpose of these functions is in
code comment. Please refer it.
Others are completely same with previous extension code in memcg.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:55:46 +03:00
/*
* page_ext requires contiguous pages ,
* bigger than MAX_ORDER unless SPARSEMEM .
*/
page_ext_init_flatmem ( ) ;
2009-06-11 19:29:06 +04:00
mem_init ( ) ;
kmem_cache_init ( ) ;
2010-06-27 20:50:00 +04:00
percpu_init_late ( ) ;
2014-01-22 03:49:07 +04:00
pgtable_init ( ) ;
2009-06-11 19:29:06 +04:00
vmalloc_init ( ) ;
}
2014-05-02 02:44:38 +04:00
asmlinkage __visible void __init start_kernel ( void )
2005-04-17 02:20:36 +04:00
{
2014-08-09 01:23:44 +04:00
char * command_line ;
char * after_dashes ;
2006-06-30 12:55:50 +04:00
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 11:24:50 +04:00
/*
* Need to run as early as possible , to initialize the
* lockdep hash :
*/
lockdep_init ( ) ;
2014-09-12 17:16:17 +04:00
set_task_stack_end_magic ( & init_task ) ;
2011-11-17 09:34:31 +04:00
smp_setup_processor_id ( ) ;
2008-04-30 11:55:01 +04:00
debug_objects_early_init ( ) ;
2008-02-14 11:44:08 +03:00
/*
* Set up the the initial canary ASAP :
*/
boot_init_stack_canary ( ) ;
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 10:39:30 +04:00
cgroup_init_early ( ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 11:24:50 +04:00
local_irq_disable ( ) ;
2011-01-20 14:06:35 +03:00
early_boot_irqs_disabled = true ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 11:24:50 +04:00
2005-04-17 02:20:36 +04:00
/*
* Interrupts are still disabled . Do necessary setups , then
* enable them
*/
2006-03-23 13:59:44 +03:00
boot_cpu_init ( ) ;
2005-04-17 02:20:36 +04:00
page_address_init ( ) ;
2013-04-30 03:18:20 +04:00
pr_notice ( " %s " , linux_banner ) ;
2005-04-17 02:20:36 +04:00
setup_arch ( & command_line ) ;
2011-05-29 22:32:28 +04:00
mm_init_cpumask ( & init_mm ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 11:53:52 +03:00
setup_command_line ( command_line ) ;
2008-03-27 00:23:48 +03:00
setup_nr_cpu_ids ( ) ;
2009-07-21 12:11:50 +04:00
setup_per_cpu_areas ( ) ;
2006-03-23 13:59:44 +03:00
smp_prepare_boot_cpu ( ) ; /* arch-specific boot-cpu hooks */
2005-04-17 02:20:36 +04:00
2012-08-01 03:43:28 +04:00
build_all_zonelists ( NULL , NULL ) ;
2009-06-10 20:40:04 +04:00
page_alloc_init ( ) ;
2013-04-30 03:18:20 +04:00
pr_notice ( " Kernel command line: %s \n " , boot_command_line ) ;
2009-06-10 20:40:04 +04:00
parse_early_param ( ) ;
2014-04-28 06:04:33 +04:00
after_dashes = parse_args ( " Booting kernel " ,
static_command_line , __start___param ,
__stop___param - __start___param ,
- 1 , - 1 , & unknown_bootoption ) ;
2014-11-11 08:59:46 +03:00
if ( ! IS_ERR_OR_NULL ( after_dashes ) )
2014-04-28 06:04:33 +04:00
parse_args ( " Setting init args " , after_dashes , NULL , 0 , - 1 , - 1 ,
set_init_arg ) ;
2011-10-13 03:17:54 +04:00
jump_label_init ( ) ;
2009-06-10 20:40:04 +04:00
/*
* These use large bootmem allocations and must precede
* kmem_cache_init ( )
*/
2011-05-25 04:13:20 +04:00
setup_log_buf ( 0 ) ;
2009-06-10 20:40:04 +04:00
pidhash_init ( ) ;
vfs_caches_init_early ( ) ;
sort_main_extable ( ) ;
trap_init ( ) ;
2009-06-11 19:29:06 +04:00
mm_init ( ) ;
2011-05-25 04:12:15 +04:00
2005-04-17 02:20:36 +04:00
/*
* Set up the scheduler prior starting any interrupts ( such as the
* timer interrupt ) . Full topology setup happens at smp_init ( )
* time - but meanwhile we still have a functioning scheduler .
*/
sched_init ( ) ;
/*
* Disable preemption - early bootup scheduling is extremely
* fragile until we cpu_idle ( ) for the first time .
*/
preempt_disable ( ) ;
2014-08-09 01:23:44 +04:00
if ( WARN ( ! irqs_disabled ( ) ,
" Interrupts were enabled *very* early, fixing it \n " ) )
2007-01-06 03:36:19 +03:00
local_irq_disable ( ) ;
2010-11-18 01:17:35 +03:00
idr_init_cache ( ) ;
2005-04-17 02:20:36 +04:00
rcu_init ( ) ;
2014-12-13 04:05:10 +03:00
/* trace_printk() and trace points may be used after this */
trace_init ( ) ;
2013-07-11 21:12:32 +04:00
context_tracking_init ( ) ;
2010-02-10 12:20:33 +03:00
radix_tree_init ( ) ;
2008-12-06 05:58:31 +03:00
/* init some links before init_ISA_irqs() */
early_irq_init ( ) ;
2005-04-17 02:20:36 +04:00
init_IRQ ( ) ;
2013-03-05 18:14:05 +04:00
tick_init ( ) ;
2014-10-13 17:44:12 +04:00
rcu_init_nohz ( ) ;
2005-04-17 02:20:36 +04:00
init_timers ( ) ;
2006-01-10 07:52:32 +03:00
hrtimers_init ( ) ;
2005-04-17 02:20:36 +04:00
softirq_init ( ) ;
2006-06-26 11:25:06 +04:00
timekeeping_init ( ) ;
2006-07-03 11:24:04 +04:00
time_init ( ) ;
2013-06-02 10:39:40 +04:00
sched_clock_postinit ( ) ;
perf: Use hrtimers for event multiplexing
The current scheme of using the timer tick was fine for per-thread
events. However, it was causing bias issues in system-wide mode
(including for uncore PMUs). Event groups would not get their fair
share of runtime on the PMU. With tickless kernels, if a core is idle
there is no timer tick, and thus no event rotation (multiplexing).
However, there are events (especially uncore events) which do count
even though cores are asleep.
This patch changes the timer source for multiplexing. It introduces a
per-PMU per-cpu hrtimer. The advantage is that even when a core goes
idle, it will come back to service the hrtimer, thus multiplexing on
system-wide events works much better.
The per-PMU implementation (suggested by PeterZ) enables adjusting the
multiplexing interval per PMU. The preferred interval is stashed into
the struct pmu. If not set, it will be forced to the default interval
value.
In order to minimize the impact of the hrtimer, it is turned on and
off on demand. When the PMU on a CPU is overcommited, the hrtimer is
activated. It is stopped when the PMU is not overcommitted.
In order for this to work properly, we had to change the order of
initialization in start_kernel() such that hrtimer_init() is run
before perf_event_init().
The default interval in milliseconds is set to a timer tick just like
with the old code. We will provide a sysctl to tune this in another
patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-03 16:21:33 +04:00
perf_event_init ( ) ;
2006-07-03 11:24:24 +04:00
profile_init ( ) ;
2011-03-29 20:35:04 +04:00
call_function_init ( ) ;
init: scream bloody murder if interrupts are enabled too early
As I was testing a lot of my code recently, and having several
"successes", I accidentally noticed in the dmesg this little line:
start_kernel(): bug: interrupts were enabled *very* early, fixing it
Sure enough, one of my patches two commits ago enabled interrupts early.
The sad part here is that I never noticed it, and I ran several tests with
ktest too, and ktest did not notice this line.
What ktest looks for (and so does many other automated testing scripts) is
a back trace produced by a WARN_ON() or BUG(). As a back trace was never
produced, my buggy patch could have slipped into linux-next, or even
worse, mainline.
Adding a WARN(!irqs_disabled()) makes this bug a little more obvious:
PID hash table entries: 4096 (order: 3, 32768 bytes)
__ex_table already sorted, skipping sort
Checking aperture...
No AGP bridge found
Calgary: detecting Calgary via BIOS EBDA area
Calgary: Unable to locate Rio Grande table in EBDA - bailing!
Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init)
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415()
Hardware name: To Be Filled By O.E.M.
Interrupts were enabled *very* early, fixing it
Modules linked in:
Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286
Call Trace:
warn_slowpath_common+0x83/0x9b
warn_slowpath_fmt+0x46/0x48
start_kernel+0x21e/0x415
x86_64_start_reservations+0x10e/0x112
x86_64_start_kernel+0x102/0x111
---[ end trace 007d8b0491b4f5d8 ]---
Preemptible hierarchical RCU implementation.
RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
NR_IRQS:4352 nr_irqs:712 16
Console: colour VGA+ 80x25
console [ttyS0] enabled, bootconsole disabled
Do you see it?
The original version of this patch just slapped a WARN_ON() in there and
kept the printk(). Ard van Breemen suggested using the WARN() interface,
which makes the code a bit cleaner.
Also, while examining other warnings in init/main.c, I found two other
locations that deserve a bloody murder scream if their conditions are hit,
and updated them accordingly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ard van Breemen <ard@telegraafnet.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 03:18:18 +04:00
WARN ( ! irqs_disabled ( ) , " Interrupts were enabled early \n " ) ;
2011-01-20 14:06:35 +03:00
early_boot_irqs_disabled = false ;
2006-07-03 11:24:24 +04:00
local_irq_enable ( ) ;
2009-06-18 07:24:12 +04:00
2009-06-12 15:03:06 +04:00
kmem_cache_init_late ( ) ;
2005-04-17 02:20:36 +04:00
/*
* HACK ALERT ! This is early . We ' re enabling the console before
* we ' ve done PCI setups etc , and console_init ( ) must be aware of
* this . But we do want output early , in case something goes wrong .
*/
console_init ( ) ;
if ( panic_later )
2014-01-24 03:54:56 +04:00
panic ( " Too many boot %s vars at `%s' " , panic_later ,
panic_param ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 11:24:50 +04:00
lockdep_info ( ) ;
2006-07-03 11:24:33 +04:00
/*
* Need to run this when irqs are enabled , because it wants
* to self - test [ hard / soft ] - irqs on / off lock inversion bugs
* too :
*/
locking_selftest ( ) ;
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_BLK_DEV_INITRD
if ( initrd_start & & ! initrd_below_start_ok & &
2008-07-30 09:33:36 +04:00
page_to_pfn ( virt_to_page ( ( void * ) initrd_start ) ) < min_low_pfn ) {
2013-04-30 03:18:20 +04:00
pr_crit ( " initrd overwritten (0x%08lx < 0x%08lx) - disabling it. \n " ,
2008-07-30 09:33:36 +04:00
page_to_pfn ( virt_to_page ( ( void * ) initrd_start ) ) ,
min_low_pfn ) ;
2005-04-17 02:20:36 +04:00
initrd_start = 0 ;
}
# endif
mm/page_ext: resurrect struct page extending code for debugging
When we debug something, we'd like to insert some information to every
page. For this purpose, we sometimes modify struct page itself. But,
this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency. At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel. Keeping this as it is would be better to reproduce errornous
situation.
This feature is intended to overcome above mentioned problems. This
feature allocates memory for extended data per page in certain place
rather than the struct page itself. This memory can be accessed by the
accessor functions provided by this code. During the boot process, it
checks whether allocation of huge chunk of memory is needed or not. If
not, it avoids allocating memory at all. With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.
Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature, so
this patch resurrect it.
To help these things to work well, this patch introduces two callbacks for
clients. One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time. The other is optional, init
callback, which is used to do proper initialization after memory is
allocated. Detailed explanation about purpose of these functions is in
code comment. Please refer it.
Others are completely same with previous extension code in memcg.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:55:46 +03:00
page_ext_init ( ) ;
2008-04-30 11:55:01 +04:00
debug_objects_mem_init ( ) ;
2011-05-19 19:25:30 +04:00
kmemleak_init ( ) ;
2005-06-22 04:14:47 +04:00
setup_per_cpu_pageset ( ) ;
2005-04-17 02:20:36 +04:00
numa_policy_init ( ) ;
if ( late_time_init )
late_time_init ( ) ;
2009-08-22 00:01:12 +04:00
sched_clock_init ( ) ;
2005-04-17 02:20:36 +04:00
calibrate_delay ( ) ;
pidmap_init ( ) ;
anon_vma_init ( ) ;
2014-03-13 03:22:58 +04:00
acpi_early_init ( ) ;
2012-12-16 03:15:24 +04:00
# ifdef CONFIG_X86
2012-11-14 13:42:35 +04:00
if ( efi_enabled ( EFI_RUNTIME_SERVICES ) )
2012-12-16 03:15:24 +04:00
efi_enter_virtual_mode ( ) ;
x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. This
causes some 16-bit software to break, but it also leaks kernel state
to user space. We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.
In checkin:
b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.
This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart. When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace. The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.
(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)
Special thanks to:
- Andy Lutomirski, for the suggestion of using very small stack slots
and copy (as opposed to map) the IRET frame there, and for the
suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.
Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # consider after upstream merge
2014-04-30 03:46:09 +04:00
# endif
2014-05-04 21:00:49 +04:00
# ifdef CONFIG_X86_ESPFIX64
x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. This
causes some 16-bit software to break, but it also leaks kernel state
to user space. We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.
In checkin:
b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.
This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart. When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace. The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.
(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)
Special thanks to:
- Andy Lutomirski, for the suggestion of using very small stack slots
and copy (as opposed to map) the IRET frame there, and for the
suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.
Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # consider after upstream merge
2014-04-30 03:46:09 +04:00
/* Should be run before the first non-init thread is created */
init_espfix_bsp ( ) ;
2012-12-16 03:15:24 +04:00
# endif
2008-04-18 10:56:15 +04:00
thread_info_cache_init ( ) ;
CRED: Inaugurate COW credentials
Inaugurate copy-on-write credentials management. This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.
A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().
With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:
struct cred *new = prepare_creds();
int ret = blah(new);
if (ret < 0) {
abort_creds(new);
return ret;
}
return commit_creds(new);
There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.
To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const. The purpose of this is compile-time
discouragement of altering credentials through those pointers. Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:
(1) Its reference count may incremented and decremented.
(2) The keyrings to which it points may be modified, but not replaced.
The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
This now prepares and commits credentials in various places in the
security code rather than altering the current creds directly.
(2) Temporary credential overrides.
do_coredump() and sys_faccessat() now prepare their own credentials and
temporarily override the ones currently on the acting thread, whilst
preventing interference from other threads by holding cred_replace_mutex
on the thread being dumped.
This will be replaced in a future patch by something that hands down the
credentials directly to the functions being called, rather than altering
the task's objective credentials.
(3) LSM interface.
A number of functions have been changed, added or removed:
(*) security_capset_check(), ->capset_check()
(*) security_capset_set(), ->capset_set()
Removed in favour of security_capset().
(*) security_capset(), ->capset()
New. This is passed a pointer to the new creds, a pointer to the old
creds and the proposed capability sets. It should fill in the new
creds or return an error. All pointers, barring the pointer to the
new creds, are now const.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
Changed; now returns a value, which will cause the process to be
killed if it's an error.
(*) security_task_alloc(), ->task_alloc_security()
Removed in favour of security_prepare_creds().
(*) security_cred_free(), ->cred_free()
New. Free security data attached to cred->security.
(*) security_prepare_creds(), ->cred_prepare()
New. Duplicate any security data attached to cred->security.
(*) security_commit_creds(), ->cred_commit()
New. Apply any security effects for the upcoming installation of new
security by commit_creds().
(*) security_task_post_setuid(), ->task_post_setuid()
Removed in favour of security_task_fix_setuid().
(*) security_task_fix_setuid(), ->task_fix_setuid()
Fix up the proposed new credentials for setuid(). This is used by
cap_set_fix_setuid() to implicitly adjust capabilities in line with
setuid() changes. Changes are made to the new credentials, rather
than the task itself as in security_task_post_setuid().
(*) security_task_reparent_to_init(), ->task_reparent_to_init()
Removed. Instead the task being reparented to init is referred
directly to init's credentials.
NOTE! This results in the loss of some state: SELinux's osid no
longer records the sid of the thread that forked it.
(*) security_key_alloc(), ->key_alloc()
(*) security_key_permission(), ->key_permission()
Changed. These now take cred pointers rather than task pointers to
refer to the security context.
(4) sys_capset().
This has been simplified and uses less locking. The LSM functions it
calls have been merged.
(5) reparent_to_kthreadd().
This gives the current thread the same credentials as init by simply using
commit_thread() to point that way.
(6) __sigqueue_alloc() and switch_uid()
__sigqueue_alloc() can't stop the target task from changing its creds
beneath it, so this function gets a reference to the currently applicable
user_struct which it then passes into the sigqueue struct it returns if
successful.
switch_uid() is now called from commit_creds(), and possibly should be
folded into that. commit_creds() should take care of protecting
__sigqueue_alloc().
(7) [sg]et[ug]id() and co and [sg]et_current_groups.
The set functions now all use prepare_creds(), commit_creds() and
abort_creds() to build and check a new set of credentials before applying
it.
security_task_set[ug]id() is called inside the prepared section. This
guarantees that nothing else will affect the creds until we've finished.
The calling of set_dumpable() has been moved into commit_creds().
Much of the functionality of set_user() has been moved into
commit_creds().
The get functions all simply access the data directly.
(8) security_task_prctl() and cap_task_prctl().
security_task_prctl() has been modified to return -ENOSYS if it doesn't
want to handle a function, or otherwise return the return value directly
rather than through an argument.
Additionally, cap_task_prctl() now prepares a new set of credentials, even
if it doesn't end up using it.
(9) Keyrings.
A number of changes have been made to the keyrings code:
(a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
all been dropped and built in to the credentials functions directly.
They may want separating out again later.
(b) key_alloc() and search_process_keyrings() now take a cred pointer
rather than a task pointer to specify the security context.
(c) copy_creds() gives a new thread within the same thread group a new
thread keyring if its parent had one, otherwise it discards the thread
keyring.
(d) The authorisation key now points directly to the credentials to extend
the search into rather pointing to the task that carries them.
(e) Installing thread, process or session keyrings causes a new set of
credentials to be created, even though it's not strictly necessary for
process or session keyrings (they're shared).
(10) Usermode helper.
The usermode helper code now carries a cred struct pointer in its
subprocess_info struct instead of a new session keyring pointer. This set
of credentials is derived from init_cred and installed on the new process
after it has been cloned.
call_usermodehelper_setup() allocates the new credentials and
call_usermodehelper_freeinfo() discards them if they haven't been used. A
special cred function (prepare_usermodeinfo_creds()) is provided
specifically for call_usermodehelper_setup() to call.
call_usermodehelper_setkeys() adjusts the credentials to sport the
supplied keyring as the new session keyring.
(11) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) selinux_setprocattr() no longer does its check for whether the
current ptracer can access processes with the new SID inside the lock
that covers getting the ptracer's SID. Whilst this lock ensures that
the check is done with the ptracer pinned, the result is only valid
until the lock is released, so there's no point doing it inside the
lock.
(12) is_single_threaded().
This function has been extracted from selinux_setprocattr() and put into
a file of its own in the lib/ directory as join_session_keyring() now
wants to use it too.
The code in SELinux just checked to see whether a task shared mm_structs
with other tasks (CLONE_VM), but that isn't good enough. We really want
to know if they're part of the same thread group (CLONE_THREAD).
(13) nfsd.
The NFS server daemon now has to use the COW credentials to set the
credentials it is going to use. It really needs to pass the credentials
down to the functions it calls, but it can't do that until other patches
in this series have been applied.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 02:39:23 +03:00
cred_init ( ) ;
2009-09-22 04:03:05 +04:00
fork_init ( totalram_pages ) ;
2005-04-17 02:20:36 +04:00
proc_caches_init ( ) ;
buffer_init ( ) ;
key_init ( ) ;
security_init ( ) ;
2010-05-21 06:04:29 +04:00
dbg_late_init ( ) ;
2009-09-22 04:03:05 +04:00
vfs_caches_init ( totalram_pages ) ;
2005-04-17 02:20:36 +04:00
signals_init ( ) ;
/* rootfs populating might need page-writeback */
page_writeback_init ( ) ;
proc_root_init ( ) ;
take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-01 17:57:28 +03:00
nsfs_init ( ) ;
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 10:39:30 +04:00
cgroup_init ( ) ;
2005-04-17 02:20:36 +04:00
cpuset_init ( ) ;
2006-07-14 11:24:40 +04:00
taskstats_init_early ( ) ;
2006-07-14 11:24:36 +04:00
delayacct_init ( ) ;
2005-04-17 02:20:36 +04:00
check_bugs ( ) ;
2009-08-14 23:13:46 +04:00
sfi_init_late ( ) ;
2005-04-17 02:20:36 +04:00
2012-11-14 13:42:35 +04:00
if ( efi_enabled ( EFI_RUNTIME_SERVICES ) ) {
2012-09-29 04:57:05 +04:00
efi_late_init ( ) ;
2012-09-29 04:55:44 +04:00
efi_free_boot_services ( ) ;
2012-09-29 04:57:05 +04:00
}
2012-09-29 04:55:44 +04:00
2008-08-14 23:45:08 +04:00
ftrace_init ( ) ;
2005-04-17 02:20:36 +04:00
/* Do the rest non-__init'ed, we're now alive */
rest_init ( ) ;
}
2009-06-18 03:28:03 +04:00
/* Call all constructor functions linked into the kernel. */
static void __init do_ctors ( void )
{
# ifdef CONFIG_CONSTRUCTORS
2009-12-15 05:00:18 +03:00
ctor_fn_t * fn = ( ctor_fn_t * ) __ctors_start ;
2009-06-18 03:28:03 +04:00
2009-12-15 05:00:18 +03:00
for ( ; fn < ( ctor_fn_t * ) __ctors_end ; fn + + )
( * fn ) ( ) ;
2009-06-18 03:28:03 +04:00
# endif
}
2012-01-13 03:02:18 +04:00
bool initcall_debug ;
2008-10-22 19:00:23 +04:00
core_param ( initcall_debug , initcall_debug , bool , 0644 ) ;
2005-04-17 02:20:36 +04:00
2014-06-05 03:12:17 +04:00
# ifdef CONFIG_KALLSYMS
struct blacklist_entry {
struct list_head next ;
char * buf ;
} ;
static __initdata_or_module LIST_HEAD ( blacklisted_initcalls ) ;
static int __init initcall_blacklist ( char * str )
{
char * str_entry ;
struct blacklist_entry * entry ;
/* str argument is a comma-separated list of functions */
do {
str_entry = strsep ( & str , " , " ) ;
if ( str_entry ) {
pr_debug ( " blacklisting initcall %s \n " , str_entry ) ;
entry = alloc_bootmem ( sizeof ( * entry ) ) ;
entry - > buf = alloc_bootmem ( strlen ( str_entry ) + 1 ) ;
strcpy ( entry - > buf , str_entry ) ;
list_add ( & entry - > next , & blacklisted_initcalls ) ;
}
} while ( str_entry ) ;
return 0 ;
}
static bool __init_or_module initcall_blacklisted ( initcall_t fn )
{
struct list_head * tmp ;
struct blacklist_entry * entry ;
char * fn_name ;
fn_name = kasprintf ( GFP_KERNEL , " %pf " , fn ) ;
if ( ! fn_name )
return false ;
list_for_each ( tmp , & blacklisted_initcalls ) {
entry = list_entry ( tmp , struct blacklist_entry , next ) ;
if ( ! strcmp ( fn_name , entry - > buf ) ) {
pr_debug ( " initcall %s blacklisted \n " , fn_name ) ;
kfree ( fn_name ) ;
return true ;
}
}
kfree ( fn_name ) ;
return false ;
}
# else
static int __init initcall_blacklist ( char * str )
{
pr_warn ( " initcall_blacklist requires CONFIG_KALLSYMS \n " ) ;
return 0 ;
}
static bool __init_or_module initcall_blacklisted ( initcall_t fn )
{
return false ;
}
# endif
__setup ( " initcall_blacklist= " , initcall_blacklist ) ;
2010-08-10 04:20:32 +04:00
static int __init_or_module do_one_initcall_debug ( initcall_t fn )
2005-04-17 02:20:36 +04:00
{
2008-11-12 01:24:42 +03:00
ktime_t calltime , delta , rettime ;
2010-05-26 14:57:53 +04:00
unsigned long long duration ;
int ret ;
2005-04-17 02:20:36 +04:00
2014-06-05 03:12:16 +04:00
printk ( KERN_DEBUG " calling %pF @ %i \n " , fn , task_pid_nr ( current ) ) ;
2010-08-10 04:20:32 +04:00
calltime = ktime_get ( ) ;
2010-05-26 14:57:53 +04:00
ret = fn ( ) ;
2010-08-10 04:20:32 +04:00
rettime = ktime_get ( ) ;
delta = ktime_sub ( rettime , calltime ) ;
duration = ( unsigned long long ) ktime_to_ns ( delta ) > > 10 ;
2014-06-05 03:12:16 +04:00
printk ( KERN_DEBUG " initcall %pF returned %d after %lld usecs \n " ,
2013-04-30 03:18:20 +04:00
fn , ret , duration ) ;
2005-04-17 02:20:36 +04:00
2010-08-10 04:20:32 +04:00
return ret ;
}
2010-08-10 04:20:32 +04:00
int __init_or_module do_one_initcall ( initcall_t fn )
2010-08-10 04:20:32 +04:00
{
int count = preempt_count ( ) ;
int ret ;
2013-07-04 02:05:37 +04:00
char msgbuf [ 64 ] ;
2010-08-10 04:20:32 +04:00
2014-06-05 03:12:17 +04:00
if ( initcall_blacklisted ( fn ) )
return - EPERM ;
2010-08-10 04:20:32 +04:00
if ( initcall_debug )
ret = do_one_initcall_debug ( fn ) ;
else
ret = fn ( ) ;
2007-05-08 11:28:26 +04:00
2008-05-16 05:14:01 +04:00
msgbuf [ 0 ] = 0 ;
2008-05-13 01:02:22 +04:00
2008-05-16 05:14:01 +04:00
if ( preempt_count ( ) ! = count ) {
2013-05-01 21:35:51 +04:00
sprintf ( msgbuf , " preemption imbalance " ) ;
2013-08-14 16:55:24 +04:00
preempt_count_set ( count ) ;
2005-04-17 02:20:36 +04:00
}
2008-05-16 05:14:01 +04:00
if ( irqs_disabled ( ) ) {
2008-05-16 00:52:41 +04:00
strlcat ( msgbuf , " disabled interrupts " , sizeof ( msgbuf ) ) ;
2008-05-16 05:14:01 +04:00
local_irq_enable ( ) ;
}
init: scream bloody murder if interrupts are enabled too early
As I was testing a lot of my code recently, and having several
"successes", I accidentally noticed in the dmesg this little line:
start_kernel(): bug: interrupts were enabled *very* early, fixing it
Sure enough, one of my patches two commits ago enabled interrupts early.
The sad part here is that I never noticed it, and I ran several tests with
ktest too, and ktest did not notice this line.
What ktest looks for (and so does many other automated testing scripts) is
a back trace produced by a WARN_ON() or BUG(). As a back trace was never
produced, my buggy patch could have slipped into linux-next, or even
worse, mainline.
Adding a WARN(!irqs_disabled()) makes this bug a little more obvious:
PID hash table entries: 4096 (order: 3, 32768 bytes)
__ex_table already sorted, skipping sort
Checking aperture...
No AGP bridge found
Calgary: detecting Calgary via BIOS EBDA area
Calgary: Unable to locate Rio Grande table in EBDA - bailing!
Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init)
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415()
Hardware name: To Be Filled By O.E.M.
Interrupts were enabled *very* early, fixing it
Modules linked in:
Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286
Call Trace:
warn_slowpath_common+0x83/0x9b
warn_slowpath_fmt+0x46/0x48
start_kernel+0x21e/0x415
x86_64_start_reservations+0x10e/0x112
x86_64_start_kernel+0x102/0x111
---[ end trace 007d8b0491b4f5d8 ]---
Preemptible hierarchical RCU implementation.
RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
NR_IRQS:4352 nr_irqs:712 16
Console: colour VGA+ 80x25
console [ttyS0] enabled, bootconsole disabled
Do you see it?
The original version of this patch just slapped a WARN_ON() in there and
kept the printk(). Ard van Breemen suggested using the WARN() interface,
which makes the code a bit cleaner.
Also, while examining other warnings in init/main.c, I found two other
locations that deserve a bloody murder scream if their conditions are hit,
and updated them accordingly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ard van Breemen <ard@telegraafnet.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 03:18:18 +04:00
WARN ( msgbuf [ 0 ] , " initcall %pF returned with %s \n " , fn , msgbuf ) ;
2008-07-30 23:49:02 +04:00
2010-05-26 14:57:53 +04:00
return ret ;
2008-05-16 05:14:01 +04:00
}
2012-03-26 06:20:51 +04:00
extern initcall_t __initcall_start [ ] ;
extern initcall_t __initcall0_start [ ] ;
extern initcall_t __initcall1_start [ ] ;
extern initcall_t __initcall2_start [ ] ;
extern initcall_t __initcall3_start [ ] ;
extern initcall_t __initcall4_start [ ] ;
extern initcall_t __initcall5_start [ ] ;
extern initcall_t __initcall6_start [ ] ;
extern initcall_t __initcall7_start [ ] ;
extern initcall_t __initcall_end [ ] ;
static initcall_t * initcall_levels [ ] __initdata = {
__initcall0_start ,
__initcall1_start ,
__initcall2_start ,
__initcall3_start ,
__initcall4_start ,
__initcall5_start ,
__initcall6_start ,
__initcall7_start ,
__initcall_end ,
} ;
2012-06-15 02:00:59 +04:00
/* Keep these in sync with initcalls in include/linux/init.h */
2012-03-26 06:20:51 +04:00
static char * initcall_level_names [ ] __initdata = {
params: add 3rd arg to option handler callback signature
Add a 3rd arg, named "doing", to unknown-options callbacks invoked
from parse_args(). The arg is passed as:
"Booting kernel" from start_kernel(),
initcall_level_names[i] from do_initcall_level(),
mod->name from load_module(), via parse_args(), parse_one()
parse_args() already has the "name" parameter, which is renamed to
"doing" to better reflect current uses 1,2 above. parse_args() passes
it to an altered parse_one(), which now passes it down into the
unknown option handler callbacks.
The mod->name will be needed to handle dyndbg for loadable modules,
since params passed by modprobe are not qualified (they do not have a
"$modname." prefix), and by the time the unknown-param callback is
called, the module name is not otherwise available.
Minor tweaks:
Add param-name to parse_one's pr_debug(), current message doesnt
identify the param being handled, add it.
Add a pr_info to print current level and level_name of the initcall,
and number of registered initcalls at that level. This adds 7 lines
to dmesg output, like:
initlevel:6=device, 172 registered initcalls
Drop "parameters" from initcall_level_names[], its unhelpful in the
pr_info() added above. This array is passed into parse_args() by
do_initcall_level().
CC: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-28 00:30:34 +04:00
" early " ,
" core " ,
" postcore " ,
" arch " ,
" subsys " ,
" fs " ,
" device " ,
" late " ,
2012-03-26 06:20:51 +04:00
} ;
static void __init do_initcall_level ( int level )
2008-05-16 05:14:01 +04:00
{
2009-12-15 05:00:18 +03:00
initcall_t * fn ;
2008-05-16 05:14:01 +04:00
2013-10-31 07:15:39 +04:00
strcpy ( initcall_command_line , saved_command_line ) ;
2012-03-26 06:20:51 +04:00
parse_args ( initcall_level_names [ level ] ,
2013-10-31 07:15:39 +04:00
initcall_command_line , __start___param ,
2012-03-26 06:20:51 +04:00
__stop___param - __start___param ,
level , level ,
2012-05-03 01:33:37 +04:00
& repair_env_string ) ;
2012-03-26 06:20:51 +04:00
for ( fn = initcall_levels [ level ] ; fn < initcall_levels [ level + 1 ] ; fn + + )
2009-12-15 05:00:18 +03:00
do_one_initcall ( * fn ) ;
2005-04-17 02:20:36 +04:00
}
2012-03-26 06:20:51 +04:00
static void __init do_initcalls ( void )
{
int level ;
2012-06-01 20:56:00 +04:00
for ( level = 0 ; level < ARRAY_SIZE ( initcall_levels ) - 1 ; level + + )
2012-03-26 06:20:51 +04:00
do_initcall_level ( level ) ;
}
2005-04-17 02:20:36 +04:00
/*
* Ok , the machine is now initialized . None of the devices
* have been touched yet , but the CPU subsystem is up and
* running , and memory and process management works .
*
* Now we can finally start doing some real work . .
*/
static void __init do_basic_setup ( void )
{
2009-03-25 12:06:30 +03:00
cpuset_init_smp ( ) ;
2005-04-17 02:20:36 +04:00
usermodehelper_init ( ) ;
2011-08-04 03:21:21 +04:00
shmem_init ( ) ;
2005-04-17 02:20:36 +04:00
driver_init ( ) ;
2007-02-14 11:33:57 +03:00
init_irq_proc ( ) ;
2009-06-18 03:28:03 +04:00
do_ctors ( ) ;
bootup: move 'usermodehelper_enable()' to the end of do_basic_setup()
Doing it just before starting to call into cpu_idle() made a sick kind
of sense only because the original bug we fixed (see commit
288d5abec831: "Boot up with usermodehelper disabled") was about problems
with some scheduler data structures not being initialized, and they had
better be initialized at that point.
But it really didn't make any other conceptual sense, and doing it after
the initial "schedule()" call for the idle thread actually opened up a
race: what if the main initialization thread did everything without
needing to sleep, and got all the way into user land too? Without
actually having scheduled back to the idle thread?
Now, in normal circumstances that doesn't ever happen, but it looks like
Richard Cochran triggered exactly that on his ARM IXP4xx machines:
"I have some ARM IXP4xx based machines that use the two on chip MAC
ports (aka NPEs). The NPE needs a firmware in order to function.
Ever since the following commit [that 288d5abec831 one], it is no
longer possible to bring up the interfaces during the init scripts."
with a call trace showing an ioctl coming from user space. Richard says:
"The init is busybox, and the startup script does mount, syslogd, and
then ifup, so that all can go by quickly."
The fix is to move the usermodehelper_enable() into the main 'init'
thread, and just put it after we've done all our initcalls. By then,
everything really should be up, but we've obviously not actually started
the user-mode portion of init yet.
Reported-and-tested-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-28 21:23:44 +04:00
usermodehelper_enable ( ) ;
2011-09-29 11:09:40 +04:00
do_initcalls ( ) ;
2013-09-10 18:52:35 +04:00
random_int_secret_init ( ) ;
2005-04-17 02:20:36 +04:00
}
2008-07-26 06:45:11 +04:00
static void __init do_pre_smp_initcalls ( void )
2008-07-26 06:45:11 +04:00
{
2009-12-15 05:00:18 +03:00
initcall_t * fn ;
2008-07-26 06:45:11 +04:00
2012-03-26 06:20:51 +04:00
for ( fn = __initcall_start ; fn < __initcall0_start ; fn + + )
2009-12-15 05:00:18 +03:00
do_one_initcall ( * fn ) ;
2008-07-26 06:45:11 +04:00
}
2013-01-19 02:05:56 +04:00
/*
* This function requests modules which should be loaded by default and is
* called twice right after initrd is mounted and right before init is
* exec ' d . If such modules are on either initrd or rootfs , they will be
* loaded before control is passed to userland .
*/
void __init load_default_modules ( void )
{
load_default_elevator_module ( ) ;
}
2012-10-11 05:28:25 +04:00
static int run_init_process ( const char * init_filename )
2005-04-17 02:20:36 +04:00
{
argv_init [ 0 ] = init_filename ;
2014-02-06 00:54:53 +04:00
return do_execve ( getname_kernel ( init_filename ) ,
2012-12-14 21:44:11 +04:00
( const char __user * const __user * ) argv_init ,
( const char __user * const __user * ) envp_init ) ;
2005-04-17 02:20:36 +04:00
}
2013-11-13 03:10:21 +04:00
static int try_to_run_init_process ( const char * init_filename )
{
int ret ;
ret = run_init_process ( init_filename ) ;
if ( ret & & ret ! = - ENOENT ) {
pr_err ( " Starting init: %s exists but couldn't execute it (error %d) \n " ,
init_filename , ret ) ;
}
return ret ;
}
2012-12-21 10:55:44 +04:00
static noinline void __init kernel_init_freeable ( void ) ;
2012-10-11 03:57:26 +04:00
static int __ref kernel_init ( void * unused )
2007-02-13 15:26:22 +03:00
{
2013-11-13 03:10:21 +04:00
int ret ;
2012-10-11 03:57:26 +04:00
kernel_init_freeable ( ) ;
2009-01-07 19:45:46 +03:00
/* need to finish all async __init code before freeing the memory */
async_synchronize_full ( ) ;
2007-02-13 15:26:22 +03:00
free_initmem ( ) ;
mark_rodata_ro ( ) ;
system_state = SYSTEM_RUNNING ;
numa_default_policy ( ) ;
2012-06-24 09:56:45 +04:00
flush_delayed_fput ( ) ;
2008-04-30 11:53:03 +04:00
2007-02-13 15:26:22 +03:00
if ( ramdisk_execute_command ) {
2013-11-13 03:10:21 +04:00
ret = run_init_process ( ramdisk_execute_command ) ;
if ( ! ret )
2012-10-11 05:28:25 +04:00
return 0 ;
2013-11-13 03:10:21 +04:00
pr_err ( " Failed to execute %s (error %d) \n " ,
ramdisk_execute_command , ret ) ;
2007-02-13 15:26:22 +03:00
}
/*
* We try each of these until one succeeds .
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine .
*/
if ( execute_command ) {
2013-11-13 03:10:21 +04:00
ret = run_init_process ( execute_command ) ;
if ( ! ret )
2012-10-11 05:28:25 +04:00
return 0 ;
2014-12-11 02:52:19 +03:00
panic ( " Requested init %s failed (error %d). " ,
execute_command , ret ) ;
2007-02-13 15:26:22 +03:00
}
2013-11-13 03:10:21 +04:00
if ( ! try_to_run_init_process ( " /sbin/init " ) | |
! try_to_run_init_process ( " /etc/init " ) | |
! try_to_run_init_process ( " /bin/init " ) | |
! try_to_run_init_process ( " /bin/sh " ) )
2012-10-11 05:28:25 +04:00
return 0 ;
2007-02-13 15:26:22 +03:00
2013-11-13 03:10:21 +04:00
panic ( " No working init found. Try passing init= option to kernel. "
2010-03-06 00:42:39 +03:00
" See Linux Documentation/init.txt for guidance. " ) ;
2007-02-13 15:26:22 +03:00
}
2012-12-21 10:55:44 +04:00
static noinline void __init kernel_init_freeable ( void )
2005-04-17 02:20:36 +04:00
{
2010-06-28 18:51:01 +04:00
/*
* Wait until kthreadd is all set - up .
*/
wait_for_completion ( & kthreadd_done ) ;
2012-05-21 23:52:42 +04:00
/* Now the scheduler is fully set up and can do blocking allocations */
gfp_allowed_mask = __GFP_BITS_MASK ;
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 02:31:49 +04:00
/*
* init can allocate pages on any node
*/
2012-12-13 01:51:40 +04:00
set_mems_allowed ( node_states [ N_MEMORY ] ) ;
2005-04-17 02:20:36 +04:00
/*
* init can run on any cpu .
*/
2009-03-31 08:05:10 +04:00
set_cpus_allowed_ptr ( current , cpu_all_mask ) ;
2005-04-17 02:20:36 +04:00
2006-10-02 13:19:00 +04:00
cad_pid = task_pid ( current ) ;
2008-01-30 15:33:17 +03:00
smp_prepare_cpus ( setup_max_cpus ) ;
2005-04-17 02:20:36 +04:00
do_pre_smp_initcalls ( ) ;
2010-11-25 20:38:29 +03:00
lockup_detector_init ( ) ;
2005-04-17 02:20:36 +04:00
smp_init ( ) ;
sched_init_smp ( ) ;
do_basic_setup ( ) ;
2010-03-03 10:53:19 +03:00
/* Open the /dev/console on the rootfs, this should never fail */
if ( sys_open ( ( const char __user * ) " /dev/console " , O_RDWR , 0 ) < 0 )
2013-04-30 03:18:20 +04:00
pr_err ( " Warning: unable to open an initial console. \n " ) ;
2010-03-03 10:53:19 +03:00
( void ) sys_dup ( 0 ) ;
( void ) sys_dup ( 0 ) ;
2005-04-17 02:20:36 +04:00
/*
* check if there is an early userspace init . If yes , let it do all
* the work
*/
2005-09-07 02:17:19 +04:00
if ( ! ramdisk_execute_command )
ramdisk_execute_command = " /init " ;
if ( sys_access ( ( const char __user * ) ramdisk_execute_command , 0 ) ! = 0 ) {
ramdisk_execute_command = NULL ;
2005-04-17 02:20:36 +04:00
prepare_namespace ( ) ;
2005-09-07 02:17:19 +04:00
}
2005-04-17 02:20:36 +04:00
/*
* Ok , we have completed the initial bootup , and
* we ' re essentially up and running . Get rid of the
* initmem segments and start the user - mode stuff . .
2014-11-05 18:01:15 +03:00
*
* rootfs is available now , try loading the public keys
* and default modules
2005-04-17 02:20:36 +04:00
*/
2013-01-19 02:05:56 +04:00
2014-11-05 18:01:15 +03:00
integrity_load_keys ( ) ;
2013-01-19 02:05:56 +04:00
load_default_modules ( ) ;
2005-04-17 02:20:36 +04:00
}