2005-04-16 15:20:36 -07:00
/*
* linux / init / main . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* GK 2 / 5 / 95 - Changed to support mounting root fs via NFS
* Added initrd & change_root : Werner Almesberger & Hans Lermen , Feb ' 96
* Moan early if gcc is old , avoiding bogus kernels - Paul Gortmaker , May ' 96
* Simplified starting of init : Michael A . Griffith < grif @ acm . org >
*/
# include <linux/types.h>
# include <linux/module.h>
# include <linux/proc_fs.h>
# include <linux/kernel.h>
# include <linux/syscalls.h>
2008-02-14 09:41:09 +01:00
# include <linux/stackprotector.h>
2005-04-16 15:20:36 -07:00
# include <linux/string.h>
# include <linux/ctype.h>
# include <linux/delay.h>
# include <linux/ioport.h>
# include <linux/init.h>
# include <linux/initrd.h>
# include <linux/bootmem.h>
2009-06-12 20:42:08 -04:00
# include <linux/acpi.h>
2005-04-16 15:20:36 -07:00
# include <linux/tty.h>
# include <linux/percpu.h>
# include <linux/kmod.h>
mm: rewrite vmap layer
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-18 20:27:03 -07:00
# include <linux/vmalloc.h>
2005-04-16 15:20:36 -07:00
# include <linux/kernel_stat.h>
2006-12-07 02:14:08 +01:00
# include <linux/start_kernel.h>
2005-04-16 15:20:36 -07:00
# include <linux/security.h>
2008-06-26 11:21:34 +02:00
# include <linux/smp.h>
2005-04-16 15:20:36 -07:00
# include <linux/profile.h>
# include <linux/rcupdate.h>
# include <linux/moduleparam.h>
# include <linux/kallsyms.h>
# include <linux/writeback.h>
# include <linux/cpu.h>
# include <linux/cpuset.h>
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 23:39:30 -07:00
# include <linux/cgroup.h>
2005-04-16 15:20:36 -07:00
# include <linux/efi.h>
2007-02-16 01:28:01 -08:00
# include <linux/tick.h>
2007-02-17 21:22:39 -08:00
# include <linux/interrupt.h>
2006-07-14 00:24:40 -07:00
# include <linux/taskstats_kern.h>
2006-07-14 00:24:36 -07:00
# include <linux/delayacct.h>
2005-04-16 15:20:36 -07:00
# include <linux/unistd.h>
# include <linux/rmap.h>
# include <linux/mempolicy.h>
# include <linux/key.h>
2006-06-27 02:53:54 -07:00
# include <linux/buffer_head.h>
2008-10-22 14:15:05 -07:00
# include <linux/page_cgroup.h>
2006-07-03 00:24:33 -07:00
# include <linux/debug_locks.h>
2008-04-30 00:55:01 -07:00
# include <linux/debugobjects.h>
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
# include <linux/lockdep.h>
2009-06-11 13:22:39 +01:00
# include <linux/kmemleak.h>
2006-12-08 02:38:01 -08:00
# include <linux/pid_namespace.h>
2006-12-19 13:01:28 -08:00
# include <linux/device.h>
2007-05-09 02:34:32 -07:00
# include <linux/kthread.h>
2007-11-09 22:39:39 +01:00
# include <linux/sched.h>
2008-02-06 01:36:44 -08:00
# include <linux/signal.h>
2008-04-29 01:03:13 -07:00
# include <linux/idr.h>
2010-05-20 21:04:29 -05:00
# include <linux/kgdb.h>
2008-08-14 15:45:08 -04:00
# include <linux/ftrace.h>
2009-01-07 08:45:46 -08:00
# include <linux/async.h>
2008-04-04 00:51:41 +02:00
# include <linux/kmemcheck.h>
2009-08-14 15:13:46 -04:00
# include <linux/sfi.h>
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
very early at kernel initialization, before any driver-core device
is registered. Every device with a major/minor will provide a
device node in devtmpfs.
Devtmpfs can be changed and altered by userspace at any time,
and in any way needed - just like today's udev-mounted tmpfs.
Unmodified udev versions will run just fine on top of it, and will
recognize an already existing kernel-created device node and use it.
The default node permissions are root:root 0600. Proper permissions
and user/group ownership, meaningful symlinks, all other policy still
needs to be applied by userspace.
If a node is created by devtmps, devtmpfs will remove the device node
when the device goes away. If the device node was created by
userspace, or the devtmpfs created node was replaced by userspace, it
will no longer be removed by devtmpfs.
If it is requested to auto-mount it, it makes init=/bin/sh work
without any further userspace support. /dev will be fully populated
and dynamic, and always reflect the current device state of the kernel.
With the commonly used dynamic device numbers, it solves the problem
where static devices nodes may point to the wrong devices.
It is intended to make the initial bootup logic simpler and more robust,
by de-coupling the creation of the inital environment, to reliably run
userspace processes, from a complex userspace bootstrap logic to provide
a working /dev.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Tested-By: Harald Hoyer <harald@redhat.com>
Tested-By: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-30 15:23:42 +02:00
# include <linux/shmem_fs.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2010-11-17 23:17:33 +01:00
# include <linux/perf_event.h>
2005-04-16 15:20:36 -07:00
# include <asm/io.h>
# include <asm/bugs.h>
# include <asm/setup.h>
2005-07-28 21:15:30 -07:00
# include <asm/sections.h>
2006-01-06 00:12:01 -08:00
# include <asm/cacheflush.h>
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_X86_LOCAL_APIC
# include <asm/smp.h>
# endif
2007-02-26 16:45:41 +01:00
static int kernel_init ( void * ) ;
2005-04-16 15:20:36 -07:00
extern void init_IRQ ( void ) ;
extern void fork_init ( unsigned long ) ;
extern void mca_init ( void ) ;
extern void sbus_init ( void ) ;
extern void prio_tree_init ( void ) ;
extern void radix_tree_init ( void ) ;
extern void free_initmem ( void ) ;
2006-01-06 00:12:01 -08:00
# ifndef CONFIG_DEBUG_RODATA
static inline void mark_rodata_ro ( void ) { }
# endif
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_TC
extern void tc_init ( void ) ;
# endif
2011-01-20 12:06:35 +01:00
/*
* Debug helper : via this flag we know that we are in ' early bootup code '
* where only the boot processor is running with IRQ disabled . This means
* two things - IRQ must not be enabled before the flag is cleared and some
* operations which are not allowed with IRQ disabled are allowed while the
* flag is set .
*/
bool early_boot_irqs_disabled __read_mostly ;
rcu: Teach RCU that idle task is not quiscent state at boot
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton. And cleans up the variable-name
and function-name language. ;-)
The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state. This in turn causes
RCU to prematurely end grace periods during this time.
This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition. RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().
Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1. This allows boot-time code that
calls synchronize_rcu() to proceed normally. Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence. Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.
In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.
Changes since v4:
o Changes the name of the introduced function and variable to
be less emotional. ;-)
Changes since v3:
o WARN_ON(nr_context_switches() > 0) to verify that RCU
switches out of boot-time mode before the first context
switch, as suggested by Nick Piggin.
Changes since v2:
o Created rcu_blocking_is_gp() internal-to-RCU API that
determines whether a call to synchronize_rcu() is itself
a grace period.
o The definition of rcu_blocking_is_gp() for rcuclassic and
rcutree checks to see if but a single CPU is online.
o The definition of rcu_blocking_is_gp() for rcupreempt
checks to see both if but a single CPU is online and if
the system is still in early boot.
This allows rcupreempt to again work correctly if running
on a single CPU after booting is complete.
o Added check to rcupreempt's synchronize_sched() for there
being but one online CPU.
Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:03:42 -08:00
enum system_states system_state __read_mostly ;
2005-04-16 15:20:36 -07:00
EXPORT_SYMBOL ( system_state ) ;
/*
* Boot command - line arguments
*/
# define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
# define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
extern void time_init ( void ) ;
/* Default late time init is NULL. archs can override this later. */
2009-01-06 14:41:10 -08:00
void ( * __initdata late_time_init ) ( void ) ;
2005-04-16 15:20:36 -07:00
extern void softirq_init ( void ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
/* Untouched command line saved by arch-specific code. */
char __initdata boot_command_line [ COMMAND_LINE_SIZE ] ;
/* Untouched saved command line (eg. for /proc) */
char * saved_command_line ;
/* Command line for parameter parsing */
static char * static_command_line ;
2005-04-16 15:20:36 -07:00
static char * execute_command ;
2005-09-06 15:17:19 -07:00
static char * ramdisk_execute_command ;
2005-04-16 15:20:36 -07:00
2007-07-15 23:41:07 -07:00
/*
* If set , this is an indication to the drivers that reset the underlying
* device before going ahead with the initialization otherwise driver might
* rely on the BIOS and skip the reset operation .
*
* This is useful if kernel is booting in an unreliable environment .
* For ex . kdump situaiton where previous kernel has crashed , BIOS has been
* skipped and devices will be in unknown state .
*/
unsigned int reset_devices ;
EXPORT_SYMBOL ( reset_devices ) ;
2005-04-16 15:20:36 -07:00
2006-09-27 01:50:44 -07:00
static int __init set_reset_devices ( char * str )
{
reset_devices = 1 ;
return 1 ;
}
__setup ( " reset_devices " , set_reset_devices ) ;
2010-08-17 23:52:56 +01:00
static const char * argv_init [ MAX_INIT_ARGS + 2 ] = { " init " , NULL , } ;
const char * envp_init [ MAX_INIT_ENVS + 2 ] = { " HOME=/ " , " TERM=linux " , NULL , } ;
2005-04-16 15:20:36 -07:00
static const char * panic_later , * panic_param ;
2010-08-11 23:04:18 -06:00
extern const struct obs_kernel_param __setup_start [ ] , __setup_end [ ] ;
2005-04-16 15:20:36 -07:00
static int __init obsolete_checksetup ( char * line )
{
2010-08-11 23:04:18 -06:00
const struct obs_kernel_param * p ;
2006-09-26 10:52:32 +02:00
int had_early_param = 0 ;
2005-04-16 15:20:36 -07:00
p = __setup_start ;
do {
int n = strlen ( p - > str ) ;
2011-10-10 00:03:37 +02:00
if ( parameqn ( line , p - > str , n ) ) {
2005-04-16 15:20:36 -07:00
if ( p - > early ) {
2006-09-26 10:52:32 +02:00
/* Already done in parse_early_param?
* ( Needs exact match on param part ) .
* Keep iterating , as we can have early
* params and __setups of same names 8 ( */
2005-04-16 15:20:36 -07:00
if ( line [ n ] = = ' \0 ' | | line [ n ] = = ' = ' )
2006-09-26 10:52:32 +02:00
had_early_param = 1 ;
2005-04-16 15:20:36 -07:00
} else if ( ! p - > setup_func ) {
printk ( KERN_WARNING " Parameter %s is obsolete, "
" ignored \n " , p - > str ) ;
return 1 ;
} else if ( p - > setup_func ( line + n ) )
return 1 ;
}
p + + ;
} while ( p < __setup_end ) ;
2006-09-26 10:52:32 +02:00
return had_early_param ;
2005-04-16 15:20:36 -07:00
}
/*
* This should be approx 2 Bo * oMips to start ( note initial shift ) , and will
* still work even if initially too large , it will just take slightly longer
*/
unsigned long loops_per_jiffy = ( 1 < < 12 ) ;
EXPORT_SYMBOL ( loops_per_jiffy ) ;
static int __init debug_kernel ( char * str )
{
console_loglevel = 10 ;
2008-02-08 04:21:58 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
static int __init quiet_kernel ( char * str )
{
console_loglevel = 4 ;
2008-02-08 04:21:58 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2008-02-08 04:21:58 -08:00
early_param ( " debug " , debug_kernel ) ;
early_param ( " quiet " , quiet_kernel ) ;
2005-04-16 15:20:36 -07:00
static int __init loglevel ( char * str )
{
2011-09-21 09:51:40 +02:00
int newlevel ;
/*
* Only update loglevel value when a correct setting was passed ,
* to prevent blind crashes ( when loglevel being set to 0 ) that
* are quite hard to debug
*/
if ( get_option ( & str , & newlevel ) ) {
console_loglevel = newlevel ;
return 0 ;
}
return - EINVAL ;
2005-04-16 15:20:36 -07:00
}
2008-02-08 04:21:58 -08:00
early_param ( " loglevel " , loglevel ) ;
2005-04-16 15:20:36 -07:00
/*
* Unknown boot options get handed to init , unless they look like
2009-12-01 14:56:44 +10:30
* unused parameters ( modprobe will find them in / proc / cmdline ) .
2005-04-16 15:20:36 -07:00
*/
static int __init unknown_bootoption ( char * param , char * val )
{
/* Change NUL term back to "=", to make "param" the whole string. */
if ( val ) {
/* param=val or param="val"? */
if ( val = = param + strlen ( param ) + 1 )
val [ - 1 ] = ' = ' ;
else if ( val = = param + strlen ( param ) + 2 ) {
val [ - 2 ] = ' = ' ;
memmove ( val - 1 , val , strlen ( val ) + 1 ) ;
val - - ;
} else
BUG ( ) ;
}
/* Handle obsolete-style parameters */
if ( obsolete_checksetup ( param ) )
return 0 ;
2009-12-01 14:56:44 +10:30
/* Unused module parameter. */
if ( strchr ( param , ' . ' ) & & ( ! val | | strchr ( param , ' . ' ) < val ) )
2005-04-16 15:20:36 -07:00
return 0 ;
if ( panic_later )
return 0 ;
if ( val ) {
/* Environment option */
unsigned int i ;
for ( i = 0 ; envp_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ENVS ) {
panic_later = " Too many boot env vars at `%s' " ;
panic_param = param ;
}
if ( ! strncmp ( param , envp_init [ i ] , val - param ) )
break ;
}
envp_init [ i ] = param ;
} else {
/* Command line option */
unsigned int i ;
for ( i = 0 ; argv_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ARGS ) {
panic_later = " Too many boot init vars at `%s' " ;
panic_param = param ;
}
}
argv_init [ i ] = param ;
}
return 0 ;
}
2008-01-30 13:33:58 +01:00
# ifdef CONFIG_DEBUG_PAGEALLOC
int __read_mostly debug_pagealloc_enabled = 0 ;
# endif
2005-04-16 15:20:36 -07:00
static int __init init_setup ( char * str )
{
unsigned int i ;
execute_command = str ;
/*
* In case LILO is going to boot us with default command line ,
* it prepends " auto " before the whole cmdline which makes
* the shell think it should execute a script with such name .
* So we ignore all arguments entered _before_ init = . . . [ MJ ]
*/
for ( i = 1 ; i < MAX_INIT_ARGS ; i + + )
argv_init [ i ] = NULL ;
return 1 ;
}
__setup ( " init= " , init_setup ) ;
2005-09-06 15:17:19 -07:00
static int __init rdinit_setup ( char * str )
{
unsigned int i ;
ramdisk_execute_command = str ;
/* See "auto" comment in init_setup */
for ( i = 1 ; i < MAX_INIT_ARGS ; i + + )
argv_init [ i ] = NULL ;
return 1 ;
}
__setup ( " rdinit= " , rdinit_setup ) ;
2005-04-16 15:20:36 -07:00
# ifndef CONFIG_SMP
2011-03-22 16:34:06 -07:00
static const unsigned int setup_max_cpus = NR_CPUS ;
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_X86_LOCAL_APIC
static void __init smp_init ( void )
{
APIC_init_uniprocessor ( ) ;
}
# else
# define smp_init() do { } while (0)
# endif
2008-03-26 14:23:48 -07:00
static inline void setup_nr_cpu_ids ( void ) { }
2005-04-16 15:20:36 -07:00
static inline void smp_prepare_cpus ( unsigned int maxcpus ) { }
# endif
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
/*
* We need to store the untouched command line for future reference .
* We also need to store the touched command line since the parameter
* parsing is performed in place , and we should allow a component to
* store reference of name / value for future reference .
*/
static void __init setup_command_line ( char * command_line )
{
saved_command_line = alloc_bootmem ( strlen ( boot_command_line ) + 1 ) ;
static_command_line = alloc_bootmem ( strlen ( command_line ) + 1 ) ;
strcpy ( saved_command_line , boot_command_line ) ;
strcpy ( static_command_line , command_line ) ;
}
2005-04-16 15:20:36 -07:00
/*
* We need to finalize in a non - __init function or else race conditions
* between the root thread and the init thread may cause start_kernel to
* be reaped by free_initmem before the root thread has proceeded to
* cpu_idle .
*
* gcc - 3.4 accidentally inlines this function , so use noinline .
*/
2010-06-28 16:51:01 +02:00
static __initdata DECLARE_COMPLETION ( kthreadd_done ) ;
2009-01-06 14:40:38 -08:00
static noinline void __init_refok rest_init ( void )
2005-04-16 15:20:36 -07:00
{
2007-05-09 02:34:32 -07:00
int pid ;
2009-09-02 14:01:24 -07:00
rcu_scheduler_starting ( ) ;
2010-06-28 16:51:01 +02:00
/*
2010-06-30 10:37:11 +02:00
* We need to spawn init first so that it obtains pid 1 , however
2010-06-28 16:51:01 +02:00
* the init task will end up wanting to create kthreads , which , if
* we schedule it before we create kthreadd , will OOPS .
*/
2007-02-26 16:45:41 +01:00
kernel_thread ( kernel_init , NULL , CLONE_FS | CLONE_SIGHAND ) ;
2005-04-16 15:20:36 -07:00
numa_default_policy ( ) ;
2007-05-09 02:34:32 -07:00
pid = kernel_thread ( kthreadd , NULL , CLONE_FS | CLONE_FILES ) ;
2010-02-22 17:04:50 -08:00
rcu_read_lock ( ) ;
2008-04-30 00:54:24 -07:00
kthreadd_task = find_task_by_pid_ns ( pid , & init_pid_ns ) ;
2010-02-22 17:04:50 -08:00
rcu_read_unlock ( ) ;
2010-06-28 16:51:01 +02:00
complete ( & kthreadd_done ) ;
2005-06-28 16:40:42 +02:00
/*
* The boot idle thread must execute schedule ( )
2007-07-09 18:51:58 +02:00
* at least once to get things moving :
2005-06-28 16:40:42 +02:00
*/
2007-07-09 18:51:58 +02:00
init_idle_bootup_task ( current ) ;
2005-11-08 21:39:01 -08:00
preempt_enable_no_resched ( ) ;
2005-06-28 16:40:42 +02:00
schedule ( ) ;
2011-08-03 22:03:29 -10:00
2005-11-08 21:39:01 -08:00
/* Call into cpu_idle with preempt disabled */
2011-08-03 22:03:29 -10:00
preempt_disable ( ) ;
2005-04-16 15:20:36 -07:00
cpu_idle ( ) ;
2007-07-09 18:51:58 +02:00
}
2005-04-16 15:20:36 -07:00
/* Check for early params. */
static int __init do_early_param ( char * param , char * val )
{
2010-08-11 23:04:18 -06:00
const struct obs_kernel_param * p ;
2005-04-16 15:20:36 -07:00
for ( p = __setup_start ; p < __setup_end ; p + + ) {
2011-10-10 00:03:37 +02:00
if ( ( p - > early & & parameq ( param , p - > str ) ) | |
serial: convert early_uart to earlycon for 8250
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-15 23:37:59 -07:00
( strcmp ( param , " console " ) = = 0 & &
strcmp ( p - > str , " earlycon " ) = = 0 )
) {
2005-04-16 15:20:36 -07:00
if ( p - > setup_func ( val ) ! = 0 )
printk ( KERN_WARNING
" Malformed early option '%s' \n " , param ) ;
}
}
/* We accept everything at this stage. */
return 0 ;
}
2009-03-30 14:37:25 -07:00
void __init parse_early_options ( char * cmdline )
{
parse_args ( " early options " , cmdline , NULL , 0 , do_early_param ) ;
}
2005-04-16 15:20:36 -07:00
/* Arch code calls this early on, or if not, just before other parsing. */
void __init parse_early_param ( void )
{
static __initdata int done = 0 ;
static __initdata char tmp_cmdline [ COMMAND_LINE_SIZE ] ;
if ( done )
return ;
/* All fall through to do_early_param. */
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
strlcpy ( tmp_cmdline , boot_command_line , COMMAND_LINE_SIZE ) ;
2009-03-30 14:37:25 -07:00
parse_early_options ( tmp_cmdline ) ;
2005-04-16 15:20:36 -07:00
done = 1 ;
}
/*
* Activate the first processor .
*/
2006-03-23 02:59:44 -08:00
static void __init boot_cpu_init ( void )
{
int cpu = smp_processor_id ( ) ;
/* Mark the boot cpu "present", "online" etc for SMP and UP case */
2009-01-01 10:12:15 +10:30
set_cpu_online ( cpu , true ) ;
2009-12-16 18:04:31 +01:00
set_cpu_active ( cpu , true ) ;
2009-01-01 10:12:15 +10:30
set_cpu_present ( cpu , true ) ;
set_cpu_possible ( cpu , true ) ;
2006-03-23 02:59:44 -08:00
}
2008-04-18 16:56:18 +10:00
void __init __weak smp_setup_processor_id ( void )
2006-06-30 01:55:50 -07:00
{
}
2008-04-18 16:56:15 +10:00
void __init __weak thread_info_cache_init ( void )
{
}
2009-06-11 18:29:06 +03:00
/*
* Set up kernel memory allocators
*/
static void __init mm_init ( void )
{
2009-06-12 10:33:53 +03:00
/*
* page_cgroup requires countinous pages as memmap
* and it ' s bigger than MAX_ORDER unless SPARSEMEM .
*/
page_cgroup_init_flatmem ( ) ;
2009-06-11 18:29:06 +03:00
mem_init ( ) ;
kmem_cache_init ( ) ;
2010-06-27 18:50:00 +02:00
percpu_init_late ( ) ;
2009-06-17 13:48:39 +10:00
pgtable_cache_init ( ) ;
2009-06-11 18:29:06 +03:00
vmalloc_init ( ) ;
}
2005-04-16 15:20:36 -07:00
asmlinkage void __init start_kernel ( void )
{
char * command_line ;
2010-08-11 23:04:18 -06:00
extern const struct kernel_param __start___param [ ] , __stop___param [ ] ;
2006-06-30 01:55:50 -07:00
smp_setup_processor_id ( ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
/*
* Need to run as early as possible , to initialize the
* lockdep hash :
*/
lockdep_init ( ) ;
2008-04-30 00:55:01 -07:00
debug_objects_early_init ( ) ;
2008-02-14 09:44:08 +01:00
/*
* Set up the the initial canary ASAP :
*/
boot_init_stack_canary ( ) ;
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 23:39:30 -07:00
cgroup_init_early ( ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
local_irq_disable ( ) ;
2011-01-20 12:06:35 +01:00
early_boot_irqs_disabled = true ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
2005-04-16 15:20:36 -07:00
/*
* Interrupts are still disabled . Do necessary setups , then
* enable them
*/
2007-02-16 01:28:01 -08:00
tick_init ( ) ;
2006-03-23 02:59:44 -08:00
boot_cpu_init ( ) ;
2005-04-16 15:20:36 -07:00
page_address_init ( ) ;
2009-05-24 15:30:48 +02:00
printk ( KERN_NOTICE " %s " , linux_banner ) ;
2005-04-16 15:20:36 -07:00
setup_arch ( & command_line ) ;
cgroups: add an owner to the mm_struct
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.
A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.
This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.
I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.
This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.
After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 01:00:16 -07:00
mm_init_owner ( & init_mm , & init_task ) ;
2011-05-29 11:32:28 -07:00
mm_init_cpumask ( & init_mm ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
setup_command_line ( command_line ) ;
2008-03-26 14:23:48 -07:00
setup_nr_cpu_ids ( ) ;
2009-07-21 17:11:50 +09:00
setup_per_cpu_areas ( ) ;
2006-03-23 02:59:44 -08:00
smp_prepare_boot_cpu ( ) ; /* arch-specific boot-cpu hooks */
2005-04-16 15:20:36 -07:00
2010-05-24 14:32:51 -07:00
build_all_zonelists ( NULL ) ;
2009-06-10 19:40:04 +03:00
page_alloc_init ( ) ;
printk ( KERN_NOTICE " Kernel command line: %s \n " , boot_command_line ) ;
parse_early_param ( ) ;
parse_args ( " Booting kernel " , static_command_line , __start___param ,
__stop___param - __start___param ,
& unknown_bootoption ) ;
2011-10-12 16:17:54 -07:00
jump_label_init ( ) ;
2009-06-10 19:40:04 +03:00
/*
* These use large bootmem allocations and must precede
* kmem_cache_init ( )
*/
2011-05-24 17:13:20 -07:00
setup_log_buf ( 0 ) ;
2009-06-10 19:40:04 +03:00
pidhash_init ( ) ;
vfs_caches_init_early ( ) ;
sort_main_extable ( ) ;
trap_init ( ) ;
2009-06-11 18:29:06 +03:00
mm_init ( ) ;
2011-05-24 17:12:15 -07:00
2005-04-16 15:20:36 -07:00
/*
* Set up the scheduler prior starting any interrupts ( such as the
* timer interrupt ) . Full topology setup happens at smp_init ( )
* time - but meanwhile we still have a functioning scheduler .
*/
sched_init ( ) ;
/*
* Disable preemption - early bootup scheduling is extremely
* fragile until we cpu_idle ( ) for the first time .
*/
preempt_disable ( ) ;
2007-01-05 16:36:19 -08:00
if ( ! irqs_disabled ( ) ) {
printk ( KERN_WARNING " start_kernel(): bug: interrupts were "
" enabled *very* early, fixing it \n " ) ;
local_irq_disable ( ) ;
}
2010-11-17 23:17:35 +01:00
idr_init_cache ( ) ;
2010-11-17 23:17:33 +01:00
perf_event_init ( ) ;
2005-04-16 15:20:36 -07:00
rcu_init ( ) ;
2010-02-10 01:20:33 -08:00
radix_tree_init ( ) ;
2008-12-05 18:58:31 -08:00
/* init some links before init_ISA_irqs() */
early_irq_init ( ) ;
2005-04-16 15:20:36 -07:00
init_IRQ ( ) ;
2009-06-11 13:22:39 +01:00
prio_tree_init ( ) ;
2005-04-16 15:20:36 -07:00
init_timers ( ) ;
2006-01-09 20:52:32 -08:00
hrtimers_init ( ) ;
2005-04-16 15:20:36 -07:00
softirq_init ( ) ;
2006-06-26 00:25:06 -07:00
timekeeping_init ( ) ;
2006-07-03 00:24:04 -07:00
time_init ( ) ;
2006-07-03 00:24:24 -07:00
profile_init ( ) ;
2011-03-29 12:35:04 -04:00
call_function_init ( ) ;
2006-07-03 00:24:24 -07:00
if ( ! irqs_disabled ( ) )
2008-11-27 02:31:57 +10:30
printk ( KERN_CRIT " start_kernel(): bug: interrupts were "
" enabled early \n " ) ;
2011-01-20 12:06:35 +01:00
early_boot_irqs_disabled = false ;
2006-07-03 00:24:24 -07:00
local_irq_enable ( ) ;
2009-06-18 13:24:12 +10:00
/* Interrupts are enabled now so all GFP allocations are safe. */
2010-03-05 13:42:13 -08:00
gfp_allowed_mask = __GFP_BITS_MASK ;
2009-06-18 13:24:12 +10:00
2009-06-12 14:03:06 +03:00
kmem_cache_init_late ( ) ;
2005-04-16 15:20:36 -07:00
/*
* HACK ALERT ! This is early . We ' re enabling the console before
* we ' ve done PCI setups etc , and console_init ( ) must be aware of
* this . But we do want output early , in case something goes wrong .
*/
console_init ( ) ;
if ( panic_later )
panic ( panic_later , panic_param ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
lockdep_info ( ) ;
2006-07-03 00:24:33 -07:00
/*
* Need to run this when irqs are enabled , because it wants
* to self - test [ hard / soft ] - irqs on / off lock inversion bugs
* too :
*/
locking_selftest ( ) ;
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_BLK_DEV_INITRD
if ( initrd_start & & ! initrd_below_start_ok & &
2008-07-29 22:33:36 -07:00
page_to_pfn ( virt_to_page ( ( void * ) initrd_start ) ) < min_low_pfn ) {
2005-04-16 15:20:36 -07:00
printk ( KERN_CRIT " initrd overwritten (0x%08lx < 0x%08lx) - "
2008-07-17 21:16:36 +02:00
" disabling it. \n " ,
2008-07-29 22:33:36 -07:00
page_to_pfn ( virt_to_page ( ( void * ) initrd_start ) ) ,
min_low_pfn ) ;
2005-04-16 15:20:36 -07:00
initrd_start = 0 ;
}
# endif
2008-10-22 14:15:05 -07:00
page_cgroup_init ( ) ;
2008-02-09 23:24:09 +01:00
enable_debug_pagealloc ( ) ;
2008-04-30 00:55:01 -07:00
debug_objects_mem_init ( ) ;
2011-05-19 16:25:30 +01:00
kmemleak_init ( ) ;
2005-06-21 17:14:47 -07:00
setup_per_cpu_pageset ( ) ;
2005-04-16 15:20:36 -07:00
numa_policy_init ( ) ;
if ( late_time_init )
late_time_init ( ) ;
2009-08-21 22:01:12 +02:00
sched_clock_init ( ) ;
2005-04-16 15:20:36 -07:00
calibrate_delay ( ) ;
pidmap_init ( ) ;
anon_vma_init ( ) ;
# ifdef CONFIG_X86
if ( efi_enabled )
efi_enter_virtual_mode ( ) ;
# endif
2008-04-18 16:56:15 +10:00
thread_info_cache_init ( ) ;
CRED: Inaugurate COW credentials
Inaugurate copy-on-write credentials management. This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.
A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().
With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:
struct cred *new = prepare_creds();
int ret = blah(new);
if (ret < 0) {
abort_creds(new);
return ret;
}
return commit_creds(new);
There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.
To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const. The purpose of this is compile-time
discouragement of altering credentials through those pointers. Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:
(1) Its reference count may incremented and decremented.
(2) The keyrings to which it points may be modified, but not replaced.
The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
This now prepares and commits credentials in various places in the
security code rather than altering the current creds directly.
(2) Temporary credential overrides.
do_coredump() and sys_faccessat() now prepare their own credentials and
temporarily override the ones currently on the acting thread, whilst
preventing interference from other threads by holding cred_replace_mutex
on the thread being dumped.
This will be replaced in a future patch by something that hands down the
credentials directly to the functions being called, rather than altering
the task's objective credentials.
(3) LSM interface.
A number of functions have been changed, added or removed:
(*) security_capset_check(), ->capset_check()
(*) security_capset_set(), ->capset_set()
Removed in favour of security_capset().
(*) security_capset(), ->capset()
New. This is passed a pointer to the new creds, a pointer to the old
creds and the proposed capability sets. It should fill in the new
creds or return an error. All pointers, barring the pointer to the
new creds, are now const.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
Changed; now returns a value, which will cause the process to be
killed if it's an error.
(*) security_task_alloc(), ->task_alloc_security()
Removed in favour of security_prepare_creds().
(*) security_cred_free(), ->cred_free()
New. Free security data attached to cred->security.
(*) security_prepare_creds(), ->cred_prepare()
New. Duplicate any security data attached to cred->security.
(*) security_commit_creds(), ->cred_commit()
New. Apply any security effects for the upcoming installation of new
security by commit_creds().
(*) security_task_post_setuid(), ->task_post_setuid()
Removed in favour of security_task_fix_setuid().
(*) security_task_fix_setuid(), ->task_fix_setuid()
Fix up the proposed new credentials for setuid(). This is used by
cap_set_fix_setuid() to implicitly adjust capabilities in line with
setuid() changes. Changes are made to the new credentials, rather
than the task itself as in security_task_post_setuid().
(*) security_task_reparent_to_init(), ->task_reparent_to_init()
Removed. Instead the task being reparented to init is referred
directly to init's credentials.
NOTE! This results in the loss of some state: SELinux's osid no
longer records the sid of the thread that forked it.
(*) security_key_alloc(), ->key_alloc()
(*) security_key_permission(), ->key_permission()
Changed. These now take cred pointers rather than task pointers to
refer to the security context.
(4) sys_capset().
This has been simplified and uses less locking. The LSM functions it
calls have been merged.
(5) reparent_to_kthreadd().
This gives the current thread the same credentials as init by simply using
commit_thread() to point that way.
(6) __sigqueue_alloc() and switch_uid()
__sigqueue_alloc() can't stop the target task from changing its creds
beneath it, so this function gets a reference to the currently applicable
user_struct which it then passes into the sigqueue struct it returns if
successful.
switch_uid() is now called from commit_creds(), and possibly should be
folded into that. commit_creds() should take care of protecting
__sigqueue_alloc().
(7) [sg]et[ug]id() and co and [sg]et_current_groups.
The set functions now all use prepare_creds(), commit_creds() and
abort_creds() to build and check a new set of credentials before applying
it.
security_task_set[ug]id() is called inside the prepared section. This
guarantees that nothing else will affect the creds until we've finished.
The calling of set_dumpable() has been moved into commit_creds().
Much of the functionality of set_user() has been moved into
commit_creds().
The get functions all simply access the data directly.
(8) security_task_prctl() and cap_task_prctl().
security_task_prctl() has been modified to return -ENOSYS if it doesn't
want to handle a function, or otherwise return the return value directly
rather than through an argument.
Additionally, cap_task_prctl() now prepares a new set of credentials, even
if it doesn't end up using it.
(9) Keyrings.
A number of changes have been made to the keyrings code:
(a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
all been dropped and built in to the credentials functions directly.
They may want separating out again later.
(b) key_alloc() and search_process_keyrings() now take a cred pointer
rather than a task pointer to specify the security context.
(c) copy_creds() gives a new thread within the same thread group a new
thread keyring if its parent had one, otherwise it discards the thread
keyring.
(d) The authorisation key now points directly to the credentials to extend
the search into rather pointing to the task that carries them.
(e) Installing thread, process or session keyrings causes a new set of
credentials to be created, even though it's not strictly necessary for
process or session keyrings (they're shared).
(10) Usermode helper.
The usermode helper code now carries a cred struct pointer in its
subprocess_info struct instead of a new session keyring pointer. This set
of credentials is derived from init_cred and installed on the new process
after it has been cloned.
call_usermodehelper_setup() allocates the new credentials and
call_usermodehelper_freeinfo() discards them if they haven't been used. A
special cred function (prepare_usermodeinfo_creds()) is provided
specifically for call_usermodehelper_setup() to call.
call_usermodehelper_setkeys() adjusts the credentials to sport the
supplied keyring as the new session keyring.
(11) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) selinux_setprocattr() no longer does its check for whether the
current ptracer can access processes with the new SID inside the lock
that covers getting the ptracer's SID. Whilst this lock ensures that
the check is done with the ptracer pinned, the result is only valid
until the lock is released, so there's no point doing it inside the
lock.
(12) is_single_threaded().
This function has been extracted from selinux_setprocattr() and put into
a file of its own in the lib/ directory as join_session_keyring() now
wants to use it too.
The code in SELinux just checked to see whether a task shared mm_structs
with other tasks (CLONE_VM), but that isn't good enough. We really want
to know if they're part of the same thread group (CLONE_THREAD).
(13) nfsd.
The NFS server daemon now has to use the COW credentials to set the
credentials it is going to use. It really needs to pass the credentials
down to the functions it calls, but it can't do that until other patches
in this series have been applied.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:23 +11:00
cred_init ( ) ;
2009-09-21 17:03:05 -07:00
fork_init ( totalram_pages ) ;
2005-04-16 15:20:36 -07:00
proc_caches_init ( ) ;
buffer_init ( ) ;
key_init ( ) ;
security_init ( ) ;
2010-05-20 21:04:29 -05:00
dbg_late_init ( ) ;
2009-09-21 17:03:05 -07:00
vfs_caches_init ( totalram_pages ) ;
2005-04-16 15:20:36 -07:00
signals_init ( ) ;
/* rootfs populating might need page-writeback */
page_writeback_init ( ) ;
# ifdef CONFIG_PROC_FS
proc_root_init ( ) ;
# endif
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 23:39:30 -07:00
cgroup_init ( ) ;
2005-04-16 15:20:36 -07:00
cpuset_init ( ) ;
2006-07-14 00:24:40 -07:00
taskstats_init_early ( ) ;
2006-07-14 00:24:36 -07:00
delayacct_init ( ) ;
2005-04-16 15:20:36 -07:00
check_bugs ( ) ;
acpi_early_init ( ) ; /* before LAPIC and SMP init */
2009-08-14 15:13:46 -04:00
sfi_init_late ( ) ;
2005-04-16 15:20:36 -07:00
2008-08-14 15:45:08 -04:00
ftrace_init ( ) ;
2005-04-16 15:20:36 -07:00
/* Do the rest non-__init'ed, we're now alive */
rest_init ( ) ;
}
2009-06-17 16:28:03 -07:00
/* Call all constructor functions linked into the kernel. */
static void __init do_ctors ( void )
{
# ifdef CONFIG_CONSTRUCTORS
2009-12-14 18:00:18 -08:00
ctor_fn_t * fn = ( ctor_fn_t * ) __ctors_start ;
2009-06-17 16:28:03 -07:00
2009-12-14 18:00:18 -08:00
for ( ; fn < ( ctor_fn_t * ) __ctors_end ; fn + + )
( * fn ) ( ) ;
2009-06-17 16:28:03 -07:00
# endif
}
2009-01-07 08:45:46 -08:00
int initcall_debug ;
2008-10-22 10:00:23 -05:00
core_param ( initcall_debug , initcall_debug , bool , 0644 ) ;
2005-04-16 15:20:36 -07:00
tracing: Fix too large stack usage in do_one_initcall()
One of my testboxes triggered this nasty stack overflow crash
during SCSI probing:
[ 5.874004] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 5.875004] device: 'sda': device_add
[ 5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
[ 5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
[ 5.878004] *pde = 00000000
[ 5.878004] Thread overran stack, or stack corrupted
[ 5.878004] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 5.878004] last sysfs file:
[ 5.878004]
[ 5.878004] Pid: 1, comm: swapper Not tainted (2.6.31-rc6-tip-01272-g9919e28-dirty #5685)
[ 5.878004] EIP: 0060:[<b1008321>] EFLAGS: 00010083 CPU: 0
[ 5.878004] EIP is at print_context_stack+0x81/0x110
[ 5.878004] EAX: cf8a3000 EBX: cf8a3fe4 ECX: 00000049 EDX: 00000000
[ 5.878004] ESI: b1cfce84 EDI: 00000000 EBP: cf8a3018 ESP: cf8a2ff4
[ 5.878004] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 5.878004] Process swapper (pid: 1, ti=cf8a2000 task=cf8a8000 task.ti=cf8a3000)
[ 5.878004] Stack:
[ 5.878004] b1004867 fffff000 cf8a3ffc
[ 5.878004] Call Trace:
[ 5.878004] [<b1004867>] ? kernel_thread_helper+0x7/0x10
[ 5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
[ 5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
[ 5.878004] *pde = 00000000
[ 5.878004] Thread overran stack, or stack corrupted
[ 5.878004] Oops: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
The oops did not reveal any more details about the real stack
that we have and the system got into an infinite loop of
recursive pagefaults.
So i booted with CONFIG_STACK_TRACER=y and the 'stacktrace' boot
parameter. The box did not crash (timings/conditions probably
changed a tiny bit to trigger the catastrophic crash), but the
/debug/tracing/stack_trace file was rather revealing:
Depth Size Location (72 entries)
----- ---- --------
0) 3704 52 __change_page_attr+0xb8/0x290
1) 3652 24 __change_page_attr_set_clr+0x43/0x90
2) 3628 60 kernel_map_pages+0x108/0x120
3) 3568 40 prep_new_page+0x7d/0x130
4) 3528 84 get_page_from_freelist+0x106/0x420
5) 3444 116 __alloc_pages_nodemask+0xd7/0x550
6) 3328 36 allocate_slab+0xb1/0x100
7) 3292 36 new_slab+0x1c/0x160
8) 3256 36 __slab_alloc+0x133/0x2b0
9) 3220 4 kmem_cache_alloc+0x1bb/0x1d0
10) 3216 108 create_object+0x28/0x250
11) 3108 40 kmemleak_alloc+0x81/0xc0
12) 3068 24 kmem_cache_alloc+0x162/0x1d0
13) 3044 52 scsi_pool_alloc_command+0x29/0x70
14) 2992 20 scsi_host_alloc_command+0x22/0x70
15) 2972 24 __scsi_get_command+0x1b/0x90
16) 2948 28 scsi_get_command+0x35/0x90
17) 2920 24 scsi_setup_blk_pc_cmnd+0xd4/0x100
18) 2896 128 sd_prep_fn+0x332/0xa70
19) 2768 36 blk_peek_request+0xe7/0x1d0
20) 2732 56 scsi_request_fn+0x54/0x520
21) 2676 12 __generic_unplug_device+0x2b/0x40
22) 2664 24 blk_execute_rq_nowait+0x59/0x80
23) 2640 172 blk_execute_rq+0x6b/0xb0
24) 2468 32 scsi_execute+0xe0/0x140
25) 2436 64 scsi_execute_req+0x152/0x160
26) 2372 60 scsi_vpd_inquiry+0x6c/0x90
27) 2312 44 scsi_get_vpd_page+0x112/0x160
28) 2268 52 sd_revalidate_disk+0x1df/0x320
29) 2216 92 rescan_partitions+0x98/0x330
30) 2124 52 __blkdev_get+0x309/0x350
31) 2072 8 blkdev_get+0xf/0x20
32) 2064 44 register_disk+0xff/0x120
33) 2020 36 add_disk+0x6e/0xb0
34) 1984 44 sd_probe_async+0xfb/0x1d0
35) 1940 44 __async_schedule+0xf4/0x1b0
36) 1896 8 async_schedule+0x12/0x20
37) 1888 60 sd_probe+0x305/0x360
38) 1828 44 really_probe+0x63/0x170
39) 1784 36 driver_probe_device+0x5d/0x60
40) 1748 16 __device_attach+0x49/0x50
41) 1732 32 bus_for_each_drv+0x5b/0x80
42) 1700 24 device_attach+0x6b/0x70
43) 1676 16 bus_attach_device+0x47/0x60
44) 1660 76 device_add+0x33d/0x400
45) 1584 52 scsi_sysfs_add_sdev+0x6a/0x2c0
46) 1532 108 scsi_add_lun+0x44b/0x460
47) 1424 116 scsi_probe_and_add_lun+0x182/0x4e0
48) 1308 36 __scsi_add_device+0xd9/0xe0
49) 1272 44 ata_scsi_scan_host+0x10b/0x190
50) 1228 24 async_port_probe+0x96/0xd0
51) 1204 44 __async_schedule+0xf4/0x1b0
52) 1160 8 async_schedule+0x12/0x20
53) 1152 48 ata_host_register+0x171/0x1d0
54) 1104 60 ata_pci_sff_activate_host+0xf3/0x230
55) 1044 44 ata_pci_sff_init_one+0xea/0x100
56) 1000 48 amd_init_one+0xb2/0x190
57) 952 8 local_pci_probe+0x13/0x20
58) 944 32 pci_device_probe+0x68/0x90
59) 912 44 really_probe+0x63/0x170
60) 868 36 driver_probe_device+0x5d/0x60
61) 832 20 __driver_attach+0x89/0xa0
62) 812 32 bus_for_each_dev+0x5b/0x80
63) 780 12 driver_attach+0x1e/0x20
64) 768 72 bus_add_driver+0x14b/0x2d0
65) 696 36 driver_register+0x6e/0x150
66) 660 20 __pci_register_driver+0x53/0xc0
67) 640 8 amd_init+0x14/0x16
68) 632 572 do_one_initcall+0x2b/0x1d0
69) 60 12 do_basic_setup+0x56/0x6a
70) 48 20 kernel_init+0x84/0xce
71) 28 28 kernel_thread_helper+0x7/0x10
There's a lot of fat functions on that stack trace, but
the largest of all is do_one_initcall(). This is due to
the boot trace entry variables being on the stack.
Fixing this is relatively easy, initcalls are fundamentally
serialized, so we can move the local variables to file scope.
Note that this large stack footprint was present for a
couple of months already - what pushed my system over
the edge was the addition of kmemleak to the call-chain:
6) 3328 36 allocate_slab+0xb1/0x100
7) 3292 36 new_slab+0x1c/0x160
8) 3256 36 __slab_alloc+0x133/0x2b0
9) 3220 4 kmem_cache_alloc+0x1bb/0x1d0
10) 3216 108 create_object+0x28/0x250
11) 3108 40 kmemleak_alloc+0x81/0xc0
12) 3068 24 kmem_cache_alloc+0x162/0x1d0
13) 3044 52 scsi_pool_alloc_command+0x29/0x70
This pushes the total to ~3800 bytes, only a tiny bit
more was needed to corrupt the on-kernel-stack thread_info.
The fix reduces the stack footprint from 572 bytes
to 28 bytes.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-21 12:53:36 +02:00
static char msgbuf [ 64 ] ;
2010-08-09 17:20:32 -07:00
static int __init_or_module do_one_initcall_debug ( initcall_t fn )
2005-04-16 15:20:36 -07:00
{
2008-11-11 23:24:42 +01:00
ktime_t calltime , delta , rettime ;
2010-05-26 18:57:53 +08:00
unsigned long long duration ;
int ret ;
2005-04-16 15:20:36 -07:00
2010-08-09 17:20:32 -07:00
printk ( KERN_DEBUG " calling %pF @ %i \n " , fn , task_pid_nr ( current ) ) ;
calltime = ktime_get ( ) ;
2010-05-26 18:57:53 +08:00
ret = fn ( ) ;
2010-08-09 17:20:32 -07:00
rettime = ktime_get ( ) ;
delta = ktime_sub ( rettime , calltime ) ;
duration = ( unsigned long long ) ktime_to_ns ( delta ) > > 10 ;
printk ( KERN_DEBUG " initcall %pF returned %d after %lld usecs \n " , fn ,
ret , duration ) ;
2005-04-16 15:20:36 -07:00
2010-08-09 17:20:32 -07:00
return ret ;
}
2010-08-09 17:20:32 -07:00
int __init_or_module do_one_initcall ( initcall_t fn )
2010-08-09 17:20:32 -07:00
{
int count = preempt_count ( ) ;
int ret ;
if ( initcall_debug )
ret = do_one_initcall_debug ( fn ) ;
else
ret = fn ( ) ;
2007-05-08 00:28:26 -07:00
2008-05-15 18:14:01 -07:00
msgbuf [ 0 ] = 0 ;
2008-05-12 14:02:22 -07:00
2010-05-26 18:57:53 +08:00
if ( ret & & ret ! = - ENODEV & & initcall_debug )
sprintf ( msgbuf , " error code %d " , ret ) ;
2008-05-12 14:02:22 -07:00
2008-05-15 18:14:01 -07:00
if ( preempt_count ( ) ! = count ) {
2008-05-15 13:52:41 -07:00
strlcat ( msgbuf , " preemption imbalance " , sizeof ( msgbuf ) ) ;
2008-05-15 18:14:01 -07:00
preempt_count ( ) = count ;
2005-04-16 15:20:36 -07:00
}
2008-05-15 18:14:01 -07:00
if ( irqs_disabled ( ) ) {
2008-05-15 13:52:41 -07:00
strlcat ( msgbuf , " disabled interrupts " , sizeof ( msgbuf ) ) ;
2008-05-15 18:14:01 -07:00
local_irq_enable ( ) ;
}
if ( msgbuf [ 0 ] ) {
2008-10-03 13:38:07 -07:00
printk ( " initcall %pF returned with %s \n " , fn , msgbuf ) ;
2008-05-15 18:14:01 -07:00
}
2008-07-30 12:49:02 -07:00
2010-05-26 18:57:53 +08:00
return ret ;
2008-05-15 18:14:01 -07:00
}
2008-07-25 19:45:11 -07:00
extern initcall_t __initcall_start [ ] , __initcall_end [ ] , __early_initcall_end [ ] ;
2008-05-15 18:14:01 -07:00
static void __init do_initcalls ( void )
{
2009-12-14 18:00:18 -08:00
initcall_t * fn ;
2008-05-15 18:14:01 -07:00
2009-12-14 18:00:18 -08:00
for ( fn = __early_initcall_end ; fn < __initcall_end ; fn + + )
do_one_initcall ( * fn ) ;
2005-04-16 15:20:36 -07:00
}
/*
* Ok , the machine is now initialized . None of the devices
* have been touched yet , but the CPU subsystem is up and
* running , and memory and process management works .
*
* Now we can finally start doing some real work . .
*/
static void __init do_basic_setup ( void )
{
2009-03-25 17:06:30 +08:00
cpuset_init_smp ( ) ;
2005-04-16 15:20:36 -07:00
usermodehelper_init ( ) ;
2011-08-03 16:21:21 -07:00
shmem_init ( ) ;
2005-04-16 15:20:36 -07:00
driver_init ( ) ;
2007-02-14 00:33:57 -08:00
init_irq_proc ( ) ;
2009-06-17 16:28:03 -07:00
do_ctors ( ) ;
bootup: move 'usermodehelper_enable()' to the end of do_basic_setup()
Doing it just before starting to call into cpu_idle() made a sick kind
of sense only because the original bug we fixed (see commit
288d5abec831: "Boot up with usermodehelper disabled") was about problems
with some scheduler data structures not being initialized, and they had
better be initialized at that point.
But it really didn't make any other conceptual sense, and doing it after
the initial "schedule()" call for the idle thread actually opened up a
race: what if the main initialization thread did everything without
needing to sleep, and got all the way into user land too? Without
actually having scheduled back to the idle thread?
Now, in normal circumstances that doesn't ever happen, but it looks like
Richard Cochran triggered exactly that on his ARM IXP4xx machines:
"I have some ARM IXP4xx based machines that use the two on chip MAC
ports (aka NPEs). The NPE needs a firmware in order to function.
Ever since the following commit [that 288d5abec831 one], it is no
longer possible to bring up the interfaces during the init scripts."
with a call trace showing an ioctl coming from user space. Richard says:
"The init is busybox, and the startup script does mount, syslogd, and
then ifup, so that all can go by quickly."
The fix is to move the usermodehelper_enable() into the main 'init'
thread, and just put it after we've done all our initcalls. By then,
everything really should be up, but we've obviously not actually started
the user-mode portion of init yet.
Reported-and-tested-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-28 10:23:44 -07:00
usermodehelper_enable ( ) ;
2011-09-29 15:09:40 +08:00
do_initcalls ( ) ;
2005-04-16 15:20:36 -07:00
}
2008-07-25 19:45:11 -07:00
static void __init do_pre_smp_initcalls ( void )
2008-07-25 19:45:11 -07:00
{
2009-12-14 18:00:18 -08:00
initcall_t * fn ;
2008-07-25 19:45:11 -07:00
2009-12-14 18:00:18 -08:00
for ( fn = __initcall_start ; fn < __early_initcall_end ; fn + + )
do_one_initcall ( * fn ) ;
2008-07-25 19:45:11 -07:00
}
2010-08-17 23:52:56 +01:00
static void run_init_process ( const char * init_filename )
2005-04-16 15:20:36 -07:00
{
argv_init [ 0 ] = init_filename ;
2006-10-02 02:18:26 -07:00
kernel_execve ( init_filename , argv_init , envp_init ) ;
2005-04-16 15:20:36 -07:00
}
2007-02-13 13:26:22 +01:00
/* This is a non __init function. Force it to be noinline otherwise gcc
* makes it inline to init ( ) and it becomes part of init . text section
*/
2009-01-06 14:40:38 -08:00
static noinline int init_post ( void )
2007-02-13 13:26:22 +01:00
{
2009-01-07 08:45:46 -08:00
/* need to finish all async __init code before freeing the memory */
async_synchronize_full ( ) ;
2007-02-13 13:26:22 +01:00
free_initmem ( ) ;
mark_rodata_ro ( ) ;
system_state = SYSTEM_RUNNING ;
numa_default_policy ( ) ;
2008-04-30 00:53:03 -07:00
current - > signal - > flags | = SIGNAL_UNKILLABLE ;
2007-02-13 13:26:22 +01:00
if ( ramdisk_execute_command ) {
run_init_process ( ramdisk_execute_command ) ;
printk ( KERN_WARNING " Failed to execute %s \n " ,
ramdisk_execute_command ) ;
}
/*
* We try each of these until one succeeds .
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine .
*/
if ( execute_command ) {
run_init_process ( execute_command ) ;
printk ( KERN_WARNING " Failed to execute %s. Attempting "
" defaults... \n " , execute_command ) ;
}
run_init_process ( " /sbin/init " ) ;
run_init_process ( " /etc/init " ) ;
run_init_process ( " /bin/init " ) ;
run_init_process ( " /bin/sh " ) ;
2010-03-05 13:42:39 -08:00
panic ( " No init found. Try passing init= option to kernel. "
" See Linux Documentation/init.txt for guidance. " ) ;
2007-02-13 13:26:22 +01:00
}
2007-02-26 16:45:41 +01:00
static int __init kernel_init ( void * unused )
2005-04-16 15:20:36 -07:00
{
2010-06-28 16:51:01 +02:00
/*
* Wait until kthreadd is all set - up .
*/
wait_for_completion ( & kthreadd_done ) ;
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 15:31:49 -07:00
/*
* init can allocate pages on any node
*/
2010-03-23 13:35:34 -07:00
set_mems_allowed ( node_states [ N_HIGH_MEMORY ] ) ;
2005-04-16 15:20:36 -07:00
/*
* init can run on any cpu .
*/
2009-03-30 22:05:10 -06:00
set_cpus_allowed_ptr ( current , cpu_all_mask ) ;
2005-04-16 15:20:36 -07:00
2006-10-02 02:19:00 -07:00
cad_pid = task_pid ( current ) ;
2008-01-30 13:33:17 +01:00
smp_prepare_cpus ( setup_max_cpus ) ;
2005-04-16 15:20:36 -07:00
do_pre_smp_initcalls ( ) ;
2010-11-25 18:38:29 +01:00
lockup_detector_init ( ) ;
2005-04-16 15:20:36 -07:00
smp_init ( ) ;
sched_init_smp ( ) ;
do_basic_setup ( ) ;
2010-03-02 23:53:19 -08:00
/* Open the /dev/console on the rootfs, this should never fail */
if ( sys_open ( ( const char __user * ) " /dev/console " , O_RDWR , 0 ) < 0 )
printk ( KERN_WARNING " Warning: unable to open an initial console. \n " ) ;
( void ) sys_dup ( 0 ) ;
( void ) sys_dup ( 0 ) ;
2005-04-16 15:20:36 -07:00
/*
* check if there is an early userspace init . If yes , let it do all
* the work
*/
2005-09-06 15:17:19 -07:00
if ( ! ramdisk_execute_command )
ramdisk_execute_command = " /init " ;
if ( sys_access ( ( const char __user * ) ramdisk_execute_command , 0 ) ! = 0 ) {
ramdisk_execute_command = NULL ;
2005-04-16 15:20:36 -07:00
prepare_namespace ( ) ;
2005-09-06 15:17:19 -07:00
}
2005-04-16 15:20:36 -07:00
/*
* Ok , we have completed the initial bootup , and
* we ' re essentially up and running . Get rid of the
* initmem segments and start the user - mode stuff . .
*/
2008-10-31 12:57:20 +01:00
2007-02-13 13:26:22 +01:00
init_post ( ) ;
return 0 ;
2005-04-16 15:20:36 -07:00
}