2005-04-16 15:20:36 -07:00
/*
* linux / init / main . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* GK 2 / 5 / 95 - Changed to support mounting root fs via NFS
* Added initrd & change_root : Werner Almesberger & Hans Lermen , Feb ' 96
* Moan early if gcc is old , avoiding bogus kernels - Paul Gortmaker , May ' 96
* Simplified starting of init : Michael A . Griffith < grif @ acm . org >
*/
# include <linux/types.h>
# include <linux/module.h>
# include <linux/proc_fs.h>
# include <linux/kernel.h>
# include <linux/syscalls.h>
2008-02-14 09:41:09 +01:00
# include <linux/stackprotector.h>
2005-04-16 15:20:36 -07:00
# include <linux/string.h>
# include <linux/ctype.h>
# include <linux/delay.h>
# include <linux/utsname.h>
# include <linux/ioport.h>
# include <linux/init.h>
# include <linux/smp_lock.h>
# include <linux/initrd.h>
# include <linux/bootmem.h>
# include <linux/tty.h>
# include <linux/gfp.h>
# include <linux/percpu.h>
# include <linux/kmod.h>
mm: rewrite vmap layer
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-18 20:27:03 -07:00
# include <linux/vmalloc.h>
2005-04-16 15:20:36 -07:00
# include <linux/kernel_stat.h>
2006-12-07 02:14:08 +01:00
# include <linux/start_kernel.h>
2005-04-16 15:20:36 -07:00
# include <linux/security.h>
2008-06-26 11:21:34 +02:00
# include <linux/smp.h>
2005-04-16 15:20:36 -07:00
# include <linux/workqueue.h>
# include <linux/profile.h>
# include <linux/rcupdate.h>
# include <linux/moduleparam.h>
# include <linux/kallsyms.h>
# include <linux/writeback.h>
# include <linux/cpu.h>
# include <linux/cpuset.h>
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 23:39:30 -07:00
# include <linux/cgroup.h>
2005-04-16 15:20:36 -07:00
# include <linux/efi.h>
2007-02-16 01:28:01 -08:00
# include <linux/tick.h>
2007-02-17 21:22:39 -08:00
# include <linux/interrupt.h>
2006-07-14 00:24:40 -07:00
# include <linux/taskstats_kern.h>
2006-07-14 00:24:36 -07:00
# include <linux/delayacct.h>
2005-04-16 15:20:36 -07:00
# include <linux/unistd.h>
# include <linux/rmap.h>
# include <linux/mempolicy.h>
# include <linux/key.h>
2006-06-27 02:53:54 -07:00
# include <linux/buffer_head.h>
2008-10-22 14:15:05 -07:00
# include <linux/page_cgroup.h>
2006-07-03 00:24:33 -07:00
# include <linux/debug_locks.h>
2008-04-30 00:55:01 -07:00
# include <linux/debugobjects.h>
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
# include <linux/lockdep.h>
2006-12-08 02:38:01 -08:00
# include <linux/pid_namespace.h>
2006-12-19 13:01:28 -08:00
# include <linux/device.h>
2007-05-09 02:34:32 -07:00
# include <linux/kthread.h>
2007-11-09 22:39:39 +01:00
# include <linux/sched.h>
2008-02-06 01:36:44 -08:00
# include <linux/signal.h>
2008-04-29 01:03:13 -07:00
# include <linux/idr.h>
2008-08-14 15:45:08 -04:00
# include <linux/ftrace.h>
2009-01-07 08:45:46 -08:00
# include <linux/async.h>
2008-11-11 23:21:31 +01:00
# include <trace/boot.h>
2005-04-16 15:20:36 -07:00
# include <asm/io.h>
# include <asm/bugs.h>
# include <asm/setup.h>
2005-07-28 21:15:30 -07:00
# include <asm/sections.h>
2006-01-06 00:12:01 -08:00
# include <asm/cacheflush.h>
2008-12-29 13:42:23 -08:00
# include <trace/kmemtrace.h>
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_X86_LOCAL_APIC
# include <asm/smp.h>
# endif
2007-02-26 16:45:41 +01:00
static int kernel_init ( void * ) ;
2005-04-16 15:20:36 -07:00
extern void init_IRQ ( void ) ;
extern void fork_init ( unsigned long ) ;
extern void mca_init ( void ) ;
extern void sbus_init ( void ) ;
extern void prio_tree_init ( void ) ;
extern void radix_tree_init ( void ) ;
extern void free_initmem ( void ) ;
# ifdef CONFIG_ACPI
extern void acpi_early_init ( void ) ;
# else
static inline void acpi_early_init ( void ) { }
# endif
2006-01-06 00:12:01 -08:00
# ifndef CONFIG_DEBUG_RODATA
static inline void mark_rodata_ro ( void ) { }
# endif
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_TC
extern void tc_init ( void ) ;
# endif
rcu: Teach RCU that idle task is not quiscent state at boot
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton. And cleans up the variable-name
and function-name language. ;-)
The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state. This in turn causes
RCU to prematurely end grace periods during this time.
This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition. RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().
Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1. This allows boot-time code that
calls synchronize_rcu() to proceed normally. Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence. Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.
In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.
Changes since v4:
o Changes the name of the introduced function and variable to
be less emotional. ;-)
Changes since v3:
o WARN_ON(nr_context_switches() > 0) to verify that RCU
switches out of boot-time mode before the first context
switch, as suggested by Nick Piggin.
Changes since v2:
o Created rcu_blocking_is_gp() internal-to-RCU API that
determines whether a call to synchronize_rcu() is itself
a grace period.
o The definition of rcu_blocking_is_gp() for rcuclassic and
rcutree checks to see if but a single CPU is online.
o The definition of rcu_blocking_is_gp() for rcupreempt
checks to see both if but a single CPU is online and if
the system is still in early boot.
This allows rcupreempt to again work correctly if running
on a single CPU after booting is complete.
o Added check to rcupreempt's synchronize_sched() for there
being but one online CPU.
Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:03:42 -08:00
enum system_states system_state __read_mostly ;
2005-04-16 15:20:36 -07:00
EXPORT_SYMBOL ( system_state ) ;
/*
* Boot command - line arguments
*/
# define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
# define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
extern void time_init ( void ) ;
/* Default late time init is NULL. archs can override this later. */
2009-01-06 14:41:10 -08:00
void ( * __initdata late_time_init ) ( void ) ;
2005-04-16 15:20:36 -07:00
extern void softirq_init ( void ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
/* Untouched command line saved by arch-specific code. */
char __initdata boot_command_line [ COMMAND_LINE_SIZE ] ;
/* Untouched saved command line (eg. for /proc) */
char * saved_command_line ;
/* Command line for parameter parsing */
static char * static_command_line ;
2005-04-16 15:20:36 -07:00
static char * execute_command ;
2005-09-06 15:17:19 -07:00
static char * ramdisk_execute_command ;
2005-04-16 15:20:36 -07:00
2007-07-15 23:41:07 -07:00
# ifdef CONFIG_SMP
2005-04-16 15:20:36 -07:00
/* Setup configured maximum number of CPUs to activate */
2008-01-30 13:33:17 +01:00
unsigned int __initdata setup_max_cpus = NR_CPUS ;
2006-09-27 01:50:44 -07:00
2005-04-16 15:20:36 -07:00
/*
* Setup routine for controlling SMP activation
*
* Command - line option of " nosmp " or " maxcpus=0 " will disable SMP
* activation entirely ( the MPS table probe still happens , though ) .
*
* Command - line option of " maxcpus=<NUM> " , where < NUM > is an integer
* greater than 0 , limits the maximum number of CPUs activated in
* SMP mode to < NUM > .
*/
2009-01-31 03:36:17 +01:00
void __weak arch_disable_smp_support ( void ) { }
2007-08-16 03:34:22 -04:00
2005-04-16 15:20:36 -07:00
static int __init nosmp ( char * str )
{
2008-01-30 13:33:17 +01:00
setup_max_cpus = 0 ;
2009-01-31 03:36:17 +01:00
arch_disable_smp_support ( ) ;
2007-07-15 23:41:07 -07:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2007-07-15 23:41:07 -07:00
early_param ( " nosmp " , nosmp ) ;
2005-04-16 15:20:36 -07:00
static int __init maxcpus ( char * str )
{
2008-01-30 13:33:17 +01:00
get_option ( & str , & setup_max_cpus ) ;
if ( setup_max_cpus = = 0 )
2009-01-31 03:36:17 +01:00
arch_disable_smp_support ( ) ;
2007-08-16 03:34:22 -04:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2007-08-27 16:02:12 +01:00
early_param ( " maxcpus " , maxcpus ) ;
2007-07-15 23:41:07 -07:00
# else
2009-01-31 03:36:17 +01:00
const unsigned int setup_max_cpus = NR_CPUS ;
2007-07-15 23:41:07 -07:00
# endif
/*
* If set , this is an indication to the drivers that reset the underlying
* device before going ahead with the initialization otherwise driver might
* rely on the BIOS and skip the reset operation .
*
* This is useful if kernel is booting in an unreliable environment .
* For ex . kdump situaiton where previous kernel has crashed , BIOS has been
* skipped and devices will be in unknown state .
*/
unsigned int reset_devices ;
EXPORT_SYMBOL ( reset_devices ) ;
2005-04-16 15:20:36 -07:00
2006-09-27 01:50:44 -07:00
static int __init set_reset_devices ( char * str )
{
reset_devices = 1 ;
return 1 ;
}
__setup ( " reset_devices " , set_reset_devices ) ;
2005-04-16 15:20:36 -07:00
static char * argv_init [ MAX_INIT_ARGS + 2 ] = { " init " , NULL , } ;
char * envp_init [ MAX_INIT_ENVS + 2 ] = { " HOME=/ " , " TERM=linux " , NULL , } ;
static const char * panic_later , * panic_param ;
extern struct obs_kernel_param __setup_start [ ] , __setup_end [ ] ;
static int __init obsolete_checksetup ( char * line )
{
struct obs_kernel_param * p ;
2006-09-26 10:52:32 +02:00
int had_early_param = 0 ;
2005-04-16 15:20:36 -07:00
p = __setup_start ;
do {
int n = strlen ( p - > str ) ;
if ( ! strncmp ( line , p - > str , n ) ) {
if ( p - > early ) {
2006-09-26 10:52:32 +02:00
/* Already done in parse_early_param?
* ( Needs exact match on param part ) .
* Keep iterating , as we can have early
* params and __setups of same names 8 ( */
2005-04-16 15:20:36 -07:00
if ( line [ n ] = = ' \0 ' | | line [ n ] = = ' = ' )
2006-09-26 10:52:32 +02:00
had_early_param = 1 ;
2005-04-16 15:20:36 -07:00
} else if ( ! p - > setup_func ) {
printk ( KERN_WARNING " Parameter %s is obsolete, "
" ignored \n " , p - > str ) ;
return 1 ;
} else if ( p - > setup_func ( line + n ) )
return 1 ;
}
p + + ;
} while ( p < __setup_end ) ;
2006-09-26 10:52:32 +02:00
return had_early_param ;
2005-04-16 15:20:36 -07:00
}
/*
* This should be approx 2 Bo * oMips to start ( note initial shift ) , and will
* still work even if initially too large , it will just take slightly longer
*/
unsigned long loops_per_jiffy = ( 1 < < 12 ) ;
EXPORT_SYMBOL ( loops_per_jiffy ) ;
static int __init debug_kernel ( char * str )
{
console_loglevel = 10 ;
2008-02-08 04:21:58 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
static int __init quiet_kernel ( char * str )
{
console_loglevel = 4 ;
2008-02-08 04:21:58 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2008-02-08 04:21:58 -08:00
early_param ( " debug " , debug_kernel ) ;
early_param ( " quiet " , quiet_kernel ) ;
2005-04-16 15:20:36 -07:00
static int __init loglevel ( char * str )
{
get_option ( & str , & console_loglevel ) ;
2008-03-04 14:28:31 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2008-02-08 04:21:58 -08:00
early_param ( " loglevel " , loglevel ) ;
2005-04-16 15:20:36 -07:00
/*
* Unknown boot options get handed to init , unless they look like
* failed parameters
*/
static int __init unknown_bootoption ( char * param , char * val )
{
/* Change NUL term back to "=", to make "param" the whole string. */
if ( val ) {
/* param=val or param="val"? */
if ( val = = param + strlen ( param ) + 1 )
val [ - 1 ] = ' = ' ;
else if ( val = = param + strlen ( param ) + 2 ) {
val [ - 2 ] = ' = ' ;
memmove ( val - 1 , val , strlen ( val ) + 1 ) ;
val - - ;
} else
BUG ( ) ;
}
/* Handle obsolete-style parameters */
if ( obsolete_checksetup ( param ) )
return 0 ;
/*
2007-10-20 01:28:29 +02:00
* Preemptive maintenance for " why didn't my misspelled command
2005-04-16 15:20:36 -07:00
* line work ? "
*/
if ( strchr ( param , ' . ' ) & & ( ! val | | strchr ( param , ' . ' ) < val ) ) {
printk ( KERN_ERR " Unknown boot option `%s': ignoring \n " , param ) ;
return 0 ;
}
if ( panic_later )
return 0 ;
if ( val ) {
/* Environment option */
unsigned int i ;
for ( i = 0 ; envp_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ENVS ) {
panic_later = " Too many boot env vars at `%s' " ;
panic_param = param ;
}
if ( ! strncmp ( param , envp_init [ i ] , val - param ) )
break ;
}
envp_init [ i ] = param ;
} else {
/* Command line option */
unsigned int i ;
for ( i = 0 ; argv_init [ i ] ; i + + ) {
if ( i = = MAX_INIT_ARGS ) {
panic_later = " Too many boot init vars at `%s' " ;
panic_param = param ;
}
}
argv_init [ i ] = param ;
}
return 0 ;
}
2008-01-30 13:33:58 +01:00
# ifdef CONFIG_DEBUG_PAGEALLOC
int __read_mostly debug_pagealloc_enabled = 0 ;
# endif
2005-04-16 15:20:36 -07:00
static int __init init_setup ( char * str )
{
unsigned int i ;
execute_command = str ;
/*
* In case LILO is going to boot us with default command line ,
* it prepends " auto " before the whole cmdline which makes
* the shell think it should execute a script with such name .
* So we ignore all arguments entered _before_ init = . . . [ MJ ]
*/
for ( i = 1 ; i < MAX_INIT_ARGS ; i + + )
argv_init [ i ] = NULL ;
return 1 ;
}
__setup ( " init= " , init_setup ) ;
2005-09-06 15:17:19 -07:00
static int __init rdinit_setup ( char * str )
{
unsigned int i ;
ramdisk_execute_command = str ;
/* See "auto" comment in init_setup */
for ( i = 1 ; i < MAX_INIT_ARGS ; i + + )
argv_init [ i ] = NULL ;
return 1 ;
}
__setup ( " rdinit= " , rdinit_setup ) ;
2005-04-16 15:20:36 -07:00
# ifndef CONFIG_SMP
# ifdef CONFIG_X86_LOCAL_APIC
static void __init smp_init ( void )
{
APIC_init_uniprocessor ( ) ;
}
# else
# define smp_init() do { } while (0)
# endif
static inline void setup_per_cpu_areas ( void ) { }
2008-03-26 14:23:48 -07:00
static inline void setup_nr_cpu_ids ( void ) { }
2005-04-16 15:20:36 -07:00
static inline void smp_prepare_cpus ( unsigned int maxcpus ) { }
# else
2008-04-04 18:11:02 -07:00
# if NR_CPUS > BITS_PER_LONG
cpumask_t cpu_mask_all __read_mostly = CPU_MASK_ALL ;
EXPORT_SYMBOL ( cpu_mask_all ) ;
# endif
2008-03-26 14:23:48 -07:00
/* Setup number of possible processor ids */
int nr_cpu_ids __read_mostly = NR_CPUS ;
EXPORT_SYMBOL ( nr_cpu_ids ) ;
/* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
static void __init setup_nr_cpu_ids ( void )
{
2009-01-01 10:12:19 +10:30
nr_cpu_ids = find_last_bit ( cpumask_bits ( cpu_possible_mask ) , NR_CPUS ) + 1 ;
2008-03-26 14:23:48 -07:00
}
2008-01-30 13:33:32 +01:00
# ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
2006-03-23 03:01:07 -08:00
unsigned long __per_cpu_offset [ NR_CPUS ] __read_mostly ;
2005-04-16 15:20:36 -07:00
EXPORT_SYMBOL ( __per_cpu_offset ) ;
static void __init setup_per_cpu_areas ( void )
{
unsigned long size , i ;
char * ptr ;
2006-03-23 03:01:04 -08:00
unsigned long nr_possible_cpus = num_possible_cpus ( ) ;
2005-04-16 15:20:36 -07:00
/* Copy section for each CPU (we discard the original) */
2007-05-02 19:27:12 +02:00
size = ALIGN ( PERCPU_ENOUGH_ROOM , PAGE_SIZE ) ;
ptr = alloc_bootmem_pages ( size * nr_possible_cpus ) ;
2005-04-16 15:20:36 -07:00
2006-03-28 01:56:37 -08:00
for_each_possible_cpu ( i ) {
2005-04-16 15:20:36 -07:00
__per_cpu_offset [ i ] = ptr - __per_cpu_start ;
memcpy ( ptr , __per_cpu_start , __per_cpu_end - __per_cpu_start ) ;
2006-03-23 03:01:04 -08:00
ptr + = size ;
2005-04-16 15:20:36 -07:00
}
}
2008-01-30 13:33:32 +01:00
# endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
2005-04-16 15:20:36 -07:00
/* Called by boot processor to activate the rest. */
static void __init smp_init ( void )
{
2007-02-20 13:57:51 -08:00
unsigned int cpu ;
2005-04-16 15:20:36 -07:00
2008-07-15 04:43:49 -07:00
/*
* Set up the current CPU as possible to migrate to .
* The other ones will be done by cpu_up / cpu_down ( )
*/
2009-03-30 22:05:12 -06:00
set_cpu_active ( smp_processor_id ( ) , true ) ;
2008-07-15 04:43:49 -07:00
2005-04-16 15:20:36 -07:00
/* FIXME: This should be done in userspace --RR */
2007-02-20 13:57:51 -08:00
for_each_present_cpu ( cpu ) {
2008-01-30 13:33:17 +01:00
if ( num_online_cpus ( ) > = setup_max_cpus )
2005-04-16 15:20:36 -07:00
break ;
2007-02-20 13:57:51 -08:00
if ( ! cpu_online ( cpu ) )
cpu_up ( cpu ) ;
2005-04-16 15:20:36 -07:00
}
/* Any cleanup work */
printk ( KERN_INFO " Brought up %ld CPUs \n " , ( long ) num_online_cpus ( ) ) ;
2008-01-30 13:33:17 +01:00
smp_cpus_done ( setup_max_cpus ) ;
2005-04-16 15:20:36 -07:00
}
# endif
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
/*
* We need to store the untouched command line for future reference .
* We also need to store the touched command line since the parameter
* parsing is performed in place , and we should allow a component to
* store reference of name / value for future reference .
*/
static void __init setup_command_line ( char * command_line )
{
saved_command_line = alloc_bootmem ( strlen ( boot_command_line ) + 1 ) ;
static_command_line = alloc_bootmem ( strlen ( command_line ) + 1 ) ;
strcpy ( saved_command_line , boot_command_line ) ;
strcpy ( static_command_line , command_line ) ;
}
2005-04-16 15:20:36 -07:00
/*
* We need to finalize in a non - __init function or else race conditions
* between the root thread and the init thread may cause start_kernel to
* be reaped by free_initmem before the root thread has proceeded to
* cpu_idle .
*
* gcc - 3.4 accidentally inlines this function , so use noinline .
*/
2009-01-06 14:40:38 -08:00
static noinline void __init_refok rest_init ( void )
2005-04-16 15:20:36 -07:00
__releases ( kernel_lock )
{
2007-05-09 02:34:32 -07:00
int pid ;
2007-02-26 16:45:41 +01:00
kernel_thread ( kernel_init , NULL , CLONE_FS | CLONE_SIGHAND ) ;
2005-04-16 15:20:36 -07:00
numa_default_policy ( ) ;
2007-05-09 02:34:32 -07:00
pid = kernel_thread ( kthreadd , NULL , CLONE_FS | CLONE_FILES ) ;
2008-04-30 00:54:24 -07:00
kthreadd_task = find_task_by_pid_ns ( pid , & init_pid_ns ) ;
2005-04-16 15:20:36 -07:00
unlock_kernel ( ) ;
2005-06-28 16:40:42 +02:00
/*
* The boot idle thread must execute schedule ( )
2007-07-09 18:51:58 +02:00
* at least once to get things moving :
2005-06-28 16:40:42 +02:00
*/
2007-07-09 18:51:58 +02:00
init_idle_bootup_task ( current ) ;
rcu: Teach RCU that idle task is not quiscent state at boot
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton. And cleans up the variable-name
and function-name language. ;-)
The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state. This in turn causes
RCU to prematurely end grace periods during this time.
This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition. RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().
Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1. This allows boot-time code that
calls synchronize_rcu() to proceed normally. Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence. Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.
In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.
Changes since v4:
o Changes the name of the introduced function and variable to
be less emotional. ;-)
Changes since v3:
o WARN_ON(nr_context_switches() > 0) to verify that RCU
switches out of boot-time mode before the first context
switch, as suggested by Nick Piggin.
Changes since v2:
o Created rcu_blocking_is_gp() internal-to-RCU API that
determines whether a call to synchronize_rcu() is itself
a grace period.
o The definition of rcu_blocking_is_gp() for rcuclassic and
rcutree checks to see if but a single CPU is online.
o The definition of rcu_blocking_is_gp() for rcupreempt
checks to see both if but a single CPU is online and if
the system is still in early boot.
This allows rcupreempt to again work correctly if running
on a single CPU after booting is complete.
o Added check to rcupreempt's synchronize_sched() for there
being but one online CPU.
Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:03:42 -08:00
rcu_scheduler_starting ( ) ;
2005-11-08 21:39:01 -08:00
preempt_enable_no_resched ( ) ;
2005-06-28 16:40:42 +02:00
schedule ( ) ;
2005-11-08 21:39:01 -08:00
preempt_disable ( ) ;
2005-06-28 16:40:42 +02:00
2005-11-08 21:39:01 -08:00
/* Call into cpu_idle with preempt disabled */
2005-04-16 15:20:36 -07:00
cpu_idle ( ) ;
2007-07-09 18:51:58 +02:00
}
2005-04-16 15:20:36 -07:00
/* Check for early params. */
static int __init do_early_param ( char * param , char * val )
{
struct obs_kernel_param * p ;
for ( p = __setup_start ; p < __setup_end ; p + + ) {
serial: convert early_uart to earlycon for 8250
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-15 23:37:59 -07:00
if ( ( p - > early & & strcmp ( param , p - > str ) = = 0 ) | |
( strcmp ( param , " console " ) = = 0 & &
strcmp ( p - > str , " earlycon " ) = = 0 )
) {
2005-04-16 15:20:36 -07:00
if ( p - > setup_func ( val ) ! = 0 )
printk ( KERN_WARNING
" Malformed early option '%s' \n " , param ) ;
}
}
/* We accept everything at this stage. */
return 0 ;
}
2009-03-30 14:37:25 -07:00
void __init parse_early_options ( char * cmdline )
{
parse_args ( " early options " , cmdline , NULL , 0 , do_early_param ) ;
}
2005-04-16 15:20:36 -07:00
/* Arch code calls this early on, or if not, just before other parsing. */
void __init parse_early_param ( void )
{
static __initdata int done = 0 ;
static __initdata char tmp_cmdline [ COMMAND_LINE_SIZE ] ;
if ( done )
return ;
/* All fall through to do_early_param. */
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
strlcpy ( tmp_cmdline , boot_command_line , COMMAND_LINE_SIZE ) ;
2009-03-30 14:37:25 -07:00
parse_early_options ( tmp_cmdline ) ;
2005-04-16 15:20:36 -07:00
done = 1 ;
}
/*
* Activate the first processor .
*/
2006-03-23 02:59:44 -08:00
static void __init boot_cpu_init ( void )
{
int cpu = smp_processor_id ( ) ;
/* Mark the boot cpu "present", "online" etc for SMP and UP case */
2009-01-01 10:12:15 +10:30
set_cpu_online ( cpu , true ) ;
set_cpu_present ( cpu , true ) ;
set_cpu_possible ( cpu , true ) ;
2006-03-23 02:59:44 -08:00
}
2008-04-18 16:56:18 +10:00
void __init __weak smp_setup_processor_id ( void )
2006-06-30 01:55:50 -07:00
{
}
2008-04-18 16:56:15 +10:00
void __init __weak thread_info_cache_init ( void )
{
}
2005-04-16 15:20:36 -07:00
asmlinkage void __init start_kernel ( void )
{
char * command_line ;
extern struct kernel_param __start___param [ ] , __stop___param [ ] ;
2006-06-30 01:55:50 -07:00
smp_setup_processor_id ( ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
/*
* Need to run as early as possible , to initialize the
* lockdep hash :
*/
lockdep_init ( ) ;
2008-04-30 00:55:01 -07:00
debug_objects_early_init ( ) ;
2008-02-14 09:44:08 +01:00
/*
* Set up the the initial canary ASAP :
*/
boot_init_stack_canary ( ) ;
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 23:39:30 -07:00
cgroup_init_early ( ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
local_irq_disable ( ) ;
early_boot_irqs_off ( ) ;
2006-07-03 00:25:06 -07:00
early_init_irq_lock_class ( ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
2005-04-16 15:20:36 -07:00
/*
* Interrupts are still disabled . Do necessary setups , then
* enable them
*/
lock_kernel ( ) ;
2007-02-16 01:28:01 -08:00
tick_init ( ) ;
2006-03-23 02:59:44 -08:00
boot_cpu_init ( ) ;
2005-04-16 15:20:36 -07:00
page_address_init ( ) ;
printk ( KERN_NOTICE ) ;
2006-12-11 09:28:46 -08:00
printk ( linux_banner ) ;
2005-04-16 15:20:36 -07:00
setup_arch ( & command_line ) ;
cgroups: add an owner to the mm_struct
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.
A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.
This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.
I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.
This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.
After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 01:00:16 -07:00
mm_init_owner ( & init_mm , & init_task ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
setup_command_line ( command_line ) ;
2005-04-16 15:20:36 -07:00
setup_per_cpu_areas ( ) ;
2008-03-26 14:23:48 -07:00
setup_nr_cpu_ids ( ) ;
2006-03-23 02:59:44 -08:00
smp_prepare_boot_cpu ( ) ; /* arch-specific boot-cpu hooks */
2005-04-16 15:20:36 -07:00
/*
* Set up the scheduler prior starting any interrupts ( such as the
* timer interrupt ) . Full topology setup happens at smp_init ( )
* time - but meanwhile we still have a functioning scheduler .
*/
sched_init ( ) ;
/*
* Disable preemption - early bootup scheduling is extremely
* fragile until we cpu_idle ( ) for the first time .
*/
preempt_disable ( ) ;
build_all_zonelists ( ) ;
page_alloc_init ( ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
printk ( KERN_NOTICE " Kernel command line: %s \n " , boot_command_line ) ;
2005-04-16 15:20:36 -07:00
parse_early_param ( ) ;
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 00:53:52 -08:00
parse_args ( " Booting kernel " , static_command_line , __start___param ,
2005-04-16 15:20:36 -07:00
__stop___param - __start___param ,
& unknown_bootoption ) ;
2007-01-05 16:36:19 -08:00
if ( ! irqs_disabled ( ) ) {
printk ( KERN_WARNING " start_kernel(): bug: interrupts were "
" enabled *very* early, fixing it \n " ) ;
local_irq_disable ( ) ;
}
2005-04-16 15:20:36 -07:00
sort_main_extable ( ) ;
trap_init ( ) ;
rcu_init ( ) ;
2008-12-05 18:58:31 -08:00
/* init some links before init_ISA_irqs() */
early_irq_init ( ) ;
2005-04-16 15:20:36 -07:00
init_IRQ ( ) ;
pidhash_init ( ) ;
init_timers ( ) ;
2006-01-09 20:52:32 -08:00
hrtimers_init ( ) ;
2005-04-16 15:20:36 -07:00
softirq_init ( ) ;
2006-06-26 00:25:06 -07:00
timekeeping_init ( ) ;
2006-07-03 00:24:04 -07:00
time_init ( ) ;
2008-05-03 18:29:28 +02:00
sched_clock_init ( ) ;
2006-07-03 00:24:24 -07:00
profile_init ( ) ;
if ( ! irqs_disabled ( ) )
2008-11-27 02:31:57 +10:30
printk ( KERN_CRIT " start_kernel(): bug: interrupts were "
" enabled early \n " ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
early_boot_irqs_on ( ) ;
2006-07-03 00:24:24 -07:00
local_irq_enable ( ) ;
2005-04-16 15:20:36 -07:00
/*
* HACK ALERT ! This is early . We ' re enabling the console before
* we ' ve done PCI setups etc , and console_init ( ) must be aware of
* this . But we do want output early , in case something goes wrong .
*/
console_init ( ) ;
if ( panic_later )
panic ( panic_later , panic_param ) ;
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 00:24:50 -07:00
lockdep_info ( ) ;
2006-07-03 00:24:33 -07:00
/*
* Need to run this when irqs are enabled , because it wants
* to self - test [ hard / soft ] - irqs on / off lock inversion bugs
* too :
*/
locking_selftest ( ) ;
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_BLK_DEV_INITRD
if ( initrd_start & & ! initrd_below_start_ok & &
2008-07-29 22:33:36 -07:00
page_to_pfn ( virt_to_page ( ( void * ) initrd_start ) ) < min_low_pfn ) {
2005-04-16 15:20:36 -07:00
printk ( KERN_CRIT " initrd overwritten (0x%08lx < 0x%08lx) - "
2008-07-17 21:16:36 +02:00
" disabling it. \n " ,
2008-07-29 22:33:36 -07:00
page_to_pfn ( virt_to_page ( ( void * ) initrd_start ) ) ,
min_low_pfn ) ;
2005-04-16 15:20:36 -07:00
initrd_start = 0 ;
}
# endif
mm: rewrite vmap layer
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-18 20:27:03 -07:00
vmalloc_init ( ) ;
2005-04-16 15:20:36 -07:00
vfs_caches_init_early ( ) ;
2006-01-08 01:02:01 -08:00
cpuset_init_early ( ) ;
2008-10-22 14:15:05 -07:00
page_cgroup_init ( ) ;
2005-04-16 15:20:36 -07:00
mem_init ( ) ;
2008-02-09 23:24:09 +01:00
enable_debug_pagealloc ( ) ;
2008-01-25 21:08:01 +01:00
cpu_hotplug_init ( ) ;
2005-04-16 15:20:36 -07:00
kmem_cache_init ( ) ;
2008-08-10 20:14:03 +03:00
kmemtrace_init ( ) ;
2008-04-30 00:55:01 -07:00
debug_objects_mem_init ( ) ;
2008-04-29 01:03:13 -07:00
idr_init_cache ( ) ;
2005-06-21 17:14:47 -07:00
setup_per_cpu_pageset ( ) ;
2005-04-16 15:20:36 -07:00
numa_policy_init ( ) ;
if ( late_time_init )
late_time_init ( ) ;
calibrate_delay ( ) ;
pidmap_init ( ) ;
pgtable_cache_init ( ) ;
prio_tree_init ( ) ;
anon_vma_init ( ) ;
# ifdef CONFIG_X86
if ( efi_enabled )
efi_enter_virtual_mode ( ) ;
# endif
2008-04-18 16:56:15 +10:00
thread_info_cache_init ( ) ;
CRED: Inaugurate COW credentials
Inaugurate copy-on-write credentials management. This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.
A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().
With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:
struct cred *new = prepare_creds();
int ret = blah(new);
if (ret < 0) {
abort_creds(new);
return ret;
}
return commit_creds(new);
There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.
To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const. The purpose of this is compile-time
discouragement of altering credentials through those pointers. Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:
(1) Its reference count may incremented and decremented.
(2) The keyrings to which it points may be modified, but not replaced.
The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
This now prepares and commits credentials in various places in the
security code rather than altering the current creds directly.
(2) Temporary credential overrides.
do_coredump() and sys_faccessat() now prepare their own credentials and
temporarily override the ones currently on the acting thread, whilst
preventing interference from other threads by holding cred_replace_mutex
on the thread being dumped.
This will be replaced in a future patch by something that hands down the
credentials directly to the functions being called, rather than altering
the task's objective credentials.
(3) LSM interface.
A number of functions have been changed, added or removed:
(*) security_capset_check(), ->capset_check()
(*) security_capset_set(), ->capset_set()
Removed in favour of security_capset().
(*) security_capset(), ->capset()
New. This is passed a pointer to the new creds, a pointer to the old
creds and the proposed capability sets. It should fill in the new
creds or return an error. All pointers, barring the pointer to the
new creds, are now const.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
Changed; now returns a value, which will cause the process to be
killed if it's an error.
(*) security_task_alloc(), ->task_alloc_security()
Removed in favour of security_prepare_creds().
(*) security_cred_free(), ->cred_free()
New. Free security data attached to cred->security.
(*) security_prepare_creds(), ->cred_prepare()
New. Duplicate any security data attached to cred->security.
(*) security_commit_creds(), ->cred_commit()
New. Apply any security effects for the upcoming installation of new
security by commit_creds().
(*) security_task_post_setuid(), ->task_post_setuid()
Removed in favour of security_task_fix_setuid().
(*) security_task_fix_setuid(), ->task_fix_setuid()
Fix up the proposed new credentials for setuid(). This is used by
cap_set_fix_setuid() to implicitly adjust capabilities in line with
setuid() changes. Changes are made to the new credentials, rather
than the task itself as in security_task_post_setuid().
(*) security_task_reparent_to_init(), ->task_reparent_to_init()
Removed. Instead the task being reparented to init is referred
directly to init's credentials.
NOTE! This results in the loss of some state: SELinux's osid no
longer records the sid of the thread that forked it.
(*) security_key_alloc(), ->key_alloc()
(*) security_key_permission(), ->key_permission()
Changed. These now take cred pointers rather than task pointers to
refer to the security context.
(4) sys_capset().
This has been simplified and uses less locking. The LSM functions it
calls have been merged.
(5) reparent_to_kthreadd().
This gives the current thread the same credentials as init by simply using
commit_thread() to point that way.
(6) __sigqueue_alloc() and switch_uid()
__sigqueue_alloc() can't stop the target task from changing its creds
beneath it, so this function gets a reference to the currently applicable
user_struct which it then passes into the sigqueue struct it returns if
successful.
switch_uid() is now called from commit_creds(), and possibly should be
folded into that. commit_creds() should take care of protecting
__sigqueue_alloc().
(7) [sg]et[ug]id() and co and [sg]et_current_groups.
The set functions now all use prepare_creds(), commit_creds() and
abort_creds() to build and check a new set of credentials before applying
it.
security_task_set[ug]id() is called inside the prepared section. This
guarantees that nothing else will affect the creds until we've finished.
The calling of set_dumpable() has been moved into commit_creds().
Much of the functionality of set_user() has been moved into
commit_creds().
The get functions all simply access the data directly.
(8) security_task_prctl() and cap_task_prctl().
security_task_prctl() has been modified to return -ENOSYS if it doesn't
want to handle a function, or otherwise return the return value directly
rather than through an argument.
Additionally, cap_task_prctl() now prepares a new set of credentials, even
if it doesn't end up using it.
(9) Keyrings.
A number of changes have been made to the keyrings code:
(a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
all been dropped and built in to the credentials functions directly.
They may want separating out again later.
(b) key_alloc() and search_process_keyrings() now take a cred pointer
rather than a task pointer to specify the security context.
(c) copy_creds() gives a new thread within the same thread group a new
thread keyring if its parent had one, otherwise it discards the thread
keyring.
(d) The authorisation key now points directly to the credentials to extend
the search into rather pointing to the task that carries them.
(e) Installing thread, process or session keyrings causes a new set of
credentials to be created, even though it's not strictly necessary for
process or session keyrings (they're shared).
(10) Usermode helper.
The usermode helper code now carries a cred struct pointer in its
subprocess_info struct instead of a new session keyring pointer. This set
of credentials is derived from init_cred and installed on the new process
after it has been cloned.
call_usermodehelper_setup() allocates the new credentials and
call_usermodehelper_freeinfo() discards them if they haven't been used. A
special cred function (prepare_usermodeinfo_creds()) is provided
specifically for call_usermodehelper_setup() to call.
call_usermodehelper_setkeys() adjusts the credentials to sport the
supplied keyring as the new session keyring.
(11) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) selinux_setprocattr() no longer does its check for whether the
current ptracer can access processes with the new SID inside the lock
that covers getting the ptracer's SID. Whilst this lock ensures that
the check is done with the ptracer pinned, the result is only valid
until the lock is released, so there's no point doing it inside the
lock.
(12) is_single_threaded().
This function has been extracted from selinux_setprocattr() and put into
a file of its own in the lib/ directory as join_session_keyring() now
wants to use it too.
The code in SELinux just checked to see whether a task shared mm_structs
with other tasks (CLONE_VM), but that isn't good enough. We really want
to know if they're part of the same thread group (CLONE_THREAD).
(13) nfsd.
The NFS server daemon now has to use the COW credentials to set the
credentials it is going to use. It really needs to pass the credentials
down to the functions it calls, but it can't do that until other patches
in this series have been applied.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:23 +11:00
cred_init ( ) ;
2005-04-16 15:20:36 -07:00
fork_init ( num_physpages ) ;
proc_caches_init ( ) ;
buffer_init ( ) ;
key_init ( ) ;
security_init ( ) ;
vfs_caches_init ( num_physpages ) ;
radix_tree_init ( ) ;
signals_init ( ) ;
/* rootfs populating might need page-writeback */
page_writeback_init ( ) ;
# ifdef CONFIG_PROC_FS
proc_root_init ( ) ;
# endif
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 23:39:30 -07:00
cgroup_init ( ) ;
2005-04-16 15:20:36 -07:00
cpuset_init ( ) ;
2006-07-14 00:24:40 -07:00
taskstats_init_early ( ) ;
2006-07-14 00:24:36 -07:00
delayacct_init ( ) ;
2005-04-16 15:20:36 -07:00
check_bugs ( ) ;
acpi_early_init ( ) ; /* before LAPIC and SMP init */
2008-08-14 15:45:08 -04:00
ftrace_init ( ) ;
2005-04-16 15:20:36 -07:00
/* Do the rest non-__init'ed, we're now alive */
rest_init ( ) ;
}
2009-01-07 08:45:46 -08:00
int initcall_debug ;
2008-10-22 10:00:23 -05:00
core_param ( initcall_debug , initcall_debug , bool , 0644 ) ;
2005-04-16 15:20:36 -07:00
2008-07-30 12:49:02 -07:00
int do_one_initcall ( initcall_t fn )
2005-04-16 15:20:36 -07:00
{
int count = preempt_count ( ) ;
2008-11-11 23:24:42 +01:00
ktime_t calltime , delta , rettime ;
2008-05-15 13:52:41 -07:00
char msgbuf [ 64 ] ;
2008-11-11 23:24:42 +01:00
struct boot_trace_call call ;
struct boot_trace_ret ret ;
2005-04-16 15:20:36 -07:00
2008-05-15 18:14:01 -07:00
if ( initcall_debug ) {
2008-11-11 23:24:42 +01:00
call . caller = task_pid_nr ( current ) ;
printk ( " calling %pF @ %i \n " , fn , call . caller ) ;
calltime = ktime_get ( ) ;
trace_boot_call ( & call , fn ) ;
2008-10-31 12:57:20 +01:00
enable_boot_trace ( ) ;
2008-05-15 18:14:01 -07:00
}
2005-04-16 15:20:36 -07:00
2008-11-11 23:24:42 +01:00
ret . result = fn ( ) ;
2005-04-16 15:20:36 -07:00
2008-05-15 18:14:01 -07:00
if ( initcall_debug ) {
2008-10-31 12:57:20 +01:00
disable_boot_trace ( ) ;
2008-11-11 23:24:42 +01:00
rettime = ktime_get ( ) ;
delta = ktime_sub ( rettime , calltime ) ;
2008-11-21 14:08:59 -08:00
ret . duration = ( unsigned long long ) ktime_to_ns ( delta ) > > 10 ;
2008-11-11 23:24:42 +01:00
trace_boot_ret ( & ret , fn ) ;
2008-10-09 15:23:05 -07:00
printk ( " initcall %pF returned %d after %Ld usecs \n " , fn ,
2008-11-11 23:24:42 +01:00
ret . result , ret . duration ) ;
2008-05-15 18:14:01 -07:00
}
2007-05-08 00:28:26 -07:00
2008-05-15 18:14:01 -07:00
msgbuf [ 0 ] = 0 ;
2008-05-12 14:02:22 -07:00
2008-11-11 23:24:42 +01:00
if ( ret . result & & ret . result ! = - ENODEV & & initcall_debug )
sprintf ( msgbuf , " error code %d " , ret . result ) ;
2008-05-12 14:02:22 -07:00
2008-05-15 18:14:01 -07:00
if ( preempt_count ( ) ! = count ) {
2008-05-15 13:52:41 -07:00
strlcat ( msgbuf , " preemption imbalance " , sizeof ( msgbuf ) ) ;
2008-05-15 18:14:01 -07:00
preempt_count ( ) = count ;
2005-04-16 15:20:36 -07:00
}
2008-05-15 18:14:01 -07:00
if ( irqs_disabled ( ) ) {
2008-05-15 13:52:41 -07:00
strlcat ( msgbuf , " disabled interrupts " , sizeof ( msgbuf ) ) ;
2008-05-15 18:14:01 -07:00
local_irq_enable ( ) ;
}
if ( msgbuf [ 0 ] ) {
2008-10-03 13:38:07 -07:00
printk ( " initcall %pF returned with %s \n " , fn , msgbuf ) ;
2008-05-15 18:14:01 -07:00
}
2008-07-30 12:49:02 -07:00
2008-11-11 23:24:42 +01:00
return ret . result ;
2008-05-15 18:14:01 -07:00
}
2008-07-25 19:45:11 -07:00
extern initcall_t __initcall_start [ ] , __initcall_end [ ] , __early_initcall_end [ ] ;
2008-05-15 18:14:01 -07:00
static void __init do_initcalls ( void )
{
initcall_t * call ;
2008-07-25 19:45:11 -07:00
for ( call = __early_initcall_end ; call < __initcall_end ; call + + )
2008-05-15 18:14:01 -07:00
do_one_initcall ( * call ) ;
2005-04-16 15:20:36 -07:00
/* Make sure there is no pending stuff from the initcall sequence */
flush_scheduled_work ( ) ;
}
/*
* Ok , the machine is now initialized . None of the devices
* have been touched yet , but the CPU subsystem is up and
* running , and memory and process management works .
*
* Now we can finally start doing some real work . .
*/
static void __init do_basic_setup ( void )
{
rcu: add call_rcu_sched()
Fourth cut of patch to provide the call_rcu_sched(). This is again to
synchronize_sched() as call_rcu() is to synchronize_rcu().
Should be fine for experimental and -rt use, but not ready for inclusion.
With some luck, I will be able to tell Andrew to come out of hiding on
the next round.
Passes multi-day rcutorture sessions with concurrent CPU hotplugging.
Fixes since the first version include a bug that could result in
indefinite blocking (spotted by Gautham Shenoy), better resiliency
against CPU-hotplug operations, and other minor fixes.
Fixes since the second version include reworking grace-period detection
to avoid deadlocks that could happen when running concurrently with
CPU hotplug, adding Mathieu's fix to avoid the softlockup messages,
as well as Mathieu's fix to allow use earlier in boot.
Fixes since the third version include a wrong-CPU bug spotted by
Andrew, getting rid of the obsolete synchronize_kernel API that somehow
snuck back in, merging spin_unlock() and local_irq_restore() in a
few places, commenting the code that checks for quiescent states based
on interrupting from user-mode execution or the idle loop, removing
some inline attributes, and some code-style changes.
Known/suspected shortcomings:
o I still do not entirely trust the sleep/wakeup logic. Next step
will be to use a private snapshot of the CPU online mask in
rcu_sched_grace_period() -- if the CPU wasn't there at the start
of the grace period, we don't need to hear from it. And the
bit about accounting for changes in online CPUs inside of
rcu_sched_grace_period() is ugly anyway.
o It might be good for rcu_sched_grace_period() to invoke
resched_cpu() when a given CPU wasn't responding quickly,
but resched_cpu() is declared static...
This patch also fixes a long-standing bug in the earlier preemptable-RCU
implementation of synchronize_rcu() that could result in loss of
concurrent external changes to a task's CPU affinity mask. I still cannot
remember who reported this...
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-12 21:21:05 +02:00
rcu_init_sched ( ) ; /* needed by module_init stage. */
2008-10-25 19:53:38 -07:00
init_workqueues ( ) ;
2009-03-25 17:06:30 +08:00
cpuset_init_smp ( ) ;
2005-04-16 15:20:36 -07:00
usermodehelper_init ( ) ;
driver_init ( ) ;
2007-02-14 00:33:57 -08:00
init_irq_proc ( ) ;
2005-04-16 15:20:36 -07:00
do_initcalls ( ) ;
}
2008-07-25 19:45:11 -07:00
static void __init do_pre_smp_initcalls ( void )
2008-07-25 19:45:11 -07:00
{
initcall_t * call ;
for ( call = __initcall_start ; call < __early_initcall_end ; call + + )
do_one_initcall ( * call ) ;
}
2005-04-16 15:20:36 -07:00
static void run_init_process ( char * init_filename )
{
argv_init [ 0 ] = init_filename ;
2006-10-02 02:18:26 -07:00
kernel_execve ( init_filename , argv_init , envp_init ) ;
2005-04-16 15:20:36 -07:00
}
2007-02-13 13:26:22 +01:00
/* This is a non __init function. Force it to be noinline otherwise gcc
* makes it inline to init ( ) and it becomes part of init . text section
*/
2009-01-06 14:40:38 -08:00
static noinline int init_post ( void )
2009-03-31 15:23:50 -07:00
__releases ( kernel_lock )
2007-02-13 13:26:22 +01:00
{
2009-01-07 08:45:46 -08:00
/* need to finish all async __init code before freeing the memory */
async_synchronize_full ( ) ;
2007-02-13 13:26:22 +01:00
free_initmem ( ) ;
unlock_kernel ( ) ;
mark_rodata_ro ( ) ;
system_state = SYSTEM_RUNNING ;
numa_default_policy ( ) ;
if ( sys_open ( ( const char __user * ) " /dev/console " , O_RDWR , 0 ) < 0 )
printk ( KERN_WARNING " Warning: unable to open an initial console. \n " ) ;
( void ) sys_dup ( 0 ) ;
( void ) sys_dup ( 0 ) ;
2008-04-30 00:53:03 -07:00
current - > signal - > flags | = SIGNAL_UNKILLABLE ;
2007-02-13 13:26:22 +01:00
if ( ramdisk_execute_command ) {
run_init_process ( ramdisk_execute_command ) ;
printk ( KERN_WARNING " Failed to execute %s \n " ,
ramdisk_execute_command ) ;
}
/*
* We try each of these until one succeeds .
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine .
*/
if ( execute_command ) {
run_init_process ( execute_command ) ;
printk ( KERN_WARNING " Failed to execute %s. Attempting "
" defaults... \n " , execute_command ) ;
}
run_init_process ( " /sbin/init " ) ;
run_init_process ( " /etc/init " ) ;
run_init_process ( " /bin/init " ) ;
run_init_process ( " /bin/sh " ) ;
panic ( " No init found. Try passing init= option to kernel. " ) ;
}
2007-02-26 16:45:41 +01:00
static int __init kernel_init ( void * unused )
2005-04-16 15:20:36 -07:00
{
lock_kernel ( ) ;
/*
* init can run on any cpu .
*/
2009-03-30 22:05:10 -06:00
set_cpus_allowed_ptr ( current , cpu_all_mask ) ;
2005-04-16 15:20:36 -07:00
/*
* Tell the world that we ' re going to be the grim
* reaper of innocent orphaned children .
*
* We don ' t want people to have to make incorrect
* assumptions about where in the task array this
* can be found .
*/
2006-12-08 02:38:01 -08:00
init_pid_ns . child_reaper = current ;
2005-04-16 15:20:36 -07:00
2006-10-02 02:19:00 -07:00
cad_pid = task_pid ( current ) ;
2008-01-30 13:33:17 +01:00
smp_prepare_cpus ( setup_max_cpus ) ;
2005-04-16 15:20:36 -07:00
do_pre_smp_initcalls ( ) ;
2008-09-23 11:38:18 +01:00
start_boot_trace ( ) ;
2005-04-16 15:20:36 -07:00
smp_init ( ) ;
sched_init_smp ( ) ;
do_basic_setup ( ) ;
/*
* check if there is an early userspace init . If yes , let it do all
* the work
*/
2005-09-06 15:17:19 -07:00
if ( ! ramdisk_execute_command )
ramdisk_execute_command = " /init " ;
if ( sys_access ( ( const char __user * ) ramdisk_execute_command , 0 ) ! = 0 ) {
ramdisk_execute_command = NULL ;
2005-04-16 15:20:36 -07:00
prepare_namespace ( ) ;
2005-09-06 15:17:19 -07:00
}
2005-04-16 15:20:36 -07:00
/*
* Ok , we have completed the initial bootup , and
* we ' re essentially up and running . Get rid of the
* initmem segments and start the user - mode stuff . .
*/
2008-10-31 12:57:20 +01:00
2007-02-13 13:26:22 +01:00
init_post ( ) ;
return 0 ;
2005-04-16 15:20:36 -07:00
}