linux

iv/linux

Author	SHA1	Message	Date
Tim Bird	a202355640	ring-buffer: fix bug in ring_buffer_discard_commit There's a bug in ring_buffer_discard_commit. The wrong pointer is being compared in order to check if the event can be freed from the buffer rather than discarded (i.e. marked as PAD). I noticed this when I was working on duration filtering. The bug is not deadly - it just results in lots of wasted space in the buffer. All filtered events are left in the buffer and marked as discarded, rather than being removed from the buffer to make space for other events. Unfortunately, when I fixed this bug, I got errors doing a filtered function trace. Multiple TIME_EXTEND events pile up in the buffer, and trigger the following loop overage warning in rb_iter_peek(): again: ... if (RB_WARN_ON(cpu_buffer, ++nr_loops > 10)) return NULL; I'm not sure what the best way is to fix this. I don't know if I should extend the loop threshhold, or if I should make the test more complex (ignore TIME_EXTEND events), or just get rid of this loop check completely. Note that if I implement a workaround for this, then I see another problem from rb_advance_iter(). I haven't tracked that one down yet. In general, it seems like the case of removing filtered events has not been working properly, and so some assumptions about buffer invariant conditions need to be revisited. Here's the patch for the simple fix: Compare correct pointer for checking if an event can be freed rather than left as discarded in the buffer. Signed-off-by: Tim Bird <tim.bird@am.sony.com> LKML-Reference: <4A25BE9E.5090909@am.sony.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-03 10:15:06 -04:00
Peter Zijlstra	a96bbc1641	perf_counter: Fix race in counter initialization We need the PID namespace and counter ID available when the counter overflows and we need to generate a sample event. [ Impact: fix kernel crash with high-frequency sampling ] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> [ fixed a further crash and cleaned up the initialization a bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-03 14:57:03 +02:00
Peter Zijlstra	226f62fdd5	perf_counter: Add a comm hook for pure fork()s I noticed missing COMM events and found that we missed reporting them for pure forks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-03 14:14:30 +02:00
Steven Rostedt	84047e360a	function-graph: always initialize task ret_stack On creating a new task while running the function graph tracer, if we fail to allocate the ret_stack, and then fail the fork, the code will free the parent ret_stack. This is because the child duplicated the parent and currently points to the parent's ret_stack. This patch always initializes the task's ret_stack to NULL. [ Impact: prevent crash of parent on low memory during fork ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-02 16:51:55 -04:00
Steven Rostedt	f7e8b616ed	function-graph: move initialization of new tasks up in fork When the function graph tracer is enabled, all new tasks must allocate a ret_stack to place the return address of functions. This is because the function graph tracer will replace the real return address with a call to the tracing of the exit function. This initialization happens in fork, but it happens too late. If fork fails, then it will call free_task and that calls the freeing of this ret_stack. But before initialization happens, the new (failed) task points to its parents ret_stack. If a fork failure happens during the function trace, it would be catastrophic for the parent. Also, there's no need to call ftrace_graph_exit_task from fork, since it is called by free_task which fork calls on failure. [ Impact: prevent crash during failed fork running function graph tracer ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-02 16:49:57 -04:00
Peter Zijlstra	0d48696f87	perf_counter: Rename perf_counter_hw_event => perf_counter_attr The structure isn't hw only and when I read event, I think about those things that fall out the other end. Rename the thing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Cc: Stephane Eranian <eranian@googlemail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:33 +02:00
Peter Zijlstra	08247e31ca	perf_counter: Add ioctl for changing the sample period/frequency Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:32 +02:00
Peter Zijlstra	8e3747c13c	perf_counter: Change data head from u32 to u64 Since some people worried that 4G might not be a large enough as an mmap data window, extend it to 64 bit for capable platforms. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:32 +02:00
Peter Zijlstra	b23f3325ed	perf_counter: Rename various fields A few renames: s/irq_period/sample_period/ s/irq_freq/sample_freq/ s/PERF_RECORD_/PERF_SAMPLE_/ s/record_type/sample_type/ And change both the new sample_type and read_format to u64. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:30 +02:00
Peter Zijlstra	8e5799b1ad	perf_counter: Add unique counter id Stephan raised the issue that we currently cannot distinguish between similar counters within a group (PERF_RECORD_GROUP uses the config value as identifier). Therefore, generate a new ID for each counter using a global u64 sequence counter. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:29 +02:00
Steven Rostedt	26c01624a2	function-graph: add memory barriers for accessing task's ret_stack The code that handles the tasks ret_stack allocation for every task assumes that only an interrupt can cause issues (even though interrupts are disabled). In reality, the code is allocating the ret_stack for tasks that may be running on other CPUs and there are not efficient memory barriers to handle this case. [ Impact: prevent crash due to using of uninitialized ret_stack variables ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-02 14:42:17 -04:00
Steven Rostedt	82310a3272	function-graph: enable the stack after initialization of other variables The function graph tracer checks if the task_struct has ret_stack defined to know if it is OK or not to use it. The initialization is done for all tasks by one process, but the idle tasks use the same initialization used by new tasks. If an interrupt happens on an idle task that just had the ret_stack created, but before the rest of the initialization took place, then we can corrupt the return address of the functions. This patch moves the setting of the task_struct's ret_stack to after the other variables have been initialized. [ Impact: prevent kernel panic on idle task when starting function graph ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-02 14:41:50 -04:00
Steven Rostedt	179c498ae2	function-graph: only allocate init tasks if it was not already done When the function graph tracer is enabled, it calls the initialization needed for the init tasks that would be called on all created tasks. The problem is that this is called every time the function graph tracer is enabled, and the ret_stack is allocated for the idle tasks each time. Thus, the old ret_stack is lost and a memory leak is created. This is also dangerous because if an interrupt happened on another CPU with the init task and the ret_stack is replaced, we then lose all the return pointers for the interrupt, and a crash would take place. [ Impact: fix memory leak and possible crash due to race ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-02 12:03:19 -04:00
Peter Zijlstra	709e50cf87	perf_counter: Use PID namespaces properly Stop using task_struct::pid and start using PID namespaces. PIDs will be reported in the PID namespace of the monitoring task at the moment of counter creation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 16:16:25 +02:00
Paul Mackerras	bf4e0ed3d0	perf_counter: Remove unused prev_state field This removes the prev_state field of struct perf_counter since it is now unused. It was only used by the cpu migration counter, which doesn't use it any more. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <18979.35052.915728.626374@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 13:10:55 +02:00
Paul Mackerras	3f731ca60a	perf_counter: Fix cpu migration counter This fixes the cpu migration software counter to count correctly even when contexts get swapped from one task to another. Previously the cpu migration counts reported by perf stat were bogus, ranging from negative to several thousand for a single "lat_ctx 2 8 32" run. With this patch the cpu migration count reported for "lat_ctx 2 8 32" is almost always between 35 and 44. This fixes the problem by adding a call into the perf_counter code from set_task_cpu when tasks are migrated. This enables us to use the generic swcounter code (with some modifications) for the cpu migration counter. This modifies the swcounter code to allow a NULL regs pointer to be passed in to perf_swcounter_ctx_event() etc. The cpu migration counter does this because there isn't necessarily a pt_regs struct for the task available. In this case, the counter will not have interrupt capability - but the migration counter didn't have interrupt capability before, so this is no loss. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <18979.35006.819769.416327@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 13:10:54 +02:00
Paul Mackerras	f38b082081	perf_counter: Initialize per-cpu context earlier on cpu up This arranges for perf_counter's notifier for cpu hotplug operations to be called earlier than the migration notifier in sched.c by increasing its priority to 20, compared to the 10 for the migration notifier. The reason for doing this is that a subsequent commit to convert the cpu migration counter to use the generic swcounter infrastructure will add a call into the perf_counter subsystem when tasks get migrated. Therefore the perf_counter subsystem needs a chance to initialize its per-cpu data for the new cpu before it can get called from the migration code. This also adds a comment to the migration notifier noting that its priority needs to be lower than that of the perf_counter notifier. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <18981.1900.792795.836858@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 13:10:54 +02:00
Steven Rostedt	0f6ce3de4e	ftrace: do not profile functions when disabled A race was found that if one were to enable and disable the function profiler repeatedly, then the system can panic. This was because a profiled function may be preempted just before disabling interrupts. While the profiler is disabled and then reenabled, the preempted function could start again, and access the hash as it is being initialized. This just adds a check in the irq disabled part to check if the profiler is enabled, and if it is not then it will just exit. When the system is disabled, the profile_enabled variable is cleared before calling the unregistering of the function profiler. This unregistering calls stop machine which also acts as a synchronize schedule. [ Impact: fix panic in enabling/disabling function profiler ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:26:23 -04:00
Steven Rostedt	112f38a7e3	tracing: make trace pipe recognize latency format flag The trace_pipe did not recognize the latency format flag and would produce different output than the trace file. The problem was partly due that the trace flags in the iterator was not set as well as the trace_pipe zeros out part of the iterator (including the flags) to be able to use the same routines as the trace file. trace_flags of the iterator should not cause any problems when not zeroed out by for trace_pipe. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:26:02 -04:00
Steven Whitehouse	ec081ddc3d	tracing: add exports to use __print_symbolic and __print_flags from a module A patch to allow the use of __print_symbolic and __print_flags from a module. This allows the current GFS2 tracing patch to build. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> LKML-Reference: <1243868015.29604.542.camel@localhost.localdomain> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:25:29 -04:00
Li Zefan	7fcb7c472f	tracing/events: introduce __dynamic_array() __string() is limited: - it's a char array, but we may want to define array with other types - a source string should be available, but we may just know the string size We introduce __dynamic_array() to break those limitations, and __string() becomes a wrapper of it. As a side effect, now __get_str() can be used in TP_fast_assign but not only TP_print. Take XFS for example, we have the string length in the dirent, but the string itself is not NULL-terminated, so __dynamic_array() can be used: TRACE_EVENT(xfs_dir2, TP_PROTO(struct xfs_da_args args), TP_ARGS(args), TP_STRUCT__entry( __field(int, namelen) __dynamic_array(char, name, args->namelen + 1) ... ), TP_fast_assign( char name = __get_str(name); if (args->namelen) memcpy(name, args->name, args->namelen); name[args->namelen] = '\0'; __entry->namelen = args->namelen; ), TP_printk("name %.*s namelen %d", __entry->namelen ? __get_str(name) : NULL __entry->namelen) ); [ Impact: allow defining dynamic size arrays ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A2384D2.3080403@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:25:15 -04:00
Steven Rostedt	897f17a653	tracing: combine the default tracers into one config Both event tracer and sched switch plugin are selected by default by all generic tracers. But if no generic tracer is enabled, their options appear. But ether one of them will select the other, thus it only makes sense to have the default tracers be selected by one option. [ Impact: clean up kconfig menu ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:23:55 -04:00
Steven Rostedt	5e0a093910	tracing: fix config options to not show when automatically selected There are two options that are selected by all tracers, but we want to have those options available when no tracer is selected. These are The event tracer and sched switch tracer. The are enabled by all tracers, but if a tracer is not selected we want the options to appear. All tracers including them select TRACING. Thus what we would like to do is: config EVENT_TRACER bool "prompt" depends on TRACING select TRACING But that gives us a bug in the kbuild system since we just created a circular dependency. We only want the prompt to show when TRACING is off. This patch adds GENERIC_TRACER that all tracers will select instead of TRACING. The two options (sched switch and event tracer) will select TRACING directly and depend on !GENERIC_TRACER. This solves the cicular dependency. [ Impact: hide options that are selected by default ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:23:30 -04:00
Steven Rostedt	2af15d6a44	ftrace: add kernel command line function filtering When using ftrace=function on the command line to trace functions on boot up, one can not filter out functions that are commonly called. This patch adds two new ftrace command line commands. ftrace_notrace=function-list ftrace_filter=function-list Where function-list is a comma separated list of functions to filter. The ftrace_notrace will make the functions listed not be included in the function tracing, and ftrace_filter will only trace the functions listed. These two act the same as the debugfs/tracing/set_ftrace_notrace and debugfs/tracing/set_ftrace_filter respectively. The simple glob expressions that are allowed by the filter files can also be used by the command line interface. ftrace_notrace=rcu,lock,spin Will not trace any function that starts with rcu, ends with lock, or has the word spin in it. Note, if the self tests are enabled, they may interfere with the filtering set by the command lines. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-01 23:23:10 -04:00
Frederic Weisbecker	43bd123623	tracing/stat: remove unappropriate safe walk on list register_stat_tracer() uses list_for_each_entry_safe to check whether a tracer is already present in the list. But we don't delete anything from the list here, so we don't need the safe version [ Impact: cleanup list use is stat tracing ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:18:33 +02:00
Li Zefan	dbd3fbdfee	tracing/stat: do some cleanups - remove duplicate code in stat_seq_init() - update comments to reflect the change from stat list to stat rbtree [ Impact: clean up ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:18:18 +02:00
Li Zefan	e162280690	tracing/stat: remember to free root node When closing a trace_stat file, we destroy the rbtree constructed during file open, but there is memory leak that the root node is not freed. [ Impact: fix memory leak when closing a trace_stat file ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:18:03 +02:00
Li Zefan	b3dd7ba7d8	tracing/stat: change dummpy_cmp() to return -1 Currently the output of trace_stat/workqueues is totally reversed: # cat /debug/tracing/trace_stat/workqueues ... 1 17 17 210 37 `-blk_unplug_work+0x0/0x57 1 3779 3779 181 11 \|-cfq_kick_queue+0x0/0x2f 1 3796 3796 kblockd/1:120 ... The correct output should be: 1 3796 3796 kblockd/1:120 1 3779 3779 181 11 \|-cfq_kick_queue+0x0/0x2f 1 17 17 210 37 `-blk_unplug_work+0x0/0x57 It's caused by "tracing/stat: replace linked list by an rbtree for sorting" (53059c9b67a62a3dc8c80204d3da42b9267ea5a0). dummpy_cmp() should return -1, so rb_node will always be inserted as right-most node in the rbtree, thus we sort the output in ascending order. [ Impact: fix the output of trace_stat/workqueues ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:17:49 +02:00
Frederic Weisbecker	8f184f2730	tracing/stat: replace linked list by an rbtree for sorting When the stat tracing framework prepares the entries from a tracer to output them to the user, it starts by computing a linear sort through a linked list to give the entries ordered by relevance to the user. This is quite ugly and causes a small latency when we begin to read the file. This patch changes that by turning the linked list into a red-black tree. Athough the whole iteration using the start and next tracer callbacks while opening the file remain the same, it is now much more fast and scalable. The rbtree guarantees O(log(n)) insertions whereas a linked list with linear sorting brought us a O(n) despair. Now the (visible) latency has disapeared. [ Impact: kill the latency while starting to read a stat tracer file ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:17:35 +02:00
Frederic Weisbecker	0d64f8342d	tracing/stat: replace trace_stat_session by stat_session The "trace" prefix in struct trace_stat_session type is annoying while reading the trace_stat.c file. It makes the lines longer, and is not that much useful to explain the sense of this type. Just keep "struct stat_session" for this type. [ Impact: make the code a bit more readable ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:17:17 +02:00
Zhaolei	f3c4ae26e9	trace_workqueue: remove blank line between each cpu The blankline between each cpu's workqueue stat is not necessary, because the cpu number is enough to part them by eye. Old style also caused a blankline below headline, and made code complex by using lock, disableirq and get cpu var. Old style: # CPU INSERTED EXECUTED NAME # \| \| \| \| 0 8644 8644 events/0 0 0 0 cpuset ... 0 1 1 kdmflush 1 35365 35365 events/1 ... New style: # CPU INSERTED EXECUTED NAME # \| \| \| \| 0 8644 8644 events/0 0 0 0 cpuset ... 0 1 1 kdmflush 1 35365 35365 events/1 ... [ Impact: provide more readable code ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:14:26 +02:00
Zhaolei	b8867164f0	trace_workqueue: remove cpu_workqueue_stats->first_entry cpu_workqueue_stats->first_entry is useless because we can retrieve the header of a cpu workqueue using: if (&cpu_workqueue_stats->list == workqueue_cpu_stat(cpu)->list.next) [ Impact: cleanup ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:13:46 +02:00
Zhaolei	1fdfca9c57	trace_workqueue: use list_for_each_entry() instead of list_for_each_entry_safe() No need to use list_for_each_entry_safe() in iteration without deleting any node, we can use list_for_each_entry() instead. [ Impact: cleanup ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:13:05 +02:00
Zhaolei	fb39125fd7	ftrace, workqueuetrace: make workqueue tracepoints use TRACE_EVENT macro v3: zhaolei@cn.fujitsu.com: Change TRACE_EVENT definition to new format introduced by Steven Rostedt: consolidate trace and trace_event headers v2: kosaki@jp.fujitsu.com: print the function names instead of addr, and zap the work addr v1: zhaolei@cn.fujitsu.com: Make workqueue tracepoints use TRACE_EVENT macro TRACE_EVENT is a more generic way to define tracepoints. Doing so adds these new capabilities to the tracepoints: - zero-copy and per-cpu splice() tracing - binary tracing without printf overhead - structured logging records exposed under /debug/tracing/events - trace events embedded in function tracer output and other plugins - user-defined, per tracepoint filter expressions Then, this patch converts DEFINE_TRACE to TRACE_EVENT in workqueue related tracepoints. [ Impact: expand workqueue tracer to events tracing ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-06-02 01:10:40 +02:00
H. Peter Anvin	48b1fddbb1	Merge branch 'irq/numa' into x86/mce3 Merge reason: arch/x86/kernel/irqinit_{32,64}.c unified in irq/numa and modified in x86/mce3; this merge resolves the conflict. Conflicts: arch/x86/kernel/irqinit.c Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-01 15:25:31 -07:00
Ingo Molnar	3d58f48ba0	Merge branch 'linus' into irq/numa Conflicts: arch/mips/sibyte/bcm1480/irq.c arch/mips/sibyte/sb1250/irq.c Merge reason: we gathered a few conflicts plus update to latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 21:06:21 +02:00
Ingo Molnar	22a4f650d6	perf_counter: Tidy up style details - whitespace fixlets - make local variable definitions more consistent [ Impact: cleanup ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 19:55:32 +02:00
Paul Mackerras	880ca15adf	perf_counter: Allow software counters to count while task is not running This changes perf_swcounter_match() so that per-task software counters can count events that occur while their associated task is not running. This will allow us to use the generic software counter code for counting task migrations, which can occur while the task is not scheduled in. To do this, we have to distinguish between the situations where the counter is inactive because its task has been scheduled out, and those where the counter is inactive because it is part of a group that was not able to go on the PMU. In the former case we want the counter to count, but not in the latter case. If the context is active, we have the latter case. If the context is inactive then we need to know whether the counter was counting when the context was last active, which we can determine by comparing its ->tstamp_stopped timestamp with the context's timestamp. This also folds three checks in perf_swcounter_match, checking perf_event_raw(), perf_event_type() and perf_event_id() individually, into a single 64-bit comparison on counter->hw_event.config, as an optimization. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <18979.34810.259718.955621@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 10:04:06 +02:00
Paul Mackerras	25346b93ca	perf_counter: Provide functions for locking and pinning the context for a task This abstracts out the code for locking the context associated with a task. Because the context might get transferred from one task to another concurrently, we have to check after locking the context that it is still the right context for the task and retry if not. This was open-coded in find_get_context() and perf_counter_init_task(). This adds a further function for pinning the context for a task, i.e. marking it so it can't be transferred to another task. This adds a 'pin_count' field to struct perf_counter_context to indicate that a context is pinned, instead of the previous method of setting the parent_gen count to all 1s. Pinning the context with a pin_count is easier to undo and doesn't require saving the parent_gen value. This also adds a perf_unpin_context() to undo the effect of perf_pin_task_context() and changes perf_counter_init_task to use it. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <18979.34748.755674.596386@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 10:04:05 +02:00
Ingo Molnar	23db9f430b	Merge branch 'linus' into perfcounters/core Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6 based - to pick up the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 10:01:39 +02:00
Peter Zijlstra	bbbee90829	perf_counter: Ammend cleanup in fork() fail When fork() fails we cannot use perf_counter_exit_task() since that assumes to operate on current. Write a new helper that cleans up unused/clean contexts. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 16:21:52 +02:00
Peter Zijlstra	665c2142a9	perf_counter: Clean up task_ctx vs interrupts Remove the local_irq_save() etc.. in routines that are smp function calls, or have IRQs disabled by other means. Then change the COMM, MMAP, and swcounter context iteration to current->perf_counter_ctxp and RCU, since it really doesn't matter which context they iterate, they're all folded. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 16:21:51 +02:00
Peter Zijlstra	efb3d17240	perf_counter: Fix COMM and MMAP events for cpu wide counters Commit `a63eaf34ae` ("perf_counter: Dynamically allocate tasks' perf_counter_context struct") broke COMM and MMAP notification for cpu wide counters by dropping out early if there was no task context, thereby also not iterating the cpu context. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 16:21:51 +02:00
Ingo Molnar	012b84dae1	perf_counter: Robustify counter-free logic This fixes a nasty crash and highlights a bug that we were freeing failed-fork() counters incorrectly. (the fix for that will come separately) [ Impact: fix crashes/lockups with inherited counters ] Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 14:28:37 +02:00
Ingo Molnar	3f4dee2273	perf_counter: Fix cpuctx->task_ctx races Peter noticed that we are sometimes reading cpuctx->task_ctx with interrupts enabled. Noticed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 14:28:36 +02:00
Paul Mackerras	ad3a37de81	perf_counter: Don't swap contexts containing locked mutex Peter Zijlstra pointed out that under some circumstances, we can take the mutex in a context or a counter and then swap that context or counter to another task, potentially leading to lock order inversions or the mutexes not protecting what they are supposed to protect. This fixes the problem by making sure that we never take a mutex in a context or counter which could get swapped to another task. Most of the cases where we take a mutex is on a top-level counter or context, i.e. a counter which has an fd associated with it or a context that contains such a counter. This adds WARN_ON_ONCE statements to verify that. The two cases where we need to take the mutex on a context that is a clone of another are in perf_counter_exit_task and perf_counter_init_task. The perf_counter_exit_task case is solved by uncloning the context before starting to remove the counters from it. The perf_counter_init_task is a little trickier; we temporarily disable context swapping for the parent (forking) task by setting its ctx->parent_gen to the all-1s value after locking the context, if it is a cloned context, and restore the ctx->parent_gen value at the end if the context didn't get uncloned in the meantime. This also moves the increment of the context generation count to be within the same critical section, protected by the context mutex, that adds the new counter to the context. That way, taking the mutex is sufficient to ensure that both the counter list and the generation count are stable. [ Impact: fix hangs, races with inherited and PID counters ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <18975.31580.520676.619896@drongo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 11:02:46 +02:00
Andi Kleen	a9862e0560	Export add_timer_on for modules Needed in followon patch. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:13 -07:00
Paul Mackerras	c93f766909	perf_counter: Fix race in attaching counters to tasks and exiting Commit `564c2b21` ("perf_counter: Optimize context switch between identical inherited contexts") introduced a race where it is possible that a counter being attached to a task could get attached to the wrong task, if the task is one that has inherited its context from another task via fork. This happens because the optimized context switch could switch the context to another task after find_get_context has read task->perf_counter_ctxp. In fact, it's possible that the context could then get freed, if the other task then exits. This fixes the problem by protecting both the context switch and the critical code in find_get_context with spinlocks. The context switch locks the cxt->lock of both the outgoing and incoming contexts before swapping them. That means that once code such as find_get_context has obtained the spinlock for the context associated with a task, the context can't get swapped to another task. However, the context may have been swapped in the interval between reading task->perf_counter_ctxp and getting the lock, so it is necessary to check and retry. To make sure that none of the contexts being looked at in find_get_context can get freed, this changes the context freeing code to use RCU. Thus an rcu_read_lock() is sufficient to ensure that no contexts can get freed. This part of the patch is lifted from a patch posted by Peter Zijlstra. This also adds a check to make sure that we can't add a counter to a task that is exiting. There is also a race between perf_counter_exit_task and find_get_context; this solves the race by moving the get_ctx that was in perf_counter_alloc into the locked region in find_get_context, so that once find_get_context has got the context for a task, it won't get freed even if the task calls perf_counter_exit_task. It doesn't matter if new top-level (non-inherited) counters get attached to the context after perf_counter_exit_task has detached the context from the task. They will just stay there and never get scheduled in until the counters' fds get closed, and then perf_release will remove them from the context and eventually free the context. With this, we are now doing the unclone in find_get_context rather than when a counter was added to or removed from a context (actually, we were missing the unclone_ctx() call when adding a counter to a context). We don't need to unclone when removing a counter from a context because we have no way to remove a counter from a cloned context. This also takes out the smp_wmb() in find_get_context, which Peter Zijlstra pointed out was unnecessary because the cmpxchg implies a full barrier anyway. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <18974.33033.667187.273886@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-28 15:03:50 +02:00
Heiko Carstens	5b6045a906	trace: disable preemption before taking raw spinlocks s390 code uses smp_processor_id() in __raw_spin_lock() code which reveals that a (raw) spinlock is taken without preemption disabled. This can potentially deadlock. To fix this explicitly disable and enable preemption. BUG: using smp_processor_id() in preemptible [00000000] code: cat/2278 caller is trace_find_cmdline+0x40/0xfc CPU: 0 Not tainted 2.6.30-rc7-dirty #39 Process cat (pid: 2278, task: 000000003faedb68, ksp: 000000003b33b988) 000000003b33b988 000000003b33bae0 0000000000000002 0000000000000000 000000003b33bb80 000000003b33baf8 000000003b33baf8 00000000000175d6 0000000000000001 000000003b33b988 000000003f9b0000 000000000000000b 000000000000000c 000000003b33bb40 000000003b33bae0 0000000000000000 0000000000000000 00000000000175d6 000000003b33bae0 000000003b33bb28 Call Trace: ([<00000000000174b2>] show_trace+0x112/0x170) [<0000000000017582>] show_stack+0x72/0x100 [<0000000000441538>] dump_stack+0xc8/0xd8 [<000000000025c350>] debug_smp_processor_id+0x114/0x130 [<00000000000bf0e4>] trace_find_cmdline+0x40/0xfc [<00000000000c35d4>] trace_print_context+0x58/0xac [<00000000000bb676>] print_trace_line+0x416/0x470 [<00000000000bc8fe>] s_show+0x4e/0x428 [<000000000013834e>] seq_read+0x36a/0x5d4 [<0000000000112a78>] vfs_read+0xc8/0x174 [<0000000000112c58>] SyS_read+0x74/0xc4 [<000000000002c7ae>] sysc_noemu+0x10/0x16 [<000002000012436c>] 0x2000012436c 1 lock held by cat/2278: #0: (&p->lock){+.+.+.}, at: [<0000000000138056>] seq_read+0x72/0x5d4 [ Impact: fix preempt-unsafe raw spinlock ] Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-28 01:21:03 +02:00
Eero Nurkkala	f2e21c9610	NOHZ: Properly feed cpufreq ondemand governor A call from irq_exit() may occasionally pause the timing info for cpufreq ondemand governor. This results in the cpufreq ondemand governor to fail to calculate the system load properly. Thus, relocate the checks for this particular case to keep the governor always functional. Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> Reported-by: Tero Kristo <tero.kristo@nokia.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-27 15:33:43 +02:00
Tetsuo Handa	ab2b7ebaad	kmod: Release sub_info on cred allocation failure. call_usermodehelper_setup() forgot to kfree(sub_info) when prepare_usermodehelper_creds() failed. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-26 12:11:19 -07:00
Steven Rostedt	0f4fc29dd6	tracing: add __print_symbolic to trace events This patch adds __print_symbolic which is similar to __print_flags but works for an enumeration type instead. That is, there is only a one to one mapping between the values and the symbols. When a match is made, then it is printed, otherwise the hex value is outputed. [ Impact: add interface for showing symbol names in events ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-26 20:31:50 +02:00
Steven Rostedt	be74b73a57	tracing: add __print_flags for events Developers have been asking for the ability in the ftrace event tracer to display names of bits in a flags variable. Instead of printing out c2, it would be easier to read FOO\|BAR\|GOO, assuming that FOO is bit 1, BAR is bit 6 and GOO is bit 7. Some examples where this would be useful are the state flags in a context switch, kmalloc flags, and even permision flags in accessing files. [ v2 changes include: Frederic Weisbecker's idea of using a mask instead of bits, thus we can output GFP_KERNEL instead of GPF_WAIT\|GFP_IO\|GFP_FS. Li Zefan's idea of allowing the caller of __print_flags to add their own delimiter (or no delimiter) where we can get for file permissions rwx instead of r\|w\|x. ] [ v3 changes: Christoph Hellwig's idea of using an array instead of va_args. ] [ Impact: better displaying of flags in trace output ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-26 20:25:22 +02:00
Ingo Molnar	329d876d6f	perf_counter: Initialize ->oncpu properly This shouldnt matter normally (and i have not seen any misbehavior), because active counters always have a proper ->oncpu value - but nevertheless initialize the field properly to -1. [ Impact: cleanup ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-26 09:54:13 +02:00
Zhaolei	0e907c9939	ftrace: clean up of using ftrace_event_enable_disable() Always use ftrace_event_enable_disable() to enable/disable an event so that we can factorize out the event toggling code. [ Impact: factorize and cleanup event tracing code ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4A14FDFE.2080402@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-26 03:30:31 +02:00
Zhaolei	b11c53e12f	ftrace: Add task_comm support for trace_event If we enable a trace event alone without any tracer running (such as function tracer, sched switch tracer, etc...) it can't output enough task command information. We need to use the tracing_{start/stop}_cmdline_record() helpers which are designed to keep track of cmdlines for any tasks that were scheduled during the tracing. Before this patch: # echo 1 > debugfs/tracing/events/sched/sched_switch/enable # cat debugfs/tracing/trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| <...>-2289 [000] 526276.724790: sched_switch: task bash:2289 [120] ==> sshd:2287 [120] <...>-2287 [000] 526276.725231: sched_switch: task sshd:2287 [120] ==> bash:2289 [120] <...>-2289 [000] 526276.725452: sched_switch: task bash:2289 [120] ==> sshd:2287 [120] <...>-2287 [000] 526276.727181: sched_switch: task sshd:2287 [120] ==> swapper:0 [140] <idle>-0 [000] 526277.032734: sched_switch: task swapper:0 [140] ==> events/0:5 [115] <...>-5 [000] 526277.032782: sched_switch: task events/0:5 [115] ==> swapper:0 [140] ... After this patch: # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| bash-2269 [000] 527347.989229: sched_switch: task bash:2269 [120] ==> sshd:2267 [120] sshd-2267 [000] 527347.990960: sched_switch: task sshd:2267 [120] ==> bash:2269 [120] bash-2269 [000] 527347.991143: sched_switch: task bash:2269 [120] ==> sshd:2267 [120] sshd-2267 [000] 527347.992959: sched_switch: task sshd:2267 [120] ==> swapper:0 [140] <idle>-0 [000] 527348.531989: sched_switch: task swapper:0 [140] ==> events/0:5 [115] events/0-5 [000] 527348.532115: sched_switch: task events/0:5 [115] ==> swapper:0 [140] ... Changelog: v1->v2: Update Kconfig to select CONTEXT_SWITCH_TRACER in ENABLE_EVENT_TRACING v2->v3: v2 can solve problem that was caused by config EVENT_TRACING alone, but when CONFIG_FTRACE is off and CONFIG_TRACING is selected by other config, compile fail happened again. This version solves it. [ Impact: fix incomplete output of event tracing ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4A14FDFE.2080402@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-26 03:03:21 +02:00
Lai Jiangshan	4f5359685a	tracing: add trace_event_read_lock() I found that there is nothing to protect event_hash in ftrace_find_event(). Rcu protects the event hashlist but not the event itself while we use it after its extraction through ftrace_find_event(). This lack of a proper locking in this spot opens a race window between any event dereferencing and module removal. Eg: --Task A-- print_trace_line(trace) { event = find_ftrace_event(trace) --Task B-- trace_module_remove_events(mod) { list_trace_events_module(ev, mod) { unregister_ftrace_event(ev->event) { hlist_del(ev->event->node) list_del(....) } } } \|--> module removed, the event has been dropped --Task A-- event->print(trace); // Dereferencing freed memory If the event retrieved belongs to a module and this module is concurrently removed, we may end up dereferencing a data from a freed module. RCU could solve this, but it would add latency to the kernel and forbid tracers output callbacks to call any sleepable code. So this fix converts 'trace_event_mutex' to a read/write semaphore, and adds trace_event_read_lock() to protect ftrace_find_event(). [ Impact: fix possible freed memory dereference in ftrace ] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4A114806.7090302@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-25 23:53:41 +02:00
Ingo Molnar	0127c3ea08	perf_counter: fix warning & lockup - remove bogus warning - fix wakeup from NMI path lockup - also fix up whitespace noise in perf_counter.h Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525153931.703093461@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 22:02:23 +02:00
Peter Zijlstra	a78ac32587	perf_counter: Generic per counter interrupt throttle Introduce a generic per counter interrupt throttle. This uses the perf_counter_overflow() quick disable to throttle a specific counter when its going too fast when a pmu->unthrottle() method is provided which can undo the quick disable. Power needs to implement both the quick disable and the unthrottle method. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525153931.703093461@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 21:41:12 +02:00
Peter Zijlstra	10989fb245	perf_counter: Fix PERF_COUNTER_CONTEXT_SWITCHES for cpu counters Ingo noticed that cpu counters had 0 context switches, even though there was plenty scheduling on the cpu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525124600.419025548@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 14:55:01 +02:00
Peter Zijlstra	6ab423e0ea	perf_counter: Propagate inheritance failures down the fork() path Fail fork() when we fail inheritance for some reason (-ENOMEM most likely). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525124600.324656474@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 14:55:01 +02:00
Peter Zijlstra	771d7cde14	perf_counter: Make pctrl() affect inherited counters too Paul noted that the new ptcrl() didn't work on child counters. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525124600.203151469@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 14:55:00 +02:00
Ingo Molnar	e4cbb4e3ac	perf_counter: Move child perfcounter init to after scheduler init Initialize a task's perfcounters (inherit from parent, etc.) after the child task's scheduler fields have been initialized already. [ Impact: cleanup ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 13:05:06 +02:00
Linus Torvalds	93c3248380	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs	2009-05-24 19:38:25 -07:00
James Bottomley	d5a877e8dd	async: make sure independent async domains can't accidentally entangle The problem occurs when async_synchronize_full_domain() is called when the async_pending list is not empty. This will cause lowest_running() to return the cookie of the first entry on the async_pending list, which might be nothing at all to do with the domain being asked for and thus cause the domain synchronization to wait for an unrelated domain. This can cause a deadlock if domain synchronization is used from one domain to wait for another. Fix by running over the async_pending list to see if any pending items actually belong to our domain (and return their cookies if they do). Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-24 13:38:41 -07:00
Rafael J. Wysocki	32bdfac546	PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs We shouldn't hold dpm_list_mtx while executing [disable\|enable]_nonboot_cpus(), because theoretically this may lead to a deadlock as shown by the following example (provided by Johannes Berg): CPU 3 CPU 2 CPU 1 suspend/hibernate something: rtnl_lock() device_pm_lock() -> mutex_lock(&dpm_list_mtx) mutex_lock(&dpm_list_mtx) linkwatch_work -> rtnl_lock() disable_nonboot_cpus() -> flush CPU 3 workqueue Fortunately, device drivers are supposed to stop any activities that might lead to the registration of new device objects way before disable_nonboot_cpus() is called, so it shouldn't be necessary to hold dpm_list_mtx over the entire late part of device suspend and early part of device resume. Thus, during the late suspend and the early resume of devices acquire dpm_list_mtx only when dpm_list is going to be traversed and release it right after that. This patch is reported to fix the regressions tracked as http://bugzilla.kernel.org/show_bug.cgi?id=13245. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Miles Lane <miles.lane@gmail.com> Tested-by: Ming Lei <tom.leiming@gmail.com>	2009-05-24 21:15:07 +02:00
Ingo Molnar	a3862d3f81	perf_counter: Increase mmap limit In a default 'perf top' run the tool will create a counter for each online CPU. With enough CPUs this will eventually exhaust the default limit. So scale it up with the number of online CPUs. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-24 09:02:37 +02:00
Peter Zijlstra	475c557973	perf_counter: Remove perf_counter_context::nr_enabled now that pctrl() no longer disables other people's counters, remove the PMU cache code that deals with that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090523163013.032998331@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-24 08:24:30 +02:00
Peter Zijlstra	082ff5a276	perf_counter: Change pctrl() behaviour Instead of en/dis-abling all counters acting on a particular task, en/dis- able all counters we created. [ v2: fix crash on first counter enable ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090523163012.916937244@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-24 08:24:08 +02:00
Peter Zijlstra	aa9c67f53d	perf_counter: Simplify context cleanup Use perf_counter_remove_from_context() to remove counters from the context. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090523163012.796275849@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-23 19:37:47 +02:00
Peter Zijlstra	682076ae1d	perf_counter: Sanitize context locking Ensure we're consistent with the context locks. context->mutex context->lock list_{add,del}_counter(); so that either lock is sufficient to stabilize the context. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090523163012.618790733@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-23 19:37:46 +02:00
Peter Zijlstra	fccc714b31	perf_counter: Sanitize counter->mutex s/counter->mutex/counter->child_mutex/ and make sure its only used to protect child_list. The usage in __perf_counter_exit_task() doesn't appear to be problematic since ctx->mutex also covers anything related to fd tear-down. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090523163012.533186528@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-23 19:37:45 +02:00
Peter Zijlstra	e220d2dcb9	perf_counter: Fix dynamic irq_period logging We call perf_adjust_freq() from perf_counter_task_tick() which is is called under the rq->lock causing lock recursion. However, it's no longer required to be called under the rq->lock, so remove it from under it. Also, fix up some related comments. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090523163012.476197912@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-23 19:37:44 +02:00
Paul Mundt	948cd52906	sparseirq: Allow early irq_desc allocation Presently non-legacy IRQs have their irq_desc allocated with kzalloc_node(). This assumes that all callers of irq_to_desc_node_alloc() will be sufficiently late in the boot process that kmalloc is available. While porting sparseirq support to sh this blew up immediately, as at the time that we register the CPU's interrupt vector map only bootmem is available. Check slab_is_available() to work out which path to use. [ Impact: fix SH early boot crash with sparseirq enabled ] Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> LKML-Reference: <20090522014008.GA2806@linux-sh.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-23 14:55:24 +02:00
Jens Axboe	9bd7de51ee	Merge branch 'master' into for-2.6.31 Conflicts: drivers/ide/ide-io.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 20:28:35 +02:00
Jens Axboe	e4b636366c	Merge branch 'master' into for-2.6.31 Conflicts: drivers/block/hd.c drivers/block/mg_disk.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 20:25:34 +02:00
Paul Mackerras	564c2b210a	perf_counter: Optimize context switch between identical inherited contexts When monitoring a process and its descendants with a set of inherited counters, we can often get the situation in a context switch where both the old (outgoing) and new (incoming) process have the same set of counters, and their values are ultimately going to be added together. In that situation it doesn't matter which set of counters are used to count the activity for the new process, so there is really no need to go through the process of reading the hardware counters and updating the old task's counters and then setting up the PMU for the new task. This optimizes the context switch in this situation. Instead of scheduling out the perf_counter_context for the old task and scheduling in the new context, we simply transfer the old context to the new task and keep using it without interruption. The new context gets transferred to the old task. This means that both tasks still have a valid perf_counter_context, so no special case is introduced when the old task gets scheduled in again, either on this CPU or another CPU. The equivalence of contexts is detected by keeping a pointer in each cloned context pointing to the context it was cloned from. To cope with the situation where a context is changed by adding or removing counters after it has been cloned, we also keep a generation number on each context which is incremented every time a context is changed. When a context is cloned we take a copy of the parent's generation number, and two cloned contexts are equivalent only if they have the same parent and the same generation number. In order that the parent context pointer remains valid (and is not reused), we increment the parent context's reference count for each context cloned from it. Since we don't have individual fds for the counters in a cloned context, the only thing that can make two clones of a given parent different after they have been cloned is enabling or disabling all counters with prctl. To account for this, we keep a count of the number of enabled counters in each context. Two contexts must have the same number of enabled counters to be considered equivalent. Here are some measurements of the context switch time as measured with the lat_ctx benchmark from lmbench, comparing the times obtained with and without this patch series: -----Unmodified----- With this patch series Counters: none 2 HW 4H+4S none 2 HW 4H+4S 2 processes: Average 3.44 6.45 11.24 3.12 3.39 3.60 St dev 0.04 0.04 0.13 0.05 0.17 0.19 8 processes: Average 6.45 8.79 14.00 5.57 6.23 7.57 St dev 1.27 1.04 0.88 1.42 1.46 1.42 32 processes: Average 5.56 8.43 13.78 5.28 5.55 7.15 St dev 0.41 0.47 0.53 0.54 0.57 0.81 The numbers are the mean and standard deviation of 20 runs of lat_ctx. The "none" columns are lat_ctx run directly without any counters. The "2 HW" columns are with lat_ctx run under perfstat, counting cycles and instructions. The "4H+4S" columns are lat_ctx run under perfstat with 4 hardware counters and 4 software counters (cycles, instructions, cache references, cache misses, task clock, context switch, cpu migrations, and page faults). [ Impact: performance optimization of counter context-switches ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-22 12:18:20 +02:00
Paul Mackerras	a63eaf34ae	perf_counter: Dynamically allocate tasks' perf_counter_context struct This replaces the struct perf_counter_context in the task_struct with a pointer to a dynamically allocated perf_counter_context struct. The main reason for doing is this is to allow us to transfer a perf_counter_context from one task to another when we do lazy PMU switching in a later patch. This has a few side-benefits: the task_struct becomes a little smaller, we save some memory because only tasks that have perf_counters attached get a perf_counter_context allocated for them, and we can remove the inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't end up recompiling nearly everything whenever perf_counter.h changes. The perf_counter_context structures are reference-counted and freed when the last reference is dropped. A context can have references from its task and the counters on its task. Counters can outlive the task so it is possible that a context will be freed well after its task has exited. Contexts are allocated on fork if the parent had a context, or otherwise the first time that a per-task counter is created on a task. In the latter case, we set the context pointer in the task struct locklessly using an atomic compare-and-exchange operation in case we raced with some other task in creating a context for the subject task. This also removes the task pointer from the perf_counter struct. The task pointer was not used anywhere and would make it harder to move a context from one task to another. Anything that needed to know which task a counter was attached to was already using counter->ctx->task. The __perf_counter_init_context function moves up in perf_counter.c so that it can be called from find_get_context, and now initializes the refcount, but is otherwise unchanged. We were potentially calling list_del_counter twice: once from __perf_counter_exit_task when the task exits and once from __perf_counter_remove_from_context when the counter's fd gets closed. This adds a check in list_del_counter so it doesn't do anything if the counter has already been removed from the lists. Since perf_counter_task_sched_in doesn't do anything if the task doesn't have a context, and leaves cpuctx->task_ctx = NULL, this adds code to __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in the case where the current task adds the first counter to itself and thus creates a context for itself. This also adds similar code to __perf_counter_enable to handle a similar situation which can arise when the counters have been disabled using prctl; that also leaves cpuctx->task_ctx = NULL. [ Impact: refactor counter context management to prepare for new feature ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-22 12:18:19 +02:00
James Morris	2c9e703c61	Merge branch 'master' into next Conflicts: fs/exec.c Removed IMA changes (the IMA checks are now performed via may_open()). Signed-off-by: James Morris <jmorris@namei.org>	2009-05-22 18:40:59 +10:00
Paul Mundt	5f8371cec9	Merge branches 'sh/stable-updates' and 'sh/sparseirq'	2009-05-22 13:29:37 +09:00
Ingo Molnar	34adc80622	perf_counter: Fix context removal deadlock Disable the PMU globally before removing a counter from a context. This fixes the following lockup: [22081.741922] ------------[ cut here ]------------ [22081.746668] WARNING: at arch/x86/kernel/cpu/perf_counter.c:803 intel_pmu_handle_irq+0x9b/0x24e() [22081.755624] Hardware name: X8DTN [22081.758903] perfcounters: irq loop stuck! [22081.762985] Modules linked in: [22081.766136] Pid: 11082, comm: perf Not tainted 2.6.30-rc6-tip #226 [22081.772432] Call Trace: [22081.774940] <NMI> [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e [22081.781993] [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e [22081.788368] [<ffffffff8104505c>] ? warn_slowpath_common+0x77/0xa3 [22081.794649] [<ffffffff810450d3>] ? warn_slowpath_fmt+0x40/0x45 [22081.800696] [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e [22081.807080] [<ffffffff814d1a72>] ? perf_counter_nmi_handler+0x3f/0x4a [22081.813751] [<ffffffff814d2d09>] ? notifier_call_chain+0x58/0x86 [22081.819951] [<ffffffff8105b250>] ? notify_die+0x2d/0x32 [22081.825392] [<ffffffff814d1414>] ? do_nmi+0x8e/0x242 [22081.830538] [<ffffffff814d0f0a>] ? nmi+0x1a/0x20 [22081.835342] [<ffffffff8117e102>] ? selinux_file_free_security+0x0/0x1a [22081.842105] [<ffffffff81018793>] ? x86_pmu_disable_counter+0x15/0x41 [22081.848673] <<EOE>> [<ffffffff81018f3d>] ? x86_pmu_disable+0x86/0x103 [22081.855512] [<ffffffff8108fedd>] ? __perf_counter_remove_from_context+0x0/0xfe [22081.862926] [<ffffffff8108fcbc>] ? counter_sched_out+0x30/0xce [22081.868909] [<ffffffff8108ff36>] ? __perf_counter_remove_from_context+0x59/0xfe [22081.876382] [<ffffffff8106808a>] ? smp_call_function_single+0x6c/0xe6 [22081.882955] [<ffffffff81091b96>] ? perf_release+0x86/0x14c [22081.888600] [<ffffffff810c4c84>] ? __fput+0xe7/0x195 [22081.893718] [<ffffffff810c213e>] ? filp_close+0x5b/0x62 [22081.899107] [<ffffffff81046a70>] ? put_files_struct+0x64/0xc2 [22081.905031] [<ffffffff8104841a>] ? do_exit+0x1e2/0x6ef [22081.910360] [<ffffffff814d0a60>] ? _spin_lock_irqsave+0x9/0xe [22081.916292] [<ffffffff8104898e>] ? do_group_exit+0x67/0x93 [22081.921953] [<ffffffff810489cc>] ? sys_exit_group+0x12/0x16 [22081.927759] [<ffffffff8100baab>] ? system_call_fastpath+0x16/0x1b [22081.934076] ---[ end trace 3a3936ce3e1b4505 ]--- And could potentially also fix the lockup reported by Marcelo Tosatti. Also, print more debug info in case of a detected lockup. [ Impact: fix lockup ] Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 20:12:54 +02:00
Ming Lei	5537937696	ftrace: fix check for return value of register_module_notifier in event_trace_init register_module_notifier() returns zero in the success case. So fix the inverted fail case check in trace events modules handler. [ Impact: fix spurious warning on ftrace initialization] Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-20 19:23:11 +02:00
Peter Zijlstra	afedadf23a	perf_counter: Optimize sched in/out of counters Avoid a function call for !group counters by directly calling the counter function. [ Impact: micro-optimize the code ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090520102553.511933670@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 12:43:34 +02:00
Peter Zijlstra	b986d7ec0f	perf_counter: Optimize disable of time based sw counters Currently we call hrtimer_cancel() unconditionally on disable of time based software counters. Avoid when possible. [ Impact: micro-optimize the code ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090520102553.388185031@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 12:43:33 +02:00
Peter Zijlstra	26b119bc81	perf_counter: Log irq_period changes For the dynamic irq_period code, log whenever we change the period so that analyzing code can normalize the event flow. [ Impact: add new feature to allow more precise profiling ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090520102553.298769743@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 12:43:33 +02:00
Peter Zijlstra	d7b629a34f	perf_counter: Solve the rotate_ctx vs inherit race differently Instead of disabling RR scheduling of the counters, use a different list that does not get rotated to iterate the counters on inheritance. [ Impact: cleanup, optimization ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090520102553.237504544@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 12:43:32 +02:00
Thomas Gleixner	2070887fde	futex: fix restart in wait_requeue_pi If the waiter has been requeued to the outer PI futex and is interrupted by a signal and the thread handles the signal then ERESTART_RESTARTBLOCK is changed to EINTR and the restart block is discarded. That way we return an unexcpected EINTR to user space instead of ending up in futex_lock_pi_restart. But we do not need to restart the syscall because we know that the condition has changed since we have been requeued. If we would simply restart the syscall then we would drop out via the comparison of the user space value with EWOULDBLOCK. The user space side needs to handle EWOULDBLOCK anyway as the enqueueing on the inner futex can race with a requeue/wake. So we can simply return EWOULDBLOCK to user space which also signals that we did not take the outer futex and let user space handle it in the same way it has to handle the requeue/wake race. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-20 10:34:32 +02:00
Thomas Gleixner	1c840c1490	futex: fix restart for early wakeup in futex_wait_requeue_pi() The futex_wait_requeue_pi op should restart unconditionally like futex_lock_pi. The user of that function e.g. pthread_cond_wait can not be interrupted so we do not care about the SA_RESTART flag of the signal. Clean up the FIXMEs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-20 10:28:45 +02:00
Thomas Gleixner	c8b15a706d	futex: cleanup error exit Reuse the put_key_ref(key2) call in the exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-20 10:28:45 +02:00
Thomas Gleixner	521c180874	Merge branch 'core/urgent' into core/futexes Merge reason: this branch was on an pre -rc1 base, merge it up to -rc6+ to get the latest upstream fixes. Conflicts: kernel/futex.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-20 09:02:28 +02:00
Ingo Molnar	c44d70a340	perf_counter: fix counter inheritance race Context rotation should not occur when we are in the middle of walking the counter list when inheriting counters ... [ Impact: fix occasionally incorrect perf stat results ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 00:22:30 +02:00
Ingo Molnar	33b2fb303f	perf_counter: fix counter freeing logic Fix counter lifetime bugs which explain the crashes reported by Marcelo Tosatti and Arnaldo Carvalho de Melo. The new rule is: flushing + freeing is only done for a task's own counters, never for other tasks. [ Impact: fix crashes/lockups with inherited counters ] Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 00:22:24 +02:00
Thomas Gleixner	64d1304a64	futex: setup writeable mapping for futex ops which modify user space data The futex code installs a read only mapping via get_user_pages_fast() even if the futex op function has to modify user space data. The eventual fault was fixed up by futex_handle_fault() which walked the VMA with mmap_sem held. After the cleanup patches which removed the mmap_sem dependency of the futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex: clean up fault logic) removed the private VMA walk logic from the futex code. This change results in a stale RO mapping which is not fixed up. Instead of reintroducing the previous fault logic we set up the mapping in get_user_pages_fast() read/write for all operations which modify user space data. Also handle private futexes in the same way and make the current unconditional access_ok(VERIFY_WRITE) depend on the futex op. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org	2009-05-19 23:36:52 +02:00
Stefan Raspl	fd51d251e4	blktrace: remove debugfs entries on bad path debugfs directory entries for devices are not removed on some of the failure pathes in do_blk_trace_setup(). One way to reproduce is to start blktrace on multiple devices with insufficient Vmalloc space: Devices will fail with a message like this: BLKTRACESETUP(2) /dev/sdu failed: 5/Input/output error If so, the respective entries in debugfs (e.g. /sys/kernel/debug/block/sdu) will remain and subsequent attempts to start blktrace on the respective devices will not succeed due to existing directories. [ Impact: fix /debug/tracing file cleanup corner case ] Signed-off-by: Stefan Raspl <stefan.raspl@linux.vnet.ibm.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com LKML-Reference: <4A1266CC.5040801@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-19 10:29:21 +02:00
Ingo Molnar	4200efd9ac	sched: properly define the sched_group::cpumask and sched_domain::span fields Properly document the variable-size structure tricks we are doing wrt. struct sched_group and sched_domain, and use the field[0] GCC extension instead of defining a vla array. Dont use unions for this, as pointed out by Linus. [ Impact: cleanup, un-confuse Sparse and LLVM ] Reported-by: Jeff Garzik <jeff@garzik.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <alpine.LFD.2.01.0905180850110.3301@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-19 09:22:19 +02:00
Linus Torvalds	ee3af6ee77	Merge branches 'sched-fixes-for-linus-2' and 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix fallback sched_clock()'s offset when using jiffies * 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: increase MAX_LOCKDEP_ENTRIES and MAX_LOCKDEP_CHAINS	2009-05-18 10:11:06 -07:00
Linus Torvalds	0130b2d701	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Append prompt in /debug/tracing/README file x86/function-graph: fix constraint for recording old return value	2009-05-18 09:15:41 -07:00
Ming Lei	24ed0c4bfc	tracing: fix check for return value of register_module_notifier return zero should be correct, so fix it. [ Impact: eliminate incorrect syslog message ] Signed-off-by: Ming Lei <tom.leiming@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: rostedt@goodmis.org LKML-Reference: <1242545498-7285-1-git-send-email-tom.leiming@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 10:24:13 +02:00
Ingo Molnar	1079cac0f4	Merge commit 'v2.6.30-rc6' into tracing/core Merge reason: we were on an -rc4 base, sync up to -rc6 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 10:15:35 +02:00
Ingo Molnar	dc3f81b129	Merge commit 'v2.6.30-rc6' into perfcounters/core Merge reason: this branch was on an -rc4 base, merge it up to -rc6 to get the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 07:37:49 +02:00
Linus Torvalds	86460103c4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: check sysdev_suspend(PMSG_FREEZE) return value	2009-05-17 11:46:22 -07:00
Ingo Molnar	0203026b58	perf_counter: fix threaded task exit Flushing counters in __exit_signal() with irqs disabled is not a good idea as perf_counter_exit_task() acquires mutexes. So flush it before acquiring the tasklist lock. (Note, we still need a fix for when the PID has been unhashed.) [ Impact: fix crash with inherited counters ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-17 11:26:57 +02:00
Peter Zijlstra	856d56b9e5	perf_counter: Fix counter inheritance Srivatsa Vaddagiri reported that a Java workload triggers this warning in kernel/exit.c: WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list)); Add the inherited counter propagation on self-detach, this could cause counter leaks and incomplete stats in threaded code like the below: #include <pthread.h> #include <unistd.h> void thread(void arg) { sleep(5); return NULL; } void main(void) { pthread_t thr; pthread_create(&thr, NULL, thread, NULL); } Reported-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-17 07:52:24 +02:00
Peter Zijlstra	8bc2095951	perf_counter: Fix inheritance cleanup code Clean up code that open-coded the list_{add,del}_counter() code in __perf_counter_exit_task() which consequently diverged. This could lead to software counter crashes. Also, fold the ctx->nr_counter inc/dec into those functions and clean up some of the related code. [ Impact: fix potential sw counter crash, cleanup ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-17 07:52:23 +02:00
Linus Torvalds	0f6f49a8cd	Fix caller information for warn_slowpath_null Ian Campbell noticed that since "Eliminate thousands of warnings with gcc 3.2 build" (commit `57adc4d2db`) all WARN_ON()'s currently appear to come from warn_slowpath_null(), eg: WARNING: at kernel/softirq.c:143 warn_slowpath_null+0x1c/0x20() because now that warn_slowpath_null() is in the call path, the __builtin_return_address(0) returns that, rather than the place that caused the warning. Fix this by splitting up the warn_slowpath_null/fmt cases differently, using a common helper function, and getting the return address in the right place. This also happens to avoid the unnecessary stack usage for the non-stdargs case, and just generally cleans things up. Make the function name printout use %pS while at it. Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-16 13:41:28 -07:00
Bjorn Helgaas	4484079d51	PM: check sysdev_suspend(PMSG_FREEZE) return value Check the return value of sysdev_suspend(). I think this was a typo. Without this change, the following "if" check is always false. I also changed the error message so it's distinguishable from the similar message a few lines above. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-05-15 23:30:50 +02:00
GeunSik Lim	88fc86c283	tracing: Append prompt in /debug/tracing/README file append prompt in /debug/tracing/README file. This is trivial issue. Fix typo Mini Howto file(README) for ftrace. [ Impact: cleanup ] Signed-off-by: GeunSik Lim <geunsik.lim@samsung.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: williams <williams@redhat.com> LKML-Reference: <1242289418.31161.45.camel@centos51> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 19:43:22 +02:00
Linus Torvalds	ade385e4d1	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: gdb documentation fix kgdb,i386: use address that SP register points to in the exception frame sysrq, intel_fb: fix sysrq g collision	2009-05-15 08:06:45 -07:00
Paul Mackerras	9d23a90a67	perf_counter: allow arch to supply event misc flags and instruction pointer At present the values we put in overflow events for the misc flags indicating processor mode and the instruction pointer are obtained using the standard user_mode() and instruction_pointer() functions. Those functions tell you where the performance monitor interrupt was taken, which might not be exactly where the counter overflow occurred, for example because interrupts were disabled at the point where the overflow occurred, or because the processor had many instructions in flight and chose to complete some more instructions beyond the one that caused the counter overflow. Some architectures (e.g. powerpc) can supply more precise information about where the counter overflow occurred and the processor mode at that point. This introduces new functions, perf_misc_flags() and perf_instruction_pointer(), which arch code can override to provide more precise information if available. They have default implementations which are identical to the existing code. This also adds a new misc flag value, PERF_EVENT_MISC_HYPERVISOR, for the case where a counter overflow occurred in the hypervisor. We encode the processor mode in the 2 bits previously used to indicate user or kernel mode; the values for user and kernel mode are unchanged and hypervisor mode is indicated by both bits being set. [ Impact: generalize perfcounter core facilities ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18956.1272.818511.561835@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 16:38:56 +02:00
Peter Zijlstra	2e569d3672	perf_counter: frequency based adaptive irq_period, 32-bit fix fix: kernel/built-in.o: In function `perf_counter_alloc': perf_counter.c:(.text+0x7ddc7): undefined reference to `__udivdi3' [ Impact: build fix on 32-bit systems ] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1242394667.6642.1887.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 15:40:25 +02:00
Thomas Gleixner	2d02494f5a	sched, timers: cleanup avenrun users avenrun is an rough estimate so we don't have to worry about consistency of the three avenrun values. Remove the xtime lock dependency and provide a function to scale the values. Cleanup the users. [ Impact: cleanup ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org>	2009-05-15 15:32:45 +02:00
Thomas Gleixner	dce48a84ad	sched, timers: move calc_load() to scheduler Dimitri Sivanich noticed that xtime_lock is held write locked across calc_load() which iterates over all online CPUs. That can cause long latencies for xtime_lock readers on large SMP systems. The load average calculation is an rough estimate anyway so there is no real need to protect the readers vs. the update. It's not a problem when the avenrun array is updated while a reader copies the values. Instead of iterating over all online CPUs let the scheduler_tick code update the number of active tasks shortly before the avenrun update happens. The avenrun update itself is handled by the CPU which calls do_timer(). [ Impact: reduce xtime_lock write locked section ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org>	2009-05-15 15:32:45 +02:00
Peter Zijlstra	60db5e09c1	perf_counter: frequency based adaptive irq_period Instead of specifying the irq_period for a counter, provide a target interrupt frequency and dynamically adapt the irq_period to match this frequency. [ Impact: new perf-counter attribute/feature ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20090515132018.646195868@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 15:26:56 +02:00
Peter Zijlstra	789f90fcf6	perf_counter: per user mlock gift Instead of a per-process mlock gift for perf-counters, use a per-user gift so that there is less of a DoS potential. [ Impact: allow less worst-case unprivileged memory consumption ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20090515132018.496182835@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 15:26:56 +02:00
Peter Zijlstra	548e1ddf25	perf_counter: remove perf_disable/enable exports Now that ACPI idle doesn't use it anymore, remove the exports. [ Impact: remove dead code/data ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20090515132018.429826617@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 15:26:55 +02:00
Thomas Gleixner	f1a11e0576	futex: remove the wait queue The waitqueue which is used in struct futex_q is a leftover from the futexfd implementation. There is no need to use a waitqueue at all, as the waiting task is the only user of it. The waitqueue just adds additional locking and a loop in the wake up path which both can be avoided. We have already a task reference in struct futex_q which is used for PI futexes. Use it for normal futexes as well and just wake up the task directly. The logic of signalling the futex wakeup via setting q->lock_ptr to NULL is kept with the difference that we set it NULL before doing the wakeup. This opens an exit race window vs. a non futex wake up of the to be woken up task, which we prevent with get_task_struct / put_task_struct on the waiter. [ Impact: simplification ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-15 15:24:18 +02:00
Jason Wessel	364b5b7b1d	sysrq, intel_fb: fix sysrq g collision Commit `79e539453b` introduced a regression where you cannot use sysrq 'g' to enter kgdb. The solution is to move the intel fb sysrq over to V for video instead of G for graphics. The SMP VOYAGER code to register for the sysrq-v is not anywhere to be found in the mainline kernel, so the comments in the code were cleaned up as well. This patch also cleans up the sysrq definitions for kgdb to make it generic for the kernel debugger, such that the sysrq 'g' can be used in the future to enter a gdbstub or another kernel debugger. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2009-05-15 07:56:24 -05:00
Jens Axboe	cd17cbfda0	Revert "mm: add /proc controls for pdflush threads" This reverts commit `fafd688e4c`. Work is progressing to switch away from pdflush as the process backing for flushing out dirty data. So it seems pointless to add more knobs to control pdflush threads. The original author of the patch did not have any specific use cases for adding the knobs, so we can easily revert this before 2.6.30 to avoid having to maintain this API forever. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-15 11:32:24 +02:00
Peter Zijlstra	9e35ad388b	perf_counter: Rework the perf counter disable/enable The current disable/enable mechanism is: token = hw_perf_save_disable(); ... /* do bits */ ... hw_perf_restore(token); This works well, provided that the use nests properly. Except we don't. x86 NMI/INT throttling has non-nested use of this, breaking things. Therefore provide a reference counter disable/enable interface, where the first disable disables the hardware, and the last enable enables the hardware again. [ Impact: refactor, simplify the PMU disable/enable logic ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:47:02 +02:00
Peter Zijlstra	53020fe81e	perf_counter: Fix perf_output_copy() WARN to account for overflow The simple reservation test in perf_output_copy() failed to take unsigned int overflow into account, fix this. [ Impact: fix false positive warning with more than 4GB of profiling data ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:46:59 +02:00
Li Zefan	5872144f64	tracing/filters: fix off-by-one bug We should leave the last slot for the ending '\0'. [ Impact: fix possible crash when the length of an operand is 128 ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A0CDC8C.30602@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-14 23:55:12 -04:00
Li Zefan	8cd995b6de	tracing/filters: add missing unlock in a failure path [ Impact: fix deadlock in a rare case we fail to allocate memory ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A0CDC6F.7070200@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-14 23:55:10 -04:00
Steven Rostedt	1ec7c4849c	tracing: stop stack trace on first empty entry The stack tracer stores eight entries in the ring buffer when an event traces the stack. The output outputs all eight entries regardless of how many entries were recorded. This patch breaks out of the loop when a null entry is discovered. [ Impact: only print the stack that is recorded ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-14 23:40:06 -04:00
Arun R Bharadwaj	eea08f32ad	timers: Logic to move non pinned timers * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: This patch migrates all non pinned timers and hrtimers to the current idle load balancer, from all the idle CPUs. Timers firing on busy CPUs are not migrated. While migrating hrtimers, care should be taken to check if migrating a hrtimer would result in a latency or not. So we compare the expiry of the hrtimer with the next timer interrupt on the target cpu and migrate the hrtimer only if it expires after the next interrupt on the target cpu. So, added a clockevents_get_next_event() helper function to return the next_event on the target cpu's clock_event_device. [ tglx: cleanups and simplifications ] Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-13 16:52:42 +02:00
Arun R Bharadwaj	cd1bb94b4a	timers: /proc/sys sysctl hook to enable timer migration * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: This patch creates the /proc/sys sysctl interface at /proc/sys/kernel/timer_migration Timer migration is enabled by default. To disable timer migration, when CONFIG_SCHED_DEBUG = y, echo 0 > /proc/sys/kernel/timer_migration Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-13 16:52:42 +02:00
Arun R Bharadwaj	5c333864a6	timers: Identifying the existing pinned timers * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: The following pinned hrtimers have been identified and marked: 1)sched_rt_period_timer 2)tick_sched_timer 3)stack_trace_timer_fn [ tglx: fixup the hrtimer pinned mode ] Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-13 16:52:42 +02:00
Arun R Bharadwaj	597d027573	timers: Framework for identifying pinned timers * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: This patch creates a new framework for identifying cpu-pinned timers and hrtimers. This framework is needed because pinned timers are expected to fire on the same CPU on which they are queued. So it is essential to identify these and not migrate them, in case there are any. For regular timers, the currently existing add_timer_on() can be used queue pinned timers and subsequently mod_timer_pinned() can be used to modify the 'expires' field. For hrtimers, new modes HRTIMER_ABS_PINNED and HRTIMER_REL_PINNED are added to queue cpu-pinned hrtimer. [ tglx: use .._PINNED mode argument instead of creating tons of new functions ] Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-13 16:52:42 +02:00
Steven Rostedt	39a2eddb9b	genirq: fix comment to say IRQ_WAKE_THREAD Trying to implement a driver to use threaded irqs, I was confused when the return value to use that was described in the comment above request_threaded_irq was not defined. Turns out that the enum is IRQ_WAKE_THREAD where as the comment said IRQ_THREAD_WAKE. [Impact: do not confuse developers with wrong comments ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <alpine.DEB.2.00.0905121431020.13338@gandalf.stny.rr.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-12 22:11:29 +02:00
Ingo Molnar	d80c19df5f	lockdep: increase MAX_LOCKDEP_ENTRIES and MAX_LOCKDEP_CHAINS Now that lockdep coverage has increased it has become easier to run out of entries: [ 21.401387] BUG: MAX_LOCKDEP_ENTRIES too low! [ 21.402007] turning off the locking correctness validator. [ 21.402007] Pid: 1555, comm: S99local Not tainted 2.6.30-rc5-tip #2 [ 21.402007] Call Trace: [ 21.402007] [<ffffffff81069789>] add_lock_to_list+0x53/0xba [ 21.402007] [<ffffffff810eb615>] ? lookup_mnt+0x19/0x53 [ 21.402007] [<ffffffff8106be14>] check_prev_add+0x14b/0x1c7 [ 21.402007] [<ffffffff8106c304>] validate_chain+0x474/0x52a [ 21.402007] [<ffffffff8106c6fc>] __lock_acquire+0x342/0x3c7 [ 21.402007] [<ffffffff8106c842>] lock_acquire+0xc1/0xe5 [ 21.402007] [<ffffffff810eb615>] ? lookup_mnt+0x19/0x53 [ 21.402007] [<ffffffff8153aedc>] _spin_lock+0x31/0x66 Double the size - as we've done in the past. [ Impact: allow lockdep to cover more locks ] Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-12 19:59:52 +02:00
Paul Mackerras	e758a33d6f	perf_counter: call hw_perf_save_disable/restore around group_sched_in I noticed that when enabling a group via the PERF_COUNTER_IOC_ENABLE ioctl on the group leader, the counters weren't enabled and counting immediately on return from the ioctl, but did start counting a little while later (presumably after a context switch). The reason was that __perf_counter_enable calls group_sched_in which calls hw_perf_group_sched_in, which on powerpc assumes that the caller has called hw_perf_save_disable already. Until commit 46d686c6 ("perf_counter: put whole group on when enabling group leader") it was true that all callers of group_sched_in had called hw_perf_save_disable first, and the powerpc hw_perf_group_sched_in relies on that (there isn't an x86 version). This fixes the problem by putting calls to hw_perf_save_disable / hw_perf_restore around the calls to group_sched_in and counter_sched_in in __perf_counter_enable. Having the calls to hw_perf_save_disable/restore around the counter_sched_in call is harmless and makes this call consistent with the other call sites of counter_sched_in, which have all called hw_perf_save_disable first. [ Impact: more precise counter group disable/enable functionality ] Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <18953.25733.53359.147452@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-12 15:31:06 +02:00
Ingo Molnar	6cda3eb62e	Merge branch 'x86/apic' into irq/numa Merge reason: both topics modify the APIC code but were able to do it in parallel so far. An upcoming patch generates a conflict so merge them to avoid the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-12 12:17:36 +02:00
Steven Rostedt	168b6b1d05	ring-buffer: move code around to remove some branches This is a bit of micro-optimizations. But since the ring buffer is used in tracing every function call, it is an extreme hot path. Every nanosecond counts. This change shows over 5% improvement in the ring-buffer-benchmark. [ Impact: more efficient code ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 23:33:06 -04:00
Steven Rostedt	88eb012536	ring-buffer: use internal time stamp function The ring_buffer_time_stamp that is exported adds a little more overhead than is needed for using it internally. This patch adds an internal timestamp function that can be inlined (a single line function) and used internally for the ring buffer. [ Impact: a little less overhead to the ring buffer ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 23:14:03 -04:00
Steven Rostedt	0f0c85fc80	ring-buffer: small optimizations Doing some small changes in the fast path of the ring buffer recording saves over 3% in the ring-buffer-benchmark test. [ Impact: a little faster ring buffer recording ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 23:12:34 -04:00
H. Peter Anvin	5031296c57	x86: add extension fields for bootloader type and version A long ago, in days of yore, it all began with a god named Thor. There were vikings and boats and some plans for a Linux kernel header. Unfortunately, a single 8-bit field was used for bootloader type and version. This has generally worked without too much pain, but we're getting close to flat running out of ID fields. Add extension fields for both type and version. The type will be extended if it the old field is 0xE; the version is a simple MSB extension. Keep /proc/sys/kernel/bootloader_type containing (type << 4) + (ver & 0xf) for backwards compatiblity, but also add /proc/sys/kernel/bootloader_version which contains the full version number. [ Impact: new feature to support more bootloaders ] Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-11 17:45:06 -07:00
Steven Rostedt	be957c447f	ring-buffer: move calculation of event length The event length is calculated and passed in to rb_reserve_next_event in two different locations. Having rb_reserve_next_event do the calculations directly makes only one location to do the change and causes the calculation to be inlined by gcc. Before: text data bss dec hex filename 16538 24 12 16574 40be kernel/trace/ring_buffer.o After: text data bss dec hex filename 16490 24 12 16526 408e kernel/trace/ring_buffer.o [ Impact: smaller more efficient code ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 14:42:53 -04:00
Steven Rostedt	1cd8d73589	ring-buffer: remove type parameter from rb_reserve_next_event The rb_reserve_next_event is only called for the data type (type = 0). There is no reason to pass in the type to the function. Before: text data bss dec hex filename 16554 24 12 16590 40ce kernel/trace/ring_buffer.o After: text data bss dec hex filename 16538 24 12 16574 40be kernel/trace/ring_buffer.o [ Impact: cleaner, smaller and slightly more efficient code ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 14:19:00 -04:00
Steven Rostedt	d988ff94c1	ring-buffer: check for divide by zero in ring-buffer-benchmark Although we check if "missed" is not zero, we divide by hit + missed, and the addition can possible overflow and become a divide by zero. This patch checks for this case, and will report it when it happens then modify "hit" to make the calculation be non zero. [ Impact: prevent possible divide by zero in ring-buffer-benchmark ] Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 13:22:26 -04:00
Steven Rostedt	5a772b2b3c	ring-buffer: replace constants with time macros in ring-buffer-benchmark The use of numeric constants is discouraged. It is cleaner and more descriptive to use macros for constant time conversions. This patch also removes an extra new line. [ Impact: more descriptive time conversions ] Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-11 13:22:26 -04:00
Ingo Molnar	7961386fe9	Merge commit 'v2.6.30-rc5' into sched/core Merge reason: sched/core was on .30-rc1 before, update to latest fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-11 12:59:37 +02:00
Li Zefan	0498625793	blktrace: pdu_buf of pc events should be unsigned I got this: 8,0 1 305.417782332 2037 I R 32 (ffffff9e 10 00 ...) [bash] It should be: 8,0 1 305.417782332 2037 I R 32 (9e 10 00 ...) [bash] [ Impact: fix output of pc events ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4A07C6B3.9080802@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-11 12:25:50 +02:00
Paul Mackerras	615a3f1e05	perf_counter: call atomic64_set for counter->count A compile warning triggered because we are calling atomic_set(&counter->count). But since counter->count is an atomic64_t, we have to use atomic64_set. So the count can be set short, resulting in the reset ioctl only resetting the low word. [ Impact: clear counter properly during the reset ioctl ] Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <18951.48285.270311.981806@drongo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-11 12:10:54 +02:00
Paul Mackerras	a08b159fc2	perf_counter: don't count scheduler ticks as context switches The context-switch software counter gives inflated values at present because each scheduler tick and each process-wide counter enable/disable prctl gets counted as a context switch. This happens because perf_counter_task_tick, perf_counter_task_disable and perf_counter_task_enable all call perf_counter_task_sched_out, which calls perf_swcounter_event to record a context switch event. This fixes it by introducing a variant of perf_counter_task_sched_out with two underscores in front for internal use within the perf_counter code, and makes perf_counter_task_{tick,disable,enable} call it. This variant doesn't record a context switch event, and takes a struct perf_counter_context *. This adds the new variant rather than changing the behaviour or interface of perf_counter_task_sched_out because that is called from other code. [ Impact: fix inflated context-switch event counts ] Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <18951.48034.485580.498953@drongo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-11 12:10:53 +02:00
Paul Mackerras	6751b71ea2	perf_counter: Put whole group on when enabling group leader Currently, if you have a group where the leader is disabled and there are siblings that are enabled, and then you enable the leader, we only put the leader on the PMU, and not its enabled siblings. This is incorrect, since the enabled group members should be all on or all off at any given point. This fixes it by adding a call to group_sched_in in __perf_counter_enable in the case where we're enabling a group leader. To avoid the need for a forward declaration this also moves group_sched_in up before __perf_counter_enable. The actual content of group_sched_in is unchanged by this patch. [ Impact: fix bug in counter enable code ] Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <18951.34946.451546.691693@drongo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-11 12:10:53 +02:00
Tejun Heo	2e46e8b27a	block: drop request->hard_* and nr_sectors struct request has had a few different ways to represent some properties of a request. ->hard_ represent block layer's view of the request progress (completion cursor) and the ones without the prefix are supposed to represent the issue cursor and allowed to be updated as necessary by the low level drivers. The thing is that as block layer supports partial completion, the two cursors really aren't necessary and only cause confusion. In addition, manual management of request detail from low level drivers is cumbersome and error-prone at the very least. Another interesting duplicate fields are rq->[hard_]nr_sectors and rq->{hard_cur\|current}_nr_sectors against rq->data_len and rq->bio->bi_size. This is more convoluted than the hard_ case. rq->[hard_]nr_sectors are initialized for requests with bio but blk_rq_bytes() uses it only for !pc requests. rq->data_len is initialized for all request but blk_rq_bytes() uses it only for pc requests. This causes good amount of confusion throughout block layer and its drivers and determining the request length has been a bit of black magic which may or may not work depending on circumstances and what the specific LLD is actually doing. rq->{hard_cur\|current}_nr_sectors represent the number of sectors in the contiguous data area at the front. This is mainly used by drivers which transfers data by walking request segment-by-segment. This value always equals rq->bio->bi_size >> 9. However, data length for pc requests may not be multiple of 512 bytes and using this field becomes a bit confusing. In general, having multiple fields to represent the same property leads only to confusion and subtle bugs. With recent block low level driver cleanups, no driver is accessing or manipulating these duplicate fields directly. Drop all the duplicates. Now rq->sector means the current sector, rq->data_len the current total length and rq->bio->bi_size the current segment length. Everything else is defined in terms of these three and available only through accessors. * blk_recalc_rq_sectors() is collapsed into blk_update_request() and now handles pc and fs requests equally other than rq->sector update. This means that now pc requests can use partial completion too (no in-kernel user yet tho). * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer now uses byte count as the primary data length. * blk_rq_pos() is now guranteed to be always correct. In-block users converted. * blk_rq_bytes() is now guaranteed to be always valid as is blk_rq_sectors(). In-block users converted. * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9. More convenient one is used. * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const pointer to request. [ Impact: API cleanup, single way to represent one property of a request ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 09:50:54 +02:00
Tejun Heo	5b93629b45	block: implement blk_rq_pos/[cur_]sectors() and convert obvious ones Implement accessors - blk_rq_pos(), blk_rq_sectors() and blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors and rq->hard_cur_sectors respectively and convert direct references of the said fields to the accessors. This is in preparation of request data length handling cleanup. Geert : suggested adding const to struct request * parameter to accessors Sergei : spotted error in patch description [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Grant Likely <grant.likely@secretlab.ca> Ackec-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 09:50:53 +02:00
David Howells	5e751e992f	CRED: Rename cred_exec_mutex to reflect that it's a guard against ptrace Rename cred_exec_mutex to reflect that it's a guard against foreign intervention on a process's credential state, such as is made by ptrace(). The attachment of a debugger to a process affects execve()'s calculation of the new credential state - _and_ also setprocattr()'s calculation of that state. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-11 08:15:36 +10:00
Al Viro	6f5bbff9a1	Convert obvious places to deactivate_locked_super() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-05-09 10:49:40 -04:00
Ron	92d23f703c	sched: Fix fallback sched_clock()'s offset when using jiffies Account for the initial offset to the jiffy count. [ Impact: fix printk timestamps on architectures using fallback sched_clock() ] Signed-off-by: Ron Lee <ron@debian.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-09 10:08:19 +02:00
Masami Hiramatsu	201517a7f3	kprobes: fix to use text_mutex around arm/disarm kprobe Fix kprobes to lock text_mutex around some arch_arm/disarm_kprobe() which are newly added by commit `de5bd88d5a`. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-08 16:23:48 -07:00
Steven Rostedt	4671c79408	tracing: add trace_set_clr_event to export event enabling function Other parts of the kernel may need to be able to enable or disable specific events. Especially parts that create trace events. [ Impact: allow enabling of trace events by those that create the event ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-08 16:30:26 -04:00
Steven Rostedt	29f93943d1	tracing: initialize return value for __ftrace_set_clr_event Commit `8f31bfe538` tracing/events: clean up for ftrace_set_clr_event() Moved out the code for ftrace_set_clr_event into a helper funciton but did not initialize the return value. As a result, we do not warn about a typo in the echoing of events in set_event. This patch restores the old warning: # echo foobar > set_event -bash: echo: write error: Invalid argument [ Impact: restore warning of invalid entries to set_event ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-08 16:06:47 -04:00
Peter Zijlstra	f370e1e2f1	perf_counter: add PERF_RECORD_CPU Allow recording the CPU number the event was generated on. RFC: this leaves a u32 as reserved, should we fill in the node_id() there, or leave this open for future extention, as userspace can already easily do the cpu->node mapping if needed. [ Impact: extend perfcounter output record format ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090508170029.008627711@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 20:36:59 +02:00
Peter Zijlstra	a85f61abe1	perf_counter: add PERF_RECORD_CONFIG Much like CONFIG_RECORD_GROUP records the hw_event.config to identify the values, allow to record this for all counters. [ Impact: extend perfcounter output record format ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090508170028.923228280@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 20:36:58 +02:00
Peter Zijlstra	3df5edad87	perf_counter: rework ioctl()s Corey noticed that ioctl()s on grouped counters didn't work on the whole group. This extends the ioctl() interface to take a second argument that is interpreted as a flags field. We then provide PERF_IOC_FLAG_GROUP to toggle the behaviour. Having this flag gives the greatest flexibility, allowing you to individually enable/disable/reset counters in a group, or all together. [ Impact: fix group counter enable/disable semantics ] Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20090508170028.837558214@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 20:36:58 +02:00
Peter Zijlstra	7fc23a5380	perf_counter: optimize perf_counter_task_tick() perf_counter_task_tick() does way too much work to find out there's nothing to do. Provide an easy short-circuit for the normal case where there are no counters on the system. [ Impact: micro-optimization ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090508170028.750619201@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 20:36:57 +02:00
Li Zefan	c142b15dc5	tracing/events: simplify system_enable_read() A smarter way to figure out the output of an enable file. [ Impact: clean up ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4A0399A5.2080603@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 14:00:36 +02:00
Li Zefan	8f31bfe538	tracing/events: clean up for ftrace_set_clr_event() Add a helper function __ftrace_set_clr_event(), and replace some ftrace_set_clr_event() calls with this helper, thus we don't need any kstrdup() or kmalloc(). As a side effect, this patch fixes an issue in self tests code, which is similar to the one fixed in commit `d6bf81ef0f` ("tracing: append ":*" to internal setting of system events") It's a small issue and won't cause any bug in fact, but we should do things right anyway. [ Impact: prevent spurious event-enabling in tracing self-tests ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4A03998E.3020503@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 14:00:35 +02:00
Ingo Molnar	f066a15533	Merge branch 'x86/urgent' into x86/xen Conflicts: arch/frv/include/asm/pgtable.h arch/x86/include/asm/required-features.h arch/x86/xen/mmu.c Merge reason: x86/xen was on a .29 base still, move it to a fresher branch and pick up Xen fixes as well, plus resolve conflicts Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-08 10:50:00 +02:00
James Morris	d254117099	Merge branch 'master' into next	2009-05-08 17:56:47 +10:00
Steven Rostedt	74f4fd2166	ring-buffer: change WARN_ON from checking preempt_count to preemptible There's a WARN_ON in the ring buffer code that makes sure preemption is disabled. It checks "!preempt_count()". But when CONFIG_PREEMPT is not enabled, preempt_count() is always zero, and this will trigger the warning. [ Impact: prevent false warning on non preemptible kernels ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-07 20:01:11 -04:00
Steven Rostedt	7da3046d6c	ring-buffer: add total count in ring-buffer-benchmark It is nice to see the overhead of the benchmark test when tracing is disabled. That is, we turn off the ring buffer just to see what the cost of running the loop that calls into the ring buffer is. Currently, if no entries wer made, we get 0. This is not informative. This patch changes it to check if we had any "missed" (non recorded) events. If so, a total count is also reported. [ Impact: evaluate the over head of the ring buffer benchmark test ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-07 19:52:20 -04:00
Steven Rostedt	0574ea421b	ring-buffer: only periodically call cond_resched to ring-buffer-benchmark Calling cond_resched at every iteration of the loop adds a bit of overhead to the benchmark. This patch does two things. 1) only calls cond-resched when CONFIG_PREEMPT is not enabled 2) only calls cond-resched after so many traces has been performed. [ Impact: less overhead to the ring-buffer-benchmark ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-07 14:20:28 -04:00
Steven Rostedt	65b7724204	tracing: have menu default enabled when kernel debug is configured Tracing can be very helpful to debug the kernel. When DEBUG_KERNEL is enabled it is nice to enable the trace menu as well. This patch only make the tracing menu enabled by default, it does not make any of the tracers enabled. And the menu is only enabled by default if DEBUG_KERNEL is enabled. [ Impact: show tracing options to those debugging the kernel ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-07 12:49:27 -04:00
Steven Rostedt	d6bf81ef0f	tracing: append ":" to internal setting of system events The system enabling of events uses the same code as the set_event file. It passes in the name of the system to the parser and that will enable all the events that has that system as a name. The problem is that it will also enable events with the same name as the system. If you have system name foo, and system name bar, but within the system bar, there exists an event called foo. By setting the system name foo, you will also be enabling the event foo in the system bar. This is not an expected result. The solution is to pass in "foo:", which will only enable the system foo and not events called foo. [ Impact: prevent accidental enabling of events with same name as a system ] Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-07 11:49:35 -04:00
Steven Rostedt	29c8000ee7	ring-buffer: remove complex calculations in ring-buffer-test Ingo Molnar thought that the code to calculate the time in cond_resched is a bit too ugly and is not needed. This patch removes it and replaces it with a simple call to cond_resched. I kept the comment that explains the reason for the cond_resched. [ Impact: remove ugly code ] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-07 11:16:18 -04:00
Ingo Molnar	0ad5d703c6	Merge branch 'tracing/hw-branch-tracing' into tracing/core Merge reason: this topic is ready for upstream now. It passed Oleg's review and Andrew had no further mm/* objections/observations either. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-07 13:36:22 +02:00
Ingo Molnar	44347d947f	Merge branch 'linus' into tracing/core Merge reason: tracing/core was on a .30-rc1 base and was missing out on on a handful of tracing fixes present in .30-rc5-almost. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-07 11:17:34 +02:00
Li Zefan	d94fc523f3	tracing/events: fix concurrent access to ftrace_events list, fix In filter_add_subsystem_pred() we should release event_mutex before calling filter_free_subsystem_preds(), since both functions hold event_mutex. [ Impact: fix deadlock when writing invalid pred into subsystem filter ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: tzanussi@gmail.com Cc: a.p.zijlstra@chello.nl Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <4A028993.7020509@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-07 10:07:28 +02:00
Frederic Weisbecker	5928c3cc0f	tracing/filters: support for operator reserved characters in strings When we set a filter for an event, such as: echo "name == my_lock_name" > \ /debug/tracing/events/lockdep/lock_acquired/filter then the following order of token type is parsed: - space - operator - parentheses - operand Because the operators and parentheses have a higher precedence than the operand characters, which is normal, then we can't use any string containing such special characters: ()=<>!&\| To get this support and also avoid ambiguous intepretation from the parser or the human, we can do it using double quotes so that we keep the usual languages habits. Then after this patch you can still declare string condition like before: echo name == myname But if you want to compare against a string containing an operator character, you can use double quotes: echo 'name == "&myname"' Don't forget to include the whole expression into single quotes or the double ones will be eaten by echo. [ Impact: support strings with special characters for tracing filters ] Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Zhaolei <zhaolei@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-07 10:05:57 +02:00
Frederic Weisbecker	e8808c1019	tracing/filters: support for filters of dynamic sized arrays Currently the filtering infrastructure supports well the numeric types and fixed sized array types. But the recently added __string() field uses a specific indirect offset mechanism which requires a specific predicate. Until now it wasn't supported. This patch adds this support and implies very few changes, only a new predicate is needed, the management of this specific field can be done through the usual string helpers in the filtering infrastructure. [ Impact: support all kinds of strings in the tracing filters ] Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Zhaolei <zhaolei@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-05-07 10:05:57 +02:00
David Rientjes	aa47b7e0f8	sched: emit thread info flags with stack trace When a thread is oom killed and fails to exit, it's helpful to know which threads have access to memory reserves if the machine livelocks. This is done by testing for the TIF_MEMDIE thread info flag and should be displayed alongside stack traces to identify tasks that have access to such reserves but are still stuck allocating pages, for instance. It would probably be helpful in other cases as well, so all thread info flags are emitted when showing a task. ( v2: fix warning reported by Stephen Rothwell ) [ Impact: extend debug printout info ] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> LKML-Reference: <alpine.DEB.2.00.0905040136390.15831@chino.kir.corp.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-07 09:36:28 +02:00
Steven Rostedt	8ae79a138e	tracing: add hierarchical enabling of events With the current event directory, you can only enable individual events. The file debugfs/tracing/set_event is used to be able to enable or disable several events at once. But that can still be awkward. This patch adds hierarchical enabling of events. That is, each directory in debugfs/tracing/events has an "enable" file. This file can enable or disable all events within the directory and below. # echo 1 > /debugfs/tracing/events/enable will enable all events. # echo 1 > /debugfs/tracing/events/sched/enable will enable all events in the sched subsystem. # echo 1 > /debugfs/tracing/events/enable # echo 0 > /debugfs/tracing/events/irq/enable will enable all events, but then disable just the irq subsystem events. When reading one of these enable files, there are four results: 0 - all events this file affects are disabled 1 - all events this file affects are enabled X - there is a mixture of events enabled and disabled ? - this file does not affect any event Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 23:11:42 -04:00
Steven Rostedt	9456f0fa6d	tracing: reset ring buffer when removing modules with events Li Zefan found that there's a race using the event ids of events and modules. When a module is loaded, an event id is incremented. We only have 16 bits for event ids (65536) and there is a possible (but highly unlikely) race that we could load and unload a module that registers events so many times that the event id counter overflows. When it overflows, it then restarts and goes looking for available ids. An id is available if it was added by a module and released. The race is if you have one module add an id, and then is removed. Another module loaded can use that same event id. But if the old module still had events in the ring buffer, the new module's call back would get bogus data. At best (and most likely) the output would just be garbage. But if the module for some reason used pointers (not recommended) then this could potentially crash. The safest thing to do is just reset the ring buffer if a module that registered events is removed. [ Impact: prevent unpredictable results of event id overflows ] Reported-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <49FEAFD0.30106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 23:11:41 -04:00
Andi Kleen	57adc4d2db	Eliminate thousands of warnings with gcc 3.2 build When building with gcc 3.2 I get thousands of warnings such as include/linux/gfp.h: In function `allocflags_to_migratetype': include/linux/gfp.h:105: warning: null format string due to passing a NULL format string to warn_slowpath() in #define __WARN() warn_slowpath(__FILE__, __LINE__, NULL) Split this case out into a separate call. This also shrinks the kernel slightly: text data bss dec hex filename 4802274 707668 712704 6222646 5ef336 vmlinux text data bss dec hex filename 4799027 703572 712704 6215303 5ed687 vmlinux due to removeing one argument from the commonly-called __WARN(). [akpm@linux-foundation.org: reduce scope of `empty'] Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-06 16:36:09 -07:00
Wu Fengguang	381a80e6df	inotify: use GFP_NOFS in kernel_event() to work around a lockdep false-positive There is what we believe to be a false positive reported by lockdep. inotify_inode_queue_event() => take inotify_mutex => kernel_event() => kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim => dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock The plan is to fix this via lockdep annotation, but that is proving to be quite involved. The patch flips the allocation over to GFP_NFS to shut the warning up, for the 2.6.30 release. Hopefully we will fix this for real in 2.6.31. I'll queue a patch in -mm to switch it back to GFP_KERNEL so we don't forget. ================================= [ INFO: inconsistent lock state ] 2.6.30-rc2-next-20090417 #203 --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes: (&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0 {RECLAIM_FS-ON-W} state was registered at: [<ffffffff81079188>] mark_held_locks+0x68/0x90 [<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100 [<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0 [<ffffffff81130652>] kernel_event+0xe2/0x190 [<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230 [<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110 [<ffffffff8110444d>] vfs_create+0xcd/0x140 [<ffffffff8110825d>] do_filp_open+0x88d/0xa20 [<ffffffff810f6b68>] do_sys_open+0x98/0x140 [<ffffffff810f6c50>] sys_open+0x20/0x30 [<ffffffff8100c272>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 690455 hardirqs last enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80 hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0 softirqs last enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220 softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50 other info that might help us debug this: 2 locks held by kswapd0/380: #0: (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180 #1: (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0 stack backtrace: Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203 Call Trace: [<ffffffff810789ef>] print_usage_bug+0x19f/0x200 [<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50 [<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0 [<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0 [<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0 [<ffffffff810f478c>] ? slob_free+0x10c/0x370 [<ffffffff8107c0a1>] lock_acquire+0xe1/0x120 [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0 [<ffffffff81562d43>] mutex_lock_nested+0x63/0x420 [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0 [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0 [<ffffffff81012fe9>] ? sched_clock+0x9/0x10 [<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0 [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0 [<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0 [<ffffffff8110cb23>] d_kill+0x33/0x60 [<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350 [<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0 [<ffffffff810d0cc5>] shrink_slab+0x125/0x180 [<ffffffff810d1540>] kswapd+0x560/0x7a0 [<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0 [<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0 [<ffffffff8106555b>] kthread+0x5b/0xa0 [<ffffffff8100d40a>] child_rip+0xa/0x20 [<ffffffff8100cdd0>] ? restore_args+0x0/0x30 [<ffffffff81065500>] ? kthread+0x0/0xa0 [<ffffffff8100d400>] ? child_rip+0x0/0x20 [eparis@redhat.com: fix audit too] Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matt Mackall <mpm@selenic.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Eric Paris <eparis@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-06 16:36:09 -07:00
Steven Rostedt	3e07a4f680	ring-buffer: change test to be more latency friendly The ring buffer benchmark/test runs a producer for 10 seconds. This is done with preemption and interrupts enabled. But if the kernel is not compiled with CONFIG_PREEMPT, it basically stops everything but interrupts for 10 seconds. Although this is just a test and is not for production, this attribute can be quite annoying. It can also spawn badness elsewhere. This patch solves the issues by calling "cond_resched" when the system is not compiled with CONFIG_PREEMPT. It also keeps track of the time spent to call cond_resched such that it does not go against the time calculations. That is, if the task schedules away, the time scheduled out is removed from the test data. Note, this only works for non PREEMPT because we do not know when the task is scheduled out if we have PREEMPT enabled. [ Impact: prevent test from stopping the world for 10 seconds ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 18:36:59 -04:00
Steven Rostedt	6634ff26cc	ring-buffer: make moving the tail page a separate function Ingo Molnar thought the code would be cleaner if we used a function call instead of a goto for moving the tail page. After implementing this, it seems that gcc still inlines the result and the output is pretty much the same. Since this is considered a cleaner approach, might as well implement it. [ Impact: code clean up ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 15:30:07 -04:00
Steven Rostedt	00c81a58c5	ring-buffer: check for failed allocation in ring buffer benchmark The result of the allocation of the ring buffer read page in the ring buffer bench mark does not check the return to see if a page was actually allocated. This patch fixes that. [ Impact: avoid NULL dereference ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 12:49:20 -04:00
Steven Rostedt	8e7abf1c62	ring-buffer: remove unneeded conditional in rb_reserve_next The code in __rb_reserve_next checks on page overflow if it is the original commiter and then resets the page back to the original setting. Although this is fine, and the code is correct, it is a bit fragil. Some experimental work I did breaks it easily. The better and more robust solution is to have all commiters that overflow the page, simply subtract what they added. [ Impact: more robust ring buffer account management ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 12:49:19 -04:00
Jaswinder Singh Rajput	48dd0fed90	tracing: trace_output.c, fix false positive compiler warning This compiler warning: CC kernel/trace/trace_output.o kernel/trace/trace_output.c: In function ‘register_ftrace_event’: kernel/trace/trace_output.c:544: warning: ‘list’ may be used uninitialized in this function Is wrong as 'list' is always initialized - but GCC (4.3.2) does not recognize this relationship properly. Work around the warning by initializing the variable to NULL. [ Impact: fix false positive compiler warning ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 14:19:16 +02:00
Alan D. Brunelle	22a7c31a96	blktrace: from-sector redundant in trace_block_remap Remove redundant from-sector parameter: it's /always/ the bio's sector passed in. [ Impact: cleanup ] Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49FF517C.7000503@hp.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 14:13:01 +02:00
Alan D. Brunelle	a42aaa3bbc	blktrace: correct remap names This attempts to clarify names utilized during block I/O remap operations (partition, volume manager). It correctly matches up the /from/ information for both device & sector. This takes in the concept from Kosaki Motohiro and extends it to include better naming for the "device_from" field. [ Impact: cleanup ] Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49FF4FAE.3000301@hp.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 14:13:00 +02:00
Mathieu Desnoyers	de1d728606	tracepoint: trace_sched_migrate_task(): remove parameter The orig_cpu parameter in trace_sched_migrate_task() is not necessary, it can be got by using task_cpu(p) in the probe. [ Impact: micro-optimization ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> [ modified from Mathieu's patch. The original patch is at: http://marc.info/?l=linux-kernel&m=123791201716239&w=2 ] Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: zhaolei@cn.fujitsu.com Cc: laijs@cn.fujitsu.com LKML-Reference: <49FFFDB7.1050402@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 12:15:51 +02:00
Li Zefan	20c8928abe	tracing/events: fix concurrent access to ftrace_events list A module will add/remove its trace events when it gets loaded/unloaded, so the ftrace_events list is not "const", and concurrent access needs to be protected. This patch thus fixes races between loading/unloding modules and read 'available_events' or read/write 'set_event', etc. Below shows how to reproduce the race: # for ((; ;)) { cat /mnt/tracing/available_events; } > /dev/null & # for ((; ;)) { insmod trace-events-sample.ko; rmmod sample; } & After a while: BUG: unable to handle kernel paging request at 0010011c IP: [<c1080f27>] t_next+0x1b/0x2d ... Call Trace: [<c10c90e6>] ? seq_read+0x217/0x30d [<c10c8ecf>] ? seq_read+0x0/0x30d [<c10b4c19>] ? vfs_read+0x8f/0x136 [<c10b4fc3>] ? sys_read+0x40/0x65 [<c1002a68>] ? sysenter_do_call+0x12/0x36 [ Impact: fix races when concurrent accessing ftrace_events list ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4A00F709.3080800@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 10:38:19 +02:00
Li Zefan	2df75e4157	tracing/events: fix memory leak when unloading module When unloading a module, memory allocated by init_preds() and trace_define_field() is not freed. [ Impact: fix memory leak ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4A00F6E0.3040503@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 10:38:19 +02:00
Ingo Molnar	3611dfb8ed	Merge branch 'core/locking' into perfcounters/core Merge reason: we moved a mutex.h commit that originated from the perfcounters tree into core/locking - but now merge back that branch to solve a merge artifact and to pick up cleanups of this commit that happened in core/locking. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 08:47:26 +02:00
Steven Rostedt	5092dbc96f	ring-buffer: add benchmark and tester This patch adds code that can benchmark the ring buffer as well as test it. This code can be compiled into the kernel (not recommended) or as a module. A separate ring buffer is used to not interfer with other users, like ftrace. It creates a producer and a consumer (option to disable creation of the consumer) and will run for 10 seconds, then sleep for 10 seconds and then repeat. While running, the producer will write 10 byte loads into the ring buffer with just putting in the current CPU number. The reader will continually try to read the buffer. The reader will alternate from reading the buffer via event by event, or by full pages. The output is a pr_info, thus it will fill up the syslogs. Starting ring buffer hammer End ring buffer hammer Time: 9000349 (usecs) Overruns: 12578640 Read: 5358440 (by events) Entries: 0 Total: 17937080 Missed: 0 Hit: 17937080 Entries per millisec: 1993 501 ns per entry Sleeping for 10 secs Starting ring buffer hammer End ring buffer hammer Time: 9936350 (usecs) Overruns: 0 Read: 28146644 (by pages) Entries: 74 Total: 28146718 Missed: 0 Hit: 28146718 Entries per millisec: 2832 353 ns per entry Sleeping for 10 secs Time: is the time the test ran Overruns: the number of events that were overwritten and not read Read: the number of events read (either by pages or events) Entries: the number of entries left in the buffer (the by pages will only read full pages) Total: Entries + Read + Overruns Missed: the number of entries that failed to write Hit: the number of entries that were written The above example shows that it takes ~353 nanosecs per entry when there is a reader, reading by pages (and no overruns) The event by event reader slowed the producer down to 501 nanosecs. [ Impact: see how changes to the ring buffer affect stability and performance ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-06 00:08:50 -04:00
Steven Rostedt	aa20ae8444	ring-buffer: move big if statement down In the hot path of the ring buffer "__rb_reserve_next" there's a big if statement that does not even return back to the work flow. code; if (cross to next page) { [ lots of code ] return; } more code; The condition is even the unlikely path, although we do not denote it with an unlikely because gcc is fine with it. The condition is true when the write crosses a page boundary, and we need to start at a new page. Having this if statement makes it hard to read, but calling another function to do the work is also not appropriate, because we are using a lot of variables that were set before the if statement, and we do not want to send them as parameters. This patch changes it to a goto: code; if (cross to next page) goto next_page; more code; return; next_page: [ lots of code] This makes the code easier to understand, and a bit more obvious. The output from gcc is practically identical. For some reason, gcc decided to use different registers when I switched it to a goto. But other than that, the logic is the same. [ Impact: easier to read code ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 21:16:11 -04:00
Steven Rostedt	94487d6d53	tracing: use proper export symbol for tracing api When adding the EXPORT_SYMBOL to some of the tracing API, I accidently used EXPORT_SYMBOL instead of EXPORT_SYMBOL_GPL. This patch fixes that mistake. [ Impact: export the tracing code only for GPL modules ] Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 19:22:53 -04:00
Steven Rostedt	41ede23ede	ring-buffer: disable writers when resetting buffers As a precaution, it is best to disable writing to the ring buffers when reseting them. [ Impact: prevent weird things if write happens during reset ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 17:22:02 -04:00
Steven Rostedt	afbab76a62	ring-buffer: have read page swap increment counter with page entries In the swap page ring buffer code that is used by the ftrace splice code, we scan the page to increment the counter of entries read. With the number of entries already in the page we simply need to add it. [ Impact: speed up reading page from ring buffer ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 16:58:24 -04:00
Linus Torvalds	99ee12973e	Merge branch 'timers/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevents: prevent endless loop in tick_handle_periodic()	2009-05-05 12:09:38 -07:00
Linus Torvalds	bcb1656827	Merge branch 'irq/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Revert "genirq: assert that irq handlers are indeed running in hardirq context"	2009-05-05 12:09:27 -07:00
Linus Torvalds	e858e8b076	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: account system time properly	2009-05-05 12:08:40 -07:00
Linus Torvalds	da87bbd142	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kernel/posix-cpu-timers.c: fix sparse warning dma-debug: remove broken dma memory leak detection for 2.6.30 locking: Documentation: lockdep-design.txt, fix note of state bits	2009-05-05 12:08:20 -07:00
Linus Torvalds	e91b3b2681	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: x86, mmiotrace: fix range test tracing: fix ref count in splice pages	2009-05-05 12:08:02 -07:00
Steven Rostedt	778c55d44e	ring-buffer: record page entries in buffer page descriptor Currently, when the ring buffer writer overflows the buffer and must write over non consumed data, we increment the overrun counter by reading the entries on the page we are about to overwrite. This reads the entries one by one. This is not very effecient. This patch adds another entry counter into each buffer page descriptor that keeps track of the number of entries on the page. Now on overwrite, the overrun counter simply needs to add the number of entries that is on the page it is about to overwrite. [ Impact: speed up of ring buffer in overwrite mode ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 14:28:08 -04:00
Oleg Nesterov	41c51c98f5	rcu: rcu_sched_grace_period(): kill the bogus flush_signals() As a kernel thread, rcu_sched_grace_period() runs with all signals ignored. It can never receive a signal even if it sleeps in TASK_INTERRUPTIBLE, it needs the explicit allow_signal() to be visible for signals. [ Impact: reduce kernel size, remove dead code ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20090503211118.GA22973@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:28:05 +02:00
Steven Rostedt	e4906eff9e	ring-buffer: convert cpu buffer entries to local_t The entries counter in cpu buffer is not atomic. It can be updated by other interrupts or from another CPU (readers). But making entries into "atomic_t" causes an atomic operation that can hurt performance. Instead we convert it to a local_t that will increment a counter with a local CPU atomic operation (if the arch supports it). Instead of fighting with readers and overwrites that decrement the counter, I added a "read" counter. Every time a reader reads an entry it is incremented. We already have a overrun counter and with that, the entries counter and the read counter, we can calculate the total number of entries in the buffer with: (entries - overrun) - read As long as the total number of entries in the ring buffer is less than the word size, this will work. But since the entries counter was previously a long, this is no different than what we had before. Thanks to Andrew Morton for pointing out in the first version that atomic_t does not replace unsigned long. I switched to atomic_long_t even though it is signed. A negative count is most likely a bug. [ Impact: keep accurate count of cpu buffer entries ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 14:25:44 -04:00
Peter Zijlstra	2023b35921	perf_counter: inheritable sample counters Redirect the output to the parent counter and put in some sanity checks. [ Impact: new perfcounter feature - inherited sampling counters ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090505155437.331556171@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:18:33 +02:00
Peter Zijlstra	22c1558e51	perf_counter: fix the output lock Use -1 instead of 0 as unlocked, since 0 is a valid cpu number. ( This is not an issue right now but will be once we allow multiple counters to output to the same mmap area. ) [ Impact: prepare code for multi-counter profile output ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090505155437.232686598@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:18:32 +02:00
Peter Zijlstra	c5078f78b4	perf_counter: provide an mlock threshold Provide a threshold to relax the mlock accounting, increasing usability. Each counter gets perf_counter_mlock_kb for free. [ Impact: allow more mmap buffering ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090505155437.112113632@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:18:32 +02:00
Peter Zijlstra	6de6a7b957	perf_counter: add ioctl(PERF_COUNTER_IOC_RESET) Provide a way to reset an existing counter - this eases PAPI libraries around perfcounters. Similar to read() it doesn't collapse pending child counters. [ Impact: new perfcounter fd ioctl method to reset counters ] Suggested-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20090505155437.022272933@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:18:31 +02:00
Peter Zijlstra	c66de4a5be	perf_counter: uncouple data_head updates from wakeups Keep data_head up-to-date irrespective of notifications. This fixes the case where you disable a counter and don't get a notification for the last few pending events, and it also allows polling usage. [ Impact: increase precision of perfcounter mmap-ed fields ] Suggested-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20090505155436.925084300@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:18:30 +02:00
Peter Zijlstra	60aa605dfc	sched: rt: document the risk of small values in the bandwidth settings Thomas noted that we should disallow sysctl_sched_rt_runtime == 0 for (!RT_GROUP) since the root group always has some RT tasks in it. Further, update the documentation to inspire clue. [ Impact: exclude corner-case sysctl_sched_rt_runtime value ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090505155436.863098054@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-05 20:07:57 +02:00
Steven Rostedt	c8d771835e	tracing: export stats of ring buffers to userspace This patch adds stats to the ftrace ring buffers: # cat /debugfs/tracing/per_cpu/cpu0/stats entries: 42360 overrun: 30509326 commit overrun: 0 nmi dropped: 0 Where entries are the total number of data entries in the buffer. overrun is the number of entries not consumed and were overwritten by the writer. commit overrun is the number of entries dropped due to nested writers wrapping the buffer before the initial writer finished the commit. nmi dropped is the number of entries dropped due to the ring buffer lock being held when an nmi was going to write to the ring buffer. Note, this field will be meaningless and will go away when the ring buffer becomes lockless. [ Impact: let userspace know what is happening in the ring buffers ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 13:52:02 -04:00
Steven Rostedt	f0d2c681ac	ring-buffer: add counters for commit overrun and nmi dropped entries The WARN_ON in the ring buffer when a commit is preempted and the buffer is filled by preceding writes can happen in normal operations. The WARN_ON makes it look like a bug, not to mention, because it does not stop tracing and calls printk which can also recurse, this is prone to deadlock (the WARN_ON is not in a position to recurse). This patch removes the WARN_ON and replaces it with a counter that can be retrieved by a tracer. This counter is called commit_overrun. While at it, I added a nmi_dropped counter to count any time an NMI entry is dropped because the NMI could not take the spinlock. [ Impact: prevent deadlock by printing normal case warning ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 13:51:02 -04:00
Steven Rostedt	d6ce96dabe	ring-buffer: export symbols I'm adding a module to do a series of tests on the ring buffer as well as benchmarks. This module needs to have more of the ring buffer API exported. There's nothing wrong with reading the ring buffer from a module. [ Impact: allow modules to read pages from the ring buffer ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-05 13:46:33 -04:00
Ingo Molnar	1dce8d99b8	perf_counter: convert perf_resource_mutex to a spinlock Now percpu counters can be initialized very early. But the init sequence uses mutex_lock(). Fortunately, perf_resource_mutex should be a spinlock anyway, so convert it. [ Impact: fix crash due to early init mutex use ] LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-04 19:30:42 +02:00
Ingo Molnar	0d905bca23	perf_counter: initialize the per-cpu context earlier percpu scheduling for perfcounters wants to take the context lock, but that lock first needs to be initialized. Currently it is an early_initcall() - but that is too late, the task tick runs much sooner than that. Call it explicitly from the scheduler init sequence instead. [ Impact: fix access-before-init crash ] LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-04 19:30:32 +02:00
Ingo Molnar	b82914ce33	perf_counter: round-robin per-CPU counters too This used to be unstable when we had the rq->lock dependencies, but now that they are that of the past we can turn on percpu counter RR too. [ Impact: handle counter over-commit for per-CPU counters too ] LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-04 19:29:57 +02:00
Andrea Righi	9e4a5bda89	mm: prevent divide error for small values of vm_dirty_bytes Avoid setting less than two pages for vm_dirty_bytes: this is necessary to avoid potential division by 0 (like the following) in get_dirty_limits(). [ 49.951610] divide error: 0000 [#1] PREEMPT SMP [ 49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent [ 49.952195] CPU 1 [ 49.952195] Modules linked in: pcspkr [ 49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1 [ 49.952195] RIP: 0010:[<ffffffff802d39a9>] [<ffffffff802d39a9>] get_dirty_limits+0xe9/0x2c0 [ 49.952195] RSP: 0018:ffff88001de03a98 EFLAGS: 00010202 [ 49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3 [ 49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000 [ 49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000 [ 49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001 [ 49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78 [ 49.952195] FS: 00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000 [ 49.952195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0 [ 49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250) [ 49.952195] Stack: [ 49.952195] ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000 [ 49.952195] 00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218 [ 49.952195] 0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7 [ 49.952195] Call Trace: [ 49.952195] [<ffffffff802d3ed7>] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0 [ 49.952195] [<ffffffff80368f8e>] ? ext3_writeback_write_end+0x9e/0x120 [ 49.952195] [<ffffffff802cc7df>] generic_file_buffered_write+0x12f/0x330 [ 49.952195] [<ffffffff802cce8d>] __generic_file_aio_write_nolock+0x26d/0x460 [ 49.952195] [<ffffffff802cda32>] ? generic_file_aio_write+0x52/0xd0 [ 49.952195] [<ffffffff802cda49>] generic_file_aio_write+0x69/0xd0 [ 49.952195] [<ffffffff80365fa6>] ext3_file_write+0x26/0xc0 [ 49.952195] [<ffffffff803034d1>] do_sync_write+0xf1/0x140 [ 49.952195] [<ffffffff80290d1a>] ? get_lock_stats+0x2a/0x60 [ 49.952195] [<ffffffff80280730>] ? autoremove_wake_function+0x0/0x40 [ 49.952195] [<ffffffff8030411b>] vfs_write+0xcb/0x190 [ 49.952195] [<ffffffff803042d0>] sys_write+0x50/0x90 [ 49.952195] [<ffffffff8022ff6b>] system_call_fastpath+0x16/0x1b [ 49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 <48> f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02 [ 49.952195] RIP [<ffffffff802d39a9>] get_dirty_limits+0xe9/0x2c0 [ 49.952195] RSP <ffff88001de03a98> [ 50.096523] ---[ end trace 008d7aa02f244d7b ]--- Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-02 15:36:10 -07:00
Magnus Damm	c81fc2c331	clockevent: export register_device and delta2ns Export the following symbols using EXPORT_SYMBOL_GPL: - clockevent_delta2ns - clockevents_register_device This allows us to build SuperH clockevent and clocksource drivers as modules, see drivers/clocksource/sh_*.c [ Impact: allow modular build of clockevent drivers ] Signed-off-by: Magnus Damm <damm@igel.co.jp> LKML-Reference: <20090501055247.8286.64067.sendpatchset@rx1.opensource.se> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-02 11:51:07 +02:00
john stultz	7d27558c41	timekeeping: create arch_gettimeoffset infrastructure Some arches don't supply their own clocksource. This is mainly the case in architectures that get their inter-tick times by reading the counter on their interval timer. Since these timers wrap every tick, they're not really useful as clocksources. Wrapping them to act like one is possible but not very efficient. So we provide a callout these arches can implement for use with the jiffies clocksource to provide finer then tick granular time. [ Impact: ease the migration to generic time keeping ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-02 11:45:15 +02:00
Magnus Damm	a25cbd045a	clocksource: setup mult_orig in clocksource_enable() Setup clocksource mult_orig in clocksource_enable(). Clocksource drivers can save power by using keeping the device clock disabled while the clocksource is unused. In practice this means that the enable() and disable() callbacks perform clk_enable() and clk_disable(). The enable() callback may also use clk_get_rate() to get the clock rate from the clock framework. This information can then be used to calculate the shift and mult variables. Currently the mult_orig variable is setup from mult at registration time only. This is conflicting with the above case since the clock is disabled and the mult variable is not yet calculated at the time of registration. Moving the mult_orig setup code to clocksource_enable() allows us to both handle the common case with no enable() callback and the mult-changed-after-enable() case. [ Impact: allow dynamic clock source usage ] Signed-off-by: Magnus Damm <damm@igel.co.jp> LKML-Reference: <20090501054546.8193.10688.sendpatchset@rx1.opensource.se> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-02 11:45:15 +02:00
Jon Hunter	a041988876	timers: allow deferrable timers for intervals tv2-tv5 to be deferred In the current kernel implementation only kernel timers for time interval tv1 are being deferred. This patch allows any timer that is configured as deferrable to be defer regardless of time interval. This patch was previously discussed in http://marc.info/?l=linux-kernel&m=123196343531966&w=2 and was acked by Venki Pallipadi, the author of the original deferrable timer patch. Signed-off-by: Jon Hunter <jon-hunter@ti.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-02 10:36:06 +02:00
Dmitri Vorobiev	a52f5c5620	clockevents: tick_broadcast_device can become static The variable tick_broadcast_device is not used outside of the file where it is defined, so let's make it static. Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-02 10:31:14 +02:00
john stultz	74a03b69d1	clockevents: prevent endless loop in tick_handle_periodic() tick_handle_periodic() can lock up hard when a one shot clock event device is used in combination with jiffies clocksource. Avoid an endless loop issue by requiring that a highres valid clocksource be installed before we call tick_periodic() in a loop when using ONESHOT mode. The result is we will only increment jiffies once per interrupt until a continuous hardware clocksource is available. Without this, we can run into a endless loop, where each cycle through the loop, jiffies is updated which increments time by tick_period or more (due to clock steering), which can cause the event programming to think the next event was before the newly incremented time and fail causing tick_periodic() to be called again and the whole process loops forever. [ Impact: prevent hard lock up ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2009-05-02 10:22:27 +02:00
Yinghai Lu	15e957d08d	x86/irq: use move_irq_desc() in create_irq_nr() move_irq_desc() will try to move irq_desc to the home node if the allocated one is not correct, in create_irq_nr(). ( This can happen on devices that are on different nodes that are using MSI, when drivers are loaded and unloaded randomly. ) v2: fix non-smp build v3: add NUMA_IRQ_DESC to eliminate #ifdefs [ Impact: improve irq descriptor locality on NUMA systems ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <49F95EAE.2050903@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-01 19:01:12 +02:00
Thomas Gleixner	d7226fb6ec	Revert "genirq: assert that irq handlers are indeed running in hardirq context" This reverts commit `044d408409`. The commit added a warning when handle_IRQ_event() is called outside of hard interrupt context. This breaks the generic tasklet based interrupt resend mechanism which is used when the hardware has no way to retrigger the interrupt. So we get a warning for a use case which is correct and worked for years. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-01 15:16:04 +02:00
Peter Zijlstra	c33a0bc4e4	perf_counter: fix race in perf_output_* When two (or more) contexts output to the same buffer, it is possible to observe half written output. Suppose we have CPU0 doing perf_counter_mmap(), CPU1 doing perf_counter_overflow(). If CPU1 does a wakeup and exposes head to user-space, then CPU2 can observe the data CPU0 is still writing. [ Impact: fix occasionally corrupted profiling records ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090501102533.007821627@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-01 13:23:43 +02:00
Oleg Nesterov	78a3d9d565	do_wait: do take security_task_wait() into account I was never able to understand what should we actually do when security_task_wait() fails, but the current code doesn't look right. If ->task_wait() returns the error, we update notask_error correctly. But then we either reap the child (despite the fact this was forbidden) or clear notask_error (and hide the securiy policy problems). This patch assumes that "stolen by ptrace" doesn't matter. If selinux denies the child we should ignore it but make sure we report -EACCESS instead of -ECHLD if there are no other eligible children. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-01 08:49:29 +10:00
Thomas Gleixner	3c56999eec	Merge branch 'core/signal' into perfcounters/core This is necessary to avoid the conflict of syscall numbers. Conflicts: arch/x86/ia32/ia32entry.S arch/x86/include/asm/unistd_32.h arch/x86/include/asm/unistd_64.h Fixes up the borked syscall numbers of perfcounters versus preadv/pwritev as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-04-30 21:16:49 +02:00
Thomas Gleixner	62ab4505e3	signals: implement sys_rt_tgsigqueueinfo sys_kill has the per thread counterpart sys_tgkill. sigqueueinfo is missing a thread directed counterpart. Such an interface is important for migrating applications from other OSes which have the per thread delivery implemented. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Ulrich Drepper <drepper@redhat.com>	2009-04-30 19:24:24 +02:00
Thomas Gleixner	30b4ae8a44	signals: split do_tkill Split out the code from do_tkill to make it reusable by the follow up patch which implements sys_rt_tgsigqueueinfo Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com>	2009-04-30 19:24:23 +02:00
Darren Hart	ba9c22f2c0	futex: remove FUTEX_REQUEUE_PI (non CMP) The new requeue PI futex op codes were modeled after the existing FUTEX_REQUEUE and FUTEX_CMP_REQUEUE calls. I was unaware at the time that FUTEX_REQUEUE was only around for compatibility reasons and shouldn't be used in new code. Ulrich Drepper elaborates on this in his Futexes are Tricky paper: http://people.redhat.com/drepper/futex.pdf. The deprecated call doesn't catch changes to the futex corresponding to the destination futex which can lead to deadlock. Therefor, I feel it best to remove FUTEX_REQUEUE_PI and leave only FUTEX_CMP_REQUEUE_PI as there are not yet any existing users of the API. This patch does change the OP code value of FUTEX_CMP_REQUEUE_PI to 12 from 13. Since my test case is the only known user of this API, I felt this was the right thing to do, rather than leave a hole in the enumeration. I chose to continue using the _CMP_ modifier in the OP code to make it explicit to the user that the test is being done. Builds, boots, and ran several hundred iterations requeue_pi.c. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <49ED580E.1050502@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-04-30 11:41:35 +02:00
Andrew Morton	a511e3f968	mutex: add atomic_dec_and_mutex_lock(), fix include/linux/mutex.h:136: warning: 'mutex_lock' declared inline after being called include/linux/mutex.h:136: warning: previous declaration of 'mutex_lock' was here uninline it. [ Impact: clean up and uninline, address compiler warning ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <200904292318.n3TNIsi6028340@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-30 09:01:34 +02:00
Paul Mackerras	c5dd016cdf	perf_counter: update copyright notice This adds my name to the list of copyright holders on the core perf_counter.c, since I have contributed a significant amount of the code in there. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <18936.59200.888049.746658@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-30 08:23:11 +02:00
H Hartley Sweeten	6e85c5ba73	kernel/posix-cpu-timers.c: fix sparse warning Sparse reports the following in kernel/posix-cpu-timers.c: warning: symbol 'firing' shadows an earlier one Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Subrata Modak <subrata@linux.vnet.ibm.com> LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE1909016C1AFE@mi8nycmail19.Mi8.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-30 08:08:31 +02:00
David Howells	3bcac0263f	SELinux: Don't flush inherited SIGKILL during execve() Don't flush inherited SIGKILL during execve() in SELinux's post cred commit hook. This isn't really a security problem: if the SIGKILL came before the credentials were changed, then we were right to receive it at the time, and should honour it; if it came after the creds were changed, then we definitely should honour it; and in any case, all that will happen is that the process will be scrapped before it ever returns to userspace. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-04-30 09:07:13 +10:00
Luis Henriques	23b94b967f	locking, rtmutex.c: Documentation cleanup Two minor updates on functions documentation: - Updated documentation for function rt_mutex_unlock(), which contained an incorrect name - Removed extra '*' from comment in function rt_mutex_destroy() [ Impact: cleanup ] Signed-off-by: Luis Henriques <henrix@sapo.pt> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20090429205451.GA23154@hades.domain.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 23:20:17 +02:00
Eric Dumazet	f5f293a4e3	sched: account system time properly Andrew Gallatin reported that IRQ and SOFTIRQ times were sometime not reported correctly on recent kernels, and even bisected to commit `457533a7d3` ([PATCH] fix scaled & unscaled cputime accounting) as the first bad commit. Further analysis pointed that commit `79741dd357` ([PATCH] idle cputime accounting) was the real cause of the problem. account_process_tick() was not taking into account timer IRQ interrupting the idle task servicing a hard or soft irq. On mostly idle cpu, irqs were thus not accounted and top or mpstat could tell user/admin that cpu was 100 % idle, 0.00 % irq, 0.00 % softirq, while it was not. [ Impact: fix occasionally incorrect CPU statistics in top/mpstat ] Reported-by: Andrew Gallatin <gallatin@myri.com> Re-reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: rick.jones2@hp.com Cc: brice@myri.com Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> LKML-Reference: <49F84BC1.7080602@cosmosbay.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 15:02:28 +02:00
Ingo Molnar	9814451142	perf_counter: add/update copyrights Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:52:50 +02:00
Robert Richter	4aeb0b4239	perfcounters: rename struct hw_perf_counter_ops into struct pmu This patch renames struct hw_perf_counter_ops into struct pmu. It introduces a structure to describe a cpu specific pmu (performance monitoring unit). It may contain ops and data. The new name of the structure fits better, is shorter, and thus better to handle. Where it was appropriate, names of function and variable have been changed too. [ Impact: cleanup ] Signed-off-by: Robert Richter <robert.richter@amd.com> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1241002046-8832-7-git-send-email-robert.richter@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:51:03 +02:00
Ingo Molnar	e7fd5d4b3d	Merge branch 'linus' into perfcounters/core Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:47:05 +02:00
David Howells	50fa610a3b	sched: Document memory barriers implied by sleep/wake-up primitives Add a section to the memory barriers document to note the implied memory barriers of sleep primitives (set_current_state() and wrappers) and wake-up primitives (wake_up() and co.). Also extend the in-code comments on the wake_up() functions to note these implied barriers. [ Impact: add documentation ] Signed-off-by: David Howells <dhowells@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20090428140138.1192.94723.stgit@warthog.procyon.org.uk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:15:55 +02:00
Heiko Carstens	a0e39ed378	tracing: fix build failure on s390 "tracing: create automated trace defines" causes this compile error on s390, as reported by Sachin Sant against linux-next: kernel/built-in.o: In function `__do_softirq': (.text+0x1c680): undefined reference to `__tracepoint_softirq_entry' This happens because the definitions of the softirq tracepoints were moved from kernel/softirq.c to kernel/irq/handle.c. Since s390 doesn't support generic hardirqs handle.c doesn't get compiled and the definitions are missing. So move the tracepoints to softirq.c again. [ Impact: fix build failure on s390 ] Reported-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: fweisbec@gmail.com LKML-Reference: <20090429135139.5fac79b8@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:06:21 +02:00
Tom Zanussi	8b37256210	tracing/filters: a better event parser Replace the current event parser hack with a better one. Filters are no longer specified predicate by predicate, but all at once and can use parens and any of the following operators: numeric fields: ==, !=, <, <=, >, >= string fields: ==, != predicates can be combined with the logical operators: &&, \|\| examples: "common_preempt_count > 4" > filter "((sig >= 10 && sig < 15) \|\| sig == 17) && comm != bash" > filter If there was an error, the erroneous string along with an error message can be seen by looking at the filter e.g.: ((sig >= 10 && sig < 15) \|\| dsig == 17) && comm != bash ^ parse_error: Field not found Currently the caret for an error always appears at the beginning of the filter; a real position should be used, but the error message should be useful even without it. To clear a filter, '0' can be written to the filter file. Filters can also be set or cleared for a complete subsystem by writing the same filter as would be written to an individual event to the filter file at the root of the subsytem. Note however, that if any event in the subsystem lacks a field specified in the filter being set, the set will fail and all filters in the subsytem are automatically cleared. This change from the previous version was made because using only the fields that happen to exist for a given event would most likely result in a meaningless filter. Because the logical operators are now implemented as predicates, the maximum number of predicates in a filter was increased from 8 to 16. [ Impact: add new, extended trace-filter implementation ] Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: fweisbec@gmail.com Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1240905899.6416.121.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:06:11 +02:00
Tom Zanussi	a118e4d140	tracing/filters: distinguish between signed and unsigned fields The new filter comparison ops need to be able to distinguish between signed and unsigned field types, so add an is_signed flag/param to the event field struct/trace_define_fields(). Also define a simple macro, is_signed_type() to determine the signedness at compile time, used in the trace macros. If the is_signed_type() macro won't work with a specific type, a new slightly modified version of TRACE_FIELD() called TRACE_FIELD_SIGN(), allows the signedness to be set explicitly. [ Impact: extend trace-filter code for new feature ] Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: fweisbec@gmail.com Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1240905893.6416.120.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:06:03 +02:00
Tom Zanussi	30e673b230	tracing/filters: move preds into event_filter object Create a new event_filter object, and move the pred-related members out of the call and subsystem objects and into the filter object - the details of the filter implementation don't need to be exposed in the call and subsystem in any case, and it will also help make the new parser implementation a little cleaner. [ Impact: refactor trace-filter code to prepare for new features ] Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: fweisbec@gmail.com Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1240905887.6416.119.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 14:05:54 +02:00
Steven Rostedt	7267fa6819	tracing: fix ref count in splice pages The pages allocated for the splice binary buffer did not initialize the ref count correctly. This caused pages not to be freed and causes a drastic memory leak. Thanks to logdev I was able to trace the tracer to find where the leak was. [ Impact: stop memory leak when using splice ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-29 08:02:44 +02:00
Steven Rostedt	7d7d2b8031	ring-buffer: fix printk output The warning output in trace_recursive_lock uses %d for a long when it should be %ld. [ Impact: fix compile warning ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-29 00:42:01 -04:00
Steven Rostedt	f2957f1f19	tracing: have splice only copy full pages Splice works with pages, it is much more effecient to use an entire page than to copy bits over several pages. Using logdev to trace the internals of the splice mechanism, I was able to see that splice can be very aggressive. When tracing is occurring, and the reader caught up to the writer, and the writer is on the reader page, the reader will copy what is there into the splice page. Splice may iterate over several pages and if the writer is still writing to the page, the reader will keep copying bits to new pages to pass to userspace. This patch changes it to only pass data to userspace if the page is full (the writer has left the page). This has a small side effect that splice can not read a partial page, and must wait for the page to fill. This should not be an issue. If tracing has stopped, then a use of "read" will still read all of the page. [ Impact: better performance for ring buffer splice code ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-29 00:26:30 -04:00
Steven Rostedt	93459c6cb9	tracing: only add splice page if entries exist The splice code allocates a page even when the ring buffer is empty. It detects the ring buffer being empty when it it fails to copy anything from the ring buffer into the page. This patch adds a check to see if there is anything in the ring buffer before allocating a page. Thanks to logdev for letting me trace the tracer to find this. [ Impact: speed up due to removing unnecessary allocation ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-29 00:23:13 -04:00
Steven Rostedt	5beae6efd1	tracing: fix ref count in splice pages The pages allocated for the splice binary buffer did not initialize the ref count correctly. This caused pages not to be freed and causes a drastic memory leak. Thanks to logdev I was able to trace the tracer to find where the leak was. [ Impact: stop memory leak when using splice ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-29 00:16:21 -04:00
Steven Rostedt	cd891ae030	tracing: convert ftrace_dump spinlocks to raw ftrace_dump is used for printing out the contents of the ftrace ring buffer to the console on failure. Currently it uses a spinlock to synchronize the output from multiple failures on different CPUs. This spin lock currently is a normal spinlock and can cause issues with lockdep and lock tracing. This patch converts it to raw since it is for error handling only. The lock is local to the ftrace_dump and is not used by any other infrastructure. [ Impact: prevent ftrace_dump from locking up by internal tracing ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-28 11:39:34 -04:00
Yinghai Lu	85ac16d033	x86/irq: change irq_desc_alloc() to take node instead of cpu This simplifies the node awareness of the code. All our allocators only deal with a NUMA node ID locality not with CPU ids anyway - so there's no need to maintain (and transform) a CPU id all across the IRq layer. v2: keep move_irq_desc related [ Impact: cleanup, prepare IRQ code to be NUMA-aware ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@goop.org> LKML-Reference: <49F65536.2020300@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-28 12:21:17 +02:00
Yinghai Lu	57b150cce8	irq: only update affinity if ->set_affinity() is sucessfull irq_set_affinity() and move_masked_irq() try to assign affinity before calling chip set_affinity(). Some archs are assigning it in ->set_affinity() again. We do something like: cpumask_cpy(desc->affinity, mask); desc->chip->set_affinity(mask); But in the failure path, affinity should not be touched - otherwise we'll end up with a different affinity mask despite the failure to migrate the IRQ. So try to update the afffinity only if set_affinity returns with 0. Also call irq_set_thread_affinity accordingly. v2: update after "irq, x86: Remove IRQ_DISABLED check in process context IRQ move" v3: according to Ingo, change set_affinity() in irq_chip should return int. v4: update comments by removing moving irq_desc code. [ Impact: fix /proc/irq/*/smp_affinity setting corner case bug ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <49F65509.60307@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-28 12:21:16 +02:00
Yinghai Lu	fcef5911c7	x86/irq: remove leftover code from NUMA_MIGRATE_IRQ_DESC The original feature of migrating irq_desc dynamic was too fragile and was causing problems: it caused crashes on systems with lots of cards with MSI-X when user-space irq-balancer was enabled. We now have new patches that create irq_desc according to device numa node. This patch removes the leftover bits of the dynamic balancer. [ Impact: remove dead code ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <49F654AF.8000808@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-28 12:21:15 +02:00

... 3 4 5 6 7 ...

7541 Commits