2018-08-16 11:26:55 -04:00
// SPDX-License-Identifier: GPL-2.0
2009-09-12 19:17:15 -04:00
/*
* This file defines the trace event structures that go into the ring
* buffer directly . They are created via macros so that changes for them
* appear in the format file . Using macros will automate this process .
*
* The macro used to create a ftrace data structure is :
*
* FTRACE_ENTRY ( name , struct_name , id , structure , print )
*
* @ name : the name used the event name , as well as the name of
* the directory that holds the format file .
*
* @ struct_name : the name of the structure that is created .
*
* @ id : The event identifier that is used to detect what event
* this is from the ring buffer .
*
* @ structure : the structure layout
*
* - __field ( type , item )
* This is equivalent to declaring
* type item ;
* in the structure .
* - __array ( type , item , size )
* This is equivalent to declaring
* type item [ size ] ;
* in the structure .
*
2009-09-12 19:22:23 -04:00
* * for structures within structures , the format of the internal
2011-03-30 22:57:33 -03:00
* structure is laid out . This allows the internal structure
2009-09-12 19:22:23 -04:00
* to be deciphered for the format file . Although these macros
* may become out of sync with the internal structure , they
* will create a compile error if it happens . Since the
2020-10-29 23:05:54 +08:00
* internal structures are just tracing helpers , this is not
2009-09-12 19:22:23 -04:00
* an issue .
*
* When an internal structure is used , it should use :
*
* __field_struct ( type , item )
*
* instead of __field . This will prevent it from being shown in
* the output file . The fields in the structure should use .
*
* __field_desc ( type , container , item )
* __array_desc ( type , container , item , len )
*
* type , item and len are the same as __field and __array , but
* container is added . This is the name of the item in
* __field_struct that this is describing .
*
*
2009-09-12 19:17:15 -04:00
* @ print : the print format shown to users in the format file .
*/
/*
tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:38:34 -04:00
* Function trace entry - function address and parent function address :
2009-09-12 19:17:15 -04:00
*/
2012-02-15 15:51:52 +01:00
FTRACE_ENTRY_REG ( function , ftrace_entry ,
2009-09-12 19:17:15 -04:00
TRACE_FN ,
F_STRUCT (
2019-10-24 22:26:59 +02:00
__field_fn ( unsigned long , ip )
__field_fn ( unsigned long , parent_ip )
2009-09-12 19:17:15 -04:00
) ,
2019-02-10 00:19:19 +08:00
F_printk ( " %ps <-- %ps " ,
( void * ) __entry - > ip , ( void * ) __entry - > parent_ip ) ,
2012-02-15 15:51:52 +01:00
perf_ftrace_event_register
2009-09-12 19:17:15 -04:00
) ;
/* Function call entry */
2016-06-29 19:56:48 +09:00
FTRACE_ENTRY_PACKED ( funcgraph_entry , ftrace_graph_ent_entry ,
2009-09-12 19:17:15 -04:00
TRACE_GRAPH_ENT ,
F_STRUCT (
2009-09-12 19:22:23 -04:00
__field_struct ( struct ftrace_graph_ent , graph_ent )
2020-06-09 22:00:41 -04:00
__field_packed ( unsigned long , graph_ent , func )
__field_packed ( int , graph_ent , depth )
2009-09-12 19:17:15 -04:00
) ,
2019-10-24 22:26:59 +02:00
F_printk ( " --> %ps (%d) " , ( void * ) __entry - > func , __entry - > depth )
2009-09-12 19:17:15 -04:00
) ;
/* Function return entry */
function_graph: Support recording and printing the return value of function
Analyzing system call failures with the function_graph tracer can be a
time-consuming process, particularly when locating the kernel function
that first returns an error in the trace logs. This change aims to
simplify the process by recording the function return value to the
'retval' member of 'ftrace_graph_ret' and printing it when outputting
the trace log.
We have introduced new trace options: funcgraph-retval and
funcgraph-retval-hex. The former controls whether to display the return
value, while the latter controls the display format.
Please note that even if a function's return type is void, a return
value will still be printed. You can simply ignore it.
This patch only establishes the fundamental infrastructure. Subsequent
patches will make this feature available on some commonly used processor
architectures.
Here is an example:
I attempted to attach the demo process to a cpu cgroup, but it failed:
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
-bash: echo: write error: Invalid argument
The strace logs indicate that the write system call returned -EINVAL(-22):
...
write(1, "273\n", 4) = -1 EINVAL (Invalid argument)
...
To capture trace logs during a write system call, use the following
commands:
cd /sys/kernel/debug/tracing/
echo 0 > tracing_on
echo > trace
echo *sys_write > set_graph_function
echo *spin* > set_graph_notrace
echo *rcu* >> set_graph_notrace
echo *alloc* >> set_graph_notrace
echo preempt* >> set_graph_notrace
echo kfree* >> set_graph_notrace
echo $$ > set_ftrace_pid
echo function_graph > current_tracer
echo 1 > options/funcgraph-retval
echo 0 > options/funcgraph-retval-hex
echo 1 > tracing_on
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
echo 0 > tracing_on
cat trace > ~/trace.log
To locate the root cause, search for error code -22 directly in the file
trace.log and identify the first function that returned -22. Once you
have identified this function, examine its code to determine the root
cause.
For example, in the trace log below, cpu_cgroup_can_attach
returned -22 first, so we can focus our analysis on this function to
identify the root cause.
...
1) | cgroup_migrate() {
1) 0.651 us | cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
1) | cgroup_migrate_execute() {
1) | cpu_cgroup_can_attach() {
1) | cgroup_taskset_first() {
1) 0.732 us | cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
1) 1.232 us | } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
1) 0.380 us | sched_rt_can_attach(); /* = 0x0 */
1) 2.335 us | } /* cpu_cgroup_can_attach = -22 */
1) 4.369 us | } /* cgroup_migrate_execute = -22 */
1) 7.143 us | } /* cgroup_migrate = -22 */
...
Link: https://lkml.kernel.org/r/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@sangfor.com.cn
Tested-by: Florian Kauer <florian.kauer@linutronix.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@sangfor.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-08 05:42:15 -07:00
# ifdef CONFIG_FUNCTION_GRAPH_RETVAL
FTRACE_ENTRY_PACKED ( funcgraph_exit , ftrace_graph_ret_entry ,
TRACE_GRAPH_RET ,
F_STRUCT (
__field_struct ( struct ftrace_graph_ret , ret )
__field_packed ( unsigned long , ret , func )
__field_packed ( unsigned long , ret , retval )
__field_packed ( int , ret , depth )
__field_packed ( unsigned int , ret , overrun )
__field_packed ( unsigned long long , ret , calltime )
__field_packed ( unsigned long long , ret , rettime )
) ,
F_printk ( " <-- %ps (%d) (start: %llx end: %llx) over: %d retval: %lx " ,
( void * ) __entry - > func , __entry - > depth ,
__entry - > calltime , __entry - > rettime ,
__entry - > depth , __entry - > retval )
) ;
# else
2016-06-29 19:56:48 +09:00
FTRACE_ENTRY_PACKED ( funcgraph_exit , ftrace_graph_ret_entry ,
2009-09-12 19:17:15 -04:00
TRACE_GRAPH_RET ,
F_STRUCT (
2009-09-12 19:22:23 -04:00
__field_struct ( struct ftrace_graph_ret , ret )
2020-06-09 22:00:41 -04:00
__field_packed ( unsigned long , ret , func )
2020-10-28 08:19:24 -04:00
__field_packed ( int , ret , depth )
__field_packed ( unsigned int , ret , overrun )
2020-06-09 22:00:41 -04:00
__field_packed ( unsigned long long , ret , calltime )
__field_packed ( unsigned long long , ret , rettime )
2009-09-12 19:17:15 -04:00
) ,
2019-02-10 00:19:19 +08:00
F_printk ( " <-- %ps (%d) (start: %llx end: %llx) over: %d " ,
( void * ) __entry - > func , __entry - > depth ,
2009-09-14 15:51:39 +08:00
__entry - > calltime , __entry - > rettime ,
2019-10-24 22:26:59 +02:00
__entry - > depth )
2009-09-12 19:17:15 -04:00
) ;
function_graph: Support recording and printing the return value of function
Analyzing system call failures with the function_graph tracer can be a
time-consuming process, particularly when locating the kernel function
that first returns an error in the trace logs. This change aims to
simplify the process by recording the function return value to the
'retval' member of 'ftrace_graph_ret' and printing it when outputting
the trace log.
We have introduced new trace options: funcgraph-retval and
funcgraph-retval-hex. The former controls whether to display the return
value, while the latter controls the display format.
Please note that even if a function's return type is void, a return
value will still be printed. You can simply ignore it.
This patch only establishes the fundamental infrastructure. Subsequent
patches will make this feature available on some commonly used processor
architectures.
Here is an example:
I attempted to attach the demo process to a cpu cgroup, but it failed:
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
-bash: echo: write error: Invalid argument
The strace logs indicate that the write system call returned -EINVAL(-22):
...
write(1, "273\n", 4) = -1 EINVAL (Invalid argument)
...
To capture trace logs during a write system call, use the following
commands:
cd /sys/kernel/debug/tracing/
echo 0 > tracing_on
echo > trace
echo *sys_write > set_graph_function
echo *spin* > set_graph_notrace
echo *rcu* >> set_graph_notrace
echo *alloc* >> set_graph_notrace
echo preempt* >> set_graph_notrace
echo kfree* >> set_graph_notrace
echo $$ > set_ftrace_pid
echo function_graph > current_tracer
echo 1 > options/funcgraph-retval
echo 0 > options/funcgraph-retval-hex
echo 1 > tracing_on
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
echo 0 > tracing_on
cat trace > ~/trace.log
To locate the root cause, search for error code -22 directly in the file
trace.log and identify the first function that returned -22. Once you
have identified this function, examine its code to determine the root
cause.
For example, in the trace log below, cpu_cgroup_can_attach
returned -22 first, so we can focus our analysis on this function to
identify the root cause.
...
1) | cgroup_migrate() {
1) 0.651 us | cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
1) | cgroup_migrate_execute() {
1) | cpu_cgroup_can_attach() {
1) | cgroup_taskset_first() {
1) 0.732 us | cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
1) 1.232 us | } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
1) 0.380 us | sched_rt_can_attach(); /* = 0x0 */
1) 2.335 us | } /* cpu_cgroup_can_attach = -22 */
1) 4.369 us | } /* cgroup_migrate_execute = -22 */
1) 7.143 us | } /* cgroup_migrate = -22 */
...
Link: https://lkml.kernel.org/r/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@sangfor.com.cn
Tested-by: Florian Kauer <florian.kauer@linutronix.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@sangfor.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-08 05:42:15 -07:00
# endif
2009-09-12 19:17:15 -04:00
/*
* Context switch trace entry - which task ( and prio ) we switched from / to :
*
* This is used for both wakeup and context switches . We only want
* to create one structure , but we need two outputs for it .
*/
# define FTRACE_CTX_FIELDS \
__field ( unsigned int , prev_pid ) \
2010-12-03 16:13:19 -08:00
__field ( unsigned int , next_pid ) \
__field ( unsigned int , next_cpu ) \
2009-09-12 19:17:15 -04:00
__field ( unsigned char , prev_prio ) \
__field ( unsigned char , prev_state ) \
__field ( unsigned char , next_prio ) \
2010-12-03 16:13:19 -08:00
__field ( unsigned char , next_state )
2009-09-12 19:17:15 -04:00
FTRACE_ENTRY ( context_switch , ctx_switch_entry ,
TRACE_CTX ,
F_STRUCT (
FTRACE_CTX_FIELDS
) ,
2009-09-14 15:51:39 +08:00
F_printk ( " %u:%u:%u ==> %u:%u:%u [%03u] " ,
2009-09-12 19:17:15 -04:00
__entry - > prev_pid , __entry - > prev_prio , __entry - > prev_state ,
__entry - > next_pid , __entry - > next_prio , __entry - > next_state ,
2019-10-24 22:26:59 +02:00
__entry - > next_cpu )
2009-09-12 19:17:15 -04:00
) ;
/*
* FTRACE_ENTRY_DUP only creates the format file , it will not
* create another structure .
*/
FTRACE_ENTRY_DUP ( wakeup , ctx_switch_entry ,
TRACE_WAKE ,
F_STRUCT (
FTRACE_CTX_FIELDS
) ,
F_printk ( " %u:%u:%u ==+ %u:%u:%u [%03u] " ,
__entry - > prev_pid , __entry - > prev_prio , __entry - > prev_state ,
__entry - > next_pid , __entry - > next_prio , __entry - > next_state ,
2019-10-24 22:26:59 +02:00
__entry - > next_cpu )
2009-09-12 19:17:15 -04:00
) ;
/*
* Stack - trace entry :
*/
# define FTRACE_STACK_ENTRIES 8
FTRACE_ENTRY ( kernel_stack , stack_entry ,
TRACE_STACK ,
F_STRUCT (
2011-07-14 16:36:53 -04:00
__field ( int , size )
tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure
For backward compatibility, older tooling expects to see the kernel_stack
event with a "caller" field that is a fixed size array of 8 addresses. The
code now supports more than 8 with an added "size" field that states the
real number of entries. But the "caller" field still just looks like a
fixed size to user space.
Since the tracing macros that create the user space format files also
creates the structures that those files represent, the kernel_stack event
structure had its "caller" field a fixed size of 8, but in reality, when
it is allocated on the ring buffer, it can hold more if the stack trace is
bigger that 8 functions. The copying of these entries was simply done with
a memcpy():
size = nr_entries * sizeof(unsigned long);
memcpy(entry->caller, fstack->calls, size);
The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was
larger than 8, that the memcpy() was writing more than what the structure
stated it can hold and it complained about it. This is because the
FORTIFY_SOURCE code is unaware that the amount allocated is actually
enough to hold the size. It does not expect that a fixed size field will
hold more than the fixed size.
This was originally solved by hiding the caller assignment with some
pointer arithmetic.
ptr = ring_buffer_data();
entry = ptr;
ptr += offsetof(typeof(*entry), caller);
memcpy(ptr, fstack->calls, size);
But it is considered bad form to hide from kernel hardening. Instead, make
it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro
that is specific for this one special use case. The macro will take 4
arguments: type, item, len, field (whereas the __array() macro takes just
the first three). This macro will act just like the __array() macro when
creating the code to deal with the format file that is exposed to user
space. But for the kernel, it will turn the caller field into:
type item[] __counted_by(field);
or for this instance:
unsigned long caller[] __counted_by(size);
Now the kernel code can expose the assignment of the caller to the
FORTIFY_SOURCE and everyone is happy!
Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-13 09:26:05 -04:00
__stack_array ( unsigned long , caller , FTRACE_STACK_ENTRIES , size )
2009-09-12 19:17:15 -04:00
) ,
2019-02-10 00:19:19 +08:00
F_printk ( " \t => %ps \n \t => %ps \n \t => %ps \n "
" \t => %ps \n \t => %ps \n \t => %ps \n "
" \t => %ps \n \t => %ps \n " ,
( void * ) __entry - > caller [ 0 ] , ( void * ) __entry - > caller [ 1 ] ,
( void * ) __entry - > caller [ 2 ] , ( void * ) __entry - > caller [ 3 ] ,
( void * ) __entry - > caller [ 4 ] , ( void * ) __entry - > caller [ 5 ] ,
2019-10-24 22:26:59 +02:00
( void * ) __entry - > caller [ 6 ] , ( void * ) __entry - > caller [ 7 ] )
2009-09-12 19:17:15 -04:00
) ;
FTRACE_ENTRY ( user_stack , userstack_entry ,
TRACE_USER_STACK ,
F_STRUCT (
__field ( unsigned int , tgid )
__array ( unsigned long , caller , FTRACE_STACK_ENTRIES )
) ,
2019-02-10 00:19:19 +08:00
F_printk ( " \t => %ps \n \t => %ps \n \t => %ps \n "
" \t => %ps \n \t => %ps \n \t => %ps \n "
" \t => %ps \n \t => %ps \n " ,
( void * ) __entry - > caller [ 0 ] , ( void * ) __entry - > caller [ 1 ] ,
( void * ) __entry - > caller [ 2 ] , ( void * ) __entry - > caller [ 3 ] ,
( void * ) __entry - > caller [ 4 ] , ( void * ) __entry - > caller [ 5 ] ,
2019-10-24 22:26:59 +02:00
( void * ) __entry - > caller [ 6 ] , ( void * ) __entry - > caller [ 7 ] )
2009-09-12 19:17:15 -04:00
) ;
/*
* trace_printk entry :
*/
FTRACE_ENTRY ( bprint , bprint_entry ,
TRACE_BPRINT ,
F_STRUCT (
__field ( unsigned long , ip )
__field ( const char * , fmt )
__dynamic_array ( u32 , buf )
) ,
2015-03-11 22:13:57 -05:00
F_printk ( " %ps: %s " ,
2019-10-24 22:26:59 +02:00
( void * ) __entry - > ip , __entry - > fmt )
2009-09-12 19:17:15 -04:00
) ;
2018-05-09 14:17:48 -04:00
FTRACE_ENTRY_REG ( print , print_entry ,
2009-09-12 19:17:15 -04:00
TRACE_PRINT ,
F_STRUCT (
__field ( unsigned long , ip )
__dynamic_array ( char , buf )
) ,
2015-03-11 22:13:57 -05:00
F_printk ( " %ps: %s " ,
2013-03-08 21:02:34 -05:00
( void * ) __entry - > ip , __entry - > buf ) ,
2018-05-09 14:17:48 -04:00
ftrace_event_register
2013-03-08 21:02:34 -05:00
) ;
2016-07-06 15:25:08 -04:00
FTRACE_ENTRY ( raw_data , raw_data_entry ,
TRACE_RAW_DATA ,
F_STRUCT (
__field ( unsigned int , id )
__dynamic_array ( char , buf )
) ,
F_printk ( " id:%04x %08x " ,
2019-10-24 22:26:59 +02:00
__entry - > id , ( int ) __entry - > buf [ 0 ] )
2016-07-06 15:25:08 -04:00
) ;
2013-03-08 21:02:34 -05:00
FTRACE_ENTRY ( bputs , bputs_entry ,
TRACE_BPUTS ,
F_STRUCT (
__field ( unsigned long , ip )
__field ( const char * , str )
) ,
2015-03-11 22:13:57 -05:00
F_printk ( " %ps: %s " ,
2019-10-24 22:26:59 +02:00
( void * ) __entry - > ip , __entry - > str )
2009-09-12 19:17:15 -04:00
) ;
FTRACE_ENTRY ( mmiotrace_rw , trace_mmiotrace_rw ,
TRACE_MMIO_RW ,
F_STRUCT (
2009-09-12 19:22:23 -04:00
__field_struct ( struct mmiotrace_rw , rw )
__field_desc ( resource_size_t , rw , phys )
__field_desc ( unsigned long , rw , value )
__field_desc ( unsigned long , rw , pc )
2019-10-24 22:26:59 +02:00
__field_desc ( int , rw , map_id )
2009-09-12 19:22:23 -04:00
__field_desc ( unsigned char , rw , opcode )
__field_desc ( unsigned char , rw , width )
2009-09-12 19:17:15 -04:00
) ,
2009-09-14 15:51:39 +08:00
F_printk ( " %lx %lx %lx %d %x %x " ,
( unsigned long ) __entry - > phys , __entry - > value , __entry - > pc ,
2019-10-24 22:26:59 +02:00
__entry - > map_id , __entry - > opcode , __entry - > width )
2009-09-12 19:17:15 -04:00
) ;
FTRACE_ENTRY ( mmiotrace_map , trace_mmiotrace_map ,
TRACE_MMIO_MAP ,
F_STRUCT (
2009-09-12 19:22:23 -04:00
__field_struct ( struct mmiotrace_map , map )
__field_desc ( resource_size_t , map , phys )
__field_desc ( unsigned long , map , virt )
__field_desc ( unsigned long , map , len )
2019-10-24 22:26:59 +02:00
__field_desc ( int , map , map_id )
2009-09-12 19:22:23 -04:00
__field_desc ( unsigned char , map , opcode )
2009-09-12 19:17:15 -04:00
) ,
2009-09-14 15:51:39 +08:00
F_printk ( " %lx %lx %lx %d %x " ,
( unsigned long ) __entry - > phys , __entry - > virt , __entry - > len ,
2019-10-24 22:26:59 +02:00
__entry - > map_id , __entry - > opcode )
2009-09-12 19:17:15 -04:00
) ;
# define TRACE_FUNC_SIZE 30
# define TRACE_FILE_SIZE 20
FTRACE_ENTRY ( branch , trace_branch ,
TRACE_BRANCH ,
F_STRUCT (
__field ( unsigned int , line )
__array ( char , func , TRACE_FUNC_SIZE + 1 )
__array ( char , file , TRACE_FILE_SIZE + 1 )
__field ( char , correct )
2017-01-19 08:57:41 -05:00
__field ( char , constant )
2009-09-12 19:17:15 -04:00
) ,
2017-01-19 08:57:41 -05:00
F_printk ( " %u:%s:%s (%u)%s " ,
2009-09-12 19:17:15 -04:00
__entry - > line ,
2017-01-19 08:57:41 -05:00
__entry - > func , __entry - > file , __entry - > correct ,
2019-10-24 22:26:59 +02:00
__entry - > constant ? " CONSTANT " : " " )
2009-09-12 19:17:15 -04:00
) ;
2016-06-23 12:45:36 -04:00
FTRACE_ENTRY ( hwlat , hwlat_entry ,
TRACE_HWLAT ,
F_STRUCT (
__field ( u64 , duration )
__field ( u64 , outer_duration )
2016-08-04 12:49:53 -04:00
__field ( u64 , nmi_total_ts )
2017-05-08 15:59:13 -07:00
__field_struct ( struct timespec64 , timestamp )
__field_desc ( s64 , timestamp , tv_sec )
2016-06-23 12:45:36 -04:00
__field_desc ( long , timestamp , tv_nsec )
2016-08-04 12:49:53 -04:00
__field ( unsigned int , nmi_count )
2016-06-23 12:45:36 -04:00
__field ( unsigned int , seqnum )
2020-02-12 12:21:03 -05:00
__field ( unsigned int , count )
2016-06-23 12:45:36 -04:00
) ,
2020-02-12 12:21:03 -05:00
F_printk ( " cnt:%u \t ts:%010llu.%010lu \t inner:%llu \t outer:%llu \t count:%d \t nmi-ts:%llu \t nmi-count:%u \n " ,
2016-06-23 12:45:36 -04:00
__entry - > seqnum ,
__entry - > tv_sec ,
__entry - > tv_nsec ,
__entry - > duration ,
2016-08-04 12:49:53 -04:00
__entry - > outer_duration ,
2020-02-12 12:21:03 -05:00
__entry - > count ,
2016-08-04 12:49:53 -04:00
__entry - > nmi_total_ts ,
2019-10-24 22:26:59 +02:00
__entry - > nmi_count )
2016-06-23 12:45:36 -04:00
) ;
2021-04-15 21:18:50 +03:00
# define FUNC_REPEATS_GET_DELTA_TS(entry) \
( ( ( u64 ) ( entry ) - > top_delta_ts < < 32 ) | ( entry ) - > bottom_delta_ts ) \
FTRACE_ENTRY ( func_repeats , func_repeats_entry ,
TRACE_FUNC_REPEATS ,
F_STRUCT (
__field ( unsigned long , ip )
__field ( unsigned long , parent_ip )
__field ( u16 , count )
__field ( u16 , top_delta_ts )
__field ( u32 , bottom_delta_ts )
) ,
F_printk ( " %ps <-%ps \t (repeats:%u delta: -%llu) " ,
( void * ) __entry - > ip ,
( void * ) __entry - > parent_ip ,
__entry - > count ,
FUNC_REPEATS_GET_DELTA_TS ( __entry ) )
) ;
trace: Add osnoise tracer
In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.
The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.
Usage
Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).
For example::
[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer
It is possible to follow the trace by reading the trace trace file::
[root@f32 tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
<...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
<...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
<...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
<...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
<...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
<...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
<...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/CPU thread. The osnoise specific fields report:
- The RUNTIME IN USE reports the amount of time in microseconds that
the osnoise thread kept looping reading the time.
- The NOISE IN US reports the sum of noise in microseconds observed
by the osnoise tracer during the associated runtime.
- The % OF CPU AVAILABLE reports the percentage of CPU available for
the osnoise thread during the runtime window.
- The MAX SINGLE NOISE IN US reports the maximum single noise observed
during the runtime window.
- The Interference counters display how many each of the respective
interference happened during the runtime window.
Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.
Tracer options
The tracer has a set of options inside the osnoise directory, they are:
- osnoise/cpus: CPUs at which a osnoise thread will execute.
- osnoise/period_us: the period of the osnoise thread.
- osnoise/runtime_us: how long an osnoise thread will look for noise.
- osnoise/stop_tracing_us: stop the system tracing if a single noise
higher than the configured value happens. Writing 0 disables this
option.
- osnoise/stop_tracing_total_us: stop the system tracing if total noise
higher than the configured value happens. Writing 0 disables this
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
be used, which is currently 5 us.
Additional Tracing
In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.
- osnoise:sample_threshold: printed anytime a noise is higher than
the configurable tolerance_ns.
- osnoise:nmi_noise: noise from NMI, including the duration.
- osnoise:irq_noise: noise from an IRQ, including the duration.
- osnoise:softirq_noise: noise from a SoftIRQ, including the
duration.
- osnoise:thread_noise: noise from a thread, including the duration.
Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.
Here is one example of the usage of these tracepoints::
osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.
It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.
Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com
Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[
Made the following functions static:
trace_irqentry_callback()
trace_irqexit_callback()
trace_intel_irqentry_callback()
trace_intel_irqexit_callback()
Added to include/trace.h:
osnoise_arch_register()
osnoise_arch_unregister()
Fixed define logic for LATENCY_FS_NOTIFY
Reported-by: kernel test robot <lkp@intel.com>
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-22 16:42:27 +02:00
FTRACE_ENTRY ( osnoise , osnoise_entry ,
TRACE_OSNOISE ,
F_STRUCT (
__field ( u64 , noise )
__field ( u64 , runtime )
__field ( u64 , max_sample )
__field ( unsigned int , hw_count )
__field ( unsigned int , nmi_count )
__field ( unsigned int , irq_count )
__field ( unsigned int , softirq_count )
__field ( unsigned int , thread_count )
) ,
F_printk ( " noise:%llu \t max_sample:%llu \t hw:%u \t nmi:%u \t irq:%u \t softirq:%u \t thread:%u \n " ,
__entry - > noise ,
__entry - > max_sample ,
__entry - > hw_count ,
__entry - > nmi_count ,
__entry - > irq_count ,
__entry - > softirq_count ,
__entry - > thread_count )
) ;
2021-06-22 16:42:28 +02:00
FTRACE_ENTRY ( timerlat , timerlat_entry ,
TRACE_TIMERLAT ,
F_STRUCT (
__field ( unsigned int , seqnum )
__field ( int , context )
__field ( u64 , timer_latency )
) ,
F_printk ( " seq:%u \t context:%d \t timer_latency:%llu \n " ,
__entry - > seqnum ,
__entry - > context ,
__entry - > timer_latency )
) ;