831 lines
36 KiB
ReStructuredText
831 lines
36 KiB
ReStructuredText
|
.. SPDX-License-Identifier: GPL-2.0
|
|||
|
|
|||
|
================
|
|||
|
Perf ring buffer
|
|||
|
================
|
|||
|
|
|||
|
.. CONTENTS
|
|||
|
|
|||
|
1. Introduction
|
|||
|
|
|||
|
2. Ring buffer implementation
|
|||
|
2.1 Basic algorithm
|
|||
|
2.2 Ring buffer for different tracing modes
|
|||
|
2.2.1 Default mode
|
|||
|
2.2.2 Per-thread mode
|
|||
|
2.2.3 Per-CPU mode
|
|||
|
2.2.4 System wide mode
|
|||
|
2.3 Accessing buffer
|
|||
|
2.3.1 Producer-consumer model
|
|||
|
2.3.2 Properties of the ring buffers
|
|||
|
2.3.3 Writing samples into buffer
|
|||
|
2.3.4 Reading samples from buffer
|
|||
|
2.3.5 Memory synchronization
|
|||
|
|
|||
|
3. The mechanism of AUX ring buffer
|
|||
|
3.1 The relationship between AUX and regular ring buffers
|
|||
|
3.2 AUX events
|
|||
|
3.3 Snapshot mode
|
|||
|
|
|||
|
|
|||
|
1. Introduction
|
|||
|
===============
|
|||
|
|
|||
|
The ring buffer is a fundamental mechanism for data transfer. perf uses
|
|||
|
ring buffers to transfer event data from kernel to user space, another
|
|||
|
kind of ring buffer which is so called auxiliary (AUX) ring buffer also
|
|||
|
plays an important role for hardware tracing with Intel PT, Arm
|
|||
|
CoreSight, etc.
|
|||
|
|
|||
|
The ring buffer implementation is critical but it's also a very
|
|||
|
challenging work. On the one hand, the kernel and perf tool in the user
|
|||
|
space use the ring buffer to exchange data and stores data into data
|
|||
|
file, thus the ring buffer needs to transfer data with high throughput;
|
|||
|
on the other hand, the ring buffer management should avoid significant
|
|||
|
overload to distract profiling results.
|
|||
|
|
|||
|
This documentation dives into the details for perf ring buffer with two
|
|||
|
parts: firstly it explains the perf ring buffer implementation, then the
|
|||
|
second part discusses the AUX ring buffer mechanism.
|
|||
|
|
|||
|
2. Ring buffer implementation
|
|||
|
=============================
|
|||
|
|
|||
|
2.1 Basic algorithm
|
|||
|
-------------------
|
|||
|
|
|||
|
That said, a typical ring buffer is managed by a head pointer and a tail
|
|||
|
pointer; the head pointer is manipulated by a writer and the tail
|
|||
|
pointer is updated by a reader respectively.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
+---------------------------+
|
|||
|
| | |***|***|***| | |
|
|||
|
+---------------------------+
|
|||
|
`-> Tail `-> Head
|
|||
|
|
|||
|
* : the data is filled by the writer.
|
|||
|
|
|||
|
Figure 1. Ring buffer
|
|||
|
|
|||
|
Perf uses the same way to manage its ring buffer. In the implementation
|
|||
|
there are two key data structures held together in a set of consecutive
|
|||
|
pages, the control structure and then the ring buffer itself. The page
|
|||
|
with the control structure in is known as the "user page". Being held
|
|||
|
in continuous virtual addresses simplifies locating the ring buffer
|
|||
|
address, it is in the pages after the page with the user page.
|
|||
|
|
|||
|
The control structure is named as ``perf_event_mmap_page``, it contains a
|
|||
|
head pointer ``data_head`` and a tail pointer ``data_tail``. When the
|
|||
|
kernel starts to fill records into the ring buffer, it updates the head
|
|||
|
pointer to reserve the memory so later it can safely store events into
|
|||
|
the buffer. On the other side, when the user page is a writable mapping,
|
|||
|
the perf tool has the permission to update the tail pointer after consuming
|
|||
|
data from the ring buffer. Yet another case is for the user page's
|
|||
|
read-only mapping, which is to be addressed in the section
|
|||
|
:ref:`writing_samples_into_buffer`.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
user page ring buffer
|
|||
|
+---------+---------+ +---------------------------------------+
|
|||
|
|data_head|data_tail|...| | |***|***|***|***|***| | | |
|
|||
|
+---------+---------+ +---------------------------------------+
|
|||
|
` `----------------^ ^
|
|||
|
`----------------------------------------------|
|
|||
|
|
|||
|
* : the data is filled by the writer.
|
|||
|
|
|||
|
Figure 2. Perf ring buffer
|
|||
|
|
|||
|
When using the ``perf record`` tool, we can specify the ring buffer size
|
|||
|
with option ``-m`` or ``--mmap-pages=``, the given size will be rounded up
|
|||
|
to a power of two that is a multiple of a page size. Though the kernel
|
|||
|
allocates at once for all memory pages, it's deferred to map the pages
|
|||
|
to VMA area until the perf tool accesses the buffer from the user space.
|
|||
|
In other words, at the first time accesses the buffer's page from user
|
|||
|
space in the perf tool, a data abort exception for page fault is taken
|
|||
|
and the kernel uses this occasion to map the page into process VMA
|
|||
|
(see ``perf_mmap_fault()``), thus the perf tool can continue to access
|
|||
|
the page after returning from the exception.
|
|||
|
|
|||
|
2.2 Ring buffer for different tracing modes
|
|||
|
-------------------------------------------
|
|||
|
|
|||
|
The perf profiles programs with different modes: default mode, per thread
|
|||
|
mode, per cpu mode, and system wide mode. This section describes these
|
|||
|
modes and how the ring buffer meets requirements for them. At last we
|
|||
|
will review the race conditions caused by these modes.
|
|||
|
|
|||
|
2.2.1 Default mode
|
|||
|
^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Usually we execute ``perf record`` command followed by a profiling program
|
|||
|
name, like below command::
|
|||
|
|
|||
|
perf record test_program
|
|||
|
|
|||
|
This command doesn't specify any options for CPU and thread modes, the
|
|||
|
perf tool applies the default mode on the perf event. It maps all the
|
|||
|
CPUs in the system and the profiled program's PID on the perf event, and
|
|||
|
it enables inheritance mode on the event so that child tasks inherits
|
|||
|
the events. As a result, the perf event is attributed as::
|
|||
|
|
|||
|
evsel::cpus::map[] = { 0 .. _SC_NPROCESSORS_ONLN-1 }
|
|||
|
evsel::threads::map[] = { pid }
|
|||
|
evsel::attr::inherit = 1
|
|||
|
|
|||
|
These attributions finally will be reflected on the deployment of ring
|
|||
|
buffers. As shown below, the perf tool allocates individual ring buffer
|
|||
|
for each CPU, but it only enables events for the profiled program rather
|
|||
|
than for all threads in the system. The *T1* thread represents the
|
|||
|
thread context of the 'test_program', whereas *T2* and *T3* are irrelevant
|
|||
|
threads in the system. The perf samples are exclusively collected for
|
|||
|
the *T1* thread and stored in the ring buffer associated with the CPU on
|
|||
|
which the *T1* thread is running.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
T1 T2 T1
|
|||
|
+----+ +-----------+ +----+
|
|||
|
CPU0 |xxxx| |xxxxxxxxxxx| |xxxx|
|
|||
|
+----+--------------+-----------+----------+----+-------->
|
|||
|
| |
|
|||
|
v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 0 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+-----+
|
|||
|
CPU1 |xxxxx|
|
|||
|
-----+-----+--------------------------------------------->
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 1 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1 T3
|
|||
|
+----+ +-------+
|
|||
|
CPU2 |xxxx| |xxxxxxx|
|
|||
|
--------------------------+----+--------+-------+-------->
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 2 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+--------------+
|
|||
|
CPU3 |xxxxxxxxxxxxxx|
|
|||
|
-----------+--------------+------------------------------>
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 3 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1: Thread 1; T2: Thread 2; T3: Thread 3
|
|||
|
x: Thread is in running state
|
|||
|
|
|||
|
Figure 3. Ring buffer for default mode
|
|||
|
|
|||
|
2.2.2 Per-thread mode
|
|||
|
^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
By specifying option ``--per-thread`` in perf command, e.g.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
perf record --per-thread test_program
|
|||
|
|
|||
|
The perf event doesn't map to any CPUs and is only bound to the
|
|||
|
profiled process, thus, the perf event's attributions are::
|
|||
|
|
|||
|
evsel::cpus::map[0] = { -1 }
|
|||
|
evsel::threads::map[] = { pid }
|
|||
|
evsel::attr::inherit = 0
|
|||
|
|
|||
|
In this mode, a single ring buffer is allocated for the profiled thread;
|
|||
|
if the thread is scheduled on a CPU, the events on that CPU will be
|
|||
|
enabled; and if the thread is scheduled out from the CPU, the events on
|
|||
|
the CPU will be disabled. When the thread is migrated from one CPU to
|
|||
|
another, the events are to be disabled on the previous CPU and enabled
|
|||
|
on the next CPU correspondingly.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
T1 T2 T1
|
|||
|
+----+ +-----------+ +----+
|
|||
|
CPU0 |xxxx| |xxxxxxxxxxx| |xxxx|
|
|||
|
+----+--------------+-----------+----------+----+-------->
|
|||
|
| |
|
|||
|
| T1 |
|
|||
|
| +-----+ |
|
|||
|
CPU1 | |xxxxx| |
|
|||
|
--|--+-----+----------------------------------|---------->
|
|||
|
| | |
|
|||
|
| | T1 T3 |
|
|||
|
| | +----+ +---+ |
|
|||
|
CPU2 | | |xxxx| |xxx| |
|
|||
|
--|-----|-----------------+----+--------+---+-|---------->
|
|||
|
| | | |
|
|||
|
| | T1 | |
|
|||
|
| | +--------------+ | |
|
|||
|
CPU3 | | |xxxxxxxxxxxxxx| | |
|
|||
|
--|-----|--+--------------+-|-----------------|---------->
|
|||
|
| | | | |
|
|||
|
v v v v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1: Thread 1
|
|||
|
x: Thread is in running state
|
|||
|
|
|||
|
Figure 4. Ring buffer for per-thread mode
|
|||
|
|
|||
|
When perf runs in per-thread mode, a ring buffer is allocated for the
|
|||
|
profiled thread *T1*. The ring buffer is dedicated for thread *T1*, if the
|
|||
|
thread *T1* is running, the perf events will be recorded into the ring
|
|||
|
buffer; when the thread is sleeping, all associated events will be
|
|||
|
disabled, thus no trace data will be recorded into the ring buffer.
|
|||
|
|
|||
|
2.2.3 Per-CPU mode
|
|||
|
^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
The option ``-C`` is used to collect samples on the list of CPUs, for
|
|||
|
example the below perf command receives option ``-C 0,2``::
|
|||
|
|
|||
|
perf record -C 0,2 test_program
|
|||
|
|
|||
|
It maps the perf event to CPUs 0 and 2, and the event is not associated to any
|
|||
|
PID. Thus the perf event attributions are set as::
|
|||
|
|
|||
|
evsel::cpus::map[0] = { 0, 2 }
|
|||
|
evsel::threads::map[] = { -1 }
|
|||
|
evsel::attr::inherit = 0
|
|||
|
|
|||
|
This results in the session of ``perf record`` will sample all threads on CPU0
|
|||
|
and CPU2, and be terminated until test_program exits. Even there have tasks
|
|||
|
running on CPU1 and CPU3, since the ring buffer is absent for them, any
|
|||
|
activities on these two CPUs will be ignored. A usage case is to combine the
|
|||
|
options for per-thread mode and per-CPU mode, e.g. the options ``–C 0,2`` and
|
|||
|
``––per–thread`` are specified together, the samples are recorded only when
|
|||
|
the profiled thread is scheduled on any of the listed CPUs.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
T1 T2 T1
|
|||
|
+----+ +-----------+ +----+
|
|||
|
CPU0 |xxxx| |xxxxxxxxxxx| |xxxx|
|
|||
|
+----+--------------+-----------+----------+----+-------->
|
|||
|
| | |
|
|||
|
v v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 0 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+-----+
|
|||
|
CPU1 |xxxxx|
|
|||
|
-----+-----+--------------------------------------------->
|
|||
|
|
|||
|
T1 T3
|
|||
|
+----+ +-------+
|
|||
|
CPU2 |xxxx| |xxxxxxx|
|
|||
|
--------------------------+----+--------+-------+-------->
|
|||
|
| |
|
|||
|
v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 1 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+--------------+
|
|||
|
CPU3 |xxxxxxxxxxxxxx|
|
|||
|
-----------+--------------+------------------------------>
|
|||
|
|
|||
|
T1: Thread 1; T2: Thread 2; T3: Thread 3
|
|||
|
x: Thread is in running state
|
|||
|
|
|||
|
Figure 5. Ring buffer for per-CPU mode
|
|||
|
|
|||
|
2.2.4 System wide mode
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
By using option ``–a`` or ``––all–cpus``, perf collects samples on all CPUs
|
|||
|
for all tasks, we call it as the system wide mode, the command is::
|
|||
|
|
|||
|
perf record -a test_program
|
|||
|
|
|||
|
Similar to the per-CPU mode, the perf event doesn't bind to any PID, and
|
|||
|
it maps to all CPUs in the system::
|
|||
|
|
|||
|
evsel::cpus::map[] = { 0 .. _SC_NPROCESSORS_ONLN-1 }
|
|||
|
evsel::threads::map[] = { -1 }
|
|||
|
evsel::attr::inherit = 0
|
|||
|
|
|||
|
In the system wide mode, every CPU has its own ring buffer, all threads
|
|||
|
are monitored during the running state and the samples are recorded into
|
|||
|
the ring buffer belonging to the CPU which the events occurred on.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
T1 T2 T1
|
|||
|
+----+ +-----------+ +----+
|
|||
|
CPU0 |xxxx| |xxxxxxxxxxx| |xxxx|
|
|||
|
+----+--------------+-----------+----------+----+-------->
|
|||
|
| | |
|
|||
|
v v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 0 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+-----+
|
|||
|
CPU1 |xxxxx|
|
|||
|
-----+-----+--------------------------------------------->
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 1 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1 T3
|
|||
|
+----+ +-------+
|
|||
|
CPU2 |xxxx| |xxxxxxx|
|
|||
|
--------------------------+----+--------+-------+-------->
|
|||
|
| |
|
|||
|
v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 2 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+--------------+
|
|||
|
CPU3 |xxxxxxxxxxxxxx|
|
|||
|
-----------+--------------+------------------------------>
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 3 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1: Thread 1; T2: Thread 2; T3: Thread 3
|
|||
|
x: Thread is in running state
|
|||
|
|
|||
|
Figure 6. Ring buffer for system wide mode
|
|||
|
|
|||
|
2.3 Accessing buffer
|
|||
|
--------------------
|
|||
|
|
|||
|
Based on the understanding of how the ring buffer is allocated in
|
|||
|
various modes, this section explains access the ring buffer.
|
|||
|
|
|||
|
2.3.1 Producer-consumer model
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
In the Linux kernel, the PMU events can produce samples which are stored
|
|||
|
into the ring buffer; the perf command in user space consumes the
|
|||
|
samples by reading out data from the ring buffer and finally saves the
|
|||
|
data into the file for post analysis. It’s a typical producer-consumer
|
|||
|
model for using the ring buffer.
|
|||
|
|
|||
|
The perf process polls on the PMU events and sleeps when no events are
|
|||
|
incoming. To prevent frequent exchanges between the kernel and user
|
|||
|
space, the kernel event core layer introduces a watermark, which is
|
|||
|
stored in the ``perf_buffer::watermark``. When a sample is recorded into
|
|||
|
the ring buffer, and if the used buffer exceeds the watermark, the
|
|||
|
kernel wakes up the perf process to read samples from the ring buffer.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
Perf
|
|||
|
/ | Read samples
|
|||
|
Polling / `--------------| Ring buffer
|
|||
|
v v ;---------------------v
|
|||
|
+----------------+ +---------+---------+ +-------------------+
|
|||
|
|Event wait queue| |data_head|data_tail| |***|***| | |***|
|
|||
|
+----------------+ +---------+---------+ +-------------------+
|
|||
|
^ ^ `------------------------^
|
|||
|
| Wake up tasks | Store samples
|
|||
|
+-----------------------------+
|
|||
|
| Kernel event core layer |
|
|||
|
+-----------------------------+
|
|||
|
|
|||
|
* : the data is filled by the writer.
|
|||
|
|
|||
|
Figure 7. Writing and reading the ring buffer
|
|||
|
|
|||
|
When the kernel event core layer notifies the user space, because
|
|||
|
multiple events might share the same ring buffer for recording samples,
|
|||
|
the core layer iterates every event associated with the ring buffer and
|
|||
|
wakes up tasks waiting on the event. This is fulfilled by the kernel
|
|||
|
function ``ring_buffer_wakeup()``.
|
|||
|
|
|||
|
After the perf process is woken up, it starts to check the ring buffers
|
|||
|
one by one, if it finds any ring buffer containing samples it will read
|
|||
|
out the samples for statistics or saving into the data file. Given the
|
|||
|
perf process is able to run on any CPU, this leads to the ring buffer
|
|||
|
potentially being accessed from multiple CPUs simultaneously, which
|
|||
|
causes race conditions. The race condition handling is described in the
|
|||
|
section :ref:`memory_synchronization`.
|
|||
|
|
|||
|
2.3.2 Properties of the ring buffers
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Linux kernel supports two write directions for the ring buffer: forward and
|
|||
|
backward. The forward writing saves samples from the beginning of the ring
|
|||
|
buffer, the backward writing stores data from the end of the ring buffer with
|
|||
|
the reversed direction. The perf tool determines the writing direction.
|
|||
|
|
|||
|
Additionally, the tool can map buffers in either read-write mode or read-only
|
|||
|
mode to the user space.
|
|||
|
|
|||
|
The ring buffer in the read-write mode is mapped with the property
|
|||
|
``PROT_READ | PROT_WRITE``. With the write permission, the perf tool
|
|||
|
updates the ``data_tail`` to indicate the data start position. Combining
|
|||
|
with the head pointer ``data_head``, which works as the end position of
|
|||
|
the current data, the perf tool can easily know where read out the data
|
|||
|
from.
|
|||
|
|
|||
|
Alternatively, in the read-only mode, only the kernel keeps to update
|
|||
|
the ``data_head`` while the user space cannot access the ``data_tail`` due
|
|||
|
to the mapping property ``PROT_READ``.
|
|||
|
|
|||
|
As a result, the matrix below illustrates the various combinations of
|
|||
|
direction and mapping characteristics. The perf tool employs two of these
|
|||
|
combinations to support buffer types: the non-overwrite buffer and the
|
|||
|
overwritable buffer.
|
|||
|
|
|||
|
.. list-table::
|
|||
|
:widths: 1 1 1
|
|||
|
:header-rows: 1
|
|||
|
|
|||
|
* - Mapping mode
|
|||
|
- Forward
|
|||
|
- Backward
|
|||
|
* - read-write
|
|||
|
- Non-overwrite ring buffer
|
|||
|
- Not used
|
|||
|
* - read-only
|
|||
|
- Not used
|
|||
|
- Overwritable ring buffer
|
|||
|
|
|||
|
The non-overwrite ring buffer uses the read-write mapping with forward
|
|||
|
writing. It starts to save data from the beginning of the ring buffer
|
|||
|
and wrap around when overflow, which is used with the read-write mode in
|
|||
|
the normal ring buffer. When the consumer doesn't keep up with the
|
|||
|
producer, it would lose some data, the kernel keeps how many records it
|
|||
|
lost and generates the ``PERF_RECORD_LOST`` records in the next time
|
|||
|
when it finds a space in the ring buffer.
|
|||
|
|
|||
|
The overwritable ring buffer uses the backward writing with the
|
|||
|
read-only mode. It saves the data from the end of the ring buffer and
|
|||
|
the ``data_head`` keeps the position of current data, the perf always
|
|||
|
knows where it starts to read and until the end of the ring buffer, thus
|
|||
|
it don't need the ``data_tail``. In this mode, it will not generate the
|
|||
|
``PERF_RECORD_LOST`` records.
|
|||
|
|
|||
|
.. _writing_samples_into_buffer:
|
|||
|
|
|||
|
2.3.3 Writing samples into buffer
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
When a sample is taken and saved into the ring buffer, the kernel
|
|||
|
prepares sample fields based on the sample type; then it prepares the
|
|||
|
info for writing ring buffer which is stored in the structure
|
|||
|
``perf_output_handle``. In the end, the kernel outputs the sample into
|
|||
|
the ring buffer and updates the head pointer in the user page so the
|
|||
|
perf tool can see the latest value.
|
|||
|
|
|||
|
The structure ``perf_output_handle`` serves as a temporary context for
|
|||
|
tracking the information related to the buffer. The advantages of it is
|
|||
|
that it enables concurrent writing to the buffer by different events.
|
|||
|
For example, a software event and a hardware PMU event both are enabled
|
|||
|
for profiling, two instances of ``perf_output_handle`` serve as separate
|
|||
|
contexts for the software event and the hardware event respectively.
|
|||
|
This allows each event to reserve its own memory space for populating
|
|||
|
the record data.
|
|||
|
|
|||
|
2.3.4 Reading samples from buffer
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
In the user space, the perf tool utilizes the ``perf_event_mmap_page``
|
|||
|
structure to handle the head and tail of the buffer. It also uses
|
|||
|
``perf_mmap`` structure to keep track of a context for the ring buffer, this
|
|||
|
context includes information about the buffer's starting and ending
|
|||
|
addresses. Additionally, the mask value can be utilized to compute the
|
|||
|
circular buffer pointer even for an overflow.
|
|||
|
|
|||
|
Similar to the kernel, the perf tool in the user space first reads out
|
|||
|
the recorded data from the ring buffer, and then updates the buffer's
|
|||
|
tail pointer ``perf_event_mmap_page::data_tail``.
|
|||
|
|
|||
|
.. _memory_synchronization:
|
|||
|
|
|||
|
2.3.5 Memory synchronization
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
The modern CPUs with relaxed memory model cannot promise the memory
|
|||
|
ordering, this means it’s possible to access the ring buffer and the
|
|||
|
``perf_event_mmap_page`` structure out of order. To assure the specific
|
|||
|
sequence for memory accessing perf ring buffer, memory barriers are
|
|||
|
used to assure the data dependency. The rationale for the memory
|
|||
|
synchronization is as below::
|
|||
|
|
|||
|
Kernel User space
|
|||
|
|
|||
|
if (LOAD ->data_tail) { LOAD ->data_head
|
|||
|
(A) smp_rmb() (C)
|
|||
|
STORE $data LOAD $data
|
|||
|
smp_wmb() (B) smp_mb() (D)
|
|||
|
STORE ->data_head STORE ->data_tail
|
|||
|
}
|
|||
|
|
|||
|
The comments in tools/include/linux/ring_buffer.h gives nice description
|
|||
|
for why and how to use memory barriers, here we will just provide an
|
|||
|
alternative explanation:
|
|||
|
|
|||
|
(A) is a control dependency so that CPU assures order between checking
|
|||
|
pointer ``perf_event_mmap_page::data_tail`` and filling sample into ring
|
|||
|
buffer;
|
|||
|
|
|||
|
(D) pairs with (A). (D) separates the ring buffer data reading from
|
|||
|
writing the pointer ``data_tail``, perf tool first consumes samples and then
|
|||
|
tells the kernel that the data chunk has been released. Since a reading
|
|||
|
operation is followed by a writing operation, thus (D) is a full memory
|
|||
|
barrier.
|
|||
|
|
|||
|
(B) is a writing barrier in the middle of two writing operations, which
|
|||
|
makes sure that recording a sample must be prior to updating the head
|
|||
|
pointer.
|
|||
|
|
|||
|
(C) pairs with (B). (C) is a read memory barrier to ensure the head
|
|||
|
pointer is fetched before reading samples.
|
|||
|
|
|||
|
To implement the above algorithm, the ``perf_output_put_handle()`` function
|
|||
|
in the kernel and two helpers ``ring_buffer_read_head()`` and
|
|||
|
``ring_buffer_write_tail()`` in the user space are introduced, they rely
|
|||
|
on memory barriers as described above to ensure the data dependency.
|
|||
|
|
|||
|
Some architectures support one-way permeable barrier with load-acquire
|
|||
|
and store-release operations, these barriers are more relaxed with less
|
|||
|
performance penalty, so (C) and (D) can be optimized to use barriers
|
|||
|
``smp_load_acquire()`` and ``smp_store_release()`` respectively.
|
|||
|
|
|||
|
If an architecture doesn’t support load-acquire and store-release in its
|
|||
|
memory model, it will roll back to the old fashion of memory barrier
|
|||
|
operations. In this case, ``smp_load_acquire()`` encapsulates
|
|||
|
``READ_ONCE()`` + ``smp_mb()``, since ``smp_mb()`` is costly,
|
|||
|
``ring_buffer_read_head()`` doesn't invoke ``smp_load_acquire()`` and it uses
|
|||
|
the barriers ``READ_ONCE()`` + ``smp_rmb()`` instead.
|
|||
|
|
|||
|
3. The mechanism of AUX ring buffer
|
|||
|
===================================
|
|||
|
|
|||
|
In this chapter, we will explain the implementation of the AUX ring
|
|||
|
buffer. In the first part it will discuss the connection between the
|
|||
|
AUX ring buffer and the regular ring buffer, then the second part will
|
|||
|
examine how the AUX ring buffer co-works with the regular ring buffer,
|
|||
|
as well as the additional features introduced by the AUX ring buffer for
|
|||
|
the sampling mechanism.
|
|||
|
|
|||
|
3.1 The relationship between AUX and regular ring buffers
|
|||
|
---------------------------------------------------------
|
|||
|
|
|||
|
Generally, the AUX ring buffer is an auxiliary for the regular ring
|
|||
|
buffer. The regular ring buffer is primarily used to store the event
|
|||
|
samples and every event format complies with the definition in the
|
|||
|
union ``perf_event``; the AUX ring buffer is for recording the hardware
|
|||
|
trace data and the trace data format is hardware IP dependent.
|
|||
|
|
|||
|
The general use and advantage of the AUX ring buffer is that it is
|
|||
|
written directly by hardware rather than by the kernel. For example,
|
|||
|
regular profile samples that write to the regular ring buffer cause an
|
|||
|
interrupt. Tracing execution requires a high number of samples and
|
|||
|
using interrupts would be overwhelming for the regular ring buffer
|
|||
|
mechanism. Having an AUX buffer allows for a region of memory more
|
|||
|
decoupled from the kernel and written to directly by hardware tracing.
|
|||
|
|
|||
|
The AUX ring buffer reuses the same algorithm with the regular ring
|
|||
|
buffer for the buffer management. The control structure
|
|||
|
``perf_event_mmap_page`` extends the new fields ``aux_head`` and ``aux_tail``
|
|||
|
for the head and tail pointers of the AUX ring buffer.
|
|||
|
|
|||
|
During the initialisation phase, besides the mmap()-ed regular ring
|
|||
|
buffer, the perf tool invokes a second syscall in the
|
|||
|
``auxtrace_mmap__mmap()`` function for the mmap of the AUX buffer with
|
|||
|
non-zero file offset; ``rb_alloc_aux()`` in the kernel allocates pages
|
|||
|
correspondingly, these pages will be deferred to map into VMA when
|
|||
|
handling the page fault, which is the same lazy mechanism with the
|
|||
|
regular ring buffer.
|
|||
|
|
|||
|
AUX events and AUX trace data are two different things. Let's see an
|
|||
|
example::
|
|||
|
|
|||
|
perf record -a -e cycles -e cs_etm/@tmc_etr0/ -- sleep 2
|
|||
|
|
|||
|
The above command enables two events: one is the event *cycles* from PMU
|
|||
|
and another is the AUX event *cs_etm* from Arm CoreSight, both are saved
|
|||
|
into the regular ring buffer while the CoreSight's AUX trace data is
|
|||
|
stored in the AUX ring buffer.
|
|||
|
|
|||
|
As a result, we can see the regular ring buffer and the AUX ring buffer
|
|||
|
are allocated in pairs. The perf in default mode allocates the regular
|
|||
|
ring buffer and the AUX ring buffer per CPU-wise, which is the same as
|
|||
|
the system wide mode, however, the default mode records samples only for
|
|||
|
the profiled program, whereas the latter mode profiles for all programs
|
|||
|
in the system. For per-thread mode, the perf tool allocates only one
|
|||
|
regular ring buffer and one AUX ring buffer for the whole session. For
|
|||
|
the per-CPU mode, the perf allocates two kinds of ring buffers for
|
|||
|
selected CPUs specified by the option ``-C``.
|
|||
|
|
|||
|
The below figure demonstrates the buffers' layout in the system wide
|
|||
|
mode; if there are any activities on one CPU, the AUX event samples and
|
|||
|
the hardware trace data will be recorded into the dedicated buffers for
|
|||
|
the CPU.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
T1 T2 T1
|
|||
|
+----+ +-----------+ +----+
|
|||
|
CPU0 |xxxx| |xxxxxxxxxxx| |xxxx|
|
|||
|
+----+--------------+-----------+----------+----+-------->
|
|||
|
| | |
|
|||
|
v v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 0 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
| | |
|
|||
|
v v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| AUX Ring buffer 0 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+-----+
|
|||
|
CPU1 |xxxxx|
|
|||
|
-----+-----+--------------------------------------------->
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 1 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| AUX Ring buffer 1 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1 T3
|
|||
|
+----+ +-------+
|
|||
|
CPU2 |xxxx| |xxxxxxx|
|
|||
|
--------------------------+----+--------+-------+-------->
|
|||
|
| |
|
|||
|
v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 2 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
| |
|
|||
|
v v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| AUX Ring buffer 2 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1
|
|||
|
+--------------+
|
|||
|
CPU3 |xxxxxxxxxxxxxx|
|
|||
|
-----------+--------------+------------------------------>
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| Ring buffer 3 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|
|||
|
v
|
|||
|
+-----------------------------------------------------+
|
|||
|
| AUX Ring buffer 3 |
|
|||
|
+-----------------------------------------------------+
|
|||
|
|
|||
|
T1: Thread 1; T2: Thread 2; T3: Thread 3
|
|||
|
x: Thread is in running state
|
|||
|
|
|||
|
Figure 8. AUX ring buffer for system wide mode
|
|||
|
|
|||
|
3.2 AUX events
|
|||
|
--------------
|
|||
|
|
|||
|
Similar to ``perf_output_begin()`` and ``perf_output_end()``'s working for the
|
|||
|
regular ring buffer, ``perf_aux_output_begin()`` and ``perf_aux_output_end()``
|
|||
|
serve for the AUX ring buffer for processing the hardware trace data.
|
|||
|
|
|||
|
Once the hardware trace data is stored into the AUX ring buffer, the PMU
|
|||
|
driver will stop hardware tracing by calling the ``pmu::stop()`` callback.
|
|||
|
Similar to the regular ring buffer, the AUX ring buffer needs to apply
|
|||
|
the memory synchronization mechanism as discussed in the section
|
|||
|
:ref:`memory_synchronization`. Since the AUX ring buffer is managed by the
|
|||
|
PMU driver, the barrier (B), which is a writing barrier to ensure the trace
|
|||
|
data is externally visible prior to updating the head pointer, is asked
|
|||
|
to be implemented in the PMU driver.
|
|||
|
|
|||
|
Then ``pmu::stop()`` can safely call the ``perf_aux_output_end()`` function to
|
|||
|
finish two things:
|
|||
|
|
|||
|
- It fills an event ``PERF_RECORD_AUX`` into the regular ring buffer, this
|
|||
|
event delivers the information of the start address and data size for a
|
|||
|
chunk of hardware trace data has been stored into the AUX ring buffer;
|
|||
|
|
|||
|
- Since the hardware trace driver has stored new trace data into the AUX
|
|||
|
ring buffer, the argument *size* indicates how many bytes have been
|
|||
|
consumed by the hardware tracing, thus ``perf_aux_output_end()`` updates the
|
|||
|
header pointer ``perf_buffer::aux_head`` to reflect the latest buffer usage.
|
|||
|
|
|||
|
At the end, the PMU driver will restart hardware tracing. During this
|
|||
|
temporary suspending period, it will lose hardware trace data, which
|
|||
|
will introduce a discontinuity during decoding phase.
|
|||
|
|
|||
|
The event ``PERF_RECORD_AUX`` presents an AUX event which is handled in the
|
|||
|
kernel, but it lacks the information for saving the AUX trace data in
|
|||
|
the perf file. When the perf tool copies the trace data from AUX ring
|
|||
|
buffer to the perf data file, it synthesizes a ``PERF_RECORD_AUXTRACE``
|
|||
|
event which is not a kernel ABI, it's defined by the perf tool to describe
|
|||
|
which portion of data in the AUX ring buffer is saved. Afterwards, the perf
|
|||
|
tool reads out the AUX trace data from the perf file based on the
|
|||
|
``PERF_RECORD_AUXTRACE`` events, and the ``PERF_RECORD_AUX`` event is used to
|
|||
|
decode a chunk of data by correlating with time order.
|
|||
|
|
|||
|
3.3 Snapshot mode
|
|||
|
-----------------
|
|||
|
|
|||
|
Perf supports snapshot mode for AUX ring buffer, in this mode, users
|
|||
|
only record AUX trace data at a specific time point which users are
|
|||
|
interested in. E.g. below gives an example of how to take snapshots
|
|||
|
with 1 second interval with Arm CoreSight::
|
|||
|
|
|||
|
perf record -e cs_etm/@tmc_etr0/u -S -a program &
|
|||
|
PERFPID=$!
|
|||
|
while true; do
|
|||
|
kill -USR2 $PERFPID
|
|||
|
sleep 1
|
|||
|
done
|
|||
|
|
|||
|
The main flow for snapshot mode is:
|
|||
|
|
|||
|
- Before a snapshot is taken, the AUX ring buffer acts in free run mode.
|
|||
|
During free run mode the perf doesn't record any of the AUX events and
|
|||
|
trace data;
|
|||
|
|
|||
|
- Once the perf tool receives the *USR2* signal, it triggers the callback
|
|||
|
function ``auxtrace_record::snapshot_start()`` to deactivate hardware
|
|||
|
tracing. The kernel driver then populates the AUX ring buffer with the
|
|||
|
hardware trace data, and the event ``PERF_RECORD_AUX`` is stored in the
|
|||
|
regular ring buffer;
|
|||
|
|
|||
|
- Then perf tool takes a snapshot, ``record__read_auxtrace_snapshot()``
|
|||
|
reads out the hardware trace data from the AUX ring buffer and saves it
|
|||
|
into perf data file;
|
|||
|
|
|||
|
- After the snapshot is finished, ``auxtrace_record::snapshot_finish()``
|
|||
|
restarts the PMU event for AUX tracing.
|
|||
|
|
|||
|
The perf only accesses the head pointer ``perf_event_mmap_page::aux_head``
|
|||
|
in snapshot mode and doesn’t touch tail pointer ``aux_tail``, this is
|
|||
|
because the AUX ring buffer can overflow in free run mode, the tail
|
|||
|
pointer is useless in this case. Alternatively, the callback
|
|||
|
``auxtrace_record::find_snapshot()`` is introduced for making the decision
|
|||
|
of whether the AUX ring buffer has been wrapped around or not, at the
|
|||
|
end it fixes up the AUX buffer's head which are used to calculate the
|
|||
|
trace data size.
|
|||
|
|
|||
|
As we know, the buffers' deployment can be per-thread mode, per-CPU
|
|||
|
mode, or system wide mode, and the snapshot can be applied to any of
|
|||
|
these modes. Below is an example of taking snapshot with system wide
|
|||
|
mode.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
Snapshot is taken
|
|||
|
|
|
|||
|
v
|
|||
|
+------------------------+
|
|||
|
| AUX Ring buffer 0 | <- aux_head
|
|||
|
+------------------------+
|
|||
|
v
|
|||
|
+--------------------------------+
|
|||
|
| AUX Ring buffer 1 | <- aux_head
|
|||
|
+--------------------------------+
|
|||
|
v
|
|||
|
+--------------------------------------------+
|
|||
|
| AUX Ring buffer 2 | <- aux_head
|
|||
|
+--------------------------------------------+
|
|||
|
v
|
|||
|
+---------------------------------------+
|
|||
|
| AUX Ring buffer 3 | <- aux_head
|
|||
|
+---------------------------------------+
|
|||
|
|
|||
|
Figure 9. Snapshot with system wide mode
|