IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Currently all the event parsing fails end up
in the event_pmu rule, and display misleading
help like:
$ perf stat -e inst kill
event syntax error: 'inst'
\___ Cannot find PMU `inst'. Missing kernel support?
...
The reason is that the event_pmu is too strong
and match also single string. Changing it to
force the '/' separators to be part of the rule,
and getting the proper error now:
$ perf stat -e inst kill
event syntax error: 'inst'
\___ parser error
Run 'perf list' for a list of valid events
...
Suggested-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180605121416.31645-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding the support to read rusage data once the workload is finished and
display the system/user time values:
$ perf stat --null perf bench sched pipe
...
Performance counter stats for 'perf bench sched pipe':
5.342599256 seconds time elapsed
2.544434000 seconds user
4.549691000 seconds sys
It works only in non -r mode and only for workload target.
So as of now, for workload targets, we display 3 types of timings. The
time we meassure in perf stat from enable to disable+period:
5.342599256 seconds time elapsed
The time spent in user and system lands, displayed only for workload
session/target:
2.544434000 seconds user
4.549691000 seconds sys
Those times are the very same displayed by 'time' tool. They are
returned by wait4 call via the getrusage struct interface.
Committer notes:
Had to rename some variables to avoid this on older systems such as
centos:6:
builtin-stat.c: In function 'print_footer':
builtin-stat.c:1831: warning: declaration of 'stime' shadows a global declaration
/usr/include/time.h:297: warning: shadowed declaration is here
Committer testing:
# perf stat --null time perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 5.526 [sec]
5.526534 usecs/op
180945 ops/sec
1.00user 6.25system 0:05.52elapsed 131%CPU (0avgtext+0avgdata 8056maxresident)k
0inputs+0outputs (0major+606minor)pagefaults 0swaps
Performance counter stats for 'time perf bench sched pipe':
5.530978744 seconds time elapsed
1.004037000 seconds user
6.259937000 seconds sys
#
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180605121313.31337-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Enable complex event names containing [.:=,] symbols to be encoded into Perf
trace using name= modifier e.g. like this:
perf record -e cpu/name=\'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM\',\
period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
Below is how it looks like in the report output. Please note explicit escaped
quoting at cmdline string in the header so that thestring can be directly reused
for another collection in shell:
perf report --header
# ========
...
# cmdline : /root/abudanko/kernel/tip/tools/perf/perf record -v -e cpu/name=\'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM\',period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
# event : name = OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM, , type = 4, size = 112, config = 0x100003c, { sample_period, sample_freq } = 3500000, sample_type = IP|TID|TIME, disabled = 1, inh
...
# ========
#
#
# Total Lost Samples: 0
#
# Samples: 24K of event 'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM'
# Event count (approx.): 86492000000
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ ..............................................
#
14.75% futex [kernel.vmlinux] [k] __entry_trampoline_start
...
perf stat -e cpu/name=\'CPU_CLK_UNHALTED.THREAD:cmask=0x1\',period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
10000000 process context switches in 16678890291ns (1667.9ns/ctxsw)
Performance counter stats for './futex':
88,095,770,571 CPU_CLK_UNHALTED.THREAD:cmask=0x1
16.679542407 seconds time elapsed
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/c194b060-761d-0d50-3b21-bb4ed680002d@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Fix __kmod_path__parse() so that perf tools does not treat vdso32 and
vdsox32 as kernel modules and fail to find the object.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: stable@vger.kernel.org
Fixes: 1f121b03d058 ("perf tools: Deal with kernel module names in '[]' correctly")
Link: http://lkml.kernel.org/r/1528117014-30032-3-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add tests for vdso32 and vdsox32. This will cause the overall test to
fail because __kmod_path__parse() does not handle vdso32 or vdsox32.
Fixes: 1f121b03d058 ("perf tools: Deal with kernel module names in '[]' correctly")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1528117014-30032-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
So far if we use 'perf record -g' this will make
symbol_conf.use_callchain 'true' and logic will assume that all events
have callchains enabled, but ever since we added the possibility of
setting up callchains for some events (e.g.: -e
cycles/call-graph=dwarf/) while not for others, we limit usage scenarios
by looking at that symbol_conf.use_callchain global boolean, we better
look at each event attributes.
On the road to that we need to look if a hist_entry has callchains, that
is, to go from hist_entry->hists to the evsel that contains it, to then
look at evsel->sample_type for PERF_SAMPLE_CALLCHAIN.
The next step is to add a symbol_conf.ignore_callchains global, to use
in the places where what we really want to know is if callchains should
be ignored, even if present.
Then -g will mean just to select a callchain mode to be applied to all
events not explicitely setting some other callchain mode, i.e. a default
callchain mode, and --no-call-graph will set
symbol_conf.ignore_callchains with that clear intention.
That too will at some point become a per evsel thing, that tools can set
for all or just a few of its evsels.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-0sas5cm4dsw2obn75g7ruz69@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We'll use this helper more frequently when reworking
symbol_conf.use_callchain logic, where knowing if a hist_entry has
callchains is the important bit, so make going from hist_entry to hists
to evsel easier, compact.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-p6gioxkzpkpz71dtt4wcs36o@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbFzauAAoJEAx081l5xIa+jzwP/AwfTreH3XBmlLeMO8kwkAMW
AiH1u3KHrlitN1U85x90dC8hVuuV2Kv9pEXQo1me/TAL/QCb9VZYCSf2dHHkkMwD
1UdwxAQW7soBbTRo9k0zuVJpRGSYIiYfqDh3L6KmTC3UF0tUJm53VOWCLG/DNrtn
XhEnGnlCUNbABZMMpavEvlwxPtvaYFlp6M+MmwRPIx32SrPJ3vSaBzxwWxOaFmjU
xiRdK+GpAbCV8yGeCCSkDgEe/TdWGUhZxoC9dLb0H9ex7ip8uZ0W4D+VTHPFrhQX
6nCpqUbp7BQTsbVSd1pAVsAv45scmSgWbKcqfC0NKSVcsHcJZBR0tQOF9OvnGZcf
o1Hv/beTqJ++IcG2rEIwyJTGxAGfZ0YSb0evTC9VcszaYo+b3+G283bdztIjzDeS
0QCTeLHYbZRHPITWVULNpMWy3TkJv32IdFhQfYSnD8/OGQIxLNhh4FFOtHnOmxSF
N8dnzOLKXhXMo/NgOL+UMNnbgLqIyOtEXCPDLuOQJNv/SOp8662m/A0yRjQNR6M2
gsPmR7dxQIwwJMyqrkLDOF411ABZohulquYgwLgG938MRPmTpPWOR72PtpGF4hAW
HLg+3HHBd1N/A1mlJUMAbUn2eMUACZBUIycE9u+U/geRgve/OQnzJH/FKGP2EJ4R
pf6CruEva+6GRR5GVzuM
=twst
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"This starts to support NVIDIA volta hardware with nouveau, and adds
amdgpu support for the GPU in the Kabylake-G (the intel + radeon
single package chip), along with some initial Intel icelake enabling.
Summary:
New Drivers:
- v3d - driver for broadcom V3D V3.x+ hardware
- xen-front - XEN PV display frontend
core:
- handle zpos normalization in the core
- stop looking at legacy pointers in atomic paths
- improved scheduler documentation
- improved aspect ratio validation
- aspect ratio support for 64:27 and 256:135
- drop unused control node code.
i915:
- Icelake (ICL) enabling
- GuC/HuC refactoring
- PSR/PSR2 enabling and fixes
- DPLL management refactoring
- DP MST fixes
- NV12 enabling
- HDCP improvements
- GEM/Execlist/reset improvements
- GVT improvements
- stolen memory first 4k fix
amdgpu:
- Vega 20 support
- VEGAM support (Kabylake-G)
- preOS scanout buffer reservation
- power management gfxoff support for raven
- SR-IOV fixes
- Vega10 power profiles and clock voltage control
- scatter/gather display support on CZ/ST
amdkfd:
- GFX9 dGPU support
- userptr memory mapping
nouveau:
- major refactoring for Volta GV100 support
tda998x:
- HDMI i2c CEC support
etnaviv:
- removed unused logging code
- license text cleanups
- MMU handling improvements
- timeout fence fix for 50 days uptime
tegra:
- IOMMU support in gr2d/gr3d drivers
- zpos support
vc4:
- syncobj support
- CTM, plane alpha and async cursor support
analogix_dp:
- HPD and aux chan fixes
sun4i:
- MIPI DSI support
tilcdc:
- clock divider fixes for OMAP-l138 LCDK board
rcar-du:
- R8A77965 support
- dma-buf fences fixes
- hardware indexed crtc/du group handling
- generic zplane property support
atmel-hclcdc:
- generic zplane property support
mediatek:
- use generic video mode function
exynos:
- S5PV210 FIMD variant support
- IPP v2 framework
- more HW overlays support"
* tag 'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm: (1286 commits)
drm/amdgpu: fix 32-bit build warning
drm/exynos: fimc: signedness bug in fimc_setup_clocks()
drm/exynos: scaler: fix static checker warning
drm/amdgpu: Use dev_info() to report amdkfd is not supported for this ASIC
drm/amd/display: Remove use of division operator for long longs
drm/amdgpu: Update GFX info structure to match what vega20 used
drm/amdgpu/pp: remove duplicate assignment
drm/sched: add rcu_barrier after entity fini
drm/amdgpu: move VM BOs on LRU again
drm/amdgpu: consistenly use VM moved flag
drm/amdgpu: kmap PDs/PTs in amdgpu_vm_update_directories
drm/amdgpu: further optimize amdgpu_vm_handle_moved
drm/amdgpu: cleanup amdgpu_vm_validate_pt_bos v2
drm/amdgpu: rework VM state machine lock handling v2
drm/amdgpu: Add runtime VCN PG support
drm/amdgpu: Enable VCN static PG by default on RV
drm/amdgpu: Add VCN static PG support on RV
drm/amdgpu: Enable VCN CG by default on RV
drm/amdgpu: Add static CG control for VCN on RV
drm/exynos: Fix default value for zpos plane property
...
If flush requests are being sent to the device we need to inherit the
failfast and driver-specific flags, too, otherwise I/O will fail.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extend the debugability of the vector management by adding the state bits
to the debugfs output.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.908136099@linutronix.de
The case that interrupt affinity setting fails with -EBUSY can be handled
in the kernel completely by using the already available generic pending
infrastructure.
If a irq_chip::set_affinity() fails with -EBUSY, handle it like the
interrupts for which irq_chip::set_affinity() can only be invoked from
interrupt context. Copy the new affinity mask to irq_desc::pending_mask and
set the affinity pending bit. The next raised interrupt for the affected
irq will check the pending bit and try to set the new affinity from the
handler. This avoids that -EBUSY is returned when an affinity change is
requested from user space and the previous change has not been cleaned
up. The new affinity will take effect when the next interrupt is raised
from the device.
Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.819273597@linutronix.de
To address the EBUSY fail of interrupt affinity settings in case that the
previous setting has not been cleaned up yet, use the new apic_ack_irq()
function instead of the special uv_ack_apic() implementation which is
merily a wrapper around ack_APIC_irq().
Preparatory change for the real fix
Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup")
Reported-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.721691398@linutronix.de
To address the EBUSY fail of interrupt affinity settings in case that the
previous setting has not been cleaned up yet, use the new apic_ack_irq()
function instead of directly invoking ack_APIC_irq().
Preparatory change for the real fix
Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.639011135@linutronix.de
To address the EBUSY fail of interrupt affinity settings in case that the
previous setting has not been cleaned up yet, use the new apic_ack_irq()
function instead of the special ir_ack_apic_edge() implementation which is
merily a wrapper around ack_APIC_irq().
Preparatory change for the real fix
Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.555716895@linutronix.de
apic_ack_edge() is explicitely for handling interrupt affinity cleanup when
interrupt remapping is not available or disable.
Remapped interrupts and also some of the platform specific special
interrupts, e.g. UV, invoke ack_APIC_irq() directly.
To address the issue of failing an affinity update with -EBUSY the delayed
affinity mechanism can be reused, but ack_APIC_irq() does not handle
that. Adding this to ack_APIC_irq() is not possible, because that function
is also used for exceptions and directly handled interrupts like IPIs.
Create a new function, which just contains the conditional invocation of
irq_move_irq() and the final ack_APIC_irq().
Reuse the new function in apic_ack_edge().
Preparatory change for the real fix.
Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.471925894@linutronix.de
The upcoming fix for the -EBUSY return from affinity settings requires to
use the irq_move_irq() functionality even on irq remapped interrupts. To
avoid the out of line call, move the check for the pending bit into an
inline helper.
Preparatory change for the real fix. No functional change.
Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Link: https://lkml.kernel.org/r/20180604162224.471925894@linutronix.de
The generic pending interrupt mechanism moves interrupts from the interrupt
handler on the original target CPU to the new destination CPU. This is
required for x86 and ia64 due to the way the interrupt delivery and
acknowledge works if the interrupts are not remapped.
However that update can fail for various reasons. Some of them are valid
reasons to discard the pending update, but the case, when the previous move
has not been fully cleaned up is not a legit reason to fail.
Check the return value of irq_do_set_affinity() for -EBUSY, which indicates
a pending cleanup, and rearm the pending move in the irq dexcriptor so it's
tried again when the next interrupt arrives.
Fixes: 996c591227d9 ("x86/irq: Plug vector cleanup race")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.386544292@linutronix.de
Several people observed the WARN_ON() in irq_matrix_free() which triggers
when the caller tries to free an vector which is not in the allocation
range. Song provided the trace information which allowed to decode the root
cause.
The rework of the vector allocation mechanism failed to preserve a sanity
check, which prevents setting a new target vector/CPU when the previous
affinity change has not fully completed.
As a result a half finished affinity change can be overwritten, which can
cause the leak of a irq descriptor pointer on the previous target CPU and
double enqueue of the hlist head into the cleanup lists of two or more
CPUs. After one CPU cleaned up its vector the next CPU will invoke the
cleanup handler with vector 0, which triggers the out of range warning in
the matrix allocator.
Prevent this by checking the apic_data of the interrupt whether the
move_in_progress flag is false and the hlist node is not hashed. Return
-EBUSY if not.
This prevents the damage and restores the behaviour before the vector
allocation rework, but due to other changes in that area it also widens the
chance that user space can observe -EBUSY. In theory this should be fine,
but actually not all user space tools handle -EBUSY correctly. Addressing
that is not part of this fix, but will be addressed in follow up patches.
Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment")
Reported-by: Dmitry Safonov <0x7f454c46@gmail.com>
Reported-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180604162224.303870257@linutronix.de
In commit 48398b6e7065 ("enic: set UDP rss flag") driver needed to set a
single bit to enable UDP rss. This is changed to two bit. One for UDP
IPv4 and other bit for UDP IPv6. The hardware which supports this is not
released yet. When released, driver should set 2 bit to enable UDP rss for
both IPv4 and IPv6.
Also add spinlock around vnic_dev_capable_rss_hash_type().
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kbuild test robot reported the following issue:
kernel/time/posix-stubs.o: warning: objtool: sys_ni_posix_timers.cold.1()+0x0: unreachable instruction
This file creates symbol aliases for the sys_ni_posix_timers() function.
So there are multiple ELF function symbols for the same function:
23: 0000000000000150 26 FUNC GLOBAL DEFAULT 1 __x64_sys_timer_create
24: 0000000000000150 26 FUNC GLOBAL DEFAULT 1 sys_ni_posix_timers
25: 0000000000000150 26 FUNC GLOBAL DEFAULT 1 __ia32_sys_timer_create
26: 0000000000000150 26 FUNC GLOBAL DEFAULT 1 __x64_sys_timer_gettime
Here's the corresponding cold subfunction:
11: 0000000000000000 45 FUNC LOCAL DEFAULT 6 sys_ni_posix_timers.cold.1
When analyzing overlapping functions, objtool only looks at the first
one in the symbol list. The rest of the functions are basically ignored
because they point to instructions which have already been analyzed.
So in this case it analyzes the __x64_sys_timer_create() function, but
then it fails to recognize that its cold subfunction is
sys_ni_posix_timers.cold.1(), because the names are different.
Make the subfunction detection a little smarter by associating each
subfunction with the first function which jumps to it, since that's the
one which will be analyzed.
Unfortunately we still have to leave the original subfunction detection
code in place, thanks to GCC switch tables. (See the comment for more
details.)
Fixes: 13810435b9a7 ("objtool: Support GCC 8's cold subfunctions")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/d3ba52662cbc8e3a64a3b64d44b4efc5674fd9ab.1527855808.git.jpoimboe@redhat.com
Both AMD and Intel can have SPEC_CTRL_MSR for SSBD.
However AMD also has two more other ways of doing it - which
are !SPEC_CTRL MSR ways.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: kvm@vger.kernel.org
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: andrew.cooper3@citrix.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180601145921.9500-4-konrad.wilk@oracle.com
The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that if CPUID 8000_0008.EBX[24] is set we should be using
the SPEC_CTRL MSR (0x48) over the VIRT SPEC_CTRL MSR (0xC001_011f)
for speculative store bypass disable.
This in effect means we should clear the X86_FEATURE_VIRT_SSBD
flag so that we would prefer the SPEC_CTRL MSR.
See the document titled:
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199889
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: andrew.cooper3@citrix.com
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20180601145921.9500-3-konrad.wilk@oracle.com
The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that the CPUID 8000_0008.EBX[26] will mean that the
speculative store bypass disable is no longer needed.
A copy of this document is available at:
https://bugzilla.kernel.org/show_bug.cgi?id=199889
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: andrew.cooper3@citrix.com
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180601145921.9500-2-konrad.wilk@oracle.com
The idt_setup_apic_and_irq_gates() sets the gates from
FIRST_EXTERNAL_VECTOR up to FIRST_SYSTEM_VECTOR first. then secondly, from
FIRST_SYSTEM_VECTOR to NR_VECTORS, it takes both APIC=y and APIC=n into
account.
But for APIC=n, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS, all
vectors has been set at the first step.
Simplify the second step, make it just work for APIC=y.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180523023555.2933-1-douly.fnst@cn.fujitsu.com
Remove extra parentheses to fix the extraneous parentheses clang
warning.
Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicholas Mc Guire <der.herr@hofr.at>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180520080012.8215-1-rvarsha016@gmail.com
AMD SME claims one bit from physical address to indicate whether the
page is encrypted or not. To achieve that we clear out the bit from
__PHYSICAL_MASK.
The capability to adjust __PHYSICAL_MASK is required beyond AMD SME.
For instance for upcoming Intel Multi-Key Total Memory Encryption.
Factor it out into a separate feature with own Kconfig handle.
It also helps with overhead of AMD SME. It saves more than 3k in .text
on defconfig + AMD_MEM_ENCRYPT:
add/remove: 3/2 grow/shrink: 5/110 up/down: 189/-3753 (-3564)
We would need to return to this once we have infrastructure to patch
constants in code. That's good candidate for it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180518113028.79825-1-kirill.shutemov@linux.intel.com
When CONFIG_OPTIMIZE_INLINING is enabled, the function native_set_p4d()
may not be fully inlined into the caller, resulting in a false-positive
warning about an access to the __pgtable_l5_enabled variable from a
non-__init function, despite the original caller being an __init function:
WARNING: vmlinux.o(.text.unlikely+0x1429): Section mismatch in reference from the function native_set_p4d() to the variable .init.data:__pgtable_l5_enabled
WARNING: vmlinux.o(.text.unlikely+0x1429): Section mismatch in reference from the function native_p4d_clear() to the variable .init.data:__pgtable_l5_enabled
The function native_set_p4d() references the variable __initdata
__pgtable_l5_enabled. This is often because native_set_p4d lacks a
__initdata annotation or the annotation of __pgtable_l5_enabled is wrong.
Marking the native_set_p4d function and its caller native_p4d_clear()
avoids this problem.
I did not bisect the original cause, but I assume this is related to the
recent rework that turned pgtable_l5_enabled() into an inline function,
which in turn caused the compiler to make different inlining decisions.
Fixes: ad3fe525b950 ("x86/mm: Unify pgtable_l5_enabled usage in early boot code")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: https://lkml.kernel.org/r/20180605113715.1133726-1-arnd@arndb.de
A CONFIG_SMP=n build emits a harmless compile-time warning:
drivers/irqchip/irq-stm32-exti.c:495:12: error: 'stm32_exti_h_set_affinity' defined but not used [-Werror=unused-function]
The #ifdef is inconsistent here, and it's better to use an IS_ENABLED() check
that lets the compiler silently drop that function.
Fixes: 927abfc4461e ("irqchip/stm32: Add stm32mp1 support with hierarchy domain")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ludovic Barre <ludovic.barre@st.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Cc: Radoslaw Pietrzyk <radoslaw.pietrzyk@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Link: https://lkml.kernel.org/r/20180605114347.1347128-1-arnd@arndb.de
"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.
"param_test_benchmark" is the same as "param_test", but it removes
testing book-keeping code to allow accurate benchmarks.
"param_test_compare_twice" is the same as "param_test", but it performs
each comparison within rseq critical section twice, thus validating
invariants. If any of the second comparisons fails, an error message
is printed and the test aborts.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-16-mathieu.desnoyers@efficios.com
"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-15-mathieu.desnoyers@efficios.com
This rseq helper library provides a user-space API to the rseq()
system call.
The rseq fast-path exposes the instruction pointer addresses where the
rseq assembly blocks begin and end, as well as the associated abort
instruction pointer, in the __rseq_table section. This section allows
debuggers may know where to place breakpoints when single-stepping
through assembly blocks which may be aborted at any point by the kernel.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-13-mathieu.desnoyers@efficios.com
Wire up the rseq system call on powerpc.
This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-11-mathieu.desnoyers@efficios.com
Syscalls are not allowed inside restartable sequences, so add a call to
rseq_syscall() at the very beginning of system call exiting path for
CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
is a syscall issued inside restartable sequences.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-10-mathieu.desnoyers@efficios.com
Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.
Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-9-mathieu.desnoyers@efficios.com
Wire up the rseq system call on x86 32/64.
This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-8-mathieu.desnoyers@efficios.com
Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.
Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.
Check that system calls are not invoked from within rseq critical
sections by invoking rseq_signal() from syscall_return_slowpath().
With CONFIG_DEBUG_RSEQ, such behavior results in termination of the
process with SIGSEGV.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-7-mathieu.desnoyers@efficios.com
Wire up the rseq system call on 32-bit ARM.
This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-6-mathieu.desnoyers@efficios.com
Syscalls are not allowed inside restartable sequences, so add a call to
rseq_syscall() at the very beginning of system call exiting path for
CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
is a syscall issued inside restartable sequences.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-5-mathieu.desnoyers@efficios.com
Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.
Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-4-mathieu.desnoyers@efficios.com
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
Provide helper macros for fields which represent pointers in
kernel-userspace ABI. This facilitates handling of 32-bit
user-space by 64-bit kernels by defining those fields as
32-bit 0-padding and 32-bit integer on 32-bit architectures,
which allows the kernel to treat those as 64-bit integers.
The order of padding and 32-bit integer depends on the
endianness.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180602124408.8430-2-mathieu.desnoyers@efficios.com
There is a typo in f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after
set_pte and ptep_set_access_flags") config ifdef, which results in the
necessary ptesync not being issued after vmalloc.
This causes random kernel faults in module load, bpf load, anywhere
that vmalloc mappings are used.
After correcting the code, this survives a guest kernel booting
hundreds of times where previously there would be a crash every few
boots (I haven't noticed the crash on host, perhaps due to different
TLB and page table walking behaviour in hardware).
A memory clobber is also added to the flush, just to be sure it won't
be reordered with the pte set or the subsequent mapping access.
Fixes: f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull workqueue updates from Tejun Heo:
- make kworkers report the workqueue it is executing or has executed
most recently in /proc/PID/comm (so they show up in ps/top)
- CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds
inside CONFIG_SMP.
* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: move function definitions within CONFIG_SMP block
workqueue: Make sure struct worker is accessible for wq_worker_comm()
workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}
proc: Consolidate task->comm formatting into proc_task_name()
workqueue: Set worker->desc to workqueue name by default
workqueue: Make worker_attach/detach_pool() update worker->pool
workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
Pull cgroup updates from Tejun Heo:
- For cpustat, cgroup has a percpu hierarchical stat mechanism which
propagates up the hierarchy lazily.
This contains commits to factor out and generalize the mechanism so
that it can be used for other cgroup stats too.
The original intention was to update memcg stats to use it but memcg
went for a different approach, so still the only user is cpustat. The
factoring out and generalization still make sense and it's likely
that this can be used for other purposes in the future.
- cgroup uses kernfs_notify() (which uses fsnotify()) to inform user
space of certain events. A rate limiting mechanism is added.
- Other misc changes.
* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: css_set_lock should nest inside tasklist_lock
rdmacg: Convert to use match_string() helper
cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
cgroup: Add cgroup_subsys->css_rstat_flush()
cgroup: Replace cgroup_rstat_mutex with a spinlock
cgroup: Factor out and expose cgroup_rstat_*() interface functions
cgroup: Reorganize kernel/cgroup/rstat.c
cgroup: Distinguish base resource stat implementation from rstat
cgroup: Rename stat to rstat
cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
cgroup: Limit event generation frequency
cgroup: Explicitly remove core interface files
Pull libata updates from Tejun Heo:
- libata has always been limiting the maximum queue depth to 31, with
one entry set aside mostly for historical reasons. This didn't use to
make much difference but Jens found out that modern hard drives can
actually perform measurably better with the extra one queue depth.
Jens updated libata core so that it can make use of full 32 queue
depth
- Damien updated command retry logic in error handling so that it
doesn't unnecessarily retry when upper layer (SCSI) is gonna handle
them
- A couple misc changes
* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
sata_fsl: use the right type for tag bitshift
ahci: enable full queue depth of 32
libata: don't clamp queue depth to ATA_MAX_QUEUE - 1
libata: add extra internal command
sata_nv: set host can_queue count appropriately
libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
libata: use ata_tag_internal() consistently
libata: bump ->qc_active to a 64-bit type
libata: convert core and drivers to ->hw_tag usage
libata: introduce notion of separate hardware tags
libata: Fix command retry decision
libata: Honor RQF_QUIET flag
libata: Make ata_dev_set_mode() less verbose
libata: Fix ata_err_string()
libata: Fix comment typo in ata_eh_analyze_tf()
sata_nv: don't use block layer bounce buffer
ata: hpt37x: Convert to use match_string() helper