2009-06-07 20:46:37 +02:00
/*
* Helper macros to support writing architecture specific
* linker scripts .
*
* A minimal linker scripts has following content :
* [ This is a sample , architectures may have special requiriements ]
*
* OUTPUT_FORMAT ( . . . )
* OUTPUT_ARCH ( . . . )
* ENTRY ( . . . )
* SECTIONS
* {
* . = START ;
* __init_begin = . ;
2009-06-14 22:10:41 +02:00
* HEAD_TEXT_SECTION
2009-06-07 20:46:37 +02:00
* INIT_TEXT_SECTION ( PAGE_SIZE )
* INIT_DATA_SECTION ( . . . )
2011-03-24 18:50:09 +01:00
* PERCPU_SECTION ( CACHELINE_SIZE )
2009-06-07 20:46:37 +02:00
* __init_end = . ;
*
* _stext = . ;
* TEXT_SECTION = 0
* _etext = . ;
*
* _sdata = . ;
2019-10-29 14:13:34 -07:00
* RO_DATA ( PAGE_SIZE )
2019-10-29 14:13:35 -07:00
* RW_DATA ( . . . )
2009-06-07 20:46:37 +02:00
* _edata = . ;
*
* EXCEPTION_TABLE ( . . . )
*
2009-07-12 18:23:33 -04:00
* BSS_SECTION ( 0 , 0 , 0 )
2009-06-07 20:46:37 +02:00
* _end = . ;
*
* STABS_DEBUG
* DWARF_DEBUG
2020-08-21 12:42:45 -07:00
* ELF_DETAILS
linker script: unify usage of discard definition
Discarded sections in different archs share some commonality but have
considerable differences. This led to linker script for each arch
implementing its own /DISCARD/ definition, which makes maintaining
tedious and adding new entries error-prone.
This patch makes all linker scripts to move discard definitions to the
end of the linker script and use the common DISCARDS macro. As ld
uses the first matching section definition, archs can include default
discarded sections by including them earlier in the linker script.
ia64 is notable because it first throws away some ia64 specific
subsections and then include the rest of the sections into the final
image, so those sections must be discarded before the inclusion.
defconfig compile tested for x86, x86-64, powerpc, powerpc64, ia64,
alpha, sparc, sparc64 and s390. Michal Simek tested microblaze.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: linux-arch@vger.kernel.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: microblaze-uclinux@itee.uq.edu.au
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Tony Luck <tony.luck@intel.com>
2009-07-09 11:27:40 +09:00
*
* DISCARDS // must be the last
2009-06-07 20:46:37 +02:00
* }
*
* [ __init_begin , __init_end ] is the init section that may be freed after init
2014-09-26 03:30:59 +01:00
* // __init_begin and __init_end should be page aligned, so that we can
* // free the whole .init memory
2009-06-07 20:46:37 +02:00
* [ _stext , _etext ] is the text section
* [ _sdata , _edata ] is the data section
*
* Some of the included output section have their own set of constants .
* Examples are : [ __initramfs_start , __initramfs_end ] for initramfs and
* [ __nosave_begin , __nosave_end ] for the nosave data
*/
2009-04-25 22:10:56 -04:00
2005-04-16 15:20:36 -07:00
# ifndef LOAD_OFFSET
# define LOAD_OFFSET 0
# endif
2019-10-29 14:13:30 -07:00
/*
* Only some architectures want to have the . notes segment visible in
2019-10-29 14:13:31 -07:00
* a separate PT_NOTE ELF Program Header . When this happens , it needs
* to be visible in both the kernel text ' s PT_LOAD and the PT_NOTE
* Program Headers . In this case , though , the PT_LOAD needs to be made
* the default again so that all the following sections don ' t also end
* up in the PT_NOTE Program Header .
2019-10-29 14:13:30 -07:00
*/
# ifdef EMITS_PT_NOTE
# define NOTES_HEADERS :text :note
2019-10-29 14:13:31 -07:00
# define NOTES_HEADERS_RESTORE __restore_ph : { *(.__restore_ph) } :text
# else
# define NOTES_HEADERS
# define NOTES_HEADERS_RESTORE
2019-10-29 14:13:30 -07:00
# endif
2019-10-29 14:13:36 -07:00
/*
* Some architectures have non - executable read - only exception tables .
* They can be added to the RO_DATA segment by specifying their desired
* alignment .
*/
# ifdef RO_EXCEPTION_TABLE_ALIGN
# define RO_EXCEPTION_TABLE EXCEPTION_TABLE(RO_EXCEPTION_TABLE_ALIGN)
# else
# define RO_EXCEPTION_TABLE
# endif
2022-09-15 13:10:47 +02:00
/* Align . function alignment. */
# define ALIGN_FUNCTION() . = ALIGN(CONFIG_FUNCTION_ALIGNMENT)
2005-07-14 20:15:44 +00:00
2017-07-26 22:46:27 +10:00
/*
* LD_DEAD_CODE_DATA_ELIMINATION option enables - fdata - sections , which
* generates . data . identifier sections , which need to be pulled in with
* . data . We don ' t want to pull in . data . . other sections , which Linux
* has defined . Same for text and bss .
2018-05-09 22:59:58 +10:00
*
kbuild: add support for Clang LTO
This change adds build system support for Clang's Link Time
Optimization (LTO). With -flto, instead of ELF object files, Clang
produces LLVM bitcode, which is compiled into native code at link
time, allowing the final binary to be optimized globally. For more
details, see:
https://llvm.org/docs/LinkTimeOptimization.html
The Kconfig option CONFIG_LTO_CLANG is implemented as a choice,
which defaults to LTO being disabled. To use LTO, the architecture
must select ARCH_SUPPORTS_LTO_CLANG and support:
- compiling with Clang,
- compiling all assembly code with Clang's integrated assembler,
- and linking with LLD.
While using CONFIG_LTO_CLANG_FULL results in the best runtime
performance, the compilation is not scalable in time or
memory. CONFIG_LTO_CLANG_THIN enables ThinLTO, which allows
parallel optimization and faster incremental builds. ThinLTO is
used by default if the architecture also selects
ARCH_SUPPORTS_LTO_CLANG_THIN:
https://clang.llvm.org/docs/ThinLTO.html
To enable LTO, LLVM tools must be used to handle bitcode files, by
passing LLVM=1 and LLVM_IAS=1 options to make:
$ make LLVM=1 LLVM_IAS=1 defconfig
$ scripts/config -e LTO_CLANG_THIN
$ make LLVM=1 LLVM_IAS=1
To prepare for LTO support with other compilers, common parts are
gated behind the CONFIG_LTO option, and LTO can be disabled for
specific files by filtering out CC_FLAGS_LTO.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201211184633.3213045-3-samitolvanen@google.com
2020-12-11 10:46:19 -08:00
* With LTO_CLANG , the linker also splits sections by default , so we need
* these macros to combine the sections during the final link .
*
2018-05-09 22:59:58 +10:00
* RODATA_MAIN is not used because existing code already defines . rodata . x
* sections to be brought in with rodata .
2017-07-26 22:46:27 +10:00
*/
kbuild: add support for Clang LTO
This change adds build system support for Clang's Link Time
Optimization (LTO). With -flto, instead of ELF object files, Clang
produces LLVM bitcode, which is compiled into native code at link
time, allowing the final binary to be optimized globally. For more
details, see:
https://llvm.org/docs/LinkTimeOptimization.html
The Kconfig option CONFIG_LTO_CLANG is implemented as a choice,
which defaults to LTO being disabled. To use LTO, the architecture
must select ARCH_SUPPORTS_LTO_CLANG and support:
- compiling with Clang,
- compiling all assembly code with Clang's integrated assembler,
- and linking with LLD.
While using CONFIG_LTO_CLANG_FULL results in the best runtime
performance, the compilation is not scalable in time or
memory. CONFIG_LTO_CLANG_THIN enables ThinLTO, which allows
parallel optimization and faster incremental builds. ThinLTO is
used by default if the architecture also selects
ARCH_SUPPORTS_LTO_CLANG_THIN:
https://clang.llvm.org/docs/ThinLTO.html
To enable LTO, LLVM tools must be used to handle bitcode files, by
passing LLVM=1 and LLVM_IAS=1 options to make:
$ make LLVM=1 LLVM_IAS=1 defconfig
$ scripts/config -e LTO_CLANG_THIN
$ make LLVM=1 LLVM_IAS=1
To prepare for LTO support with other compilers, common parts are
gated behind the CONFIG_LTO option, and LTO can be disabled for
specific files by filtering out CC_FLAGS_LTO.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201211184633.3213045-3-samitolvanen@google.com
2020-12-11 10:46:19 -08:00
# if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG)
2017-07-26 22:46:27 +10:00
# define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
2021-02-23 11:36:21 +00:00
# define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* .data.$L*
2018-05-09 22:59:58 +10:00
# define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
2021-01-10 11:56:47 +00:00
# define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
# define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
2018-05-09 22:59:58 +10:00
# define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]*
2017-07-26 22:46:27 +10:00
# else
# define TEXT_MAIN .text
# define DATA_MAIN .data
2018-05-09 22:59:58 +10:00
# define SDATA_MAIN .sdata
# define RODATA_MAIN .rodata
2017-07-26 22:46:27 +10:00
# define BSS_MAIN .bss
2018-05-09 22:59:58 +10:00
# define SBSS_MAIN .sbss
2017-07-26 22:46:27 +10:00
# endif
2010-07-10 08:35:00 +02:00
/*
2020-06-30 16:49:05 +02:00
* GCC 4.5 and later have a 32 bytes section alignment for structures .
* Except GCC 4.9 , that feels the need to align on 64 bytes .
2010-07-10 08:35:00 +02:00
*/
2010-12-22 11:57:26 -08:00
# define STRUCT_ALIGNMENT 32
# define STRUCT_ALIGN() . = ALIGN(STRUCT_ALIGNMENT)
2010-07-10 08:35:00 +02:00
2019-12-19 16:44:52 -05:00
/*
* The order of the sched class addresses are important , as they are
* used to determine the order of the priority of each sched class in
* relation to each other .
*/
# define SCHED_DATA \
2019-12-19 16:44:53 -05:00
STRUCT_ALIGN ( ) ; \
2022-05-17 13:46:54 +02:00
__sched_class_highest = . ; \
2019-12-19 16:44:53 -05:00
* ( __stop_sched_class ) \
2022-05-17 13:46:54 +02:00
* ( __dl_sched_class ) \
* ( __rt_sched_class ) \
* ( __fair_sched_class ) \
* ( __idle_sched_class ) \
__sched_class_lowest = . ;
2019-12-19 16:44:52 -05:00
2008-01-20 20:07:28 +01:00
/* The actual configuration determine if the init/exit sections
* are handled as text / data or they can be discarded ( which
* often happens at runtime )
*/
# ifdef CONFIG_HOTPLUG_CPU
# define CPU_KEEP(sec) *(.cpu##sec)
# define CPU_DISCARD(sec)
# else
# define CPU_KEEP(sec)
# define CPU_DISCARD(sec) *(.cpu##sec)
# endif
2008-01-24 22:20:18 +01:00
# if defined(CONFIG_MEMORY_HOTPLUG)
2008-01-20 20:07:28 +01:00
# define MEM_KEEP(sec) *(.mem##sec)
# define MEM_DISCARD(sec)
# else
# define MEM_KEEP(sec)
# define MEM_DISCARD(sec) *(.mem##sec)
# endif
2022-09-03 15:11:53 +02:00
# ifndef CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
# define KEEP_PATCHABLE KEEP(*(__patchable_function_entries))
# define PATCHABLE_DISCARDS
# else
# define KEEP_PATCHABLE
# define PATCHABLE_DISCARDS *(__patchable_function_entries)
# endif
2022-10-18 13:49:21 +02:00
# ifndef CONFIG_ARCH_SUPPORTS_CFI_CLANG
/*
* Simply points to ftrace_stub , but with the proper protocol .
* Defined by the linker script in linux / vmlinux . lds . h
*/
# define FTRACE_STUB_HACK ftrace_stub_graph = ftrace_stub;
# else
# define FTRACE_STUB_HACK
# endif
ftrace: create __mcount_loc section
This patch creates a section in the kernel called "__mcount_loc".
This will hold a list of pointers to the mcount relocation for
each call site of mcount.
For example:
objdump -dr init/main.o
[...]
Disassembly of section .text:
0000000000000000 <do_one_initcall>:
0: 55 push %rbp
[...]
000000000000017b <init_post>:
17b: 55 push %rbp
17c: 48 89 e5 mov %rsp,%rbp
17f: 53 push %rbx
180: 48 83 ec 08 sub $0x8,%rsp
184: e8 00 00 00 00 callq 189 <init_post+0xe>
185: R_X86_64_PC32 mcount+0xfffffffffffffffc
[...]
We will add a section to point to each function call.
.section __mcount_loc,"a",@progbits
[...]
.quad .text + 0x185
[...]
The offset to of the mcount call site in init_post is an offset from
the start of the section, and not the start of the function init_post.
The mcount relocation is at the call site 0x185 from the start of the
.text section.
.text + 0x185 == init_post + 0xa
We need a way to add this __mcount_loc section in a way that we do not
lose the relocations after final link. The .text section here will
be attached to all other .text sections after final link and the
offsets will be meaningless. We need to keep track of where these
.text sections are.
To do this, we use the start of the first function in the section.
do_one_initcall. We can make a tmp.s file with this function as a reference
to the start of the .text section.
.section __mcount_loc,"a",@progbits
[...]
.quad do_one_initcall + 0x185
[...]
Then we can compile the tmp.s into a tmp.o
gcc -c tmp.s -o tmp.o
And link it into back into main.o.
ld -r main.o tmp.o -o tmp_main.o
mv tmp_main.o main.o
But we have a problem. What happens if the first function in a section
is not exported, and is a static function. The linker will not let
the tmp.o use it. This case exists in main.o as well.
Disassembly of section .init.text:
0000000000000000 <set_reset_devices>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9>
5: R_X86_64_PC32 mcount+0xfffffffffffffffc
The first function in .init.text is a static function.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
The lowercase 't' means that set_reset_devices is local and is not exported.
If we simply try to link the tmp.o with the set_reset_devices we end
up with two symbols: one local and one global.
.section __mcount_loc,"a",@progbits
.quad set_reset_devices + 0x10
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
U set_reset_devices
We still have an undefined reference to set_reset_devices, and if we try
to compile the kernel, we will end up with an undefined reference to
set_reset_devices, or even worst, it could be exported someplace else,
and then we will have a reference to the wrong location.
To handle this case, we make an intermediate step using objcopy.
We convert set_reset_devices into a global exported symbol before linking
it with tmp.o and set it back afterwards.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
Now we have a section in main.o called __mcount_loc that we can place
somewhere in the kernel using vmlinux.ld.S and access it to convert
all these locations that call mcount into nops before starting SMP
and thus, eliminating the need to do this with kstop_machine.
Note, A well documented perl script (scripts/recordmcount.pl) is used
to do all this in one location.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 15:45:07 -04:00
# ifdef CONFIG_FTRACE_MCOUNT_RECORD
module/ftrace: handle patchable-function-entry
When using patchable-function-entry, the compiler will record the
callsites into a section named "__patchable_function_entries" rather
than "__mcount_loc". Let's abstract this difference behind a new
FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
explicitly (e.g. with custom module linker scripts).
As parisc currently handles this explicitly, it is fixed up accordingly,
with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
only defined when DYNAMIC_FTRACE is selected, the parisc module loading
code is updated to only use the definition in that case. When
DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
this removes some redundant work in that case.
To make sure that this is keep up-to-date for modules and the main
kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
simplified for legibility.
I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
and verified that the section made it into the .ko files for modules.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Helge Deller <deller@gmx.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Torsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: linux-parisc@vger.kernel.org
2019-10-16 18:17:11 +01:00
/*
* The ftrace call sites are logged to a section whose name depends on the
* compiler option used . A given kernel image will only use one , AKA
* FTRACE_CALLSITE_SECTION . We capture all of them here to avoid header
* dependencies for FTRACE_CALLSITE_SECTION ' s definition .
New tracing features:
- PERAMAENT flag to ftrace_ops when attaching a callback to a function
As /proc/sys/kernel/ftrace_enabled when set to zero will disable all
attached callbacks in ftrace, this has a detrimental impact on live
kernel tracing, as it disables all that it patched. If a ftrace_ops
is registered to ftrace with the PERMANENT flag set, it will prevent
ftrace_enabled from being disabled, and if ftrace_enabled is already
disabled, it will prevent a ftrace_ops with PREMANENT flag set from
being registered.
- New register_ftrace_direct(). As eBPF would like to register its own
trampolines to be called by the ftrace nop locations directly,
without going through the ftrace trampoline, this function has been
added. This allows for eBPF trampolines to live along side of
ftrace, perf, kprobe and live patching. It also utilizes the ftrace
enabled_functions file that keeps track of functions that have been
modified in the kernel, to allow for security auditing.
- Allow for kernel internal use of ftrace instances. Subsystems in
the kernel can now create and destroy their own tracing instances
which allows them to have their own tracing buffer, and be able
to record events without worrying about other users from writing over
their data.
- New seq_buf_hex_dump() that lets users use the hex_dump() in their
seq_buf usage.
- Notifications now added to tracing_max_latency to allow user space
to know when a new max latency is hit by one of the latency tracers.
- Wider spread use of generic compare operations for use of bsearch and
friends.
- More synthetic event fields may be defined (32 up from 16)
- Use of xarray for architectures with sparse system calls, for the
system call trace events.
This along with small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXdwv4BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnB5AP91vsdHQjwE1+/UWG/cO+qFtKvn2QJK
QmBRIJNH/s+1TAD/fAOhgw+ojSK3o/qc+NpvPTEW9AEwcJL1wacJUn+XbQc=
=ztql
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New tracing features:
- New PERMANENT flag to ftrace_ops when attaching a callback to a
function.
As /proc/sys/kernel/ftrace_enabled when set to zero will disable
all attached callbacks in ftrace, this has a detrimental impact on
live kernel tracing, as it disables all that it patched. If a
ftrace_ops is registered to ftrace with the PERMANENT flag set, it
will prevent ftrace_enabled from being disabled, and if
ftrace_enabled is already disabled, it will prevent a ftrace_ops
with PREMANENT flag set from being registered.
- New register_ftrace_direct().
As eBPF would like to register its own trampolines to be called by
the ftrace nop locations directly, without going through the ftrace
trampoline, this function has been added. This allows for eBPF
trampolines to live along side of ftrace, perf, kprobe and live
patching. It also utilizes the ftrace enabled_functions file that
keeps track of functions that have been modified in the kernel, to
allow for security auditing.
- Allow for kernel internal use of ftrace instances.
Subsystems in the kernel can now create and destroy their own
tracing instances which allows them to have their own tracing
buffer, and be able to record events without worrying about other
users from writing over their data.
- New seq_buf_hex_dump() that lets users use the hex_dump() in their
seq_buf usage.
- Notifications now added to tracing_max_latency to allow user space
to know when a new max latency is hit by one of the latency
tracers.
- Wider spread use of generic compare operations for use of bsearch
and friends.
- More synthetic event fields may be defined (32 up from 16)
- Use of xarray for architectures with sparse system calls, for the
system call trace events.
This along with small clean ups and fixes"
* tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (51 commits)
tracing: Enable syscall optimization for MIPS
tracing: Use xarray for syscall trace events
tracing: Sample module to demonstrate kernel access to Ftrace instances.
tracing: Adding new functions for kernel access to Ftrace instances
tracing: Fix Kconfig indentation
ring-buffer: Fix typos in function ring_buffer_producer
ftrace: Use BIT() macro
ftrace: Return ENOTSUPP when DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not configured
ftrace: Rename ftrace_graph_stub to ftrace_stub_graph
ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
ftrace: Add helper find_direct_entry() to consolidate code
ftrace: Add another check for match in register_ftrace_direct()
ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
ftrace/selftests: Fix spelling mistake "wakeing" -> "waking"
tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
ftrace/samples: Add a sample module that implements modify_ftrace_direct()
ftrace: Add modify_ftrace_direct()
tracing: Add missing "inline" in stub function of latency_fsnotify()
tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text
tracing: Use seq_buf_hex_dump() to dump buffers
...
2019-11-27 11:42:01 -08:00
*
2020-06-17 16:56:16 -04:00
* ftrace_ops_list_func will be defined as arch_ftrace_ops_list_func
* as some archs will have a different prototype for that function
* but ftrace_ops_list_func ( ) will have a single prototype .
module/ftrace: handle patchable-function-entry
When using patchable-function-entry, the compiler will record the
callsites into a section named "__patchable_function_entries" rather
than "__mcount_loc". Let's abstract this difference behind a new
FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
explicitly (e.g. with custom module linker scripts).
As parisc currently handles this explicitly, it is fixed up accordingly,
with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
only defined when DYNAMIC_FTRACE is selected, the parisc module loading
code is updated to only use the definition in that case. When
DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
this removes some redundant work in that case.
To make sure that this is keep up-to-date for modules and the main
kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
simplified for legibility.
I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
and verified that the section made it into the .ko files for modules.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Helge Deller <deller@gmx.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Torsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: linux-parisc@vger.kernel.org
2019-10-16 18:17:11 +01:00
*/
2009-07-27 11:23:50 -07:00
# define MCOUNT_REC() . = ALIGN(8); \
2018-05-09 16:23:51 +09:00
__start_mcount_loc = . ; \
2018-05-09 22:59:58 +10:00
KEEP ( * ( __mcount_loc ) ) \
2022-09-03 15:11:53 +02:00
KEEP_PATCHABLE \
2019-10-15 09:00:55 -04:00
__stop_mcount_loc = . ; \
2022-10-18 13:49:21 +02:00
FTRACE_STUB_HACK \
2020-06-17 16:56:16 -04:00
ftrace_ops_list_func = arch_ftrace_ops_list_func ;
ftrace: create __mcount_loc section
This patch creates a section in the kernel called "__mcount_loc".
This will hold a list of pointers to the mcount relocation for
each call site of mcount.
For example:
objdump -dr init/main.o
[...]
Disassembly of section .text:
0000000000000000 <do_one_initcall>:
0: 55 push %rbp
[...]
000000000000017b <init_post>:
17b: 55 push %rbp
17c: 48 89 e5 mov %rsp,%rbp
17f: 53 push %rbx
180: 48 83 ec 08 sub $0x8,%rsp
184: e8 00 00 00 00 callq 189 <init_post+0xe>
185: R_X86_64_PC32 mcount+0xfffffffffffffffc
[...]
We will add a section to point to each function call.
.section __mcount_loc,"a",@progbits
[...]
.quad .text + 0x185
[...]
The offset to of the mcount call site in init_post is an offset from
the start of the section, and not the start of the function init_post.
The mcount relocation is at the call site 0x185 from the start of the
.text section.
.text + 0x185 == init_post + 0xa
We need a way to add this __mcount_loc section in a way that we do not
lose the relocations after final link. The .text section here will
be attached to all other .text sections after final link and the
offsets will be meaningless. We need to keep track of where these
.text sections are.
To do this, we use the start of the first function in the section.
do_one_initcall. We can make a tmp.s file with this function as a reference
to the start of the .text section.
.section __mcount_loc,"a",@progbits
[...]
.quad do_one_initcall + 0x185
[...]
Then we can compile the tmp.s into a tmp.o
gcc -c tmp.s -o tmp.o
And link it into back into main.o.
ld -r main.o tmp.o -o tmp_main.o
mv tmp_main.o main.o
But we have a problem. What happens if the first function in a section
is not exported, and is a static function. The linker will not let
the tmp.o use it. This case exists in main.o as well.
Disassembly of section .init.text:
0000000000000000 <set_reset_devices>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9>
5: R_X86_64_PC32 mcount+0xfffffffffffffffc
The first function in .init.text is a static function.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
The lowercase 't' means that set_reset_devices is local and is not exported.
If we simply try to link the tmp.o with the set_reset_devices we end
up with two symbols: one local and one global.
.section __mcount_loc,"a",@progbits
.quad set_reset_devices + 0x10
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
U set_reset_devices
We still have an undefined reference to set_reset_devices, and if we try
to compile the kernel, we will end up with an undefined reference to
set_reset_devices, or even worst, it could be exported someplace else,
and then we will have a reference to the wrong location.
To handle this case, we make an intermediate step using objcopy.
We convert set_reset_devices into a global exported symbol before linking
it with tmp.o and set it back afterwards.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
Now we have a section in main.o called __mcount_loc that we can place
somewhere in the kernel using vmlinux.ld.S and access it to convert
all these locations that call mcount into nops before starting SMP
and thus, eliminating the need to do this with kstop_machine.
Note, A well documented perl script (scripts/recordmcount.pl) is used
to do all this in one location.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 15:45:07 -04:00
# else
2019-10-15 09:00:55 -04:00
# ifdef CONFIG_FUNCTION_TRACER
2022-10-18 13:49:21 +02:00
# define MCOUNT_REC() FTRACE_STUB_HACK \
2020-06-17 16:56:16 -04:00
ftrace_ops_list_func = arch_ftrace_ops_list_func ;
2019-10-15 09:00:55 -04:00
# else
# define MCOUNT_REC()
# endif
ftrace: create __mcount_loc section
This patch creates a section in the kernel called "__mcount_loc".
This will hold a list of pointers to the mcount relocation for
each call site of mcount.
For example:
objdump -dr init/main.o
[...]
Disassembly of section .text:
0000000000000000 <do_one_initcall>:
0: 55 push %rbp
[...]
000000000000017b <init_post>:
17b: 55 push %rbp
17c: 48 89 e5 mov %rsp,%rbp
17f: 53 push %rbx
180: 48 83 ec 08 sub $0x8,%rsp
184: e8 00 00 00 00 callq 189 <init_post+0xe>
185: R_X86_64_PC32 mcount+0xfffffffffffffffc
[...]
We will add a section to point to each function call.
.section __mcount_loc,"a",@progbits
[...]
.quad .text + 0x185
[...]
The offset to of the mcount call site in init_post is an offset from
the start of the section, and not the start of the function init_post.
The mcount relocation is at the call site 0x185 from the start of the
.text section.
.text + 0x185 == init_post + 0xa
We need a way to add this __mcount_loc section in a way that we do not
lose the relocations after final link. The .text section here will
be attached to all other .text sections after final link and the
offsets will be meaningless. We need to keep track of where these
.text sections are.
To do this, we use the start of the first function in the section.
do_one_initcall. We can make a tmp.s file with this function as a reference
to the start of the .text section.
.section __mcount_loc,"a",@progbits
[...]
.quad do_one_initcall + 0x185
[...]
Then we can compile the tmp.s into a tmp.o
gcc -c tmp.s -o tmp.o
And link it into back into main.o.
ld -r main.o tmp.o -o tmp_main.o
mv tmp_main.o main.o
But we have a problem. What happens if the first function in a section
is not exported, and is a static function. The linker will not let
the tmp.o use it. This case exists in main.o as well.
Disassembly of section .init.text:
0000000000000000 <set_reset_devices>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9>
5: R_X86_64_PC32 mcount+0xfffffffffffffffc
The first function in .init.text is a static function.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
The lowercase 't' means that set_reset_devices is local and is not exported.
If we simply try to link the tmp.o with the set_reset_devices we end
up with two symbols: one local and one global.
.section __mcount_loc,"a",@progbits
.quad set_reset_devices + 0x10
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
U set_reset_devices
We still have an undefined reference to set_reset_devices, and if we try
to compile the kernel, we will end up with an undefined reference to
set_reset_devices, or even worst, it could be exported someplace else,
and then we will have a reference to the wrong location.
To handle this case, we make an intermediate step using objcopy.
We convert set_reset_devices into a global exported symbol before linking
it with tmp.o and set it back afterwards.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
Now we have a section in main.o called __mcount_loc that we can place
somewhere in the kernel using vmlinux.ld.S and access it to convert
all these locations that call mcount into nops before starting SMP
and thus, eliminating the need to do this with kstop_machine.
Note, A well documented perl script (scripts/recordmcount.pl) is used
to do all this in one location.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 15:45:07 -04:00
# endif
2008-01-20 20:07:28 +01:00
vmlinux.lds.h: fix BOUNDED_SECTION_(PRE|POST)_LABEL macros
Commit 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
added BOUNDED_SECTION_(PRE|POST)_LABEL macros, encapsulating the basic
boilerplate to KEEP/pack records into a section, and to mark the begin
and end of the section with linker-symbols.
But it tried to do extra, adding KEEP(*(.gnu.linkonce.##_sec_)) to
optionally reserve a header record in front of the data. It wrongly
placed the KEEP after the linker-symbol starting the section,
so if a header was added, it would wind up in the data.
Moving the KEEP to the "correct" place proved brittle, and too clever
by half. The obvious safe fix is to remove the KEEP and restore the
plain old boilerplate. The header can be added later, with separate
macros.
Also, the macro var-names: _s_, _e_ are nearly invisible, change them
to more obvious names: _BEGIN_, _END_
Fixes: 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20221117171633.923628-2-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-17 10:16:32 -07:00
# define BOUNDED_SECTION_PRE_LABEL(_sec_, _label_, _BEGIN_, _END_) \
_BEGIN_ # # _label_ = . ; \
2022-10-22 16:56:36 -06:00
KEEP ( * ( _sec_ ) ) \
vmlinux.lds.h: fix BOUNDED_SECTION_(PRE|POST)_LABEL macros
Commit 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
added BOUNDED_SECTION_(PRE|POST)_LABEL macros, encapsulating the basic
boilerplate to KEEP/pack records into a section, and to mark the begin
and end of the section with linker-symbols.
But it tried to do extra, adding KEEP(*(.gnu.linkonce.##_sec_)) to
optionally reserve a header record in front of the data. It wrongly
placed the KEEP after the linker-symbol starting the section,
so if a header was added, it would wind up in the data.
Moving the KEEP to the "correct" place proved brittle, and too clever
by half. The obvious safe fix is to remove the KEEP and restore the
plain old boilerplate. The header can be added later, with separate
macros.
Also, the macro var-names: _s_, _e_ are nearly invisible, change them
to more obvious names: _BEGIN_, _END_
Fixes: 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20221117171633.923628-2-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-17 10:16:32 -07:00
_END_ # # _label_ = . ;
2022-10-22 16:56:36 -06:00
vmlinux.lds.h: fix BOUNDED_SECTION_(PRE|POST)_LABEL macros
Commit 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
added BOUNDED_SECTION_(PRE|POST)_LABEL macros, encapsulating the basic
boilerplate to KEEP/pack records into a section, and to mark the begin
and end of the section with linker-symbols.
But it tried to do extra, adding KEEP(*(.gnu.linkonce.##_sec_)) to
optionally reserve a header record in front of the data. It wrongly
placed the KEEP after the linker-symbol starting the section,
so if a header was added, it would wind up in the data.
Moving the KEEP to the "correct" place proved brittle, and too clever
by half. The obvious safe fix is to remove the KEEP and restore the
plain old boilerplate. The header can be added later, with separate
macros.
Also, the macro var-names: _s_, _e_ are nearly invisible, change them
to more obvious names: _BEGIN_, _END_
Fixes: 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20221117171633.923628-2-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-17 10:16:32 -07:00
# define BOUNDED_SECTION_POST_LABEL(_sec_, _label_, _BEGIN_, _END_) \
_label_ # # _BEGIN_ = . ; \
2022-10-22 16:56:36 -06:00
KEEP ( * ( _sec_ ) ) \
vmlinux.lds.h: fix BOUNDED_SECTION_(PRE|POST)_LABEL macros
Commit 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
added BOUNDED_SECTION_(PRE|POST)_LABEL macros, encapsulating the basic
boilerplate to KEEP/pack records into a section, and to mark the begin
and end of the section with linker-symbols.
But it tried to do extra, adding KEEP(*(.gnu.linkonce.##_sec_)) to
optionally reserve a header record in front of the data. It wrongly
placed the KEEP after the linker-symbol starting the section,
so if a header was added, it would wind up in the data.
Moving the KEEP to the "correct" place proved brittle, and too clever
by half. The obvious safe fix is to remove the KEEP and restore the
plain old boilerplate. The header can be added later, with separate
macros.
Also, the macro var-names: _s_, _e_ are nearly invisible, change them
to more obvious names: _BEGIN_, _END_
Fixes: 2f465b921bb8 ("vmlinux.lds.h: place optional header space in BOUNDED_SECTION")
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20221117171633.923628-2-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-17 10:16:32 -07:00
_label_ # # _END_ = . ;
2022-10-22 16:56:36 -06:00
# define BOUNDED_SECTION_BY(_sec_, _label_) \
BOUNDED_SECTION_PRE_LABEL ( _sec_ , _label_ , __start , __stop )
# define BOUNDED_SECTION(_sec) BOUNDED_SECTION_BY(_sec, _sec)
vmlinux.lds.h: add HEADERED_SECTION_* macros
These macros elaborate on BOUNDED_SECTION_(PRE|POST)_LABEL macros,
prepending an optional KEEP(.gnu.linkonce##_sec_) reservation, and a
linker-symbol to address it.
This allows a developer to define a header struct (which must fit with
the section's base struct-type), and could contain:
1- fields whose value is common to the entire set of data-records.
This allows the header & data structs to specialize, complement
each other, and shrink.
2- an uplink pointer to an organizing struct
which refs other related/sub data-tables
header record is addressable via the extern'd header linker-symbol
Once the linker-symbols created by the macro are ref'd extern in code,
that code can compute a record's index (ptr - start) in the "primary"
table, then use it to index into the related/sub tables. Adding a
primary.map_* field foreach sub-table would then allow deduplication
and remapping of that sub-table.
This is aimed at dyndbg's struct _ddebug __dyndbg[] section, whose 3
columns: function, file, module are 50%, 90%, 100% redundant. The
module column is fully recoverable after dynamic_debug_init() saves it
to each ddebug_table.module as the builtin __dyndbg[] table is parsed.
Given that those 3 columns use 24/56 of a _ddebug record, a dyndbg=y
kernel with ~5k callsites could reduce kernel memory substantially.
Returning that memory to the kernel buddy-allocator? is then possible.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20221117171633.923628-3-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-17 10:16:33 -07:00
# define HEADERED_SECTION_PRE_LABEL(_sec_, _label_, _BEGIN_, _END_, _HDR_) \
_HDR_ # # _label_ = . ; \
KEEP ( * ( . gnu . linkonce . # # _sec_ ) ) \
BOUNDED_SECTION_PRE_LABEL ( _sec_ , _label_ , _BEGIN_ , _END_ )
# define HEADERED_SECTION_POST_LABEL(_sec_, _label_, _BEGIN_, _END_, _HDR_) \
_label_ # # _HDR_ = . ; \
KEEP ( * ( . gnu . linkonce . # # _sec_ ) ) \
BOUNDED_SECTION_POST_LABEL ( _sec_ , _label_ , _BEGIN_ , _END_ )
# define HEADERED_SECTION_BY(_sec_, _label_) \
HEADERED_SECTION_PRE_LABEL ( _sec_ , _label_ , __start , __stop )
# define HEADERED_SECTION(_sec) HEADERED_SECTION_BY(_sec, _sec)
2008-11-12 15:24:24 -05:00
# ifdef CONFIG_TRACE_BRANCH_PROFILING
2022-10-22 16:56:36 -06:00
# define LIKELY_PROFILE() \
BOUNDED_SECTION_BY ( _ftrace_annotated_branch , _annotated_branch_profile )
2008-11-12 00:14:39 -05:00
# else
# define LIKELY_PROFILE()
# endif
2008-11-21 01:30:54 -05:00
# ifdef CONFIG_PROFILE_ALL_BRANCHES
2022-10-22 16:56:36 -06:00
# define BRANCH_PROFILE() \
BOUNDED_SECTION_BY ( _ftrace_branch , _branch_profile )
2008-11-21 01:30:54 -05:00
# else
# define BRANCH_PROFILE()
# endif
2014-04-17 17:17:05 +09:00
# ifdef CONFIG_KPROBES
2022-10-22 16:56:36 -06:00
# define KPROBE_BLACKLIST() \
. = ALIGN ( 8 ) ; \
BOUNDED_SECTION ( _kprobe_blacklist )
2014-04-17 17:17:05 +09:00
# else
# define KPROBE_BLACKLIST()
# endif
2018-01-13 02:55:03 +09:00
# ifdef CONFIG_FUNCTION_ERROR_INJECTION
2022-10-22 16:56:36 -06:00
# define ERROR_INJECT_WHITELIST() \
STRUCT_ALIGN ( ) ; \
BOUNDED_SECTION ( _error_injection_whitelist )
2017-12-11 11:36:46 -05:00
# else
2018-01-13 02:55:03 +09:00
# define ERROR_INJECT_WHITELIST()
2017-12-11 11:36:46 -05:00
# endif
2009-04-08 03:14:01 -05:00
# ifdef CONFIG_EVENT_TRACING
2022-10-22 16:56:36 -06:00
# define FTRACE_EVENTS() \
. = ALIGN ( 8 ) ; \
BOUNDED_SECTION ( _ftrace_events ) \
BOUNDED_SECTION_BY ( _ftrace_eval_map , _ftrace_eval_maps )
2009-02-24 10:21:36 -05:00
# else
# define FTRACE_EVENTS()
# endif
2009-03-06 17:21:48 +01:00
# ifdef CONFIG_TRACING
2022-10-22 16:56:36 -06:00
# define TRACE_PRINTKS() BOUNDED_SECTION_BY(__trace_printk_fmt, ___trace_bprintk_fmt)
# define TRACEPOINT_STR() BOUNDED_SECTION_BY(__tracepoint_str, ___tracepoint_str)
2009-03-06 17:21:48 +01:00
# else
# define TRACE_PRINTKS()
2013-07-12 17:07:27 -04:00
# define TRACEPOINT_STR()
2009-03-06 17:21:48 +01:00
# endif
2009-03-13 15:42:11 +01:00
# ifdef CONFIG_FTRACE_SYSCALLS
2022-10-22 16:56:36 -06:00
# define TRACE_SYSCALLS() \
. = ALIGN ( 8 ) ; \
BOUNDED_SECTION_BY ( __syscalls_metadata , _syscalls_metadata )
2009-03-13 15:42:11 +01:00
# else
# define TRACE_SYSCALLS()
# endif
2018-03-28 12:05:37 -07:00
# ifdef CONFIG_BPF_EVENTS
2022-10-22 16:56:36 -06:00
# define BPF_RAW_TP() STRUCT_ALIGN(); \
BOUNDED_SECTION_BY ( __bpf_raw_tp_map , __bpf_raw_tp )
2018-03-28 12:05:37 -07:00
# else
# define BPF_RAW_TP()
# endif
2015-03-09 16:27:21 -04:00
# ifdef CONFIG_SERIAL_EARLYCON
2022-10-22 16:56:36 -06:00
# define EARLYCON_TABLE() \
. = ALIGN ( 8 ) ; \
BOUNDED_SECTION_POST_LABEL ( __earlycon_table , __earlycon_table , , _end )
2015-03-09 16:27:21 -04:00
# else
# define EARLYCON_TABLE()
# endif
2010-12-22 11:57:26 -08:00
2018-10-10 17:18:22 -07:00
# ifdef CONFIG_SECURITY
2022-10-22 16:56:36 -06:00
# define LSM_TABLE() \
. = ALIGN ( 8 ) ; \
BOUNDED_SECTION_PRE_LABEL ( . lsm_info . init , _lsm_info , __start , __end )
# define EARLY_LSM_TABLE() \
. = ALIGN ( 8 ) ; \
BOUNDED_SECTION_PRE_LABEL ( . early_lsm_info . init , _early_lsm_info , __start , __end )
2018-10-10 17:18:22 -07:00
# else
# define LSM_TABLE()
2019-08-19 17:17:37 -07:00
# define EARLY_LSM_TABLE()
2018-10-10 17:18:22 -07:00
# endif
2014-03-24 16:59:20 -05:00
# define ___OF_TABLE(cfg, name) _OF_TABLE_##cfg(name)
# define __OF_TABLE(cfg, name) ___OF_TABLE(cfg, name)
2016-06-14 14:58:58 +09:00
# define OF_TABLE(cfg, name) __OF_TABLE(IS_ENABLED(cfg), name)
2014-03-24 16:59:20 -05:00
# define _OF_TABLE_0(name)
# define _OF_TABLE_1(name) \
irqchip: add basic infrastructure
With the recent creation of the drivers/irqchip/ directory, it is
desirable to move irq controller drivers here. At the moment, the only
driver here is irq-bcm2835, the driver for the irq controller found in
the ARM BCM2835 SoC, present in Rasberry Pi systems. This irq
controller driver was exporting its initialization function and its
irq handling function through a header file in
<linux/irqchip/bcm2835.h>.
When proposing to also move another irq controller driver in
drivers/irqchip, Rob Herring raised the very valid point that moving
things to drivers/irqchip was good in order to remove more stuff from
arch/arm, but if it means adding gazillions of headers files in
include/linux/irqchip/, it would not be very nice.
So, upon the suggestion of Rob Herring and Arnd Bergmann, this commit
introduces a small infrastructure that defines a central
irqchip_init() function in drivers/irqchip/irqchip.c, which is meant
to be called as the ->init_irq() callback of ARM platforms. This
function calls of_irq_init() with an array of match strings and init
functions generated from a special linker section.
Note that the irq controller driver initialization function is
responsible for setting the global handle_arch_irq() variable, so that
ARM platforms no longer have to define the ->handle_irq field in their
DT_MACHINE structure.
A global header, <linux/irqchip.h> is also added to expose the single
irqchip_init() function to the reset of the kernel.
A further commit moves the BCM2835 irq controller driver to this new
small infrastructure, therefore removing the include/linux/irqchip/
directory.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Stephen Warren <swarren@wwwdotorg.org>
Reviewed-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
[rob.herring: reword commit message to reflect use of linker sections.]
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
2012-11-20 23:00:52 +01:00
. = ALIGN ( 8 ) ; \
2018-05-09 16:23:51 +09:00
__ # # name # # _of_table = . ; \
2016-11-24 03:41:41 +11:00
KEEP ( * ( __ # # name # # _of_table ) ) \
KEEP ( * ( __ # # name # # _of_table_end ) )
2014-03-24 16:59:20 -05:00
2017-05-26 19:34:11 +02:00
# define TIMER_OF_TABLES() OF_TABLE(CONFIG_TIMER_OF, timer)
2014-03-24 16:59:20 -05:00
# define IRQCHIP_OF_MATCH_TABLE() OF_TABLE(CONFIG_IRQCHIP, irqchip)
# define CLK_OF_TABLES() OF_TABLE(CONFIG_COMMON_CLK, clk)
# define RESERVEDMEM_OF_TABLES() OF_TABLE(CONFIG_OF_RESERVED_MEM, reservedmem)
# define CPU_METHOD_OF_TABLES() OF_TABLE(CONFIG_SMP, cpu_method)
2015-02-02 16:32:45 +01:00
# define CPUIDLE_METHOD_OF_TABLES() OF_TABLE(CONFIG_CPU_IDLE, cpuidle_method)
2013-10-30 18:21:09 -07:00
2015-09-28 15:49:12 +01:00
# ifdef CONFIG_ACPI
# define ACPI_PROBE_TABLE(name) \
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_POST_LABEL ( __ # # name # # _acpi_probe_table , \
__ # # name # # _acpi_probe_table , , _end )
2015-09-28 15:49:12 +01:00
# else
# define ACPI_PROBE_TABLE(name)
# endif
2019-06-12 22:13:24 +02:00
# ifdef CONFIG_THERMAL
# define THERMAL_TABLE(name) \
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_POST_LABEL ( __ # # name # # _thermal_table , \
__ # # name # # _thermal_table , , _end )
2019-06-12 22:13:24 +02:00
# else
# define THERMAL_TABLE(name)
# endif
2010-12-22 11:57:26 -08:00
# define KERNEL_DTB() \
STRUCT_ALIGN ( ) ; \
2018-05-09 16:23:51 +09:00
__dtb_start = . ; \
2016-11-24 03:41:41 +11:00
KEEP ( * ( . dtb . init . rodata ) ) \
2018-05-09 16:23:51 +09:00
__dtb_end = . ;
2010-12-22 11:57:26 -08:00
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
/*
* . data section
*/
2007-05-17 13:38:44 +02:00
# define DATA_DATA \
mtd: only use __xipram annotation when XIP_KERNEL is set
When XIP_KERNEL is enabled, some functions are defined in the .data
ELF section because we require them to be in RAM whenever we communicate
with the flash chip. However this causes problems when FTRACE is
enabled and gcc emits calls to __gnu_mcount_nc in the function
prolog:
drivers/built-in.o: In function `cfi_chip_setup':
:(.data+0x272fc): relocation truncated to fit: R_ARM_CALL against symbol `__gnu_mcount_nc' defined in .text section in arch/arm/kernel/built-in.o
drivers/built-in.o: In function `cfi_probe_chip':
:(.data+0x27de8): relocation truncated to fit: R_ARM_CALL against symbol `__gnu_mcount_nc' defined in .text section in arch/arm/kernel/built-in.o
/tmp/ccY172rP.s: Assembler messages:
/tmp/ccY172rP.s:70: Warning: ignoring changed section attributes for .data
/tmp/ccY172rP.s: Error: 1 warning, treating warnings as errors
make[5]: *** [drivers/mtd/chips/cfi_probe.o] Error 1
/tmp/ccK4rjeO.s: Assembler messages:
/tmp/ccK4rjeO.s:421: Warning: ignoring changed section attributes for .data
/tmp/ccK4rjeO.s: Error: 1 warning, treating warnings as errors
make[5]: *** [drivers/mtd/chips/cfi_util.o] Error 1
/tmp/ccUvhCYR.s: Assembler messages:
/tmp/ccUvhCYR.s:1895: Warning: ignoring changed section attributes for .data
/tmp/ccUvhCYR.s: Error: 1 warning, treating warnings as errors
Specifically, this does not work because the .data section is not
marked executable, which leads LD to not generate trampolines for
long calls.
This moves the __xipram functions into their own .xiptext section instead.
The section is still placed next to .data and located in RAM but is marked
executable, which avoids the build errors.
Also, we only need to place the XIP functions into a separate section
if both CONFIG_XIP_KERNEL and CONFIG_MTD_XIP are set: When only MTD_XIP
is used, the whole kernel is still in RAM and we do not need to worry
about pulling out the rug under it. When only XIP_KERNEL but not MTD_XIP
is set, the kernel is in some form of ROM, but we never write to it.
Note that MTD_XIP has been broken on ARM since around 2011 or 2012. I
have sent another patch[2] to fix compilation, which I plan to merge
through arm-soc unless there are objections. The obvious alternative
to that would be to completely rip out the MTD_XIP support from the
kernel, since obviously nobody has been using it in a long while.
Link: [1] https://patchwork.kernel.org/patch/8109771/
Link: [2] https://patchwork.kernel.org/patch/9855225/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2017-07-21 22:26:25 +02:00
* ( . xiptext ) \
2017-07-26 22:46:27 +10:00
* ( DATA_MAIN ) \
2022-11-08 10:49:34 -07:00
* ( . data . . decrypted ) \
2008-01-28 20:21:15 +01:00
* ( . ref . data ) \
2010-10-26 14:22:29 -07:00
* ( . data . . shared_aligned ) /* percpu related */ \
2018-05-09 22:59:58 +10:00
MEM_KEEP ( init . data * ) \
MEM_KEEP ( exit . data * ) \
2012-03-23 15:01:52 -07:00
* ( . data . unlikely ) \
2018-05-09 16:23:51 +09:00
__start_once = . ; \
2017-11-17 15:27:03 -08:00
* ( . data . once ) \
2018-05-09 16:23:51 +09:00
__end_once = . ; \
2011-01-26 17:26:22 -05:00
STRUCT_ALIGN ( ) ; \
tracing: Kernel Tracepoints
Implementation of kernel tracepoints. Inspired from the Linux Kernel
Markers. Allows complete typing verification by declaring both tracing
statement inline functions and probe registration/unregistration static
inline functions within the same macro "DEFINE_TRACE". No format string
is required. See the tracepoint Documentation and Samples patches for
usage examples.
Taken from the documentation patch :
"A tracepoint placed in code provides a hook to call a function (probe)
that you can provide at runtime. A tracepoint can be "on" (a probe is
connected to it) or "off" (no probe is attached). When a tracepoint is
"off" it has no effect, except for adding a tiny time penalty (checking
a condition for a branch) and space penalty (adding a few bytes for the
function call at the end of the instrumented function and adds a data
structure in a separate section). When a tracepoint is "on", the
function you provide is called each time the tracepoint is executed, in
the execution context of the caller. When the function provided ends its
execution, it returns to the caller (continuing from the tracepoint
site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters, which
prototypes are described in a tracepoint declaration placed in a header
file."
Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".
We make sure the previous array containing probes, which has been
scheduled for deletion by the rcu callback, is indeed freed before we
proceed to the next update. It therefore limits the rate of modification
of a single tracepoint to one update per RCU period. The objective here
is to permit fast batch add/removal of probes on _different_
tracepoints.
Changelog :
- Use #name ":" #proto as string to identify the tracepoint in the
tracepoint table. This will make sure not type mismatch happens due to
connexion of a probe with the wrong type to a tracepoint declared with
the same name in a different header.
- Add tracepoint_entry_free_old.
- Change __TO_TRACE to get rid of the 'i' iterator.
Masami Hiramatsu <mhiramat@redhat.com> :
Tested on x86-64.
Performance impact of a tracepoint : same as markers, except that it
adds about 70 bytes of instructions in an unlikely branch of each
instrumented function (the for loop, the stack setup and the function
call). It currently adds a memory read, a test and a conditional branch
at the instrumentation site (in the hot path). Immediate values will
eventually change this into a load immediate, test and branch, which
removes the memory read which will make the i-cache impact smaller
(changing the memory read for a load immediate removes 3-4 bytes per
site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
also saves the d-cache hit).
About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added.
Quoting Hideo Aoki about Markers :
I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
tree, which includes several markers for LTTng, using an ia64 server.
While the immediate trace mark feature isn't implemented on ia64, there
is no major performance regression. So, I think that we don't have any
issues to propose merging marker point patches into Linus's tree from
the viewpoint of performance impact.
I prepared two kernels to evaluate. The first one was compiled without
CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
I ran hackbench 5 times in each condition and calculated the average and
difference between the kernels.
The parameter of hackbench: every 50 from 50 to 800
The number of CPUs of the server: 2, 4, and 8
Below is the results. As you can see, major performance regression
wasn't found in any case. Even if number of processes increases,
differences between marker-enabled kernel and marker- disabled kernel
doesn't increase. Moreover, if number of CPUs increases, the differences
doesn't increase either.
Curiously, marker-enabled kernel is better than marker-disabled kernel
in more than half cases, although I guess it comes from the difference
of memory access pattern.
* 2 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 4.811 | 4.872 | +0.061 | +1.27 |
100 | 9.854 | 10.309 | +0.454 | +4.61 |
150 | 15.602 | 15.040 | -0.562 | -3.6 |
200 | 20.489 | 20.380 | -0.109 | -0.53 |
250 | 25.798 | 25.652 | -0.146 | -0.56 |
300 | 31.260 | 30.797 | -0.463 | -1.48 |
350 | 36.121 | 35.770 | -0.351 | -0.97 |
400 | 42.288 | 42.102 | -0.186 | -0.44 |
450 | 47.778 | 47.253 | -0.526 | -1.1 |
500 | 51.953 | 52.278 | +0.325 | +0.63 |
550 | 58.401 | 57.700 | -0.701 | -1.2 |
600 | 63.334 | 63.222 | -0.112 | -0.18 |
650 | 68.816 | 68.511 | -0.306 | -0.44 |
700 | 74.667 | 74.088 | -0.579 | -0.78 |
750 | 78.612 | 79.582 | +0.970 | +1.23 |
800 | 85.431 | 85.263 | -0.168 | -0.2 |
--------------------------------------------------------------
* 4 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.586 | 2.584 | -0.003 | -0.1 |
100 | 5.254 | 5.283 | +0.030 | +0.56 |
150 | 8.012 | 8.074 | +0.061 | +0.76 |
200 | 11.172 | 11.000 | -0.172 | -1.54 |
250 | 13.917 | 14.036 | +0.119 | +0.86 |
300 | 16.905 | 16.543 | -0.362 | -2.14 |
350 | 19.901 | 20.036 | +0.135 | +0.68 |
400 | 22.908 | 23.094 | +0.186 | +0.81 |
450 | 26.273 | 26.101 | -0.172 | -0.66 |
500 | 29.554 | 29.092 | -0.461 | -1.56 |
550 | 32.377 | 32.274 | -0.103 | -0.32 |
600 | 35.855 | 35.322 | -0.533 | -1.49 |
650 | 39.192 | 38.388 | -0.804 | -2.05 |
700 | 41.744 | 41.719 | -0.025 | -0.06 |
750 | 45.016 | 44.496 | -0.520 | -1.16 |
800 | 48.212 | 47.603 | -0.609 | -1.26 |
--------------------------------------------------------------
* 8 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.094 | 2.072 | -0.022 | -1.07 |
100 | 4.162 | 4.273 | +0.111 | +2.66 |
150 | 6.485 | 6.540 | +0.055 | +0.84 |
200 | 8.556 | 8.478 | -0.078 | -0.91 |
250 | 10.458 | 10.258 | -0.200 | -1.91 |
300 | 12.425 | 12.750 | +0.325 | +2.62 |
350 | 14.807 | 14.839 | +0.032 | +0.22 |
400 | 16.801 | 16.959 | +0.158 | +0.94 |
450 | 19.478 | 19.009 | -0.470 | -2.41 |
500 | 21.296 | 21.504 | +0.208 | +0.98 |
550 | 23.842 | 23.979 | +0.137 | +0.57 |
600 | 26.309 | 26.111 | -0.198 | -0.75 |
650 | 28.705 | 28.446 | -0.259 | -0.9 |
700 | 31.233 | 31.394 | +0.161 | +0.52 |
750 | 34.064 | 33.720 | -0.344 | -1.01 |
800 | 36.320 | 36.114 | -0.206 | -0.57 |
--------------------------------------------------------------
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: 'Peter Zijlstra' <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 12:16:16 -04:00
* ( __tracepoints ) \
2009-02-05 11:51:38 -05:00
/* implement dynamic printk debug */ \
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __dyndbg_classes , ___dyndbg_classes ) \
BOUNDED_SECTION_BY ( __dyndbg , ___dyndbg ) \
2008-11-21 01:30:54 -05:00
LIKELY_PROFILE ( ) \
2009-02-24 10:21:36 -05:00
BRANCH_PROFILE ( ) \
2013-07-12 17:07:27 -04:00
TRACE_PRINTKS ( ) \
2018-03-28 12:05:37 -07:00
BPF_RAW_TP ( ) \
2013-07-12 17:07:27 -04:00
TRACEPOINT_STR ( )
2007-05-17 13:38:44 +02:00
2009-06-07 20:46:37 +02:00
/*
* Data section helpers
*/
# define NOSAVE_DATA \
. = ALIGN ( PAGE_SIZE ) ; \
2018-05-09 16:23:51 +09:00
__nosave_begin = . ; \
2010-02-20 01:03:52 +01:00
* ( . data . . nosave ) \
2009-06-07 20:46:37 +02:00
. = ALIGN ( PAGE_SIZE ) ; \
2018-05-09 16:23:51 +09:00
__nosave_end = . ;
2009-06-07 20:46:37 +02:00
# define PAGE_ALIGNED_DATA(page_align) \
. = ALIGN ( page_align ) ; \
2020-07-21 11:34:48 +02:00
* ( . data . . page_aligned ) \
. = ALIGN ( page_align ) ;
2009-06-07 20:46:37 +02:00
# define READ_MOSTLY_DATA(align) \
. = ALIGN ( align ) ; \
2011-01-12 16:59:38 -08:00
* ( . data . . read_mostly ) \
. = ALIGN ( align ) ;
2009-06-07 20:46:37 +02:00
# define CACHELINE_ALIGNED_DATA(align) \
. = ALIGN ( align ) ; \
2010-02-20 01:03:34 +01:00
* ( . data . . cacheline_aligned )
2009-06-07 20:46:37 +02:00
2009-06-23 18:53:15 -04:00
# define INIT_TASK_DATA(align) \
2009-06-07 20:46:37 +02:00
. = ALIGN ( align ) ; \
2018-05-09 16:23:51 +09:00
__start_init_task = . ; \
init_thread_union = . ; \
init_stack = . ; \
2018-05-09 22:59:58 +10:00
KEEP ( * ( . data . . init_task ) ) \
KEEP ( * ( . data . . init_thread_info ) ) \
2018-05-09 16:23:51 +09:00
. = __start_init_task + THREAD_SIZE ; \
__end_init_task = . ;
2009-06-07 20:46:37 +02:00
2018-09-18 23:51:43 -07:00
# define JUMP_TABLE_DATA \
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __jump_table , ___jump_table )
2018-09-18 23:51:43 -07:00
2022-03-08 16:30:12 +01:00
# ifdef CONFIG_HAVE_STATIC_CALL_INLINE
2020-08-18 15:57:42 +02:00
# define STATIC_CALL_DATA \
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . static_call_sites , _static_call_sites ) \
BOUNDED_SECTION_BY ( . static_call_tramp_key , _static_call_tramp_key )
2022-03-08 16:30:12 +01:00
# else
# define STATIC_CALL_DATA
# endif
2020-08-18 15:57:42 +02:00
2016-06-07 12:20:51 +02:00
/*
* Allow architectures to handle ro_after_init data on their
* own by defining an empty RO_AFTER_INIT_DATA .
*/
# ifndef RO_AFTER_INIT_DATA
2016-11-10 10:46:44 -08:00
# define RO_AFTER_INIT_DATA \
2020-08-14 17:31:57 -07:00
. = ALIGN ( 8 ) ; \
2018-05-09 16:23:51 +09:00
__start_ro_after_init = . ; \
2016-11-10 10:46:44 -08:00
* ( . data . . ro_after_init ) \
2018-09-18 23:51:43 -07:00
JUMP_TABLE_DATA \
2020-08-18 15:57:42 +02:00
STATIC_CALL_DATA \
2018-05-09 16:23:51 +09:00
__end_ro_after_init = . ;
2016-06-07 12:20:51 +02:00
# endif
2022-09-08 14:54:47 -07:00
/*
* . kcfi_traps contains a list KCFI trap locations .
*/
# ifndef KCFI_TRAPS
# ifdef CONFIG_ARCH_USES_CFI_TRAPS
# define KCFI_TRAPS \
__kcfi_traps : AT ( ADDR ( __kcfi_traps ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . kcfi_traps , ___kcfi_traps ) \
2022-09-08 14:54:47 -07:00
}
# else
# define KCFI_TRAPS
# endif
# endif
2009-06-07 20:46:37 +02:00
/*
* Read only Data
*/
2019-10-29 14:13:34 -07:00
# define RO_DATA(align) \
2007-05-29 21:29:00 +02:00
. = ALIGN ( ( align ) ) ; \
2005-04-16 15:20:36 -07:00
. rodata : AT ( ADDR ( . rodata ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
__start_rodata = . ; \
2005-04-16 15:20:36 -07:00
* ( . rodata ) * ( . rodata . * ) \
2019-12-19 16:44:52 -05:00
SCHED_DATA \
2016-06-07 12:20:51 +02:00
RO_AFTER_INIT_DATA /* Read only after init */ \
2011-01-26 17:26:22 -05:00
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __tracepoints_ptrs , ___tracepoints_ptrs ) \
tracing: Kernel Tracepoints
Implementation of kernel tracepoints. Inspired from the Linux Kernel
Markers. Allows complete typing verification by declaring both tracing
statement inline functions and probe registration/unregistration static
inline functions within the same macro "DEFINE_TRACE". No format string
is required. See the tracepoint Documentation and Samples patches for
usage examples.
Taken from the documentation patch :
"A tracepoint placed in code provides a hook to call a function (probe)
that you can provide at runtime. A tracepoint can be "on" (a probe is
connected to it) or "off" (no probe is attached). When a tracepoint is
"off" it has no effect, except for adding a tiny time penalty (checking
a condition for a branch) and space penalty (adding a few bytes for the
function call at the end of the instrumented function and adds a data
structure in a separate section). When a tracepoint is "on", the
function you provide is called each time the tracepoint is executed, in
the execution context of the caller. When the function provided ends its
execution, it returns to the caller (continuing from the tracepoint
site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters, which
prototypes are described in a tracepoint declaration placed in a header
file."
Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".
We make sure the previous array containing probes, which has been
scheduled for deletion by the rcu callback, is indeed freed before we
proceed to the next update. It therefore limits the rate of modification
of a single tracepoint to one update per RCU period. The objective here
is to permit fast batch add/removal of probes on _different_
tracepoints.
Changelog :
- Use #name ":" #proto as string to identify the tracepoint in the
tracepoint table. This will make sure not type mismatch happens due to
connexion of a probe with the wrong type to a tracepoint declared with
the same name in a different header.
- Add tracepoint_entry_free_old.
- Change __TO_TRACE to get rid of the 'i' iterator.
Masami Hiramatsu <mhiramat@redhat.com> :
Tested on x86-64.
Performance impact of a tracepoint : same as markers, except that it
adds about 70 bytes of instructions in an unlikely branch of each
instrumented function (the for loop, the stack setup and the function
call). It currently adds a memory read, a test and a conditional branch
at the instrumentation site (in the hot path). Immediate values will
eventually change this into a load immediate, test and branch, which
removes the memory read which will make the i-cache impact smaller
(changing the memory read for a load immediate removes 3-4 bytes per
site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
also saves the d-cache hit).
About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added.
Quoting Hideo Aoki about Markers :
I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
tree, which includes several markers for LTTng, using an ia64 server.
While the immediate trace mark feature isn't implemented on ia64, there
is no major performance regression. So, I think that we don't have any
issues to propose merging marker point patches into Linus's tree from
the viewpoint of performance impact.
I prepared two kernels to evaluate. The first one was compiled without
CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
I ran hackbench 5 times in each condition and calculated the average and
difference between the kernels.
The parameter of hackbench: every 50 from 50 to 800
The number of CPUs of the server: 2, 4, and 8
Below is the results. As you can see, major performance regression
wasn't found in any case. Even if number of processes increases,
differences between marker-enabled kernel and marker- disabled kernel
doesn't increase. Moreover, if number of CPUs increases, the differences
doesn't increase either.
Curiously, marker-enabled kernel is better than marker-disabled kernel
in more than half cases, although I guess it comes from the difference
of memory access pattern.
* 2 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 4.811 | 4.872 | +0.061 | +1.27 |
100 | 9.854 | 10.309 | +0.454 | +4.61 |
150 | 15.602 | 15.040 | -0.562 | -3.6 |
200 | 20.489 | 20.380 | -0.109 | -0.53 |
250 | 25.798 | 25.652 | -0.146 | -0.56 |
300 | 31.260 | 30.797 | -0.463 | -1.48 |
350 | 36.121 | 35.770 | -0.351 | -0.97 |
400 | 42.288 | 42.102 | -0.186 | -0.44 |
450 | 47.778 | 47.253 | -0.526 | -1.1 |
500 | 51.953 | 52.278 | +0.325 | +0.63 |
550 | 58.401 | 57.700 | -0.701 | -1.2 |
600 | 63.334 | 63.222 | -0.112 | -0.18 |
650 | 68.816 | 68.511 | -0.306 | -0.44 |
700 | 74.667 | 74.088 | -0.579 | -0.78 |
750 | 78.612 | 79.582 | +0.970 | +1.23 |
800 | 85.431 | 85.263 | -0.168 | -0.2 |
--------------------------------------------------------------
* 4 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.586 | 2.584 | -0.003 | -0.1 |
100 | 5.254 | 5.283 | +0.030 | +0.56 |
150 | 8.012 | 8.074 | +0.061 | +0.76 |
200 | 11.172 | 11.000 | -0.172 | -1.54 |
250 | 13.917 | 14.036 | +0.119 | +0.86 |
300 | 16.905 | 16.543 | -0.362 | -2.14 |
350 | 19.901 | 20.036 | +0.135 | +0.68 |
400 | 22.908 | 23.094 | +0.186 | +0.81 |
450 | 26.273 | 26.101 | -0.172 | -0.66 |
500 | 29.554 | 29.092 | -0.461 | -1.56 |
550 | 32.377 | 32.274 | -0.103 | -0.32 |
600 | 35.855 | 35.322 | -0.533 | -1.49 |
650 | 39.192 | 38.388 | -0.804 | -2.05 |
700 | 41.744 | 41.719 | -0.025 | -0.06 |
750 | 45.016 | 44.496 | -0.520 | -1.16 |
800 | 48.212 | 47.603 | -0.609 | -1.26 |
--------------------------------------------------------------
* 8 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.094 | 2.072 | -0.022 | -1.07 |
100 | 4.162 | 4.273 | +0.111 | +2.66 |
150 | 6.485 | 6.540 | +0.055 | +0.84 |
200 | 8.556 | 8.478 | -0.078 | -0.91 |
250 | 10.458 | 10.258 | -0.200 | -1.91 |
300 | 12.425 | 12.750 | +0.325 | +2.62 |
350 | 14.807 | 14.839 | +0.032 | +0.22 |
400 | 16.801 | 16.959 | +0.158 | +0.94 |
450 | 19.478 | 19.009 | -0.470 | -2.41 |
500 | 21.296 | 21.504 | +0.208 | +0.98 |
550 | 23.842 | 23.979 | +0.137 | +0.57 |
600 | 26.309 | 26.111 | -0.198 | -0.75 |
650 | 28.705 | 28.446 | -0.259 | -0.9 |
700 | 31.233 | 31.394 | +0.161 | +0.52 |
750 | 34.064 | 33.720 | -0.344 | -1.01 |
800 | 36.320 | 36.114 | -0.206 | -0.57 |
--------------------------------------------------------------
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: 'Peter Zijlstra' <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 12:16:16 -04:00
* ( __tracepoints_strings ) /* Tracepoints: strings */ \
2005-04-16 15:20:36 -07:00
} \
\
. rodata1 : AT ( ADDR ( . rodata1 ) - LOAD_OFFSET ) { \
* ( . rodata1 ) \
} \
\
/* PCI quirks */ \
. pci_fixup : AT ( ADDR ( . pci_fixup ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_early , _pci_fixups_early , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_header , _pci_fixups_header , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_final , _pci_fixups_final , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_enable , _pci_fixups_enable , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_resume , _pci_fixups_resume , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_suspend , _pci_fixups_suspend , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_resume_early , _pci_fixups_resume_early , __start , __end ) \
BOUNDED_SECTION_PRE_LABEL ( . pci_fixup_suspend_late , _pci_fixups_suspend_late , __start , __end ) \
2005-04-16 15:20:36 -07:00
} \
\
2021-10-21 08:58:38 -07:00
FW_LOADER_BUILT_IN_DATA \
2008-05-12 15:44:41 +02:00
TRACEDATA \
\
printk: Userspace format indexing support
We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:
$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-15 17:52:53 +01:00
PRINTK_INDEX \
\
2005-04-16 15:20:36 -07:00
/* Kernel symbol table: Normal symbols */ \
__ksymtab : AT ( ADDR ( __ksymtab ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
__start___ksymtab = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( SORT ( ___ksymtab + * ) ) ) \
2018-05-09 16:23:51 +09:00
__stop___ksymtab = . ; \
2005-04-16 15:20:36 -07:00
} \
\
/* Kernel symbol table: GPL-only symbols */ \
__ksymtab_gpl : AT ( ADDR ( __ksymtab_gpl ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
__start___ksymtab_gpl = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( SORT ( ___ksymtab_gpl + * ) ) ) \
2018-05-09 16:23:51 +09:00
__stop___ksymtab_gpl = . ; \
2005-04-16 15:20:36 -07:00
} \
\
/* Kernel symbol table: Normal symbols */ \
__kcrctab : AT ( ADDR ( __kcrctab ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
__start___kcrctab = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( SORT ( ___kcrctab + * ) ) ) \
2018-05-09 16:23:51 +09:00
__stop___kcrctab = . ; \
2005-04-16 15:20:36 -07:00
} \
\
/* Kernel symbol table: GPL-only symbols */ \
__kcrctab_gpl : AT ( ADDR ( __kcrctab_gpl ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
__start___kcrctab_gpl = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( SORT ( ___kcrctab_gpl + * ) ) ) \
2018-05-09 16:23:51 +09:00
__stop___kcrctab_gpl = . ; \
2005-04-16 15:20:36 -07:00
} \
\
/* Kernel symbol table: strings */ \
__ksymtab_strings : AT ( ADDR ( __ksymtab_strings ) - LOAD_OFFSET ) { \
2016-11-24 03:41:41 +11:00
* ( __ksymtab_strings ) \
2005-04-16 15:20:36 -07:00
} \
\
2008-01-20 20:07:28 +01:00
/* __*init sections */ \
__init_rodata : AT ( ADDR ( __init_rodata ) - LOAD_OFFSET ) { \
2008-01-28 20:21:15 +01:00
* ( . ref . rodata ) \
2008-01-20 20:07:28 +01:00
MEM_KEEP ( init . rodata ) \
MEM_KEEP ( exit . rodata ) \
} \
\
2005-04-16 15:20:36 -07:00
/* Built-in module parameters. */ \
__param : AT ( ADDR ( __param ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __param , ___param ) \
2010-12-15 14:00:19 -08:00
} \
\
/* Built-in module versions. */ \
__modver : AT ( ADDR ( __modver ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __modver , ___modver ) \
2006-09-27 01:51:02 -07:00
} \
2019-10-29 14:13:32 -07:00
\
2022-09-08 14:54:47 -07:00
KCFI_TRAPS \
\
2019-10-29 14:13:36 -07:00
RO_EXCEPTION_TABLE \
2019-10-29 14:13:32 -07:00
NOTES \
2020-03-18 15:27:46 -07:00
BTF \
2019-10-29 14:13:32 -07:00
\
. = ALIGN ( ( align ) ) ; \
__end_rodata = . ;
2005-04-16 15:20:36 -07:00
add support for Clang CFI
This change adds support for Clang’s forward-edge Control Flow
Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler
injects a runtime check before each indirect function call to ensure
the target is a valid function with the correct static type. This
restricts possible call targets and makes it more difficult for
an attacker to exploit bugs that allow the modification of stored
function pointers. For more details, see:
https://clang.llvm.org/docs/ControlFlowIntegrity.html
Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain
visibility to possible call targets. Kernel modules are supported
with Clang’s cross-DSO CFI mode, which allows checking between
independently compiled components.
With CFI enabled, the compiler injects a __cfi_check() function into
the kernel and each module for validating local call targets. For
cross-module calls that cannot be validated locally, the compiler
calls the global __cfi_slowpath_diag() function, which determines
the target module and calls the correct __cfi_check() function. This
patch includes a slowpath implementation that uses __module_address()
to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a
shadow map that speeds up module look-ups by ~3x.
Clang implements indirect call checking using jump tables and
offers two methods of generating them. With canonical jump tables,
the compiler renames each address-taken function to <function>.cfi
and points the original symbol to a jump table entry, which passes
__cfi_check() validation. This isn’t compatible with stand-alone
assembly code, which the compiler doesn’t instrument, and would
result in indirect calls to assembly code to fail. Therefore, we
default to using non-canonical jump tables instead, where the compiler
generates a local jump table entry <function>.cfi_jt for each
address-taken function, and replaces all references to the function
with the address of the jump table entry.
Note that because non-canonical jump table addresses are local
to each component, they break cross-module function address
equality. Specifically, the address of a global function will be
different in each module, as it's replaced with the address of a local
jump table entry. If this address is passed to a different module,
it won’t match the address of the same function taken there. This
may break code that relies on comparing addresses passed from other
components.
CFI checking can be disabled in a function with the __nocfi attribute.
Additionally, CFI can be disabled for an entire compilation unit by
filtering out CC_FLAGS_CFI.
By default, CFI failures result in a kernel panic to stop a potential
exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the
kernel prints out a rate-limited warning instead, and allows execution
to continue. This option is helpful for locating type mismatches, but
should only be enabled during development.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
2021-04-08 11:28:26 -07:00
2020-03-09 22:47:17 +01:00
/*
* Non - instrumentable text section
*/
# define NOINSTR_TEXT \
ALIGN_FUNCTION ( ) ; \
__noinstr_text_start = . ; \
* ( . noinstr . text ) \
2023-01-12 20:43:31 +01:00
__cpuidle_text_start = . ; \
* ( . cpuidle . text ) \
__cpuidle_text_end = . ; \
2020-03-09 22:47:17 +01:00
__noinstr_text_end = . ;
2017-07-26 22:46:27 +10:00
/*
* . text section . Map to function alignment to avoid address changes
2016-09-14 12:24:03 +10:00
* during second ld run in second ld pass when generating System . map
2017-07-26 22:46:27 +10:00
*
* TEXT_MAIN here will match . text . fixup and . text . unlikely if dead
* code elimination is enabled , so these sections should be converted
* to use " .. " first .
*/
2007-05-13 00:31:33 +02:00
# define TEXT_TEXT \
ALIGN_FUNCTION ( ) ; \
vmlinux.lds.h: Add PGO and AutoFDO input sections
Basically, consider .text.{hot|unlikely|unknown}.* part of .text, too.
When compiling with profiling information (collected via PGO
instrumentations or AutoFDO sampling), Clang will separate code into
.text.hot, .text.unlikely, or .text.unknown sections based on profiling
information. After D79600 (clang-11), these sections will have a
trailing `.` suffix, ie. .text.hot., .text.unlikely., .text.unknown..
When using -ffunction-sections together with profiling infomation,
either explicitly (FGKASLR) or implicitly (LTO), code may be placed in
sections following the convention:
.text.hot.<foo>, .text.unlikely.<bar>, .text.unknown.<baz>
where <foo>, <bar>, and <baz> are functions. (This produces one section
per function; we generally try to merge these all back via linker script
so that we don't have 50k sections).
For the above cases, we need to teach our linker scripts that such
sections might exist and that we'd explicitly like them grouped
together, otherwise we can wind up with code outside of the
_stext/_etext boundaries that might not be mapped properly for some
architectures, resulting in boot failures.
If the linker script is not told about possible input sections, then
where the section is placed as output is a heuristic-laiden mess that's
non-portable between linkers (ie. BFD and LLD), and has resulted in many
hard to debug bugs. Kees Cook is working on cleaning this up by adding
--orphan-handling=warn linker flag used in ARCH=powerpc to additional
architectures. In the case of linker scripts, borrowing from the Zen of
Python: explicit is better than implicit.
Also, ld.bfd's internal linker script considers .text.hot AND
.text.hot.* to be part of .text, as well as .text.unlikely and
.text.unlikely.*. I didn't see support for .text.unknown.*, and didn't
see Clang producing such code in our kernel builds, but I see code in
LLVM that can produce such section names if profiling information is
missing. That may point to a larger issue with generating or collecting
profiles, but I would much rather be safe and explicit than have to
debug yet another issue related to orphan section placement.
Reported-by: Jian Cai <jiancai@google.com>
Suggested-by: Fāng-ruì Sòng <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Luis Lozano <llozano@google.com>
Tested-by: Manoj Gupta <manojgupta@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=add44f8d5c5c05e08b11e033127a744d61c26aee
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=1de778ed23ce7492c523d5850c6c6dbb34152655
Link: https://reviews.llvm.org/D79600
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1084760
Link: https://lore.kernel.org/r/20200821194310.3089815-7-keescook@chromium.org
Debugged-by: Luis Lozano <llozano@google.com>
2020-08-21 12:42:47 -07:00
* ( . text . hot . text . hot . * ) \
* ( TEXT_MAIN . text . fixup ) \
* ( . text . unlikely . text . unlikely . * ) \
* ( . text . unknown . text . unknown . * ) \
2020-03-09 22:47:17 +01:00
NOINSTR_TEXT \
locking/refcounts, x86/asm: Use unique .text section for refcount exceptions
Using .text.unlikely for refcount exceptions isn't safe because gcc may
move entire functions into .text.unlikely (e.g. in6_dev_dev()), which
would cause any uses of a protected refcount_t function to stay inline
with the function, triggering the protection unconditionally:
.section .text.unlikely,"ax",@progbits
.type in6_dev_get, @function
in6_dev_getx:
.LFB4673:
.loc 2 4128 0
.cfi_startproc
...
lock; incl 480(%rbx)
js 111f
.pushsection .text.unlikely
111: lea 480(%rbx), %rcx
112: .byte 0x0f, 0xff
.popsection
113:
This creates a unique .text..refcount section and adds an additional
test to the exception handler to WARN in the case of having none of OF,
SF, nor ZF set so we can see things like this more easily in the future.
The double dot for the section name keeps it out of the TEXT_MAIN macro
namespace, to avoid collisions and so it can be put at the end with
text.unlikely to keep the cold code together.
See commit:
cb87481ee89db ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured")
... which matches C names: [a-zA-Z0-9_] but not ".".
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch <linux-arch@vger.kernel.org>
Fixes: 7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast refcount overflow protection")
Link: http://lkml.kernel.org/r/1504382986-49301-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-02 13:09:45 -07:00
* ( . text . . refcount ) \
2008-01-28 20:21:15 +01:00
* ( . ref . text ) \
2021-07-30 19:31:08 -07:00
* ( . text . asan . * . text . tsan . * ) \
2018-05-09 22:59:58 +10:00
MEM_KEEP ( init . text * ) \
MEM_KEEP ( exit . text * ) \
2008-01-20 20:07:28 +01:00
2007-05-13 00:31:33 +02:00
2005-07-14 20:15:44 +00:00
/* sched.text is aling to function alignment to secure we have same
* address even at second ld pass when generating System . map */
2005-04-16 15:20:36 -07:00
# define SCHED_TEXT \
2005-07-14 20:15:44 +00:00
ALIGN_FUNCTION ( ) ; \
2018-05-09 16:23:51 +09:00
__sched_text_start = . ; \
2005-04-16 15:20:36 -07:00
* ( . sched . text ) \
2018-05-09 16:23:51 +09:00
__sched_text_end = . ;
2005-04-16 15:20:36 -07:00
2005-07-14 20:15:44 +00:00
/* spinlock.text is aling to function alignment to secure we have same
* address even at second ld pass when generating System . map */
2005-04-16 15:20:36 -07:00
# define LOCK_TEXT \
2005-07-14 20:15:44 +00:00
ALIGN_FUNCTION ( ) ; \
2018-05-09 16:23:51 +09:00
__lock_text_start = . ; \
2005-04-16 15:20:36 -07:00
* ( . spinlock . text ) \
2018-05-09 16:23:51 +09:00
__lock_text_end = . ;
2005-09-06 15:19:26 -07:00
# define KPROBES_TEXT \
ALIGN_FUNCTION ( ) ; \
2018-05-09 16:23:51 +09:00
__kprobes_text_start = . ; \
2005-09-06 15:19:26 -07:00
* ( . kprobes . text ) \
2018-05-09 16:23:51 +09:00
__kprobes_text_end = . ;
2005-09-10 19:44:54 +02:00
2011-03-07 19:10:39 +01:00
# define ENTRY_TEXT \
ALIGN_FUNCTION ( ) ; \
2018-05-09 16:23:51 +09:00
__entry_text_start = . ; \
2011-03-07 19:10:39 +01:00
* ( . entry . text ) \
2018-05-09 16:23:51 +09:00
__entry_text_end = . ;
2011-03-07 19:10:39 +01:00
2008-12-09 23:53:16 +01:00
# define IRQENTRY_TEXT \
ALIGN_FUNCTION ( ) ; \
2018-05-09 16:23:51 +09:00
__irqentry_text_start = . ; \
2008-12-09 23:53:16 +01:00
* ( . irqentry . text ) \
2018-05-09 16:23:51 +09:00
__irqentry_text_end = . ;
2008-12-09 23:53:16 +01:00
2016-03-25 14:22:05 -07:00
# define SOFTIRQENTRY_TEXT \
ALIGN_FUNCTION ( ) ; \
2018-05-09 16:23:51 +09:00
__softirqentry_text_start = . ; \
2016-03-25 14:22:05 -07:00
* ( . softirqentry . text ) \
2018-05-09 16:23:51 +09:00
__softirqentry_text_end = . ;
2016-03-25 14:22:05 -07:00
2020-08-18 15:57:45 +02:00
# define STATIC_CALL_TEXT \
ALIGN_FUNCTION ( ) ; \
__static_call_text_start = . ; \
* ( . static_call . text ) \
__static_call_text_end = . ;
2008-02-19 21:00:18 +01:00
/* Section used for early init (in .S files) */
2018-05-09 22:59:58 +10:00
# define HEAD_TEXT KEEP(*(.head.text))
2008-02-19 21:00:18 +01:00
2009-06-14 22:10:41 +02:00
# define HEAD_TEXT_SECTION \
2009-06-07 20:46:37 +02:00
. head . text : AT ( ADDR ( . head . text ) - LOAD_OFFSET ) { \
HEAD_TEXT \
}
/*
* Exception table
*/
# define EXCEPTION_TABLE(align) \
. = ALIGN ( align ) ; \
__ex_table : AT ( ADDR ( __ex_table ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __ex_table , ___ex_table ) \
2009-06-07 20:46:37 +02:00
}
2020-03-18 15:27:46 -07:00
/*
* . BTF
*/
# ifdef CONFIG_DEBUG_INFO_BTF
# define BTF \
. BTF : AT ( ADDR ( . BTF ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . BTF , _BTF ) \
2020-07-11 23:53:23 +02:00
} \
. = ALIGN ( 4 ) ; \
. BTF_ids : AT ( ADDR ( . BTF_ids ) - LOAD_OFFSET ) { \
* ( . BTF_ids ) \
2020-03-18 15:27:46 -07:00
}
# else
# define BTF
# endif
2009-06-07 20:46:37 +02:00
/*
* Init task
*/
2009-06-23 18:53:15 -04:00
# define INIT_TASK_DATA_SECTION(align) \
2009-06-07 20:46:37 +02:00
. = ALIGN ( align ) ; \
2010-07-13 11:39:42 +02:00
. data . . init_task : AT ( ADDR ( . data . . init_task ) - LOAD_OFFSET ) { \
2009-06-23 18:53:15 -04:00
INIT_TASK_DATA ( align ) \
2009-06-07 20:46:37 +02:00
}
2008-02-19 21:00:18 +01:00
2009-06-17 16:28:03 -07:00
# ifdef CONFIG_CONSTRUCTORS
2009-06-30 11:41:13 -07:00
# define KERNEL_CTORS() . = ALIGN(8); \
2018-05-09 16:23:51 +09:00
__ctors_start = . ; \
2020-10-04 19:57:20 -07:00
KEEP ( * ( SORT ( . ctors . * ) ) ) \
2016-11-24 03:41:41 +11:00
KEEP ( * ( . ctors ) ) \
KEEP ( * ( SORT ( . init_array . * ) ) ) \
KEEP ( * ( . init_array ) ) \
2018-05-09 16:23:51 +09:00
__ctors_end = . ;
2009-06-17 16:28:03 -07:00
# else
# define KERNEL_CTORS()
# endif
2008-01-20 14:15:03 +01:00
/* init and exit section handling */
2008-01-20 20:07:28 +01:00
# define INIT_DATA \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( SORT ( ___kentry + * ) ) ) \
2023-05-24 00:55:01 +08:00
* ( . init . data . init . data . * ) \
2018-05-09 22:59:58 +10:00
MEM_DISCARD ( init . data * ) \
2009-06-17 16:28:03 -07:00
KERNEL_CTORS ( ) \
2009-07-27 11:23:50 -07:00
MCOUNT_REC ( ) \
2018-05-09 22:59:58 +10:00
* ( . init . rodata . init . rodata . * ) \
tracing: Replace trace_event struct array with pointer array
Currently the trace_event structures are placed in the _ftrace_events
section, and at link time, the linker makes one large array of all
the trace_event structures. On boot up, this array is read (much like
the initcall sections) and the events are processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the _ftrace_event section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The _ftrace_event section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-01-27 09:15:30 -05:00
FTRACE_EVENTS ( ) \
tracing: Replace syscall_meta_data struct array with pointer array
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 17:06:09 -05:00
TRACE_SYSCALLS ( ) \
2014-04-17 17:17:05 +09:00
KPROBE_BLACKLIST ( ) \
2018-01-13 02:55:03 +09:00
ERROR_INJECT_WHITELIST ( ) \
2010-12-22 11:57:26 -08:00
MEM_DISCARD ( init . rodata ) \
2013-01-04 12:30:52 +05:30
CLK_OF_TABLES ( ) \
2014-02-28 14:42:49 +01:00
RESERVEDMEM_OF_TABLES ( ) \
2017-05-26 18:33:27 +02:00
TIMER_OF_TABLES ( ) \
2013-10-30 18:21:09 -07:00
CPU_METHOD_OF_TABLES ( ) \
2015-02-02 16:32:45 +01:00
CPUIDLE_METHOD_OF_TABLES ( ) \
irqchip: add basic infrastructure
With the recent creation of the drivers/irqchip/ directory, it is
desirable to move irq controller drivers here. At the moment, the only
driver here is irq-bcm2835, the driver for the irq controller found in
the ARM BCM2835 SoC, present in Rasberry Pi systems. This irq
controller driver was exporting its initialization function and its
irq handling function through a header file in
<linux/irqchip/bcm2835.h>.
When proposing to also move another irq controller driver in
drivers/irqchip, Rob Herring raised the very valid point that moving
things to drivers/irqchip was good in order to remove more stuff from
arch/arm, but if it means adding gazillions of headers files in
include/linux/irqchip/, it would not be very nice.
So, upon the suggestion of Rob Herring and Arnd Bergmann, this commit
introduces a small infrastructure that defines a central
irqchip_init() function in drivers/irqchip/irqchip.c, which is meant
to be called as the ->init_irq() callback of ARM platforms. This
function calls of_irq_init() with an array of match strings and init
functions generated from a special linker section.
Note that the irq controller driver initialization function is
responsible for setting the global handle_arch_irq() variable, so that
ARM platforms no longer have to define the ->handle_irq field in their
DT_MACHINE structure.
A global header, <linux/irqchip.h> is also added to expose the single
irqchip_init() function to the reset of the kernel.
A further commit moves the BCM2835 irq controller driver to this new
small infrastructure, therefore removing the include/linux/irqchip/
directory.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Stephen Warren <swarren@wwwdotorg.org>
Reviewed-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
[rob.herring: reword commit message to reflect use of linker sections.]
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
2012-11-20 23:00:52 +01:00
KERNEL_DTB ( ) \
2014-03-27 08:06:16 -05:00
IRQCHIP_OF_MATCH_TABLE ( ) \
2015-09-28 15:49:13 +01:00
ACPI_PROBE_TABLE ( irqchip ) \
2017-05-26 18:33:27 +02:00
ACPI_PROBE_TABLE ( timer ) \
2019-06-12 22:13:24 +02:00
THERMAL_TABLE ( governor ) \
2018-10-10 17:18:22 -07:00
EARLYCON_TABLE ( ) \
2019-08-19 17:17:37 -07:00
LSM_TABLE ( ) \
2020-08-04 13:47:41 -07:00
EARLY_LSM_TABLE ( ) \
KUNIT_TABLE ( )
2008-01-20 20:07:28 +01:00
# define INIT_TEXT \
2018-05-09 22:59:58 +10:00
* ( . init . text . init . text . * ) \
2016-07-14 12:07:29 -07:00
* ( . text . startup ) \
2018-05-09 22:59:58 +10:00
MEM_DISCARD ( init . text * )
2008-01-20 20:07:28 +01:00
# define EXIT_DATA \
2018-05-09 22:59:58 +10:00
* ( . exit . data . exit . data . * ) \
2018-09-13 12:59:59 +02:00
* ( . fini_array . fini_array . * ) \
* ( . dtors . dtors . * ) \
2018-05-09 22:59:58 +10:00
MEM_DISCARD ( exit . data * ) \
MEM_DISCARD ( exit . rodata * )
2008-01-20 14:15:03 +01:00
2008-01-20 20:07:28 +01:00
# define EXIT_TEXT \
* ( . exit . text ) \
2016-07-14 12:07:29 -07:00
* ( . text . exit ) \
2008-01-20 20:07:28 +01:00
MEM_DISCARD ( exit . text )
2008-01-20 14:15:03 +01:00
2009-06-14 22:10:41 +02:00
# define EXIT_CALL \
* ( . exitcall . exit )
2009-06-07 20:46:37 +02:00
/*
* bss ( Block Started by Symbol ) - uninitialized data
* zeroed during startup
*/
2009-07-12 18:23:33 -04:00
# define SBSS(sbss_align) \
. = ALIGN ( sbss_align ) ; \
2009-06-07 20:46:37 +02:00
. sbss : AT ( ADDR ( . sbss ) - LOAD_OFFSET ) { \
2017-05-12 03:40:40 +10:00
* ( . dynsbss ) \
2018-05-09 22:59:58 +10:00
* ( SBSS_MAIN ) \
2009-06-07 20:46:37 +02:00
* ( . scommon ) \
}
2012-08-14 11:08:00 -07:00
/*
* Allow archectures to redefine BSS_FIRST_SECTIONS to add extra
* sections to the front of bss .
*/
# ifndef BSS_FIRST_SECTIONS
# define BSS_FIRST_SECTIONS
# endif
2009-06-07 20:46:37 +02:00
# define BSS(bss_align) \
. = ALIGN ( bss_align ) ; \
. bss : AT ( ADDR ( . bss ) - LOAD_OFFSET ) { \
2012-08-14 11:08:00 -07:00
BSS_FIRST_SECTIONS \
2020-07-21 11:34:48 +02:00
. = ALIGN ( PAGE_SIZE ) ; \
2010-02-20 01:03:38 +01:00
* ( . bss . . page_aligned ) \
2020-07-21 11:34:48 +02:00
. = ALIGN ( PAGE_SIZE ) ; \
2009-06-07 20:46:37 +02:00
* ( . dynbss ) \
2017-07-26 22:46:27 +10:00
* ( BSS_MAIN ) \
2009-06-07 20:46:37 +02:00
* ( COMMON ) \
}
/*
* DWARF debug sections .
* Symbols in the DWARF debugging sections are relative to
* the beginning of the section so we begin them at 0.
*/
2005-09-10 19:44:54 +02:00
# define DWARF_DEBUG \
/* DWARF 1 */ \
. debug 0 : { * ( . debug ) } \
. line 0 : { * ( . line ) } \
/* GNU DWARF 1 extensions */ \
. debug_srcinfo 0 : { * ( . debug_srcinfo ) } \
. debug_sfnames 0 : { * ( . debug_sfnames ) } \
/* DWARF 1.1 and DWARF 2 */ \
. debug_aranges 0 : { * ( . debug_aranges ) } \
. debug_pubnames 0 : { * ( . debug_pubnames ) } \
/* DWARF 2 */ \
. debug_info 0 : { * ( . debug_info \
. gnu . linkonce . wi . * ) } \
. debug_abbrev 0 : { * ( . debug_abbrev ) } \
. debug_line 0 : { * ( . debug_line ) } \
. debug_frame 0 : { * ( . debug_frame ) } \
. debug_str 0 : { * ( . debug_str ) } \
. debug_loc 0 : { * ( . debug_loc ) } \
. debug_macinfo 0 : { * ( . debug_macinfo ) } \
2017-05-12 03:40:40 +10:00
. debug_pubtypes 0 : { * ( . debug_pubtypes ) } \
/* DWARF 3 */ \
. debug_ranges 0 : { * ( . debug_ranges ) } \
2005-09-10 19:44:54 +02:00
/* SGI/MIPS DWARF 2 extensions */ \
. debug_weaknames 0 : { * ( . debug_weaknames ) } \
. debug_funcnames 0 : { * ( . debug_funcnames ) } \
. debug_typenames 0 : { * ( . debug_typenames ) } \
. debug_varnames 0 : { * ( . debug_varnames ) } \
2017-05-12 03:40:40 +10:00
/* GNU DWARF 2 extensions */ \
. debug_gnu_pubnames 0 : { * ( . debug_gnu_pubnames ) } \
. debug_gnu_pubtypes 0 : { * ( . debug_gnu_pubtypes ) } \
/* DWARF 4 */ \
. debug_types 0 : { * ( . debug_types ) } \
/* DWARF 5 */ \
2021-02-05 12:22:18 -08:00
. debug_addr 0 : { * ( . debug_addr ) } \
. debug_line_str 0 : { * ( . debug_line_str ) } \
. debug_loclists 0 : { * ( . debug_loclists ) } \
2017-05-12 03:40:40 +10:00
. debug_macro 0 : { * ( . debug_macro ) } \
2021-02-05 12:22:18 -08:00
. debug_names 0 : { * ( . debug_names ) } \
. debug_rnglists 0 : { * ( . debug_rnglists ) } \
. debug_str_offsets 0 : { * ( . debug_str_offsets ) }
2005-09-10 19:44:54 +02:00
2020-08-21 12:42:45 -07:00
/* Stabs debugging sections. */
2005-09-10 19:44:54 +02:00
# define STABS_DEBUG \
. stab 0 : { * ( . stab ) } \
. stabstr 0 : { * ( . stabstr ) } \
. stab . excl 0 : { * ( . stab . excl ) } \
. stab . exclstr 0 : { * ( . stab . exclstr ) } \
. stab . index 0 : { * ( . stab . index ) } \
2020-08-21 12:42:45 -07:00
. stab . indexstr 0 : { * ( . stab . indexstr ) }
/* Required sections not related to debugging. */
# define ELF_DETAILS \
2020-08-21 12:42:46 -07:00
. comment 0 : { * ( . comment ) } \
. symtab 0 : { * ( . symtab ) } \
. strtab 0 : { * ( . strtab ) } \
. shstrtab 0 : { * ( . shstrtab ) }
2006-09-25 23:32:26 -07:00
2008-05-12 15:44:41 +02:00
# ifdef CONFIG_GENERIC_BUG
[PATCH] Generic BUG implementation
This patch adds common handling for kernel BUGs, for use by architectures as
they wish. The code is derived from arch/powerpc.
The advantages of having common BUG handling are:
- consistent BUG reporting across architectures
- shared implementation of out-of-line file/line data
- implement CONFIG_DEBUG_BUGVERBOSE consistently
This means that in inline impact of BUG is just the illegal instruction
itself, which is an improvement for i386 and x86-64.
A BUG is represented in the instruction stream as an illegal instruction,
which has file/line information associated with it. This extra information is
stored in the __bug_table section in the ELF file.
When the kernel gets an illegal instruction, it first confirms it might
possibly be from a BUG (ie, in kernel mode, the right illegal instruction).
It then calls report_bug(). This searches __bug_table for a matching
instruction pointer, and if found, prints the corresponding file/line
information. If report_bug() determines that it wasn't a BUG which caused the
trap, it returns BUG_TRAP_TYPE_NONE.
Some architectures (powerpc) implement WARN using the same mechanism; if the
illegal instruction was the result of a WARN, then report_bug(Q) returns
CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG.
lib/bug.c keeps a list of loaded modules which can be searched for __bug_table
entries. The architecture must call
module_bug_finalize()/module_bug_cleanup() from its corresponding
module_finalize/cleanup functions.
Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount.
At the very least, filename and line information will not be recorded for each
but, but architectures may decide to store no extra information per BUG at
all.
Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so
architectures will generally have to include an infinite loop (or similar) in
the BUG code, so that gcc knows execution won't continue beyond that point.
gcc does have a __builtin_trap() operator which may be useful to achieve the
same effect, unfortunately it cannot be used to actually implement the BUG
itself, because there's no way to get the instruction's address for use in
generating the __bug_table entry.
[randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors]
[bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h]
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:36:19 -08:00
# define BUG_TABLE \
. = ALIGN ( 8 ) ; \
__bug_table : AT ( ADDR ( __bug_table ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( __bug_table , ___bug_table ) \
[PATCH] Generic BUG implementation
This patch adds common handling for kernel BUGs, for use by architectures as
they wish. The code is derived from arch/powerpc.
The advantages of having common BUG handling are:
- consistent BUG reporting across architectures
- shared implementation of out-of-line file/line data
- implement CONFIG_DEBUG_BUGVERBOSE consistently
This means that in inline impact of BUG is just the illegal instruction
itself, which is an improvement for i386 and x86-64.
A BUG is represented in the instruction stream as an illegal instruction,
which has file/line information associated with it. This extra information is
stored in the __bug_table section in the ELF file.
When the kernel gets an illegal instruction, it first confirms it might
possibly be from a BUG (ie, in kernel mode, the right illegal instruction).
It then calls report_bug(). This searches __bug_table for a matching
instruction pointer, and if found, prints the corresponding file/line
information. If report_bug() determines that it wasn't a BUG which caused the
trap, it returns BUG_TRAP_TYPE_NONE.
Some architectures (powerpc) implement WARN using the same mechanism; if the
illegal instruction was the result of a WARN, then report_bug(Q) returns
CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG.
lib/bug.c keeps a list of loaded modules which can be searched for __bug_table
entries. The architecture must call
module_bug_finalize()/module_bug_cleanup() from its corresponding
module_finalize/cleanup functions.
Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount.
At the very least, filename and line information will not be recorded for each
but, but architectures may decide to store no extra information per BUG at
all.
Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so
architectures will generally have to include an infinite loop (or similar) in
the BUG code, so that gcc knows execution won't continue beyond that point.
gcc does have a __builtin_trap() operator which may be useful to achieve the
same effect, unfortunately it cannot be used to actually implement the BUG
itself, because there's no way to get the instruction's address for use in
generating the __bug_table entry.
[randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors]
[bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h]
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:36:19 -08:00
}
2008-05-12 15:44:41 +02:00
# else
# define BUG_TABLE
# endif
[PATCH] Generic BUG implementation
This patch adds common handling for kernel BUGs, for use by architectures as
they wish. The code is derived from arch/powerpc.
The advantages of having common BUG handling are:
- consistent BUG reporting across architectures
- shared implementation of out-of-line file/line data
- implement CONFIG_DEBUG_BUGVERBOSE consistently
This means that in inline impact of BUG is just the illegal instruction
itself, which is an improvement for i386 and x86-64.
A BUG is represented in the instruction stream as an illegal instruction,
which has file/line information associated with it. This extra information is
stored in the __bug_table section in the ELF file.
When the kernel gets an illegal instruction, it first confirms it might
possibly be from a BUG (ie, in kernel mode, the right illegal instruction).
It then calls report_bug(). This searches __bug_table for a matching
instruction pointer, and if found, prints the corresponding file/line
information. If report_bug() determines that it wasn't a BUG which caused the
trap, it returns BUG_TRAP_TYPE_NONE.
Some architectures (powerpc) implement WARN using the same mechanism; if the
illegal instruction was the result of a WARN, then report_bug(Q) returns
CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG.
lib/bug.c keeps a list of loaded modules which can be searched for __bug_table
entries. The architecture must call
module_bug_finalize()/module_bug_cleanup() from its corresponding
module_finalize/cleanup functions.
Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount.
At the very least, filename and line information will not be recorded for each
but, but architectures may decide to store no extra information per BUG at
all.
Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so
architectures will generally have to include an infinite loop (or similar) in
the BUG code, so that gcc knows execution won't continue beyond that point.
gcc does have a __builtin_trap() operator which may be useful to achieve the
same effect, unfortunately it cannot be used to actually implement the BUG
itself, because there's no way to get the instruction's address for use in
generating the __bug_table entry.
[randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors]
[bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h]
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:36:19 -08:00
2017-10-13 15:02:00 -05:00
# ifdef CONFIG_UNWINDER_ORC
2017-07-24 18:36:57 -05:00
# define ORC_UNWIND_TABLE \
x86/unwind/orc: Add ELF section with ORC version identifier
Commits ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC
metadata") and fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in
two") changed the ORC format. Although ORC is internal to the kernel,
it's the only way for external tools to get reliable kernel stack traces
on x86-64. In particular, the drgn debugger [1] uses ORC for stack
unwinding, and these format changes broke it [2]. As the drgn
maintainer, I don't care how often or how much the kernel changes the
ORC format as long as I have a way to detect the change.
It suffices to store a version identifier in the vmlinux and kernel
module ELF files (to use when parsing ORC sections from ELF), and in
kernel memory (to use when parsing ORC from a core dump+symbol table).
Rather than hard-coding a version number that needs to be manually
bumped, Peterz suggested hashing the definitions from orc_types.h. If
there is a format change that isn't caught by this, the hashing script
can be updated.
This patch adds an .orc_header allocated ELF section containing the
20-byte hash to vmlinux and kernel modules, along with the corresponding
__start_orc_header and __stop_orc_header symbols in vmlinux.
1: https://github.com/osandov/drgn
2: https://github.com/osandov/drgn/issues/303
Fixes: ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata")
Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com
2023-06-13 14:14:56 -07:00
. orc_header : AT ( ADDR ( . orc_header ) - LOAD_OFFSET ) { \
BOUNDED_SECTION_BY ( . orc_header , _orc_header ) \
} \
2017-07-24 18:36:57 -05:00
. = ALIGN ( 4 ) ; \
. orc_unwind_ip : AT ( ADDR ( . orc_unwind_ip ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . orc_unwind_ip , _orc_unwind_ip ) \
2017-07-24 18:36:57 -05:00
} \
2019-03-06 11:07:24 -06:00
. = ALIGN ( 2 ) ; \
2017-07-24 18:36:57 -05:00
. orc_unwind : AT ( ADDR ( . orc_unwind ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . orc_unwind , _orc_unwind ) \
2017-07-24 18:36:57 -05:00
} \
2021-10-13 10:57:42 -07:00
text_size = _etext - _stext ; \
2017-07-24 18:36:57 -05:00
. = ALIGN ( 4 ) ; \
. orc_lookup : AT ( ADDR ( . orc_lookup ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
orc_lookup = . ; \
2021-10-13 10:57:42 -07:00
. + = ( ( ( text_size + LOOKUP_BLOCK_SIZE - 1 ) / \
2017-07-24 18:36:57 -05:00
LOOKUP_BLOCK_SIZE ) + 1 ) * 4 ; \
2018-05-09 16:23:51 +09:00
orc_lookup_end = . ; \
2017-07-24 18:36:57 -05:00
}
# else
# define ORC_UNWIND_TABLE
# endif
2021-10-21 08:58:38 -07:00
/* Built-in firmware blobs */
# ifdef CONFIG_FW_LOADER
# define FW_LOADER_BUILT_IN_DATA \
. builtin_fw : AT ( ADDR ( . builtin_fw ) - LOAD_OFFSET ) ALIGN ( 8 ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_PRE_LABEL ( . builtin_fw , _builtin_fw , __start , __end ) \
2021-10-21 08:58:38 -07:00
}
# else
# define FW_LOADER_BUILT_IN_DATA
# endif
2008-05-12 15:44:41 +02:00
# ifdef CONFIG_PM_TRACE
# define TRACEDATA \
. = ALIGN ( 4 ) ; \
. tracedata : AT ( ADDR ( . tracedata ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_POST_LABEL ( . tracedata , __tracedata , _start , _end ) \
2008-05-12 15:44:41 +02:00
}
# else
# define TRACEDATA
# endif
printk: Userspace format indexing support
We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:
$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-15 17:52:53 +01:00
# ifdef CONFIG_PRINTK_INDEX
# define PRINTK_INDEX \
. printk_index : AT ( ADDR ( . printk_index ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . printk_index , _printk_index ) \
printk: Userspace format indexing support
We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:
$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-15 17:52:53 +01:00
}
# else
# define PRINTK_INDEX
# endif
2022-12-27 03:45:37 +09:00
/*
* Discard . note . GNU - stack , which is emitted as PROGBITS by the compiler .
* Otherwise , the type of . notes section would become PROGBITS instead of NOTES .
vmlinux.lds.h: Discard .note.gnu.property section
When tooling reads ELF notes, it assumes each note entry is aligned to
the value listed in the .note section header's sh_addralign field.
The kernel-created ELF notes in the .note.Linux and .note.Xen sections
are aligned to 4 bytes. This causes the toolchain to set those
sections' sh_addralign values to 4.
On the other hand, the GCC-created .note.gnu.property section has an
sh_addralign value of 8 for some reason, despite being based on struct
Elf32_Nhdr which only needs 4-byte alignment.
When the mismatched input sections get linked together into the vmlinux
.notes output section, the higher alignment "wins", resulting in an
sh_addralign of 8, which confuses tooling. For example:
$ readelf -n .tmp_vmlinux.btf
...
readelf: .tmp_vmlinux.btf: Warning: note with invalid namesz and/or descsz found at offset 0x170
readelf: .tmp_vmlinux.btf: Warning: type: 0x4, namesize: 0x006e6558, descsize: 0x00008801, alignment: 8
In this case readelf thinks there's alignment padding where there is
none, so it starts reading an ELF note in the middle.
With newer toolchains (e.g., latest Fedora Rawhide), a similar mismatch
triggers a build failure when combined with CONFIG_X86_KERNEL_IBT:
btf_encoder__encode: btf__dedup failed!
Failed to encode BTF
libbpf: failed to find '.BTF' ELF section in vmlinux
FAILED: load BTF from vmlinux: No data available
make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 255
This latter error was caused by pahole crashing when it encountered the
corrupt .notes section. This crash has been fixed in dwarves version
1.25. As Tianyi Liu describes:
"Pahole reads .notes to look for LINUX_ELFNOTE_BUILD_LTO. When LTO is
enabled, pahole needs to call cus__merge_and_process_cu to merge
compile units, at which point there should only be one unspecified
type (used to represent some compilation information) in the global
context.
However, when the kernel is compiled without LTO, if pahole calls
cus__merge_and_process_cu due to alignment issues with notes,
multiple unspecified types may appear after merging the cus, and
older versions of pahole only support up to one. This is why pahole
1.24 crashes, while newer versions support multiple. However, the
latest version of pahole still does not solve the problem of
incorrect LTO recognition, so compiling the kernel may be slower
than normal."
Even with the newer pahole, the note section misaligment issue still
exists and pahole is misinterpreting the LTO note. Fix it by discarding
the .note.gnu.property section. While GNU properties are important for
user space (and VDSO), they don't seem to have any use for vmlinux.
(In fact, they're already getting (inadvertently) stripped from vmlinux
when CONFIG_DEBUG_INFO_BTF is enabled. The BTF data is extracted from
vmlinux.o with "objcopy --only-section=.BTF" into .btf.vmlinux.bin.o.
That file doesn't have .note.gnu.property, so when it gets modified and
linked back into the main object, the linker automatically strips it
(see "How GNU properties are merged" in the ld man page).)
Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lkml.kernel.org/bpf/57830c30-cd77-40cf-9cd1-3bb608aa602e@app.fastmail.com
Debugged-by: Tianyi Liu <i.pear@outlook.com>
Suggested-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Link: https://lore.kernel.org/r/20230418214925.ay3jpf2zhw75kgmd@treble
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-04-18 14:49:25 -07:00
*
* Also , discard . note . gnu . property , otherwise it forces the notes section to
* be 8 - byte aligned which causes alignment mismatches with the kernel ' s custom
* 4 - byte aligned notes .
2022-12-27 03:45:37 +09:00
*/
2006-09-25 23:32:26 -07:00
# define NOTES \
vmlinux.lds.h: Discard .note.gnu.property section
When tooling reads ELF notes, it assumes each note entry is aligned to
the value listed in the .note section header's sh_addralign field.
The kernel-created ELF notes in the .note.Linux and .note.Xen sections
are aligned to 4 bytes. This causes the toolchain to set those
sections' sh_addralign values to 4.
On the other hand, the GCC-created .note.gnu.property section has an
sh_addralign value of 8 for some reason, despite being based on struct
Elf32_Nhdr which only needs 4-byte alignment.
When the mismatched input sections get linked together into the vmlinux
.notes output section, the higher alignment "wins", resulting in an
sh_addralign of 8, which confuses tooling. For example:
$ readelf -n .tmp_vmlinux.btf
...
readelf: .tmp_vmlinux.btf: Warning: note with invalid namesz and/or descsz found at offset 0x170
readelf: .tmp_vmlinux.btf: Warning: type: 0x4, namesize: 0x006e6558, descsize: 0x00008801, alignment: 8
In this case readelf thinks there's alignment padding where there is
none, so it starts reading an ELF note in the middle.
With newer toolchains (e.g., latest Fedora Rawhide), a similar mismatch
triggers a build failure when combined with CONFIG_X86_KERNEL_IBT:
btf_encoder__encode: btf__dedup failed!
Failed to encode BTF
libbpf: failed to find '.BTF' ELF section in vmlinux
FAILED: load BTF from vmlinux: No data available
make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 255
This latter error was caused by pahole crashing when it encountered the
corrupt .notes section. This crash has been fixed in dwarves version
1.25. As Tianyi Liu describes:
"Pahole reads .notes to look for LINUX_ELFNOTE_BUILD_LTO. When LTO is
enabled, pahole needs to call cus__merge_and_process_cu to merge
compile units, at which point there should only be one unspecified
type (used to represent some compilation information) in the global
context.
However, when the kernel is compiled without LTO, if pahole calls
cus__merge_and_process_cu due to alignment issues with notes,
multiple unspecified types may appear after merging the cus, and
older versions of pahole only support up to one. This is why pahole
1.24 crashes, while newer versions support multiple. However, the
latest version of pahole still does not solve the problem of
incorrect LTO recognition, so compiling the kernel may be slower
than normal."
Even with the newer pahole, the note section misaligment issue still
exists and pahole is misinterpreting the LTO note. Fix it by discarding
the .note.gnu.property section. While GNU properties are important for
user space (and VDSO), they don't seem to have any use for vmlinux.
(In fact, they're already getting (inadvertently) stripped from vmlinux
when CONFIG_DEBUG_INFO_BTF is enabled. The BTF data is extracted from
vmlinux.o with "objcopy --only-section=.BTF" into .btf.vmlinux.bin.o.
That file doesn't have .note.gnu.property, so when it gets modified and
linked back into the main object, the linker automatically strips it
(see "How GNU properties are merged" in the ld man page).)
Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lkml.kernel.org/bpf/57830c30-cd77-40cf-9cd1-3bb608aa602e@app.fastmail.com
Debugged-by: Tianyi Liu <i.pear@outlook.com>
Suggested-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Link: https://lore.kernel.org/r/20230418214925.ay3jpf2zhw75kgmd@treble
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-04-18 14:49:25 -07:00
/ DISCARD / : { \
* ( . note . GNU - stack ) \
* ( . note . gnu . property ) \
} \
2007-07-19 01:48:36 -07:00
. notes : AT ( ADDR ( . notes ) - LOAD_OFFSET ) { \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_BY ( . note . * , _notes ) \
2019-10-29 14:13:31 -07:00
} NOTES_HEADERS \
NOTES_HEADERS_RESTORE
2006-10-27 11:41:44 -07:00
2009-06-07 20:46:37 +02:00
# define INIT_SETUP(initsetup_align) \
. = ALIGN ( initsetup_align ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_POST_LABEL ( . init . setup , __setup , _start , _end )
2009-06-07 20:46:37 +02:00
2012-03-26 12:50:51 +10:30
# define INIT_CALLS_LEVEL(level) \
2018-05-09 16:23:51 +09:00
__initcall # # level # # _start = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( . initcall # # level # # . init ) ) \
KEEP ( * ( . initcall # # level # # s . init ) ) \
2006-10-27 11:41:44 -07:00
2009-06-07 20:46:37 +02:00
# define INIT_CALLS \
2018-05-09 16:23:51 +09:00
__initcall_start = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( . initcallearly . init ) ) \
2012-03-26 12:50:51 +10:30
INIT_CALLS_LEVEL ( 0 ) \
INIT_CALLS_LEVEL ( 1 ) \
INIT_CALLS_LEVEL ( 2 ) \
INIT_CALLS_LEVEL ( 3 ) \
INIT_CALLS_LEVEL ( 4 ) \
INIT_CALLS_LEVEL ( 5 ) \
INIT_CALLS_LEVEL ( rootfs ) \
INIT_CALLS_LEVEL ( 6 ) \
INIT_CALLS_LEVEL ( 7 ) \
2018-05-09 16:23:51 +09:00
__initcall_end = . ;
2009-06-07 20:46:37 +02:00
# define CON_INITCALL \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_POST_LABEL ( . con_initcall . init , __con_initcall , _start , _end )
2009-06-07 20:46:37 +02:00
2020-08-04 13:47:41 -07:00
/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h */
# define KUNIT_TABLE() \
. = ALIGN ( 8 ) ; \
2022-10-22 16:56:36 -06:00
BOUNDED_SECTION_POST_LABEL ( . kunit_test_suites , __kunit_suites , _start , _end )
2020-08-04 13:47:41 -07:00
2009-06-07 20:46:37 +02:00
# ifdef CONFIG_BLK_DEV_INITRD
# define INIT_RAM_FS \
2010-10-26 14:22:30 -07:00
. = ALIGN ( 4 ) ; \
2018-05-09 16:23:51 +09:00
__initramfs_start = . ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( . init . ramfs ) ) \
initramfs: fix initramfs size calculation
The size of a built-in initramfs is calculated in init/initramfs.c by
"__initramfs_end - __initramfs_start". Those symbols are defined in the
linker script include/asm-generic/vmlinux.lds.h:
#define INIT_RAM_FS \
. = ALIGN(PAGE_SIZE); \
VMLINUX_SYMBOL(__initramfs_start) = .; \
*(.init.ramfs) \
VMLINUX_SYMBOL(__initramfs_end) = .;
If the initramfs file has an odd number of bytes, the "__initramfs_end"
symbol points to an odd address, for example, the symbols in the
System.map might look like:
0000000000572000 T __initramfs_start
00000000005bcd05 T __initramfs_end <-- odd address
At least on s390 this causes a problem:
Certain s390 instructions, especially instructions for loading addresses
(larl) or branch addresses must be on even addresses. The compiler loads
the symbol addresses with the "larl" instruction. This instruction sets
the last bit to 0 and, therefore, for odd size files, the calculated size
is one byte less than it should be:
0000000000540a9c <populate_rootfs>:
540a9c: eb cf f0 78 00 24 stmg %r12,%r15,120(%r15),
540aa2: c0 10 00 01 8a af larl %r1,572000 <__initramfs_start>
540aa8: c0 c0 00 03 e1 2e larl %r12,5bcd04 <initramfs_end>
(Instead of 5bcd05)
...
540abe: 1b c1 sr %r12,%r1
To fix the problem, this patch introduces the global variable
__initramfs_size, which is calculated in the "usr/initramfs_data.S" file.
The populate_rootfs() function can then use the start marker of the
.init.ramfs section and the value of __initramfs_size for loading the
initramfs. Because the start marker and size is sufficient, the
__initramfs_end symbol is no longer needed and is removed.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Michal Marek <mmarek@suse.cz>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2010-09-17 15:24:11 -07:00
. = ALIGN ( 8 ) ; \
kbuild: allow archs to select link dead code/data elimination
Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to
select to build with -ffunction-sections, -fdata-sections, and link
with --gc-sections. It requires some work (documented) to ensure all
unreferenced entrypoints are live, and requires toolchain and build
verification, so it is made a per-arch option for now.
On a random powerpc64le build, this yelds a significant size saving,
it boots and runs fine, but there is a lot I haven't tested as yet, so
these savings may be reduced if there are bugs in the link.
text data bss dec filename
11169741 1180744 1923176 14273661 vmlinux
10445269 1004127 1919707 13369103 vmlinux.dce
~700K text, ~170K data, 6% removed from kernel image size.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-08-24 22:29:20 +10:00
KEEP ( * ( . init . ramfs . info ) )
2009-06-07 20:46:37 +02:00
# else
2009-06-22 15:32:31 +01:00
# define INIT_RAM_FS
2009-06-07 20:46:37 +02:00
# endif
2017-10-20 09:30:57 -05:00
/*
* Memory encryption operates on a page basis . Since we need to clear
* the memory encryption mask for this section , it needs to be aligned
* on a page boundary and be a page - size multiple in length .
*
* Note : We use a separate section so that only this section gets
* decrypted to avoid exposing more than we wish .
*/
# ifdef CONFIG_AMD_MEM_ENCRYPT
# define PERCPU_DECRYPTED_SECTION \
. = ALIGN ( PAGE_SIZE ) ; \
* ( . data . . percpu . . decrypted ) \
. = ALIGN ( PAGE_SIZE ) ;
# else
# define PERCPU_DECRYPTED_SECTION
# endif
linker script: unify usage of discard definition
Discarded sections in different archs share some commonality but have
considerable differences. This led to linker script for each arch
implementing its own /DISCARD/ definition, which makes maintaining
tedious and adding new entries error-prone.
This patch makes all linker scripts to move discard definitions to the
end of the linker script and use the common DISCARDS macro. As ld
uses the first matching section definition, archs can include default
discarded sections by including them earlier in the linker script.
ia64 is notable because it first throws away some ia64 specific
subsections and then include the rest of the sections into the final
image, so those sections must be discarded before the inclusion.
defconfig compile tested for x86, x86-64, powerpc, powerpc64, ia64,
alpha, sparc, sparc64 and s390. Michal Simek tested microblaze.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: linux-arch@vger.kernel.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: microblaze-uclinux@itee.uq.edu.au
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Tony Luck <tony.luck@intel.com>
2009-07-09 11:27:40 +09:00
/*
* Default discarded sections .
*
* Some archs want to discard exit text / data at runtime rather than
* link time due to cross - section references such as alt instructions ,
* bug table , eh_frame , etc . DISCARDS must be the last of output
* section definitions so that such archs put those in earlier section
* definitions .
*/
2020-03-26 12:30:20 -07:00
# ifdef RUNTIME_DISCARD_EXIT
# define EXIT_DISCARDS
# else
# define EXIT_DISCARDS \
EXIT_TEXT \
EXIT_DATA
# endif
2020-08-21 12:42:44 -07:00
/*
2021-01-29 17:46:51 -07:00
* Clang ' s - fprofile - arcs , - fsanitize = kernel - address , and
* - fsanitize = thread produce unwanted sections ( . eh_frame
* and . init_array . * ) , but CONFIG_CONSTRUCTORS wants to
* keep any . init_array . * sections .
2020-08-21 12:42:44 -07:00
* https : //bugs.llvm.org/show_bug.cgi?id=46478
*/
2022-10-27 17:59:06 +02:00
# ifdef CONFIG_UNWIND_TABLES
# define DISCARD_EH_FRAME
# else
# define DISCARD_EH_FRAME *(.eh_frame)
# endif
2022-09-08 14:54:47 -07:00
# if defined(CONFIG_GCOV_KERNEL) || defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KCSAN)
2020-08-21 12:42:44 -07:00
# ifdef CONFIG_CONSTRUCTORS
# define SANITIZER_DISCARDS \
2022-10-27 17:59:06 +02:00
DISCARD_EH_FRAME
2020-08-21 12:42:44 -07:00
# else
# define SANITIZER_DISCARDS \
* ( . init_array ) * ( . init_array . * ) \
2022-10-27 17:59:06 +02:00
DISCARD_EH_FRAME
2020-08-21 12:42:44 -07:00
# endif
# else
# define SANITIZER_DISCARDS
# endif
2020-08-21 12:42:42 -07:00
# define COMMON_DISCARDS \
2020-08-21 12:42:44 -07:00
SANITIZER_DISCARDS \
2022-09-03 15:11:53 +02:00
PATCHABLE_DISCARDS \
2020-08-21 12:42:42 -07:00
* ( . discard ) \
* ( . discard . * ) \
kbuild: generate KSYMTAB entries by modpost
Commit 7b4537199a4a ("kbuild: link symbol CRCs at final link, removing
CONFIG_MODULE_REL_CRCS") made modpost output CRCs in the same way
whether the EXPORT_SYMBOL() is placed in *.c or *.S.
For further cleanups, this commit applies a similar approach to the
entire data structure of EXPORT_SYMBOL().
The EXPORT_SYMBOL() compilation is split into two stages.
When a source file is compiled, EXPORT_SYMBOL() will be converted into
a dummy symbol in the .export_symbol section.
For example,
EXPORT_SYMBOL(foo);
EXPORT_SYMBOL_NS_GPL(bar, BAR_NAMESPACE);
will be encoded into the following assembly code:
.section ".export_symbol","a"
__export_symbol_foo:
.asciz "" /* license */
.asciz "" /* name space */
.balign 8
.quad foo /* symbol reference */
.previous
.section ".export_symbol","a"
__export_symbol_bar:
.asciz "GPL" /* license */
.asciz "BAR_NAMESPACE" /* name space */
.balign 8
.quad bar /* symbol reference */
.previous
They are mere markers to tell modpost the name, license, and namespace
of the symbols. They will be dropped from the final vmlinux and modules
because the *(.export_symbol) will go into /DISCARD/ in the linker script.
Then, modpost extracts all the information about EXPORT_SYMBOL() from the
.export_symbol section, and generates the final C code:
KSYMTAB_FUNC(foo, "", "");
KSYMTAB_FUNC(bar, "_gpl", "BAR_NAMESPACE");
KSYMTAB_FUNC() (or KSYMTAB_DATA() if it is data) is expanded to struct
kernel_symbol that will be linked to the vmlinux or a module.
With this change, EXPORT_SYMBOL() works in the same way for *.c and *.S
files, providing the following benefits.
[1] Deprecate EXPORT_DATA_SYMBOL()
In the old days, EXPORT_SYMBOL() was only available in C files. To export
a symbol in *.S, EXPORT_SYMBOL() was placed in a separate *.c file.
arch/arm/kernel/armksyms.c is one example written in the classic manner.
Commit 22823ab419d8 ("EXPORT_SYMBOL() for asm") removed this limitation.
Since then, EXPORT_SYMBOL() can be placed close to the symbol definition
in *.S files. It was a nice improvement.
However, as that commit mentioned, you need to use EXPORT_DATA_SYMBOL()
for data objects on some architectures.
In the new approach, modpost checks symbol's type (STT_FUNC or not),
and outputs KSYMTAB_FUNC() or KSYMTAB_DATA() accordingly.
There are only two users of EXPORT_DATA_SYMBOL:
EXPORT_DATA_SYMBOL_GPL(empty_zero_page) (arch/ia64/kernel/head.S)
EXPORT_DATA_SYMBOL(ia64_ivt) (arch/ia64/kernel/ivt.S)
They are transformed as follows and output into .vmlinux.export.c
KSYMTAB_DATA(empty_zero_page, "_gpl", "");
KSYMTAB_DATA(ia64_ivt, "", "");
The other EXPORT_SYMBOL users in ia64 assembly are output as
KSYMTAB_FUNC().
EXPORT_DATA_SYMBOL() is now deprecated.
[2] merge <linux/export.h> and <asm-generic/export.h>
There are two similar header implementations:
include/linux/export.h for .c files
include/asm-generic/export.h for .S files
Ideally, the functionality should be consistent between them, but they
tend to diverge.
Commit 8651ec01daed ("module: add support for symbol namespaces.") did
not support the namespace for *.S files.
This commit shifts the essential implementation part to C, which supports
EXPORT_SYMBOL_NS() for *.S files.
<asm/export.h> and <asm-generic/export.h> will remain as a wrapper of
<linux/export.h> for a while.
They will be removed after #include <asm/export.h> directives are all
replaced with #include <linux/export.h>.
[3] Implement CONFIG_TRIM_UNUSED_KSYMS in one-pass algorithm (by a later commit)
When CONFIG_TRIM_UNUSED_KSYMS is enabled, Kbuild recursively traverses
the directory tree to determine which EXPORT_SYMBOL to trim. If an
EXPORT_SYMBOL turns out to be unused by anyone, Kbuild begins the
second traverse, where some source files are recompiled with their
EXPORT_SYMBOL() tuned into a no-op.
We can do this better now; modpost can selectively emit KSYMTAB entries
that are really used by modules.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2023-06-12 00:50:52 +09:00
* ( . export_symbol ) \
2020-08-21 12:42:43 -07:00
* ( . modinfo ) \
/* ld.bfd warns about .gnu.version* even when not emitted */ \
* ( . gnu . version * ) \
2020-08-21 12:42:42 -07:00
2009-06-24 15:13:38 +09:00
# define DISCARDS \
/ DISCARD / : { \
2020-03-26 12:30:20 -07:00
EXIT_DISCARDS \
linker script: unify usage of discard definition
Discarded sections in different archs share some commonality but have
considerable differences. This led to linker script for each arch
implementing its own /DISCARD/ definition, which makes maintaining
tedious and adding new entries error-prone.
This patch makes all linker scripts to move discard definitions to the
end of the linker script and use the common DISCARDS macro. As ld
uses the first matching section definition, archs can include default
discarded sections by including them earlier in the linker script.
ia64 is notable because it first throws away some ia64 specific
subsections and then include the rest of the sections into the final
image, so those sections must be discarded before the inclusion.
defconfig compile tested for x86, x86-64, powerpc, powerpc64, ia64,
alpha, sparc, sparc64 and s390. Michal Simek tested microblaze.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: linux-arch@vger.kernel.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: microblaze-uclinux@itee.uq.edu.au
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Tony Luck <tony.luck@intel.com>
2009-07-09 11:27:40 +09:00
EXIT_CALL \
2020-08-21 12:42:42 -07:00
COMMON_DISCARDS \
2009-06-24 15:13:38 +09:00
}
2011-04-04 01:41:32 +02:00
/**
* PERCPU_INPUT - the percpu input sections
* @ cacheline : cacheline size
*
* The core percpu section names and core symbols which do not rely
* directly upon load addresses .
*
* @ cacheline is used to align subsections to avoid false cacheline
* sharing between subsections for different purposes .
*/
# define PERCPU_INPUT(cacheline) \
2018-05-09 16:23:51 +09:00
__per_cpu_start = . ; \
2011-04-04 01:41:32 +02:00
* ( . data . . percpu . . first ) \
. = ALIGN ( PAGE_SIZE ) ; \
* ( . data . . percpu . . page_aligned ) \
. = ALIGN ( cacheline ) ; \
2014-07-01 12:11:47 -07:00
* ( . data . . percpu . . read_mostly ) \
2011-04-04 01:41:32 +02:00
. = ALIGN ( cacheline ) ; \
* ( . data . . percpu ) \
* ( . data . . percpu . . shared_aligned ) \
2017-10-20 09:30:57 -05:00
PERCPU_DECRYPTED_SECTION \
2018-05-09 16:23:51 +09:00
__per_cpu_end = . ;
2011-04-04 01:41:32 +02:00
2009-01-13 20:41:35 +09:00
/**
2009-01-19 12:21:28 +09:00
* PERCPU_VADDR - define output section for percpu area
2011-01-25 14:26:50 +01:00
* @ cacheline : cacheline size
2009-01-13 20:41:35 +09:00
* @ vaddr : explicit base address ( optional )
* @ phdr : destination PHDR ( optional )
*
2011-01-25 14:26:50 +01:00
* Macro which expands to output section for percpu area .
*
* @ cacheline is used to align subsections to avoid false cacheline
* sharing between subsections for different purposes .
*
* If @ vaddr is not blank , it specifies explicit base address and all
* percpu symbols will be offset from the given address . If blank ,
* @ vaddr always equals @ laddr + LOAD_OFFSET .
2009-01-13 20:41:35 +09:00
*
* @ phdr defines the output PHDR to use if not blank . Be warned that
* output PHDR is sticky . If @ phdr is specified , the next output
* section in the linker script will go there too . @ phdr should have
* a leading colon .
*
2009-01-30 16:32:22 +09:00
* Note that this macros defines __per_cpu_load as an absolute symbol .
* If there is no need to put the percpu section at a predetermined
2011-03-24 18:50:09 +01:00
* address , use PERCPU_SECTION .
2009-01-13 20:41:35 +09:00
*/
2011-01-25 14:26:50 +01:00
# define PERCPU_VADDR(cacheline, vaddr, phdr) \
2018-05-09 16:23:51 +09:00
__per_cpu_load = . ; \
. data . . percpu vaddr : AT ( __per_cpu_load - LOAD_OFFSET ) { \
2011-04-04 01:41:32 +02:00
PERCPU_INPUT ( cacheline ) \
2009-01-19 12:21:28 +09:00
} phdr \
2018-05-09 16:23:51 +09:00
. = __per_cpu_load + SIZEOF ( . data . . percpu ) ;
2009-01-13 20:41:35 +09:00
/**
2011-03-24 18:50:09 +01:00
* PERCPU_SECTION - define output section for percpu area , simple version
2011-01-25 14:26:50 +01:00
* @ cacheline : cacheline size
2009-01-13 20:41:35 +09:00
*
2011-03-24 18:50:09 +01:00
* Align to PAGE_SIZE and outputs output section for percpu area . This
* macro doesn ' t manipulate @ vaddr or @ phdr and __per_cpu_load and
2009-01-13 20:41:35 +09:00
* __per_cpu_start will be identical .
2009-01-30 16:32:22 +09:00
*
2011-03-24 18:50:09 +01:00
* This macro is equivalent to ALIGN ( PAGE_SIZE ) ; PERCPU_VADDR ( @ cacheline , , )
2011-01-25 14:26:50 +01:00
* except that __per_cpu_load is defined as a relative symbol against
* . data . . percpu which is required for relocatable x86_32 configuration .
2009-01-13 20:41:35 +09:00
*/
2011-03-24 18:50:09 +01:00
# define PERCPU_SECTION(cacheline) \
. = ALIGN ( PAGE_SIZE ) ; \
2010-02-20 01:03:43 +01:00
. data . . percpu : AT ( ADDR ( . data . . percpu ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
__per_cpu_load = . ; \
2011-04-04 01:41:32 +02:00
PERCPU_INPUT ( cacheline ) \
2009-01-30 16:32:22 +09:00
}
2009-06-07 20:46:37 +02:00
/*
* Definition of the high level * _SECTION macros
* They will fit only a subset of the architectures
*/
/*
* Writeable data .
* All sections are combined in a single . data section .
* The sections following CONSTRUCTORS are arranged so their
* typical alignment matches .
* A cacheline is typical / always less than a PAGE_SIZE so
* the sections that has this restriction ( or similar )
* is located before the ones requiring PAGE_SIZE alignment .
* NOSAVE_DATA starts and ends with a PAGE_SIZE alignment which
2011-03-30 22:57:33 -03:00
* matches the requirement of PAGE_ALIGNED_DATA .
2009-06-07 20:46:37 +02:00
*
2009-06-14 22:10:41 +02:00
* use 0 as page_align if page_aligned data is not used */
2019-10-29 14:13:35 -07:00
# define RW_DATA(cacheline, pagealigned, inittask) \
2009-06-07 20:46:37 +02:00
. = ALIGN ( PAGE_SIZE ) ; \
. data : AT ( ADDR ( . data ) - LOAD_OFFSET ) { \
2009-06-23 18:53:15 -04:00
INIT_TASK_DATA ( inittask ) \
2009-09-24 10:36:16 -04:00
NOSAVE_DATA \
PAGE_ALIGNED_DATA ( pagealigned ) \
2009-06-07 20:46:37 +02:00
CACHELINE_ALIGNED_DATA ( cacheline ) \
READ_MOSTLY_DATA ( cacheline ) \
DATA_DATA \
CONSTRUCTORS \
2017-02-25 08:56:53 +01:00
} \
2017-07-24 18:36:57 -05:00
BUG_TABLE \
2009-06-07 20:46:37 +02:00
# define INIT_TEXT_SECTION(inittext_align) \
. = ALIGN ( inittext_align ) ; \
. init . text : AT ( ADDR ( . init . text ) - LOAD_OFFSET ) { \
2018-05-09 16:23:51 +09:00
_sinittext = . ; \
2009-06-07 20:46:37 +02:00
INIT_TEXT \
2018-05-09 16:23:51 +09:00
_einittext = . ; \
2009-06-07 20:46:37 +02:00
}
# define INIT_DATA_SECTION(initsetup_align) \
. init . data : AT ( ADDR ( . init . data ) - LOAD_OFFSET ) { \
INIT_DATA \
INIT_SETUP ( initsetup_align ) \
INIT_CALLS \
CON_INITCALL \
INIT_RAM_FS \
}
2009-07-12 18:23:33 -04:00
# define BSS_SECTION(sbss_align, bss_align, stop_align) \
. = ALIGN ( sbss_align ) ; \
2018-05-09 16:23:51 +09:00
__bss_start = . ; \
2009-07-12 18:23:33 -04:00
SBSS ( sbss_align ) \
2009-06-07 20:46:37 +02:00
BSS ( bss_align ) \
2009-07-12 18:23:33 -04:00
. = ALIGN ( stop_align ) ; \
2018-05-09 16:23:51 +09:00
__bss_stop = . ;