diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl
index 49a9ef82d575..6367bba32d22 100644
--- a/Documentation/DocBook/kernel-hacking.tmpl
+++ b/Documentation/DocBook/kernel-hacking.tmpl
@@ -8,8 +8,7 @@
- Paul
- Rusty
+ Rusty
Russell
@@ -20,7 +19,7 @@
- 2001
+ 2005
Rusty Russell
@@ -64,7 +63,7 @@
Introduction
- Welcome, gentle reader, to Rusty's Unreliable Guide to Linux
+ Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux
Kernel Hacking. This document describes the common routines and
general requirements for kernel code: its goal is to serve as a
primer for Linux kernel development for experienced C
@@ -96,13 +95,13 @@
- not associated with any process, serving a softirq, tasklet or bh;
+ not associated with any process, serving a softirq or tasklet;
- running in kernel space, associated with a process;
+ running in kernel space, associated with a process (user context);
@@ -114,11 +113,12 @@
- There is a strict ordering between these: other than the last
- category (userspace) each can only be pre-empted by those above.
- For example, while a softirq is running on a CPU, no other
- softirq will pre-empt it, but a hardware interrupt can. However,
- any other CPUs in the system execute independently.
+ There is an ordering between these. The bottom two can preempt
+ each other, but above that is a strict hierarchy: each can only be
+ preempted by the ones above it. For example, while a softirq is
+ running on a CPU, no other softirq will preempt it, but a hardware
+ interrupt can. However, any other CPUs in the system execute
+ independently.
@@ -130,10 +130,10 @@
User Context
- User context is when you are coming in from a system call or
- other trap: you can sleep, and you own the CPU (except for
- interrupts) until you call schedule().
- In other words, user context (unlike userspace) is not pre-emptable.
+ User context is when you are coming in from a system call or other
+ trap: like userspace, you can be preempted by more important tasks
+ and by interrupts. You can sleep, by calling
+ schedule().
@@ -153,7 +153,7 @@
- Beware that if you have interrupts or bottom halves disabled
+ Beware that if you have preemption or softirqs disabled
(see below), in_interrupt() will return a
false positive.
@@ -168,10 +168,10 @@
keyboard are examples of real
hardware which produce interrupts at any time. The kernel runs
interrupt handlers, which services the hardware. The kernel
- guarantees that this handler is never re-entered: if another
+ guarantees that this handler is never re-entered: if the same
interrupt arrives, it is queued (or dropped). Because it
disables interrupts, this handler has to be fast: frequently it
- simply acknowledges the interrupt, marks a `software interrupt'
+ simply acknowledges the interrupt, marks a 'software interrupt'
for execution and exits.
@@ -188,60 +188,52 @@
- Software Interrupt Context: Bottom Halves, Tasklets, softirqs
+ Software Interrupt Context: Softirqs and Tasklets
Whenever a system call is about to return to userspace, or a
- hardware interrupt handler exits, any `software interrupts'
+ hardware interrupt handler exits, any 'software interrupts'
which are marked pending (usually by hardware interrupts) are
run (kernel/softirq.c).
Much of the real interrupt handling work is done here. Early in
- the transition to SMP, there were only `bottom
+ the transition to SMP, there were only 'bottom
halves' (BHs), which didn't take advantage of multiple CPUs. Shortly
after we switched from wind-up computers made of match-sticks and snot,
- we abandoned this limitation.
+ we abandoned this limitation and switched to 'softirqs'.
lists the
- different BH's. No matter how many CPUs you have, no two BHs will run at
- the same time. This made the transition to SMP simpler, but sucks hard for
- scalable performance. A very important bottom half is the timer
- BH (): you
- can register to have it call functions for you in a given length of time.
+ different softirqs. A very important softirq is the
+ timer softirq (): you can
+ register to have it call functions for you in a given length of
+ time.
- 2.3.43 introduced softirqs, and re-implemented the (now
- deprecated) BHs underneath them. Softirqs are fully-SMP
- versions of BHs: they can run on as many CPUs at once as
- required. This means they need to deal with any races in shared
- data using their own locks. A bitmask is used to keep track of
- which are enabled, so the 32 available softirqs should not be
- used up lightly. (Yes, people will
- notice).
-
-
-
- tasklets ()
- are like softirqs, except they are dynamically-registrable (meaning you
- can have as many as you want), and they also guarantee that any tasklet
- will only run on one CPU at any time, although different tasklets can
- run simultaneously (unlike different BHs).
+ Softirqs are often a pain to deal with, since the same softirq
+ will run simultaneously on more than one CPU. For this reason,
+ tasklets () are more
+ often used: they are dynamically-registrable (meaning you can have
+ as many as you want), and they also guarantee that any tasklet
+ will only run on one CPU at any time, although different tasklets
+ can run simultaneously.
- The name `tasklet' is misleading: they have nothing to do with `tasks',
+ The name 'tasklet' is misleading: they have nothing to do with 'tasks',
and probably more to do with some bad vodka Alexey Kuznetsov had at the
time.
- You can tell you are in a softirq (or bottom half, or tasklet)
+ You can tell you are in a softirq (or tasklet)
using the in_softirq() macro
().
@@ -288,11 +280,10 @@
A rigid stack limit
- The kernel stack is about 6K in 2.2 (for most
- architectures: it's about 14K on the Alpha), and shared
- with interrupts so you can't use it all. Avoid deep
- recursion and huge local arrays on the stack (allocate
- them dynamically instead).
+ Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's
+ about 14K on most 64-bit archs, and often shared with interrupts
+ so you can't use it all. Avoid deep recursion and huge local
+ arrays on the stack (allocate them dynamically instead).
@@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg)
If all your routine does is read or write some parameter, consider
- implementing a sysctl interface instead.
+ implementing a sysfs interface instead.
@@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */
- You will eventually lock up your box if you break these rules.
+ You should always compile your kernel
+ CONFIG_DEBUG_SPINLOCK_SLEEP on, and it will warn
+ you if you break these rules. If you do break
+ the rules, you will eventually lock up your box.
@@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
success).
- [Yes, this moronic interface makes me cringe. Please submit a
- patch and become my hero --RR.]
+ [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.]
The functions may sleep implicitly. This should never be called
@@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
- If you see a kmem_grow: Called nonatomically from int
- warning message you called a memory allocation function
- from interrupt context without GFP_ATOMIC.
- You should really fix that. Run, don't walk.
+ If you see a sleeping function called from invalid
+ context warning message, then maybe you called a
+ sleeping allocation function from interrupt context without
+ GFP_ATOMIC. You should really fix that.
+ Run, don't walk.
@@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
- udelay()/mdelay()
+ mdelay()/udelay()
- The udelay() function can be used for small pauses.
- Do not use large values with udelay() as you risk
+ The udelay() and ndelay() functions can be used for small pauses.
+ Do not use large values with them as you risk
overflow - the helper function mdelay() is useful
- here, or even consider schedule_timeout().
+ here, or consider msleep().
@@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
These routines disable soft interrupts on the local CPU, and
restore them. They are reentrant; if soft interrupts were
disabled before, they will still be disabled after this pair
- of functions has been called. They prevent softirqs, tasklets
- and bottom halves from running on the current CPU.
+ of functions has been called. They prevent softirqs and tasklets
+ from running on the current CPU.
@@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
- smp_processor_id() returns the current
- processor number, between 0 and NR_CPUS (the
- maximum number of CPUs supported by Linux, currently 32). These
- values are not necessarily continuous.
+ get_cpu() disables preemption (so you won't
+ suddenly get moved to another CPU) and returns the current
+ processor number, between 0 and NR_CPUS. Note
+ that the CPU numbers are not necessarily continuous. You return
+ it again with put_cpu() when you are done.
+
+
+ If you know you cannot be preempted by another task (ie. you are
+ in interrupt context, or have preemption disabled) you can use
+ smp_processor_id().
@@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
After boot, the kernel frees up a special section; functions
marked with __init and data structures marked with
- __initdata are dropped after boot is complete (within
- modules this directive is currently ignored). __exit
+ __initdata are dropped after boot is complete: similarly
+ modules discard this memory after initialization. __exit
is used to declare a function which is only required on exit: the
function will be dropped if this file is not compiled as a module.
See the header file for use. Note that it makes no sense for a function
marked with __init to be exported to modules with
EXPORT_SYMBOL() - this will break.
-
- Static data structures marked as __initdata must be initialised
- (as opposed to ordinary static data which is zeroed BSS) and cannot be
- const.
-
@@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
The function can return a negative error number to cause
module loading to fail (unfortunately, this has no effect if
- the module is compiled into the kernel). For modules, this is
- called in user context, with interrupts enabled, and the
- kernel lock held, so it can sleep.
+ the module is compiled into the kernel). This function is
+ called in user context with interrupts enabled, so it can sleep.
@@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
reached zero. This function can also sleep, but cannot fail:
everything must be cleaned up by the time it returns.
+
+
+ Note that this macro is optional: if it is not present, your
+ module will not be removable (except for 'rmmod -f').
+
+
+
+
+ try_module_get()/module_put()
+
+
+
+ These manipulate the module usage count, to protect against
+ removal (a module also can't be removed if another module uses one
+ of its exported symbols: see below). Before calling into module
+ code, you should call try_module_get() on
+ that module: if it fails, then the module is being removed and you
+ should act as if it wasn't there. Otherwise, you can safely enter
+ the module, and call module_put() when you're
+ finished.
+
+
+
+ Most registerable structures have an
+ owner field, such as in the
+ file_operations structure. Set this field
+ to the macro THIS_MODULE.
+
@@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
There is a macro to do this:
wait_event_interruptible()
- The
+ The
first argument is the wait queue head, and the second is an
expression which is evaluated; the macro returns
0 when this expression is true, or
@@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
Call wake_up()
- ;,
+ ;,
which will wake up every process in the queue. The exception is
if one has TASK_EXCLUSIVE set, in which case
- the remainder of the queue will not be woken.
+ the remainder of the queue will not be woken. There are other variants
+ of this basic function available in the same header.
@@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
first class of operations work on atomic_t
; this
- contains a signed integer (at least 24 bits long), and you must use
+ contains a signed integer (at least 32 bits long), and you must use
these functions to manipulate or read atomic_t variables.
atomic_read() and
atomic_set() get and set the counter,
@@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
Note that these functions are slower than normal arithmetic, and
- so should not be used unnecessarily. On some platforms they
- are much slower, like 32-bit Sparc where they use a spinlock.
+ so should not be used unnecessarily.
- The second class of atomic operations is atomic bit operations on a
- long, defined in
+ The second class of atomic operations is atomic bit operations on an
+ unsigned long, defined in
. These
operations generally take a pointer to the bit pattern, and a bit
@@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
test_and_clear_bit() and
test_and_change_bit() do the same thing,
except return true if the bit was previously set; these are
- particularly useful for very simple locking.
+ particularly useful for atomically setting flags.
@@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
than BITS_PER_LONG. The resulting behavior is strange on big-endian
platforms though so it is a good idea not to do this.
-
-
- Note that the order of bits depends on the architecture, and in
- particular, the bitfield passed to these operations must be at
- least as large as a long.
-
@@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
- This is the classic method of exporting a symbol, and it works
- for both modules and non-modules. In the kernel all these
- declarations are often bundled into a single file to help
- genksyms (which searches source files for these declarations).
- See the comment on genksyms and Makefiles below.
+ This is the classic method of exporting a symbol: dynamically
+ loaded modules will be able to use the symbol as normal.
@@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
symbols exported by EXPORT_SYMBOL_GPL() can
only be seen by modules with a
MODULE_LICENSE() that specifies a GPL
- compatible license.
+ compatible license. It implies that the function is considered
+ an internal implementation issue, and not really an interface.
@@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
- There are three sets of linked-list routines in the kernel
- headers, but this one seems to be winning out (and Linus has
- used it). If you don't have some particular pressing need for
- a single list, it's a good choice. In fact, I don't care
- whether it's a good choice or not, just use it so we can get
- rid of the others.
+ There used to be three sets of linked-list routines in the kernel
+ headers, but this one is the winner. If you don't have some
+ particular pressing need for a single list, it's a good choice.
+
+
+
+ In particular, list_for_each_entry is useful.
@@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
convention, and return 0 for success,
and a negative error number
(eg. -EFAULT) for failure. This can be
- unintuitive at first, but it's fairly widespread in the networking
- code, for example.
+ unintuitive at first, but it's fairly widespread in the kernel.
- The filesystem code uses ERR_PTR()
+ Using ERR_PTR()
- ; to
+ ; to
encode a negative error number into a pointer, and
IS_ERR() and PTR_ERR()
to get it back out again: avoids a separate pointer parameter for
@@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = {
supported, due to lack of general use, but the following are
considered standard (see the GCC info page section "C
Extensions" for more details - Yes, really the info page, the
- man page is only a short summary of the stuff in info):
+ man page is only a short summary of the stuff in info).
@@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = {
- Function names as strings (__FUNCTION__)
+ Function names as strings (__func__).
@@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = {
Usually you want a configuration option for your kernel hack.
- Edit Config.in in the appropriate directory
- (but under arch/ it's called
- config.in). The Config Language used is not
- bash, even though it looks like bash; the safe way is to use only
- the constructs that you already see in
- Config.in files (see
- Documentation/kbuild/kconfig-language.txt).
- It's good to run "make xconfig" at least once to test (because
- it's the only one with a static parser).
-
-
-
- Variables which can be Y or N use bool followed by a
- tagline and the config define name (which must start with
- CONFIG_). The tristate function is the same, but
- allows the answer M (which defines
- CONFIG_foo_MODULE in your source, instead of
- CONFIG_FOO) if CONFIG_MODULES
- is enabled.
+ Edit Kconfig in the appropriate directory.
+ The Config language is simple to use by cut and paste, and there's
+ complete documentation in
+ Documentation/kbuild/kconfig-language.txt.
You may well want to make your CONFIG option only visible if
CONFIG_EXPERIMENTAL is enabled: this serves as a
warning to users. There many other fancy things you can do: see
- the various Config.in files for ideas.
+ the various Kconfig files for ideas.
+
+
+
+ In your description of the option, make sure you address both the
+ expert user and the user who knows nothing about your feature. Mention
+ incompatibilities and issues here. Definitely
+ end your description with if in doubt, say N
+
(or, occasionally, `Y'); this is for people who have no
+ idea what you are talking about.
Edit the Makefile: the CONFIG variables are
- exported here so you can conditionalize compilation with `ifeq'.
- If your file exports symbols then add the names to
- export-objs so that genksyms will find them.
-
-
- There is a restriction on the kernel build system that objects
- which export symbols must have globally unique names.
- If your object does not have a globally unique name then the
- standard fix is to move the
- EXPORT_SYMBOL() statements to their own
- object with a unique name.
- This is why several systems have separate exporting objects,
- usually suffixed with ksyms.
-
-
-
-
-
-
-
- Document your option in Documentation/Configure.help. Mention
- incompatibilities and issues here. Definitely
- end your description with if in doubt, say N
-
(or, occasionally, `Y'); this is for people who have no
- idea what you are talking about.
+ exported here so you can usually just add a "obj-$(CONFIG_xxx) +=
+ xxx.o" line. The syntax is documented in
+ Documentation/kbuild/makefiles.txt.
@@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = {
- include/linux/brlock.h:
+ include/asm-i386/delay.h:
-extern inline void br_read_lock (enum brlock_indices idx)
-{
- /*
- * This causes a link-time bug message if an
- * invalid index is used:
- */
- if (idx >= __BR_END)
- __br_lock_usage_bug();
-
- read_lock(&__brlock_array[smp_processor_id()][idx]);
-}
+#define ndelay(n) (__builtin_constant_p(n) ? \
+ ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
+ __ndelay(n))