Commit Graph

365 Commits

Author SHA1 Message Date
Yoichi Yuasa
c0589f1ece [MIPS] Remove unused definitions from addrspace.h.
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-19 17:39:19 +01:00
Thiemo Seufer
c583122c26 [MIPS] Qemu system shutdown support
Signed-off-by: Thiemo Seufer <ths@networkno.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-19 17:39:19 +01:00
Atsushi Nemoto
eae89076e6 [MIPS] Unify mips_fpu_soft_struct and mips_fpu_hard_structs.
The struct mips_fpu_soft_struct and mips_fpu_hard_struct are
completely same now and the kernel fpu emulator assumes that.  This
patch unifies them to mips_fpu_struct and get rid of mips_fpu_union.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-19 17:39:18 +01:00
Mark.Zhan
a240a46964 [MIPS] Wind River 4KC PPMC Eval Board Support
Support for the GT-64120-based Wind River 4KC PPMC Evaluation board.

Signed-off-by: Rongkai.Zhan <Rongkai.zhan@windriver.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-19 17:39:18 +01:00
Atsushi Nemoto
0307e8d024 [MIPS] Fix futex_atomic_op_inuser.
I found that NPTL's pthread_cond_signal() does not work properly on
kernels compiled by gcc 4.1.x.  I suppose inline asm for
__futex_atomic_op() was wrong.  I suppose:

1. "=&r" constraint should be used for oldval.
2. Instead of "r" (uaddr), "=R" (*uaddr) for output and "R" (*uaddr)
   for input should be used.
3. "memory" should be added to the clobber list.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-19 17:39:17 +01:00
Atsushi Nemoto
c138e12f3a [MIPS] Fix fpu_save_double on 64-bit.
> Without this fix, _save_fp() in 64-bit kernel is seriously broken.
>
> ffffffff8010bec0 <_save_fp>:
> ffffffff8010bec0:       400d6000        mfc0    t1,c0_status
> ffffffff8010bec4:       000c7140        sll     t2,t0,0x5
> ffffffff8010bec8:       05c10011        bgez    t2,ffffffff8010bf10 <_save_fp+0x50>
> ffffffff8010becc:       00000000        nop
> ffffffff8010bed0:       f4810328        sdc1    $f1,808(a0)
> ...

Fix register usage in fpu_save_double() and make fpu_restore_double()
more symmetric with fpu_save_double().

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-19 17:39:13 +01:00
Chad Reese
b1c231f5a5 [MIPS] Fix sparsemem support.
Move memory_present() in arch/mips/kernel/setup.c. When using sparsemem
extreme, this function does an allocate for bootmem. This would always
fail since init_bootmem hasn't been called yet.
    
Move memory_present after free_bootmem. This only marks actual memory
ranges as present instead of the entire address space.
    
Signed-off-by: Chad Reese  <creese@caviumnetworks.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:20 +01:00
Ralf Baechle
e32b699335 [MIPS] Fix 64-bit build for RM7000.
RM7000 has 40-bit virtual / 36-bit physical address space.
    
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:19 +01:00
Sergei Shtylyov
7cb710c9a6 [MIPS] Fix non-linear memory mapping on MIPS
Fix the non-linear memory mapping done via remap_file_pages() -- it
didn't work on any MIPS CPU because the page offset clashing with
_PAGE_FILE and some other page protection bits which should have been left
zeros for this kind of pages.
    
Signed-off-by: Konstantin Baydarov <kbaidarov@ru.mvista.com>
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:18 +01:00
Sergei Shtylyov
6ebba0e2f5 [MIPS] Fix swap entry for MIPS32 36-bit physical address
With 64-bit physical address enabled, 'swapon' was causing kernel oops on
Alchemy CPUs (MIPS32) because of the swap entry type field corrupting the
_PAGE_FILE bit in 'pte_low' field. So, switch to storing the swap entry in
'pte_high' field using all its bits except _PAGE_GLOBAL and _PAGE_VALID which
gives 25 bits for the swap entry offset.
    
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:16 +01:00
Sergei Shtylyov
79e0bc3725 [MIPS] Fix mprotect() syscall for MIPS32 w/36-bit physical address support
Fix mprotect() syscall for MIPS32 CPUs with 36-bit physical address
support: pte_modify() macro didn't clear the hardware page protection bits
before modifying...
    
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:15 +01:00
Ralf Baechle
722ace9dfb [MIPS] Fix declaration of smp_prepare_cpus() platform hook.
A while ago prom_prepare_cpus was replaced by plat_prepare_cpus but
the declaration has stayed unchanged.
    
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:12 +01:00
Ralf Baechle
5ee823507b [MIPS] Fix instable BogoMIPS on multi-issue processors.
Increase alignment of BogoMIPS loop to 8 bytes.  Having the delay loop
overlap cache line boundaries may cause instable delays.
    
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:10 +01:00
Ralf Baechle
acf518cbba [MIPS] Remove duplicate declaration of cpu_online_map.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-06 00:15:09 +01:00
Linus Torvalds
48e49ead3e Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Fix D-cache corruption in mremap
  [SPARC64]: Make smp_processor_id() functional before start_kernel()
2006-06-02 16:02:22 -07:00
David S. Miller
0b0968a3e6 [SPARC64]: Fix D-cache corruption in mremap
If we move a mapping from one virtual address to another,
and this changes the virtual color of the mapping to those
pages, we can see corrupt data due to D-cache aliasing.

Check for and deal with this by overriding the move_pte()
macro.  Set things up so that other platforms can cleanly
override the move_pte() macro too.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-01 17:47:25 -07:00
Kumba
44d921b246 [MIPS] Treat R14000 like R10000.
Signed-off-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:35 +01:00
Thiemo Seufer
ca30225e9e [MIPS] Update/Fix instruction definitions
A small bugfix for up to now unused instruction definitions, and a
somewhat larger update to cover MIPS32R2 instructions.
    
Signed-off-by: Thiemo Seufer <ths@networkno.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:34 +01:00
Thiemo Seufer
3301edcbd7 [MIPS] DSP and MDMX share the same config flag bit.
Clarify comment.
    
Signed-off-by: Thiemo Seufer <ths@networkno.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:33 +01:00
Daniel Jacobowitz
1c0c1ae4f3 [MIPS] Update struct sigcontext member names
Rename the 64-bit sc_hi and sc_lo arrays to use the same names
as the 32-bit struct sigcontext (sc_mdhi, sc_hi1, et cetera).
    
Signed-off-by: Daniel Jacobowitz <dan@codesourcery.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:31 +01:00
Ralf Baechle
6ee1da94c5 [MIPS] Update/fix futex assembly
o Implement futex_atomic_op_inuser() operation
 o Don't use the R10000-ll/sc bug workaround version for every processor.
   branch likely is deprecated and some historic ll/sc processors don't
   implement it.  In any case it's slow.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:31 +01:00
Chris Dearman
c620953c32 [MIPS] Fix detection and handling of the 74K processor.
Nothing exciting; Linux just didn't know it yet so this is most adding
a value to a case statement.
    
Signed-off-by: Chris Dearman <chris@mips.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:30 +01:00
Sergei Shtylyov
6e9538917c [MIPS] Fix marking buddy of pte global for MIPS32 w/36-bit physical address
In case of CONFIG_64BIT_PHYS_ADDR, set_pte() and pte_clear() functions
only set _PAGE_GLOBAL bit in the pte_low field of the buddy PTEs,
forgetting to propagate ito to pte_high. Thus, the both pages might not
really be made global for the CPU (since it AND's the G-bit of the
odd / even PTEs together to decide whether they're global or not). Thus,
if only a single page is allocated via vmalloc() or ioremap(), it's not
really global for CPU (and it must be, since this is kernel mapping),
and thus its ASID is compared against the current process' one -- so,
we'll get into trouble sooner or later...  Also, pte_none() will fail
on global pages because _PAGE_GLOBAL bit is set in both pte_low and
pte_high, and pte_val() will return u64 value consisting of those fields
concateneted.
    
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-06-01 00:28:30 +01:00
Chris Dearman
7a8341969f [MIPS] 24K LV: Add core card id.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-27 15:13:50 +01:00
Atsushi Nemoto
bc81824720 [MIPS] Fix bitops for MIPS32/MIPS64 CPUs.
With recent rewrite for generic bitops, fls() for 32bit kernel with
MIPS64_CPU is broken.  Also, ffs(), fls() should be defined the same
way as the libc and compiler built-in routines (returns int instead of
unsigned long).
    
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-27 15:13:49 +01:00
Ralf Baechle
7e3bfc7cfc [MIPS] Handle IDE PIO cache aliases on SMP.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:29 +02:00
Ralf Baechle
b4ade4bf88 [MIPS] MIPS boards: Set HZ to 100.
1000Hz will bring an FPGA CPU down on it's knees and it's even worse on
multithreaded cores.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:29 +02:00
Ralf Baechle
f088fc84f9 [MIPS] FPU affinity for MT ASE.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:28 +02:00
Ralf Baechle
41c594ab65 [MIPS] MT: Improved multithreading support.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:28 +02:00
Ralf Baechle
2600990e64 [MIPS] kpsd and other AP/SP improvements.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:27 +02:00
Ralf Baechle
a682a24170 [MIPS] Fix genrtc compilation.
Signed-off-by: Ralf Roesch <ralf.roesch@rw-gmbh.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:22 +02:00
Ralf Baechle
675055bfb5 [MIPS] Use "R" constraint for cache_op.
Gcc might emit an absolute address for the the "m" constraint which
gas unfortunately does not permit.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:21 +02:00
Ralf Baechle
d35d473c25 [MIPS] Fix the crime against humanity that mipsIRQ.S is.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:21 +02:00
Ralf Baechle
ba8990f2ae [MIPS] JMR3927 build fixes for the RTC code.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:20 +02:00
Ralf Baechle
088cf96a69 [MIPS] MV6434x: Add prototype of interrupt dispatch function.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:20 +02:00
Ralf Baechle
ac2384a855 [MIPS] it8172: Fix build of serial driver.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:19 +02:00
Ralf Baechle
93373ed4d8 [MIPS] Rewrite spurious_interrupt from assembler to C.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:18 +02:00
Ralf Baechle
a8d587a71b [MIPS] Wire up sync_file_range(2).
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:14 +02:00
Ralf Baechle
f115da9cd6 [MIPS] Wire splice syscall.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:13 +02:00
Ralf Baechle
84ada9f856 [MIPS] More SHT_* and SHF_* ELF definitions.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:13 +02:00
Ralf Baechle
91b05e6776 [MIPS] Fix vectored interrupt support in TLB exception handler generator.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:13 +02:00
Ralf Baechle
15c4f67ab8 [MIPS] Provide access functions for c0_badvaddr.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:13 +02:00
Ralf Baechle
b4d05cb9cb [MIPS] Make set_vi_srs_handler static.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-04-19 04:14:12 +02:00
Yasunori Goto
c80d79d746 [PATCH] Configurable NODES_SHIFT
Current implementations define NODES_SHIFT in include/asm-xxx/numnodes.h for
each arch.  Its definition is sometimes configurable.  Indeed, ia64 defines 5
NODES_SHIFT values in the current git tree.  But it looks a bit messy.

SGI-SN2(ia64) system requires 1024 nodes, and the number of nodes already has
been changeable by config.  Suitable node's number may be changed in the
future even if it is other architecture.  So, I wrote configurable node's
number.

This patch set defines just default value for each arch which needs multi
nodes except ia64.  But, it is easy to change to configurable if necessary.

On ia64 the number of nodes can be already configured in generic ia64 and SN2
config.  But, NODES_SHIFT is defined for DIG64 and HP'S machine too.  So, I
changed it so that all platforms can be configured via CONFIG_NODES_SHIFT.  It
would be simpler.

See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=114358010523896&w=2

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:39 -07:00
Linus Torvalds
d4965b3e2f Merge master.kernel.org:/home/rmk/linux-2.6-serial
* master.kernel.org:/home/rmk/linux-2.6-serial:
  [SERIAL] Provide Cirrus EP93xx AMBA PL010 serial support.
  [SERIAL] amba-pl010: allow platforms to specify modem control method
  [SERIAL] Remove obsoleted au1x00_uart driver
  [SERIAL] Small time UART configuration fix for AU1100 processor
2006-03-28 13:52:37 -08:00
Matt Mackall
da2468b6a8 [PATCH] RTC: Remove RTC UIP synchronization on MIPS MC146818
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:00 -08:00
Yoichi Yuasa
d23ee8fe6e [PATCH] mips: fixed collision of rtc function name
Fix the collision of rtc function name.

Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:50 -08:00
Ingo Molnar
8f17d3a504 [PATCH] lightweight robust futexes updates
- fix: initialize the robust list(s) to NULL in copy_process.

- doc update

- cleanup: rename _inuser to _inatomic

- __user cleanups and other small cleanups

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Ingo Molnar
e9056f13bf [PATCH] lightweight robust futexes: arch defaults
This patchset provides a new (written from scratch) implementation of robust
futexes, called "lightweight robust futexes".  We believe this new
implementation is faster and simpler than the vma-based robust futex solutions
presented before, and we'd like this patchset to be adopted in the upstream
kernel.  This is version 1 of the patchset.

  Background
  ----------

What are robust futexes?  To answer that, we first need to understand what
futexes are: normal futexes are special types of locks that in the
noncontended case can be acquired/released from userspace without having to
enter the kernel.

A futex is in essence a user-space address, e.g.  a 32-bit lock variable
field.  If userspace notices contention (the lock is already owned and someone
else wants to grab it too) then the lock is marked with a value that says
"there's a waiter pending", and the sys_futex(FUTEX_WAIT) syscall is used to
wait for the other guy to release it.  The kernel creates a 'futex queue'
internally, so that it can later on match up the waiter with the waker -
without them having to know about each other.  When the owner thread releases
the futex, it notices (via the variable value) that there were waiter(s)
pending, and does the sys_futex(FUTEX_WAKE) syscall to wake them up.  Once all
waiters have taken and released the lock, the futex is again back to
'uncontended' state, and there's no in-kernel state associated with it.  The
kernel completely forgets that there ever was a futex at that address.  This
method makes futexes very lightweight and scalable.

"Robustness" is about dealing with crashes while holding a lock: if a process
exits prematurely while holding a pthread_mutex_t lock that is also shared
with some other process (e.g.  yum segfaults while holding a pthread_mutex_t,
or yum is kill -9-ed), then waiters for that lock need to be notified that the
last owner of the lock exited in some irregular way.

To solve such types of problems, "robust mutex" userspace APIs were created:
pthread_mutex_lock() returns an error value if the owner exits prematurely -
and the new owner can decide whether the data protected by the lock can be
recovered safely.

There is a big conceptual problem with futex based mutexes though: it is the
kernel that destroys the owner task (e.g.  due to a SEGFAULT), but the kernel
cannot help with the cleanup: if there is no 'futex queue' (and in most cases
there is none, futexes being fast lightweight locks) then the kernel has no
information to clean up after the held lock!  Userspace has no chance to clean
up after the lock either - userspace is the one that crashes, so it has no
opportunity to clean up.  Catch-22.

In practice, when e.g.  yum is kill -9-ed (or segfaults), a system reboot is
needed to release that futex based lock.  This is one of the leading
bugreports against yum.

To solve this problem, 'Robust Futex' patches were created and presented on
lkml: the one written by Todd Kneisel and David Singleton is the most advanced
at the moment.  These patches all tried to extend the futex abstraction by
registering futex-based locks in the kernel - and thus give the kernel a
chance to clean up.

E.g.  in David Singleton's robust-futex-6.patch, there are 3 new syscall
variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and FUTEX_RECOVER.
The kernel attaches such robust futexes to vmas (via
vma->vm_file->f_mapping->robust_head), and at do_exit() time, all vmas are
searched to see whether they have a robust_head set.

Lots of work went into the vma-based robust-futex patch, and recently it has
improved significantly, but unfortunately it still has two fundamental
problems left:

 - they have quite complex locking and race scenarios.  The vma-based
   patches had been pending for years, but they are still not completely
   reliable.

 - they have to scan _every_ vma at sys_exit() time, per thread!

The second disadvantage is a real killer: pthread_exit() takes around 1
microsecond on Linux, but with thousands (or tens of thousands) of vmas every
pthread_exit() takes a millisecond or more, also totally destroying the CPU's
L1 and L2 caches!

This is very much noticeable even for normal process sys_exit_group() calls:
the kernel has to do the vma scanning unconditionally!  (this is because the
kernel has no knowledge about how many robust futexes there are to be cleaned
up, because a robust futex might have been registered in another task, and the
futex variable might have been simply mmap()-ed into this process's address
space).

This huge overhead forced the creation of CONFIG_FUTEX_ROBUST, but worse than
that: the overhead makes robust futexes impractical for any type of generic
Linux distribution.

So it became clear to us, something had to be done.  Last week, when Thomas
Gleixner tried to fix up the vma-based robust futex patch in the -rt tree, he
found a handful of new races and we were talking about it and were analyzing
the situation.  At that point a fundamentally different solution occured to
me.  This patchset (written in the past couple of days) implements that new
solution.  Be warned though - the patchset does things we normally dont do in
Linux, so some might find the approach disturbing.  Parental advice
recommended ;-)

  New approach to robust futexes
  ------------------------------

At the heart of this new approach there is a per-thread private list of robust
locks that userspace is holding (maintained by glibc) - which userspace list
is registered with the kernel via a new syscall [this registration happens at
most once per thread lifetime].  At do_exit() time, the kernel checks this
user-space list: are there any robust futex locks to be cleaned up?

In the common case, at do_exit() time, there is no list registered, so the
cost of robust futexes is just a simple current->robust_list != NULL
comparison.  If the thread has registered a list, then normally the list is
empty.  If the thread/process crashed or terminated in some incorrect way then
the list might be non-empty: in this case the kernel carefully walks the list
[not trusting it], and marks all locks that are owned by this thread with the
FUTEX_OWNER_DEAD bit, and wakes up one waiter (if any).

The list is guaranteed to be private and per-thread, so it's lockless.  There
is one race possible though: since adding to and removing from the list is
done after the futex is acquired by glibc, there is a few instructions window
for the thread (or process) to die there, leaving the futex hung.  To protect
against this possibility, userspace (glibc) also maintains a simple per-thread
'list_op_pending' field, to allow the kernel to clean up if the thread dies
after acquiring the lock, but just before it could have added itself to the
list.  Glibc sets this list_op_pending field before it tries to acquire the
futex, and clears it after the list-add (or list-remove) has finished.

That's all that is needed - all the rest of robust-futex cleanup is done in
userspace [just like with the previous patches].

Ulrich Drepper has implemented the necessary glibc support for this new
mechanism, which fully enables robust mutexes.  (Ulrich plans to commit these
changes to glibc-HEAD later today.)

Key differences of this userspace-list based approach, compared to the vma
based method:

 - it's much, much faster: at thread exit time, there's no need to loop
   over every vma (!), which the VM-based method has to do.  Only a very
   simple 'is the list empty' op is done.

 - no VM changes are needed - 'struct address_space' is left alone.

 - no registration of individual locks is needed: robust mutexes dont need
   any extra per-lock syscalls.  Robust mutexes thus become a very lightweight
   primitive - so they dont force the application designer to do a hard choice
   between performance and robustness - robust mutexes are just as fast.

 - no per-lock kernel allocation happens.

 - no resource limits are needed.

 - no kernel-space recovery call (FUTEX_RECOVER) is needed.

 - the implementation and the locking is "obvious", and there are no
   interactions with the VM.

  Performance
  -----------

I have benchmarked the time needed for the kernel to process a list of 1
million (!) held locks, using the new method [on a 2GHz CPU]:

 - with FUTEX_WAIT set [contended mutex]: 130 msecs
 - without FUTEX_WAIT set [uncontended mutex]: 30 msecs

I have also measured an approach where glibc does the lock notification [which
it currently does for !pshared robust mutexes], and that took 256 msecs -
clearly slower, due to the 1 million FUTEX_WAKE syscalls userspace had to do.

(1 million held locks are unheard of - we expect at most a handful of locks to
be held at a time.  Nevertheless it's nice to know that this approach scales
nicely.)

  Implementation details
  ----------------------

The patch adds two new syscalls: one to register the userspace list, and one
to query the registered list pointer:

 asmlinkage long
 sys_set_robust_list(struct robust_list_head __user *head,
                     size_t len);

 asmlinkage long
 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
                     size_t __user *len_ptr);

List registration is very fast: the pointer is simply stored in
current->robust_list.  [Note that in the future, if robust futexes become
widespread, we could extend sys_clone() to register a robust-list head for new
threads, without the need of another syscall.]

So there is virtually zero overhead for tasks not using robust futexes, and
even for robust futex users, there is only one extra syscall per thread
lifetime, and the cleanup operation, if it happens, is fast and
straightforward.  The kernel doesnt have any internal distinction between
robust and normal futexes.

If a futex is found to be held at exit time, the kernel sets the highest bit
of the futex word:

	#define FUTEX_OWNER_DIED        0x40000000

and wakes up the next futex waiter (if any). User-space does the rest of
the cleanup.

Otherwise, robust futexes are acquired by glibc by putting the TID into the
futex field atomically.  Waiters set the FUTEX_WAITERS bit:

	#define FUTEX_WAITERS           0x80000000

and the remaining bits are for the TID.

  Testing, architecture support
  -----------------------------

I've tested the new syscalls on x86 and x86_64, and have made sure the parsing
of the userspace list is robust [ ;-) ] even if the list is deliberately
corrupted.

i386 and x86_64 syscalls are wired up at the moment, and Ulrich has tested the
new glibc code (on x86_64 and i386), and it works for his robust-mutex
testcases.

All other architectures should build just fine too - but they wont have the
new syscalls yet.

Architectures need to implement the new futex_atomic_cmpxchg_inuser() inline
function before writing up the syscalls (that function returns -ENOSYS right
now).

This patch:

Add placeholder futex_atomic_cmpxchg_inuser() implementations to every
architecture that supports futexes.  It returns -ENOSYS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Ingo Molnar
62ac285f3c [PATCH] mips: add ptr_to_compat()
Add ptr_to_compat() - needed by the new robust futex code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00