linux/mm
Nick Piggin 557ed1fa26 remove ZERO_PAGE
The commit b5810039a5 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
..
allocpercpu.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
backing-dev.c remove mm/backing-dev.c:congestion_wait_interruptible() 2007-07-16 09:05:52 -07:00
bootmem.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
bounce.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
fadvise.c [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path 2006-12-08 08:28:43 -08:00
filemap_xip.c mm: fault feedback #2 2007-07-19 10:04:41 -07:00
filemap.c filemap: convert some unsigned long to pgoff_t 2007-10-16 09:42:53 -07:00
filemap.h Remove all inclusions of <linux/config.h> 2006-10-04 03:38:54 -04:00
fremap.c fix VM_CAN_NONLINEAR check in sys_remap_file_pages 2007-10-08 12:58:14 -07:00
highmem.c Create the ZONE_MOVABLE zone 2007-07-17 10:22:59 -07:00
hugetlb.c hugetlb: fix clear_user_highpage arguments 2007-10-01 07:52:23 -07:00
internal.h Make page->private usable in compound pages 2007-05-07 12:12:53 -07:00
Kconfig vmemmap: generify initialisation via helpers 2007-10-16 09:42:51 -07:00
madvise.c speed up madvise_need_mmap_write() usage 2007-07-16 09:05:36 -07:00
Makefile Generic Virtual Memmap support for SPARSEMEM 2007-10-16 09:42:51 -07:00
memory_hotplug.c memory hotplug: fix unnecessary calling of init_currenty_empty_zone() 2007-06-01 08:18:29 -07:00
memory.c remove ZERO_PAGE 2007-10-16 09:42:53 -07:00
mempolicy.c Clean up duplicate includes in mm/ 2007-10-16 09:42:52 -07:00
mempool.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
migrate.c mm/migrate.c __user annotation 2007-10-14 12:41:51 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITY 2007-07-16 09:05:37 -07:00
mmap.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
mmzone.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
mprotect.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
mremap.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
msync.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
nommu.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
oom_kill.c oom: print points as unsigned long 2007-07-31 15:39:36 -07:00
page_alloc.c process_zones(): fix recovery code 2007-08-31 01:42:22 -07:00
page_io.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
page-writeback.c mm: set_page_dirty_balance() vs ->page_mkwrite() 2007-10-08 12:58:14 -07:00
pdflush.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
prio_tree.c
quicklist.c Quicklists for page table pages 2007-05-07 12:12:54 -07:00
readahead.c readahead: remove several readahead macros 2007-10-16 09:42:52 -07:00
rmap.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
shmem_acl.c [PATCH] Fix typos in mm/shmem_acl.c 2006-10-11 11:14:23 -07:00
shmem.c Clean up duplicate includes in mm/ 2007-10-16 09:42:52 -07:00
slab.c slab: skip calling cache_free_alien() when the platform is not numa capable 2007-08-22 19:52:46 -07:00
slob.c slob: reduce list scanning 2007-07-21 17:49:16 -07:00
slub.c SLUB: direct pass through of page size or higher kmalloc requests 2007-10-16 09:42:53 -07:00
sparse-vmemmap.c vmemmap: generify initialisation via helpers 2007-10-16 09:42:51 -07:00
sparse.c Generic Virtual Memmap support for SPARSEMEM 2007-10-16 09:42:51 -07:00
swap_state.c Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated 2007-07-17 10:22:59 -07:00
swap.c Clean up duplicate includes in mm/ 2007-10-16 09:42:52 -07:00
swapfile.c Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION 2007-07-29 16:45:38 -07:00
thrash.c Bug in mm/thrash.c function grab_swap_token() 2007-05-11 08:29:32 -07:00
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c mm: merge populate and nopage into fault (fixes nonlinear) 2007-07-19 10:04:41 -07:00
util.c add kstrndup 2007-07-18 08:47:39 -07:00
vmalloc.c lguest: export symbols for lguest as a module 2007-07-19 10:04:52 -07:00
vmscan.c synchronous lumpy reclaim: wait for page writeback when directly reclaiming contiguous areas 2007-08-22 19:52:45 -07:00
vmstat.c Remove fs.h from mm.h 2007-07-29 17:09:29 -07:00