IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The function blktrans_dev_release and blktrans_dev_put are only used
locally in this file and should be static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The three backing_dev_info symbols are only used in this file and
should be static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
clk_get() returns a struct clk cookie to the driver and some platforms
may return NULL if they only support a single clock. clk_get() has only
failed if it returns a ERR_PTR() encoded pointer.
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Reviewed-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
As there are no more dependencies on MTD_CONCAT or CONFIG_MTD_CONCAT,
drop this entry from Kconfig entirely.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Acked-by: Stefan Roese <sr@denx.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
As MTD_CONCAT is becoming a part of mtd core, it's now meaningless
to to check for it in ifdefs. Drop such references from MTD code.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Acked-by: Stefan Roese <sr@denx.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
As MTD_CONCAT is becoming a part of mtd core, it's no longer necessary
to depend on it in Kconfig scripts. Drop such references.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Acked-by: Stefan Roese <sr@denx.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Move mtdconcat to be an integral part of the mtd core. It's a tiny bit
of code, which bears 'say Y if you don't know what to do' note in the
Kconfig. OTOH there are several ugly ifdefs depending on the MTD_CONCAT.
So, making MTD_CONCAT support mandatory will allow us to clean up code a
lot.
Kconfig entry is changed to be a bool defaulting to Y, so all code
pieces depending on it, will have MTD_CONCAT Kconfig symbol and
CONFIG_MTD_CONCAT define. This will be removed in one of next patches.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Acked-by: Stefan Roese <sr@denx.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Since 43cc71eed1250755986da4c0f9898f9a635cb3bf (platform: prefix MODALIAS
with "platform:"), the platform modalias is prefixed with "platform:".
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org
This one liner patch fixes double free that will occur if add_mtd_blktrans_dev
fails. On failure it frees the input argument, but all its users also free it
on error which is natural thing to do. Thus don't free it.
All credit for finding that bug belongs to reporters of the bug in the android bugzilla
http://code.google.com/p/android/issues/detail?id=13761
Commit message tweaked by Artem.
Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org
For some unknown reasons resources needed by amd76xrom driver can be
unavailable. And instead of returning an error, the driver keeps going
and crash the kernel. This patch fixes the problem by making the driver
return -EBUSY if the resources are not available.
Commit messages tweaked by Artem.
Reported-by: Russell Whitaker <russ@ashlandhome.net>
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org
As inval_cache_and_wait_for_operation() drop and reclaim the lock
to invalidate the cache, some other thread may suspend the operation
before reaching the for(;;) loop. Therefore the loop must start with
checking the chip->state before reading status from the chip.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Acked-by: Michael Cashwell <mboards@prograde.net>
Acked-by: Stefan Bigler <stefan.bigler@keymile.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org
In the commit 08968041bef437ec363623cd3218c2b083537ada
(mtd: cfi_cmdset_0002: make sector erase command variable)
introdused a field sector_erase_cmd. In the same commit initialisation
of cfi->sector_erase_cmd made in cfi_chip_setup()
(file drivers/mtd/chips/cfi_probe.c), so the CFI chip has no problem:
...
cfi->cfi_mode = CFI_MODE_CFI;
cfi->sector_erase_cmd = CMD(0x30);
...
But for the JEDEC chips this initialisation is not carried out,
so the JEDEC chips have sector_erase_cmd == 0.
This patch adds the missing initialisation.
Signed-off-by: Antony Pavlov <antony@niisi.msk.ru>
Acked-by: Guillaume LECERF <glecerf@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
CC: stable@kernel.org
In the following commit, we'll need to use the CMD() macro in order to
fix the initialisation of the sector_erase_cmd field. That requires the
local variable to be called 'cfi', so change it first in a simple patch.
Signed-off-by: Antony Pavlov <antony@niisi.msk.ru>
Acked-by: Guillaume LECERF <glecerf@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
CC: stable@kernel.org
Provide the LEB offset information in the UBI device information data
structure. This piece of information is required by UBIFS to find out
what are the LEB offsets which are aligned to the max. write size.
If LEB offset not aligned to max. write size, then UBIFS has to take
this into account to write more optimally.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Incorporate MTD write buffer size into UBI device information
because UBIFS needs this field. UBI does not use it ATM, just
provides to upper layers in 'struct ubi_device_info'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Move to SOC_SOC_IMX5X. Leave only places which prevent multi-soc
using ARCH_MX5X.
Signed-off-by: Richard Zhao <richard.zhao@freescale.com>
Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Final step to eliminate of_platform_bus_type. They're all just
platform drivers now.
v2: fix type in pasemi_nand.c (thanks to Stephen Rothwell)
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
This patch updates EXYNOS4 OneNAND support according to the change of
ARCH name, EXYNOS4.
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
The current multiblock erase verify read timeout 100us is the maximum
for none-error case. If errors happen during multibock erase then
the specification recommends to run multiblock erase verify command
with maximum timeout 10ms (see specs. for KFM4G16Q2A and KFN8G16Q2A).
For the most common non-error case we wait 100us in udelay polling
loop. In case of timeout the interrupt mode is used to wait for the
command end.
Signed-off-by: Roman Tereshonkov <roman.tereshonkov@nokia.com>
Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
OneNAND frequency is determined when calculating
GPMC timings. Return that value instead of determining it
again in the OMAP OneNAND driver.
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
This patch overrides nand ecc layout and bad block descriptor (for 8-bit
device) to support hw ecc in romcode layout. So as to have in sync with ecc
layout throughout; i.e. x-loader, u-boot and kernel.
This enables to flash x-loader, u-boot, kernel, FS images from kernel itself
and compatiable with other tools.
This patch does not enables this feature by default and need to pass from
board file to enable for any board.
Signed-off-by: Vimal Singh <vimalsingh@ti.com>
Signed-off-by: Sukumar Ghorai <s-ghorai@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
This patch makes it possible to select sw or hw (different layout options)
ecc scheme supported by omap nand driver.
Signed-off-by: Vimal Singh <vimalsingh@ti.com>
Signed-off-by: Sukumar Ghorai <s-ghorai@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Configure the FIFO THREASHOLD value different for read and write to keep busy
both filling and to drain out of FIFO at reading and writing.
Signed-off-by: Vimal Singh <vimalsingh@ti.com>
Signed-off-by: Sukumar Ghorai <s-ghorai@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
nand transfer type (sDMA, Polled, prefetch) can be select from board file,
enabling all transfer type in driver, by default.
this helps in multi-omap build and to select different transfer type for
different board.
Signed-off-by: Sukumar Ghorai <s-ghorai@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
zoom3 and 3630-sdp having the x16 nand device.
This patch configure gpmc as x16 and select the currect function in driver
for polled mode (without prefetch enable) transfer.
Signed-off-by: Sukumar Ghorai <s-ghorai@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
There are two spellings in use for 'freeze' + 'able' - 'freezable' and
'freezeable'. The former is the more prominent one. The latter is
mostly used by workqueue and in a few other odd places. Unify the
spelling to 'freezable'.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Steven Whitehouse <swhiteho@redhat.com>
When the read operation fails, UBI tries to re-read several times in
a hope that one of the subsequent reads may succeed. However, currently
UBI re-reads only if MTD failed to read all data, but does not re-reads
if all the data were read, but with an integrity error (-EBADMSB). This
patch makes UBI to always re-try reading.
This should be useful for reading NAND pages with unstable bits -
re-reading may help to get correct data.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
During scanning UBI allocates one struct ubi_scan_leb object for each PEB,
so it can end up allocating thousands of them. Use slab cache to reduce
memory consumption for these 48-byte objects, because currently used
'kmalloc()' ends up allocating 64 bytes per object, instead of 48.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This change affects only the debugging code. Namely, use mtd->read()
function instead of ubi_io_read() to avoid bit-flips injection
(ubi_dbg_is_bitflip()) which we do not want on the debugging path.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When reading data from the flash, corrupt the buffer we are about to
read to. The idea is to fix the following possible situation:
1. The buffer contains data from previous operation, e.g., read from
another PEB previously. The data looks like expected, e.g., if we
just do not read anything and return - the caller would not
notice this. E.g., if we are reading a VID header, the buffer may
contain a valid VID header from another PEB.
2. The driver is buggy and returns use success or -EBADMSG or
-EUCLEAN, but it does not actually put any data to the buffer.
This may confuse UBI or upper layers - they may think the buffer
contains valid data while in fact it is just old data.
Thus, try to reveal such buggy MTD drivers with simple debugging
code which fills the read buffer with 0x12 constant.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This reverts commit a121f643993474548fe98144514c50dd4f3dbe76.
Unfortunately, this commit breaks UBIFS backward compatibility and
makes new UBIFS refuse older UBIFS-formatted media:
UBIFS error: validate_sb: min. I/O unit mismatch: 8 in superblock, 64 real
Thus, we have to revert this patch and work on a better solution.
Reported-by: Holger Brunck <holger.brunck@keymile.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Wrong macro was used in calculating the data offset: UBI_EC_HDR_SIZE instead of
UBI_VID_HDR_SIZE. The data offset should be VID header offset + VID header size
(aligned to the minimum I/O unit).
This was not a bug only because currently UBI_EC_HDR_SIZE and UBI_VID_HDR_SIZE
have the same value of 64 bytes.
Commit message was amended by Artem.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In 'nor_erase_prepare()' we want to make sure the UBI headers are
corrupted. But it is possible that one of the headers just contains
all 0xFFs, which is also OK, because UBI will erase it in case of
a power cut.
Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* git://git.infradead.org/mtd-2.6: (59 commits)
mtd: mtdpart: disallow reading OOB past the end of the partition
mtd: pxa3xx_nand: NULL dereference in pxa3xx_nand_probe
UBI: use mtd->writebufsize to set minimal I/O unit size
mtd: initialize writebufsize in the MTD object of a partition
mtd: onenand: add mtd->writebufsize initialization
mtd: nand: add mtd->writebufsize initialization
mtd: cfi: add writebufsize initialization
mtd: add writebufsize field to mtd_info struct
mtd: OneNAND: OMAP2/3: prevent regulator sleeping while OneNAND is in use
mtd: OneNAND: add enable / disable methods to onenand_chip
mtd: m25p80: Fix JEDEC ID for AT26DF321
mtd: txx9ndfmc: limit transfer bytes to 512 (ECC provides 6 bytes max)
mtd: cfi_cmdset_0002: add support for Samsung K8D3x16UxC NOR chips
mtd: cfi_cmdset_0002: add support for Samsung K8D6x16UxM NOR chips
mtd: nand: ams-delta: drop omap_read/write, use ioremap
mtd: m25p80: add debugging trace in sst_write
mtd: nand: ams-delta: select for built-in by default
mtd: OneNAND: lighten scary initial bad block messages
mtd: OneNAND: OMAP2/3: add support for command line partitioning
mtd: nand: rearrange ONFI revision checking, add ONFI 2.3
...
Fix up trivial conflict in drivers/mtd/Kconfig as per DavidW.
This patch fixes the mtdpart bug which allows users reading OOB past the
end of the partition. This happens because 'part_read_oob()' allows reading
multiple OOB areas in one go, and mtdparts does not validate the OOB
length in the request.
Although there is such check in 'nand_do_read_oob()' in nand_base.c, but
it checks that we do not read past the flash chip, not the partition,
because in nand_base.c we work with the whole chip (e.g., mtd->size
in nand_base.c is the size of the whole chip). So this check cannot
be done correctly in nand_base.c and should be instead done in mtdparts.c.
This problem was reported by Jason Liu <r64343@freescale.com> and reproduced
with nandsim:
$ modprobe nandsim first_id_byte=0x20 second_id_byte=0xaa third_id_byte=0x00 \
fourth_id_byte=0x15 parts=0x400,0x400
$ modprobe nandsim mtd_oobtest.ko dev=0
$ dmesg
= snip =
mtd_oobtest: attempting to read past end of device
mtd_oobtest: an error is expected...
mtd_oobtest: error: read past end of device
= snip =
mtd_oobtest: finished with 2 errors
Reported-by: Jason Liu <liu.h.jason@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.
Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive
That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.
For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.
Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
This series aims to develop logging facility for enterprise use.
It is important to save kernel messages reliably on enterprise system
because they are helpful for diagnosing system.
This series add kmsg_dump() to the paths loosing kernel messages. The use
case is the following.
[Use case of reboot/poweroff/halt/emergency_restart]
My company has often experienced the followings in our support service.
- Customer's system suddenly reboots.
- Customers ask us to investigate the reason of the reboot.
We recognize the fact itself because boot messages remain in
/var/log/messages. However, we can't investigate the reason why the
system rebooted, because the last messages don't remain. And off course
we can't explain the reason.
We can solve above problem with this patch as follows.
Case1: reboot with command
- We can see "Restarting system with command:" or ""Restarting system.".
Case2: halt with command
- We can see "System halted.".
Case3: poweroff with command
- We can see " Power down.".
Case4: emergency_restart with sysrq.
- We can see "Sysrq:" outputted in __handle_sysrq().
Case5: emergency_restart with softdog.
- We can see "Initiating system reboot" in watchdog_fire().
So, we can distinguish the reason of reboot, poweroff, halt and emergency_restart.
If customer executed reboot command, you may think the customer should
know the fact. However, they often claim they don't execute the command
when they rebooted system by mistake.
No message remains on the current Linux kernel, so we can't show the proof
to the customer. This patch improves this situation.
This patch:
Alters mtdoops and ramoops to perform their actions only for
KMSG_DUMP_PANIC, KMSG_DUMP_OOPS and KMSG_DUMP_KEXEC because they would
like to log crashes only.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Reviewed-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (29 commits)
of/flattree: forward declare struct device_node in of_fdt.h
ipmi: explicitly include of_address.h and of_irq.h
sparc: explicitly cast negative phandle checks to s32
powerpc/405: Fix missing #{address,size}-cells in i2c node
powerpc/5200: dts: refactor dts files
powerpc/5200: dts: Change combatible strings on localbus
powerpc/5200: dts: remove unused properties
powerpc/5200: dts: rename nodes to prepare for refactoring dts files
of/flattree: Update dtc to current mainline.
of/device: Don't register disabled devices
powerpc/dts: fix syntax bugs in bluestone.dts
of: Fixes for OF probing on little endian systems
of: make drivers depend on CONFIG_OF instead of CONFIG_PPC_OF
of/flattree: Add of_flat_dt_match() helper function
of_serial: explicitly include of_irq.h
of/flattree: Refactor unflatten_device_tree and add fdt_unflatten_tree
of/flattree: Reorder unflatten_dt_node
of/flattree: Refactor unflatten_dt_node
of/flattree: Add non-boottime device tree functions
of/flattree: Add Kconfig for EARLY_FLATTREE
...
Fix up trivial conflict in arch/sparc/prom/tree_32.c as per Grant.
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>