IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Pull x86 fixes from Peter Anvin:
- Three EFI-related fixes
- Two early memory initialization fixes
- build fix for older binutils
- fix for an eager FPU performance regression -- currently we don't
allow the use of the FPU at interrupt time *at all* in eager mode,
which is clearly wrong.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Allow FPU to be used at interrupt time even with eagerfpu
x86, crc32-pclmul: Fix build with older binutils
x86-64, init: Fix a possible wraparound bug in switchover in head_64.S
x86, range: fix missing merge during add range
x86, efi: initial the local variable of DataSize to zero
efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse
efivarfs: Never return ENOENT from firmware again
When the directory freespace index grows to a second block (2017
4k data blocks in the directory), the initialisation of the second
new block header goes wrong. The write verifier fires a corruption
error indicating that the block number in the header is zero. This
was being tripped by xfs/110.
The problem is that the initialisation of the new block is done just
fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2
structure to zero on-disk header fields that xfs_dir3_free_get_buf()
has already zeroed. These lined up with the block number in the dir
v3 header format.
While looking at this, I noticed that the struct xfs_dir3_free_hdr()
had 4 bytes of padding in it that wasn't defined as padding or being
zeroed by the initialisation. Add a pad field declaration and fully
zero the on disk and in-core headers in xfs_dir3_free_get_buf() so
that this is never an issue in the future. Note that this doesn't
change the on-disk layout, just makes the 32 bits of padding in the
layout explicit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 5ae6e6a401)
Currently, swapping extents from one inode to another is a simple
act of switching data and attribute forks from one inode to another.
This, unfortunately in no longer so simple with CRC enabled
filesystems as there is owner information embedded into the BMBT
blocks that are swapped between inodes. Hence swapping the forks
between inodes results in the inodes having mapping blocks that
point to the wrong owner and hence are considered corrupt.
To fix this we need an extent tree block or record based swap
algorithm so that the BMBT block owner information can be updated
atomically in the swap transaction. This is a significant piece of
new work, so for the moment simply don't allow swap extent
operations to succeed on CRC enabled filesystems.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 02f75405a7)
Currently userspace has no way of determining that a filesystem is
CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
indicate that the filesystem has v5 superblock support enabled.
This will allow xfs_info to correctly report the state of the
filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 74137fff06)
When CRCs are enabled, the number of blocks needed to hold a remote
symlink on a 1k block size filesystem may be 2 instead of 1. The
transaction reservation for the allocated blocks was not taking this
into account and only allocating one block. Hence when trying to
read or invalidate such symlinks, we are mapping a hole where there
should be a block and things go bad at that point.
Fix the reservation to use the correct block count, clean up the
block count calculation similar to the remote attribute calculation,
and add a debug guard to detect when we don't write the entire
symlink to disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 321a95839e)
A long time ago in a galaxy far away....
.. the was a commit made to fix some ilinux specific "fragmented
buffer" log recovery problem:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
That problem occurred when a contiguous dirty region of a buffer was
split across across two pages of an unmapped buffer. It's been a
long time since that has been done in XFS, and the changes to log
the entire inode buffers for CRC enabled filesystems has
re-introduced that corner case.
And, of course, it turns out that the above commit didn't actually
fix anything - it just ensured that log recovery is guaranteed to
fail when this situation occurs. And now for the gory details.
xfstest xfs/085 is failing with this assert:
XFS (vdb): bad number of regions (0) in inode log format
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
Largely undocumented factoid #1: Log recovery depends on all log
buffer format items starting with this format:
struct foo_log_format {
__uint16_t type;
__uint16_t size;
....
As recoery uses the size field and assumptions about 32 bit
alignment in decoding format items. So don't pay much attention to
the fact log recovery thinks that it decoding an inode log format
item - it just uses them to determine what the size of the item is.
But why would it see a log format item with a zero size? Well,
luckily enough xfs_logprint uses the same code and gives the same
error, so with a bit of gdb magic, it turns out that it isn't a log
format that is being decoded. What logprint tells us is this:
Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none
BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000
Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none
BUF DATA
----------------------------------------------------------------------------
Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none
xfs_logprint: unknown log operation type (4e49)
**********************************************************************
* ERROR: data block=2 *
**********************************************************************
That we've got a buffer format item (oper 130) that has two regions;
the format item itself and one dirty region. The subsequent region
after the buffer format item and it's data is them what we are
tripping over, and the first bytes of it at an inode magic number.
Not a log opheader like there is supposed to be.
That means there's a problem with the buffer format item. It's dirty
data region is 4096 bytes, and it contains - you guessed it -
initialised inodes. But inode buffers are 8k, not 4k, and we log
them in their entirety. So something is wrong here. The buffer
format item contains:
(gdb) p /x *(struct xfs_buf_log_format *)in_f
$22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
blf_data_map = {0xffffffff, 0xffffffff, .... }}
Two regions, and a signle dirty contiguous region of 64 bits. 64 *
128 = 8k, so this should be followed by a single 8k region of data.
And the blf_flags tell us that the type of buffer is a
XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
buffer. So, it should be followed by 8k of inode data.
But we know that the next region has a header of:
(gdb) p /x *ohead
$25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
oh_flags = 0x0, oh_res2 = 0x0}
and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
long enough to hold all the logged data. There must be another
region. There is - there's a following opheader for another 4k of
data that contains the other half of the inode cluster data - the
one we assert fail on because it's not a log format header.
So why is the second part of the data not being accounted to the
correct buffer log format structure? It took a little more work with
gdb to work out that the buffer log format structure was both
expecting it to be there but hadn't accounted for it. It was at that
point I went to the kernel code, as clearly this wasn't a bug in
xfs_logprint and the kernel was writing bad stuff to the log.
First port of call was the buffer item formatting code, and the
discontiguous memory/contiguous dirty region handling code
immediately stood out. I've wondered for a long time why the code
had this comment in it:
vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
vecp->i_len = nbits * XFS_BLF_CHUNK;
vecp->i_type = XLOG_REG_TYPE_BCHUNK;
/*
* You would think we need to bump the nvecs here too, but we do not
* this number is used by recovery, and it gets confused by the boundary
* split here
* nvecs++;
*/
vecp++;
And it didn't account for the extra vector pointer. The case being
handled here is that a contiguous dirty region lies across a
boundary that cannot be memcpy()d across, and so has to be split
into two separate operations for xlog_write() to perform.
What this code assumes is that what is written to the log is two
consecutive blocks of data that are accounted in the buf log format
item as the same contiguous dirty region and so will get decoded as
such by the log recovery code.
The thing is, xlog_write() knows nothing about this, and so just
does it's normal thing of adding an opheader for each vector. That
means the 8k region gets written to the log as two separate regions
of 4k each, but because nvecs has not been incremented, the buf log
format item accounts for only one of them.
Hence when we come to log recovery, we process the first 4k region
and then expect to come across a new item that starts with a log
format structure of some kind that tells us whenteh next data is
going to be. Instead, we hit raw buffer data and things go bad real
quick.
So, the commit from 2002 that commented out nvecs++ is just plain
wrong. It breaks log recovery completely, and it would seem the only
reason this hasn't been since then is that we don't log large
contigous regions of multi-page unmapped buffers very often. Never
would be a closer estimate, at least until the CRC code came along....
So, lets fix that by restoring the nvecs accounting for the extra
region when we hit this case.....
.... and there's the problemin log recovery it is apparently working
around:
XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
regions being broken up into multiple regions by the log formatting
code. That's an easy fix, though - if the number of contiguous dirty
bits exceeds the length of the region being copied out of the log,
only account for the number of dirty bits that region covers, and
then loop again and copy more from the next region. It's a 2 line
fix.
Now xfstests xfs/085 passes, we have one less piece of mystery
code, and one more important piece of knowledge about how to
structure new log format items..
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 709da6a61a)
XFS has failed to kill suid/sgid bits correctly when truncating
files of non-zero size since commit c4ed4243 ("xfs: split
xfs_setattr") introduced in the 3.1 kernel. Fix it.
Fix it.
cc: stable kernel <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 56c19e89b3)
Lockdep reports:
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #3 Not tainted
---------------------------------------------
setquota/28368 is trying to acquire lock:
(sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
but task is already holding lock:
(sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
allocated.
xfs_qm_scall_setqlim() is starting a transaction and then not
passing it into xfs_qm_dqet() and so it starts it's own transaction
when allocating the dquot. Splat!
Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
inside the setqlim transaction. This requires getting the dquot
first (and allocating it if necessary) then dropping and relocking
the dquot before joining it to the setqlim transaction.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f648167f3a)
Pull CIFS fixes from Steve French:
"Fixes for a couple of DFS problems, a problem with extended security
negotiation and two other small cifs fixes"
* 'for-3.10' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix composing of mount options for DFS referrals
cifs: stop printing the unc= option in /proc/mounts
cifs: fix error handling when calling cifs_parse_devname
cifs: allow sec=none mounts to work against servers that don't support extended security
cifs: fix potential buffer overrun when composing a new options string
cifs: only set ops for inodes in I_NEW state
- Stable fix to prevent an rpc_task wakeup race
- Fix a NFSv4.1 session drain deadlock
- Fix a NFSv4/v4.1 mount regression when not running rpc.gssd
- Ensure auth_gss pipe detection works in namespaces
- Fix SETCLIENTID fallback if rpcsec_gss is not available
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRomGNAAoJEGcL54qWCgDyzUIQALj4hsOtdcmCCAWu0m+Pr+Gz
/6rpO2xJ5cLWyTYS2J9wmPkU+mb/g/OB/OhANyQQsDbfoW3e27TGaXRB8hUs29vk
LHTkFcrBTi4ohlw2Gyb6LWDF6cyHkC7cpH7tfIjLDff77V/qK1uqo41MtYpqUvy3
41tMoFOhvpwsy7HuKqVyYtNlwxnXrdJ5XvK0ycaQYQ9t8om9Hxu/scCgLfl/qwRr
eVhIesC2/Y1eC46UK7/TWCC7aTLkvi85UQY+fywMkajt/gPzjjh6nuhc45aOT9d7
pc0sXgIs8fLEjC+CRa0ojrziJZCHZY93U1kzA+CotRqPq76nPeFeG/1VhViKKb4C
1AU9IA4ntViUqNDQiGrCxE6ZeMewTOKksrCErwSS16Hv9Gqh4SHq84R/bF6MFgBc
dqQsexUbf/vaWHmuosARgbyc/QcBXDwWwbkq0hXSWbHA8vpc8/Lw1wkp49eyKz0V
0zwJ9xInakXr5/DIC72IKEF0Dg26L+GTXWLmDZVcdpVBp5A403trxUKckcsNa/Xf
FXaGMhyv2ZfV4AohW1Z8klkLiMHt4dr+YB7SQAgR/gb81p3odgKgMz51PfmHxFqs
4nxwRQqWplBTGiLrOFSgaRC50jLqEloUweWC2qf56KYEb9N6M6XDIXhllLIMyWLD
94cvihpKj+/ecKH5kQWF
=B8go
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Stable fix to prevent an rpc_task wakeup race
- Fix a NFSv4.1 session drain deadlock
- Fix a NFSv4/v4.1 mount regression when not running rpc.gssd
- Ensure auth_gss pipe detection works in namespaces
- Fix SETCLIENTID fallback if rpcsec_gss is not available
* tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix SETCLIENTID fallback if GSS is not available
SUNRPC: Prevent an rpc_task wakeup race
NFSv4.1 Fix a pNFS session draining deadlock
SUNRPC: Convert auth_gss pipe detection to work in namespaces
SUNRPC: Faster detection if gssd is actually running
SUNRPC: Fix a bug in gss_create_upcall
- Fix for corruption with FSX on 512 byte blocksize filesystems
- Fix rounding error in xfs_free_file_space
- Fix use-after-free with extent free intents
- Add several missing KM_NOFS flags to fix lockdep reports
- Several fixes for CRC related code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJRn93mAAoJENaLyazVq6ZOPw8P/1ab7Gr4K7ccpuRvC+JiBPx9
/Eb1l2A68r8rWJq3bEvLM/XgN7F0IESastS+iA8lrNuLWJAHCb+Y37mdbyWymhWO
J6LFg/Mic+n5uGmg6nXrqLJ8epTd3L+V+cCEHLpa/20zEwa5yd8sSTV0P1pfcMRi
S36Uof7boe+s4H/s2z4YX8Y+dkaVn1tN7j4PDs5Zd/wiUq0EKA5le4AkbqIJlCDn
E9mJL1S2t3QwfgQWL+pe8f5jzReGTAxJUQWX2ErajX4vYI7PnR1WTYf7YJB291f0
XPfkjqvaK0WCdipSkgb58t/Rj2IxoHItdryvBHa/rnL4e95aBQMpNyY8btNX28ou
NgcIQnqQB+xlJeHpSyb1xNusamGzWw0CEYsXPcDQbXj1uJjd7/GfqTwtVbB4zOlW
Hua7/6N22vaY1dfQcQFwgi1XcKF9jSVKBVOvy7yHEmdNuT7mFejaG1gVO7NpTIgd
s7R91pQVlYpLTmZHK6ZqZap5OC6kS/YJr9HxVxS3FQhaJmFwGqvf6UUjSWOK1MnJ
obTbbygo2Orvh10lgzDNsd+xRJZ9aFn+UfywSeQGTfm91HvTnEQje3xqctv3zmBy
adrWSXXg/eN07nciNk0zTgHnQZo1v/ZeAFlz0AcSslMMQ7Pq4+Pv4LPtdxMIdrxU
1IZ32GunGTUXqm6GRW6v
=hQCM
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfs
Pull xfs fixes from Ben Myers:
"Here are fixes for corruption on 512 byte filesystems, a rounding
error, a use-after-free, some flags to fix lockdep reports, and
several fixes related to CRCs. We have a somewhat larger post -rc1
queue than usual due to fixes related to the CRC feature we merged for
3.10:
- Fix for corruption with FSX on 512 byte blocksize filesystems
- Fix rounding error in xfs_free_file_space
- Fix use-after-free with extent free intents
- Add several missing KM_NOFS flags to fix lockdep reports
- Several fixes for CRC related code"
* tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfs:
xfs: remote attribute lookups require the value length
xfs: xfs_attr_shortform_allfit() does not handle attr3 format.
xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC
xfs: fix missing KM_NOFS tags to keep lockdep happy
xfs: Don't reference the EFI after it is freed
xfs: fix rounding in xfs_free_file_space
xfs: fix sub-page blocksize data integrity writes
The recent changes overhauling fs/aio.c introduced a bug that results in
the kioctx not being freed when outstanding kiocbs are cancelled at
exit_aio() time. Specifically, a kiocb that is cancelled has its
completion events discarded by batch_complete_aio(), which then fails to
wake up the process stuck in free_ioctx(). Fix this by modifying the
wait_event() condition in free_ioctx() appropriately.
This patch was tested with the cancel operation in the thread based code
posted yesterday.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Last time we found there is lock/unlock bug in ocfs2_file_aio_write, and
then we did a thorough search for all lock resources in
ocfs2_inode_info, including rw, inode and open lockres and found this
bug. My kernel version is 3.0.13, and it is also in the lastest version
3.9. In ocfs2_fiemap, once ocfs2_get_clusters_nocache failed, it should
goto out_unlock instead of out, because we need release buffer head, up
read alloc sem and unlock inode.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary
DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.
The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."
SYMPTOMS:
(1) System log contains error messages likewise:
[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only
(2) The NILFS2 file system is remounted in RO mode.
REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):
----------------[BEGIN SCRIPT]--------------------
VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------
REPRODUCIBILITY: 100%
INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:
open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)
The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.
The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.
When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().
The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.
As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.
FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.
Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In reviewing man pages, I noticed that io_getevents is documented to
update the timeout that gets passed into the library call. This doesn't
happen in kernel space or in the library (even though it's documented to
do so in both places). Unless there is objection, I'd like to fix the
comments/docs to match the code (I will also update the man page upon
consensus).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Cyril Hrubis <chrubis@suse.cz>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 634725a929 ("hfs: cleanup HFS+ prints") removed the BUG_ON in
hfs_bnode_create in hfsplus. This patch removes it from the hfs version
and avoids an fsfuzzer crash.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_file_aio_write(), it does ocfs2_rw_lock() first and then
ocfs2_inode_lock().
But if ocfs2_inode_lock() failed, it goes to out_sems without unlocking
rw lock. This will cause a bug in ocfs2_lock_res_free() when testing
res->l_ex_holders, which is increased in __ocfs2_cluster_lock() and
decreased in __ocfs2_cluster_unlock().
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: "Duyongfeng (B)" <du.duyongfeng@huawei.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Intermediate value of fat_clusters can be overflowed on 32bits arch.
Reported-by: Krzysztof Strasburger <strasbur@chkw386.ch.pwr.wroc.pl>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading a remote attribute, to correctly calculate the length
of the data buffer for CRC enable filesystems, we need to know the
length of the attribute data. We get this information when we look
up the attribute, but we don't store it in the args structure along
with the other remote attr information we get from the lookup. Add
this information to the args structure so we can use it
appropriately.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit e461fcb194)
xfstests generic/117 fails with:
XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC)
indicating a function that does not handle the attr3 format
correctly. Fix it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit b38958d715)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 72916fb8cb)
There are several places where we use KM_SLEEP allocation contexts
and use the fact that they are called from transaction context to
add KM_NOFS where appropriate. Unfortunately, there are several
places where the code makes this assumption but can be called from
outside transaction context but with filesystem locks held. These
places need explicit KM_NOFS annotations to avoid lockdep
complaining about reclaim contexts.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit ac14876cf9)
Checking the EFI for whether it is being released from recovery
after we've already released the known active reference is a mistake
worthy of a brown paper bag. Fix the (now) obvious use after free
that it can cause.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 52c24ad39f)
The offset passed into xfs_free_file_space() needs to be rounded
down to a certain size, but the rounding mask is built by a 32 bit
variable. Hence the mask will always mask off the upper 32 bits of
the offset and lead to incorrect writeback and invalidation ranges.
This is not actually exposed as a bug because we writeback and
invalidate from the rounded offset to the end of the file, and hence
the offset we are actually punching a hole out of will always be
covered by the code. This needs fixing, however, if we ever want to
use exact ranges for writeback/invalidation here...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 28ca489c63)
FSX on 512 byte block size filesystems has been failing for some
time with corrupted data. The fault dates back to the change in
the writeback data integrity algorithm that uses a mark-and-sweep
approach to avoid data writeback livelocks.
Unfortunately, a side effect of this mark-and-sweep approach is that
each page will only be written once for a data integrity sync, and
there is a condition in writeback in XFS where a page may require
two writeback attempts to be fully written. As a result of the high
level change, we now only get a partial page writeback during the
integrity sync because the first pass through writeback clears the
mark left on the page index to tell writeback that the page needs
writeback....
The cause is writing a partial page in the clustering code. This can
happen when a mapping boundary falls in the middle of a page - we
end up writing back the first part of the page that the mapping
covers, but then never revisit the page to have the remainder mapped
and written.
The fix is simple - if the mapping boundary falls inside a page,
then simple abort clustering without touching the page. This means
that the next ->writepage entry that write_cache_pages() will make
is the page we aborted on, and xfs_vm_writepage() will map all
sections of the page correctly. This behaviour is also optimal for
non-data integrity writes, as it results in contiguous sequential
writeback of the file rather than missing small holes and having to
write them a "random" writes in a future pass.
With this fix, all the fsx tests in xfstests now pass on a 512 byte
block size filesystem on a 4k page machine.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 49b137cbbc)
With the change to ignore the unc= and prefixpath= mount options, there
is no longer any need to add them to the options string when mounting.
By the same token, we now need to build a device name that includes the
prefixpath when mounting.
To make things neater, the delimiters on the devicename are changed
to '/' since that's preferred when mounting anyway.
v2: fix some comments and don't bother looking at whether there is
a prepath in the ref->node_name when deciding whether to pass
a prepath to cifs_build_devname.
v3: rebase on top of potential buffer overrun fix for stable
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Since we no longer recognize that option, stop printing it out. The
devicename is now the canonical source for this info.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When we allowed separate unc= and prefixpath= mount options, we could
ignore EINVAL errors from cifs_parse_devname. Now that they are
deprecated, we need to check for that as well and fail the mount if it's
malformed.
Also fix a later error message that refers to the unc= option.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In the case of sec=none, we're not sending a username or password, so
there's little benefit to mandating NTLMSSP auth. Allow it to use
unencapsulated auth in that case.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Consider the case where we have a very short ip= string in the original
mount options, and when we chase a referral we end up with a very long
IPv6 address. Be sure to allow for that possibility when estimating the
size of the string to allocate.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's generally not safe to reset the inode ops once they've been set. In
the case where the inode was originally thought to be a directory and
then later found to be a DFS referral, this can lead to an oops when we
try to trigger an inode op on it after changing the ops to the blank
referral operations.
Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Pull CIFS fix from Steve French:
"One cifs fix to merge now - fixes possible DFS oops (I expect to
request a merge of 4 additional cifs fixes next week)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: only set ops for inodes in I_NEW state
Fix build errors by correcting DLM dependencies in GFS2.
Build errors happen when CONFIG_GFS2_FS_LOCKING_DLM=y and CONFIG_DLM=m:
fs/built-in.o: In function `gfs2_lock':
file.c:(.text+0xc7abd): undefined reference to `dlm_posix_get'
file.c:(.text+0xc7ad0): undefined reference to `dlm_posix_unlock'
file.c:(.text+0xc7ad9): undefined reference to `dlm_posix_lock'
fs/built-in.o: In function `gdlm_unmount':
lock_dlm.c:(.text+0xd6e5b): undefined reference to `dlm_release_lockspace'
fs/built-in.o: In function `sync_unlock':
lock_dlm.c:(.text+0xd6e9e): undefined reference to `dlm_unlock'
fs/built-in.o: In function `sync_lock':
lock_dlm.c:(.text+0xd6fb6): undefined reference to `dlm_lock'
fs/built-in.o: In function `gdlm_put_lock':
lock_dlm.c:(.text+0xd7238): undefined reference to `dlm_unlock'
fs/built-in.o: In function `gdlm_mount':
lock_dlm.c:(.text+0xd753e): undefined reference to `dlm_new_lockspace'
lock_dlm.c:(.text+0xd79d3): undefined reference to `dlm_release_lockspace'
fs/built-in.o: In function `gdlm_lock':
lock_dlm.c:(.text+0xd8179): undefined reference to `dlm_lock'
fs/built-in.o: In function `gdlm_cancel':
lock_dlm.c:(.text+0xd6b22): undefined reference to `dlm_unlock'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the multi-block allocation code, such that
directory inodes only get a single block reserved in the bitmap.
That way, the bitmaps are more tightly packed together, and there
are fewer spans of free blocks for in-use block reservations.
This means it takes less time to find a free span of blocks in the
bitmap, which speeds things up. This increases the performance of
some workloads by almost 2X. In Nate's mockup.py script (which does
(1) create dir, (2) create dir in dir, (3) create file in that dir)
the test executes in 23 steps rather than 43 steps, a 47%
performance improvement.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes two regression problems that Abhi found in the
GFS2 quota code.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Commit 79d852bf "NFS: Retry SETCLIENTID with AUTH_SYS instead of
AUTH_NONE" did not take into account commit 23631227 "NFSv4: Fix the
fallback to AUTH_NULL if krb5i is not available".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
efivarfs if an EFI variable gets deleted from under us and return EOF
when reading from a zero-length file - Lingzhu Xiang
* Fix an oops in efivar_update_sysfs_entries() caused by reusing (and
therefore corrupting) a kzalloc() allocation - Seiji Aguchi
* Initialise the DataSize argument to GetVariable() otherwise it will
not be updated with the actual size of the variable on return.
Discovered on a Acer Aspire V3 BIOS - Lee, Chun-Yi
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRlgJ5AAoJEC84WcCNIz1VcRgQAJWnQpuI+Y1RjPYuLb8ISZ92
jKnVSAQPaWDccf/KKx7qtk2iK6G1mExomOggp6PpV4/uHWqkQ64qoGL0HmRbmdH7
//gNXm+ITnSVojoTBBis2LYEYDUQ1aDp3vTevFT2OOMKzKRsgwFD5KbqLnPJCxnl
kFHlxQSPa7ZYk6GXOVdBJ5ya7spU+Lk4ODiL4xM8EBkMBedUStedB5zNg5rEMLJq
5FZfs2Ryg7riOkhBnKWCvFjAj3PO/yHLVdrVrGZx+KpATHr/rbsbzDEp3+Pk8zEu
3aE1p/XLaB4Qk9WvwhWonWF7PSZP41SjrcEt77mTmYivk0cJHBORo/wJsom683BU
4elWR588Pi92nKEx4MSLREhkzP5RdVFIb4Yrg+1xKLEfI5Kfj1nEQrFVk53VM/rL
L1m1bwhKLaNUVFJmxJYHaZ6VTjcB6lupz9/AMQr1N1UAavlmr4xEveNTfto+Ys7u
/J3Z04wouhOGQonVDrZxxOzWFewq1kR/hRfwBAf/V0qemAeS61ZHGpfmmq00y5L2
jmXncQ5EspFIJfk7+KV2sPQfqC8U8b2jyfO36Ed9u3d/qQ6eiQ1rTZMLKGg1iSdW
Z+94w7+huEdsNZShSUE8+HXNvzPCS6704PBa9YT292ZibWM9onk6AoBQy/snrEaO
K+PKxkZyPBML77Vs4TSI
=8+a3
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent' into x86/urgent
* Avoid confusing the user by returning -EIO instead of -ENOENT in
efivarfs if an EFI variable gets deleted from under us and return EOF
when reading from a zero-length file - Lingzhu Xiang
* Fix an oops in efivar_update_sysfs_entries() caused by reusing (and
therefore corrupting) a kzalloc() allocation - Seiji Aguchi
* Initialise the DataSize argument to GetVariable() otherwise it will
not be updated with the actual size of the variable on return.
Discovered on a Acer Aspire V3 BIOS - Lee, Chun-Yi
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
On a CB_RECALL the callback service thread flushes the inode using
filemap_flush prior to scheduling the state manager thread to return the
delegation. When pNFS is used and I/O has not yet gone to the data server
servicing the inode, a LAYOUTGET can preceed the I/O. Unlike the async
filemap_flush call, the LAYOUTGET must proceed to completion.
If the state manager starts to recover data while the inode flush is sending
the LAYOUTGET, a deadlock occurs as the callback service thread holds the
single callback session slot until the flushing is done which blocks the state
manager thread, and the state manager thread has set the session draining bit
which puts the inode flush LAYOUTGET RPC to sleep on the forechannel slot
table waitq.
Separate the draining of the back channel from the draining of the fore channel
by moving the NFS4_SESSION_DRAINING bit from session scope into the fore
and back slot tables. Drain the back channel first allowing the LAYOUTGET
call to proceed (and fail) so the callback service thread frees the callback
slot. Then proceed with draining the forechannel.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull btrfs fixes from Chris Mason:
"Miao Xie has been very busy, fixing races and enospc problems and many
other small but important pieces.
Alexandre Oliva discovered some problems with how our error handling
was interacting with the block layer and for now has disabled our
partial handling of sub-page writes. The real sub-page work is in a
series of patches from IBM that we still need to integrate and test.
The code Alexandre has turned off was really incomplete.
Josef has more error handling fixes and an important fix for the new
skinny extent format.
This also has my fix for the tracepoint crash from late in 3.9. It's
the first stage in a larger clean up to get rid of btrfs_bio and make
a proper bioset for all the items we need to tack into the bio. For
now the bioset only holds our mirror_num and stripe_index, but for the
next merge window I'll shuffle more in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs: make sure roots are assigned before freeing their nodes
Btrfs: explicitly use global_block_rsv for quota_tree
btrfs: do away with non-whole_page extent I/O
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
Btrfs: pause the space balance when remounting to R/O
Btrfs: fix unprotected root node of the subvolume's inode rb-tree
Btrfs: fix accessing a freed tree root
Btrfs: return errno if possible when we fail to allocate memory
Btrfs: update the global reserve if it is empty
Btrfs: don't steal the reserved space from the global reserve if their space type is different
Btrfs: optimize the error handle of use_block_rsv()
Btrfs: don't use global block reservation for inode cache truncation
Btrfs: don't abort the current transaction if there is no enough space for inode cache
Correct allowed raid levels on balance.
Btrfs: fix possible memory leak in replace_path()
Btrfs: fix possible memory leak in the find_parent_nodes()
Btrfs: don't allow device replace on RAID5/RAID6
Btrfs: handle running extent ops with skinny metadata
...
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.
As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.
Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.
This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point. Just add checks to make sure these
roots are set to keep us from panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error. This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.
It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case. Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!
I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>