39556 Commits

Author SHA1 Message Date
Weston Andros Adamson
309a1d65b1 nfs: handle overlapping reqs in lock_and_join
This is needed for mirrored DS support, where multuple requests
cover the same range.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03 11:06:44 -08:00
Weston Andros Adamson
180bb5ec06 pnfs: release lseg in pnfs_generic_pg_cleanup
This is needed to support mirrored writes - the first write can't just
trash the lseg, we need to keep it around until all mirrors have
written.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03 11:06:44 -08:00
Weston Andros Adamson
2176bf4269 nfs: introduce pg_cleanup op for pgio descriptors
Add a new operation to nfs_pageio_ops that is called on nfs_pageio_complete.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03 11:06:43 -08:00
Peng Tao
c220106fb4 nfs/filelayout: use pnfs_error_mark_layout_for_return
Instead of calling layoutreturn directly, call pnfs_error_mark_layout_for_return
to mark layouts for return and let generic code return layout when
layout segments are freed.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>

Conflicts:
	fs/nfs/filelayout/filelayout.c
2015-02-03 11:06:43 -08:00
Peng Tao
e736a5b98c nfs41: clear NFS_LAYOUT_RETURN if layoutreturn is sent or failed to send
So that pnfs path is not disabled for ever.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:42 -08:00
Peng Tao
aa1e0e3a8e nfs41: send layoutreturn in last put_lseg
If current lseg is the last lseg marked with NFS_LSEG_LAYOUTRETURN,
send layoutreturn.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:42 -08:00
Peng Tao
ce6ab4f238 nfs41: don't use a layout if it is marked for returning
And if we are to return the same type of layouts, don't bother
sending more layoutgets.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:41 -08:00
Peng Tao
016256df3a nfs41: add a helper to mark layout for return
It marks all matching layout segments as NFS_LSEG_LAYOUTRETURN,
which is an indicator for pnfs_put_lseg() to send layoutreturn,
and also prevents pnfs_update_layout() from using the returning
segments. Once it is set, it never gets cleared.

It also sets proper io failure bit so that pnfs path can be retried
after PNFS_LAYOUTGET_RETRY_TIMEOUT second.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:41 -08:00
Peng Tao
f40eb5d044 nfs41: make a helper function to send layoutreturn
It allows to specify different iomode to return.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:41 -08:00
Peng Tao
4579d6b897 nfs41: pass iomode through layoutreturn args
So that it is possible to return a specific iomode layouts.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:40 -08:00
Peng Tao
aabff4ddca nfs: save server READ/WRITE/COMMIT status
Flexfiles layout would want to use them to report DS IO status.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:40 -08:00
Peng Tao
9bf87482dd nfs41: serialize first layoutget of a file
Per RFC 5661 Errata 3208:
| A client MAY always forget its layout state and associated
| layout stateid at any time (See also section 12.5.5.1).
| In such case, the client MUST use a non-layout stateid for the next
| LAYOUTGET operation. This will signal the server that the client has
| no more layouts on the file and its respective layout state can be
| released before issuing a new layout in response to LAYOUTGET.

In order to make such a signal unique to server, client needs to serialize
all layoutgets using non-layout stateid. We implement this by serializing
layoutgets when client has no layout segments at hand.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:39 -08:00
Peng Tao
abb9a0079c nfs41: close a small race window when adding new layout to global list
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:39 -08:00
Peng Tao
72cff4494e nfs/flexclient: export pnfs_layoutcommit_inode
flexfiles needs to start layoutcommit when necessary

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
2015-02-03 11:06:38 -08:00
Peng Tao
36d3e3dcc9 nfs: set hostname when creating nfsv3 ds connection
lockd assumes hostname exists otherwise kernel oops.
It can be reproduced by following steps:
1. mount flexfile MDS
2. write some files
3. mount DS via nfsv3

BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8134f332>] strlen+0x2/0x20
 PGD 0
 Oops: 0000 [#1] SMP
 Modules linked in: nfsd(F) nfs_layout_flexfiles(F) rpcsec_gss_krb5(F) auth_rpcgss(F) nfsv4(F) dns_resolver(F) nfsv3(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) ebtable_nat(F) nf_conntrack_netbios_ns(F) nf_conntrack_broadcast(F) ipt_MASQUERADE(F) ip6table_nat(F) nf_nat_ipv6(F) ip6table_mangle(F) ip6t_REJECT(F) nf_conntrack_ipv6(F) nf_defrag_ipv6(F) iptable_nat(F) nf_nat_ipv4(F) nf_nat(F) iptable_mangle(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F) ebtable_filter(F) ebtables(F) ip6table_filter(F) ip6_tables(F) bnep(F) snd_ens1371(F) snd_rawmidi(F) snd_ac97_codec(F) btusb(F) ac97_bus(F) snd_seq(F) snd_seq_device(F) snd_pcm(F) ppdev(F) bluetooth(F) 6lowpan_iphc(F) rfkill(F) vmw_balloon(F) snd_timer(F) snd(F) soundcore(F) gameport(F) i2c_piix4(F) e1000(F) vmw_vmci(F) parport_pc(F) parport(F) shpchp(F) uinput(F) xfs(F) libcrc32c(F) vmwgfx(F) ttm(F) drm(F) mptspi(F) scsi_transport_spi(F) mptscsih(F) mptbase(F) i2c_core(F)
 CPU: 0 PID: 10397 Comm: mount.nfs Tainted: GF            3.14.7-100.pd_client.001.fc16.x86_64 #1
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
 task: ffff880008942600 ti: ffff880007990000 task.ti: ffff880007990000
 RIP: 0010:[<ffffffff8134f332>]  [<ffffffff8134f332>] strlen+0x2/0x20
 RSP: 0018:ffff880007991aa0  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff880038d39c20 RCX: 0000000000000004
 RDX: 0000000000000006 RSI: 0000000000000010 RDI: 0000000000000000
 RBP: ffff880007991b38 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000014600 R11: 0000000000000400 R12: ffffffff81cc8580
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000004
 FS:  00007f90cd2ef880(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000001710000 CR4: 00000000001407f0
 Stack:
  ffffffffa045f52c ffff880001782230 ffff880004141e28 0006880007991ac8
  ffffffff816dc14b ffff880000000000 ffff880038d39c20 0000000000000010
  0000000481cc0006 0000000000000000 ffffffffa0410be8 000000000000c014
 Call Trace:
  [<ffffffffa045f52c>] ? nlmclnt_lookup_host+0x4c/0x2c0 [lockd]
  [<ffffffff816dc14b>] ? _raw_spin_unlock_bh+0x1b/0x20
  [<ffffffffa0410be8>] ? svc_destroy+0xb8/0x140 [sunrpc]
  [<ffffffffa045c323>] nlmclnt_init+0x53/0xc0 [lockd]
  [<ffffffffa047d2dc>] ? nfs_get_client+0x1cc/0x340 [nfs]
  [<ffffffffa047c2e7>] nfs_start_lockd+0xa7/0xd0 [nfs]
  [<ffffffffa047df71>] nfs_create_server+0x181/0x5c0 [nfs]
  [<ffffffffa04460f3>] nfs3_create_server+0x13/0x30 [nfsv3]
  [<ffffffffa048a0bc>] nfs_try_mount+0x21c/0x300 [nfs]
  [<ffffffff811ca32d>] ? __kmalloc_track_caller+0x1ad/0x240
  [<ffffffffa048b677>] ? nfs_fs_mount+0xc37/0xd80 [nfs]
  [<ffffffffa048ad05>] nfs_fs_mount+0x2c5/0xd80 [nfs]
  [<ffffffffa048a830>] ? nfs_clone_super+0x140/0x140 [nfs]
  [<ffffffffa048a240>] ? nfs_clone_sb_security+0x40/0x40 [nfs]
  [<ffffffff811e7e43>] mount_fs+0x43/0x1b0
  [<ffffffff81193100>] ? __alloc_percpu+0x10/0x20
  [<ffffffff812026e6>] vfs_kern_mount+0x76/0x120
  [<ffffffff81204917>] do_mount+0x237/0xa80

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-03 11:06:38 -08:00
Trond Myklebust
16cecdf620 NFSv4.1/NFSv3: Add pNFS callbacks for nfs3_(read|write|commit)_done()
Enable pNFS callbacks to allow flex files to work correctly with a
NFSv3-enabled data server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-03 11:06:37 -08:00
Peng Tao
46a5ab4754 nfs: allow to specify cred in nfs_initiate_pgio
so that flexfile layout client can pass in DS credential instead of
using user cred, which will be done in the next patch.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:37 -08:00
Peng Tao
2c4b131dea nfs4: export nfs4_sequence_done
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:36 -08:00
Peng Tao
cb04ad2a2b nfs4: pass slot table to nfs40_setup_sequence
flexclient needs this as there is no nfs_server to DS connection.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:36 -08:00
Peng Tao
c36aae9ad9 nfs: allow different protocol in nfs_initiate_commit
pnfs flexfile layout client may want to use NFSv3 ops rather
than the default MDS v4 ops.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:35 -08:00
Tom Haynes
abde71f4d3 pnfs: Add nfs_rpc_ops in calls to nfs_initiate_pgio
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:35 -08:00
Peng Tao
5f01d95394 nfs41: create NFSv3 DS connection if specified
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:35 -08:00
Peng Tao
30626f9c32 nfs41: allow LD to choose DS connection version/minor_version
flexfile layout may need to set such when making DS connections.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:34 -08:00
Peng Tao
1a04c6e1a2 nfsv3: introduce nfs3_set_ds_client
The flexfiles layout wants to create DS connection over NFSv3.
Add nfs3_set_ds_client to allow that to happen.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:34 -08:00
Peng Tao
39280a5ae8 nfs41: move file layout macros to generic pnfs
They can be reused by flexfile layout as well.

Also add a code such that if read fails on one DS and
there are other DSes available to use, don't resend
through MDS but through pNFS so that client can read
from other DSes.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:33 -08:00
Peng Tao
064172f345 nfs41: allow LD to choose DS connection auth flavor
flexfile layout may use different auth flavor as specified by MDS.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:33 -08:00
Peng Tao
7405f9e195 nfs41: pull nfs4_ds_connect from file layout to generic pnfs
It can be reused by flexfiles layout client.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Peng Tao
6b7f3cf963 nfs41: pull decode_ds_addr from file layout to generic pnfs
It can be reused by flexfile layout.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Peng Tao
875ae0694b nfs41: pull data server cache from file layout to generic pnfs
Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs.

They can all be reused by flexfile layout as well.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Tom Haynes
085d1e33a6 pnfs: Do not grab the commit_info lock twice when rescheduling writes
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:31 -08:00
Tom Haynes
f54bcf2ece pnfs: Prepare for flexfiles by pulling out common code
The flexfilelayout driver will share some common code
with the filelayout driver. This set of changes refactors
that common code out to avoid any module depenencies.

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:31 -08:00
Gui Hecheng
b76808fc26 btrfs: cleanup init for list in free-space-cache
o removed an unecessary INIT_LIST_HEAD after LIST_HEAD

o merge a declare & INIT_LIST_HEAD pair into one LIST_HEAD

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:50 -08:00
Shaohua Li
2f0810880f btrfs: delete chunk allocation attemp when setting block group ro
Below test will fail currently:
      mkfs.ext4 -F /dev/sda
      btrfs-convert /dev/sda
      mount /dev/sda /mnt
      btrfs device add -f /dev/sdb /mnt
      btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt

The reason is there are some block groups with usage 0, but the whole
disk hasn't free space to allocate new chunk, so we even can't set such
block group readonly. This patch deletes the chunk allocation when
setting block group ro. For META, we already have reserve. But for
SYSTEM, we don't have, so the check_system_chunk is still required.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:20 -08:00
Naohiro Aota
289454ad26 btrfs: clear bio reference after submit_one_bio()
After submit_one_bio(), `bio' can go away. However submit_extent_page()
leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
It will cause invalid paging request when submit_extent_page() is called
next time.

I reproduced ENOMEM case with the following script (need
CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).

  #!/bin/bash

  dmesgout=dmesg.txt
  start=100000
  end=300000
  step=1000

  # btrfs options
  device=/dev/vdb1
  directory=/mnt/btrfs

  # fault-injection options
  percent=100
  times=3

  mkdir -p $directory || exit 1
  mount -o compress $device $directory || exit 1

  rm -f $directory/file || exit 1
  dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1

  for interval in `seq $start $step $end`; do
          dmesg -C
          echo 1 > /proc/sys/vm/drop_caches
          sync
          export FAILCMD_TYPE=fail_page_alloc
          ./failcmd.sh -p $percent -t $times -i $interval \
                  --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
                  -- \
                  cat $directory/file > /dev/null
          dmesg > ${dmesgout}
          if grep -q BUG: ${dmesgout}; then
                  cat ${dmesgout}
                  exit 1
          fi
  done

  umount $directory
  exit 0

Signed-off-by: Naohiro Aota <naota@elisp.net>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:51 -08:00
Filipe Manana
de554a4fa6 Btrfs: fix scrub race leading to use-after-free
While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:

[68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
[68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
[68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
[68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
[68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
[68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
[68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
[68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
[68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
[68127.807663] Stack:
[68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663]  RSP <ffff8803f097fbb8>
[68127.807663] CR2: ffff8803f8947a50
[68127.807663] ---[ end trace d7045aac00a66cd8 ]---

This is due to a race that can happen in a very tiny time window and is
illustrated by the following sequence diagram:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       atomic_dec(&sctx->bios_in_flight)
                                                                   wait sctx->bios_in_flight == 0
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
       wake_up(&sctx->list_wait)
          __wake_up()
              spin_lock_irqsave(&sctx->list_wait->lock, flags)

Another variation of this scenario that results in the same use-after-free
issue is:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
                                                                   wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       __wake_up(&sctx->list_wait)
          spin_lock_irqsave(&sctx->list_wait->lock, flags)
          default_wake_function()
              wake up task at CPU 2
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
          spin_unlock_irqrestore(&sctx->list_wait->lock, flags)

Fix this by holding the scrub lock while doing the wakeup.

This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit a2de733c78fa7af51ba9670482fa7d392aa67c57).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:50 -08:00
Filipe Manana
001a648df4 Btrfs: add missing cleanup on sysfs init failure
If we failed during initialization of sysfs, we weren't unregistering the
top level btrfs sysfs entry nor the debugfs stuff.
Not unregistering the top level sysfs entry makes future attempts to reload
the btrfs module impossible and the following is reported in dmesg:

[ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0()
[ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs'
[ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs]
[ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G        W    3.13.0-fdm-btrfs-next-24+ #7
[ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2246.451312]  0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410
[ 2246.451314]  ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290
[ 2246.451316]  ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80
[ 2246.451318] Call Trace:
[ 2246.451322]  [<ffffffff816f1da6>] dump_stack+0x4e/0x68
[ 2246.451324]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
[ 2246.451325]  [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50
[ 2246.451328]  [<ffffffff81367dc5>] ? strlcat+0x65/0x90
(....)

This fixes the following change:

    btrfs: add simple debugfs interface
    commit 1bae30982bc86ab66d61ccb6e22792593b45d44d

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:49 -08:00
Filipe Manana
d4b450cd4b Btrfs: fix race between transaction commit and empty block group removal
Committing a transaction can race with automatic removal of empty block
groups (cleaner kthread), leading to a BUG_ON() in the transaction
commit code while running btrfs_finish_extent_commit(). The following
sequence diagram shows how it can happen:

           CPU 1                                       CPU 2

btrfs_commit_transaction()
  fs_info->running_transaction = NULL
  btrfs_finish_extent_commit()
    find_first_extent_bit()
      -> found range for block group X
         in fs_info->freed_extents[]

                                               btrfs_delete_unused_bgs()
                                                 -> found block group X

                                                 Removed block group X's range
                                                 from fs_info->freed_extents[]

                                                 btrfs_remove_chunk()
                                                    btrfs_remove_block_group(bg X)

    unpin_extent_range(bg X range)
       btrfs_lookup_block_group(bg X)
          -> returns NULL
            -> BUG_ON()

The trace that results from the BUG_ON() is:

[48665.187808] ------------[ cut here ]------------
[48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
[48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
[48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
[48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
[48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
[48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
[48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
[48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
[48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
[48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
[48665.197388] Stack:
[48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
[48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
[48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
[48665.197388] Call Trace:
[48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
[48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
[48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
[48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
[48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
[48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
[48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
[48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
[48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
[48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
[48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
[48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
[48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388]  RSP <ffff8801b56a7b88>
[48665.272246] ---[ end trace b9c6ab9957521376 ]---

Fix this by ensuring that unpining the block group's range in
btrfs_finish_extent_commit() is done in a synchronized fashion
with removing the block group's range from freed_extents[]
in btrfs_delete_unused_bgs()

This race got introduced with the change:

    Btrfs: remove empty block groups automatically
    commit 47ab2a6c689913db23ccae38349714edf8365e0a

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:48 -08:00
David Sterba
e3540eab29 btrfs: add more checks to btrfs_read_sys_array
Verify that the sys_array has enough bytes to read the next item.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:39 -08:00
David Sterba
1ffb22cf8c btrfs: cleanup, rename a few variables in btrfs_read_sys_array
There's a pointer to buffer, integer offset and offset passed as
pointer, try to find matching names for them.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:28 -08:00
David Sterba
ce7fca5f57 btrfs: add checks for sys_chunk_array sizes
Verify that possible minimum and maximum size is set, validity of
contents is checked in btrfs_read_sys_array.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:23:43 -08:00
David Sterba
75d6ad382b btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
I received a few crafted images from Jiri, all got through the recently
added superblock checks. The lower bounds checks for num_devices and
sector/node -sizes were missing and caused a crash during mount.

Tools for symbolic code execution were used to prepare the images
contents.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:20:39 -08:00
chandan r
9cc97d6462 Btrfs: Add code to support file creation time
This patch adds a new member to the 'struct btrfs_inode' structure to hold
the file creation time.

Signed-off-by: chandan <chandanrmail@gmail.com>
[refreshed, removed btrfs_inode_otime]
Signed-off-by: David Sterba <dsterba@suse.cz>

Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:16 -08:00
David Sterba
a937b9791e btrfs: kill btrfs_inode_*time helpers
They just opencode taking address of the timespec member.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:07 -08:00
Markus Elfring
648695c748 jfs: Deletion of an unnecessary check before the function call "unload_nls"
The unload_nls() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2015-02-02 15:02:34 -06:00
Christoph Hellwig
31ef83dc05 nfsd: add trace events
For now just a few simple events to trace the layout stateid lifetime, but
these already were enough to find several bugs in the Linux client layout
stateid handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:44 +01:00
Christoph Hellwig
c5c707f96f nfsd: implement pNFS layout recalls
Add support to issue layout recalls to clients.  For now we only support
full-file recalls to get a simple and stable implementation.  This allows
to embedd a nfsd4_callback structure in the layout_state and thus avoid
any memory allocations under spinlocks during a recall.  For normal
use cases that do not intent to share a single file between multiple
clients this implementation is fully sufficient.

To ensure layouts are recalled on local filesystem access each layout
state registers a new FL_LAYOUT lease with the kernel file locking code,
which filesystems that support pNFS exports that require recalls need
to break on conflicting access patterns.

The XDR code is based on the old pNFS server implementation by
Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman,
Marc Eshel, Mike Sager and Ricardo Labiaga.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:43 +01:00
Christoph Hellwig
9cf514ccfa nfsd: implement pNFS operations
Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and
LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage
outstanding layouts and devices.

Layout management is very straight forward, with a nfs4_layout_stateid
structure that extends nfs4_stid to manage layout stateids as the
top-level structure.  It is linked into the nfs4_file and nfs4_client
structures like the other stateids, and contains a linked list of
layouts that hang of the stateid.  The actual layout operations are
implemented in layout drivers that are not part of this commit, but
will be added later.

The worst part of this commit is the management of the pNFS device IDs,
which suffers from a specification that is not sanely implementable due
to the fact that the device-IDs are global and not bound to an export,
and have a small enough size so that we can't store the fsid portion of
a file handle, and must never be reused.  As we still do need perform all
export authentication and validation checks on a device ID passed to
GETDEVICEINFO we are caught between a rock and a hard place.  To work
around this issue we add a new hash that maps from a 64-bit integer to a
fsid so that we can look up the export to authenticate against it,
a 32-bit integer as a generation that we can bump when changing the device,
and a currently unused 32-bit integer that could be used in the future
to handle more than a single device per export.  Entries in this hash
table are never deleted as we can't reuse the ids anyway, and would have
a severe lifetime problem anyway as Linux export structures are temporary
structures that can go away under load.

Parts of the XDR data, structures and marshaling/unmarshaling code, as
well as many concepts are derived from the old pNFS server implementation
from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman,
Mike Sager, Ricardo Labiaga and many others.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:42 +01:00
Christoph Hellwig
4d227fca1b nfsd: make find_any_file available outside nfs4state.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:41 +01:00
Christoph Hellwig
e6ba76e194 nfsd: make find/get/put file available outside nfs4state.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:41 +01:00
Christoph Hellwig
cd61c52231 nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:40 +01:00