53526 Commits

Author SHA1 Message Date
Andrei Vagin
d0f0223122 proc: replace seq_printf by seq_put_smth to speed up /proc/pid/status
seq_printf() works slower than seq_puts, seq_puts, etc.

== test_proc.c
int main(int argc, char **argv)
{
	int n, i, fd;
	char buf[16384];

	n = atoi(argv[1]);
	for (i = 0; i < n; i++) {
		fd = open(argv[2], O_RDONLY);
		if (fd < 0)
			return 1;
		if (read(fd, buf, sizeof(buf)) <= 0)
			return 1;
		close(fd);
	}

	return 0;
}
==

$ time ./test_proc  1000000 /proc/1/status

== Before path ==
real	0m5.171s
user	0m0.328s
sys	0m4.783s

== After patch ==
real	0m4.761s
user	0m0.334s
sys	0m4.366s

Link: http://lkml.kernel.org/r/20180212074931.7227-4-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Andrei Vagin
48dffbf82d proc: optimize single-symbol delimiters to spead up seq_put_decimal_ull
A delimiter is a string which is printed before a number.  A
syngle-symbol delimiters can be printed by set_putc() and this works
faster than printing by set_puts().

== test_proc.c

int main(int argc, char **argv)
{
	int n, i, fd;
	char buf[16384];

	n = atoi(argv[1]);
	for (i = 0; i < n; i++) {
		fd = open(argv[2], O_RDONLY);
		if (fd < 0)
			return 1;
		if (read(fd, buf, sizeof(buf)) <= 0)
			return 1;
		close(fd);
	}

	return 0;
}
==

$ time ./test_proc  1000000 /proc/1/stat

== Before patch ==
  real	0m3.820s
  user	0m0.337s
  sys	0m3.394s

== After patch ==
  real	0m3.110s
  user	0m0.324s
  sys	0m2.700s

Link: http://lkml.kernel.org/r/20180212074931.7227-3-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Andrei Vagin
f66406638f proc: replace seq_printf on seq_putc to speed up /proc/pid/smaps
seq_putc() works much faster than seq_printf()

== Before patch ==
  $ time python test_smaps.py
  real    0m3.828s
  user    0m0.413s
  sys     0m3.408s

== After patch ==
  $ time python test_smaps.py
  real	0m3.405s
  user	0m0.401s
  sys	0m3.003s

== Before patch ==
-   75.51%     4.62%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 70.88% show_smap.isra.33
      + 24.82% seq_put_decimal_ull_aligned
      + 19.78% __walk_page_range
      + 12.74% seq_printf
      + 11.08% show_map_vma.isra.23
      + 1.68% seq_puts

== After patch ==
-   69.16%     5.70%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 63.46% show_smap.isra.33
      + 25.98% seq_put_decimal_ull_aligned
      + 20.90% __walk_page_range
      + 12.60% show_map_vma.isra.23
        1.56% seq_putc
      + 1.55% seq_puts

Link: http://lkml.kernel.org/r/20180212074931.7227-2-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Andrei Vagin
d1be35cb6f proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a
specified minimal field width.

It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it
works much faster.

== test_smaps.py
  num = 0
  with open("/proc/1/smaps") as f:
          for x in xrange(10000):
                  data = f.read()
                  f.seek(0, 0)
==

== Before patch ==
  $ time python test_smaps.py
  real    0m4.593s
  user    0m0.398s
  sys     0m4.158s

== After patch ==
  $ time python test_smaps.py
  real    0m3.828s
  user    0m0.413s
  sys     0m3.408s

$ perf -g record python test_smaps.py
== Before patch ==
-   79.01%     3.36%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 75.65% show_smap.isra.33
      + 48.85% seq_printf
      + 15.75% __walk_page_range
      + 9.70% show_map_vma.isra.23
        0.61% seq_puts

== After patch ==
-   75.51%     4.62%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 70.88% show_smap.isra.33
      + 24.82% seq_put_decimal_ull_w
      + 19.78% __walk_page_range
      + 12.74% seq_printf
      + 11.08% show_map_vma.isra.23
      + 1.68% seq_puts

[akpm@linux-foundation.org: fix drivers/of/unittest.c build]
Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Alexey Dobriyan
2acddbe816 proc: account "struct pde_opener"
The allocation is persistent in fact as any fool can open a file in
/proc and sit on it.

Link: http://lkml.kernel.org/r/20180214082409.GC17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Alexey Dobriyan
195b8cf068 proc: move "struct pde_opener" to kmem cache
"struct pde_opener" is fixed size and we can have more granular approach
to debugging.

For those who don't know, per cache SLUB poisoning and red zoning don't
work if there is at least one object allocated which is hopeless in case
of kmalloc-64 but not in case of standalone cache.  Although systemd
opens 2 files from the get go, so it is hopeless after all.

Link: http://lkml.kernel.org/r/20180214082306.GB17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Alexey Dobriyan
a9fabc3df4 proc: randomize "struct pde_opener"
The more the merrier.

Link: http://lkml.kernel.org/r/20180214081935.GA17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Alexey Dobriyan
e7a6e291e3 proc: faster open/close of files without ->release hook
The whole point of code in fs/proc/inode.c is to make sure ->release
hook is called either at close() or at rmmod time.

All if it is unnecessary if there is no ->release hook.

Save allocation+list manipulations under spinlock in that case.

Link: http://lkml.kernel.org/r/20180214063033.GA15579@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Alexey Dobriyan
e74a0effff proc: move /proc/sysvipc creation to where it belongs
Move the proc_mkdir() call within the sysvipc subsystem such that we
avoid polluting proc_root_init() with petty cpp.

[dave@stgolabs.net: contributed changelog]
Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Alexey Dobriyan
2f89742435 proc: do less stuff under ->pde_unload_lock
Commit ca469f35a8e9ef ("deal with races between remove_proc_entry() and
proc_reg_release()") moved too much stuff under ->pde_unload_lock making
a problem described at series "[PATCH v5] procfs: Improve Scaling in
proc" worse.

While RCU is being figured out, move kfree() out of ->pde_unload_lock.

On my potato, difference is only 0.5% speedup with concurrent
open+read+close of /proc/cmdline, but the effect should be more
noticeable on more capable machines.

$ perf stat -r 16 -- ./proc-j 16

 Performance counter stats for './proc-j 16' (16 runs):

     130569.502377      task-clock (msec)         #   15.872 CPUs utilized            ( +-  0.05% )
            19,169      context-switches          #    0.147 K/sec                    ( +-  0.18% )
                15      cpu-migrations            #    0.000 K/sec                    ( +-  3.27% )
               437      page-faults               #    0.003 K/sec                    ( +-  1.25% )
   300,172,097,675      cycles                    #    2.299 GHz                      ( +-  0.05% )
    96,793,267,308      instructions              #    0.32  insn per cycle           ( +-  0.04% )
    22,798,342,298      branches                  #  174.607 M/sec                    ( +-  0.04% )
       111,764,687      branch-misses             #    0.49% of all branches          ( +-  0.47% )

       8.226574400 seconds time elapsed                                          ( +-  0.05% )
       ^^^^^^^^^^^

$ perf stat -r 16 -- ./proc-j 16

 Performance counter stats for './proc-j 16' (16 runs):

     129866.777392      task-clock (msec)         #   15.869 CPUs utilized            ( +-  0.04% )
            19,154      context-switches          #    0.147 K/sec                    ( +-  0.66% )
                14      cpu-migrations            #    0.000 K/sec                    ( +-  1.73% )
               431      page-faults               #    0.003 K/sec                    ( +-  1.09% )
   298,556,520,546      cycles                    #    2.299 GHz                      ( +-  0.04% )
    96,525,366,833      instructions              #    0.32  insn per cycle           ( +-  0.04% )
    22,730,194,043      branches                  #  175.027 M/sec                    ( +-  0.04% )
       111,506,074      branch-misses             #    0.49% of all branches          ( +-  0.18% )

       8.183629778 seconds time elapsed                                          ( +-  0.04% )
       ^^^^^^^^^^^

Link: http://lkml.kernel.org/r/20180213132911.GA24298@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Mateusz Guzik
68c3411ff4 proc: get rid of task lock/unlock pair to read umask for the "status" file
get_task_umask locks/unlocks the task on its own.  The only caller does
the same thing immediately after.

Utilize the fact the task has to be locked anyway and just do it once.
Since there are no other users and the code is short, fold it in.

Link: http://lkml.kernel.org/r/1517995608-23683-1-git-send-email-mguzik@redhat.com
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Andrei Vagin
8cfa67b4d9 procfs: optimize seq_pad() to speed up /proc/pid/maps
seq_printf() is slow and it can be replaced by memset() in this case.

== test.py
  num = 0
  with open("/proc/1/maps") as f:
          while num < 10000 :
                  data = f.read()
                  f.seek(0, 0)
                  num = num + 1
==

== Before patch ==
  $  time python test.py
  real	0m0.986s
  user	0m0.279s
  sys	0m0.707s

== After patch ==
  $ time python test.py
  real	0m0.932s
  user	0m0.261s
  sys	0m0.669s

$ perf record -g python test.py
== Before patch ==
-   47.35%     3.38%  python   [kernel.kallsyms] [k] show_map_vma.isra.23
   - 43.97% show_map_vma.isra.23
      + 20.84% seq_path
      - 15.73% show_vma_header_prefix
      + 6.96% seq_pad
   + 2.94% __GI___libc_read

== After patch ==
-   44.01%     0.34%  python   [kernel.kallsyms] [k] show_pid_map
   - 43.67% show_pid_map
      - 42.91% show_map_vma.isra.23
         + 21.55% seq_path
         - 15.68% show_vma_header_prefix
         + 2.08% seq_pad
        0.55% seq_putc

Link: http://lkml.kernel.org/r/20180112185812.7710-2-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Andrei Vagin
0e3dc01914 procfs: add seq_put_hex_ll to speed up /proc/pid/maps
seq_put_hex_ll() prints a number in hexadecimal notation and works
faster than seq_printf().

== test.py
  num = 0
  with open("/proc/1/maps") as f:
          while num < 10000 :
                  data = f.read()
                  f.seek(0, 0)
                 num = num + 1
==

== Before patch ==
  $  time python test.py

  real	0m1.561s
  user	0m0.257s
  sys	0m1.302s

== After patch ==
  $ time python test.py

  real	0m0.986s
  user	0m0.279s
  sys	0m0.707s

$ perf -g record python test.py:

== Before patch ==
-   67.42%     2.82%  python   [kernel.kallsyms] [k] show_map_vma.isra.22
   - 64.60% show_map_vma.isra.22
      - 44.98% seq_printf
         - seq_vprintf
            - vsnprintf
               + 14.85% number
               + 12.22% format_decode
                 5.56% memcpy_erms
      + 15.06% seq_path
      + 4.42% seq_pad
   + 2.45% __GI___libc_read

== After patch ==
-   47.35%     3.38%  python   [kernel.kallsyms] [k] show_map_vma.isra.23
   - 43.97% show_map_vma.isra.23
      + 20.84% seq_path
      - 15.73% show_vma_header_prefix
           10.55% seq_put_hex_ll
         + 2.65% seq_put_decimal_ull
           0.95% seq_putc
      + 6.96% seq_pad
   + 2.94% __GI___libc_read

[avagin@openvz.org: use unsigned int instead of int where it is suitable]
  Link: http://lkml.kernel.org/r/20180214025619.4005-1-avagin@openvz.org
[avagin@openvz.org: v2]
  Link: http://lkml.kernel.org/r/20180117082050.25406-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20180112185812.7710-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Roman Gushchin
f1782c9bc5 dcache: account external names as indirectly reclaimable memory
I received a report about suspicious growth of unreclaimable slabs on
some machines.  I've found that it happens on machines with low memory
pressure, and these unreclaimable slabs are external names attached to
dentries.

External names are allocated using generic kmalloc() function, so they
are accounted as unreclaimable.  But they are held by dentries, which
are reclaimable, and they will be reclaimed under the memory pressure.

In particular, this breaks MemAvailable calculation, as it doesn't take
unreclaimable slabs into account.  This leads to a silly situation, when
a machine is almost idle, has no memory pressure and therefore has a big
dentry cache.  And the resulting MemAvailable is too low to start a new
workload.

To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
used to track the amount of memory, consumed by external names.  The
counter is increased in the dentry allocation path, if an external name
structure is allocated; and it's decreased in the dentry freeing path.

To reproduce the problem I've used the following Python script:

  import os

  for iter in range (0, 10000000):
      try:
          name = ("/some_long_name_%d" % iter) + "_" * 220
          os.stat(name)
      except Exception:
          pass

Without this patch:
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7811688 kB
  $ python indirect.py
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    2753052 kB

With the patch:
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7809516 kB
  $ python indirect.py
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7749144 kB

[guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
  Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
[guro@fb.com: fix indirectly reclaimable memory accounting]
  Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Kyle Spiers
5ac7c2fd6e isofs compress: Remove VLA usage
As part of the effort to remove VLAs from the kernel[1], this changes
the allocation of the bhs and pages arrays from being on the stack to being
kcalloc()ed. This also allows for the removal of the explicit zeroing
of bhs.

https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Kyle Spiers <ksspiers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-11 09:55:40 +02:00
Carlos Maiolino
8c81dd46ef Force log to disk before reading the AGF during a fstrim
Forcing the log to disk after reading the agf is wrong, we might be
calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.

This can cause a deadlock when racing a fstrim with a filesystem
shutdown.

The deadlock has been identified due a miscalculation bug in device-mapper
dm-thin, which returns lack of space to its users earlier than the device itself
really runs out of space, changing the device-mapper volume into an error state.

The problem happened while filling the filesystem with a single file,
triggering the bug in device-mapper, consequently causing an IO error
and shutting down the filesystem.

If such file is removed, and fstrim executed before the XFS finishes the
shut down process, the fstrim process will end up holding the buffer
lock, and going to sleep on the cil wait queue.

At this point, the shut down process will try to wake up all the threads
waiting on the cil wait queue, but for this, it will try to hold the
same buffer log already held my the fstrim, locking up the filesystem.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-10 22:39:04 -07:00
Matthew Wilcox
fbbb450904 Export __set_page_dirty
XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-10 22:39:01 -07:00
Frank Sorenson
98de9ce6f6 NFS: advance nfs_entry cookie only after decoding completes successfully
In nfs[34]_decode_dirent, the cookie is advanced as soon as it is
read, but decoding may still fail later in the function, returning
an error.  Because the cookie has been advanced, the failing entry
is not re-requested from the server, resulting in a missing directory
entry.

In addition, nfs v3 and v4 read the cookie at different locations
in the xdr_stream, so the behavior of the two can be inconsistent.

Fix these by reading the cookie into a temporary variable, and
only advancing the cookie once the entire entry has been decoded
from the xdr_stream successfully.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
chendt
dbc898ae10 NFSv3/acl: forget acl cache after setattr
Sync of ACL with std permissions fail,We need to forget the ACL cache after setattr.

Reproduction:
#!/bin/bash
touch testfile
cat <<EOF >testfile
#!/bin/bash
echo "Test was executed"
EOF
chmod u=rwx testfile
chmod g=rw- testfile
chmod o=r-- testfile

chacl u::r--,g::rwx,o:rw- testfile
chmod u+w testfile
ls -l testfile
chacl -l testfile

Output:
-rw-rwxrw- 1 root root 0 Mar 28 05:29 testfile
testfile [u::r--,g::rwx,o::rw-]

Signed-off-by: chendt.fnst <chendt.fnst@cn.fujitsu.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Kinglong Mee <Kinglong Mee>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
609339c123 NFSv4.1: Fix exclusive create
When we use EXCLUSIVE4_1 mode, the server returns an attribute mask where
all the bits indicate which attributes were set, and where the verifier
was stored. In order to figure out which attribute we have to resend,
we need to clear out the attributes that are set in exclcreat_bitmask.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[Anna: Fixed typo NFS4_CREATE_EXCLUSIVE4 -> NFS4_CREATE_EXCLUSIVE]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
f6cdfa6dd6 NFSv4: Declare the size up to date after it was set.
When we've changed the file size, then ensure we declare it to be
up to date in the inode attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Matthew Wilcox
aae5730e2d nfs: Use ida_simple API
Allocate the owner_id when we allocate the state and free it when we free
the state.  That lets us get rid of a gnarly ida_pre_get() / ida_get_new()
loop.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
35156bfff3 NFSv4: Fix the nfs_inode_set_delegation() arguments
Neither nfs_inode_set_delegation() nor nfs_inode_reclaim_delegation() are
generic code. They have no business delving into NFSv4 OPEN xdr structures,
so let's replace the "struct nfs_openres" parameter.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
8b06494624 NFSv4: Clean up CB_GETATTR encoding
Replace the open coded bitmap implementation with a generic one.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
8bcbe7d98c NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
If we hold a delegation, then the results of the ACCESS call are protected
anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
36b3743fef NFSv4: Add a helper to encode/decode struct timespec
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
40a3426c75 NFSv4: Clean up encode_attrs
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
37c88763de NFSv4; Clean up XDR encoding of type bitmap4
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
e8d8aa46be NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
d943f2dd8d NFSv4: Ignore change attribute invalidations if we hold a delegation
Don't bother even recording an invalid change attribute if we hold a
delegation since we already know the state of our attribute cache.
We can rely on the fact that we will pick up a copy from the server
when we return the delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
16e1437517 NFS: More fine grained attribute tracking
Currently, if the NFS_INO_INVALID_ATTR flag is set, for instance by
a call to nfs_post_op_update_inode_locked(), then it will not be cleared
until all the attributes have been revalidated. This means, for instance,
that NFSv4 writes will always force a full attribute revalidation.

Track the ctime, mtime, size and change attribute separately from the
other attributes so that we can have nfs_post_op_update_inode_locked()
set them correctly, and later have the cache consistency bitmask be
able to clear them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
cac88f942d NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
If we managed to revalidate all the attributes, then there is no reason
to mark them as invalid again. We do, however want to ensure that we
set nfsi->attrtimeo correctly.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
783b194c6e NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
If we received weak cache consistency data from the server, then those
attributes are up to date, and there is no reason to mark them as
dirty in the attribute cache.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
8619ddd07b NFS: Don't force a revalidation of all attributes if change is missing
Even if the change attribute is missing, it is still OK to mark the other
attributes as being up to date.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
c01d36457d NFSv4: Don't return the delegation when not needed by NFSv4.x (x>0)
Starting with NFSv4.1, the server is able to deduce the client id from
the SEQUENCE op which means it can always figure out whether or not
the client is holding a delegation on a file that is being changed.
For that reason, RFC5661 does not require a delegation to be unconditionally
recalled on operations such as SETATTR, RENAME, or REMOVE.

Note that for now, we continue to return READ delegations since that is
still expected by the Linux knfsd server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
c135cb39a9 NFS: Remove the unused return_delegation() callback
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
199366f017 NFS: Move the delegation return down into _nfs4_do_setattr()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
977fcc2b0b NFS: Add a delegation return into nfs4_proc_unlink_setup()
Ensure that when we do finally delete the file, then we return the
delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
f2c2c552f1 NFS: Move delegation recall into the NFSv4 callback for rename_setup()
Move the delegation recall out of the generic code, and into the NFSv4
specific callback.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
912678dbc5 NFS: Move the delegation return down into nfs4_proc_remove()
Move the delegation return out of generic code and down into the
NFSv4 specific unlink code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
9f76827287 NFS: Move the delegation return down into nfs4_proc_link()
Move the delegation return out of generic code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust
f50862423f NFSv4: Fix nfs4_return_incompatible_delegation
The 'fmode' argument can take an FMODE_EXEC value, which we want to
filter out before comparing to the delegation type.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Jeff Layton
571745935b nfs4: wake any lock waiters on successful RECLAIM_COMPLETE
If we have a RECLAIM_COMPLETE with a populated cl_lock_waitq, then
that implies that a reconnect has occurred. Since we can't expect a
CB_NOTIFY_LOCK callback at that point, just wake up the entire queue
so that all the tasks can re-poll for their locks.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Jeff Layton
5656610325 nfs4: don't compare clientid in nfs4_wake_lock_waiter
The task is expected to sleep for a while here, and it's possible that
a new EXCHANGE_ID has occurred in the interim, and we were assigned a
new clientid. Since this is a per-client list, there isn't a lot of
value in vetting the clientid on the incoming request.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Jeff Layton
41a7462018 nfs4: always reset notified flag to false before repolling for lock
We may get a notification and lose the race to another client. Ensure
that we wait again for a notification in that case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Linus Torvalds
b284d4d5a6 The big ticket items are:
- support for rbd "fancy" striping (myself).  The striping feature bit
   is now fully implemented, allowing mapping v2 images with non-default
   striping patterns.  This completes support for --image-format 2.
 
 - CephFS quota support (Luis Henriques and Zheng Yan).  This set is
   based on the new SnapRealm code in the upcoming v13.y.z ("Mimic")
   release.  Quota handling will be rejected on older filesystems.
 
 - memory usage improvements in CephFS (Chengguang Xu).  Directory
   specific bits have been split out of ceph_file_info and some effort
   went into improving cap reservation code to avoid OOM crashes.
 
 Also included a bunch of assorted fixes all over the place from
 Chengguang and others.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJazOI/AAoJEEp/3jgCEfOLOu0IAKGFkcCo0UdQDGHHJZHn2rAm
 CSWMMwyYGAhoWI6Gva0jx1A2omZLFSeq/MC8dWLL/MNAKt8i/qo8bTsTrwCHMR2Q
 D0FsvMWIhkWRS1/FcD1uVDhn0a/DFm5Kfy8kzz3v695TDCt+BYWrCqyHTB/wSdRR
 VpO3KdpHQ9h3ojNBRgIniOCNPeQP+QzLXy+P0h0oKbP2Y03mwJlsWG4L6zakkkwT
 e2I+RVdlOMUDJ7rZxiXESBr6BuLI4oOkPe8roQGmZPy1Xe17xa9M5iWVNuM6RUhO
 Z9bS2aLMhbDyeCPqvzgAnsUtFT0PAQjB5NYw2yqisbHs/wrU5kMOOpcLqz/Ls/s=
 =v1I9
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The big ticket items are:

   - support for rbd "fancy" striping (myself).

     The striping feature bit is now fully implemented, allowing mapping
     v2 images with non-default striping patterns. This completes
     support for --image-format 2.

   - CephFS quota support (Luis Henriques and Zheng Yan).

     This set is based on the new SnapRealm code in the upcoming v13.y.z
     ("Mimic") release. Quota handling will be rejected on older
     filesystems.

   - memory usage improvements in CephFS (Chengguang Xu).

     Directory specific bits have been split out of ceph_file_info and
     some effort went into improving cap reservation code to avoid OOM
     crashes.

  Also included a bunch of assorted fixes all over the place from
  Chengguang and others"

* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
  ceph: quota: report root dir quota usage in statfs
  ceph: quota: add counter for snaprealms with quota
  ceph: quota: cache inode pointer in ceph_snap_realm
  ceph: fix root quota realm check
  ceph: don't check quota for snap inode
  ceph: quota: update MDS when max_bytes is approaching
  ceph: quota: support for ceph.quota.max_bytes
  ceph: quota: don't allow cross-quota renames
  ceph: quota: support for ceph.quota.max_files
  ceph: quota: add initial infrastructure to support cephfs quotas
  rbd: remove VLA usage
  rbd: fix spelling mistake: "reregisteration" -> "reregistration"
  ceph: rename function drop_leases() to a more descriptive name
  ceph: fix invalid point dereference for error case in mdsc destroy
  ceph: return proper bool type to caller instead of pointer
  ceph: optimize memory usage
  ceph: optimize mds session register
  libceph, ceph: add __init attribution to init funcitons
  ceph: filter out used flags when printing unused open flags
  ceph: don't wait on writeback when there is no more dirty pages
  ...
2018-04-10 12:25:30 -07:00
Linus Torvalds
9f3a0941fb libnvdimm for 4.17
* A rework of the filesytem-dax implementation provides for detection of
   unmap operations (truncate / hole punch) colliding with in-progress
   device-DMA. A fix for these collisions remains a work-in-progress
   pending resolution of truncate latency and starvation regressions.
 
 * The of_pmem driver expands the users of libnvdimm outside of x86 and
   ACPI to describe an implementation of persistent memory on PowerPC with
   Open Firmware / Device tree.
 
 * Address Range Scrub (ARS) handling is completely rewritten to account for
   the fact that ARS may run for 100s of seconds and there is no platform
   defined way to cancel it. ARS will now no longer block namespace
   initialization.
 
 * The NVDIMM Namespace Label implementation is updated to handle label
   areas as small as 1K, down from 128K.
 
 * Miscellaneous cleanups and updates to unit test infrastructure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
 lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
 1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
 /NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
 EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
 xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
 9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
 qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
 l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
 ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
 3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
 9eGJG1H+JspcbtLF9iHn
 =acZQ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f145 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
Darrick J. Wong
4919d42ab6 xfs: only cancel cow blocks when truncating the data fork
In xfs_itruncate_extents, only cancel cow blocks and clear the reflink
flag if we were asked to truncate the data fork.  Attr fork blocks
cannot be shared, so this makes no sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-04-10 08:28:33 -07:00
David Howells
5a81327616 afs: Do better accretion of small writes on newly created content
Processes like ld that do lots of small writes that aren't necessarily
contiguous result in a lot of small StoreData operations to the server, the
idea being that if someone else changes the data on the server, we only
write our changes over that and not the space between.  Further, we don't
want to write back empty space if we can avoid it to make it easier for the
server to do sparse files.

However, making lots of tiny RPC ops is a lot less efficient for the server
than one big one because each op requires allocation of resources and the
taking of locks, so we want to compromise a bit.

Reduce the load by the following:

 (1) If a file is just created locally or has just been truncated with
     O_TRUNC locally, allow subsequent writes to the file to be merged with
     intervening space if that space doesn't cross an entire intervening
     page.

 (2) Don't flush the file on ->flush() but rather on ->release() if the
     file was open for writing.

Just linking vmlinux.o, without this patch, looking in /proc/fs/afs/stats:

	file-wr : n=441 nb=513581204

and after the patch:

	file-wr : n=62 nb=513668555

there were 379 fewer StoreData RPC operations at the expense of an extra
87K being written.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09 21:54:48 +01:00
David Howells
76a5cb6fc1 afs: Add stats for data transfer operations
Add statistics to /proc/fs/afs/stats for data transfer RPC operations.  New
lines are added that look like:

	file-rd : n=55794 nb=10252282150
	file-wr : n=9789 nb=3247763645

where n= indicates the number of ops completed and nb= indicates the number
of bytes successfully transferred.  file-rd is the counts for read/fetch
operations and file-wr the counts for write/store operations.

Note that directory and symlink downloading are included in the file-rd
stats at the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09 21:54:48 +01:00