Go to file
Vivek Goyal 9d769e6aa2 fuse: support SB_NOSEC flag to improve write performance
Virtiofs can be slow with small writes if xattr are enabled and we are
doing cached writes (No direct I/O).  Ganesh Mahalingam noticed this.

Some debugging showed that file_remove_privs() is called in cached write
path on every write.  And everytime it calls security_inode_need_killpriv()
which results in call to __vfs_getxattr(XATTR_NAME_CAPS).  And this goes to
file server to fetch xattr.  This extra round trip for every write slows
down writes tremendously.

Normally to avoid paying this penalty on every write, vfs has the notion of
caching this information in inode (S_NOSEC).  So vfs sets S_NOSEC, if
filesystem opted for it using super block flag SB_NOSEC.  And S_NOSEC is
cleared when setuid/setgid bit is set or when security xattr is set on
inode so that next time a write happens, we check inode again for clearing
setuid/setgid bits as well clear any security.capability xattr.

This seems to work well for local file systems but for remote file systems
it is possible that VFS does not have full picture and a different client
sets setuid/setgid bit or security.capability xattr on file and that means
VFS information about S_NOSEC on another client will be stale.  So for
remote filesystems SB_NOSEC was disabled by default.

Commit 9e1f1de02c ("more conservative S_NOSEC handling") mentioned that
these filesystems can still make use of SB_NOSEC as long as they clear
S_NOSEC when they are refreshing inode attriutes from server.

So this patch tries to enable SB_NOSEC on fuse (regular fuse as well as
virtiofs).  And clear SB_NOSEC when we are refreshing inode attributes.

This is enabled only if server supports FUSE_HANDLE_KILLPRIV_V2.  This says
that server will clear setuid/setgid/security.capability on
chown/truncate/write as apporpriate.

This should provide tighter coherency because now suid/sgid/
security.capability will be cleared even if fuse client cache has not seen
these attrs.

Basic idea is that fuse client will trigger suid/sgid/security.capability
clearing based on its attr cache.  But even if cache has gone stale, it is
fine because FUSE_HANDLE_KILLPRIV_V2 will make sure WRITE clear
suid/sgid/security.capability.

We make this change only if server supports FUSE_HANDLE_KILLPRIV_V2.  This
should make sure that existing filesystems which might be relying on
seucurity.capability always being queried from server are not impacted.

This tighter coherency relies on WRITE showing up on server (and not being
cached in guest).  So writeback_cache mode will not provide that tight
coherency and it is not recommended to use two together.  Having said that
it might work reasonably well for lot of use cases.

This change improves random write performance very significantly.  Running
virtiofsd with cache=auto and following fio command:

fio --ioengine=libaio --direct=1  --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

Bandwidth increases from around 50MB/s to around 250MB/s as a result of
applying this patch.  So improvement is very significant.

Link: https://github.com/kata-containers/runtime/issues/2815
Reported-by: "Mahalingam, Ganesh" <ganesh.mahalingam@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11 17:22:33 +01:00
arch treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
block block-5.10-2020-10-24 2020-10-24 12:46:42 -07:00
certs .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
crypto drivers-5.10-2020-10-12 2020-10-13 13:04:41 -07:00
Documentation xen: branch for v5.10-rc1c 2020-10-25 10:55:35 -07:00
drivers treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
fs fuse: support SB_NOSEC flag to improve write performance 2020-11-11 17:22:33 +01:00
include fuse: add a flag FUSE_OPEN_KILL_SUIDGID for open() request 2020-11-11 17:22:33 +01:00
init linux-kselftest-kunit-5.10-rc1 2020-10-18 14:45:59 -07:00
ipc ipc: adjust proc_ipc_sem_dointvec definition to match prototype 2020-09-05 12:14:29 -07:00
kernel treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
lib random32: make prandom_u32() less predictable 2020-10-25 10:40:08 -07:00
LICENSES LICENSES/deprecated: add Zlib license text 2020-09-16 14:33:49 +02:00
mm Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
net mm: remove kzfree() compatibility definition 2020-10-25 11:39:02 -07:00
samples bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static 2020-10-22 01:46:52 +02:00
scripts treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
security SafeSetID changes for v5.10 2020-10-25 10:45:26 -07:00
sound ARM: SoC platform updates 2020-10-24 10:33:08 -07:00
tools treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
usr Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-08-07 13:29:39 -07:00
virt kvm: x86/mmu: Support dirty logging for the TDP MMU 2020-10-23 03:42:13 -04:00
.clang-format RDMA 5.10 pull request 2020-10-17 11:18:18 -07:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: docs: ignore sphinx_*/ directories 2020-09-10 10:44:31 -06:00
.mailmap MAINTAINERS: jarkko.sakkinen@linux.intel.com -> jarkko@kernel.org 2020-10-16 11:11:19 -07:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS networking changes for the 5.10 merge window 2020-10-15 18:42:13 -07:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS ARM: Devicetree updates 2020-10-24 10:44:18 -07:00
Makefile Linux 5.10-rc1 2020-10-25 15:14:11 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.