linux

iv/linux

Author	SHA1	Message	Date
Linus Torvalds	7f198ba7ae	affs-for-6.1-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmM71hEACgkQxWXV+ddt WDsiGg/9EGardWzoQs/YCexFxjDTnIQxaTmoxN2igfsNJ7PwQQDb2R2hUSSQgarp vepmjkO0YdFI4Hlym81x5t/cRmdGYz1fjt/0nBibhrhxJBUi3S6CcSOABuOdP699 E9Iuzyx6YRa4eJN5+rmyDj/51sOgZNGZ+cMKY8ukkXk1Y/VBfOzU/yIoLVjNKPMJ UZwXN8C/Tvh/wbE4uBYu5G4dbXfC+qx3ywz/+KdccmSm1iPgA1NzmbAJDFjyCVKI R0qub1qXzOTqM0Y1uAu1+LjS2mIv1uASCt2MXt1ragZJsClFK8k+pQwHU9/m4S4G lOtUobsfrAPN8laH9B6aIWJMTSY2hglQxbKLFXAy12znlHoPXa5VANNh7KIurBNH xtV9eFjcc1b3CJ+nF1NaekamFhWFDSL1kHQpu16gyTZc1p9wkbAh8mYSl0LVHOCp n7ZiSJ1Gq5+WdbTMvjVAlgObtzpjOkvl0MpsDRqBn7fva/CNYI+D1NWbf2uP9NyY Pe9XO8w7bNXRLyV8vKtISrCnxULGuXxsXPG23K8nd78h5ZPrjCyU3RQaAANFfgDe YEzCXF+WKLT8UflJOB1l9rYnY6VrwlktVYOHgHRT94CHPn7vJkCf2HVlOg28sBnM 39sspyIyO9b55IsWyY0RxRwdQ6mMTXnJqxG5GgorMQ+ZrBg9BPk= =fNXH -----END PGP SIGNATURE----- Merge tag 'affs-for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull affs update from David Sterba: "One minor update for AFFS, switching away from strlcpy" * tag 'affs-for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: move from strlcpy with unused retval to strscpy	2022-10-06 17:42:29 -07:00
Linus Torvalds	76e4503534	for-6.1-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmM6zNkACgkQxWXV+ddt WDsNMg/+LTuwf6Js+mAl1AgtSpLOl2gLfNBJAUXhzwPbc3nF9bwONE/EUYEXTo5h kTf1cQRj0NCIZ7iHDwXuWNm77diNl+SChEDIoc7k0d6P7Qmmn2AWbTLM4dleyg5S 6jxPpOMbegycQfL9tSJNaiT9zlZxj9Z+0yPibR99otrgtuv6zuvRxcdh34rEFIyf xoabO3/18lAKHzYzAZxNXMpbUSBmqLPVoZEOcfBAXvcuIJkzKRP6Y9gwlYs+kn+D J8BPa3LoSNxXrpCvWzlu7vO3gwNp7H7pQQqZKjjEcOZ+dj2UYQeTyJvl1vdzaNyk EoFYlkaKkYi7RaonuHjNaTeD/igJf8Eo6DTiXzACECssbKutlvNG4HXuFApsWy7M T7KZ5jTAQ98ZMYjgZ27UbEpFZd8lYHzV952Njjo9zbRVbqwaPEZTTdkjpz+3X6t4 Z0A951ixOYKiOVdu3Uj1fHaBv0n/p0wrXIGt3ZIdjufM9TctV3oJwOZOiM2H0ccb XJVwsQG92+ja9XLZrw8H62PCKBYo3LL52r9b9NVodY9aTsQWTfiV5OP84RRlncCp hzPkHmO1YIyVcLoijagiO7cW21pQbKfqsRX/P1F7DXyjosHppmDS7IHDWA7Adf3W QA6eBnoWqVwBh7P+IyxJuRG0CrnxkPZeAZIhohDwk5Mt4NGATkA= =NlUz -----END PGP SIGNATURE----- Merge tag 'for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There's a bunch of performance improvements, most notably the FIEMAP speedup, the new block group tree to speed up mount on large filesystems, more io_uring integration, some sysfs exports and the usual fixes and core updates. Summary: Performance: - outstanding FIEMAP speed improvement - algorithmic change how extents are enumerated leads to orders of magnitude speed boost (uncached and cached) - extent sharing check speedup (2.2x uncached, 3x cached) - add more cancellation points, allowing to interrupt seeking in files with large number of extents - more efficient hole and data seeking (4x uncached, 1.3x cached) - sample results: 256M, 32K extents: 4s -> 29ms (~150x) 512M, 64K extents: 30s -> 59ms (~550x) 1G, 128K extents: 225s -> 120ms (~1800x) - improved inode logging, especially for directories (on dbench workload throughput +25%, max latency -21%) - improved buffered IO, remove redundant extent state tracking, lowering memory consumption and avoiding rb tree traversal - add sysfs tunable to let qgroup temporarily skip exact accounting when deleting snapshot, leading to a speedup but requiring a rescan after that, will be used by snapper - support io_uring and buffered writes, until now it was just for direct IO, with the no-wait semantics implemented in the buffered write path it now works and leads to speed improvement in IOPS (2x), throughput (2.2x), latency (depends, 2x to 150x) - small performance improvements when dropping and searching for extent maps as well as when flushing delalloc in COW mode (throughput +5MB/s) User visible changes: - new incompatible feature block-group-tree adding a dedicated tree for tracking block groups, this allows a much faster load during mount and avoids seeking unlike when it's scattered in the extent tree items - this reduces mount time for many-terabyte sized filesystems - conversion tool will be provided so existing filesystem can also be updated in place - to reduce test matrix and feature combinations requires no-holes and free-space-tree (mkfs defaults since 5.15) - improved reporting of super block corruption detected by scrub - scrub also tries to repair super block and does not wait until next commit - discard stats and tunables are exported in sysfs (/sys/fs/btrfs/FSID/discard) - qgroup status is exported in sysfs (/sys/sys/fs/btrfs/FSID/qgroups/) - verify that super block was not modified when thawing filesystem Fixes: - FIEMAP fixes - fix extent sharing status, does not depend on the cached status where merged - flush delalloc so compressed extents are reported correctly - fix alignment of VMA for memory mapped files on THP - send: fix failures when processing inodes with no links (orphan files and directories) - fix race between quota enable and quota rescan ioctl - handle more corner cases for read-only compat feature verification - fix missed extent on fsync after dropping extent maps Core: - lockdep annotations to validate various transactions states and state transitions - preliminary support for fs-verity in send - more effective memory use in scrub for subpage where sector is smaller than page - block group caching progress logic has been removed, load is now synchronous - simplify end IO callbacks and bio handling, use chained bios instead of own tracking - add no-wait semantics to several functions (tree search, nocow, flushing, buffered write - cleanups and refactoring MM changes: - export balance_dirty_pages_ratelimited_flags" * tag 'for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (177 commits) btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer btrfs: drop extent map range more efficiently btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: remove unnecessary next extent map search btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary extent map initializations btrfs: remove the refcount warning/check at free_extent_map() btrfs: add helper to replace extent map range with a new extent map btrfs: move open coded extent map tree deletion out of inode eviction btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: fix missed extent on fsync after dropping extent maps btrfs: remove stale prototype of btrfs_write_inode btrfs: enable nowait async buffered writes btrfs: assert nowait mode is not used for some btree search functions btrfs: make btrfs_buffered_write nowait compatible btrfs: plumb NOWAIT through the write path btrfs: make lock_and_cleanup_extent_if_need nowait compatible ...	2022-10-06 17:36:48 -07:00
Linus Torvalds	4c0ed7d8d6	whack-a-mole: constifying struct path * -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxmRQAKCRBZ7Krx/gZQ 6+/kAQD2xyf+i4zOYVBr1NB3qBbhVS1zrni1NbC/kT3dJPgTvwEA7z7eqwnrN4zg scKFP8a3yPoaQBfs4do5PolhuSr2ngA= =NBI+ -----END PGP SIGNATURE----- Merge tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs constification updates from Al Viro: "whack-a-mole: constifying struct path " tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ecryptfs: constify path spufs: constify path nd_jump_link(): constify path audit_init_parent(): constify path __io_setxattr(): constify path do_proc_readlink(): constify path overlayfs: constify path fs/notify: constify path may_linkat(): constify path do_sys_name_to_handle(): constify path ->getprocattr(): attribute name is const char *, TYVM...	2022-10-06 17:31:02 -07:00
Linus Torvalds	ab29622157	whack-a-mole: cropped up open-coded file_inode() uses... -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxj0gAKCRBZ7Krx/gZQ 66/1AQC/KfIAINNOPxozsZaxOaOKo0ouVJ7sJV4ZGsPKpU69gwD/UodJZCtyZ52h wwkmfzTDjAgGt1QCKj96zk2XFqg4swE= =u0pv -----END PGP SIGNATURE----- Merge tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file_inode() updates from Al Vrio: "whack-a-mole: cropped up open-coded file_inode() uses..." * tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: use ->f_mapping _nfs42_proc_copy(): use ->f_mapping instead of file_inode()->i_mapping dma_buf: no need to bother with file_inode()->i_mapping nfs_finish_open(): don't open-code file_inode() bprm_fill_uid(): don't open-code file_inode() sgx: use ->f_mapping... exfat_iterate(): don't open-code file_inode(file) ibmvmc: don't open-code file_inode()	2022-10-06 17:22:11 -07:00
Linus Torvalds	7a3353c5c4	struct file-related stuff -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxjIQAKCRBZ7Krx/gZQ 6/FPAQCNCZygQzd+54//vo4kTwv5T2Bv3hS8J51rASPJT87/BQD/TfCLS5urt/Gt 81A1dFOfnTXseofuBKyGSXwQm0dWpgA= =PLre -----END PGP SIGNATURE----- Merge tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs file updates from Al Viro: "struct file-related stuff" * tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dma_buf_getfile(): don't bother with ->f_flags reassignments Change calling conventions for filldir_t locks: fix TOCTOU race when granting write lease	2022-10-06 17:13:18 -07:00
Linus Torvalds	70df64d6c6	d_path pile -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxjQAAKCRBZ7Krx/gZQ 683pAP9oSHaXo3Twl6rweirNbHocgm8MynCgIU3bpzeVPi6Z1wEApfEq4IInWQyL R6ObOneoSobi+9Iaqsoe+uKu54MghAY= =rt7w -----END PGP SIGNATURE----- Merge tag 'pull-d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_path updates from Al Viro. * tag 'pull-d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: d_path.c: typo fix... dynamic_dname(): drop unused dentry argument	2022-10-06 16:55:41 -07:00
Linus Torvalds	46811b5cb3	saner inode_init_always() -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxivgAKCRBZ7Krx/gZQ 66iyAQD2btSJlwKoqMlo+Xnj0J7nvHEYpFOf+rMWa/1SJIJOnAEAr9VYpgstELHP bAZ3EbGyLPZbLPnmiTzrDlvqZDbKDgs= =4e/D -----END PGP SIGNATURE----- Merge tag 'pull-inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs inode update from Al Viro: "Saner inode_init_always(), also fixing a nilfs problem" * tag 'pull-inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: fix UAF/GPF bug in nilfs_mdt_destroy	2022-10-06 16:49:00 -07:00
Linus Torvalds	0326074ff4	Networking changes for 6.1. Core ---- - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF --- - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols --------- - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API ---------- - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers ---------------------- - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers ------- - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmM7vtkACgkQMUZtbf5S Irvotg//dmh53rC+UMKO3OgOqPlSMnaqzbUdDEfN6mj4Mpox7Csb8zERVURHhBHY fvlXWsDgxmvgTebI5fvNC5+f1iW5xcqgJV2TWnNmDOKWwvQwb6qQfgixVmunvkpe IIukMXYt0dAf9bXeeEfbNXcCb85cPwB76stX0tMV6BX7osp3T0TL1fvFk0NJkL0j TeydLad/yAQtPb4TbeWYjNDoxPVDf0cVpUrevLGmWE88UMYmgTqPze+h1W5Wri52 bzjdLklY/4cgcIZClHQ6F9CeRWqEBxvujA5Hj/cwOcn/ptVVJWUGi7sQo3sYkoSs HFu+F8XsTec14kGNC0Ab40eVdqs5l/w8+E+4jvgXeKGOtVns8DwoiUIzqXpyty89 Ib04mffrwWNjFtHvo/kIsNwP05X2PGE9HUHfwsTUfisl/ASvMmQp7D7vUoqQC/4B AMVzT5qpjkmfBHYQQGuw8FxJhMeAOjC6aAo6censhXJyiUhIfleQsN0syHdaNb8q 9RZlhAgQoVb6ZgvBV8r8unQh/WtNZ3AopwifwVJld2unsE/UNfQy2KyqOWBES/zf LP9sfuX0JnmHn8s1BQEUMPU1jF9ZVZCft7nufJDL6JhlAL+bwZeEN4yCiAHOPZqE ymSLHI9s8yWZoNpuMWKrI9kFexVnQFKmA3+quAJUcYHNMSsLkL8= =Gsio -----END PGP SIGNATURE----- Merge tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF: - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols: - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API: - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers: - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers: - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support" * tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits) eth: pse: add missing static inlines once: rename _SLOW to _SLEEPABLE net: pse-pd: add regulator based PSE driver dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller ethtool: add interface to interact with Ethernet Power Equipment net: mdiobus: search for PSE nodes by parsing PHY nodes. net: mdiobus: fwnode_mdiobus_register_phy() rework error handling net: add framework to support Ethernet PSE and PDs devices dt-bindings: net: phy: add PoDL PSE property net: marvell: prestera: Propagate nh state from hw to kernel net: marvell: prestera: Add neighbour cache accounting net: marvell: prestera: add stub handler neighbour events net: marvell: prestera: Add heplers to interact with fib_notifier_info net: marvell: prestera: Add length macros for prestera_ip_addr net: marvell: prestera: add delayed wq and flush wq on deinit net: marvell: prestera: Add strict cleanup of fib arbiter net: marvell: prestera: Add cleanup of allocated fib_nodes net: marvell: prestera: Add router nexthops ABI eth: octeon: fix build after netif_napi_add() changes net/mlx5: E-Switch, Return EBUSY if can't get mode lock ...	2022-10-04 13:38:03 -07:00
Linus Torvalds	725737e7c2	STATX_DIOALIGN for 6.1 Make statx() support reporting direct I/O (DIO) alignment information. This provides a generic interface for userspace programs to determine whether a file supports DIO, and if so with what alignment restrictions. Specifically, STATX_DIOALIGN works on block devices, and on regular files when their containing filesystem has implemented support. An interface like this has been requested for years, since the conditions for when DIO is supported in Linux have gotten increasingly complex over time. Today, DIO support and alignment requirements can be affected by various filesystem features such as multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode, etc. Further complicating things, Linux v6.0 relaxed the traditional rule of DIO needing to be aligned to the block device's logical block size; now user buffers (but not file offsets) only need to be aligned to the DMA alignment. The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was discarded in favor of creating a clean new interface with statx(). For more information, see the individual commits and the man page update https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpV2xQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKwF1AQDetPX5hyuq0/mwikOywLTTJsoHgGY5 euO+dISqjH/InwD9HAQqfPRkdM1j4ml82BjjkAfrhzZXOOWPKJm0zOhMIQg= =0Oav -----END PGP SIGNATURE----- Merge tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull STATX_DIOALIGN support from Eric Biggers: "Make statx() support reporting direct I/O (DIO) alignment information. This provides a generic interface for userspace programs to determine whether a file supports DIO, and if so with what alignment restrictions. Specifically, STATX_DIOALIGN works on block devices, and on regular files when their containing filesystem has implemented support. An interface like this has been requested for years, since the conditions for when DIO is supported in Linux have gotten increasingly complex over time. Today, DIO support and alignment requirements can be affected by various filesystem features such as multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode, etc. Further complicating things, Linux v6.0 relaxed the traditional rule of DIO needing to be aligned to the block device's logical block size; now user buffers (but not file offsets) only need to be aligned to the DMA alignment. The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was discarded in favor of creating a clean new interface with statx(). For more information, see the individual commits and the man page update[1]" Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1] * tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: xfs: support STATX_DIOALIGN f2fs: support STATX_DIOALIGN f2fs: simplify f2fs_force_buffered_io() f2fs: move f2fs_force_buffered_io() into file.c ext4: support STATX_DIOALIGN fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN vfs: support STATX_DIOALIGN on block devices statx: add direct I/O alignment information	2022-10-03 20:33:41 -07:00
Linus Torvalds	5779aa2dac	fsverity updates for 6.1 Minor changes to convert uses of kmap() to kmap_local_page(). -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpKvRQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK9KSAP9PyrI3kDLBpiG9os3HIZtHyPHgt6OZ FA978i0UuAxgHAD7BPiIT55oBdOrn6CVy2g8PkwmkcKQx0kvhVQq8Pyz5ww= =49pm -----END PGP SIGNATURE----- Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "Minor changes to convert uses of kmap() to kmap_local_page()" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fs-verity: use kmap_local_page() instead of kmap() fs-verity: use memcpy_from_page()	2022-10-03 20:27:34 -07:00
Linus Torvalds	438b2cdd17	fscrypt updates for 6.1 This release contains some implementation changes, but no new features: - Rework the implementation of the fscrypt filesystem-level keyring to not be as tightly coupled to the keyrings subsystem. This resolves several issues. - Eliminate most direct uses of struct request_queue from fs/crypto/, since struct request_queue is considered to be a block layer implementation detail. - Stop using the PG_error flag to track decryption failures. This is a prerequisite for freeing up PG_error for other uses. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpMMRQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKxYbAP0VrWjlqonO75gYkIxwX0aTxajoKC3m awUDAC/feQ910gD6A4WbJivanLngJKgcxfbhN5paalZJEGNOBBrOUB1WLgs= =CxSh -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fscrypt updates from Eric Biggers: "This release contains some implementation changes, but no new features: - Rework the implementation of the fscrypt filesystem-level keyring to not be as tightly coupled to the keyrings subsystem. This resolves several issues. - Eliminate most direct uses of struct request_queue from fs/crypto/, since struct request_queue is considered to be a block layer implementation detail. - Stop using the PG_error flag to track decryption failures. This is a prerequisite for freeing up PG_error for other uses" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: work on block_devices instead of request_queues fscrypt: stop holding extra request_queue references fscrypt: stop using keyrings subsystem for fscrypt_master_key fscrypt: stop using PG_error to track error status fscrypt: remove fscrypt_set_test_dummy_encryption()	2022-10-03 20:18:34 -07:00
Linus Torvalds	f4309528f3	dlm for 6.1 This set of commits includes: . Fix a couple races found with a new torture test. . Improve errors when api functions are used incorrectly. . Improve tracing for lock requests from user space. . Fix use after free in recently added tracing code. . Small internal code cleanups. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJjOyfeAAoJEDgbc8f8gGmqHF4QALKGo+95JGzfXN37dNL2ve8L DAKxESYIwaTEWuKxmD4AGogClEl55UoC8kxMB3dHwLZEd4U0v5ZDULR6NUYXMpos 6miaoF+pJfBnpNRqpCieWRW5dYXD4TwSdquv5rUSmUBrdOSy34s/nORWB4kL443K hFPcbo5Mv1L0W70/+gdj1uBlBsenZxnXu6aEmrckONqwj9Q2SBjJTik9WuNwh+FF tEcmUt8kDanGkbwtMCxnbT3HDOdfQyW+qq4IJ6MOYHlW9Cqbp9QUvAIho4DEpr7f eGurQ/urSD3dltzuYQcZ81zGhaGxzaRt5d2AEHRrGugQ2ZvnsG74oSAmEINZTSw4 RV2EXyJ4hXcXK/yJXo3fGzFm2/5JFvYhnvddo6wts3vQZHwefExIRCHVz2cJL9eS gFpfFu4uB8z7w7l9s9LJKv7cTriaDd1WHuIWZGonz3wlFSUOn7IxunDxM3Hc5YO3 okawhr6sWe03fFcKsw1WeWymfDUwmk/7OV15OSDanItAwX5vkBYDBvAcA/cwm8cj P0Vb3c1/Sf1IjjHGGA13vHpD1JXJ7FHafg6jyWmjJNqaS+wtShvs2As9MqbtSWMb o2OcYTEEzME4mMIXZzVlKP7hhkLMaVR5PwGmbPovlyAkEUX0soH7nefyLMAqP3JG 7VZYV46VCL7wm3yjrKYw =sL1G -----END PGP SIGNATURE----- Merge tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - Fix a couple races found with a new torture test - Improve errors when api functions are used incorrectly - Improve tracing for lock requests from user space - Fix use after free in recently added tracing cod. - Small internal code cleanups * tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: fix possible use after free if tracing fs: dlm: const void resource name parameter fs: dlm: LSFL_CB_DELAY only for kernel lockspaces fs: dlm: remove DLM_LSFL_FS from uapi fs: dlm: trace user space callbacks fs: dlm: change ls_clear_proc_locks to spinlock fs: dlm: remove dlm_del_ast prototype fs: dlm: handle rcom in else if branch fs: dlm: allow lockspaces have zero lvblen fs: dlm: fix invalid derefence of sb_lvbptr fs: dlm: handle -EINVAL as log_error() fs: dlm: use __func__ for function name fs: dlm: handle -EBUSY first in unlock validation fs: dlm: handle -EBUSY first in lock arg validation fs: dlm: fix race between test_bit() and queue_work() fs: dlm: fix race in lowcomms	2022-10-03 20:11:59 -07:00
Linus Torvalds	f90497a16e	NFSD 6.1 Release Notes This release is mostly bug fixes, clean-ups, and optimizations. One notable set of fixes addresses a subtle buffer overflow issue that occurs if a small RPC Call message arrives in an oversized RPC record. This is only possible on a framed RPC transport such as TCP. Because NFSD shares the receive and send buffers in one set of pages, an oversized RPC record steals pages from the send buffer that will be used to construct the RPC Reply message. NFSD must not assume that a full-sized buffer is always available to it; otherwise, it will walk off the end of the send buffer while constructing its reply. In this release, we also introduce the ability for the server to wait a moment for clients to return delegations before it responds with NFS4ERR_DELAY. This saves a retransmit and a network round- trip when a delegation recall is needed. This work will be built upon in future releases. The NFS server adds another shrinker to its collection. Because courtesy clients can linger for quite some time, they might be freeable when the server host comes under memory pressure. A new shrinker has been added that releases courtesy client resources during low memory scenarios. Lastly, of note: the maximum number of operations per NFSv4 COMPOUND that NFSD can handle is increased from 16 to 50. There are NFSv4 client implementations that need more than 16 to successfully perform a mount operation that uses a pathname with many components. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmM66P4ACgkQM2qzM29m f5fhMg//afS2mp4fgPz4MjoFIqD/Icep8qFEPA8Gy6I1dDGRxd9wNgjoN4JALFdr NKX1oRVISBvDrOG/C84GbYnXEDlzY8q1HmyPoJA8VAR57hnJXfPZN6CBN//Bx4mU nISPJeNGY9SMNVhS8916V/yzd41uWDQuD+H+i5mluBTJHONgSzwzc80sQ+eq+yZQ PV6mlJN6hcm14LCaDOTXF7oY2Wm6dQc2rV87YChJWnc+vdXKnme/LWTMY1ABkePD g88mSL6w3YDKEuKciWda5/QU1ETp/Q7XTjFGDKEQSnnNsvCLmUKogJTKVa2QqyLY P1qlrj6XwukqAe414W4amlLL3q4NUFmJZPNWDxdf+qtTrQrBBEFrsKy/bSt27XoD cTvBWcorMG2riSlYPViVeh8RpyC6qwhttPbvGAmflVF2KEyXpfgc5Pnn0/Xm1Ac9 XKzaCTJlUyRb/2wdqVtQbIpyh3sbhzp8zhv7sWKXgQOEXxKOO3ZAIrQXeL6oFN/b HlXDty7wKhFRj8IbkZfQ9SvN1saTONwB3clYHbCXTetkw/nnrUgLYcu8NIDBK9ou wkBcz1++XgVTqjRFwUwagb62cPJnRM6UiROYCVbQp7qcUe4/U+WP9t6dnZlnGZVZ dtipKlH/LTGKW+d7ysZOqb4hsRza5Kaduz5a7lML7UIGQXmxjM0= =hE6t -----END PGP SIGNATURE----- Merge tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd updates from Chuck Lever: "This release is mostly bug fixes, clean-ups, and optimizations. One notable set of fixes addresses a subtle buffer overflow issue that occurs if a small RPC Call message arrives in an oversized RPC record. This is only possible on a framed RPC transport such as TCP. Because NFSD shares the receive and send buffers in one set of pages, an oversized RPC record steals pages from the send buffer that will be used to construct the RPC Reply message. NFSD must not assume that a full-sized buffer is always available to it; otherwise, it will walk off the end of the send buffer while constructing its reply. In this release, we also introduce the ability for the server to wait a moment for clients to return delegations before it responds with NFS4ERR_DELAY. This saves a retransmit and a network round- trip when a delegation recall is needed. This work will be built upon in future releases. The NFS server adds another shrinker to its collection. Because courtesy clients can linger for quite some time, they might be freeable when the server host comes under memory pressure. A new shrinker has been added that releases courtesy client resources during low memory scenarios. Lastly, of note: the maximum number of operations per NFSv4 COMPOUND that NFSD can handle is increased from 16 to 50. There are NFSv4 client implementations that need more than 16 to successfully perform a mount operation that uses a pathname with many components" * tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (53 commits) nfsd: extra checks when freeing delegation stateids nfsd: make nfsd4_run_cb a bool return function nfsd: fix comments about spinlock handling with delegations nfsd: only fill out return pointer on success in nfsd4_lookup_stateid NFSD: fix use-after-free on source server when doing inter-server copy NFSD: Cap rsize_bop result based on send buffer size NFSD: Rename the fields in copy_stateid_t nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_file_cache_stats_fops nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_reply_cache_stats_fops nfsd: use DEFINE_SHOW_ATTRIBUTE to define client_info_fops nfsd: use DEFINE_SHOW_ATTRIBUTE to define export_features_fops and supported_enctypes_fops nfsd: use DEFINE_PROC_SHOW_ATTRIBUTE to define nfsd_proc_ops NFSD: Pack struct nfsd4_compoundres NFSD: Remove unused nfsd4_compoundargs::cachetype field NFSD: Remove "inline" directives on op_rsize_bop helpers NFSD: Clean up nfs4svc_encode_compoundres() SUNRPC: Fix typo in xdr_buf_subsegment's kdoc comment NFSD: Clean up WRITE arg decoders NFSD: Use xdr_inline_decode() to decode NFSv3 symlinks NFSD: Refactor common code out of dirlist helpers ...	2022-10-03 20:07:15 -07:00
Linus Torvalds	3497640a80	Changes since last update: - Introduce fscache-based domain to share blobs between images; - Support recording fragments in a special packed inode; - Support partial-referenced pclusters for global compressed data deduplication; - Fix an order >= MAX_ORDER warning due to crafted negative i_size; - Several cleanups. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYzq3FxEceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBJbRAQDab/0DJu7iDktzupazfCibkg8vWzakXIi+ KE0y5O8VaQEAwn9bdPU4cp+raowoMt3z8eGsj4H9ZO9NM8NfPUX0uQQ= =TNVH -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, for container use cases, fscache-based shared domain is introduced [1] so that data blobs in the same domain will be storage deduplicated and it will also be used for page cache sharing later. Also, a special packed inode is now introduced to record inode fragments which keep the tail part of files by Yue Hu [2]. You can keep arbitary length or (at will) the whole file as a fragment and then fragments can be optionally compressed in the packed inode together and even deduplicated for smaller image sizes. In addition to that, global compressed data deduplication by sharing partial-referenced pclusters is also supported in this cycle. Summary: - Introduce fscache-based domain to share blobs between images - Support recording fragments in a special packed inode - Support partial-referenced pclusters for global compressed data deduplication - Fix an order >= MAX_ORDER warning due to crafted negative i_size - Several cleanups" Link: https://lore.kernel.org/r/20220916085940.89392-1-zhujia.zj@bytedance.com [1] Link: https://lore.kernel.org/r/cover.1663065968.git.huyue2@coolpad.com [2] * tag 'erofs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: clean up erofs_iget() erofs: clean up unnecessary code and comments erofs: fold in z_erofs_reload_indexes() erofs: introduce partial-referenced pclusters erofs: support on-disk compressed fragments data erofs: support interlaced uncompressed data for compressed files erofs: clean up .read_folio() and .readahead() in fscache mode erofs: introduce 'domain_id' mount option erofs: Support sharing cookies in the same domain erofs: introduce a pseudo mnt to manage shared cookies erofs: introduce fscache-based domain erofs: code clean up for fscache erofs: use kill_anon_super() to kill super in fscache mode erofs: fix order >= MAX_ORDER warning due to crafted negative i_size	2022-10-03 20:01:40 -07:00
Linus Torvalds	8bea8ff34a	fs.vfsuid.fat.v6.1 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYzqjGgAKCRCRxhvAZXjc ovVFAQCbZXflZk/DGy1CVEWHwJDkMlmN62jCY+3gZP6UPsCeCQEA4uB3Hub7jihO b2q9yhR+p6G7DHkeNAo2qUu4bI0lDgQ= =zG6Z -----END PGP SIGNATURE----- Merge tag 'fs.vfsuid.fat.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull fatfs vfsuid conversion from Christian Brauner: "Last cycle we introduced the new vfs{g,u}id_t types that we had agreed on. The most important parts of the vfs have been converted but there are a few more places we need to switch before we can remove the old helpers completely. This cycle we converted all filesystems that called idmapped mount helpers directly. The affected filesystems are f2fs, fat, fuse, ksmbd, overlayfs, and xfs. We've sent patches for all of them. Looking at -next f2fs, ksmbd, overlayfs, and xfs have all picked up these patches and they should land in mainline during the v6.1 merge window. So all filesystems that have a separate tree should send the vfsuid conversion themselves. Onle the fat conversion is going through this generic fs trees because there is no fat tree. In order to change time settings on an inode fat checks that the caller either is the owner of the inode or the inode's group is in the caller's group list. If fat is on an idmapped mount we compare whether the inode mapped into the mount is equivalent to the caller's fsuid. If it isn't we compare whether the inode's group mapped into the mount is in the caller's group list. We now use the new vfsuid based helpers for that" * tag 'fs.vfsuid.fat.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fat: port to vfs{g,u}id_t and associated helpers	2022-10-03 19:54:29 -07:00
Linus Torvalds	223b845253	fs.acl.rework.prep.v6.1 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYzqi8gAKCRCRxhvAZXjc orKNAQCGKPJ3Kc3LVVnh8qdjm9npP+j9UQAB7jDZi9q7RijIIAD/VYjj+z5XLg4V k96ibCyir1+4EOF8ihY0WQi40MSWYws= =S/Wf -----END PGP SIGNATURE----- Merge tag 'fs.acl.rework.prep.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs acl updates from Christian Brauner: "These are general fixes and preparatory changes related to the ongoing posix acl rework. The actual rework where we build a type safe posix acl api wasn't ready for this merge window but we're hopeful for the next merge window. General fixes: - Some filesystems like 9p and cifs have to implement custom posix acl handlers because they require access to the dentry in order to set and get posix acls while the set and get inode operations currently don't. But the ntfs3 filesystem has no such requirement and thus implemented custom posix acl xattr handlers when it really didn't have to. So this pr contains patch that just implements set and get inode operations for ntfs3 and switches it to rely on the generic posix acl xattr handlers. (We would've appreciated reviews from the ntfs3 maintainers but we didn't get any. But hey, if we really broke it we'll fix it. But fstests for ntfs3 said it's fine.) - The posix_acl_fix_xattr_common() helper has been adapted so it can be used by a few more callers and avoiding open-coding the same checks over and over. Other than the two general fixes this series introduces a new helper vfs_set_acl_prepare(). The reason for this helper is so that we can mitigate one of the source that change {g,u}id values directly in the uapi struct. With the vfs_set_acl_prepare() helper we can move the idmapped mount fixup into the generic posix acl set handler. The advantage of this is that it allows us to remove the posix_acl_setxattr_idmapped_mnt() helper which so far we had to call in vfs_setxattr() to account for idmapped mounts. While semantically correct the problem with this approach was that we had to keep the value parameter of the generic vfs_setxattr() call as non-const. This is rectified in this series. Ultimately, we will get rid of all the extreme kludges and type unsafety once we have merged the posix api - hopefully during the next merge window - built solely around get and set inode operations. Which incidentally will also improve handling of posix acls in security and especially in integrity modesl. While this will come with temporarily having two inode operation for posix acls that is nothing compared to the problems we have right now and so well worth it. We'll end up with something that we can actually reason about instead of needing to write novels to explain what's going on" * tag 'fs.acl.rework.prep.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: xattr: always us is_posix_acl_xattr() helper acl: fix the comments of posix_acl_xattr_set xattr: constify value argument in vfs_setxattr() ovl: use vfs_set_acl_prepare() acl: move idmapping handling into posix_acl_xattr_set() acl: add vfs_set_acl_prepare() acl: return EOPNOTSUPP in posix_acl_fix_xattr_common() ntfs3: rework xattr handlers and switch to POSIX ACL VFS helpers	2022-10-03 19:48:54 -07:00
Linus Torvalds	da380aefdd	fs/coredump fix -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzuCiQAKCRBZ7Krx/gZQ 65z6AQDbHgeZ3vXLyHdxs3VWsUkKWMUV1gb5MmzBs/eGq0K3hAD9FfGsOoT2ow9h 2ics8AGrvMZMHrFOwFAmolXjLW7qQws= =y0rq -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull coredump fix from Al Viro: "Brown paper bag bug fix for the coredumping fix late in the 6.0 release cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [brown paperbag] fix coredump breakage	2022-10-03 18:03:58 -07:00
Linus Torvalds	26b84401da	lsm/stable-6.1 PR 20221003 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmM68YIUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOTbA//TR8i+Wy8iswUCmtfmYg91h1uebpl /kjNsSmfgivAUTGamr3eN2WRlGhZfkFDPIHa25uybSA6Q+75p4lst83Rt3HDbjkv Ga7grCXnHwSDwJoHOSeFh0pojV2u7Zvfmiib2U5hPZEmd3kBw3NCgAJVcSGN80B2 dct36fzZNXjvpWDbygmFtRRkmEseslSkft8bUVvNZBP+B0zvv3vcNY1QFuKuK+W2 8wWpvO/cCSmke5i2c2ktHSk2f8/Y6n26Ik/OTHcTVfoKZLRaFbXEzLyxzLrNWd6m hujXgcxszTtHdmoXx+J6uBauju7TR8pi1x8mO2LSGrlpRc1cX0A5ED8WcH71+HVE 8L1fIOmZShccPZn8xRok7oYycAUm/gIfpmSLzmZA76JsZYAe+mp9Ze9FA6fZtSwp 7Q/rfw/Rlz25WcFBe4xypP078HkOmqutkCk2zy5liR+cWGrgy/WKX15vyC0TaPrX tbsRKuCLkipgfXrTk0dX3kmhz+3bJYjqeZEt7sfPSZYpaOGkNXVmAW0wnCOTuLMU +8pIVktvQxMmACEj2gBMz11iooR4DpWLxOcQQR/impgCpNdZ60nA0a6KPJoIXC+5 NfTa422FZkc99QRVblUZyWSgJBW78Z3ZAQcQlo1AGLlFydbfrSFTRLbmNJZo/Nkl KwpGvWs5nB0rVw0= =VZl5 -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull LSM updates from Paul Moore: "Seven patches for the LSM layer and we've got a mix of trivial and significant patches. Highlights below, starting with the smaller bits first so they don't get lost in the discussion of the larger items: - Remove some redundant NULL pointer checks in the common LSM audit code. - Ratelimit the lockdown LSM's access denial messages. With this change there is a chance that the last visible lockdown message on the console is outdated/old, but it does help preserve the initial series of lockdown denials that started the denial message flood and my gut feeling is that these might be the more valuable messages. - Open userfaultfds as readonly instead of read/write. While this code obviously lives outside the LSM, it does have a noticeable impact on the LSMs with Ondrej explaining the situation in the commit description. It is worth noting that this patch languished on the VFS list for over a year without any comments (objections or otherwise) so I took the liberty of pulling it into the LSM tree after giving fair notice. It has been in linux-next since the end of August without any noticeable problems. - Add a LSM hook for user namespace creation, with implementations for both the BPF LSM and SELinux. Even though the changes are fairly small, this is the bulk of the diffstat as we are also including BPF LSM selftests for the new hook. It's also the most contentious of the changes in this pull request with Eric Biederman NACK'ing the LSM hook multiple times during its development and discussion upstream. While I've never taken NACK's lightly, I'm sending these patches to you because it is my belief that they are of good quality, satisfy a long-standing need of users and distros, and are in keeping with the existing nature of the LSM layer and the Linux Kernel as a whole. The patches in implement a LSM hook for user namespace creation that allows for a granular approach, configurable at runtime, which enables both monitoring and control of user namespaces. The general consensus has been that this is far preferable to the other solutions that have been adopted downstream including outright removal from the kernel, disabling via system wide sysctls, or various other out-of-tree mechanisms that users have been forced to adopt since we haven't been able to provide them an upstream solution for their requests. Eric has been steadfast in his objections to this LSM hook, explaining that any restrictions on the user namespace could have significant impact on userspace. While there is the possibility of impacting userspace, it is important to note that this solution only impacts userspace when it is requested based on the runtime configuration supplied by the distro/admin/user. Frederick (the pathset author), the LSM/security community, and myself have tried to work with Eric during development of this patchset to find a mutually acceptable solution, but Eric's approach and unwillingness to engage in a meaningful way have made this impossible. I have CC'd Eric directly on this pull request so he has a chance to provide his side of the story; there have been no objections outside of Eric's" * tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lockdown: ratelimit denial messages userfaultfd: open userfaultfds with O_RDONLY selinux: Implement userns_create hook selftests/bpf: Add tests verifying bpf lsm userns_create hook bpf-lsm: Make bpf_lsm_userns_create() sleepable security, lsm: Introduce security_create_user_ns() lsm: clean up redundant NULL pointer check	2022-10-03 17:51:52 -07:00
Al Viro	4f526fef91	[brown paperbag] fix coredump breakage Let me count the ways in which I'd screwed up: * when emitting a page, handling of gaps in coredump should happen before fetching the current file position. * fix for a problem that occurs on rather uncommon setups (and hadn't been observed in the wild) had been sent very late in the cycle. * ... with badly insufficient testing, introducing an easily reproducible breakage. Without giving it time to soak in -next. Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: "J. R. Okajima" <hooanon05g@gmail.com> Tested-by: "J. R. Okajima" <hooanon05g@gmail.com> Fixes: 06bbaa6dc53c "[coredump] don't use __kernel_write() on kmap_local_page()" Cc: stable@kernel.org # v6.0-only Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-10-03 20:28:38 -04:00
Linus Torvalds	12ed00ba01	execve updates for v6.1-rc1 - Remove a.out implementation globally (Eric W. Biederman) - Remove unused linux_binprm::taso member (Lukas Bulwahn) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM4bQsWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkftD/4gTcnAd3BCUgerhiQfq64kYNPt l47p0BM6GzXvl1Mf4Q0TDV35WJ/JD5Yd3ij3V7J2XJWSHANUAlHbxm9yfChVLACU 99YRVhuWSdohJkF7p0b8dkAQO551aeodj/JUKGiNrJyNR4L336r+YG5aqEjOSPji jQH5I4SonDeaGdLy8nYO/aRhEryIF1FqvLH6egNp6Tt8Q69UqDYIojdCgZ/MS5lb dldrFsDb3ZjoXET0NdeIzZEZVS6zDM2iehb2W8dtRFoNsjMXz4jSiy7AEoprETBz wAKZ16t0dj2sARLLVGL8i3m2k6tzD6zzkIIoc9X3VOyeOADa6aghagDMyDpRo6ZB 2ML7wMNCHCboCVVfG3n2rWTIFmrqeycAiny0hZxU4bjBBYSxTK4qD9lFQtlXk0cD BESZhnM6gg87vVFgLV/8aefCvwRd5eb8Pugtwb3qF4NsSvgosZtIXhWnIeU3cjg2 425+4XOPbLsBv/u8NVkG8yIHaHZbtXH78JDIcNFMgSvw9iBwn2NH9184EpHI4qRx 9aBjHPz+VOqzPTnNR6ISP5J4VXaOvHeocb/ckbrS+xhY8zQmbWX9QHaAZ+XnDYby PVY0HjmYTFSlijHSRhqLqNMHwtOAeiSZwJNk4wh+39H0ynpTzThGjQ9NlGYQ5cUE TVF5ukO5QRa5GbmhIQ== =Ns31 -----END PGP SIGNATURE----- Merge tag 'execve-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: "This removes a.out support globally; it has been disabled for a while now. - Remove a.out implementation globally (Eric W. Biederman) - Remove unused linux_binprm::taso member (Lukas Bulwahn)" * tag 'execve-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt: remove taso from linux_binprm struct a.out: Remove the a.out implementation	2022-10-03 16:56:40 -07:00
Linus Torvalds	d649d2c49b	pstore revert for v6.0-rc8 - Revert crypto acomp migration (Guilherme G. Piccoli) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM3CssWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJl75D/42DaeB92odK/3XG9cI1Frp4YS0 vcTIecUeheNrTf4okUjeL0kQjUJgDQLWYNxZU6O3ljNM3lItDegV14Ij4tiPJlNK pOg8e+ddnWmhTB6c1BrLwVIjPDUnJmhqd1L2G6D/1djlcQU2TzN/amjBE4PQYrC4 Kyqn8nmrVt5CgCvkV4PPDuIiyeGx808bFu+rK13J6BZYDH4mTZWqky25+wFuresn HNBQ0Xh7otaeeIMdbdohaitANlr7xHpUOLDuHnQaZ9Od1wJdFqNFpQ0sYDwvsqhY aTyNAAkZig9xMp10Sr6lCDfT/ZHkHDtSNOqn30iwPrFX83QVbbnaQhRsN7EK8Caz v7dx2HDjP5MxkYetT+3qPW4waScc5/s7dJNXoK/3E3oWrKoFRwx2CrmZUEuHmjHC S1M/kG+MoizxGAWbtvB93XbxXJlULdgmv1VKQyBjZRZ9JVBP7AnQmstljKjK0yn2 o/pgTFRduiTj7+N9iU7h7GxHNLXJm0YHwCZw1QhORc1dg14jbMkNEu6GN6ZNfuvg ADYs2tRN7YA7oGjACaCNAEuGxjY5/quvPAdYHdwankXBj8S8O2/BCxa02mbrPevC 7p32nwzRtUz4G+bjJr4N/RGO1q/Qzn5IcZnzOxGUVn5TC729JxkfRUFrb/m0l0PP PgfmGgNjSQ4KDaYIMQ== =KoAn -----END PGP SIGNATURE----- Merge tag 'pstore-v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore revert from Kees Cook: "A misbehavior with some compression backends in pstore was just discovered due to the recent crypto acomp migration. Since we're so close to release, it seems better to just simply revert it, and we can figure out what's going on without leaving it broken for a release. - Revert crypto acomp migration (Guilherme G. Piccoli)" * tag 'pstore-v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: Revert "pstore: migrate to crypto acomp interface"	2022-09-30 08:54:14 -07:00
Guilherme G. Piccoli	40158dbf7e	Revert "pstore: migrate to crypto acomp interface" This reverts commit e4f0a7ec586b7644107839f5394fb685cf1aadcc. When using this new interface, both efi_pstore and ramoops backends are unable to properly decompress dmesg if using zstd, lz4 and lzo algorithms (and maybe more). It does succeed with deflate though. The message observed in the kernel log is: [2.328828] pstore: crypto_acomp_decompress failed, ret = -22! The pstore infrastructure is able to collect the dmesg with both backends tested, but since decompression fails it's unreadable. With this revert everything is back to normal. Fixes: e4f0a7ec586b ("pstore: migrate to crypto acomp interface") Cc: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220929215515.276486-1-gpiccoli@igalia.com	2022-09-30 08:16:06 -07:00
Linus Torvalds	987a926c1d	fix for breakage in dump_user_range() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzXkrwAKCRBZ7Krx/gZQ 6+BcAP4+OfgZO6qStFjroURhZ8o3qfCJpYde/fGN5S/F/5UMdAD6A+4yozxXLdMZ wb2oB0XhxHv33vFdFfEeodquTnn+MAQ= =1Mnt -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull coredump fix from Al Viro: "Fix for breakage in dump_user_range()" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [coredump] don't use __kernel_write() on kmap_local_page()	2022-09-29 14:37:45 -07:00
Jakub Kicinski	accc3b4a57	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-29 14:30:51 -07:00
Tetsuo Handa	cbddcc4fa3	btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for commit bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code") missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must not be moved to after clean_tree_block() because clean_tree_block() is calling btrfs_header_generation() since commit 55c69072d6bd5be1 ("Btrfs: Fix extent_buffer usage when nodesize != leafsize"). Since memzero_extent_buffer() will reset "struct btrfs_header" part, we can't move btrfs_set_header_generation() to before memzero_extent_buffer(). Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block(). Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1] Reported-by: syzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com> Fixes: bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	db21370bff	btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	b54bb86556	btrfs: avoid pointless extent map tree search when flushing delalloc When flushing delalloc, in COW mode at cow_file_range(), before entering the loop that allocates extents and creates ordered extents, we do a call to btrfs_drop_extent_map_range() for the whole range. This is pointless because in the loop we call create_io_em(), which will also call btrfs_drop_extent_map_range() before inserting the new extent map. So remove that call at cow_file_range() not only because it is not needed, but also because it will make the btrfs_drop_extent_map_range() calls made from create_io_em() waste time searching the extent map tree, and that tree can be large for files with many extents. It also makes us waste time at btrfs_drop_extent_map_range() allocating and freeing the split extent maps for nothing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	6c05813ebb	btrfs: remove unnecessary next extent map search At __tree_search(), and its single caller __lookup_extent_mapping(), there is no point in finding the next extent map that starts after the search offset if we were able to find the previous extent map that ends before our search offset, because __lookup_extent_mapping() ignores the next acceptable extent map if we were able to find the previous one. So just return immediately if we were able to find the previous extent map, therefore avoiding wasting time iterating the tree looking for the next extent map which will not be used by __lookup_extent_mapping(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	08f088dd63	btrfs: remove unnecessary NULL pointer checks when searching extent maps The previous and next pointer arguments passed to __tree_search() are never NULL as the only caller of this function, __lookup_extent_mapping(), always passes the address of two on stack pointers. So remove the NULL checks and add assertions to verify the pointers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	74333c7d87	btrfs: assert tree is locked when clearing extent map from logging When calling clear_em_logging() we should have a write lock on the extent map tree, as we will try to merge the extent map with the previous and next ones in the tree. So assert that we have a write lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:31 +02:00
Filipe Manana	2e0cdaa028	btrfs: remove unnecessary extent map initializations When allocating an extent map, we use kmem_cache_zalloc() which guarantees the returned memory is initialized to zeroes, therefore it's pointless to initialize the generation and flags of the extent map to zero again. Remove those initializations, as they are pointless and slightly increase the object text size. Before removing them: $ size fs/btrfs/extent_map.o text data bss dec hex filename 9241 274 24 9539 2543 fs/btrfs/extent_map.o After removing them: $ size fs/btrfs/extent_map.o text data bss dec hex filename 9209 274 24 9507 2523 fs/btrfs/extent_map.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	ad5d6e9148	btrfs: remove the refcount warning/check at free_extent_map() At free_extent_map(), it's pointless to have a WARN_ON() to check if the refcount of the extent map is zero. Such check is already done by the refcount_t module and refcount_dec_and_test(), which loudly complains if we try to decrement a reference count that is currently 0. The WARN_ON() dates back to the time when used a regular atomic_t type for the reference counter, before we switched to the refcount_t type. The main goal of the refcount_t type/module is precisely to catch such types of bugs and loudly complain if they happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	a1ba4c080b	btrfs: add helper to replace extent map range with a new extent map We have several places that need to drop all the extent maps in a given file range and then add a new extent map for that range. Currently they call btrfs_drop_extent_map_range() to delete all extent maps in the range and then keep trying to add the new extent map in a loop that keeps retrying while the insertion of the new extent map fails with -EEXIST. So instead of repeating this logic, add a helper to extent_map.c that does these steps and name it btrfs_replace_extent_map_range(). Also add a comment about why the retry loop is necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	9c9d1b4f74	btrfs: move open coded extent map tree deletion out of inode eviction Move the loop that removes all the extent maps from the inode's extent map tree during inode eviction out of inode.c and into extent_map.c, to btrfs_drop_extent_map_range(). Anything manipulating extent maps or the extent map tree should be in extent_map.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	99ba0c8150	btrfs: use cond_resched_rwlock_write() during inode eviction At evict_inode_truncate_pages(), instead of manually checking if rescheduling is needed, then unlock the extent map tree, reschedule and then write lock again the tree, use the helper cond_resched_rwlock_write() which does all that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	f3109e33bb	btrfs: use extent_map_end() at btrfs_drop_extent_map_range() Instead of open coding the end offset calculation of an extent map, use the helper extent_map_end() and cache its result in a local variable, since it's used several times. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	4c0c8cfc84	btrfs: move btrfs_drop_extent_cache() to extent_map.c The function btrfs_drop_extent_cache() doesn't really belong at file.c because what it does is drop a range of extent maps for a file range. It directly allocates and manipulates extent maps, by dropping, splitting and replacing them in an extent map tree, so it should be located at extent_map.c, where all manipulations of an extent map tree and its extent maps are supposed to be done. So move it out of file.c and into extent_map.c. Additionally do the following changes: 1) Rename it into btrfs_drop_extent_map_range(), as this makes it more clear about what it does. The term "cache" is a bit confusing as it's not widely used, "extent maps" or "extent mapping" is much more common; 2) Change its 'skip_pinned' argument from int to bool; 3) Turn several of its local variables from int to bool, since they are used as booleans; 4) Move the declaration of some variables out of the function's main scope and into the scopes where they are used; 5) Remove pointless assignment of false to 'modified' early in the while loop, as later that variable is set and it's not used before that second assignment; 6) Remove checks for NULL before calling free_extent_map(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Filipe Manana	cef7820d6a	btrfs: fix missed extent on fsync after dropping extent maps When dropping extent maps for a range, through btrfs_drop_extent_cache(), if we find an extent map that starts before our target range and/or ends before the target range, and we are not able to allocate extent maps for splitting that extent map, then we don't fail and simply remove the entire extent map from the inode's extent map tree. This is generally fine, because in case anyone needs to access the extent map, it can just load it again later from the respective file extent item(s) in the subvolume btree. However, if that extent map is new and is in the list of modified extents, then a fast fsync will miss the parts of the extent that were outside our range (that needed to be split), therefore not logging them. Fix that by marking the inode for a full fsync. This issue was introduced after removing BUG_ON()s triggered when the split extent map allocations failed, done by commit 7014cdb49305ed ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and the fast fsync path already existed but was very recent. Also, in the case where we could allocate extent maps for the split operations but then fail to add a split extent map to the tree, mark the inode for a full fsync as well. This is not supposed to ever fail, and we assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is not set), it's the correct thing to do to make sure a fast fsync will not miss a new extent. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:30 +02:00
Jeff Layton	3050dfa63e	btrfs: remove stale prototype of btrfs_write_inode This function no longer exists, was removed in 3c4276936f6f ("Btrfs: fix btrfs_write_inode vs delayed iput deadlock"). Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:29 +02:00
Stefan Roesch	926078b21d	btrfs: enable nowait async buffered writes Enable nowait async buffered writes in btrfs_do_write_iter() and btrfs_file_open(). In this version encoded buffered writes have the optimization not enabled. Encoded writes are enabled by using an ioctl. io_uring currently does not support ioctls. This might be enabled in the future. Performance results: For fio the following results have been obtained with a queue depth of 1 and 4k block size (runtime 600 secs): sequential writes: without patch with patch libaio psync iops: 55k 134k 117K 148K bw: 221MB/s 538MB/s 469MB/s 592MB/s clat: 15286ns 82ns 994ns 6340ns For an io depth of 1, the new patch improves throughput by over two times (compared to the existing behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. To achieve the same or better performance with the existing code an io depth of 4 is required. Increasing the iodepth further does not lead to improvements. The tests have been run like this: ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... Testing: This patch has been tested with xfstests, fsx, fio. xfstests shows no new diffs compared to running without the patch series. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:29 +02:00
Stefan Roesch	c922b016f3	btrfs: assert nowait mode is not used for some btree search functions Adds nowait asserts to btree search functions which are not used by buffered IO and direct IO paths. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:29 +02:00
Stefan Roesch	965f47aeb5	btrfs: make btrfs_buffered_write nowait compatible We need to avoid unconditionally calling balance_dirty_pages_ratelimited as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags with the BDP_ASYNC in case the buffered write is nowait, returning EAGAIN eventually. It also moves the function after the again label. This can cause the function to be called a bit later, but this should have no impact in the real world. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Stefan Roesch	304e45acdb	btrfs: plumb NOWAIT through the write path We have everywhere setup for nowait, plumb NOWAIT through the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Stefan Roesch	2fcab928cc	btrfs: make lock_and_cleanup_extent_if_need nowait compatible Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the nowait parameter is specified we try to lock the extent in nowait mode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Stefan Roesch	fc22600012	btrfs: make prepare_pages nowait compatible Add nowait parameter to the prepare_pages function. In case nowait is specified for an async buffered write request, do a nowait allocation or return -EAGAIN. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	80f9d24130	btrfs: make btrfs_check_nocow_lock nowait compatible Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add a nowait flag to btrfs_check_nocow_lock so it can be used by the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	d2c7a19f5c	btrfs: add btrfs_try_lock_ordered_range For IOCB_NOWAIT we're going to want to use try lock on the extent lock, and simply bail if there's an ordered extent in the range because the only choice there is to wait for the ordered extent to complete. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	1daedb1d6b	btrfs: add the ability to use NO_FLUSH for data reservations In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH data reservations, so plumb this through the delalloc reservation system. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:28 +02:00
Josef Bacik	26ce911446	btrfs: make can_nocow_extent nowait compatible If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:26 +02:00
Al Viro	06bbaa6dc5	[coredump] don't use __kernel_write() on kmap_local_page() passing kmap_local_page() result to __kernel_write() is unsafe - random ->write_iter() might (and 9p one does) get unhappy when passed ITER_KVEC with pointer that came from kmap_local_page(). Fix by providing a variant of __kernel_write() that takes an iov_iter from caller (__kernel_write() becomes a trivial wrapper) and adding dump_emit_page() that parallels dump_emit(), except that instead of __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source. Fixes: 3159ed57792b "fs/coredump: use kmap_local_page()" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-09-28 14:28:40 -04:00

1 2 3 4 5 ...

78296 Commits