12606 Commits

Author SHA1 Message Date
Viktor Prutyanov
af8ececda1 virtio: add VIRTIO_F_NOTIFICATION_DATA feature support
According to VirtIO spec v1.2, VIRTIO_F_NOTIFICATION_DATA feature
indicates that the driver passes extra data along with the queue
notifications.

In a split queue case, the extra data is 16-bit available index. In a
packed queue case, the extra data is 1-bit wrap counter and 15-bit
available index.

Add support for this feature for MMIO, channel I/O and modern PCI
transports.

Signed-off-by: Viktor Prutyanov <viktor@daynix.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20230413081855.36643-2-alvaro.karsz@solid-run.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-04-21 03:02:35 -04:00
Josh Triplett
519fe1bae7 ext4: Add a uapi header for ext4 userspace APIs
Create a uapi header include/uapi/linux/ext4.h, move the ioctls and
associated data structures to the uapi header, and include it from
fs/ext4/ext4.h.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/680175260970d977d16b5cc7e7606483ec99eb63.1680402881.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-19 23:39:42 -04:00
Chuck Lever
2fd5532044 net/handshake: Add a kernel API for requesting a TLSv1.3 handshake
To enable kernel consumers of TLS to request a TLS handshake, add
support to net/handshake/ to request a handshake upcall.

This patch also acts as a template for adding handshake upcall
support for other kernel transport layer security providers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19 18:48:48 -07:00
Chuck Lever
3b3009ea8a net/handshake: Create a NETLINK service for handling handshake requests
When a kernel consumer needs a transport layer security session, it
first needs a handshake to negotiate and establish a session. This
negotiation can be done in user space via one of the several
existing library implementations, or it can be done in the kernel.

No in-kernel handshake implementations yet exist. In their absence,
we add a netlink service that can:

a. Notify a user space daemon that a handshake is needed.

b. Once notified, the daemon calls the kernel back via this
   netlink service to get the handshake parameters, including an
   open socket on which to establish the session.

c. Once the handshake is complete, the daemon reports the
   session status and other information via a second netlink
   operation. This operation marks that it is safe for the
   kernel to use the open socket and the security session
   established there.

The notification service uses a multicast group. Each handshake
mechanism (eg, tlshd) adopts its own group number so that the
handshake services are completely independent of one another. The
kernel can then tell via netlink_has_listeners() whether a handshake
service is active and prepared to handle a handshake request.

A new netlink operation, ACCEPT, acts like accept(2) in that it
instantiates a file descriptor in the user space daemon's fd table.
If this operation is successful, the reply carries the fd number,
which can be treated as an open and ready file descriptor.

While user space is performing the handshake, the kernel keeps its
muddy paws off the open socket. A second new netlink operation,
DONE, indicates that the user space daemon is finished with the
socket and it is safe for the kernel to use again. The operation
also indicates whether a session was established successfully.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19 18:48:48 -07:00
Ondrej Kozina
9e05a2599a sed-opal: geometry feature reporting command
Locking range start and locking range length
attributes may be require to satisfy restrictions
exposed by OPAL2 geometry feature reporting.

Geometry reporting feature is described in TCG OPAL SSC,
section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
and LowestAlignedLBA).

4.3.5.2.1.1 RangeStart Behavior:

[ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]

When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:

a) the AlignmentRequired (ALIGN above) column in the LockingInfo
   table is TRUE;
b) RangeStart is non-zero; and
c) StartAlignment is non-zero, then the method SHALL fail and
   return an error status code INVALID_PARAMETER.

4.3.5.2.1.2 RangeLength Behavior:

If RangeStart is zero, then
	[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]

If RangeStart is non-zero, then
	[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]

When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:

a) the AlignmentRequired (ALIGN above) column in the LockingInfo
   table is TRUE;
b) RangeLength is non-zero; and
c) LengthAlignment is non-zero, then the method SHALL fail and
   return an error status code INVALID_PARAMETER

In userspace we stuck to logical block size reported by general
block device (via sysfs or ioctl), but we can not read
'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
we need to get those values from sed-opal interface otherwise
we will not be able to report or avoid locking range setup
INVALID_PARAMETER errors above.

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-19 14:07:13 -06:00
Ming Lei
2d786e66c9 block: ublk: switch to ioctl command encoding
All ublk commands(control, IO) should have taken ioctl command encoding
from the beginning, because ioctl command encoding defines each code
uniquely, so driver can figure out wrong command sent from userspace
easily; 2) it might help security subsystem for audit uring cmd[1].

Unfortunately we didn't do that way, and it could be one lesson for
ublk driver.

So switch to ioctl command encoding now, we still support commands encoded
in old way, but they become legacy definition. Any new command should take
ioctl encoding.

See ublksrv code for switching to ioctl command encoding in [2].

[1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/
[2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ken Kurematsu <k.kurematsu@nskint.co.jp>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230418131810.855959-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18 20:13:30 -06:00
David Wei
ea97f6c855 io_uring: add support for multishot timeouts
A multishot timeout submission will repeatedly generate completions with
the IORING_CQE_F_MORE cflag set. Depending on the value of the `off'
field in the submission, these timeouts can either repeat indefinitely
until cancelled (`off' = 0) or for a fixed number of times (`off' > 0).

Only noseq timeouts (i.e. not dependent on the number of I/O
completions) are supported.

An indefinite timer will be cancelled if the CQ ever overflows.

Signed-off-by: David Wei <davidhwei@meta.com>
Link: https://lore.kernel.org/r/20230418225817.1905027-1-davidhwei@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18 19:38:36 -06:00
Kevin Brodsky
31088f6f79 uapi/linux/const.h: prefer ISO-friendly __typeof__
typeof is (still) a GNU extension, which means that it cannot be used when
building ISO C (e.g.  -std=c99).  It should therefore be avoided in uapi
headers in favour of the ISO-friendly __typeof__.

Unfortunately this issue could not be detected by
CONFIG_UAPI_HEADER_TEST=y as the __ALIGN_KERNEL() macro is not expanded in
any uapi header.

This matters from a userspace perspective, not a kernel one. uapi
headers and their contents are expected to be usable in a variety of
situations, and in particular when building ISO C applications (with
-std=c99 or similar).

This particular problem can be reproduced by trying to use the
__ALIGN_KERNEL macro directly in application code, say:

#include <linux/const.h>

int align(int x, int a)
{
	return __KERNEL_ALIGN(x, a);
}
	
and trying to build that with -std=c99.	

Link: https://lkml.kernel.org/r/20230411092747.3759032-1-kevin.brodsky@arm.com
Fixes: a79ff731a1b2 ("netfilter: xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reported-by: Ruben Ayrapetyan <ruben.ayrapetyan@arm.com>
Tested-by: Ruben Ayrapetyan <ruben.ayrapetyan@arm.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Tested-by: Petr Vorel <pvorel@suse.cz>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:39:34 -07:00
Yang Yang
a3b2aeac9d delayacct: track delays from IRQ/SOFTIRQ
Delay accounting does not track the delay of IRQ/SOFTIRQ.  While
IRQ/SOFTIRQ could have obvious impact on some workloads productivity, such
as when workloads are running on system which is busy handling network
IRQ/SOFTIRQ.

Get the delay of IRQ/SOFTIRQ could help users to reduce such delay.  Such
as setting interrupt affinity or task affinity, using kernel thread for
NAPI etc.  This is inspired by "sched/psi: Add PSI_IRQ to track
IRQ/SOFTIRQ pressure"[1].  Also fix some code indent problems of older
code.

And update tools/accounting/getdelays.c:
    / # ./getdelays -p 156 -di
    print delayacct stats ON
    printing IO accounting
    PID     156

    CPU             count     real total  virtual total    delay total  delay average
                       15       15836008       16218149      275700790         18.380ms
    IO              count    delay total  delay average
                        0              0          0.000ms
    SWAP            count    delay total  delay average
                        0              0          0.000ms
    RECLAIM         count    delay total  delay average
                        0              0          0.000ms
    THRASHING       count    delay total  delay average
                        0              0          0.000ms
    COMPACT         count    delay total  delay average
                        0              0          0.000ms
    WPCOPY          count    delay total  delay average
                       36        7586118          0.211ms
    IRQ             count    delay total  delay average
                       42         929161          0.022ms

[1] commit 52b1364ba0b1("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")

Link: https://lkml.kernel.org/r/202304081728353557233@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Cc: wangyong <wang.yong12@zte.com.cn>
Cc: junhua huang <huang.junhua@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:39:34 -07:00
Josh Triplett
ddc65971bb prctl: add PR_GET_AUXV to copy auxv to userspace
If a library wants to get information from auxv (for instance,
AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable
or ideal:

- Be main or the pre-main startup code, and grub through the stack above
  main. Doesn't work for a library.
- Call libc getauxval. Not ideal for libraries that are trying to be
  libc-independent and/or don't otherwise require anything from other
  libraries.
- Open and read /proc/self/auxv. Doesn't work for libraries that may run
  in arbitrarily constrained environments that may not have /proc
  mounted (e.g. libraries that might be used by an init program or a
  container setup tool).
- Assume you're on the main thread and still on the original stack, and
  try to walk the stack upwards, hoping to find auxv. Extremely bad
  idea.
- Ask the caller to pass auxv in for you. Not ideal for a user-friendly
  library, and then your caller may have the same problem.

Add a prctl that copies current->mm->saved_auxv to a userspace buffer.

Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:53 -07:00
Qu Wenruo
604e6681e1 btrfs: scrub: reject unsupported scrub flags
Since the introduction of scrub interface, the only flag that we support
is BTRFS_SCRUB_READONLY.  Thus there is no sanity checks, if there are
some undefined flags passed in, we just ignore them.

This is problematic if we want to introduce new scrub flags, as we have
no way to determine if such flags are supported.

Address the problem by introducing a check for the flags, and if
unsupported flags are set, return -EOPNOTSUPP to inform the user space.

This check should be backported for all supported kernels before any new
scrub flags are introduced.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Gregory Price
3f67987cdc ptrace: Provide set/get interface for syscall user dispatch
The syscall user dispatch configuration can only be set by the task itself,
but lacks a ptrace set/get interface which makes it impossible to implement
checkpoint/restore for it.

Add the required ptrace requests and the get/set functions in the syscall
user dispatch code to make that possible.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20230407171834.3558-4-gregory.price@memverge.com
2023-04-16 14:23:07 +02:00
Dave Marchevsky
d54730b50b bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
meant for use in BPF programs. Similarly to other opaque types like
bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
user-defined struct types a bpf_refcount can be located, so necessary
btf_record plumbing is added to enable this. bpf_refcount is sized to
hold a refcount_t.

Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
btf_record as refcount_off in addition to being in the field array.
Caching refcount_off makes sense for this field because further patches
in the series will modify functions that take local kptrs (e.g.
bpf_obj_drop) to change their behavior if the type they're operating on
is refcounted. So enabling fast "is this type refcounted?" checks is
desirable.

No such verifier behavior changes are introduced in this patch, just
logic to recognize 'struct bpf_refcount' in btf_record.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:49 -07:00
Ming Qian
302b988ca0 media: Add ABGR64_12 video format
ABGR64_12 is a reversed RGB format with alpha channel last,
12 bits per component like ABGR32,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-15 09:11:30 +01:00
Ming Qian
da0b7a400e media: Add BGR48_12 video format
BGR48_12 is a reversed RGB format with 12 bits per component like BGR24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-15 09:10:46 +01:00
Ming Qian
99c9549677 media: Add YUV48_12 video format
YUV48_12 is a YUV format with 12-bits per component like YUV24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.

[hverkuil: replaced a . by ,]

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-15 09:10:27 +01:00
Ming Qian
a490ea6844 media: Add Y012 video format
Y012 is a luma-only formats with 12-bits per pixel,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-15 09:06:34 +01:00
Ming Qian
aa10804042 media: Add P012 and P012M video format
P012 is a YUV format with 12-bits per component with interleaved UV,
like NV12, expanded to 16 bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
And P012M has two non contiguous planes.

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-15 09:05:53 +01:00
Tomi Valkeinen
f57fa29592 media: v4l2-subdev: Add new ioctl for client capabilities
Add new ioctls to set and get subdev client capabilities. Client in this
context means the userspace application which opens the subdev device
node. The client capabilities are stored in the file handle of the
opened subdev device node, and the client must set the capabilities for
each opened subdev.

For now we only add a single flag, V4L2_SUBDEV_CLIENT_CAP_STREAMS, which
indicates that the client is streams-aware.

The reason for needing such a flag is as follows:

Many structs passed via ioctls, e.g. struct v4l2_subdev_format, contain
reserved fields (usually a single array field). These reserved fields
can be used to extend the ioctl. The userspace is required to zero the
reserved fields.

We recently added a new 'stream' field to many of these structs, and the
space for the field was taken from these reserved arrays. The assumption
was that these new 'stream' fields are always initialized to zero if the
userspace does not use them. This was a mistake, as, as mentioned above,
the userspace is required to zero the _reserved_ fields. In other words,
there is no requirement to zero this new stream field, and if the
userspace doesn't use the field (which is the case for all userspace
applications at the moment), the field may contain random data.

This shows that the way the reserved fields are defined in v4l2 is, in
my opinion, somewhat broken, but there is nothing to do about that.

To fix this issue we need a way for the userspace to tell the kernel
that the userspace has indeed set the 'stream' field, and it's fine for
the kernel to access it. This is achieved with the new ioctl, which the
userspace should usually use right after opening the subdev device node.

Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-15 08:58:41 +01:00
Vladimir Oltean
a721c3e54b net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.

I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.

Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.

Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.

A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Vladimir Oltean
f62af20bed net/sched: mqprio: allow per-TC user input of FP adminStatus
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).

The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.

I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).

The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.

Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Jakub Kicinski
c2865b1122 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI
 g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk
 mLSptZ9qro8yZA86BvyPvlBT1h9ELQA=
 =StAc
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:43:38 -07:00
Jakub Kicinski
800e68c44f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

tools/testing/selftests/net/config
  62199e3f1658 ("selftests: net: Add VXLAN MDB test")
  3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:04:28 -07:00
Dave Jiang
2442b7473a dmaengine: idxd: process batch descriptor completion record faults
Add event log processing for faulting of user batch descriptor completion
record.

When encountering an event log entry for a page fault on a completion
record, the driver is expected to do the following:
1. If the "first error in batch" bit in event log entry error info is
set, discard any previously recorded errors associated with the
"batch identifier".
2. Fix the page fault according to the fault address in the event log. If
successful, write the completion record to the fault address in user space.
3. If an error is encountered while writing the completion record and it is
associated to a descriptor in the batch, the driver associates the error
with the batch identifier of the event log entry and tracks it until the
event log entry for the corresponding batch desc is encountered.

While processing an event log entry for a batch descriptor with error
indicating that one or more descs in the batch had event log entries,
the driver will do the following before writing the batch completion
record:
1. If the status field of the completion record is 0x1, the driver will
change it to error code 0x5 (one or more operations in batch completed
with status not successful) and changes the result field to 1.
2. If the status is error code 0x6 (page fault on batch descriptor list
address), change the result field to 1.
3. If status is any other value, the completion record is not changed.
4. Clear the recorded error in preparation for next batch with same batch
identifier.

The result field is for user software to determine whether to set the
"Batch Error" flag bit in the descriptor for continuation of partial
batch descriptor completion. See DSA spec 2.0 for additional information.

If no error has been recorded for the batch, the batch completion record is
written to user space as is.

Tested-by: Tony Zhu <tony.zhu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20230407203143.2189681-12-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-04-12 23:18:45 +05:30
Dave Jiang
6926987185 dmaengine: idxd: add descs_completed field for completion record
The descs_completed field for a completion record is part of a batch
descriptor completion record. It takes the same location as bytes_completed
in a normal descriptor field. Add to expose to user.

Tested-by: Tony Zhu <tony.zhu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20230407203143.2189681-11-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-04-12 23:18:45 +05:30
Dave Jiang
c40bd7d973 dmaengine: idxd: process user page faults for completion record
DSA supports page fault handling through PRS. However, the DMA engine
that's processing the descriptor is blocked until the PRS response is
received. Other workqueues sharing the engine are also blocked.
Page fault handing by the driver with PRS disabled can be used to
mitigate the stalling.

With PRS disabled while ATS remain enabled, DSA handles page faults on
a completion record by reporting an event in the event log. In this
instance, the descriptor is completed and the event log contains the
completion record address and the contents of the completion record. Add
support to the event log handling code to fault in the completion record
and copy the content of the completion record to user memory.

A bitmap is introduced to keep track of discarded event log entries. When
the user process initiates ->release() of the char device, it no longer is
interested in any remaining event log entries tied to the relevant wq and
PASID. The driver will mark the event log entry index in the bitmap. Upon
encountering the entries during processing, the event log handler will just
clear the bitmap bit and skip the entry rather than attempt to process the
event log entry.

Tested-by: Tony Zhu <tony.zhu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20230407203143.2189681-10-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-04-12 23:18:45 +05:30
Dave Jiang
5fbe6503b5 dmanegine: idxd: add debugfs for event log dump
Add debugfs entry to dump the content of the event log for debugging. The
function will dump all non-zero entries in the event log. It will note
which entries are processed and which entries are still pending processing
at the time of the dump. The entries may not always be in chronological
order due to the log is a circular buffer.

Tested-by: Tony Zhu <tony.zhu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20230407203143.2189681-6-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-04-12 23:18:45 +05:30
Dave Jiang
2f431ba908 dmaengine: idxd: add interrupt handling for event log
An event log interrupt is raised in the misc interrupt INTCAUSE register
when an event is written by the hardware. Add basic event log processing
support to the interrupt handler. The event log is a ring where the
hardware owns the tail and the software owns the head. The hardware will
advance the tail index when an additional event has been pushed to memory.
The software will process the log entry and then advances the head. The
log is full when (tail + 1) % log_size = head. The hardware will stop
writing when the log is full. The user is expected to create a log size
large enough to handle all the expected events.

Tested-by: Tony Zhu <tony.zhu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20230407203143.2189681-5-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-04-12 23:18:45 +05:30
Dave Jiang
244da66cda dmaengine: idxd: setup event log configuration
Add setup of event log feature for supported device. Event log addresses
error reporting that was lacking in gen 1 DSA devices where a second error
event does not get reported when a first event is pending software
handling. The event log allows a circular buffer that the device can push
error events to. It is up to the user to create a large enough event log
ring in order to capture the expected events. The evl size can be set in
the device sysfs attribute. By default 64 entries are supported as minimal
when event log is enabled.

Tested-by: Tony Zhu <tony.zhu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20230407203143.2189681-4-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-04-12 23:18:45 +05:30
Andrii Nakryiko
47a71c1f9a bpf: Add log_true_size output field to return necessary log buffer size
Add output-only log_true_size and btf_log_true_size field to
BPF_PROG_LOAD and BPF_BTF_LOAD commands, respectively. It will return
the size of log buffer necessary to fit in all the log contents at
specified log_level. This is very useful for BPF loader libraries like
libbpf to be able to size log buffer correctly, but could be used by
users directly, if necessary, as well.

This patch plumbs all this through the code, taking into account actual
bpf_attr size provided by user to determine if these new fields are
expected by users. And if they are, set them from kernel on return.

We refactory btf_parse() function to accommodate this, moving attr and
uattr handling inside it. The rest is very straightforward code, which
is split from the logging accounting changes in the previous patch to
make it simpler to review logic vs UAPI changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-13-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Daniel Vetter
b8d85bb505 Merge tag 'drm-msm-next-2023-04-10' of https://gitlab.freedesktop.org/drm/msm into drm-next
main pull request for v6.4

Core Display:
============
* Bugfixes for error handling during probe
* rework UBWC decoder programming
* prepare_commit cleanup
* bindings for SM8550 (MDSS, DPU), SM8450 (DP)
* timeout calculation fixup
* atomic: use drm_crtc_next_vblank_start() instead of our own
  custom thing to calculate the start of next vblank

DP:
==
* interrupts cleanup

DPU:
===
* DSPP sub-block flush on sc7280
* support AR30 in addition to XR30 format
* Allow using REC_0 and REC_1 to handle wide (4k) RGB planes
* Split the HW catalog into individual per-SoC files

DSI:
===
* rework DSI instance ID detection on obscure platforms

GPU:
===
* uapi C++ compatibility fix
* a6xx: More robust gdsc reset
* a3xx and a4xx devfreq support
* update generated headers
* various cleanups and fixes
* GPU and GEM updates to avoid allocations which could trigger
  reclaim (shrinker) in fence signaling path
* dma-fence deadline hint support and wait-boost
* a640 speedbin support
* a650 speedbin support

Conflicts in drivers/gpu/drm/msm/adreno/adreno_gpu.c:

Conflict between the 7fa5047a436b ("drm: Use of_property_present() for
testing DT property presence") and 9f251f934012 ("drm/msm/adreno: Use
OPP for every GPU generation"). The latter removed the of_ function
call outright, so I went with what's in the PR unchanged.

From: Rob Clark <robdclark@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGvwuj5tabyW910+N-B=5kFNAC7QNYoQ=0xi3roBjQvFFQ@mail.gmail.com
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
2023-04-11 12:21:50 +02:00
Ming Qian
ec9aa62a1e media: add RealVideo format RV30 and RV40
RealVideo, or also spelled as Real Video, is a suite of proprietary
video compression formats developed by RealNetworks -
the specific format changes with the version.
RealVideo codecs are identified by four-character codes.
RV30 and RV40 are RealNetworks' proprietary H.264-based codecs.

Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-10 14:08:53 +01:00
Ming Qian
ae77d13914 media: add Sorenson Spark video format
Sorenson Spark is an implementation of H.263 for use
in Flash Video and Adobe Flash files.
Sorenson Spark is an incomplete implementation of H.263.
It differs mostly in header structure and ranges of the coefficients.

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-04-10 14:06:47 +01:00
Oded Gabbay
a25c2f7a46 accel/habanalabs/uapi: new Gaudi2 server type
Add definition of a new Gaudi2 server type. This represents
the connectivity between the cards in that server type.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:34 +03:00
Danylo Piliaiev
f1af066bcf drm/msm: Rename drm_msm_gem_submit_reloc::or in C++ code
Clashes with C++ `or` keyword

Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Patchwork: https://patchwork.freedesktop.org/patch/528751/
Link: https://lore.kernel.org/r/20230326163813.535762-1-robdclark@gmail.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2023-04-06 20:29:39 +03:00
Oswald Buddenhagen
102882b5c6 ALSA: document that struct __snd_pcm_mmap_control64 is messed up
I'm not the first one to run into this, see e.g.
https://lore.kernel.org/all/29QBMJU8DE71E.2YZSH8IHT5HMH@mforney.org/

Signed-off-by: Oswald Buddenhagen <oswald.buddenhagen@gmx.de>
Link: https://lore.kernel.org/r/20230406132521.2252019-1-oswald.buddenhagen@gmx.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-04-06 16:41:52 +02:00
Daniel Vetter
52b113e968 Merge tag 'drm-misc-next-2023-04-06' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v6.4-rc1:

UAPI Changes:

Cross-subsystem Changes:
- Document port and rotation dt bindings better.
- For panel timing DT bindings, document that vsync and hsync are
  first, rather than last in image.
- Fix video/aperture typos.

Core Changes:
- Reject prime DMA-Buf attachment if get_sg_table is missing.
  (For self-importing dma-buf only.)
- Add prime import/export to vram-helper.
- Fix oops in drm/vblank when init is not called.
- Fixup xres/yres_virtual and other fixes in fb helper.
- Improve SCDC debugs.
- Skip setting deadline on modesets.
- Assorted TTM fixes.

Driver Changes:
- Add lima usage stats.
- Assorted fixes to bridge/lt8192b, tc358767, ivpu,
  bridge/ti-sn65dsi83, ps8640.
- Use pci aperture helpers in drm/ast lynxfb, radeonfb.
- Revert some lima patches, as they required a commit that has been
  reverted upstream.
- Add AUO NE135FBM-N41 v8.1 eDP panel.
- Add QAIC accel driver.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/64bb9696-a76a-89d9-1866-bcdf7c69c284@linux.intel.com
2023-04-06 14:37:15 +02:00
Daniel Vetter
f86286569e Merge tag 'drm-intel-gt-next-2023-04-06' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
UAPI Changes:

- (Build-time only, should not have any impact)
  drm/i915/uapi: Replace fake flex-array with flexible-array member

  "Zero-length arrays as fake flexible arrays are deprecated and we are
  moving towards adopting C99 flexible-array members instead."

  This is on core kernel request moving towards GCC 13.

Driver Changes:

- Fix context runtime accounting on sysfs fdinfo for heavy workloads (Tvrtko)
- Add support for OA media units on MTL (Umesh)
- Add new workarounds for Meteorlake (Daniele, Radhakrishna, Haridhar)
- Fix sysfs to read actual frequency for MTL and Gen6 and earlier
  (Ashutosh)
- Synchronize i915/BIOS on C6 enabling on MTL (Vinay)
- Fix DMAR error noise due to GPU error capture (Andrej)
- Fix forcewake during BAR resize on discrete (Andrzej)
- Flush lmem contents after construction on discrete (Chris)
- Fix GuC loading timeout on systems where IFWI programs low boot
  frequency (John)
- Fix race condition UAF in i915_perf_add_config_ioctl (Min)

- Sanitycheck MMIO access early in driver load and during forcewake
  (Matt)
- Wakeref fixes for GuC RC error scenario and active VM tracking (Chris)
- Cancel HuC delayed load timer on reset (Daniele)
- Limit double GT reset to pre-MTL (Daniele)
- Use i915 instead of dev_priv insied the file_priv structure (Andi)
- Improve GuC load error reporting (John)
- Simplify VCS/BSD engine selection logic (Tvrtko)
- Perform uc late init after probe error injection (Andrzej)
- Fix format for perf_limit_reasons in debugfs (Vinay)
- Create per-gt debugfs files (Andi)

- Documentation and kerneldoc fixes (Nirmoy, Lee)
- Selftest improvements (Fei, Jonathan)

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZC6APj/feB+jBf2d@jlahtine-mobl.ger.corp.intel.com
2023-04-06 14:21:00 +02:00
Jeffrey Hugo
c501ca23a6 accel/qaic: Add uapi and core driver file
Add the QAIC driver uapi file and core driver file that binds to the PCIe
device. The core driver file also creates the accel device and manages
all the interconnections between the different parts of the driver.

The driver can be built as a module. If so, it will be called "qaic.ko".

Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-3-git-send-email-quic_jhugo@quicinc.com
2023-04-06 08:23:03 +02:00
Axel Rasmussen
0289184476 mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs
UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE
to resolve a missing fault, one can install a write-protected one.  This
is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination.

This was motivated by testing HugeTLB HGM [1], and in particular its
interaction with userfaultfd features.  Existing userfaultfd code supports
using WP and MINOR modes together (i.e.  you can register an area with
both enabled), but without this CONTINUE flag the combination is in
practice unusable.

So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as
UFFDIO_COPY_MODE_WP, but for *minor* faults.

Update the selftest to do some very basic exercising of the new flag.

Update Documentation/ to describe how these flags are used (neither the
COPY nor the new CONTINUE versions of this mode flag were described there
before).

[1]: https://patchwork.kernel.org/project/linux-mm/cover/20230218002819.1486479-1-jthoughton@google.com/

Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:48 -07:00
Peter Xu
2bad466cc9 mm/uffd: UFFD_FEATURE_WP_UNPOPULATED
Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.

The new feature bit makes anonymous memory acts the same as file memory on
userfaultfd-wp in that it'll also wr-protect none ptes.

It can be useful in two cases:

(1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
    so pre-fault can be replaced by enabling this flag and speed up
    protections

(2) It helps to implement async uffd-wp mode that Muhammad is working on [1]

It's debatable whether this is the most ideal solution because with the
new feature bit set, wr-protect none pte needs to pre-populate the
pgtables to the last level (PAGE_SIZE).  But it seems fine so far to
service either purpose above, so we can leave optimizations for later.

The series brings pte markers to anonymous memory too.  There's some
change in the common mm code path in the 1st patch, great to have some eye
looking at it, but hopefully they're still relatively straightforward.


This patch (of 2):

This is a new feature that controls how uffd-wp handles none ptes.  When
it's set, the kernel will handle anonymous memory the same way as file
memory, by allowing the user to wr-protect unpopulated ptes.

File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not. 
For anonymous it was not as persistent because we used to assume that we
don't need protections on none ptes or known zero pages.

One use case of such a feature bit was VM live snapshot, where if without
wr-protecting empty ptes the snapshot can contain random rubbish in the
holes of the anonymous memory, which can cause misbehave of the guest when
the guest OS assumes the pages should be all zeros.

QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].

Recently there's another need raised on using userfaultfd wr-protect for
detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
case if without being able to wr-protect none ptes by default, the dirty
info can get lost, since we cannot treat every none pte to be dirty (the
current design is identify a page dirty based on uffd-wp bit being
cleared).

In general, we want to be able to wr-protect empty ptes too even for
anonymous.

This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
uffd-wp handling on none ptes being consistent no matter what the memory
type is underneath.  It doesn't have any impact on file memories so far
because we already have pte markers taking care of that.  So it only
affects anonymous.

The feature bit is by default off, so the old behavior will be maintained.
Sometimes it may be wanted because the wr-protect of none ptes will
contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
markers to anonymous), but also on creating the pgtables to store the pte
markers.  So there's potentially less chance of using thp on the first
fault for a none pmd or larger than a pmd.

The major implementation part is teaching the whole kernel to understand
pte markers even for anonymously mapped ranges, meanwhile allowing the
UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
new feature bit is set.

Note that even if the patch subject starts with mm/uffd, there're a few
small refactors to major mm path of handling anonymous page faults.  But
they should be straightforward.

With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
the memory before wr-protect during taking a live snapshot.  Quotting from
Muhammad's test result here [3] based on a simple program [4]:

  (1) With huge page disabled
  echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
  ./uffd_wp_perf
  Test DEFAULT: 4
  Test PRE-READ: 1111453 (pre-fault 1101011)
  Test MADVISE: 278276 (pre-fault 266378)
  Test WP-UNPOPULATE: 11712

  (2) With Huge page enabled
  echo always > /sys/kernel/mm/transparent_hugepage/enabled
  ./uffd_wp_perf
  Test DEFAULT: 4
  Test PRE-READ: 22521 (pre-fault 22348)
  Test MADVISE: 4909 (pre-fault 4743)
  Test WP-UNPOPULATE: 14448

There'll be a great perf boost for no-thp case, while for thp enabled with
extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
but that's low possibility in reality, also the overhead was not reduced
but postponed until a follow up write on any huge zero thp, so potentially
it is faster by making the follow up writes slower.

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
[2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
[3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/
[4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c

[peterx@redhat.com: comment changes, oneliner fix to khugepaged]
  Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n
Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:44 -07:00
Ondrej Kozina
4c4dd04e75 sed-opal: Add command to read locking range parameters.
It returns following attributes:

locking range start
locking range length
read lock enabled
write lock enabled
lock state (RW, RO or LK)

It can be retrieved by user authority provided the authority
was added to locking range via prior IOC_OPAL_ADD_USR_TO_LR
ioctl command. The command was extended to add user in ACE that
allows to read attributes listed above.

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230405111223.272816-6-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-05 07:46:26 -06:00
Oliver Upton
e65733b5c5 KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.

Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Dmitry Fomichev
f1ba4e674f virtio-blk: fix to match virtio spec
The merged patch series to support zoned block devices in virtio-blk
is not the most up to date version. The merged patch can be found at

https://lore.kernel.org/linux-block/20221016034127.330942-3-dmitry.fomichev@wdc.com/

but the latest and reviewed version is

https://lore.kernel.org/linux-block/20221110053952.3378990-3-dmitry.fomichev@wdc.com/

The reason is apparently that the correct mailing lists and
maintainers were not copied.

The differences between the two are mostly cleanups, but there is one
change that is very important in terms of compatibility with the
approved virtio-zbd specification.

Before it was approved, the OASIS virtio spec had a change in
VIRTIO_BLK_T_ZONE_APPEND request layout that is not reflected in the
current virtio-blk driver code. In the running code, the status is
the first byte of the in-header that is followed by some pad bytes
and the u64 that carries the sector at which the data has been written
to the zone back to the driver, aka the append sector.

This layout turned out to be problematic for implementing in QEMU and
the request status byte has been eventually made the last byte of the
in-header. The current code doesn't expect that and this causes the
append sector value always come as zero to the block layer. This needs
to be fixed ASAP.

Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Message-Id: <20230330214953.1088216-2-dmitry.fomichev@wdc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-04-04 11:01:57 -04:00
Pavel Begunkov
d322818ef4 io_uring: kill unused notif declarations
There are two leftover structures from the notification registration
mechanism that has never been released, kill them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f05f65aebaf8b1b5bf28519a8fdb350e3e7c9ad0.1679924536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Jens Axboe
c56e022c0a io_uring: add support for user mapped provided buffer ring
The ring mapped provided buffer rings rely on the application allocating
the memory for the ring, and then the kernel will map it. This generally
works fine, but runs into issues on some architectures where we need
to be able to ensure that the kernel and application virtual address for
the ring play nicely together. This at least impacts architectures that
set SHM_COLOUR, but potentially also anyone setting SHMLBA.

To use this variant of ring provided buffers, the application need not
allocate any memory for the ring. Instead the kernel will do so, and
the allocation must subsequently call mmap(2) on the ring with the
offset set to:

	IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)

to get a virtual address for the buffer ring. Normally the application
would allocate a suitable piece of memory (and correctly aligned) and
simply pass that in via io_uring_buf_reg.ring_addr and the kernel would
map it.

Outside of the setup differences, the kernel allocate + user mapped
provided buffer ring works exactly the same.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:21 -06:00
Jens Axboe
81cf17cd3a io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags'
In preparation for allowing flags to be set for registration, rename
the padding and use it for that.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:21 -06:00
Jakub Kicinski
54fd494af9 netfilter updates for net-next
1. No need to disable BH in nfnetlink proc handler, freeing happens
    via call_rcu.
 2. Expose classid in nfetlink_queue, from Eric Sage.
 3. Fix nfnetlink message description comments, from Matthieu De Beule.
 4. Allow removal of offloaded connections via ctnetlink, from Paul Blakey.
 
 Signed-off-by: Florian Westphal <fw@strlen.de>
 -----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmQmuXoNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AAlxEACzFDfO6Jf+yGT6sF+927JjPugwSGpqywD24/ky
 9IqPjJLQurMoPRf9BaRsdjIlxKdhJXog2p6iUgNfXzNgIfT9r7nv/er0TET6ieQk
 3tIR8RqsphRR/JOMlk3vvWqp0Vlh+AX3kYaHo4kCO6Ir0b/OJx8eOYd1+Dl+mz1e
 ggGff5oROXXtMP7LvDs8WQ07aIsrUFeZnLrjZl0sUV7yoqld8K/V7JQrzHn+XXqi
 ERGeymHb4zd/9ljfnt/pkrMn9oJjU9ji8KC5QVfksT01WuI3g2WQtHjRaYuC9b1t
 8ASWIQCoxwMk3HBsBevj5e/QzrF8oe+yAWUeO133ezsTv5N07z/jT6sUbZ5mXHTF
 hz2qAKk3ABNhkCTZHxZlDxIEB9uf0ItbKVSLl0s0j33fjyrchPjfxLzxSSYekOXQ
 xH8zYECsdhivEMB28sKLZqywdWM8vCtazqojRHKunPRH43diESuu9WkAcge4y+XD
 EJQR8FWpyuHS0ux6m/Tx6KqVoSiE+LcCbhs7KOud+hHCdpY3aZ6rrBXpctN1LEsl
 37Eg0hrAEcBfKZ1z6rlULNYKKOAdjbD+ySSf2e8vnjlCeiNOZqq+c/yD1ElW14Yz
 u8xR099ZY89hy0TUhSgJRl/5VcSswMVy7VNiqB6gv+8X/u8XZF/+c7R8gO0Yxdmq
 OadEEg==
 =EBSm
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-2023-03-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter updates for net-next

1. No need to disable BH in nfnetlink proc handler, freeing happens
   via call_rcu.
2. Expose classid in nfetlink_queue, from Eric Sage.
3. Fix nfnetlink message description comments, from Matthieu De Beule.
4. Allow removal of offloaded connections via ctnetlink, from Paul Blakey.

* tag 'nf-next-2023-03-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: ctnetlink: Support offloaded conntrack entry deletion
  netfilter: Correct documentation errors in nf_tables.h
  netfilter: nfnetlink_queue: enable classid socket info retrieval
  netfilter: nfnetlink_log: remove rcu_bh usage
====================

Link: https://lore.kernel.org/r/20230331104809.2959-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-31 10:39:33 -07:00
Fenghua Yu
6fec8938b7 dmaengine: idxd: Add descriptor definitions for translation fetch operation
The translation fetch operation (0x0A) fetches address translations for the
address range specified in the descriptor by issuing address translation
(ATS) requests to the IOMMU.

Add descriptor definitions for the operation so that user can use DSA
to accelerate translation fetch.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20230303213413.3357431-4-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-03-31 17:25:26 +05:30
Fenghua Yu
12bbc2c260 dmaengine: idxd: Add descriptor definitions for DIX generate operation
The Data Integrity Extension (DIX) generate operation (0x17) computes
the Data Integrity Field (DIF) on the source data and writes only the
computed DIF for each source block to the PI destination address.

Add descriptor definitions for this operation so that user can use
DSA to accelerate DIX generate operation.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20230303213413.3357431-3-fenghua.yu@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-03-31 17:25:25 +05:30