IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-07-03
The following pull-request contains BPF updates for your *net-next* tree.
There is a minor merge conflict in mlx5 due to 8960b38932be ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:
** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
static int mlx5e_open_cq(struct mlx5e_channel *c,
struct dim_cq_moder moder,
struct mlx5e_cq_param *param,
struct mlx5e_cq *cq)
=======
int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497
Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...
drivers/net/ethernet/mellanox/mlx5/core/en.h +977
... and in mlx5e_open_xsk() ...
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64
... needs the same rename from net_dim_cq_moder into dim_cq_moder.
** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
struct dim_cq_moder icocq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
struct mlx5e_channel *c;
unsigned int irq;
=======
struct net_dim_cq_moder icocq_moder = {0, 0};
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497
Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.
Let me know if you run into any issues. Anyway, the main changes are:
1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.
2) Addition of two new per-cgroup BPF hooks for getsockopt and
setsockopt along with a new sockopt program type which allows more
fine-grained pass/reject settings for containers. Also add a sock_ops
callback that can be selectively enabled on a per-socket basis and is
executed for every RTT to help tracking TCP statistics, both features
from Stanislav.
3) Follow-up fix from loops in precision tracking which was not propagating
precision marks and as a result verifier assumed that some branches were
not taken and therefore wrongly removed as dead code, from Alexei.
4) Fix BPF cgroup release synchronization race which could lead to a
double-free if a leaf's cgroup_bpf object is released and a new BPF
program is attached to the one of ancestor cgroups in parallel, from Roman.
5) Support for bulking XDP_TX on veth devices which improves performance
in some cases by around 9%, from Toshiaki.
6) Allow for lookups into BPF devmap and improve feedback when calling into
bpf_redirect_map() as lookup is now performed right away in the helper
itself, from Toke.
7) Add support for fq's Earliest Departure Time to the Host Bandwidth
Manager (HBM) sample BPF program, from Lawrence.
8) Various cleanups and minor fixes all over the place from many others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
windows real servers can handle gre tunnels, this patch allows
gre encapsulation with the tunneling method, thereby letting ipvs
be load balancer for windows-based services
Signed-off-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Uppercase is a reminiscence from the iptables infrastructure, rename
this header before this is included in stable kernels.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Implement DEVX dispatching event by looking up for the applicable
subscriptions for the reported event and using their target fd to
signal/set the event.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable subscription for device events over DEVX.
Each subscription is added to the two level xarray data structure
according to its event number and the DEVX object information in case was
given with the given target fd.
Those events will be reported over the given fd once will occur.
Downstream patches will mange the dispatching to any subscription.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce MLX5_IB_OBJECT_DEVX_ASYNC_EVENT_FD and its initial
implementation.
This object is from type class FD and will be used to read DEVX
async events.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Daniel Borkmann says:
====================
pull-request: bpf 2019-07-03
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH
on BE architectures, from Jiong.
2) Fix several bugs in the x32 BPF JIT for handling shifts by 0,
from Luke and Xi.
3) Fix NULL pointer deref in btf_type_is_resolve_source_only(),
from Stanislav.
4) Properly handle the check that forwarding is enabled on the device
in bpf_ipv6_fib_lookup() helper code, from Anton.
5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit
alignment such as m68k, from Baruch.
6) Fix kernel hanging in unregister_netdevice loop while unregistering
device bound to XDP socket, from Ilya.
7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan.
8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri.
9) Fix bpftool to use correct argument in cgroup errors, from Jakub.
10) Fix detaching dummy prog in XDP redirect sample code, from Prashant.
11) Add Jonathan to AF_XDP reviewers, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for the Marvell MV64x60 line of bridge chips that contained
MPSC controllers has been removed and there are no other components
that have that controller so remove its driver.
Signed-off-by: Mark Greer <mgreer@animalcreek.com>
Link: https://lore.kernel.org/r/20190626160553.28518-1-mgreer@animalcreek.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Performance impact should be minimal because it's under a new
BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled.
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Priyaranjan Jha <priyarjha@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Provide a keyctl() operation to grant/remove permissions. The grant
operation, wrapped by libkeyutils, looks like:
int ret = keyctl_grant_permission(key_serial_t key,
enum key_ace_subject_type type,
unsigned int subject,
unsigned int perm);
Where key is the key to be modified, type and subject represent the subject
to which permission is to be granted (or removed) and perm is the set of
permissions to be granted. 0 is returned on success. SET_SECURITY
permission is required for this.
The subject type currently must be KEY_ACE_SUBJ_STANDARD for the moment
(other subject types will come along later).
For subject type KEY_ACE_SUBJ_STANDARD, the following subject values are
available:
KEY_ACE_POSSESSOR The possessor of the key
KEY_ACE_OWNER The owner of the key
KEY_ACE_GROUP The key's group
KEY_ACE_EVERYONE Everyone
perm lists the permissions to be granted:
KEY_ACE_VIEW Can view the key metadata
KEY_ACE_READ Can read the key content
KEY_ACE_WRITE Can update/modify the key content
KEY_ACE_SEARCH Can find the key by searching/requesting
KEY_ACE_LINK Can make a link to the key
KEY_ACE_SET_SECURITY Can set security
KEY_ACE_INVAL Can invalidate
KEY_ACE_REVOKE Can revoke
KEY_ACE_JOIN Can join this keyring
KEY_ACE_CLEAR Can clear this keyring
If an ACE already exists for the subject, then the permissions mask will be
overwritten; if perm is 0, it will be deleted.
Currently, the internal ACL is limited to a maximum of 16 entries.
For example:
int ret = keyctl_grant_permission(key,
KEY_ACE_SUBJ_STANDARD,
KEY_ACE_OWNER,
KEY_ACE_VIEW | KEY_ACE_READ);
Signed-off-by: David Howells <dhowells@redhat.com>
Broken up commit to add the Soft iWarp RDMA driver.
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When sending a call-function IPI-many to vCPUs, yield if any of
the IPI target vCPUs was preempted, we just select the first
preempted target vCPU which we found since the state of target
vCPUs can change underneath and to avoid race conditions.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While using new non arhitectural features using PUNIT Mailbox and MMIO
read/write interface, still there is need to operate using MSRs to
control PUNIT. User space could have used user user-space MSR interface for
this, but when user space MSR access is disabled, then it can't. Here only
limited number of MSRs are allowed using this new interface.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Add an IOCTL to send mailbox commands to PUNIT using PUNIT PCI device.
A limited set of mailbox commands can be sent to PUNIT.
This MMIO interface is used by the intel-speed-select tool under
tools/x86/power to enumerate and control Intel Speed Select features.
The MBOX commands ids and semantics of the message can be checked from
the source code of the tool.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Added MMIO interface to read/write specific offsets in PUNIT PCI device
which export core priortization. This MMIO interface can be used using
ioctl interface on /dev/isst_interface using IOCTL ISST_IF_IO_CMD.
This MMIO interface is used by the intel-speed-select tool under
tools/x86/power to enumerate and set core priority. The MMIO offsets and
semantics of the message can be checked from the source code of the tool.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Add processing for IOCTL command ISST_IF_GET_PHY_ID. This converts from the
Linux logical CPU to PUNIT CPU numbering scheme.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Encapsulate common functions which all Intel Speed Select Technology
interface drivers can use. This creates API to register misc device for
user kernel communication and handle all common IOCTLs. As part of the
registry it allows a callback which is to handle domain specific ioctl
processing.
There can be multiple drivers register for services, which can be built
as modules. So this driver handle contention during registry and as well
as during removal. Once user space opened the misc device, the registered
driver will be prevented from removal. Also once misc device is opened by
the user space new client driver can't register, till the misc device is
closed.
There are two types of client drivers, one to handle mail box interface
and the other is to allow direct read/write to some specific MMIO space.
This common driver implements IOCTL ISST_IF_GET_PLATFORM_INFO.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
VTS5zps=
=huBY
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc6' into for-5.3/block
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.
* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...
The information which is currently provided as a response to the
"HL_INFO_HW_IDLE" IOCTL is merely a general boolean value.
This patch extends it and provides also a bitmask that indicates which
of the device engines are busy.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Shifting signed 32-bit value by 31 bits is undefined. Changing most
significant bit to unsigned.
Changes included in v2:
- use subsystem specific subject lines
- CC required mailing lists
Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow modification of the shadow mbr. If the shadow mbr is not marked as
done, this data will be presented read only as the device content. Only
after marking the shadow mbr as done and unlocking a locking range the
actual content is accessible.
Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Enable users to mark the shadow mbr as done without completely
deactivating the shadow mbr feature. This may be useful on reboots,
when the power to the disk is not disconnected in between and the shadow
mbr stores the required boot files. Of course, this saves also the
(few) commands required to enable the feature if it is already enabled
and one only wants to mark the shadow mbr as done.
Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
PSID is a 32 character password printed on the drive label,
to prove its physical access. This PSID reverttper function
is very useful to regain the control over the drive when it
is locked and the user can no longer access it because of some
failures. However, *all the data on the drive is completely
erased*. This method is advisable only when the user is exhausted
of all other recovery methods.
PSID capabilities are described in:
https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf
Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rvt_rwqe and rvt_rwq struct elements are shared between rdmavt and the
providers but are not in uapi directory. As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2, The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.
Move rvt_rwqe and rvt_rwq struct into rvt-abi.h header in uapi directory.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rvt_cq_wc struct elements are shared between rdmavt and the providers
but not in uapi directory. As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2 The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.
In that case, move rvt_cq_wc struct into the rvt-abi.h header file and
create a rvt_k_cq_w for the kernel completion queue.
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
VTS5zps=
=huBY
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc6' into rdma.git for-next
For dependencies in next patches.
Resolve conflicts:
- Use uverbs_get_cleared_udata() with new cq allocation flow
- Continue to delete nes despite SPDX conflict
- Resolve list appends in mlx5_command_str()
- Use u16 for vport_rule stuff
- Resolve list appends in struct ib_client
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Merge commit 1c8c5a9d38f60 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next") undid the
fix from commit 36f9814a494 ("bpf: fix uapi hole for 32 bit compat
applications") by taking the gpl_compatible 1-bit field definition from
commit b85fab0e67b162 ("bpf: Add gpl_compatible flag to struct
bpf_prog_info") as is. That breaks architectures with 16-bit alignment
like m68k. Add 31-bit pad after gpl_compatible to restore alignment of
following fields.
Thanks to Dmitry V. Levin his analysis of this bug history.
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The bpf_redirect_map() helper used by XDP programs doesn't return any
indication of whether it can successfully redirect to the map index it was
given. Instead, BPF programs have to track this themselves, leading to
programs using duplicate maps to track which entries are populated in the
devmap.
This patch fixes this by moving the map lookup into the bpf_redirect_map()
helper, which makes it possible to return failure to the eBPF program. The
lower bits of the flags argument is used as the return code, which means
that existing users who pass a '0' flag argument will get XDP_ABORTED.
With this, a BPF program can check the return code from the helper call and
react by, for instance, substituting a different redirect. This works for
any type of map used for redirect.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, we are seeing non-critical packets being transmitted outside of
their timeslice. We can confirm that the packets are being dequeued at the
right time. So, the delay is induced in the hardware side. The most likely
reason is the hardware queues are starving the lower priority queues.
In order to improve the performance of taprio, we will be making use of the
txtime feature provided by the ETF qdisc. For all the packets which do not
have the SO_TXTIME option set, taprio will set the transmit timestamp (set
in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit
time for the packet is set to when the gate is open. If SO_TXTIME is set,
the TAPrio qdisc will validate whether the timestamp (in skb->tstamp)
occurs when the gate corresponding to skb's traffic class is open.
Following two parameters added to support this mode:
- flags: used to enable txtime-assist mode. Will also be used to enable
other modes (like hardware offloading) later.
- txtime-delay: This indicates the minimum time it will take for the packet
to hit the wire. This is useful in determining whether we can transmit
the packet in the remaining time if the gate corresponding to the packet is
currently open.
An example configuration for enabling txtime-assist:
tc qdisc replace dev eth0 parent root handle 100 taprio \\
num_tc 3 \\
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
queues 1@0 1@0 1@0 \\
base-time 1558653424279842568 \\
sched-entry S 01 300000 \\
sched-entry S 02 300000 \\
sched-entry S 04 400000 \\
flags 0x1 \\
txtime-delay 40000 \\
clockid CLOCK_TAI
tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
offload delta 200000 clockid CLOCK_TAI
Note that all the traffic classes are mapped to the same queue. This is
only possible in taprio when txtime-assist is enabled. Also, note that the
ETF Qdisc is enabled with offload mode set.
In this mode, if the packet's traffic class is open and the complete packet
can be transmitted, taprio will try to transmit the packet immediately.
This will be done by setting skb->tstamp to current_time + the time delta
indicated in the txtime-delay parameter. This parameter indicates the time
taken (in software) for packet to reach the network adapter.
If the packet cannot be transmitted in the current interval or if the
packet's traffic is not currently transmitting, the skb->tstamp is set to
the next available timestamp value. This is tracked in the next_launchtime
parameter in the struct sched_entry.
The behaviour w.r.t admin and oper schedules is not changed from what is
present in software mode.
The transmit time is already known in advance. So, we do not need the HR
timers to advance the schedule and wakeup the dequeue side of taprio. So,
HR timer won't be run when this mode is enabled.
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, etf expects a socket with SO_TXTIME option set for each packet
it encounters. So, it will drop all other packets. But, in the future
commits we are planning to add functionality where tstamp value will be set
by another qdisc. Also, some packets which are generated from within the
kernel (e.g. ICMP packets) do not have any socket associated with them.
So, this commit adds support for skip_sock_check. When this option is set,
etf will skip checking for a socket and other associated options for all
skbs.
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BIT() macro isn't exported as part of the UAPI interface. So, the
compile-test to ensure they are self contained fails. So, use _BITUL()
instead.
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- fix includes for _MAX constants, atomic functions and fwdecls,
by Sven Eckelmann (3 patches)
- shorten multicast tt/tvlv worker spinlock section, by Linus Luessing
- routeable multicast preparations: implement MAC multicast filtering,
by Linus Luessing (2 patches, David Millers comments integrated)
- remove return value checks for debugfs_create, by Greg Kroah-Hartman
- add routable multicast optimizations, by Linus Luessing (2 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl0WG8UWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoWU9D/9bzRACAW/jwDqVw9NVk6kpjzXa
fRj+4raXjDCug67XLdWd50KW+gen5HkkkOonu96Iew6ZGceTvqOciPKdVXnjH30n
2Bi/+K56LYWomaUMWD+aVhjgEKOIjDYd1ueqZCGsMrIFuhA5MonFWrigpSiQv0Gr
s0cbxbKOl5EO+lGFgySYznVdJ/K+9RlHyGB+hkthg9CAFaJX92wvwcFW37+cLgOS
3B6T1PylyWn72uPfac2Jd0M8yQkPaNtZqwJBj8Y0wJ7cVfj29p/VZdlCn/g5oRKT
VtRnuukOWNGn942+C88pl4YhC9rTipFb4qd9sgj5oDlj5d6B9+ZHy8vyhFQEByzJ
2l4rjkNIjwfEkw3QIxZT9/HFnyymrT08sGUtGzKAhIUcwXAe0Zm83EazGzig0UQ3
hjXZRloPldQ93CnboychqX7erDz3qGBTb+v+T+JDXzV0bR5UD4VWs/X2K6tls0nB
bacin9O7VZgo2uaxkrzNKxzRKYPFn8LbpSitINZEYbNllncdCFfhEQ3depdc//SL
WachCKPkwLRXSJCIXjnhGQSmLR4SvTP6xucr9ImwpE12Px4DWRGv8jux289q3flu
ZdX2xAUsLMLsbKjELwAJrgH1HaKNZhsw16UxCc4QSNqe9RhwD/Qc/O6Gj11ONhGn
6ury81y2zFpBu4Acrg==
=nP3L
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20190627v2' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- fix includes for _MAX constants, atomic functions and fwdecls,
by Sven Eckelmann (3 patches)
- shorten multicast tt/tvlv worker spinlock section, by Linus Luessing
- routeable multicast preparations: implement MAC multicast filtering,
by Linus Luessing (2 patches, David Millers comments integrated)
- remove return value checks for debugfs_create, by Greg Kroah-Hartman
- add routable multicast optimizations, by Linus Luessing (2 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shifting signed 32-bit value by 31 bits is undefined. Changing most
significant bit to unsigned.
Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
passing them down to the kernel or bypass kernel completely.
BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
kernel returns.
Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
The buffer memory is pre-allocated (because I don't think there is
a precedent for working with __user memory from bpf). This might be
slow to do for each {s,g}etsockopt call, that's why I've added
__cgroup_bpf_prog_array_is_empty that exits early if there is nothing
attached to a cgroup. Note, however, that there is a race between
__cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
program layout might have changed; this should not be a problem
because in general there is a race between multiple calls to
{s,g}etsocktop and user adding/removing bpf progs from a cgroup.
The return code of the BPF program is handled as follows:
* 0: EPERM
* 1: success, continue with next BPF program in the cgroup chain
v9:
* allow overwriting setsockopt arguments (Alexei Starovoitov):
* use set_fs (same as kernel_setsockopt)
* buffer is always kzalloc'd (no small on-stack buffer)
v8:
* use s32 for optlen (Andrii Nakryiko)
v7:
* return only 0 or 1 (Alexei Starovoitov)
* always run all progs (Alexei Starovoitov)
* use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
(decided to use optval=-1 instead, optval=0 might be a valid input)
* call getsockopt hook after kernel handlers (Alexei Starovoitov)
v6:
* rework cgroup chaining; stop as soon as bpf program returns
0 or 2; see patch with the documentation for the details
* drop Andrii's and Martin's Acked-by (not sure they are comfortable
with the new state of things)
v5:
* skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
v4:
* don't export bpf_sk_fullsock helper (Martin Lau)
* size != sizeof(__u64) for uapi pointers (Martin Lau)
* offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
v3:
* typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
* reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
Nakryiko)
* use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
* use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
* new CG_SOCKOPT_ACCESS macro to wrap repeated parts
v2:
* moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
* aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
* bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
* added [0,2] return code check to verifier (Martin Lau)
* dropped unused buf[64] from the stack (Martin Lau)
* use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
* dropped bpf_target_off from ctx rewrites (Martin Lau)
* use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split. This will also allow a
greater range of subjects to represented.
============
WHY DO THIS?
============
The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.
For SETATTR, this includes actions that are about controlling access to a
key:
(1) Changing a key's ownership.
(2) Changing a key's security information.
(3) Setting a keyring's restriction.
And actions that are about managing a key's lifetime:
(4) Setting an expiry time.
(5) Revoking a key.
and (proposed) managing a key as part of a cache:
(6) Invalidating a key.
Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.
Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission. It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.
As for SEARCH permission, that currently covers:
(1) Finding keys in a keyring tree during a search.
(2) Permitting keyrings to be joined.
(3) Invalidation.
But these don't really belong together either, since these actions really
need to be controlled separately.
Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.
===============
WHAT IS CHANGED
===============
The SETATTR permission is split to create two new permissions:
(1) SET_SECURITY - which allows the key's owner, group and ACL to be
changed and a restriction to be placed on a keyring.
(2) REVOKE - which allows a key to be revoked.
The SEARCH permission is split to create:
(1) SEARCH - which allows a keyring to be search and a key to be found.
(2) JOIN - which allows a keyring to be joined as a session keyring.
(3) INVAL - which allows a key to be invalidated.
The WRITE permission is also split to create:
(1) WRITE - which allows a key's content to be altered and links to be
added, removed and replaced in a keyring.
(2) CLEAR - which allows a keyring to be cleared completely. This is
split out to make it possible to give just this to an administrator.
(3) REVOKE - see above.
Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together. An ACE specifies a subject, such as:
(*) Possessor - permitted to anyone who 'possesses' a key
(*) Owner - permitted to the key owner
(*) Group - permitted to the key group
(*) Everyone - permitted to everyone
Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.
Further subjects may be made available by later patches.
The ACE also specifies a permissions mask. The set of permissions is now:
VIEW Can view the key metadata
READ Can read the key content
WRITE Can update/modify the key content
SEARCH Can find the key by searching/requesting
LINK Can make a link to the key
SET_SECURITY Can change owner, ACL, expiry
INVAL Can invalidate
REVOKE Can revoke
JOIN Can join this keyring
CLEAR Can clear this keyring
The KEYCTL_SETPERM function is then deprecated.
The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.
The KEYCTL_INVALIDATE function then requires INVAL.
The KEYCTL_REVOKE function then requires REVOKE.
The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.
The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.
======================
BACKWARD COMPATIBILITY
======================
To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.
It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.
SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned
on if a keyring is being altered.
The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.
It will make the following mappings:
(1) INVAL, JOIN -> SEARCH
(2) SET_SECURITY -> SETATTR
(3) REVOKE -> WRITE if SETATTR isn't already set
(4) CLEAR -> WRITE
Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.
=======
TESTING
=======
This passes the keyutils testsuite for all but a couple of tests:
(1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
if the type doesn't have ->read(). You still can't actually read the
key.
(2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
work as Other has been replaced with Everyone in the ACL.
Signed-off-by: David Howells <dhowells@redhat.com>
Make it possible for the application to determine whether the AF_XDP
socket is running in zero-copy mode. To achieve this, add a new
getsockopt option XDP_OPTIONS that returns flags. The only flag
supported for now is the zero-copy mode indicator.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To be able to apply our group aware multicast optimizations to packets
with a scope greater than link-local we need to not only keep track of
multicast listeners but also multicast routers.
With this patch a node detects the presence of multicast routers on
its segment by checking if
/proc/sys/net/ipv{4,6}/conf/<bat0|br0(bat)>/mc_forwarding is set for one
thing. This option is enabled by multicast routing daemons and needed
for the kernel's multicast routing tables to receive and route packets.
For another thing if a bridge is configured on top of bat0 then the
presence of an IPv6 multicast router behind this bridge is currently
detected by checking for an IPv6 multicast "All Routers Address"
(ff02::2). This should later be replaced by querying the bridge, which
performs proper, RFC4286 compliant Multicast Router Discovery (our
simplified approach includes more hosts than necessary, most notably
not just multicast routers but also unicast ones and is not applicable
for IPv4).
If no multicast router is detected then this is signalized via the new
BATADV_MCAST_WANT_NO_RTR4 and BATADV_MCAST_WANT_NO_RTR6
multicast tvlv flags.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Fill in some padding in the disc record structure, and add GCC
packed and aligned attributes to ensure that it is correctly
laid out.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently a key has a standard matching criteria of { type, description }
and this is used to only allow keys with unique criteria in a keyring.
This means, however, that you cannot have keys with the same type and
description but a different target namespace in the same keyring.
This is a potential problem for a containerised environment where, say, a
container is made up of some parts of its mount space involving netfs
superblocks from two different network namespaces.
This is also a problem for shared system management keyrings such as the
DNS records keyring or the NFS idmapper keyring that might contain keys
from different network namespaces.
Fix this by including a namespace component in a key's matching criteria.
Keyring types are marked to indicate which, if any, namespace is relevant
to keys of that type, and that namespace is set when the key is created
from the current task's namespace set.
The capability bit KEYCTL_CAPS1_NS_KEY_TAG is set if the kernel is
employing this feature.
Signed-off-by: David Howells <dhowells@redhat.com>
Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission). This isn't very container friendly, however.
Make the following changes:
(1) Make default session, process and thread keyring names begin with a
'.' instead of '_'.
(2) Keyrings whose names begin with a '.' aren't added to the list. Such
keyrings are system specials.
(3) Replace the global list with per-user_namespace lists. A keyring adds
its name to the list for the user_namespace that it is currently in.
(4) When a user_namespace is deleted, it just removes itself from the
keyring name list.
The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.
This can be tested by:
# keyctl newring foo @s
995906392
# unshare -U
$ keyctl show
...
995906392 --alswrv 65534 65534 \_ keyring: foo
...
$ keyctl session foo
Joined session keyring: 935622349
As can be seen, a new session keyring was created.
The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
For all string attributes for which we don't currently accept the element
as input, we only use it as output, set the string length to
RDMA_NLDEV_ATTR_EMPTY_STRING which is defined as 1. That way we will only
accept a null string for that element. This will prevent someone from
writing a new input routine that uses the element without also updating
the policy to have a valid value.
Also while there, make sure the existing entries that are valid have the
correct policy, if not, correct the policy. Remove unnecessary checks
for nla_strlcpy() overflow once the policy has been set correctly.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resolve conflict between d2912cb15bdd ("treewide: Replace GPLv2
boilerplate/reference with SPDX - rule 500") removing the GPL disclaimer
and fe03d4745675 ("Update my email address") which updates Jozsef
Kadlecsik's email.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>