IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in
xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.
Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value.
This affected output route lookup for packets sent on an ip6gretap device
in cases where routing was dependent on the value of flowi6_proto.
Since the correct proto is already set in the tunnel flowi6 template via
commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
path."), simply delete the line setting the incorrect flowi6_proto value.
Suggested-by: Jiri Benc <jbenc@redhat.com>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAV+VA9/Sw1s6N8H32AQJufQ//bP1Aq6Ej+3km0o+S1w2lEl7kc1Gnown+
PLKecg73rtMLEXDCJPKrh/OjGBX1GFaPC5HXWF/4TRav3/qjsi2/P7MSIYbt8ZAk
0kVcdpIU1RWrPFt0tdhxTDSgPMhRWaKJTFV0CwQgnCqeyrVX44syx42B941Wsgcs
N/6ANRr1qFLTwT1LKD0EhRWxnds1YI9WPgF5huaAn5RHpymCynwxsxyTDdU+NL0j
YE75o5rjUzAZzMk5xrTjO8yd7xjUfjO7xT/CzjjVfXB4TOBmYExfOrLNZX0vOPPr
LBOj3LTLwklWtx5/RJGe3A4y1+yfiUYGTu90ArARHUxP6AMwnxztvjuy2MQvBOgP
xTlHcEOPoHrxxfIGdJH/0NRl1zgn6IF2lFOH3EgRcw8hzMdFp46AXV74ckst1vOQ
LsqTbhndG2vkVFV8uZCZO7om0HRbpxbvy6+hEakdZ6k9kJA3amezS6Q8wiXxGhAE
0wBns3XI/75Zd7QyXRkcHl3iUqS5uW0dKmAqitIlgf3CUsr4cZFlHO8IDD3rkTFD
WbnvFhaL9Ee5IQjbsHgyzqPSI7sUJUYG1xBAz8wIWnjD0bGu5b/QzgdBSCVIFJ3e
bNvpe1B6HEHj0EyIj1wK2H8311Y4lPNeJlOj7BDUqqRIZW8vviG26MTGML6GnJwU
L35wa/s4ebo=
=+8HJ
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160923' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Bug fixes and tracepoints
Here are a bunch of bug fixes:
(1) Need to set the timestamp on a Tx packet before queueing it to avoid
trouble with the retransmission function.
(2) Don't send an ACK at the end of the service reply transmission; it's
the responsibility of the client to send an ACK to close the call.
The service can resend the last DATA packet or send a PING ACK.
(3) Wake sendmsg() on abnormal call termination.
(4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.
(5) Use before_eq() & co. to compare serial numbers (which may wrap).
(6) Start the resend timer on DATA packet transmission.
(7) Don't accidentally cancel a retransmission upon receiving a NACK.
(8) Fix the call timer setting function to deal with timeouts that are now
or past.
(9) Don't use a flag to communicate the presence of the last packet in the
Tx buffer from sendmsg to the input routines where ACK and DATA
reception is handled. The problem is that there's a window between
queueing the last packet for transmission and setting the flag in
which ACKs or reply DATA packets can arrive, causing apparent state
machine violation issues.
Instead use the annotation buffer to mark the last packet and pick up
and set the flag in the input routines.
(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
someone else nicked the ACK we were about to transmit.
There are also new tracepoints and one altered tracepoint used to track
down the above bugs:
(11) Call timer tracepoint.
(12) Data Tx tracepoint (and adjustments to ACK tracepoint).
(13) Injected Rx packet loss tracepoint.
(14) Ack proposal tracepoint.
(15) Retransmission selection tracepoint.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2016-09-23
Only two patches this time:
1) Fix a comment reference to struct xfrm_replay_state_esn.
From Richard Guy Briggs.
2) Convert xfrm_state_lookup to rcu, we don't need the
xfrm_state_lock anymore in the input path.
From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
the ability for user-space application to specify it for the VF, as an
option to support 802.1ad.
We adjusted IP Link tool to support this option.
For future use cases, the new UAPI supports multiple vlans. For now we
limit the list size to a single vlan in kernel.
Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
compatibility with older versions of IP Link tool.
Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
We kept 802.1Q as the drivers' default vlan protocol.
Suitable ip link tool command examples:
Set vf vlan protocol 802.1ad:
ip link set eth0 vf 1 vlan 100 proto 802.1ad
Set vf to VST (802.1Q) mode:
ip link set eth0 vf 1 vlan 100 proto 802.1Q
Or by omitting the new parameter
ip link set eth0 vf 1 vlan 100
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the newly enforced limit on the number of namespaces,
we get a build warning if CONFIG_NETNS is disabled:
net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]
This moves the two added functions inside the #ifdef that guards
their callers.
Fixes: 703286608a22 ("netns: Add a limit on the number of net namespaces")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited. It also prints a warning
everytime this feature is used as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a tracepoint to log in rxrpc_resend() which packets will be
retransmitted. Note that if a positive ACK comes in whilst we have dropped
the lock to retransmit another packet, the actual retransmission may not
happen, though some of the effects will (such as altering the congestion
management).
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to log proposed ACKs, including whether the proposal is
used to update a pending ACK or is discarded in favour of an easlier,
higher priority ACK.
Whilst we're at it, get rid of the rxrpc_acks() function and access the
name array directly. We do, however, need to validate the ACK reason
number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
array.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to log transmission of DATA packets (including loss
injection).
Adjust the ACK transmission tracepoint to include the packet serial number
and to line this up with the DATA transmission display.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_send_call_packet() is invoking the tx_ack tracepoint before it checks
whether there's an ACK to transmit (another thread may jump in and transmit
it).
Fix this by only invoking the tracepoint if we get a valid ACK to transmit.
Further, only allocate a serial number if we're going to actually transmit
something.
Signed-off-by: David Howells <dhowells@redhat.com>
When the last packet of data to be transmitted on a call is queued, tx_top
is set and then the RXRPC_CALL_TX_LAST flag is set. Unfortunately, this
leaves a race in the ACK processing side of things because the flag affects
the interpretation of tx_top and also allows us to start receiving reply
data before we've finished transmitting.
To fix this, make the following changes:
(1) rxrpc_queue_packet() now sets a marker in the annotation buffer
instead of setting the RXRPC_CALL_TX_LAST flag.
(2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
same context as the routines that use it.
(3) rxrpc_end_tx_phase() is simplified to just shift the call state.
The Tx window must have been rotated before calling to discard the
last packet.
(4) rxrpc_receiving_reply() is added to handle the arrival of the first
DATA packet of a reply to a client call (which is an implicit ACK of
the Tx phase).
(5) The last part of rxrpc_input_ack() is reordered to perform Tx
rotation, then soft-ACK application and then to end the phase if we've
rotated the last packet. In the event of a terminal ACK, the soft-ACK
application will be skipped as nAcks should be 0.
(6) rxrpc_input_ackall() now has to rotate as well as ending the phase.
In addition:
(7) Alter the transmit tracepoint to log the rotation of the last packet.
(8) Remove the no-longer relevant queue_reqack tracepoint note. The
ACK-REQUESTED packet header flag is now set as needed when we actually
transmit the packet and may vary by retransmission.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the call timer in the following ways:
(1) If call->resend_at or call->ack_at are before or equal to the current
time, then ignore that timeout.
(2) If call->expire_at is before or equal to the current time, then don't
set the timer at all (possibly we should queue the call).
(3) Don't skip modifying the timer if timer_pending() is true. This
indicates that the timer is working, not that it has expired and is
running/waiting to run its expiry handler.
Also call rxrpc_set_timer() to start the call timer going rather than
calling add_timer().
Signed-off-by: David Howells <dhowells@redhat.com>
When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
it updates the Tx packet annotations in the annotation buffer. If a
soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
states and that is fine; but if the soft-ACK is an NACK, we overwrite the
to-be-retransmitted with a nak - which isn't.
Instead, we need to let any scheduled retransmission stand if the packet
was NAK'd.
Note that we don't reissue a resend if the annotation is in the
to-be-retransmitted state because someone else must've scheduled the
resend already.
Signed-off-by: David Howells <dhowells@redhat.com>
When a DATA packet has its initial transmission, we may need to start or
adjust the resend timer. Without this we end up relying on being sent a
NACK to initiate the resend.
Signed-off-by: David Howells <dhowells@redhat.com>
before_eq() and friends should be used to compare serial numbers (when not
checking for (non)equality) rather than casting to int, subtracting and
checking the result.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a small helper that complements 36bbef52c7eb ("bpf: direct packet
write and access for helpers for clsact progs") for invalidating the
current skb->hash after mangling on headers via direct packet write.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same motivation as in commit 80b48c445797 ("bpf: don't use raw processor
id in generic helper"), but this time for XDP typed programs. Thus, allow
for preemption checks when we have DEBUG_PREEMPT enabled, and otherwise
use the raw variant.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use skb_to_full_sk() helper introduced in commit bd5eb35f16a9
("xfrm: take care of request sockets") as otherwise we miss tcp synack
messages, since ownership is on request socket and therefore it would
miss the sk_fullsock() check. Use skb_to_full_sk() as also done similarly
in the bpf_get_cgroup_classid() helper via 2309236c13fe ("cls_cgroup:
get sk_classid only from full sockets") fix to not let this fall through.
Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.
This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a void helper to set the STP state of a port, checking first if the
required routine is provided by the driver.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If DBGUNDO() is enabled (FASTRETRANS_DEBUG > 1), a compile
error will happen, since inet6_sk(sk)->daddr became sk->sk_v6_daddr
Fixes: efe4208f47f9 ("ipv6: make lookups simpler and faster")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't send an IDLE ACK at the end of the transmission of the response to a
service call. The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data. At that point, the call is complete.
Signed-off-by: David Howells <dhowells@redhat.com>
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.
If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.
Signed-off-by: David Howells <dhowells@redhat.com>
With TCP MTU probing enabled and offload TX checksumming disabled,
tcp_mtu_probe() calculated the wrong checksum when a fragment being copied
into the probe's SKB had an odd length. This was caused by the direct use
of skb_copy_and_csum_bits() to calculate the checksum, as it pads the
fragment being copied, if needed. When this fragment was not the last, a
subsequent call used the previous checksum without considering this
padding.
The effect was a stale connection in one way, as even retransmissions
wouldn't solve the problem, because the checksum was never recalculated for
the full SKB length.
Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.
We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)
Add an EWMA of the unthrottle latency to help diagnostics.
This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.
Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &
Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52
After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52
A new iproute2 tc can output the 'unthrottle latency' :
$ tc -s qd sh dev eth0 | grep latency
0 gc, 0 highprio, 32490767 throttled, 2382 ns latency
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.
Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.
Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.
Even so, I don't think this discrepancy resulted in any practical bug.
This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On error path in route4_change(), 'f' could be NULL,
so we should check NULL before calling tcf_exts_destroy().
Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already checked for !found just a bit before:
if (!found) {
regs->verdict.code = NFT_BREAK;
return;
}
if (found && set->flags & NFT_SET_MAP)
^^^^^
So this redundant check can just go away.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It's better to use sizeof(info->name)-1 as index to force set the string
tail instead of literal number '29'.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are some codes which are used to get one random once in netfilter.
We could use net_get_random_once to simplify these codes.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
pkt->xt.thoff is not always set properly, but we use it without any check.
For payload expr, it will cause wrong results. For nftrace, we may notify
the wrong network or transport header to the user space, furthermore,
input the following nft rules, warning message will be printed out:
# nft add rule arp filter output meta nftrace set 1
WARNING: CPU: 0 PID: 13428 at net/netfilter/nf_tables_trace.c:263
nft_trace_notify+0x4a3/0x5e0 [nf_tables]
Call Trace:
[<ffffffff813d58ae>] dump_stack+0x63/0x85
[<ffffffff810a4c0b>] __warn+0xcb/0xf0
[<ffffffff810a4d3d>] warn_slowpath_null+0x1d/0x20
[<ffffffffa0589703>] nft_trace_notify+0x4a3/0x5e0 [nf_tables]
[ ... ]
[<ffffffffa05690a8>] nft_do_chain_arp+0x78/0x90 [nf_tables_arp]
[<ffffffff816f4aa2>] nf_iterate+0x62/0x80
[<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
[<ffffffff81732bbf>] arp_xmit+0x8f/0xb0
[ ... ]
[<ffffffff81732d36>] arp_solicit+0x106/0x2c0
So before we use pkt->xt.thoff, check the tprot_set first.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer
and ptr + priv->len all point to the last valid address plus 1. So if
they are equal, we can still fetch the valid data. It's unnecessary to
fall back to nft_payload_eval.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.
But acctually, it is not very flexible, for example:
tcp dport 80 mapped to queue0
tcp dport 81 mapped to queue1
tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...
So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
queue num tcp dport map { 80:0, 81:1, 82:2 ... }
Florian Westphal also proposed wider usage scenarios:
queue num jhash ip saddr . ip daddr mod ...
queue num meta cpu ...
queue num meta mark ...
The last point is how to load a queue number from sreg, although we can
use *(u16*)®s->data[reg] to load the queue number, just like nat expr
to load its l4port do.
But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.
So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.
This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").
Fixes: 96518518cc41 ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
From: Andrey Vagin <avagin@openvz.org>
Each namespace has an owning user namespace and now there is not way
to discover these relationships.
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.
Why we may want to know relationships between namespaces?
One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?
One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.
There [1] was a discussion about which interface to choose to determing
relationships between namespaces.
Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.
Here is an implementaions of these ioctl-s.
$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:
fd = ioctl(ns_fd, ioctl_type);
where ioctl_type is one of the following:
NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.
NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.
In addition to generic ioctl(2) errors, the following specific ones
can occur:
EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
EPERM The requested namespace is outside of the current namespace
scope.
[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101
Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric
Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Return -EPERM if an owning user namespace is outside of a process
current user namespace.
v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.
The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull networking fixes from David Miller:
"Mostly small bits scattered all over the place, which is usually how
things go this late in the -rc series.
1) Proper driver init device resets in bnx2, from Baoquan He.
2) Fix accounting overflow in __tcp_retransmit_skb(),
sk_forward_alloc, and ip_idents_reserve, from Eric Dumazet.
3) Fix crash in bna driver ethtool stats handling, from Ivan Vecera.
4) Missing check of skb_linearize() return value in mac80211, from
Johannes Berg.
5) Endianness fix in nf_table_trace dumps, from Liping Zhang.
6) SSN comparison fix in SCTP, from Marcelo Ricardo Leitner.
7) Update DSA and b44 MAINTAINERS entries.
8) Make input path of vti6 driver work again, from Nicolas Dichtel.
9) Off-by-one in mlx4, from Sebastian Ott.
10) Fix fallback route lookup handling in ipv6, from Vincent Bernat.
11) Fix stack corruption on probe in qed driver, from Yuval Mintz.
12) PHY init fixes in r8152 from Hayes Wang.
13) Missing SKB free in irda_accept error path, from Phil Turnbull"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tcp: properly account Fast Open SYN-ACK retrans
tcp: fix under-accounting retransmit SNMP counters
MAINTAINERS: Update b44 maintainer.
net: get rid of an signed integer overflow in ip_idents_reserve()
net/mlx4_core: Fix to clean devlink resources
net: can: ifi: Configure transmitter delay
vti6: fix input path
ipmr, ip6mr: return lastuse relative to now
r8152: disable ALDPS and EEE before setting PHY
r8152: remove r8153_enable_eee
r8152: move PHY settings to hw_phy_cfg
r8152: move enabling PHY
r8152: move some functions
cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter
qed: Fix stack corruption on probe
MAINTAINERS: Add an entry for the core network DSA code
net: ipv6: fallback to full lookup if table lookup is unsuitable
net/mlx5: E-Switch, Handle mode change failures
net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code
net/mlx5: Fix flow counter bulk command out mailbox allocation
...
Scan response data should not be updated unless there
is an advertising instance.
Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.
Example:
meta mark set numgen inc mod 2 offset 100
This will generate marks with the serie 100, 101, 100, 101, ...
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAV+OQV/Sw1s6N8H32AQK/Gw//TF7n19v+gqUenh5m6xPYkVlZl6d/TRi+
3JoG5pdNORxTDU7UgzkeuCywDTk5XUYsJs3TOzInRAdDedwfgIiwF3ZKw3Bo30vR
cVUfG7GK4o+CLWifL3BILYMTJfkOnUS4sllylSqX/EOlPDEEspSRWTxXq+DCOGNZ
1APBRD8XfA+IIC3fleMh+zSpKZ3ffc2c7djelzo2nCG3ku78U57B23TCyzp2tQNQ
6ClvhOAwL2nMXF5vebtIU7ou6LUV/TdC4qTkQuz3du/+k+LOG/c8/6s6k70MgXQU
L3DW3rcnrWxkyzDb5oQoGYSWG5x4gp/TazHbJE2kuUVhQma8eDbOAGRWJoxlSzoC
LqHE+6q3KnwwXpbYd3DJ+jUI7pu7pUvub1cvJr0uxPcjRb4CzhHT/1OBUb9p4CJX
/n8NFNXk+5qWsvLaPuNNBPs4pc2xgz/cotjKBYUznqObiq2xgeivZpbsEBOpSIT1
2hl0EuyAi1Gwpi6qfW8oM6EGrlAzuG77cLcLnxrDz+GsHcgqUvdSuTh0P26eOh7D
1V03kkfX9dIqkOc5xA9xAckopfG5BhQDiFsMV+5McZ2x8GtUdnMw8E7dsG8xIeY5
yDzk9m6tD79PlqS7HJ7Fzj6owzqLUeJOI08y9EUSacBFKzNak1NVmcYcXd10rDFj
duNM4rDi6zA=
=3zfm
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Preparation for slow-start algorithm [ver #2]
Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:
(1) Stop storing the protocol header in the Tx socket buffers, but rather
generate it on the fly. This potentially saves a little space and
makes it easier to alter the header just before transmission (the
flags may get altered and the serial number has to be changed).
(2) Mask off the Tx buffer annotations and add a flag to record which ones
have already been resent.
(3) Track RTT on a per-peer basis for use in future changes. Tracepoints
are added to log this.
(4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
ACK from which RTT data can be calculated. The response also carries
other useful information.
(5) Expedite PING-RESPONSE ACK generation from sendmsg. If we're actively
using sendmsg, this allows us, under some circumstances, to avoid
having to rely on the background work item to run to generate this
ACK.
This requires ktime_sub_ms() to be added.
(6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
ACKs from which RTT data can be calculated.
(7) Limit the use of pings and ACK requests for RTT determination.
Changes:
(V2) Don't use the C division operator for 64-bit division. One instance
should use do_div() and the other should be using nsecs_to_jiffies().
The last two patches got transposed, leading to an undefined symbol
in one of them.
Reported-by: kbuild test robot <lkp@intel.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic. Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.
This could probably be made adjustable in future.
Signed-off-by: David Howells <dhowells@redhat.com>