93ab6cc691
Some networks can make sure TCP payload can exactly fit 4KB pages, with well chosen MSS/MTU and architectures. Implement mmap() system call so that applications can avoid copying data without complex splice() games. Note that a successful mmap( X bytes) on TCP socket is consuming bytes, as if recvmsg() has been done. (tp->copied += X) Only PROT_READ mappings are accepted, as skb page frags are fundamentally shared and read only. If tcp_mmap() finds data that is not a full page, or a patch of urgent data, -EINVAL is returned, no bytes are consumed. Application must fallback to recvmsg() to read the problematic sequence. mmap() wont block, regardless of socket being in blocking or non-blocking mode. If not enough bytes are in receive queue, mmap() would return -EAGAIN, or -EIO if socket is in a state where no other bytes can be added into receive queue. An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD) to efficiently use mmap() On the sender side, MSG_EOR might help to clearly separate unaligned headers and 4K-aligned chunks if necessary. Tested: mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch. MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header) Without mmap() (tcp_mmap -s) received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit, cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit, cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit, cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit, cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches With mmap() on receiver (tcp_mmap -s -z) received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit, cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit, cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit, cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit, cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
---|---|---|
.. | ||
ila | ||
netfilter | ||
addrconf_core.c | ||
addrconf.c | ||
addrlabel.c | ||
af_inet6.c | ||
ah6.c | ||
anycast.c | ||
calipso.c | ||
datagram.c | ||
esp6_offload.c | ||
esp6.c | ||
exthdrs_core.c | ||
exthdrs_offload.c | ||
exthdrs.c | ||
fib6_notifier.c | ||
fib6_rules.c | ||
fou6.c | ||
icmp.c | ||
inet6_connection_sock.c | ||
inet6_hashtables.c | ||
ip6_checksum.c | ||
ip6_fib.c | ||
ip6_flowlabel.c | ||
ip6_gre.c | ||
ip6_icmp.c | ||
ip6_input.c | ||
ip6_offload.c | ||
ip6_offload.h | ||
ip6_output.c | ||
ip6_tunnel.c | ||
ip6_udp_tunnel.c | ||
ip6_vti.c | ||
ip6mr.c | ||
ipcomp6.c | ||
ipv6_sockglue.c | ||
Kconfig | ||
Makefile | ||
mcast_snoop.c | ||
mcast.c | ||
mip6.c | ||
ndisc.c | ||
netfilter.c | ||
output_core.c | ||
ping.c | ||
proc.c | ||
protocol.c | ||
raw.c | ||
reassembly.c | ||
route.c | ||
seg6_hmac.c | ||
seg6_iptunnel.c | ||
seg6_local.c | ||
seg6.c | ||
sit.c | ||
syncookies.c | ||
sysctl_net_ipv6.c | ||
tcp_ipv6.c | ||
tcpv6_offload.c | ||
tunnel6.c | ||
udp_impl.h | ||
udp_offload.c | ||
udp.c | ||
udplite.c | ||
xfrm6_input.c | ||
xfrm6_mode_beet.c | ||
xfrm6_mode_ro.c | ||
xfrm6_mode_transport.c | ||
xfrm6_mode_tunnel.c | ||
xfrm6_output.c | ||
xfrm6_policy.c | ||
xfrm6_protocol.c | ||
xfrm6_state.c | ||
xfrm6_tunnel.c |