3a8a670eee
Core ---- - Rework the sendpage & splice implementations. Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is. Remove the MSG_SENDPAGE_NOTLAST flag completely. - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid. - Enable socket busy polling with CONFIG_RT. - Improve reliability and efficiency of reporting for ref_tracker. - Auto-generate a user space C library for various Netlink families. Protocols --------- - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2]. - Use per-VMA locking for "page-flipping" TCP receive zerocopy. - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags. - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative. - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO). - Avoid waking up applications using TLS sockets until we have a full record. - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring. - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address. - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch. - PPPoE: make number of hash bits configurable. - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig). - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge). - Support matching on Connectivity Fault Management (CFM) packets. - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug. - HSR: don't enable promiscuous mode if device offloads the proto. - Support active scanning in IEEE 802.15.4. - Continue work on Multi-Link Operation for WiFi 7. BPF --- - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators. - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything. - Accept dynptr memory as memory arguments passed to helpers. - Add routing table ID to bpf_fib_lookup BPF helper. - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands. - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only). - Show target_{obj,btf}_id in tracing link fdinfo. - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter --------- - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value. - Increase ip_vs_conn_tab_bits range for 64BIT builds. - Allow updating size of a set. - Improve NAT tuple selection when connection is closing. Driver API ---------- - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out). - Support configuring rate selection pins of SFP modules. - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines. - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer. - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio). - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message. - Split devlink instance and devlink port operations. New hardware / drivers ---------------------- - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers ------- - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/ fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/ JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4= =9i4I -----END PGP SIGNATURE----- Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
1649 lines
42 KiB
C
1649 lines
42 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
/*
|
|
* linux/net/sunrpc/svcsock.c
|
|
*
|
|
* These are the RPC server socket internals.
|
|
*
|
|
* The server scheduling algorithm does not always distribute the load
|
|
* evenly when servicing a single client. May need to modify the
|
|
* svc_xprt_enqueue procedure...
|
|
*
|
|
* TCP support is largely untested and may be a little slow. The problem
|
|
* is that we currently do two separate recvfrom's, one for the 4-byte
|
|
* record length, and the second for the actual record. This could possibly
|
|
* be improved by always reading a minimum size of around 100 bytes and
|
|
* tucking any superfluous bytes away in a temporary store. Still, that
|
|
* leaves write requests out in the rain. An alternative may be to peek at
|
|
* the first skb in the queue, and if it matches the next TCP sequence
|
|
* number, to extract the record marker. Yuck.
|
|
*
|
|
* Copyright (C) 1995, 1996 Olaf Kirch <okir@monad.swb.de>
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/module.h>
|
|
#include <linux/errno.h>
|
|
#include <linux/fcntl.h>
|
|
#include <linux/net.h>
|
|
#include <linux/in.h>
|
|
#include <linux/inet.h>
|
|
#include <linux/udp.h>
|
|
#include <linux/tcp.h>
|
|
#include <linux/unistd.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/netdevice.h>
|
|
#include <linux/skbuff.h>
|
|
#include <linux/file.h>
|
|
#include <linux/freezer.h>
|
|
#include <net/sock.h>
|
|
#include <net/checksum.h>
|
|
#include <net/ip.h>
|
|
#include <net/ipv6.h>
|
|
#include <net/udp.h>
|
|
#include <net/tcp.h>
|
|
#include <net/tcp_states.h>
|
|
#include <net/tls.h>
|
|
#include <net/handshake.h>
|
|
#include <linux/uaccess.h>
|
|
#include <linux/highmem.h>
|
|
#include <asm/ioctls.h>
|
|
#include <linux/key.h>
|
|
|
|
#include <linux/sunrpc/types.h>
|
|
#include <linux/sunrpc/clnt.h>
|
|
#include <linux/sunrpc/xdr.h>
|
|
#include <linux/sunrpc/msg_prot.h>
|
|
#include <linux/sunrpc/svcsock.h>
|
|
#include <linux/sunrpc/stats.h>
|
|
#include <linux/sunrpc/xprt.h>
|
|
|
|
#include <trace/events/sock.h>
|
|
#include <trace/events/sunrpc.h>
|
|
|
|
#include "socklib.h"
|
|
#include "sunrpc.h"
|
|
|
|
#define RPCDBG_FACILITY RPCDBG_SVCXPRT
|
|
|
|
/* To-do: to avoid tying up an nfsd thread while waiting for a
|
|
* handshake request, the request could instead be deferred.
|
|
*/
|
|
enum {
|
|
SVC_HANDSHAKE_TO = 5U * HZ
|
|
};
|
|
|
|
static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *,
|
|
int flags);
|
|
static int svc_udp_recvfrom(struct svc_rqst *);
|
|
static int svc_udp_sendto(struct svc_rqst *);
|
|
static void svc_sock_detach(struct svc_xprt *);
|
|
static void svc_tcp_sock_detach(struct svc_xprt *);
|
|
static void svc_sock_free(struct svc_xprt *);
|
|
|
|
static struct svc_xprt *svc_create_socket(struct svc_serv *, int,
|
|
struct net *, struct sockaddr *,
|
|
int, int);
|
|
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
|
static struct lock_class_key svc_key[2];
|
|
static struct lock_class_key svc_slock_key[2];
|
|
|
|
static void svc_reclassify_socket(struct socket *sock)
|
|
{
|
|
struct sock *sk = sock->sk;
|
|
|
|
if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
|
|
return;
|
|
|
|
switch (sk->sk_family) {
|
|
case AF_INET:
|
|
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFSD",
|
|
&svc_slock_key[0],
|
|
"sk_xprt.xpt_lock-AF_INET-NFSD",
|
|
&svc_key[0]);
|
|
break;
|
|
|
|
case AF_INET6:
|
|
sock_lock_init_class_and_name(sk, "slock-AF_INET6-NFSD",
|
|
&svc_slock_key[1],
|
|
"sk_xprt.xpt_lock-AF_INET6-NFSD",
|
|
&svc_key[1]);
|
|
break;
|
|
|
|
default:
|
|
BUG();
|
|
}
|
|
}
|
|
#else
|
|
static void svc_reclassify_socket(struct socket *sock)
|
|
{
|
|
}
|
|
#endif
|
|
|
|
/**
|
|
* svc_tcp_release_ctxt - Release transport-related resources
|
|
* @xprt: the transport which owned the context
|
|
* @ctxt: the context from rqstp->rq_xprt_ctxt or dr->xprt_ctxt
|
|
*
|
|
*/
|
|
static void svc_tcp_release_ctxt(struct svc_xprt *xprt, void *ctxt)
|
|
{
|
|
}
|
|
|
|
/**
|
|
* svc_udp_release_ctxt - Release transport-related resources
|
|
* @xprt: the transport which owned the context
|
|
* @ctxt: the context from rqstp->rq_xprt_ctxt or dr->xprt_ctxt
|
|
*
|
|
*/
|
|
static void svc_udp_release_ctxt(struct svc_xprt *xprt, void *ctxt)
|
|
{
|
|
struct sk_buff *skb = ctxt;
|
|
|
|
if (skb)
|
|
consume_skb(skb);
|
|
}
|
|
|
|
union svc_pktinfo_u {
|
|
struct in_pktinfo pkti;
|
|
struct in6_pktinfo pkti6;
|
|
};
|
|
#define SVC_PKTINFO_SPACE \
|
|
CMSG_SPACE(sizeof(union svc_pktinfo_u))
|
|
|
|
static void svc_set_cmsg_data(struct svc_rqst *rqstp, struct cmsghdr *cmh)
|
|
{
|
|
struct svc_sock *svsk =
|
|
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
|
|
switch (svsk->sk_sk->sk_family) {
|
|
case AF_INET: {
|
|
struct in_pktinfo *pki = CMSG_DATA(cmh);
|
|
|
|
cmh->cmsg_level = SOL_IP;
|
|
cmh->cmsg_type = IP_PKTINFO;
|
|
pki->ipi_ifindex = 0;
|
|
pki->ipi_spec_dst.s_addr =
|
|
svc_daddr_in(rqstp)->sin_addr.s_addr;
|
|
cmh->cmsg_len = CMSG_LEN(sizeof(*pki));
|
|
}
|
|
break;
|
|
|
|
case AF_INET6: {
|
|
struct in6_pktinfo *pki = CMSG_DATA(cmh);
|
|
struct sockaddr_in6 *daddr = svc_daddr_in6(rqstp);
|
|
|
|
cmh->cmsg_level = SOL_IPV6;
|
|
cmh->cmsg_type = IPV6_PKTINFO;
|
|
pki->ipi6_ifindex = daddr->sin6_scope_id;
|
|
pki->ipi6_addr = daddr->sin6_addr;
|
|
cmh->cmsg_len = CMSG_LEN(sizeof(*pki));
|
|
}
|
|
break;
|
|
}
|
|
}
|
|
|
|
static int svc_sock_result_payload(struct svc_rqst *rqstp, unsigned int offset,
|
|
unsigned int length)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* Report socket names for nfsdfs
|
|
*/
|
|
static int svc_one_sock_name(struct svc_sock *svsk, char *buf, int remaining)
|
|
{
|
|
const struct sock *sk = svsk->sk_sk;
|
|
const char *proto_name = sk->sk_protocol == IPPROTO_UDP ?
|
|
"udp" : "tcp";
|
|
int len;
|
|
|
|
switch (sk->sk_family) {
|
|
case PF_INET:
|
|
len = snprintf(buf, remaining, "ipv4 %s %pI4 %d\n",
|
|
proto_name,
|
|
&inet_sk(sk)->inet_rcv_saddr,
|
|
inet_sk(sk)->inet_num);
|
|
break;
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
|
case PF_INET6:
|
|
len = snprintf(buf, remaining, "ipv6 %s %pI6 %d\n",
|
|
proto_name,
|
|
&sk->sk_v6_rcv_saddr,
|
|
inet_sk(sk)->inet_num);
|
|
break;
|
|
#endif
|
|
default:
|
|
len = snprintf(buf, remaining, "*unknown-%d*\n",
|
|
sk->sk_family);
|
|
}
|
|
|
|
if (len >= remaining) {
|
|
*buf = '\0';
|
|
return -ENAMETOOLONG;
|
|
}
|
|
return len;
|
|
}
|
|
|
|
static int
|
|
svc_tcp_sock_process_cmsg(struct svc_sock *svsk, struct msghdr *msg,
|
|
struct cmsghdr *cmsg, int ret)
|
|
{
|
|
if (cmsg->cmsg_level == SOL_TLS &&
|
|
cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
|
|
u8 content_type = *((u8 *)CMSG_DATA(cmsg));
|
|
|
|
switch (content_type) {
|
|
case TLS_RECORD_TYPE_DATA:
|
|
/* TLS sets EOR at the end of each application data
|
|
* record, even though there might be more frames
|
|
* waiting to be decrypted.
|
|
*/
|
|
msg->msg_flags &= ~MSG_EOR;
|
|
break;
|
|
case TLS_RECORD_TYPE_ALERT:
|
|
ret = -ENOTCONN;
|
|
break;
|
|
default:
|
|
ret = -EAGAIN;
|
|
}
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
static int
|
|
svc_tcp_sock_recv_cmsg(struct svc_sock *svsk, struct msghdr *msg)
|
|
{
|
|
union {
|
|
struct cmsghdr cmsg;
|
|
u8 buf[CMSG_SPACE(sizeof(u8))];
|
|
} u;
|
|
int ret;
|
|
|
|
msg->msg_control = &u;
|
|
msg->msg_controllen = sizeof(u);
|
|
ret = sock_recvmsg(svsk->sk_sock, msg, MSG_DONTWAIT);
|
|
if (unlikely(msg->msg_controllen != sizeof(u)))
|
|
ret = svc_tcp_sock_process_cmsg(svsk, msg, &u.cmsg, ret);
|
|
return ret;
|
|
}
|
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
static void svc_flush_bvec(const struct bio_vec *bvec, size_t size, size_t seek)
|
|
{
|
|
struct bvec_iter bi = {
|
|
.bi_size = size + seek,
|
|
};
|
|
struct bio_vec bv;
|
|
|
|
bvec_iter_advance(bvec, &bi, seek & PAGE_MASK);
|
|
for_each_bvec(bv, bvec, bi, bi)
|
|
flush_dcache_page(bv.bv_page);
|
|
}
|
|
#else
|
|
static inline void svc_flush_bvec(const struct bio_vec *bvec, size_t size,
|
|
size_t seek)
|
|
{
|
|
}
|
|
#endif
|
|
|
|
/*
|
|
* Read from @rqstp's transport socket. The incoming message fills whole
|
|
* pages in @rqstp's rq_pages array until the last page of the message
|
|
* has been received into a partial page.
|
|
*/
|
|
static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
|
|
size_t seek)
|
|
{
|
|
struct svc_sock *svsk =
|
|
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
|
|
struct bio_vec *bvec = rqstp->rq_bvec;
|
|
struct msghdr msg = { NULL };
|
|
unsigned int i;
|
|
ssize_t len;
|
|
size_t t;
|
|
|
|
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
|
|
for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE)
|
|
bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0);
|
|
rqstp->rq_respages = &rqstp->rq_pages[i];
|
|
rqstp->rq_next_page = rqstp->rq_respages + 1;
|
|
|
|
iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
|
|
if (seek) {
|
|
iov_iter_advance(&msg.msg_iter, seek);
|
|
buflen -= seek;
|
|
}
|
|
len = svc_tcp_sock_recv_cmsg(svsk, &msg);
|
|
if (len > 0)
|
|
svc_flush_bvec(bvec, len, seek);
|
|
|
|
/* If we read a full record, then assume there may be more
|
|
* data to read (stream based sockets only!)
|
|
*/
|
|
if (len == buflen)
|
|
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
|
|
return len;
|
|
}
|
|
|
|
/*
|
|
* Set socket snd and rcv buffer lengths
|
|
*/
|
|
static void svc_sock_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
|
|
{
|
|
unsigned int max_mesg = svsk->sk_xprt.xpt_server->sv_max_mesg;
|
|
struct socket *sock = svsk->sk_sock;
|
|
|
|
nreqs = min(nreqs, INT_MAX / 2 / max_mesg);
|
|
|
|
lock_sock(sock->sk);
|
|
sock->sk->sk_sndbuf = nreqs * max_mesg * 2;
|
|
sock->sk->sk_rcvbuf = nreqs * max_mesg * 2;
|
|
sock->sk->sk_write_space(sock->sk);
|
|
release_sock(sock->sk);
|
|
}
|
|
|
|
static void svc_sock_secure_port(struct svc_rqst *rqstp)
|
|
{
|
|
if (svc_port_is_privileged(svc_addr(rqstp)))
|
|
set_bit(RQ_SECURE, &rqstp->rq_flags);
|
|
else
|
|
clear_bit(RQ_SECURE, &rqstp->rq_flags);
|
|
}
|
|
|
|
/*
|
|
* INET callback when data has been received on the socket.
|
|
*/
|
|
static void svc_data_ready(struct sock *sk)
|
|
{
|
|
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
|
|
|
|
trace_sk_data_ready(sk);
|
|
|
|
if (svsk) {
|
|
/* Refer to svc_setup_socket() for details. */
|
|
rmb();
|
|
svsk->sk_odata(sk);
|
|
trace_svcsock_data_ready(&svsk->sk_xprt, 0);
|
|
if (test_bit(XPT_HANDSHAKE, &svsk->sk_xprt.xpt_flags))
|
|
return;
|
|
if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
|
|
svc_xprt_enqueue(&svsk->sk_xprt);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* INET callback when space is newly available on the socket.
|
|
*/
|
|
static void svc_write_space(struct sock *sk)
|
|
{
|
|
struct svc_sock *svsk = (struct svc_sock *)(sk->sk_user_data);
|
|
|
|
if (svsk) {
|
|
/* Refer to svc_setup_socket() for details. */
|
|
rmb();
|
|
trace_svcsock_write_space(&svsk->sk_xprt, 0);
|
|
svsk->sk_owspace(sk);
|
|
svc_xprt_enqueue(&svsk->sk_xprt);
|
|
}
|
|
}
|
|
|
|
static int svc_tcp_has_wspace(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
|
|
if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
|
|
return 1;
|
|
return !test_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
|
|
}
|
|
|
|
static void svc_tcp_kill_temp_xprt(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
|
|
sock_no_linger(svsk->sk_sock->sk);
|
|
}
|
|
|
|
/**
|
|
* svc_tcp_handshake_done - Handshake completion handler
|
|
* @data: address of xprt to wake
|
|
* @status: status of handshake
|
|
* @peerid: serial number of key containing the remote peer's identity
|
|
*
|
|
* If a security policy is specified as an export option, we don't
|
|
* have a specific export here to check. So we set a "TLS session
|
|
* is present" flag on the xprt and let an upper layer enforce local
|
|
* security policy.
|
|
*/
|
|
static void svc_tcp_handshake_done(void *data, int status, key_serial_t peerid)
|
|
{
|
|
struct svc_xprt *xprt = data;
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
|
|
if (!status) {
|
|
if (peerid != TLS_NO_PEERID)
|
|
set_bit(XPT_PEER_AUTH, &xprt->xpt_flags);
|
|
set_bit(XPT_TLS_SESSION, &xprt->xpt_flags);
|
|
}
|
|
clear_bit(XPT_HANDSHAKE, &xprt->xpt_flags);
|
|
complete_all(&svsk->sk_handshake_done);
|
|
}
|
|
|
|
/**
|
|
* svc_tcp_handshake - Perform a transport-layer security handshake
|
|
* @xprt: connected transport endpoint
|
|
*
|
|
*/
|
|
static void svc_tcp_handshake(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct sock *sk = svsk->sk_sock->sk;
|
|
struct tls_handshake_args args = {
|
|
.ta_sock = svsk->sk_sock,
|
|
.ta_done = svc_tcp_handshake_done,
|
|
.ta_data = xprt,
|
|
};
|
|
int ret;
|
|
|
|
trace_svc_tls_upcall(xprt);
|
|
|
|
clear_bit(XPT_TLS_SESSION, &xprt->xpt_flags);
|
|
init_completion(&svsk->sk_handshake_done);
|
|
|
|
ret = tls_server_hello_x509(&args, GFP_KERNEL);
|
|
if (ret) {
|
|
trace_svc_tls_not_started(xprt);
|
|
goto out_failed;
|
|
}
|
|
|
|
ret = wait_for_completion_interruptible_timeout(&svsk->sk_handshake_done,
|
|
SVC_HANDSHAKE_TO);
|
|
if (ret <= 0) {
|
|
if (tls_handshake_cancel(sk)) {
|
|
trace_svc_tls_timed_out(xprt);
|
|
goto out_close;
|
|
}
|
|
}
|
|
|
|
if (!test_bit(XPT_TLS_SESSION, &xprt->xpt_flags)) {
|
|
trace_svc_tls_unavailable(xprt);
|
|
goto out_close;
|
|
}
|
|
|
|
/* Mark the transport ready in case the remote sent RPC
|
|
* traffic before the kernel received the handshake
|
|
* completion downcall.
|
|
*/
|
|
set_bit(XPT_DATA, &xprt->xpt_flags);
|
|
svc_xprt_enqueue(xprt);
|
|
return;
|
|
|
|
out_close:
|
|
set_bit(XPT_CLOSE, &xprt->xpt_flags);
|
|
out_failed:
|
|
clear_bit(XPT_HANDSHAKE, &xprt->xpt_flags);
|
|
set_bit(XPT_DATA, &xprt->xpt_flags);
|
|
svc_xprt_enqueue(xprt);
|
|
}
|
|
|
|
/*
|
|
* See net/ipv6/ip_sockglue.c : ip_cmsg_recv_pktinfo
|
|
*/
|
|
static int svc_udp_get_dest_address4(struct svc_rqst *rqstp,
|
|
struct cmsghdr *cmh)
|
|
{
|
|
struct in_pktinfo *pki = CMSG_DATA(cmh);
|
|
struct sockaddr_in *daddr = svc_daddr_in(rqstp);
|
|
|
|
if (cmh->cmsg_type != IP_PKTINFO)
|
|
return 0;
|
|
|
|
daddr->sin_family = AF_INET;
|
|
daddr->sin_addr.s_addr = pki->ipi_spec_dst.s_addr;
|
|
return 1;
|
|
}
|
|
|
|
/*
|
|
* See net/ipv6/datagram.c : ip6_datagram_recv_ctl
|
|
*/
|
|
static int svc_udp_get_dest_address6(struct svc_rqst *rqstp,
|
|
struct cmsghdr *cmh)
|
|
{
|
|
struct in6_pktinfo *pki = CMSG_DATA(cmh);
|
|
struct sockaddr_in6 *daddr = svc_daddr_in6(rqstp);
|
|
|
|
if (cmh->cmsg_type != IPV6_PKTINFO)
|
|
return 0;
|
|
|
|
daddr->sin6_family = AF_INET6;
|
|
daddr->sin6_addr = pki->ipi6_addr;
|
|
daddr->sin6_scope_id = pki->ipi6_ifindex;
|
|
return 1;
|
|
}
|
|
|
|
/*
|
|
* Copy the UDP datagram's destination address to the rqstp structure.
|
|
* The 'destination' address in this case is the address to which the
|
|
* peer sent the datagram, i.e. our local address. For multihomed
|
|
* hosts, this can change from msg to msg. Note that only the IP
|
|
* address changes, the port number should remain the same.
|
|
*/
|
|
static int svc_udp_get_dest_address(struct svc_rqst *rqstp,
|
|
struct cmsghdr *cmh)
|
|
{
|
|
switch (cmh->cmsg_level) {
|
|
case SOL_IP:
|
|
return svc_udp_get_dest_address4(rqstp, cmh);
|
|
case SOL_IPV6:
|
|
return svc_udp_get_dest_address6(rqstp, cmh);
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
/**
|
|
* svc_udp_recvfrom - Receive a datagram from a UDP socket.
|
|
* @rqstp: request structure into which to receive an RPC Call
|
|
*
|
|
* Called in a loop when XPT_DATA has been set.
|
|
*
|
|
* Returns:
|
|
* On success, the number of bytes in a received RPC Call, or
|
|
* %0 if a complete RPC Call message was not ready to return
|
|
*/
|
|
static int svc_udp_recvfrom(struct svc_rqst *rqstp)
|
|
{
|
|
struct svc_sock *svsk =
|
|
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
|
|
struct svc_serv *serv = svsk->sk_xprt.xpt_server;
|
|
struct sk_buff *skb;
|
|
union {
|
|
struct cmsghdr hdr;
|
|
long all[SVC_PKTINFO_SPACE / sizeof(long)];
|
|
} buffer;
|
|
struct cmsghdr *cmh = &buffer.hdr;
|
|
struct msghdr msg = {
|
|
.msg_name = svc_addr(rqstp),
|
|
.msg_control = cmh,
|
|
.msg_controllen = sizeof(buffer),
|
|
.msg_flags = MSG_DONTWAIT,
|
|
};
|
|
size_t len;
|
|
int err;
|
|
|
|
if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
|
|
/* udp sockets need large rcvbuf as all pending
|
|
* requests are still in that buffer. sndbuf must
|
|
* also be large enough that there is enough space
|
|
* for one reply per thread. We count all threads
|
|
* rather than threads in a particular pool, which
|
|
* provides an upper bound on the number of threads
|
|
* which will access the socket.
|
|
*/
|
|
svc_sock_setbufsize(svsk, serv->sv_nrthreads + 3);
|
|
|
|
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
|
|
0, 0, MSG_PEEK | MSG_DONTWAIT);
|
|
if (err < 0)
|
|
goto out_recv_err;
|
|
skb = skb_recv_udp(svsk->sk_sk, MSG_DONTWAIT, &err);
|
|
if (!skb)
|
|
goto out_recv_err;
|
|
|
|
len = svc_addr_len(svc_addr(rqstp));
|
|
rqstp->rq_addrlen = len;
|
|
if (skb->tstamp == 0) {
|
|
skb->tstamp = ktime_get_real();
|
|
/* Don't enable netstamp, sunrpc doesn't
|
|
need that much accuracy */
|
|
}
|
|
sock_write_timestamp(svsk->sk_sk, skb->tstamp);
|
|
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be more data... */
|
|
|
|
len = skb->len;
|
|
rqstp->rq_arg.len = len;
|
|
trace_svcsock_udp_recv(&svsk->sk_xprt, len);
|
|
|
|
rqstp->rq_prot = IPPROTO_UDP;
|
|
|
|
if (!svc_udp_get_dest_address(rqstp, cmh))
|
|
goto out_cmsg_err;
|
|
rqstp->rq_daddrlen = svc_addr_len(svc_daddr(rqstp));
|
|
|
|
if (skb_is_nonlinear(skb)) {
|
|
/* we have to copy */
|
|
local_bh_disable();
|
|
if (csum_partial_copy_to_xdr(&rqstp->rq_arg, skb))
|
|
goto out_bh_enable;
|
|
local_bh_enable();
|
|
consume_skb(skb);
|
|
} else {
|
|
/* we can use it in-place */
|
|
rqstp->rq_arg.head[0].iov_base = skb->data;
|
|
rqstp->rq_arg.head[0].iov_len = len;
|
|
if (skb_checksum_complete(skb))
|
|
goto out_free;
|
|
rqstp->rq_xprt_ctxt = skb;
|
|
}
|
|
|
|
rqstp->rq_arg.page_base = 0;
|
|
if (len <= rqstp->rq_arg.head[0].iov_len) {
|
|
rqstp->rq_arg.head[0].iov_len = len;
|
|
rqstp->rq_arg.page_len = 0;
|
|
rqstp->rq_respages = rqstp->rq_pages+1;
|
|
} else {
|
|
rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
|
|
rqstp->rq_respages = rqstp->rq_pages + 1 +
|
|
DIV_ROUND_UP(rqstp->rq_arg.page_len, PAGE_SIZE);
|
|
}
|
|
rqstp->rq_next_page = rqstp->rq_respages+1;
|
|
|
|
if (serv->sv_stats)
|
|
serv->sv_stats->netudpcnt++;
|
|
|
|
svc_sock_secure_port(rqstp);
|
|
svc_xprt_received(rqstp->rq_xprt);
|
|
return len;
|
|
|
|
out_recv_err:
|
|
if (err != -EAGAIN) {
|
|
/* possibly an icmp error */
|
|
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
}
|
|
trace_svcsock_udp_recv_err(&svsk->sk_xprt, err);
|
|
goto out_clear_busy;
|
|
out_cmsg_err:
|
|
net_warn_ratelimited("svc: received unknown control message %d/%d; dropping RPC reply datagram\n",
|
|
cmh->cmsg_level, cmh->cmsg_type);
|
|
goto out_free;
|
|
out_bh_enable:
|
|
local_bh_enable();
|
|
out_free:
|
|
kfree_skb(skb);
|
|
out_clear_busy:
|
|
svc_xprt_received(rqstp->rq_xprt);
|
|
return 0;
|
|
}
|
|
|
|
/**
|
|
* svc_udp_sendto - Send out a reply on a UDP socket
|
|
* @rqstp: completed svc_rqst
|
|
*
|
|
* xpt_mutex ensures @rqstp's whole message is written to the socket
|
|
* without interruption.
|
|
*
|
|
* Returns the number of bytes sent, or a negative errno.
|
|
*/
|
|
static int svc_udp_sendto(struct svc_rqst *rqstp)
|
|
{
|
|
struct svc_xprt *xprt = rqstp->rq_xprt;
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct xdr_buf *xdr = &rqstp->rq_res;
|
|
union {
|
|
struct cmsghdr hdr;
|
|
long all[SVC_PKTINFO_SPACE / sizeof(long)];
|
|
} buffer;
|
|
struct cmsghdr *cmh = &buffer.hdr;
|
|
struct msghdr msg = {
|
|
.msg_name = &rqstp->rq_addr,
|
|
.msg_namelen = rqstp->rq_addrlen,
|
|
.msg_control = cmh,
|
|
.msg_controllen = sizeof(buffer),
|
|
};
|
|
unsigned int sent;
|
|
int err;
|
|
|
|
svc_udp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
|
|
rqstp->rq_xprt_ctxt = NULL;
|
|
|
|
svc_set_cmsg_data(rqstp, cmh);
|
|
|
|
mutex_lock(&xprt->xpt_mutex);
|
|
|
|
if (svc_xprt_is_dead(xprt))
|
|
goto out_notconn;
|
|
|
|
err = xdr_alloc_bvec(xdr, GFP_KERNEL);
|
|
if (err < 0)
|
|
goto out_unlock;
|
|
|
|
err = xprt_sock_sendmsg(svsk->sk_sock, &msg, xdr, 0, 0, &sent);
|
|
if (err == -ECONNREFUSED) {
|
|
/* ICMP error on earlier request. */
|
|
err = xprt_sock_sendmsg(svsk->sk_sock, &msg, xdr, 0, 0, &sent);
|
|
}
|
|
xdr_free_bvec(xdr);
|
|
trace_svcsock_udp_send(xprt, err);
|
|
out_unlock:
|
|
mutex_unlock(&xprt->xpt_mutex);
|
|
if (err < 0)
|
|
return err;
|
|
return sent;
|
|
|
|
out_notconn:
|
|
mutex_unlock(&xprt->xpt_mutex);
|
|
return -ENOTCONN;
|
|
}
|
|
|
|
static int svc_udp_has_wspace(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct svc_serv *serv = xprt->xpt_server;
|
|
unsigned long required;
|
|
|
|
/*
|
|
* Set the SOCK_NOSPACE flag before checking the available
|
|
* sock space.
|
|
*/
|
|
set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
|
|
required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
|
|
if (required*2 > sock_wspace(svsk->sk_sk))
|
|
return 0;
|
|
clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
|
|
return 1;
|
|
}
|
|
|
|
static struct svc_xprt *svc_udp_accept(struct svc_xprt *xprt)
|
|
{
|
|
BUG();
|
|
return NULL;
|
|
}
|
|
|
|
static void svc_udp_kill_temp_xprt(struct svc_xprt *xprt)
|
|
{
|
|
}
|
|
|
|
static struct svc_xprt *svc_udp_create(struct svc_serv *serv,
|
|
struct net *net,
|
|
struct sockaddr *sa, int salen,
|
|
int flags)
|
|
{
|
|
return svc_create_socket(serv, IPPROTO_UDP, net, sa, salen, flags);
|
|
}
|
|
|
|
static const struct svc_xprt_ops svc_udp_ops = {
|
|
.xpo_create = svc_udp_create,
|
|
.xpo_recvfrom = svc_udp_recvfrom,
|
|
.xpo_sendto = svc_udp_sendto,
|
|
.xpo_result_payload = svc_sock_result_payload,
|
|
.xpo_release_ctxt = svc_udp_release_ctxt,
|
|
.xpo_detach = svc_sock_detach,
|
|
.xpo_free = svc_sock_free,
|
|
.xpo_has_wspace = svc_udp_has_wspace,
|
|
.xpo_accept = svc_udp_accept,
|
|
.xpo_kill_temp_xprt = svc_udp_kill_temp_xprt,
|
|
};
|
|
|
|
static struct svc_xprt_class svc_udp_class = {
|
|
.xcl_name = "udp",
|
|
.xcl_owner = THIS_MODULE,
|
|
.xcl_ops = &svc_udp_ops,
|
|
.xcl_max_payload = RPCSVC_MAXPAYLOAD_UDP,
|
|
.xcl_ident = XPRT_TRANSPORT_UDP,
|
|
};
|
|
|
|
static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
|
|
{
|
|
svc_xprt_init(sock_net(svsk->sk_sock->sk), &svc_udp_class,
|
|
&svsk->sk_xprt, serv);
|
|
clear_bit(XPT_CACHE_AUTH, &svsk->sk_xprt.xpt_flags);
|
|
svsk->sk_sk->sk_data_ready = svc_data_ready;
|
|
svsk->sk_sk->sk_write_space = svc_write_space;
|
|
|
|
/* initialise setting must have enough space to
|
|
* receive and respond to one request.
|
|
* svc_udp_recvfrom will re-adjust if necessary
|
|
*/
|
|
svc_sock_setbufsize(svsk, 3);
|
|
|
|
/* data might have come in before data_ready set up */
|
|
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
|
|
|
|
/* make sure we get destination address info */
|
|
switch (svsk->sk_sk->sk_family) {
|
|
case AF_INET:
|
|
ip_sock_set_pktinfo(svsk->sk_sock->sk);
|
|
break;
|
|
case AF_INET6:
|
|
ip6_sock_set_recvpktinfo(svsk->sk_sock->sk);
|
|
break;
|
|
default:
|
|
BUG();
|
|
}
|
|
}
|
|
|
|
/*
|
|
* A data_ready event on a listening socket means there's a connection
|
|
* pending. Do not use state_change as a substitute for it.
|
|
*/
|
|
static void svc_tcp_listen_data_ready(struct sock *sk)
|
|
{
|
|
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
|
|
|
|
trace_sk_data_ready(sk);
|
|
|
|
/*
|
|
* This callback may called twice when a new connection
|
|
* is established as a child socket inherits everything
|
|
* from a parent LISTEN socket.
|
|
* 1) data_ready method of the parent socket will be called
|
|
* when one of child sockets become ESTABLISHED.
|
|
* 2) data_ready method of the child socket may be called
|
|
* when it receives data before the socket is accepted.
|
|
* In case of 2, we should ignore it silently and DO NOT
|
|
* dereference svsk.
|
|
*/
|
|
if (sk->sk_state != TCP_LISTEN)
|
|
return;
|
|
|
|
if (svsk) {
|
|
/* Refer to svc_setup_socket() for details. */
|
|
rmb();
|
|
svsk->sk_odata(sk);
|
|
set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
|
|
svc_xprt_enqueue(&svsk->sk_xprt);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* A state change on a connected socket means it's dying or dead.
|
|
*/
|
|
static void svc_tcp_state_change(struct sock *sk)
|
|
{
|
|
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
|
|
|
|
if (svsk) {
|
|
/* Refer to svc_setup_socket() for details. */
|
|
rmb();
|
|
svsk->sk_ostate(sk);
|
|
trace_svcsock_tcp_state(&svsk->sk_xprt, svsk->sk_sock);
|
|
if (sk->sk_state != TCP_ESTABLISHED)
|
|
svc_xprt_deferred_close(&svsk->sk_xprt);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Accept a TCP connection
|
|
*/
|
|
static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct sockaddr_storage addr;
|
|
struct sockaddr *sin = (struct sockaddr *) &addr;
|
|
struct svc_serv *serv = svsk->sk_xprt.xpt_server;
|
|
struct socket *sock = svsk->sk_sock;
|
|
struct socket *newsock;
|
|
struct svc_sock *newsvsk;
|
|
int err, slen;
|
|
|
|
if (!sock)
|
|
return NULL;
|
|
|
|
clear_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
|
|
err = kernel_accept(sock, &newsock, O_NONBLOCK);
|
|
if (err < 0) {
|
|
if (err != -EAGAIN)
|
|
trace_svcsock_accept_err(xprt, serv->sv_name, err);
|
|
return NULL;
|
|
}
|
|
if (IS_ERR(sock_alloc_file(newsock, O_NONBLOCK, NULL)))
|
|
return NULL;
|
|
|
|
set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
|
|
|
|
err = kernel_getpeername(newsock, sin);
|
|
if (err < 0) {
|
|
trace_svcsock_getpeername_err(xprt, serv->sv_name, err);
|
|
goto failed; /* aborted connection or whatever */
|
|
}
|
|
slen = err;
|
|
|
|
/* Reset the inherited callbacks before calling svc_setup_socket */
|
|
newsock->sk->sk_state_change = svsk->sk_ostate;
|
|
newsock->sk->sk_data_ready = svsk->sk_odata;
|
|
newsock->sk->sk_write_space = svsk->sk_owspace;
|
|
|
|
/* make sure that a write doesn't block forever when
|
|
* low on memory
|
|
*/
|
|
newsock->sk->sk_sndtimeo = HZ*30;
|
|
|
|
newsvsk = svc_setup_socket(serv, newsock,
|
|
(SVC_SOCK_ANONYMOUS | SVC_SOCK_TEMPORARY));
|
|
if (IS_ERR(newsvsk))
|
|
goto failed;
|
|
svc_xprt_set_remote(&newsvsk->sk_xprt, sin, slen);
|
|
err = kernel_getsockname(newsock, sin);
|
|
slen = err;
|
|
if (unlikely(err < 0))
|
|
slen = offsetof(struct sockaddr, sa_data);
|
|
svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);
|
|
|
|
if (sock_is_loopback(newsock->sk))
|
|
set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
|
|
else
|
|
clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
|
|
if (serv->sv_stats)
|
|
serv->sv_stats->nettcpconn++;
|
|
|
|
return &newsvsk->sk_xprt;
|
|
|
|
failed:
|
|
sockfd_put(newsock);
|
|
return NULL;
|
|
}
|
|
|
|
static size_t svc_tcp_restore_pages(struct svc_sock *svsk,
|
|
struct svc_rqst *rqstp)
|
|
{
|
|
size_t len = svsk->sk_datalen;
|
|
unsigned int i, npages;
|
|
|
|
if (!len)
|
|
return 0;
|
|
npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
|
for (i = 0; i < npages; i++) {
|
|
if (rqstp->rq_pages[i] != NULL)
|
|
put_page(rqstp->rq_pages[i]);
|
|
BUG_ON(svsk->sk_pages[i] == NULL);
|
|
rqstp->rq_pages[i] = svsk->sk_pages[i];
|
|
svsk->sk_pages[i] = NULL;
|
|
}
|
|
rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
|
|
return len;
|
|
}
|
|
|
|
static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
|
|
{
|
|
unsigned int i, len, npages;
|
|
|
|
if (svsk->sk_datalen == 0)
|
|
return;
|
|
len = svsk->sk_datalen;
|
|
npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
|
for (i = 0; i < npages; i++) {
|
|
svsk->sk_pages[i] = rqstp->rq_pages[i];
|
|
rqstp->rq_pages[i] = NULL;
|
|
}
|
|
}
|
|
|
|
static void svc_tcp_clear_pages(struct svc_sock *svsk)
|
|
{
|
|
unsigned int i, len, npages;
|
|
|
|
if (svsk->sk_datalen == 0)
|
|
goto out;
|
|
len = svsk->sk_datalen;
|
|
npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
|
for (i = 0; i < npages; i++) {
|
|
if (svsk->sk_pages[i] == NULL) {
|
|
WARN_ON_ONCE(1);
|
|
continue;
|
|
}
|
|
put_page(svsk->sk_pages[i]);
|
|
svsk->sk_pages[i] = NULL;
|
|
}
|
|
out:
|
|
svsk->sk_tcplen = 0;
|
|
svsk->sk_datalen = 0;
|
|
}
|
|
|
|
/*
|
|
* Receive fragment record header into sk_marker.
|
|
*/
|
|
static ssize_t svc_tcp_read_marker(struct svc_sock *svsk,
|
|
struct svc_rqst *rqstp)
|
|
{
|
|
ssize_t want, len;
|
|
|
|
/* If we haven't gotten the record length yet,
|
|
* get the next four bytes.
|
|
*/
|
|
if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) {
|
|
struct msghdr msg = { NULL };
|
|
struct kvec iov;
|
|
|
|
want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
|
|
iov.iov_base = ((char *)&svsk->sk_marker) + svsk->sk_tcplen;
|
|
iov.iov_len = want;
|
|
iov_iter_kvec(&msg.msg_iter, ITER_DEST, &iov, 1, want);
|
|
len = svc_tcp_sock_recv_cmsg(svsk, &msg);
|
|
if (len < 0)
|
|
return len;
|
|
svsk->sk_tcplen += len;
|
|
if (len < want) {
|
|
/* call again to read the remaining bytes */
|
|
goto err_short;
|
|
}
|
|
trace_svcsock_marker(&svsk->sk_xprt, svsk->sk_marker);
|
|
if (svc_sock_reclen(svsk) + svsk->sk_datalen >
|
|
svsk->sk_xprt.xpt_server->sv_max_mesg)
|
|
goto err_too_large;
|
|
}
|
|
return svc_sock_reclen(svsk);
|
|
|
|
err_too_large:
|
|
net_notice_ratelimited("svc: %s %s RPC fragment too large: %d\n",
|
|
__func__, svsk->sk_xprt.xpt_server->sv_name,
|
|
svc_sock_reclen(svsk));
|
|
svc_xprt_deferred_close(&svsk->sk_xprt);
|
|
err_short:
|
|
return -EAGAIN;
|
|
}
|
|
|
|
static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp)
|
|
{
|
|
struct rpc_xprt *bc_xprt = svsk->sk_xprt.xpt_bc_xprt;
|
|
struct rpc_rqst *req = NULL;
|
|
struct kvec *src, *dst;
|
|
__be32 *p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
|
|
__be32 xid;
|
|
__be32 calldir;
|
|
|
|
xid = *p++;
|
|
calldir = *p;
|
|
|
|
if (!bc_xprt)
|
|
return -EAGAIN;
|
|
spin_lock(&bc_xprt->queue_lock);
|
|
req = xprt_lookup_rqst(bc_xprt, xid);
|
|
if (!req)
|
|
goto unlock_notfound;
|
|
|
|
memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
|
|
/*
|
|
* XXX!: cheating for now! Only copying HEAD.
|
|
* But we know this is good enough for now (in fact, for any
|
|
* callback reply in the forseeable future).
|
|
*/
|
|
dst = &req->rq_private_buf.head[0];
|
|
src = &rqstp->rq_arg.head[0];
|
|
if (dst->iov_len < src->iov_len)
|
|
goto unlock_eagain; /* whatever; just giving up. */
|
|
memcpy(dst->iov_base, src->iov_base, src->iov_len);
|
|
xprt_complete_rqst(req->rq_task, rqstp->rq_arg.len);
|
|
rqstp->rq_arg.len = 0;
|
|
spin_unlock(&bc_xprt->queue_lock);
|
|
return 0;
|
|
unlock_notfound:
|
|
printk(KERN_NOTICE
|
|
"%s: Got unrecognized reply: "
|
|
"calldir 0x%x xpt_bc_xprt %p xid %08x\n",
|
|
__func__, ntohl(calldir),
|
|
bc_xprt, ntohl(xid));
|
|
unlock_eagain:
|
|
spin_unlock(&bc_xprt->queue_lock);
|
|
return -EAGAIN;
|
|
}
|
|
|
|
static void svc_tcp_fragment_received(struct svc_sock *svsk)
|
|
{
|
|
/* If we have more data, signal svc_xprt_enqueue() to try again */
|
|
svsk->sk_tcplen = 0;
|
|
svsk->sk_marker = xdr_zero;
|
|
}
|
|
|
|
/**
|
|
* svc_tcp_recvfrom - Receive data from a TCP socket
|
|
* @rqstp: request structure into which to receive an RPC Call
|
|
*
|
|
* Called in a loop when XPT_DATA has been set.
|
|
*
|
|
* Read the 4-byte stream record marker, then use the record length
|
|
* in that marker to set up exactly the resources needed to receive
|
|
* the next RPC message into @rqstp.
|
|
*
|
|
* Returns:
|
|
* On success, the number of bytes in a received RPC Call, or
|
|
* %0 if a complete RPC Call message was not ready to return
|
|
*
|
|
* The zero return case handles partial receives and callback Replies.
|
|
* The state of a partial receive is preserved in the svc_sock for
|
|
* the next call to svc_tcp_recvfrom.
|
|
*/
|
|
static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
|
|
{
|
|
struct svc_sock *svsk =
|
|
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
|
|
struct svc_serv *serv = svsk->sk_xprt.xpt_server;
|
|
size_t want, base;
|
|
ssize_t len;
|
|
__be32 *p;
|
|
__be32 calldir;
|
|
|
|
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
len = svc_tcp_read_marker(svsk, rqstp);
|
|
if (len < 0)
|
|
goto error;
|
|
|
|
base = svc_tcp_restore_pages(svsk, rqstp);
|
|
want = len - (svsk->sk_tcplen - sizeof(rpc_fraghdr));
|
|
len = svc_tcp_read_msg(rqstp, base + want, base);
|
|
if (len >= 0) {
|
|
trace_svcsock_tcp_recv(&svsk->sk_xprt, len);
|
|
svsk->sk_tcplen += len;
|
|
svsk->sk_datalen += len;
|
|
}
|
|
if (len != want || !svc_sock_final_rec(svsk))
|
|
goto err_incomplete;
|
|
if (svsk->sk_datalen < 8)
|
|
goto err_nuts;
|
|
|
|
rqstp->rq_arg.len = svsk->sk_datalen;
|
|
rqstp->rq_arg.page_base = 0;
|
|
if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) {
|
|
rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len;
|
|
rqstp->rq_arg.page_len = 0;
|
|
} else
|
|
rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len;
|
|
|
|
rqstp->rq_xprt_ctxt = NULL;
|
|
rqstp->rq_prot = IPPROTO_TCP;
|
|
if (test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags))
|
|
set_bit(RQ_LOCAL, &rqstp->rq_flags);
|
|
else
|
|
clear_bit(RQ_LOCAL, &rqstp->rq_flags);
|
|
|
|
p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
|
|
calldir = p[1];
|
|
if (calldir)
|
|
len = receive_cb_reply(svsk, rqstp);
|
|
|
|
/* Reset TCP read info */
|
|
svsk->sk_datalen = 0;
|
|
svc_tcp_fragment_received(svsk);
|
|
|
|
if (len < 0)
|
|
goto error;
|
|
|
|
svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
|
|
if (serv->sv_stats)
|
|
serv->sv_stats->nettcpcnt++;
|
|
|
|
svc_sock_secure_port(rqstp);
|
|
svc_xprt_received(rqstp->rq_xprt);
|
|
return rqstp->rq_arg.len;
|
|
|
|
err_incomplete:
|
|
svc_tcp_save_pages(svsk, rqstp);
|
|
if (len < 0 && len != -EAGAIN)
|
|
goto err_delete;
|
|
if (len == want)
|
|
svc_tcp_fragment_received(svsk);
|
|
else
|
|
trace_svcsock_tcp_recv_short(&svsk->sk_xprt,
|
|
svc_sock_reclen(svsk),
|
|
svsk->sk_tcplen - sizeof(rpc_fraghdr));
|
|
goto err_noclose;
|
|
error:
|
|
if (len != -EAGAIN)
|
|
goto err_delete;
|
|
trace_svcsock_tcp_recv_eagain(&svsk->sk_xprt, 0);
|
|
goto err_noclose;
|
|
err_nuts:
|
|
svsk->sk_datalen = 0;
|
|
err_delete:
|
|
trace_svcsock_tcp_recv_err(&svsk->sk_xprt, len);
|
|
svc_xprt_deferred_close(&svsk->sk_xprt);
|
|
err_noclose:
|
|
svc_xprt_received(rqstp->rq_xprt);
|
|
return 0; /* record not complete */
|
|
}
|
|
|
|
static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
|
|
int flags)
|
|
{
|
|
struct msghdr msg = { .msg_flags = MSG_SPLICE_PAGES | flags, };
|
|
|
|
iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, vec, 1, vec->iov_len);
|
|
return sock_sendmsg(sock, &msg);
|
|
}
|
|
|
|
/*
|
|
* MSG_SPLICE_PAGES is used exclusively to reduce the number of
|
|
* copy operations in this path. Therefore the caller must ensure
|
|
* that the pages backing @xdr are unchanging.
|
|
*
|
|
* In addition, the logic assumes that * .bv_len is never larger
|
|
* than PAGE_SIZE.
|
|
*/
|
|
static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
|
|
rpc_fraghdr marker, unsigned int *sentp)
|
|
{
|
|
const struct kvec *head = xdr->head;
|
|
const struct kvec *tail = xdr->tail;
|
|
struct kvec rm = {
|
|
.iov_base = &marker,
|
|
.iov_len = sizeof(marker),
|
|
};
|
|
struct msghdr msg = {
|
|
.msg_flags = 0,
|
|
};
|
|
int ret;
|
|
|
|
*sentp = 0;
|
|
ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
|
|
if (ret < 0)
|
|
return ret;
|
|
|
|
ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
|
|
if (ret < 0)
|
|
return ret;
|
|
*sentp += ret;
|
|
if (ret != rm.iov_len)
|
|
return -EAGAIN;
|
|
|
|
ret = svc_tcp_send_kvec(sock, head, 0);
|
|
if (ret < 0)
|
|
return ret;
|
|
*sentp += ret;
|
|
if (ret != head->iov_len)
|
|
goto out;
|
|
|
|
msg.msg_flags = MSG_SPLICE_PAGES;
|
|
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec,
|
|
xdr_buf_pagecount(xdr), xdr->page_len);
|
|
ret = sock_sendmsg(sock, &msg);
|
|
if (ret < 0)
|
|
return ret;
|
|
*sentp += ret;
|
|
|
|
if (tail->iov_len) {
|
|
ret = svc_tcp_send_kvec(sock, tail, 0);
|
|
if (ret < 0)
|
|
return ret;
|
|
*sentp += ret;
|
|
}
|
|
|
|
out:
|
|
return 0;
|
|
}
|
|
|
|
/**
|
|
* svc_tcp_sendto - Send out a reply on a TCP socket
|
|
* @rqstp: completed svc_rqst
|
|
*
|
|
* xpt_mutex ensures @rqstp's whole message is written to the socket
|
|
* without interruption.
|
|
*
|
|
* Returns the number of bytes sent, or a negative errno.
|
|
*/
|
|
static int svc_tcp_sendto(struct svc_rqst *rqstp)
|
|
{
|
|
struct svc_xprt *xprt = rqstp->rq_xprt;
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct xdr_buf *xdr = &rqstp->rq_res;
|
|
rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
|
|
(u32)xdr->len);
|
|
unsigned int sent;
|
|
int err;
|
|
|
|
svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
|
|
rqstp->rq_xprt_ctxt = NULL;
|
|
|
|
atomic_inc(&svsk->sk_sendqlen);
|
|
mutex_lock(&xprt->xpt_mutex);
|
|
if (svc_xprt_is_dead(xprt))
|
|
goto out_notconn;
|
|
tcp_sock_set_cork(svsk->sk_sk, true);
|
|
err = svc_tcp_sendmsg(svsk->sk_sock, xdr, marker, &sent);
|
|
xdr_free_bvec(xdr);
|
|
trace_svcsock_tcp_send(xprt, err < 0 ? (long)err : sent);
|
|
if (err < 0 || sent != (xdr->len + sizeof(marker)))
|
|
goto out_close;
|
|
if (atomic_dec_and_test(&svsk->sk_sendqlen))
|
|
tcp_sock_set_cork(svsk->sk_sk, false);
|
|
mutex_unlock(&xprt->xpt_mutex);
|
|
return sent;
|
|
|
|
out_notconn:
|
|
atomic_dec(&svsk->sk_sendqlen);
|
|
mutex_unlock(&xprt->xpt_mutex);
|
|
return -ENOTCONN;
|
|
out_close:
|
|
pr_notice("rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n",
|
|
xprt->xpt_server->sv_name,
|
|
(err < 0) ? "got error" : "sent",
|
|
(err < 0) ? err : sent, xdr->len);
|
|
svc_xprt_deferred_close(xprt);
|
|
atomic_dec(&svsk->sk_sendqlen);
|
|
mutex_unlock(&xprt->xpt_mutex);
|
|
return -EAGAIN;
|
|
}
|
|
|
|
static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
|
|
struct net *net,
|
|
struct sockaddr *sa, int salen,
|
|
int flags)
|
|
{
|
|
return svc_create_socket(serv, IPPROTO_TCP, net, sa, salen, flags);
|
|
}
|
|
|
|
static const struct svc_xprt_ops svc_tcp_ops = {
|
|
.xpo_create = svc_tcp_create,
|
|
.xpo_recvfrom = svc_tcp_recvfrom,
|
|
.xpo_sendto = svc_tcp_sendto,
|
|
.xpo_result_payload = svc_sock_result_payload,
|
|
.xpo_release_ctxt = svc_tcp_release_ctxt,
|
|
.xpo_detach = svc_tcp_sock_detach,
|
|
.xpo_free = svc_sock_free,
|
|
.xpo_has_wspace = svc_tcp_has_wspace,
|
|
.xpo_accept = svc_tcp_accept,
|
|
.xpo_kill_temp_xprt = svc_tcp_kill_temp_xprt,
|
|
.xpo_handshake = svc_tcp_handshake,
|
|
};
|
|
|
|
static struct svc_xprt_class svc_tcp_class = {
|
|
.xcl_name = "tcp",
|
|
.xcl_owner = THIS_MODULE,
|
|
.xcl_ops = &svc_tcp_ops,
|
|
.xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
|
|
.xcl_ident = XPRT_TRANSPORT_TCP,
|
|
};
|
|
|
|
void svc_init_xprt_sock(void)
|
|
{
|
|
svc_reg_xprt_class(&svc_tcp_class);
|
|
svc_reg_xprt_class(&svc_udp_class);
|
|
}
|
|
|
|
void svc_cleanup_xprt_sock(void)
|
|
{
|
|
svc_unreg_xprt_class(&svc_tcp_class);
|
|
svc_unreg_xprt_class(&svc_udp_class);
|
|
}
|
|
|
|
static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
|
|
{
|
|
struct sock *sk = svsk->sk_sk;
|
|
|
|
svc_xprt_init(sock_net(svsk->sk_sock->sk), &svc_tcp_class,
|
|
&svsk->sk_xprt, serv);
|
|
set_bit(XPT_CACHE_AUTH, &svsk->sk_xprt.xpt_flags);
|
|
set_bit(XPT_CONG_CTRL, &svsk->sk_xprt.xpt_flags);
|
|
if (sk->sk_state == TCP_LISTEN) {
|
|
strcpy(svsk->sk_xprt.xpt_remotebuf, "listener");
|
|
set_bit(XPT_LISTENER, &svsk->sk_xprt.xpt_flags);
|
|
sk->sk_data_ready = svc_tcp_listen_data_ready;
|
|
set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
|
|
} else {
|
|
sk->sk_state_change = svc_tcp_state_change;
|
|
sk->sk_data_ready = svc_data_ready;
|
|
sk->sk_write_space = svc_write_space;
|
|
|
|
svsk->sk_marker = xdr_zero;
|
|
svsk->sk_tcplen = 0;
|
|
svsk->sk_datalen = 0;
|
|
memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages));
|
|
|
|
tcp_sock_set_nodelay(sk);
|
|
|
|
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
|
|
switch (sk->sk_state) {
|
|
case TCP_SYN_RECV:
|
|
case TCP_ESTABLISHED:
|
|
break;
|
|
default:
|
|
svc_xprt_deferred_close(&svsk->sk_xprt);
|
|
}
|
|
}
|
|
}
|
|
|
|
void svc_sock_update_bufs(struct svc_serv *serv)
|
|
{
|
|
/*
|
|
* The number of server threads has changed. Update
|
|
* rcvbuf and sndbuf accordingly on all sockets
|
|
*/
|
|
struct svc_sock *svsk;
|
|
|
|
spin_lock_bh(&serv->sv_lock);
|
|
list_for_each_entry(svsk, &serv->sv_permsocks, sk_xprt.xpt_list)
|
|
set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
|
|
spin_unlock_bh(&serv->sv_lock);
|
|
}
|
|
EXPORT_SYMBOL_GPL(svc_sock_update_bufs);
|
|
|
|
/*
|
|
* Initialize socket for RPC use and create svc_sock struct
|
|
*/
|
|
static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
|
|
struct socket *sock,
|
|
int flags)
|
|
{
|
|
struct svc_sock *svsk;
|
|
struct sock *inet;
|
|
int pmap_register = !(flags & SVC_SOCK_ANONYMOUS);
|
|
|
|
svsk = kzalloc(sizeof(*svsk), GFP_KERNEL);
|
|
if (!svsk)
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
inet = sock->sk;
|
|
|
|
if (pmap_register) {
|
|
int err;
|
|
|
|
err = svc_register(serv, sock_net(sock->sk), inet->sk_family,
|
|
inet->sk_protocol,
|
|
ntohs(inet_sk(inet)->inet_sport));
|
|
if (err < 0) {
|
|
kfree(svsk);
|
|
return ERR_PTR(err);
|
|
}
|
|
}
|
|
|
|
svsk->sk_sock = sock;
|
|
svsk->sk_sk = inet;
|
|
svsk->sk_ostate = inet->sk_state_change;
|
|
svsk->sk_odata = inet->sk_data_ready;
|
|
svsk->sk_owspace = inet->sk_write_space;
|
|
/*
|
|
* This barrier is necessary in order to prevent race condition
|
|
* with svc_data_ready(), svc_tcp_listen_data_ready(), and others
|
|
* when calling callbacks above.
|
|
*/
|
|
wmb();
|
|
inet->sk_user_data = svsk;
|
|
|
|
/* Initialize the socket */
|
|
if (sock->type == SOCK_DGRAM)
|
|
svc_udp_init(svsk, serv);
|
|
else
|
|
svc_tcp_init(svsk, serv);
|
|
|
|
trace_svcsock_new(svsk, sock);
|
|
return svsk;
|
|
}
|
|
|
|
/**
|
|
* svc_addsock - add a listener socket to an RPC service
|
|
* @serv: pointer to RPC service to which to add a new listener
|
|
* @net: caller's network namespace
|
|
* @fd: file descriptor of the new listener
|
|
* @name_return: pointer to buffer to fill in with name of listener
|
|
* @len: size of the buffer
|
|
* @cred: credential
|
|
*
|
|
* Fills in socket name and returns positive length of name if successful.
|
|
* Name is terminated with '\n'. On error, returns a negative errno
|
|
* value.
|
|
*/
|
|
int svc_addsock(struct svc_serv *serv, struct net *net, const int fd,
|
|
char *name_return, const size_t len, const struct cred *cred)
|
|
{
|
|
int err = 0;
|
|
struct socket *so = sockfd_lookup(fd, &err);
|
|
struct svc_sock *svsk = NULL;
|
|
struct sockaddr_storage addr;
|
|
struct sockaddr *sin = (struct sockaddr *)&addr;
|
|
int salen;
|
|
|
|
if (!so)
|
|
return err;
|
|
err = -EINVAL;
|
|
if (sock_net(so->sk) != net)
|
|
goto out;
|
|
err = -EAFNOSUPPORT;
|
|
if ((so->sk->sk_family != PF_INET) && (so->sk->sk_family != PF_INET6))
|
|
goto out;
|
|
err = -EPROTONOSUPPORT;
|
|
if (so->sk->sk_protocol != IPPROTO_TCP &&
|
|
so->sk->sk_protocol != IPPROTO_UDP)
|
|
goto out;
|
|
err = -EISCONN;
|
|
if (so->state > SS_UNCONNECTED)
|
|
goto out;
|
|
err = -ENOENT;
|
|
if (!try_module_get(THIS_MODULE))
|
|
goto out;
|
|
svsk = svc_setup_socket(serv, so, SVC_SOCK_DEFAULTS);
|
|
if (IS_ERR(svsk)) {
|
|
module_put(THIS_MODULE);
|
|
err = PTR_ERR(svsk);
|
|
goto out;
|
|
}
|
|
salen = kernel_getsockname(svsk->sk_sock, sin);
|
|
if (salen >= 0)
|
|
svc_xprt_set_local(&svsk->sk_xprt, sin, salen);
|
|
svsk->sk_xprt.xpt_cred = get_cred(cred);
|
|
svc_add_new_perm_xprt(serv, &svsk->sk_xprt);
|
|
return svc_one_sock_name(svsk, name_return, len);
|
|
out:
|
|
sockfd_put(so);
|
|
return err;
|
|
}
|
|
EXPORT_SYMBOL_GPL(svc_addsock);
|
|
|
|
/*
|
|
* Create socket for RPC service.
|
|
*/
|
|
static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
|
|
int protocol,
|
|
struct net *net,
|
|
struct sockaddr *sin, int len,
|
|
int flags)
|
|
{
|
|
struct svc_sock *svsk;
|
|
struct socket *sock;
|
|
int error;
|
|
int type;
|
|
struct sockaddr_storage addr;
|
|
struct sockaddr *newsin = (struct sockaddr *)&addr;
|
|
int newlen;
|
|
int family;
|
|
|
|
if (protocol != IPPROTO_UDP && protocol != IPPROTO_TCP) {
|
|
printk(KERN_WARNING "svc: only UDP and TCP "
|
|
"sockets supported\n");
|
|
return ERR_PTR(-EINVAL);
|
|
}
|
|
|
|
type = (protocol == IPPROTO_UDP)? SOCK_DGRAM : SOCK_STREAM;
|
|
switch (sin->sa_family) {
|
|
case AF_INET6:
|
|
family = PF_INET6;
|
|
break;
|
|
case AF_INET:
|
|
family = PF_INET;
|
|
break;
|
|
default:
|
|
return ERR_PTR(-EINVAL);
|
|
}
|
|
|
|
error = __sock_create(net, family, type, protocol, &sock, 1);
|
|
if (error < 0)
|
|
return ERR_PTR(error);
|
|
|
|
svc_reclassify_socket(sock);
|
|
|
|
/*
|
|
* If this is an PF_INET6 listener, we want to avoid
|
|
* getting requests from IPv4 remotes. Those should
|
|
* be shunted to a PF_INET listener via rpcbind.
|
|
*/
|
|
if (family == PF_INET6)
|
|
ip6_sock_set_v6only(sock->sk);
|
|
if (type == SOCK_STREAM)
|
|
sock->sk->sk_reuse = SK_CAN_REUSE; /* allow address reuse */
|
|
error = kernel_bind(sock, sin, len);
|
|
if (error < 0)
|
|
goto bummer;
|
|
|
|
error = kernel_getsockname(sock, newsin);
|
|
if (error < 0)
|
|
goto bummer;
|
|
newlen = error;
|
|
|
|
if (protocol == IPPROTO_TCP) {
|
|
if ((error = kernel_listen(sock, 64)) < 0)
|
|
goto bummer;
|
|
}
|
|
|
|
svsk = svc_setup_socket(serv, sock, flags);
|
|
if (IS_ERR(svsk)) {
|
|
error = PTR_ERR(svsk);
|
|
goto bummer;
|
|
}
|
|
svc_xprt_set_local(&svsk->sk_xprt, newsin, newlen);
|
|
return (struct svc_xprt *)svsk;
|
|
bummer:
|
|
sock_release(sock);
|
|
return ERR_PTR(error);
|
|
}
|
|
|
|
/*
|
|
* Detach the svc_sock from the socket so that no
|
|
* more callbacks occur.
|
|
*/
|
|
static void svc_sock_detach(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct sock *sk = svsk->sk_sk;
|
|
|
|
/* put back the old socket callbacks */
|
|
lock_sock(sk);
|
|
sk->sk_state_change = svsk->sk_ostate;
|
|
sk->sk_data_ready = svsk->sk_odata;
|
|
sk->sk_write_space = svsk->sk_owspace;
|
|
sk->sk_user_data = NULL;
|
|
release_sock(sk);
|
|
}
|
|
|
|
/*
|
|
* Disconnect the socket, and reset the callbacks
|
|
*/
|
|
static void svc_tcp_sock_detach(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
|
|
svc_sock_detach(xprt);
|
|
|
|
if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
|
|
svc_tcp_clear_pages(svsk);
|
|
kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Free the svc_sock's socket resources and the svc_sock itself.
|
|
*/
|
|
static void svc_sock_free(struct svc_xprt *xprt)
|
|
{
|
|
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
|
|
struct socket *sock = svsk->sk_sock;
|
|
|
|
trace_svcsock_free(svsk, sock);
|
|
|
|
tls_handshake_cancel(sock->sk);
|
|
if (sock->file)
|
|
sockfd_put(sock);
|
|
else
|
|
sock_release(sock);
|
|
kfree(svsk);
|
|
}
|