[MAJOR] replaced the client-side HTTP parser with a new one

The new parser uses an FSM to strictly follow RFC2616. Headers are indexed and parsed only once they're all available. That way, complex regexes make more sense. HTTP processing is now performed in several phases by calling multiple functions, making the code cleaner and easier to read. Note that req[i]pass does not work anymore because it would require that we mark a header to be ignored. What is really needed is to have the ability to add an exception to a matching (match xx except yy). Several bugs have been fixed in appsession during the conversion to the new FSM (method length and recovery on malloc errors). The code does build and work with the debug examples, but is not usable yet to connect to anything as it does not forward the requests yet.
2006-12-04 02:26:12 +01:00 · 2006-12-04 02:26:12 +01:00 · 58f10d7478
commit 58f10d7478
parent b7eba10304
7 changed files with 1401 additions and 621 deletions
--- a/2
+++ b/2
@ -166,3 +166,5 @@ TODO for 1.3
  - check all copyrights
  - fix Makefile.bsd
  - separate inline functions to put them in files covered by GPL
+  - implement HTTP status 414 - request URI too long
+
--- a/doc/http-parsing.txt
+++ b/doc/http-parsing.txt
@ -0,0 +1,214 @@
+--- Relevant portions of RFC2616 ---
+
+OCTET               = <any 8-bit sequence of data>
+CHAR                = <any US-ASCII character (octets 0 - 127)>
+UPALPHA             = <any US-ASCII uppercase letter "A".."Z">
+LOALPHA             = <any US-ASCII lowercase letter "a".."z">
+ALPHA               = UPALPHA | LOALPHA
+DIGIT               = <any US-ASCII digit "0".."9">
+CTL                 = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
+CR                  = <US-ASCII CR, carriage return (13)>
+LF                  = <US-ASCII LF, linefeed (10)>
+SP                  = <US-ASCII SP, space (32)>
+HT                  = <US-ASCII HT, horizontal-tab (9)>
+<">                 = <US-ASCII double-quote mark (34)>
+CRLF                = CR LF
+LWS                 = [CRLF] 1*( SP | HT )
+TEXT                = <any OCTET except CTLs, but including LWS>
+HEX                 = "A" | "B" | "C" | "D" | "E" | "F"
+                      | "a" | "b" | "c" | "d" | "e" | "f" | DIGIT
+separators          = "(" | ")" | "<" | ">" | "@"
+                    | "," | ";" | ":" | "\" | <">
+                    | "/" | "[" | "]" | "?" | "="
+                    | "{" | "}" | SP | HT
+token               = 1*<any CHAR except CTLs or separators>
+
+quoted-pair         = "\" CHAR
+ctext               = <any TEXT excluding "(" and ")">
+qdtext              = <any TEXT except <">>
+quoted-string       = ( <"> *(qdtext | quoted-pair ) <"> )
+comment             = "(" *( ctext | quoted-pair | comment ) ")"
+
+
+
+
+
+4 HTTP Message
+4.1 Message Types
+
+HTTP messages consist of requests from client to server and responses from
+server to client. Request (section 5) and Response (section 6) messages use the
+generic message format of RFC 822 [9] for transferring entities (the payload of
+the message). Both types of message consist of :
+
+  - a start-line
+  - zero or more header fields (also known as "headers")
+  - an empty line (i.e., a line with nothing preceding the CRLF) indicating the
+    end of the header fields
+  - and possibly a message-body.
+
+
+HTTP-message        = Request | Response
+
+start-line          = Request-Line | Status-Line
+generic-message     = start-line
+                      *(message-header CRLF)
+                      CRLF
+                      [ message-body ]
+
+In the interest of robustness, servers SHOULD ignore any empty line(s) received
+where a Request-Line is expected. In other words, if the server is reading the
+protocol stream at the beginning of a message and receives a CRLF first, it
+should ignore the CRLF.
+
+
+4.2 Message headers
+
+- Each header field consists of a name followed by a colon (":") and the field
+  value.
+- Field names are case-insensitive.
+- The field value MAY be preceded by any amount of LWS, though a single SP is
+  preferred.
+- Header fields can be extended over multiple lines by preceding each extra
+  line with at least one SP or HT.
+
+
+message-header      = field-name ":" [ field-value ]
+field-name          = token
+field-value         = *( field-content | LWS )
+field-content       = <the OCTETs making up the field-value and consisting of
+                       either *TEXT or combinations of token, separators, and
+                       quoted-string>
+
+
+The field-content does not include any leading or trailing LWS occurring before
+the first non-whitespace character of the field-value or after the last
+non-whitespace character of the field-value. Such leading or trailing LWS MAY
+be removed without changing the semantics of the field value. Any LWS that
+occurs between field-content MAY be replaced with a single SP before
+interpreting the field value or forwarding the message downstream.
+
+
+=> format des headers = 1*(CHAR & !ctl & !sep) ":" *(OCTET & (!ctl | LWS))
+=> les regex de matching de headers s'appliquent sur field-content, et peuvent
+   utiliser field-value comme espace de travail (mais de préférence après le
+   premier SP).
+
+(19.3) The line terminator for message-header fields is the sequence CRLF.
+However, we recommend that applications, when parsing such headers, recognize
+a single LF as a line terminator and ignore the leading CR.
+
+
+
+
+
+message-body    = entity-body
+                | <entity-body encoded as per Transfer-Encoding>
+
+
+
+5 Request
+
+Request         = Request-Line
+                  *(( general-header
+                    | request-header
+                    | entity-header ) CRLF)
+                  CRLF
+                  [ message-body ]
+
+
+
+5.1 Request line
+
+The elements are separated by SP characters. No CR or LF is allowed except in
+the final CRLF sequence.
+
+Request-Line = Method SP Request-URI SP HTTP-Version CRLF
+
+(19.3) Clients SHOULD be tolerant in parsing the Status-Line and servers
+tolerant when parsing the Request-Line. In particular, they SHOULD accept any
+amount of SP or HT characters between fields, even though only a single SP is
+required.
+
+4.5 General headers
+Apply to MESSAGE.
+
+general-header  = Cache-Control
+                | Connection
+                | Date
+                | Pragma
+                | Trailer
+                | Transfer-Encoding
+                | Upgrade
+                | Via
+                | Warning
+
+General-header field names can be extended reliably only in combination with a
+change in the protocol version. However, new or experimental header fields may
+be given the semantics of general header fields if all parties in the
+communication recognize them to be general-header fields. Unrecognized header
+fields are treated as entity-header fields.
+
+
+
+
+5.3 Request Header Fields
+
+The request-header fields allow the client to pass additional information about
+the request, and about the client itself, to the server. These fields act as
+request modifiers, with semantics equivalent to the parameters on a programming
+language method invocation.
+
+request-header  = Accept
+                | Accept-Charset
+                | Accept-Encoding
+                | Accept-Language
+                | Authorization
+                | Expect
+                | From
+                | Host
+                | If-Match
+                | If-Modified-Since
+                | If-None-Match
+                | If-Range
+                | If-Unmodified-Since
+                | Max-Forwards
+                | Proxy-Authorization
+                | Range
+                | Referer
+                | TE
+                | User-Agent
+
+Request-header field names can be extended reliably only in combination with a
+change in the protocol version. However, new or experimental header fields MAY
+be given the semantics of request-header fields if all parties in the
+communication recognize them to be request-header fields. Unrecognized header
+fields are treated as entity-header fields.
+
+
+
+7.1 Entity header fields
+
+Entity-header fields define metainformation about the entity-body or, if no
+body is present, about the resource identified by the request. Some of this
+metainformation is OPTIONAL; some might be REQUIRED by portions of this
+specification.
+
+entity-header   = Allow
+                | Content-Encoding
+                | Content-Language
+                | Content-Length
+                | Content-Location
+                | Content-MD5
+                | Content-Range
+                | Content-Type
+                | Expires
+                | Last-Modified
+                | extension-header
+extension-header = message-header
+
+The extension-header mechanism allows additional entity-header fields to be
+defined without changing the protocol, but these fields cannot be assumed to be
+recognizable by the recipient. Unrecognized header fields SHOULD be ignored by
+the recipient and MUST be forwarded by transparent proxies.
+
--- a/include/proto/proto_http.h
+++ b/include/proto/proto_http.h
@ -27,6 +27,13 @@
 #include <types/session.h>
 #include <types/task.h>

+/*
+ * some macros used for the request parsing.
+ * from RFC2616:
+ *   CTL                 = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
+ */
+static inline int IS_CTL(const unsigned char x) { return (x < 32)||(x == 127);}
+

 int event_accept(int fd);
 int process_session(struct task *t);
@ -39,6 +46,10 @@ void srv_close_with_err(struct session *t, int err, int finst,
 			int status, int msglen, const char *msg);

 int produce_content(struct session *s);
+void debug_hdr(const char *dir, struct session *t, const char *start, const char *end);
+void get_srv_from_appsession(struct session *t, const char *begin, const char *end);
+void apply_filters_to_session(struct session *t, struct buffer *req, struct hdr_exp *exp);
+void manage_client_side_cookies(struct session *t, struct buffer *req);

 #endif /* _PROTO_PROTO_HTTP_H */

--- a/include/types/proto_http.h
+++ b/include/types/proto_http.h
@ -51,6 +51,21 @@
 #define SV_STCLOSE	6


+/* Possible states while parsing HTTP messages (request|response) */
+#define HTTP_PA_EMPTY      0    /* leading LF, before start line */
+#define HTTP_PA_START      1    /* inside start line */
+#define HTTP_PA_STRT_LF    2    /* LF after start line */
+#define HTTP_PA_HEADER     3    /* inside a header */
+#define HTTP_PA_HDR_LF     4    /* LF after a header */
+#define HTTP_PA_HDR_LWS    5    /* LWS after a header */
+#define HTTP_PA_LFLF       6    /* after double LF/CRLF at the end of headers */
+#define HTTP_PA_ERROR      7    /* syntax error in the message */
+#define HTTP_PA_CR_SKIP 0x10    /* ORed with other values when a CR was skipped */
+#define HTTP_PA_LF_EXP  0x20    /* ORed with other values when a CR is seen and
+				 * an LF is expected before entering the
+				 * designated state. */
+
+
 #endif /* _TYPES_PROTO_HTTP_H */

 /*
--- a/include/types/session.h
+++ b/include/types/session.h
@ -122,6 +122,7 @@ struct session {
 	char **req_cap;				/* array of captured request headers (may be NULL) */
 	char **rsp_cap;				/* array of captured response headers (may be NULL) */
 	struct hdr_idx hdr_idx;                 /* array of header indexes (max: MAX_HTTP_HDR) */
+	int hdr_state;                          /* where we are in the current header parsing */
 	struct chunk req_line;			/* points to first line */
 	struct chunk auth_hdr;			/* points to 'Authorization:' header */
 	struct {
--- a/src/client.c
+++ b/src/client.c
@ -161,6 +161,7 @@ int event_accept(int fd) {

 		s->cli_state = (p->mode == PR_MODE_HTTP) ?  CL_STHEADERS : CL_STDATA; /* no HTTP headers for non-HTTP proxies */
 		s->srv_state = SV_STIDLE;
+		s->hdr_state = HTTP_PA_EMPTY; /* at the very beginning of the request */
 		s->req = s->rep = NULL; /* will be allocated later */

 		s->cli_fd = cfd;
--- a/src/proto_http.c
+++ b/src/proto_http.c