libxml2

mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2024-10-26 12:25:09 +03:00

Author	SHA1	Message	Date
Nick Wellnhofer	c93679381c	html: Fix check for end of comment in push parser Make sure to reset checkIndex. Handle case where "--" or "--!" is at the end of the buffer. Fix "avail" check in htmlParseOrTryFinish.	2022-11-20 21:27:59 +01:00
Nick Wellnhofer	68a6518c45	parser: Rewrite push parser boundary checks Remove inaccurate xmlParseCheckTransition check. Remove non-incremental xmlParseGetLasts check. Add functions that check for several boundary constructs more accurately, keeping track of progress in ctxt->checkIndex. Fixes #439.	2022-11-20 21:27:08 +01:00
Nick Wellnhofer	6843fc726f	Remove or annotate char casts	2022-09-01 04:31:30 +02:00
Nick Wellnhofer	2cac626976	Don't use sizeof(xmlChar) or sizeof(char)	2022-09-01 03:35:19 +02:00
Nick Wellnhofer	ad338ca737	Remove explicit integer casts Remove explicit integer casts as final operation - in assignments - when passing arguments - when returning values Remove casts - to the same type - from certain range-bound values The main motivation is that these explicit casts don't change the result of operations and only render UBSan's implicit-conversion checks useless. Removing these casts allows UBSan to detect cases where truncation or sign-changes occur unexpectedly. Document some explicit casts as truncating and add a few missing ones.	2022-09-01 02:33:57 +02:00
Nick Wellnhofer	65dc8a63ac	Make xmlNewSAXParserCtx take a const sax handler Also improve documentation.	2022-09-01 00:17:45 +02:00
Nick Wellnhofer	0f568c0b73	Consolidate private header files Private functions were previously declared - in header files in the root directory - in public headers guarded with IN_LIBXML - in libxml.h - redundantly in source files that used them. Consolidate all private header files in include/private.	2022-08-26 02:11:56 +02:00
Nick Wellnhofer	58fc89e8a9	Deprecate internal parser functions	2022-08-25 21:04:57 +02:00
Nick Wellnhofer	a308c0cdf7	Deprecate old HTML SAX API	2022-08-25 21:04:57 +02:00
Nick Wellnhofer	9a82b94a94	Introduce xmlNewSAXParserCtxt and htmlNewSAXParserCtxt Add API functions to create a parser context with a custom SAX handler without having to mess with ctxt->sax manually.	2022-08-24 14:07:55 +02:00
Nick Wellnhofer	0a04db19fc	Don't mess with parser options in htmlParseDocument Don't set ctxt->html. This member should already be initialized. Set ctxt->linenumbers in htmlCtxtUseOptions like the XML parser does.	2022-08-24 14:06:00 +02:00
Nick Wellnhofer	d45263a262	Remove useless call to htmlDefaultSAXHandlerInit This function is already called from xmlInitParser.	2022-08-24 14:04:35 +02:00
Nick Wellnhofer	4b184240be	Remove htmlDefaultSAXHandler from non-SAX1 build This matches long-standing behavior of the XML counterpart.	2022-08-22 14:24:25 +02:00
Nick Wellnhofer	80bd34c3c6	Don't initialize SAX handler in htmlReadMemory The SAX handler is already initialized when creating the parser context.	2022-08-22 14:06:37 +02:00
Nick Wellnhofer	37cedc0b15	Fix htmlReadMemory mixing up XML and HTML functions Also see `fe6890e2`.	2022-08-22 14:04:07 +02:00
Nick Wellnhofer	920753c4aa	Don't use default SAX handler to report unrelated errors	2022-08-22 13:48:59 +02:00
Nick Wellnhofer	38f04779f7	Fix HTML parser with threads and --without-legacy If the legacy functions are disabled, the default "V1" HTML SAX handler isn't initialized in threads other than the main thread. htmlInitParserCtxt would later use the empty V1 SAX handler, resulting in NULL documents. Change htmlInitParserCtxt to initialize the HTML SAX handler by calling xmlSAX2InitHtmlDefaultSAXHandler. This removes the ability to change the default handler but is more in line with the XML parser which initializes the SAX handler by calling xmlSAXVersion, ignoring the V1 default handler. Fixes #399.	2022-08-22 13:48:59 +02:00
Nick Wellnhofer	5b2d07a726	Use xmlStrlen in *CtxtReadDoc xmlStrlen handles buffers larger than INT_MAX more gracefully.	2022-08-20 17:00:50 +02:00
Nick Wellnhofer	4ad71c2d72	Fix xmlCtxtReadDoc with encoding xmlCtxtReadDoc used to create an input stream involving xmlNewStringInputStream. This would create a stream without an input buffer, causing problems with encodings (see #34). After commit `aab584dc3`, an error was returned even with UTF-8 encodings which happened to work before. Make xmlCtxtReadDoc call xmlCtxtReadMemory which doesn't suffer from these issues. Also fix htmlCtxtReadDoc. Fixes #397.	2022-08-20 16:34:08 +02:00
Nick Wellnhofer	e986d09cf5	Skip incorrectly opened HTML comments Commit `4fd69f3e` fixed handling of '<' characters not followed by an ASCII letter. But a '<!' sequence followed by invalid characters should be treated as bogus comment and skipped. Fixes #380.	2022-08-02 14:38:09 +02:00
Nick Wellnhofer	6722d22c88	Reduce indentation in HTMLparser.c No functional change.	2022-08-02 14:38:09 +02:00
Nick Wellnhofer	a82ea25fc8	Also reset nsNr in htmlCtxtReset	2022-07-28 21:36:10 +02:00
David Kilzer	44e9118c02	Prevent integer-overflow in htmlSkipBlankChars() and xmlSkipBlankChars() * HTMLparser.c: (htmlSkipBlankChars): * parser.c: (xmlSkipBlankChars): - Cap the return value at INT_MAX. - The commit range that OSS-Fuzz listed for the fix didn't make any changes to xmlSkipBlankChars(), so it seems like this issue may still exist. Found by OSS-Fuzz Issue 44803.	2022-04-11 18:09:37 +00:00
Nick Wellnhofer	40483d0ce2	Deprecate module init and cleanup functions These functions shouldn't be part of the public API. Most init functions are only thread-safe when called from xmlInitParser. Global variables should only be cleaned up by calling xmlCleanupParser.	2022-03-06 15:59:43 +01:00
Nick Wellnhofer	ebb1797030	Remove unneeded #includes	2022-03-04 22:11:49 +01:00
Mike Dalessio	d7b287b94c	htmlParseComment: handle abruptly-closed comments See guidance provided on abrutply-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-abrupt-closing-of-empty-comment	2022-03-02 14:42:47 +00:00
Nick Wellnhofer	776d15d383	Don't check for standard C89 headers Don't check for - ctype.h - errno.h - float.h - limits.h - math.h - signal.h - stdarg.h - stdlib.h - string.h - time.h Stop including non-standard headers - malloc.h - strings.h	2022-03-02 00:43:54 +01:00
Nick Wellnhofer	4fd69f3e27	Fix recovery from invalid HTML start tags Only try to parse a start tag if there's a '<' followed by an ASCII letter. This is more in line with HTML5 and the old behavior in recovery mode. Emit a literal '<' if the following character is invalid. Fixes #101. Fixes #339.	2022-02-22 18:41:00 +01:00
Nick Wellnhofer	346c3a930c	Remove elfgcchack.h The same optimization can be enabled with -fno-semantic-interposition since GCC 5. clang has always used this option by default.	2022-02-20 21:49:04 +01:00
Nick Wellnhofer	d7cb33cf44	Rework validation context flags Use a bitmask instead of magic values to - keep track whether the validation context is part of a parser context - keep track whether xmlValidateDtdFinal was called This allows to add addtional flags later. Note that this deliberately changes the name of a public struct member, assuming that this was always private data never to be used by client code.	2022-02-20 21:49:04 +01:00
Nick Wellnhofer	96dc7f4ae6	Also register HTML document nodes Fixes #196.	2022-02-01 16:38:29 +01:00
Finn Barber	fe6890e292	Fix htmlReadFd, which was using a mix of xml and html context functions	2022-01-16 15:31:54 +01:00
David King	e7d1c53a49	Fix memory leak in xmlFreeParserInputBuffer Found by Coverity. https://bugzilla.redhat.com/show_bug.cgi?id=1938806	2022-01-16 14:10:34 +01:00
Nick Wellnhofer	798bdf13f6	Different approach to fix quadratic behavior in HTML push parser The old approach introduced a regression, see issue #312 and the previous commit. Disable code that tries to recover from invalid start tags. This only affects "recovery" mode. Add a comment outlining a better fix in accordance with the HTML5 spec.	2022-01-10 14:50:20 +01:00
Nick Wellnhofer	094fc08a09	Fix regression when parsing invalid HTML tags in push mode Revert part of commit `173a0830` that changed behavior when parsing malformed start tags with the push parser. This reintroduces quadratic behavior in recovery mode which will be worked around in the next commit. Fixes #312.	2022-01-10 14:49:00 +01:00
Nick Wellnhofer	2732b23466	Fix regression parsing public IDs literals in HTML Fix regression introduced when reworking htmlParsePubidLiteral in commit `93ce33c2`. Fixes #318.	2022-01-10 13:37:59 +01:00
Nick Wellnhofer	7279d23636	Fix htmlTagLookup Fix regression introduced with `b25acce8`. Some users like libxslt may call the HTML output functions on documents with uppercase tag names, so we must keep case-insensitive string comparison. Fixes #248.	2021-05-06 10:54:29 +02:00
Nick Wellnhofer	683de7efe4	Fix duplicate xmlStrEqual calls in htmlParseEndTag	2021-03-04 19:22:35 +01:00
Nick Wellnhofer	8095365b77	Speed up htmlCheckAutoClose Switch to binary search.	2021-03-04 19:22:35 +01:00
Nick Wellnhofer	b25acce858	Speed up htmlTagLookup Switch to binary search. This is the first time bsearch is used in the libxml2 code base. But it's a standard library function since C89 and should be portable.	2021-03-04 17:44:45 +01:00
Nick Wellnhofer	0fb3ae5840	Revert "Improve HTML fuzzer stability" This reverts commit `de1b51eddc`.	2021-02-22 17:31:05 +01:00
Nick Wellnhofer	de1b51eddc	Improve HTML fuzzer stability Call htmlInitAutoClose during fuzzer initialization to fix stability issue. Leave a note concerning problems with this function.	2021-02-22 13:21:38 +01:00
Nick Wellnhofer	dcb80b92da	Fix slow parsing of HTML with encoding errors Under certain circumstances, the HTML parser would try to guess and switch input encodings multiple times, leading to slow processing of documents with encoding errors. The repeated scanning of the input buffer when guessing encodings could even lead to quadratic behavior. The code htmlCurrentChar probably assumed that if there's an encoding handler, it is guaranteed to produce valid UTF-8. This holds true in general, but if the detected encoding was "UTF-8", the UTF8ToUTF8 encoding handler simply invoked memcpy without checking for invalid UTF-8. This still must be fixed, preferably by not using this handler at all. Also leave a note that switching encodings twice seems impossible to implement correctly. Add a check when handling UTF-8 encoding errors in htmlCurrentChar to avoid this situation, even if encoders produce invalid UTF-8. Found by OSS-Fuzz.	2021-02-20 21:28:56 +01:00
Nick Wellnhofer	954696e7cf	Fix infinite loop in HTML parser introduced with recent commits Check for XML_PARSER_EOF to avoid an infinite loop introduced with recent changes to the HTML push parser. Found by OSS-Fuzz.	2021-02-07 14:38:55 +01:00
Mike Dalessio	a67b63d183	use new htmlParseLookupCommentEnd to find comment ends Note that the caret in error messages generated during comment parsing may have moved by one byte. See guidance provided on incorrectly-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment	2020-12-16 16:12:07 +01:00
Mike Dalessio	29f5d20e84	htmlParseComment: treat `--!>` as if it closed the comment See guidance provided on incorrectly-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment	2020-12-16 16:12:07 +01:00
Nick Wellnhofer	94c2e415a9	Fix quadratic runtime in HTML push parser with null bytes Null bytes in the input stream do not necessarily signal an EOF condition. Check the stream pointers for EOF to avoid quadratic rescanning of input data. Note that the CUR_CHAR macro used in functions like htmlParseCharData calls htmlCurrentChar which translates null bytes. Found by OSS-Fuzz.	2020-12-06 16:44:11 +01:00
Nick Wellnhofer	438e595a8c	Stop counting nbChars in parser context The value was inaccurate and never used.	2020-08-09 15:01:45 +02:00
Nick Wellnhofer	f6a9541fb8	Remove unneeded progress checks in HTML parser The HTML parser should now be guaranteed to make progress, so the checks became unnecessary.	2020-08-09 14:54:37 +02:00
Nick Wellnhofer	93ce33c2b8	Fix several quadratic runtime issues in HTML push parser Fix a few remaining cases where the HTML push parser would scan more content during lookahead than being parsed later. Make sure that htmlParseDocTypeDecl consumes all content up to the final '>' in case of errors. The old comment said "We shouldn't try to resynchronize", but ignoring invalid content is also what the HTML5 spec mandates. Likewise, make htmlParseEndTag skip to the final '>' in invalid end tags even if not in recovery mode. This is probably the most visible change in practice and leads to different output for some tests but is also more in line with HTML5. Make sure that htmlParsePI and htmlParseComment don't abort if invalid characters are encountered but log an error and ignore the character. Change some other end-of-buffer checks to test for a zero byte instead of relying on IS_CHAR. Fix usage of IS_CHAR macro in htmlParseScript.	2020-07-23 20:47:35 +02:00

1 2 3 4 5 ...

363 Commits