libxml2

mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2025-01-12 09:17:37 +03:00

Author	SHA1	Message	Date
Nick Wellnhofer	e395946194	html: Reenable buggy detection of XML declarations Switch to UTF-8 if a document starts with '<?xm' to match old behavior. Also enable this check in the push parser. Fixes #637.	2023-11-30 16:22:59 +01:00
Nick Wellnhofer	d7d0bc6581	SAX2: Ignore namespaces in HTML documents In commit `21ca8829`, we started to ignore namespaces in HTML element names but we still called xmlSplitQName, effectively stripping the namespace prefix. This would cause elements like <o:p> being parsed as <p>. Now we leave the name untouched. Fixes #508.	2023-03-31 17:08:43 +02:00
Nick Wellnhofer	76c6da4209	error: Make sure that error messages are valid UTF-8 This has caused issues with the Python bindings for a long time. Should fix #64.	2022-12-04 23:34:19 +01:00
Nick Wellnhofer	76d6b0d768	html: Don't escape ASCII chars in href attributes In several cases, href attributes can contain ASCII characters which are illegal in URIs. Escaping them often does more harm than good. Fixes #321.	2022-11-20 21:16:03 +01:00
Nick Wellnhofer	e986d09cf5	Skip incorrectly opened HTML comments Commit `4fd69f3e` fixed handling of '<' characters not followed by an ASCII letter. But a '<!' sequence followed by invalid characters should be treated as bogus comment and skipped. Fixes #380.	2022-08-02 14:38:09 +02:00
Nick Wellnhofer	f1c32b4c78	Allow missing result files in runtest Treat missing files as empty.	2022-04-04 04:28:15 +02:00
Mike Dalessio	d7b287b94c	htmlParseComment: handle abruptly-closed comments See guidance provided on abrutply-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-abrupt-closing-of-empty-comment	2022-03-02 14:42:47 +00:00
Mike Dalessio	24cdc89006	test coverage for abruptly-closed comments These establish baseline behavior so that the subsequent commit is clear about the behavior it will modify.	2022-03-02 14:42:47 +00:00
Nick Wellnhofer	2732b23466	Fix regression parsing public IDs literals in HTML Fix regression introduced when reworking htmlParsePubidLiteral in commit `93ce33c2`. Fixes #318.	2022-01-10 13:37:59 +01:00
Mike Dalessio	a67b63d183	use new htmlParseLookupCommentEnd to find comment ends Note that the caret in error messages generated during comment parsing may have moved by one byte. See guidance provided on incorrectly-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment	2020-12-16 16:12:07 +01:00
Mike Dalessio	29f5d20e84	htmlParseComment: treat `--!>` as if it closed the comment See guidance provided on incorrectly-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment	2020-12-16 16:12:07 +01:00
Mike Dalessio	e28d9347bc	add test coverage for incorrectly-closed comments this establishes the baseline behavior so that subsequent commits which modify this behavior are clear about what's being changed.	2020-12-16 16:12:07 +01:00
Nick Wellnhofer	93ce33c2b8	Fix several quadratic runtime issues in HTML push parser Fix a few remaining cases where the HTML push parser would scan more content during lookahead than being parsed later. Make sure that htmlParseDocTypeDecl consumes all content up to the final '>' in case of errors. The old comment said "We shouldn't try to resynchronize", but ignoring invalid content is also what the HTML5 spec mandates. Likewise, make htmlParseEndTag skip to the final '>' in invalid end tags even if not in recovery mode. This is probably the most visible change in practice and leads to different output for some tests but is also more in line with HTML5. Make sure that htmlParsePI and htmlParseComment don't abort if invalid characters are encountered but log an error and ignore the character. Change some other end-of-buffer checks to test for a zero byte instead of relying on IS_CHAR. Fix usage of IS_CHAR macro in htmlParseScript.	2020-07-23 20:47:35 +02:00
Nick Wellnhofer	477c7f6aff	Fix quadratic runtime in HTML parser Commit `eeb99329` removed an important optimization avoiding quadratic runtime when repeatedly scanning the input buffer for terminating characters in the HTML push parser. The related bug is https://bugzilla.gnome.org/show_bug.cgi?id=444994 Make sure that ctxt->checkIndex is always written and store additional parser state in ctxt->inSubset which is unused in the HTML parser. Found by OSS-Fuzz.	2020-07-06 12:17:20 +02:00
Nick Wellnhofer	0b2d5c48e3	Initialize keepBlanks in HTML parser This caused failures in the HTML push tests but the fix required to change the expected output of the HTML SAX tests.	2017-06-12 19:11:54 +02:00
David Kilzer	85c112a082	Add test cases for bug 758518 test/HTML/758518-entity.html exposed a bug in pushParseTest() in runtest.c which assumed that an input file was at least 4 bytes long. That test case is only 3 bytes, so we now take the minimum of 4 bytes or the length of the test input. We also now use 'chunkSize' in place of the hard-coded value '1024' later in the function.	2017-06-12 18:26:11 +02:00
Pranjal Jumde	0bcd05c5cd	Heap-based buffer overread in htmlCurrentChar For https://bugzilla.gnome.org/show_bug.cgi?id=758606 * parserInternals.c: (xmlNextChar): Add an test to catch other issues on ctxt->input corruption proactively. For non-UTF-8 charsets, xmlNextChar() failed to check for the end of the input buffer and would continuing reading. Fix this by pulling out the check for the end of the input buffer into common code, and return if we reach the end of the input buffer prematurely. * result/HTML/758606.html: Added. * result/HTML/758606.html.err: Added. * result/HTML/758606.html.sax: Added. * result/HTML/758606_2.html: Added. * result/HTML/758606_2.html.err: Added. * result/HTML/758606_2.html.sax: Added. * test/HTML/758606.html: Added test case. * test/HTML/758606_2.html: Added test case.	2016-05-23 15:01:07 +08:00
Hugh Davenport	beca86e8c8	Detect change of encoding when parsing HTML names From https://bugzilla.gnome.org/show_bug.cgi?id=758518 Happens when a file has a name getting parsed, but no valid encoding set, so libxml has to guess what the encoding is. This patch detects when the buffer location changes, and if it does, restarts the parsing of the name. This slightly change a couple of regression tests output	2016-05-23 15:01:07 +08:00
Pranjal Jumde	a820dbeac2	Bug 758605: Heap-based buffer overread in xmlDictAddString <https://bugzilla.gnome.org/show_bug.cgi?id=758605 > Reviewed by David Kilzer. * HTMLparser.c: (htmlParseName): Add bounds check. (htmlParseNameComplex): Ditto. * result/HTML/758605.html: Added. * result/HTML/758605.html.err: Added. * result/HTML/758605.html.sax: Added. * runtest.c: (pushParseTest): The input for the new test case was so small (4 bytes) that htmlParseChunk() was never called after htmlCreatePushParserCtxt(), thereby creating a false positive test failure. Fixed by using a do-while loop so we always call htmlParseChunk() at least once. * test/HTML/758605.html: Added.	2016-05-23 15:01:07 +08:00
Daniel Veillard	f933c89813	Keep non-significant blanks node in HTML parser For https://bugzilla.gnome.org/show_bug.cgi?id=681822 Regardless if the option HTML_PARSE_NOBLANKS is set or not, blank nodes are removed from a HTML document, for example: <html> <head> <title>This is a test.</title> </head> <body> <p>This is a test.</p> </body> </html> is read as: <html><head><title>This is a test.</title></head><body> <p>This is a test.</p> </body></html> This changes the default behaviour but the old behaviour is available as expected when using the parser flag HTML_PARSE_NOBLANKS Based on original patch from Igor Ignatyuk <igor_ignatiouk@hotmail.com> * HTMLparser.c: change various places in the parser where ignorable_space SAX callback was called without checking for the parser flag preference * xmllint.c: make sure we use the new flag even for HTML parsing * result/HTML/*: this modifies the output of a number of tests	2012-09-07 19:32:12 +08:00
Denis Pauk	a0cd075d94	HTML parser error with <noscript> in the <head> For https://bugzilla.gnome.org/show_bug.cgi?id=615785 When the <noscript> is found, <head> is closed and a <body> element is created. The real <body id="xxx"> gets skipped over, so I can't see any of the body's attributes. Just don't close <head> when encountering a <noscript> Add a regression test too	2012-05-11 19:31:12 +08:00
Denis Pauk	868d92da89	Add HTML parser support for HTML5 meta charset encoding declaration For https://bugzilla.gnome.org/show_bug.cgi?id=655218 http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element """ The charset attribute specifies the character encoding used by the document. This is a character encoding declaration. If the attribute is present in an XML document, its value must be an ASCII case-insensitive match for the string "UTF-8" (and the document is therefore forced to use UTF-8 as its encoding). """ However, while <meta http-equiv="Content-Type" content="text/html; charset=utf8"> works, <meta charset="utf8"> does not. While libxml2 HTML parser is not tuned for HTML5, this is a simple addition Also added a testcase	2012-05-10 15:34:57 +08:00
Daniel Veillard	3c080d6d72	Don't give default HTML boolean attribute values in parser * HTMLparser.c: don't default value of HTML boolean attributes in the parser * SAX2.c: move this to SAX2 tree building backend * result/HTML/doc2.htm.sax result/HTML/doc3.htm.sax result/HTML/wired.html.sax: this changes a few HTML SAX regression tests	2010-03-15 15:47:50 +01:00
Daniel Veillard	a57ba4ce96	fix an HTML parsing error on large data sections reported by Mike Day add * HTMLparser.c: fix an HTML parsing error on large data sections reported by Mike Day * test/HTML/utf8bug.html result/HTML/utf8bug.html.err result/HTML/utf8bug.html.sax result/HTML/utf8bug.html: add the reproducer to the test suite daniel svn path=/trunk/; revision=3797	2008-09-25 16:06:18 +00:00
Daniel Veillard	42720248e6	change the way script/style are parsed to not try to detect comments, * HTMLparser.c: change the way script/style are parsed to not try to detect comments, reported by Mike Day * result/HTML/doc3.*: affects the result of that test Daniel svn path=/trunk/; revision=3598	2007-04-16 07:02:31 +00:00
Daniel Veillard	c47d263049	fixing HTML minimized attribute values to be generated internally if not * HTMLparser.c: fixing HTML minimized attribute values to be generated internally if not present, fixes bug #332124 * result/HTML/doc2.htm.sax result/HTML/doc3.htm.sax result/HTML/wired.html.sax: this affects the SAX event strem for a few test cases Daniel	2006-10-17 16:13:27 +00:00
Daniel Veillard	48519092e5	fixing HTML entities in attributes parsing bug #362552 added to the * HTMLparser.c: fixing HTML entities in attributes parsing bug #362552 * result/HTML/entities2.html* test/HTML/entities2.html: added to the regression suite Daniel	2006-10-17 15:56:35 +00:00
Daniel Veillard	b990008f05	script HTML parser error fix, corrects bug #319715 added test from Michael * HTMLparser.c: script HTML parser error fix, corrects bug #319715 * result/HTML/53867* test/HTML/53867.html: added test from Michael Day to the regression suite Daniel	2005-10-25 12:36:29 +00:00
Daniel Veillard	36d73403ff	Applied the last patch from Gary Coady for #304637 changing the behaviour * HTMLparser.c: Applied the last patch from Gary Coady for #304637 changing the behaviour when text nodes are found in body * result/HTML/*: this changes the output of some tests Daniel	2005-09-01 09:52:30 +00:00
Daniel Veillard	b8c8016044	fixed bug #310333 with a patch close to the provided patch for HTML UTF-8 * HTMLtree.c: fixed bug #310333 with a patch close to the provided patch for HTML UTF-8 serialization * result/HTML/script2.html: this changed the output of that test Daniel	2005-08-08 13:46:45 +00:00
Daniel Veillard	358fef4b1e	applied UTF-8 script parsing bug #310229 fix from Jiri Netolicky added the * HTMLparser.c: applied UTF-8 script parsing bug #310229 fix from Jiri Netolicky * result/HTML/script2.html* test/HTML/script2.html: added the test case from the regression suite Daniel	2005-07-13 16:37:38 +00:00
Daniel Veillard	597f1c1f34	applied patch from James Bursa fixing an html parsing bug in push mode * HTMLparser.c: applied patch from James Bursa fixing an html parsing bug in push mode * result/HTML/repeat.html* test/HTML/repeat.html: added the test to the regression suite Daniel	2005-07-03 23:00:18 +00:00
Daniel Veillard	fc484dd0a0	added support for HTML PIs #156087 added specific tests Daniel * HTMLparser.c: added support for HTML PIs #156087 * test/HTML/python.html result/HTML/python.html*: added specific tests Daniel	2004-10-22 14:34:23 +00:00
Daniel Veillard	18a65095e0	fix to the fix for #141864 from Paul Elseth apply fix from David Gatwood * xmlIO.c: fix to the fix for #141864 from Paul Elseth * HTMLparser.c result/HTML/doc3.htm: apply fix from David Gatwood for #141195 about text between comments. Daniel	2004-05-11 15:57:42 +00:00
Daniel Veillard	42fd412637	change --html to make sure we use the HTML serialization rule by default * xmllint.c: change --html to make sure we use the HTML serialization rule by default when HTML parser is used, add --xmlout to allow to force the XML serializer on HTML. * HTMLtree.c: ugly tweak to fix the output on <p> element and solve #125093 * result/HTML/*: this changes the output of some tests Daniel	2003-11-04 08:47:48 +00:00
Daniel Veillard	652f9aa966	Fix #124907 by simply backporting the same fix as for the XML parser * HTMLparser.c: Fix #124907 by simply backporting the same fix as for the XML parser * result/HTML/doc3.htm.err: change to ID detecting modified one test result. Daniel	2003-10-28 22:04:45 +00:00
Daniel Veillard	05bcb7ed30	fixed to not send NULL to %s printing cleaning up some of the regression * HTMLparser.c: fixed to not send NULL to %s printing * python/tests/error.py result/HTML/doc3.htm.err result/HTML/test3.html.err result/HTML/wired.html.err result/valid/t8.xml.err result/valid/t8a.xml.err: cleaning up some of the regression tests error Daniel	2003-10-19 14:26:34 +00:00
Daniel Veillard	f403d298c3	more code cleanup, especially around error messages, the HTML parser has * HTMLparser.c Makefile.am legacy.c parser.c parserInternals.c include/libxml/xmlerror.h: more code cleanup, especially around error messages, the HTML parser has now been upgraded to the new handling. * result/HTML/*: a few changes in the resulting error messages Daniel	2003-10-05 13:51:35 +00:00
Daniel Veillard	4b1577f14a	removing the SAXresults tree, keeping result in the same tree, added * Makefile.am results/.sax SAXResult/: removing the SAXresults tree, keeping result in the same tree, added SAXtests to the default "make tests" Daniel	2003-09-03 13:10:37 +00:00
Daniel Veillard	20aa0fb478	fixed a small problem in the patch for #118763 this reverts back to the * tree.c: fixed a small problem in the patch for #118763 * result/HTML/doc3.htm*: this reverts back to the previous result Daniel	2003-08-04 19:43:15 +00:00
Daniel Veillard	39057f40d6	fixing HTML attribute serialization bug #118763 applying a modified * tree.c: fixing HTML attribute serialization bug #118763 applying a modified version of the patch from Bacek * result/HTML/doc3.htm*: this modifies the output from one test Daniel	2003-08-04 01:33:43 +00:00
Daniel Veillard	8265a18a6a	do not generate " for " outside of attributes this changes the output * entities.c: do not generate " for " outside of attributes * result//*: this changes the output of some tests Daniel	2003-06-13 10:05:56 +00:00
William M. Brack	3b811174f7	Updated testfiles for error.c fix	2003-05-14 02:53:43 +00:00
Daniel Veillard	ef0b450163	fixed some problems related to #75813 about handling of Result Value Trees * xpath.c: fixed some problems related to #75813 about handling of Result Value Trees Daniel	2003-03-24 13:57:34 +00:00
Daniel Veillard	77a90a7f8e	patch from johan@evenhuis.nl for #107937 fixing some line counting * HTMLparser.c parser.c parserInternals.c: patch from johan@evenhuis.nl for #107937 fixing some line counting problems, and some other cleanups. * result/HTML/: this result in some line number changes Daniel	2003-03-22 00:04:05 +00:00
Daniel Veillard	fee408f5eb	final touch at closing #87235 </p> end tags need to be generated. this * HTMLparser.c: final touch at closing #87235 </p> end tags need to be generated. * result/HTML/cf_128.html result/HTML/test2.html result/HTML/test3.html: this change slightly the output of a few tests * doc/*: regenerated Daniel	2002-11-22 13:18:30 +00:00
Daniel Veillard	ce02dbc430	Mikhail Sogrine pointed out a bug in HTML parsing, applied his patch added * HTMLparser.c: Mikhail Sogrine pointed out a bug in HTML parsing, applied his patch * result/HTML/attrents.html result/HTML/attrents.html.err result/HTML/attrents.html.sax test/HTML/attrents.html: added the test and result case provided by Mikhail Sogrine Daniel	2002-10-22 19:14:58 +00:00
Daniel Veillard	8c9872ca2e	trying to fix 87235 about discarded white spaces in the HTML parser. this * HTMLparser.c: trying to fix 87235 about discarded white spaces in the HTML parser. * result/HTML/*: this changes the output of a number of HTML regression tests Daniel	2002-07-05 18:17:10 +00:00
Daniel Veillard	6231e84559	fixed & serialization bug introduced in 2.4.20 this changes a few things * HTMLtree.c: fixed & serialization bug introduced in 2.4.20 * result/HTML/*: this changes a few things in the results Daniel	2002-04-18 11:54:04 +00:00
Daniel Veillard	eb475a37df	fixing bug #78662 i.e. add proper escaping of URI when saving HTML files. * HTMLtree.c uri.c: fixing bug #78662 i.e. add proper escaping of URI when saving HTML files. * result/HTML/*: this impacted some tests Daniel	2002-04-14 22:00:22 +00:00

1 2

88 Commits