Nick Wellnhofer
9f04cce695
html: Remove unused or useless return codes
...
htmlParseStartTag should always succeed (except for malloc failures).
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e179f3ec0e
html: Stop reporting syntax errors
...
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c6af101728
html: Test tokenizer against html5lib test suite
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
27752f75ca
html: Fix EOF handling in start tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
b19d353970
html: Fix EOF handling in comments
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17e56ac54a
html: Fix parsing of end tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
24a09033c9
html: Fix bogus end tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
bca6485476
html: Allow U+000C FORM FEED as whitespace
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
6edf1a645e
html: Fix DOCTYPE parsing
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
9678163f54
html: Don't check for valid XML characters
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a6955c13c7
html: Parse numeric character references according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
4eeac30944
html: Start to fix EOF and U+0000 handling
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e062a4a9b3
html: Add HTML5 parser option
...
This option passes tokenizer output directly to the SAX callbacks,
making it possible to test the tokenizer against the html5lib test
suite.
This will produce unbalanced calls to the startElement and endElement
callbacks, but it's the only way to support a SAX like interface for
HTML5. It can be used for filtering or rewriting HTML5, for example.
A HTML5 tree builder could then be implemented on top of the SAX
callbacks.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17da54c522
html: Normalize newlines
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
341dc78f24
html: Deduplicate code in htmlCurrentChar
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
3adb396d87
html: Parse bogus comments instead of ignoring them
...
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
8444017578
html: Add missing calls to htmlCheckParagraph()
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
86d6b9b051
html: Deduplicate some code
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
0d324bde36
html: Simplify node info accounting
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
ccb61f599e
html: Remove duplicate calls to htmlAutoClose
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e1834745e0
html: Add character data tests
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f9ed30e972
html: HTML5 character data states
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239
html: Parse named character references according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
d5cd0f07f8
html: Prefer SKIP(1) over NEXT in HTML parser
...
Use SKIP(1) where it's safe to avoid a function call.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dc2d498318
html: Rework htmlLookupSequence
...
Rename to htmlLookupString and use strstr for increased performance.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
637215a4de
html: Always terminate doctype declarations on '>'
...
Align with HTML5 spec. This allows to remove the old quote handling in
htmlLookupSequence.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
72e29f9a3d
html: Fix quadratic behavior in push parser
...
Fix quadratic behavior related to unquoted attribute values. We really
have to replicate parts of the HTML5 state machine to find the end of
tags relibably.
Fixes #533 .
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a80f8b64a9
html: Allow attributes in end tags
...
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f2272c231b
html: Handle unexpected-solidus-in-tag according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
939b53ee12
html: Stop skipping tag content
...
Tag and attributes names should always be parsed succesfully now.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dcb2abb2fe
html: Parse tag and attribute names according to HTML5
...
HTML5 allows bascially all characters in tag and attribute names.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
d67833a3c5
xmllint: Use proper type to store seconds since epoch
...
Should avoid year 2038 problem.
Fixes #801 .
2024-09-26 19:34:34 +02:00
correctmost
81d38ed069
meson: Fix duplicate listing of libxml2.devhelp2
...
The duplication caused a warning when uninstalling.
2024-09-25 07:52:10 -04:00
Nick Wellnhofer
b1c5aa6544
xpath: Deprecate xmlXPathNAN and xmlXPath*INF
...
Users should simply use the C99 macros.
2024-09-19 12:50:59 +02:00
Nick Wellnhofer
55ddccb645
io: Make sure not to pass partial UTF-8 to write callback
...
We cannot split UTF-8 at arbitrary boundaries.
2024-09-14 00:05:13 +02:00
Nick Wellnhofer
c46b89e243
xpath: Deprecate xmlXPathEvalExpr
...
Also check the argument instead of crashing if there's no context.
2024-09-13 21:06:36 +02:00
Nick Wellnhofer
03f1bdd260
xpath: Make recursion check work with xmlXPathCompile
...
The check for maximum recursion depth required a parser context with an
xmlXPathContext which xmlXPathCompile didn't provide.
All other code should already set up or require an xmlXPathContext.
2024-09-13 20:59:47 +02:00
Nick Wellnhofer
dae160c64b
encoding: Fix table entry for "UTF16"
2024-09-13 12:08:20 +02:00
Nick Wellnhofer
5e7874015e
save: Make xmlEscapeTab signed
...
Fixes issues in platforms where char is unsigned.
Fixes #797 .
2024-09-10 17:50:08 +02:00
Nick Wellnhofer
6e503eb742
encoding: Handle more ICU error codes
...
U_ILLEGAL_ESCAPE_SEQUENCE and U_UNSUPPORTED_ESCAPE_SEQUENCE can occur
with ISO-2022.
2024-09-10 03:34:46 +02:00
Nick Wellnhofer
55d36c5990
encoding: Fix error code in xmlUconvConvert
...
Broke in 46ec621e
.
2024-09-10 03:11:18 +02:00
Nick Wellnhofer
de10d4cd5f
include: Check whether _MSC_VER is defined
...
Should fix #795 .
2024-09-04 16:32:22 +02:00
Nick Wellnhofer
bd9eed4694
parser: Make unsupported encodings an error in declarations
...
This was changed in 45157261
, but in encoding declarations, unsupported
encodings should raise a fatal error.
Fixes #794 .
2024-09-02 19:29:39 +02:00
Nick Wellnhofer
40abebbc73
python: Fix SAX driver with character streams
...
This apparently broke with Python 3.5 which introduced character
streams.
Fixes #790 .
2024-08-29 01:31:26 +02:00
Nick Wellnhofer
8ae06d5223
SAX2: Don't merge CDATA sections
...
The Document Object Model (DOM) Level 3 Core Specification says:
> Adjacent CDATASection nodes are not merged by use of the normalize
> method of the Node interface.
Fixes #412 .
2024-08-29 01:31:19 +02:00
Nick Wellnhofer
dde62ae5d5
parser: Align push parsing of CDATA sections with pull parser
...
Remove special handling of CDATA sections in push parser. This makes
sure that only a single callback is generated for large sections.
Fixes #22 and needed for #412 .
2024-08-29 01:28:49 +02:00
Nick Wellnhofer
4d10e53af1
parser: Make sure to set and increment input id
...
Revert part of commits 410931e3
and b9d2f3c9
.
2024-08-28 22:47:20 +02:00
Nick Wellnhofer
6d365ca02c
doc: XML_PARSE_NO_XXE is available since 2.13.0
2024-08-28 22:09:30 +02:00
Nick Wellnhofer
8ad618d2d6
doc: Document all xmllint options
...
Remove --pushsmall.
Fixes #785 .
2024-08-28 22:03:30 +02:00
triallax
67ff748c3e
io: don't set the executable bit when creating files
...
Issue seems to have been introduced in
0bef93bf24
.
2024-08-26 23:53:29 +01:00