1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2024-10-26 12:25:09 +03:00
Commit Graph

5990 Commits

Author SHA1 Message Date
Nick Wellnhofer
507f11edf0 encoding: Remove debugging code 2023-08-16 19:50:36 +02:00
Nick Wellnhofer
138213acdf python: Fix tests on MinGW
Add the directory containing libxml2.dll with os.add_dll_directory to
make tests work on MinGW.

This has changed in Python 3.8 but for some reason, the issue only
turned up with Python 3.11 on MinGW. Contrary to documentation, copying
libxml2.dll into the directory containing the .pyd file doesn't work.
2023-08-15 12:55:35 +02:00
Nick Wellnhofer
e2ab48b9b5 malloc-fail: Fix unsigned integer overflow in xmlTextReaderPushData
Return immediately if xmlParserInputBufferRead fails.

Found by OSS-Fuzz, see #344.
2023-08-14 15:06:31 +02:00
Nick Wellnhofer
0d24fc0a47 html: Remove encoding hack in htmlCreateFileParserCtxt
Switch encoding directly instead of calling htmlCheckEncoding with faked
content.
2023-08-14 12:53:49 +02:00
Nick Wellnhofer
5db5a704eb html: Fix UAF in htmlCurrentChar
Short-lived regression found by OSS-Fuzz.
2023-08-09 18:40:25 +02:00
Nick Wellnhofer
b973ceaf2f parser: Fix mistake in xmlDetectEncoding
Short-lived regression.
2023-08-09 18:40:25 +02:00
Nick Wellnhofer
cb717d7e02 parser: Update line number after coalescing text nodes
This should make the line number of text nodes deterministic. Before,
it depended on the callback sequence which depends on the size of chunks
fed to the parser.
2023-08-09 16:58:33 +02:00
Nick Wellnhofer
855818bd2b parser: Check for truncated multi-byte sequences
When decoding input data, check whether the "raw" buffer is empty after
parsing the document. Otherwise, the input ends with a truncated
multi-byte sequence which shouldn't be silently ignored.
2023-08-08 15:21:37 +02:00
Nick Wellnhofer
95e81a360c parser: Decode all data in xmlCharEncInput
Even with flush set to true, xmlCharEncInput didn't guarantee to decode
all data. This complicated the push parser.

Remove the flush flag and always decode all available data.

Also fix ICU code where the flush flag has a different meaning. Always
set flush to false and retry even with empty input buffers.
2023-08-08 15:21:31 +02:00
Nick Wellnhofer
834b8123ef parser: Stream data when reading from memory
Don't create a copy of the whole input buffer. Read the data chunk by
chunk to save memory.

Historically, it was probably envisioned to read data from memory
without additional copying. This doesn't work reliably with the current
design of the XML parser which requires a terminating null byte at the
end of input buffers. This lead to xmlReadMemory interfaces, which
expect pointer and size arguments, being changed to make a
zero-terminated copy of the input buffer. Interfaces based on
xmlReadDoc, which actually expect a zero-terminated string and
would make zero-copy operation work, were then simplified to rely on
xmlReadMemoryi, resulting in an unnecessary copy.

To avoid copying (possibly gigabytes) of memory temporarily, we now
stream in-memory input just like content read from files in a
chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of
250 bytes). As a side effect, we also avoid another copy of the whole
input when handling non-UTF-8 data which was made possible by some
earlier commits.

Interfaces expecting zero-terminated strings now make use of strnlen
which unfortunately isn't part of the standard C library and only
mandated since POSIX 2008.
2023-08-08 15:21:28 +02:00
Nick Wellnhofer
5aff27ae78 parser: Optimize xmlLoadEntityContent
Load entity content via xmlParserInputBufferGrow, avoiding a copy.

This also fixes an entity size accounting error.
2023-08-08 15:21:25 +02:00
Nick Wellnhofer
facc2a06da parser: Don't overwrite EOF parser state 2023-08-08 15:21:21 +02:00
Nick Wellnhofer
59fa0bb383 parser: Simplify input pointer updates
The base member always points to the beginning of the buffer.
2023-08-08 15:21:14 +02:00
Nick Wellnhofer
c88ab7e329 parser: Don't reinitialize parser input members
The parser input struct should already be initialized.
2023-08-08 15:19:54 +02:00
Nick Wellnhofer
4ee0815514 encoding: Move rawconsumed accounting to xmlCharEncInput 2023-08-08 15:19:51 +02:00
Nick Wellnhofer
a0462e2d54 test: Add push parser test with overridden encoding
After recent changes, it should work to call xmlSwitchEncoding to
override the encoding for the push parser. This was never properly
supported, so Chromium and WebKit added a hack to reset the encoding in
the startDocument SAX handler.
2023-08-08 15:19:49 +02:00
Nick Wellnhofer
ec7be50662 parser: Rework encoding detection
Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set
when xmlSwitchEncoding is called. The parser can use the flag to
reliably detect whether an encoding was already set via user override,
BOM or other auto-detection. In this case, the encoding declaration
won't be used to switch the encoding.

Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding
and ctxt->input->buf->encoder was used.

Introduce private helper functions to switch encodings used by both the
XML and HTML parser:

- xmlDetectEncoding which skips over the BOM, allowing to remove the
  BOM checks from other encoding functions.
- xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns
  about encoding mismatches.

If users override the encoding, store the declared instead of the actual
encoding in xmlDoc. In this case, the actual encoding is known and the
raw value from the doc is more useful.

Also use the input flags to store the ISO-8859-1 fallback state.
Restrict the fallback to cases where no encoding was specified. (The
fallback is only useful in recovery mode and these days broken UTF-8 is
probably more likely than ISO-8859-1, so it might eventually be removed
completely.)

The 'charset' member of xmlParserCtxt is now unused. The 'encoding'
member of xmlParserInput is now unused.

The 'standalone' member of xmlParserInput is renamed to 'flags'.

A new parser state XML_PARSER_XML_DECL is added for the push parser.
2023-08-08 15:19:46 +02:00
Nick Wellnhofer
d38e73f91e parser: Always create UTF-8 in xmlParseReference
It seems that this code path could only be triggered after an encoding
error in recovery mode. Creating char-ref nodes is unnecessary and
typically unexpected.
2023-08-08 15:19:44 +02:00
Nick Wellnhofer
131d0dc0a7 parser: Don't use 'standalone' member of xmlParserInput
The standalone declaration is only parsed in the main input stream.
2023-08-08 15:19:39 +02:00
Nick Wellnhofer
d9ec182b65 parser: Don't detect encoding in xmlCtxtResetPush
The encoding will be detected in xmlParseTryOrFinish.
2023-08-08 15:19:36 +02:00
Nick Wellnhofer
3a64f39448 html: Remove some debugging code in htmlParseTryOrFinish 2023-08-08 15:19:25 +02:00
Nick Wellnhofer
58de9d31da valid: Fix c1->parent pointer in xmlCopyDocElementContent
Fixes #572.
2023-08-03 12:00:55 +02:00
Nick Wellnhofer
7569328138 malloc-fail: Fix memory leak in xmlCompileAttributeTest
Found by OSS-Fuzz, see #344.
2023-07-21 14:50:30 +02:00
Nick Wellnhofer
90bcbcfcc7 parser: Fix potential use-after-free in xmlParseCharDataInternal
Return immediately if a SAX handler stops the parser.

Fixes #569.
2023-07-20 21:40:57 +02:00
Nick Wellnhofer
8844744772 parser: Fix typo in previous commit 2023-06-23 23:04:30 +02:00
Nick Wellnhofer
9d0541dd2f parser: Make xmlSwitchEncoding always skip the BOM
Chromium calls xmlSwitchEncoding from the start document handler and
relies on this function to skip the BOM. Commit 98840d40 changed the
behavior when switching to UTF-16 since inspecting the input buffer at
this point is fragile.

Revert part of the commit to also skip a potential (decoded UTF-8) BOM
when switching to UTF-16. Make sure that we do this only at the start of
an input stream to avoid U-FEFF characters being lost.

BOM handling should ultimately be moved to the parsing code to avoid
such bugs.

See https://bugs.chromium.org/p/chromium/issues/detail?id=1451026
2023-06-22 18:22:32 +02:00
Christoph Reiter
2473b4855e autotools: fix Python module file ext for cygwin/msys2
both use .dll, not .pyd
2023-06-21 14:38:38 +02:00
David Kilzer
5f54bac9eb testapi: test_xmlSAXDefaultVersion() leaves xmlSAX2DefaultVersionValue set to 1 with LIBXML_SAX1_ENABLED
Add code to save and to restore the default value of
xmlSAX2DefaultVersionValue.

Fixes #554.
2023-06-10 10:55:38 -07:00
Nick Wellnhofer
b236b7a588 parser: Halt parser when growing buffer results in OOM
Fix short-lived regression from previous commit.

It might be safer to make xmlBufSetInputBaseCur use the original buffer
even in case of errors.

Found by OSS-Fuzz.
2023-06-08 21:59:20 +02:00
Nick Wellnhofer
20f5c73457 parser: Recover more input from encoding errors
Don't halt the parser in xmlParserGrow to allow more input to be
recovered in case of encoding errors.

Fixes #543.
2023-06-07 14:05:34 +02:00
Nick Wellnhofer
db21cd5db9 malloc-fail: Handle malloc failures in xmlAddEncodingAlias
Avoid memory errors if an allocation fails.

See #344. Fixes #553.
2023-06-06 14:25:30 +02:00
Nick Wellnhofer
305a75ccbe malloc-fail: Fix null-deref with xmllint --copy
See #344. Fixes #552.
2023-06-06 13:15:46 +02:00
Nick Wellnhofer
6273df6c6d xpath: Ignore entity ref nodes when computing node hash
XPath queries only work reliably if entities are substituted.
Nevertheless, it's possible to query a document with entity reference
nodes. xmllint even deletes entities when the `--dropdtd` option is
passed, resulting in dangling pointers, so it's best to skip entity
reference nodes to avoid a use-after-free.

Fixes #550.
2023-05-30 12:30:27 +02:00
Nick Wellnhofer
e2f21c22d3 win32: Deprecate old Windows build system 2023-05-30 12:03:45 +02:00
Nick Wellnhofer
1e8ab6977d gitlab-ci: Lower _XOPEN_SOURCE value 2023-05-25 03:25:48 +02:00
Nick Wellnhofer
cb8ccb1078 testapi: Don't set http_proxy environment variable
We already disable network access, so this has no effect.
2023-05-25 03:17:45 +02:00
Nick Wellnhofer
9fd57df815 autotools: Improve iconv check
Use a custom test program which includes iconv.h, so we can check
whether the possibly redefined symbols in this header file match the
symbols in the iconv library.

Should fix #547.
2023-05-25 02:47:27 +02:00
Nick Wellnhofer
c3c6cc6202 runtest: Fix compilation without LIBXML_HTML_ENABLED
Fixes #545.
2023-05-24 20:08:56 +02:00
Nick Wellnhofer
981093abd1 test: Add push parser tests for split UTF-8 sequences 2023-05-18 19:35:16 +02:00
Nick Wellnhofer
e0f3016f71 parser: Fix regression when push parsing UTF-8 sequences
Partial UTF-8 sequences are allowed when push parsing.

Fixes #542.
2023-05-18 18:21:20 +02:00
Nick Wellnhofer
687a2b719e xinclude: Lower initial table size when fuzzing
We don't have test cases with many documents, so set the initial table
size to 1 when fuzzing, so there is a chance to detect reallocation
issues.
2023-05-11 13:27:52 +02:00
Nick Wellnhofer
c40cbf07a3 malloc-fail: Fix null deref after xmlXIncludeNewRef
See #344.
2023-05-11 13:27:52 +02:00
Nick Wellnhofer
105ce73da0 xinclude: Fix false positives in inclusion loop detection
xmlXIncludeRecurseDoc can realloc the cache.
2023-05-11 13:27:52 +02:00
Nick Wellnhofer
bdb5667a5c autotools: Fix ICU detection
Fixes #540.
2023-05-10 18:13:47 +02:00
Nick Wellnhofer
9dae389cee parser: Fix "huge input lookup" error with push parser
Fix parsing of larger documents without XML_PARSE_HUGE.

Should fix #538.
2023-05-09 13:30:21 +02:00
Nick Wellnhofer
b8961df65d SAX: Always validate xml:ids
The behavior shouldn't depend on mostly random configuration options.
2023-05-09 03:25:24 +02:00
Nick Wellnhofer
f24ffddbb9 Stop using sprintf
Switch remaining users to snprintf.
2023-05-08 23:33:04 +02:00
Nick Wellnhofer
01723fc68f xpath: Fix build without LIBXML_XPATH_ENABLED
Move static function declaration into XPATH block. Also move comparison
functions.

Fixes #537.
2023-05-08 23:15:30 +02:00
Nick Wellnhofer
235b15a590 SAX: Always initialize SAX1 element handlers
Follow-up to commit d0c3f01e. A parser context will be initialized to
SAX version 2, but this can be overridden with XML_PARSE_SAX1 later,
so we must initialize the SAX1 element handlers as well.

Change the check in xmlDetectSAX2 to only look for XML_SAX2_MAGIC, so
we don't switch to SAX1 if the SAX2 element handlers are NULL.
2023-05-08 19:15:44 +02:00
Mike Dalessio
3463063001
autoconf: fix iconv library paths
and pass cflags when building executables

See 0f77167f for prior related work
2023-05-06 12:26:17 -04:00