1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2024-10-27 04:55:04 +03:00
Commit Graph

6014 Commits

Author SHA1 Message Date
Nick Wellnhofer
f45abbd3e5 dict: Fix integer overflow of string lengths
Fixes #546.
2023-09-04 16:07:40 +02:00
Nick Wellnhofer
edc2dd48cb dict: Update hash function
Update hash function from classic Jenkins OAAT (dict.c) and a variant of
DJB2 (hash.c) to "GoodOAAT" taken from the SMHasher repo. This hash
function passes all SMHasher tests.
2023-09-04 16:07:23 +02:00
James Le Cuirot
93e8bb2a40
build: Generate better pkg-config files for static-only builds
pkg-config supports `Requires.private` and `Libs.private` fields for
static linking. However, if you're building a dynamic binary, then
pkg-config will use the non-private fields, even if just the static
libxml2 is available. This will result in libxml2 being underlinked,
causing the build to fail. The solution is to fold the private fields
into the non-private fields when the shared libxml2 is not being built.

This works for Autotools and CMake. Meson also knows how to handle this
when it automatically generates pkg-config files.
2023-09-03 08:52:36 +01:00
James Le Cuirot
4640ccac85
build: Generate better pkg-config file for SYSROOT builds
The -I and -L flags you use to build should not necessarily be the same
ones you bake into installed files. If you are building with
dependencies located under a SYSROOT then the installed files should
have no knowledge of that SYSROOT. For example, if the build requires
`-L/path/to/sysroot/usr/lib/foo` then only `-L/usr/lib/foo` should be
baked into the installed files.

pkg-config is SYSROOT-aware, so this issue can be sidestepped by using
the `Requires` field rather than the `Libs` and `Cflags` fields. This is
easily resolved if you rely solely on pkg-config, but this project falls
back to standard Autoconf checks, so a little more effort is required.

Unfortunately, this issue cannot feasibly be resolved for CMake.
`find_package` is used rather than `pkg_check_modules`, so we cannot
tell whether a pkg-config file for each dependency is present or not,
even if `find_package` uses pkg-config behind the scenes. The CMake
build does not record any dependency -I or -L flags into the pkg-config
file anyway. This is a problem in itself, although these dependencies
are most likely installed to standard locations.

Meson is very much better at handling this, as it generates the
pkg-config file automatically using the correct logic.
2023-09-03 08:52:22 +01:00
Nick Wellnhofer
54a0b19a9f autoconf: Allow custom --with-icu configure option 2023-09-01 14:52:14 +02:00
Nick Wellnhofer
c5989473b9 dict: Use thread-local storage for PRNG state 2023-09-01 14:52:11 +02:00
Nick Wellnhofer
57cfd221a6 dict: Use xoroshiro64** as PRNG
Stop using rand_r. This enables hash randomization on all platforms.
2023-09-01 14:52:04 +02:00
Nick Wellnhofer
6d7aaaa835 dict: Tune hash table growth
Introduce load factor as main trigger and increase MAX_HASH_LEN. This
should make growth behavior more predictable.

Raise size limit to INT_MAX. This avoids quadratic behavior with larger
tables.
2023-09-01 14:51:55 +02:00
Nick Wellnhofer
4b8f7cf05d hash: Fix integer overflow of nbElems 2023-09-01 14:43:08 +02:00
Nick Wellnhofer
bfd7d28698 xmllint: Fix more error messages 2023-08-29 21:16:34 +02:00
Nick Wellnhofer
373244bc66 xmllint: Fix error message when push parsing empty documents 2023-08-29 21:05:32 +02:00
Nick Wellnhofer
53050b1dd8 parser: More fixes to push parser error handling 2023-08-29 20:06:43 +02:00
Nick Wellnhofer
bbd918b2e7 parser: Fix detection of null bytes
Also suppress misleading extra errors.

Fixes #122.
2023-08-29 18:43:10 +02:00
Nick Wellnhofer
c6083a32d6 parser: Improve error handling in push parser
- Report errors earlier
- Align error messages with pull parser
2023-08-29 18:41:05 +02:00
Nick Wellnhofer
1edae30f82 parser: Don't check inputNr in xmlParseTryOrFinish
There's no apparent reason for this check. inputNr should always be 1
here.
2023-08-29 18:17:14 +02:00
Nick Wellnhofer
e48f2695fe parser: Remove push parser debugging code 2023-08-29 18:17:09 +02:00
Nick Wellnhofer
cde4499778 SAX2: Allow multiple top-level elements
When parsing with HTML_PARSE_NOIMPLIED, the result document can contain
multiple top-level elements. Rework xmlSAX2StartElement to simply add
the element as a child of ctxt->node or ctxt->myDoc.

Don't invoke xmlAddSibling for non-element parents. The context node
should always be an element node.

Fixes #584.
2023-08-27 16:35:23 +02:00
Nick Wellnhofer
d39f78069d tree: Fix copying of DTDs
- Don't create multiple DTD nodes.
- Fix UAF if malloc fails.
- Skip DTD nodes if tree module is disabled.

Fixes #583.
2023-08-23 20:43:14 +02:00
Nick Wellnhofer
4e4c89a4bc doc: Improve documentation of configuration options 2023-08-21 11:13:33 +02:00
Nick Wellnhofer
778cca386d legacy: Add stubs for disabled modules
When legacy support is requested, always enable stubs for FTP and
XPointer location modules which were removed from the standard
configuration. Going forward, the --with-legacy configuration option
should be used to provide maximum ABI compatibility.

Fixes #433.
2023-08-20 23:16:12 +02:00
Nick Wellnhofer
ed3bd05284 parser: Allow to set maximum amplification factor 2023-08-20 20:49:16 +02:00
Nick Wellnhofer
9d80a2b134 entities: Don't change doc when encoding entities
doc->encoding shouldn't be touched by xmlEncodeEntitiesInternal.
2023-08-17 12:47:14 +02:00
Nick Wellnhofer
f1c1f5c6b4 parser: Revert change to doc->encoding
Fixes #579.
2023-08-17 12:47:14 +02:00
Nick Wellnhofer
61b8e097b9 parser: Never use UTF-8 encoding handler 2023-08-16 19:50:36 +02:00
Nick Wellnhofer
507f11edf0 encoding: Remove debugging code 2023-08-16 19:50:36 +02:00
Nick Wellnhofer
138213acdf python: Fix tests on MinGW
Add the directory containing libxml2.dll with os.add_dll_directory to
make tests work on MinGW.

This has changed in Python 3.8 but for some reason, the issue only
turned up with Python 3.11 on MinGW. Contrary to documentation, copying
libxml2.dll into the directory containing the .pyd file doesn't work.
2023-08-15 12:55:35 +02:00
Nick Wellnhofer
e2ab48b9b5 malloc-fail: Fix unsigned integer overflow in xmlTextReaderPushData
Return immediately if xmlParserInputBufferRead fails.

Found by OSS-Fuzz, see #344.
2023-08-14 15:06:31 +02:00
Nick Wellnhofer
0d24fc0a47 html: Remove encoding hack in htmlCreateFileParserCtxt
Switch encoding directly instead of calling htmlCheckEncoding with faked
content.
2023-08-14 12:53:49 +02:00
Nick Wellnhofer
5db5a704eb html: Fix UAF in htmlCurrentChar
Short-lived regression found by OSS-Fuzz.
2023-08-09 18:40:25 +02:00
Nick Wellnhofer
b973ceaf2f parser: Fix mistake in xmlDetectEncoding
Short-lived regression.
2023-08-09 18:40:25 +02:00
Nick Wellnhofer
cb717d7e02 parser: Update line number after coalescing text nodes
This should make the line number of text nodes deterministic. Before,
it depended on the callback sequence which depends on the size of chunks
fed to the parser.
2023-08-09 16:58:33 +02:00
Nick Wellnhofer
855818bd2b parser: Check for truncated multi-byte sequences
When decoding input data, check whether the "raw" buffer is empty after
parsing the document. Otherwise, the input ends with a truncated
multi-byte sequence which shouldn't be silently ignored.
2023-08-08 15:21:37 +02:00
Nick Wellnhofer
95e81a360c parser: Decode all data in xmlCharEncInput
Even with flush set to true, xmlCharEncInput didn't guarantee to decode
all data. This complicated the push parser.

Remove the flush flag and always decode all available data.

Also fix ICU code where the flush flag has a different meaning. Always
set flush to false and retry even with empty input buffers.
2023-08-08 15:21:31 +02:00
Nick Wellnhofer
834b8123ef parser: Stream data when reading from memory
Don't create a copy of the whole input buffer. Read the data chunk by
chunk to save memory.

Historically, it was probably envisioned to read data from memory
without additional copying. This doesn't work reliably with the current
design of the XML parser which requires a terminating null byte at the
end of input buffers. This lead to xmlReadMemory interfaces, which
expect pointer and size arguments, being changed to make a
zero-terminated copy of the input buffer. Interfaces based on
xmlReadDoc, which actually expect a zero-terminated string and
would make zero-copy operation work, were then simplified to rely on
xmlReadMemoryi, resulting in an unnecessary copy.

To avoid copying (possibly gigabytes) of memory temporarily, we now
stream in-memory input just like content read from files in a
chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of
250 bytes). As a side effect, we also avoid another copy of the whole
input when handling non-UTF-8 data which was made possible by some
earlier commits.

Interfaces expecting zero-terminated strings now make use of strnlen
which unfortunately isn't part of the standard C library and only
mandated since POSIX 2008.
2023-08-08 15:21:28 +02:00
Nick Wellnhofer
5aff27ae78 parser: Optimize xmlLoadEntityContent
Load entity content via xmlParserInputBufferGrow, avoiding a copy.

This also fixes an entity size accounting error.
2023-08-08 15:21:25 +02:00
Nick Wellnhofer
facc2a06da parser: Don't overwrite EOF parser state 2023-08-08 15:21:21 +02:00
Nick Wellnhofer
59fa0bb383 parser: Simplify input pointer updates
The base member always points to the beginning of the buffer.
2023-08-08 15:21:14 +02:00
Nick Wellnhofer
c88ab7e329 parser: Don't reinitialize parser input members
The parser input struct should already be initialized.
2023-08-08 15:19:54 +02:00
Nick Wellnhofer
4ee0815514 encoding: Move rawconsumed accounting to xmlCharEncInput 2023-08-08 15:19:51 +02:00
Nick Wellnhofer
a0462e2d54 test: Add push parser test with overridden encoding
After recent changes, it should work to call xmlSwitchEncoding to
override the encoding for the push parser. This was never properly
supported, so Chromium and WebKit added a hack to reset the encoding in
the startDocument SAX handler.
2023-08-08 15:19:49 +02:00
Nick Wellnhofer
ec7be50662 parser: Rework encoding detection
Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set
when xmlSwitchEncoding is called. The parser can use the flag to
reliably detect whether an encoding was already set via user override,
BOM or other auto-detection. In this case, the encoding declaration
won't be used to switch the encoding.

Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding
and ctxt->input->buf->encoder was used.

Introduce private helper functions to switch encodings used by both the
XML and HTML parser:

- xmlDetectEncoding which skips over the BOM, allowing to remove the
  BOM checks from other encoding functions.
- xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns
  about encoding mismatches.

If users override the encoding, store the declared instead of the actual
encoding in xmlDoc. In this case, the actual encoding is known and the
raw value from the doc is more useful.

Also use the input flags to store the ISO-8859-1 fallback state.
Restrict the fallback to cases where no encoding was specified. (The
fallback is only useful in recovery mode and these days broken UTF-8 is
probably more likely than ISO-8859-1, so it might eventually be removed
completely.)

The 'charset' member of xmlParserCtxt is now unused. The 'encoding'
member of xmlParserInput is now unused.

The 'standalone' member of xmlParserInput is renamed to 'flags'.

A new parser state XML_PARSER_XML_DECL is added for the push parser.
2023-08-08 15:19:46 +02:00
Nick Wellnhofer
d38e73f91e parser: Always create UTF-8 in xmlParseReference
It seems that this code path could only be triggered after an encoding
error in recovery mode. Creating char-ref nodes is unnecessary and
typically unexpected.
2023-08-08 15:19:44 +02:00
Nick Wellnhofer
131d0dc0a7 parser: Don't use 'standalone' member of xmlParserInput
The standalone declaration is only parsed in the main input stream.
2023-08-08 15:19:39 +02:00
Nick Wellnhofer
d9ec182b65 parser: Don't detect encoding in xmlCtxtResetPush
The encoding will be detected in xmlParseTryOrFinish.
2023-08-08 15:19:36 +02:00
Nick Wellnhofer
3a64f39448 html: Remove some debugging code in htmlParseTryOrFinish 2023-08-08 15:19:25 +02:00
Nick Wellnhofer
58de9d31da valid: Fix c1->parent pointer in xmlCopyDocElementContent
Fixes #572.
2023-08-03 12:00:55 +02:00
Nick Wellnhofer
7569328138 malloc-fail: Fix memory leak in xmlCompileAttributeTest
Found by OSS-Fuzz, see #344.
2023-07-21 14:50:30 +02:00
Nick Wellnhofer
90bcbcfcc7 parser: Fix potential use-after-free in xmlParseCharDataInternal
Return immediately if a SAX handler stops the parser.

Fixes #569.
2023-07-20 21:40:57 +02:00
Nick Wellnhofer
8844744772 parser: Fix typo in previous commit 2023-06-23 23:04:30 +02:00
Nick Wellnhofer
9d0541dd2f parser: Make xmlSwitchEncoding always skip the BOM
Chromium calls xmlSwitchEncoding from the start document handler and
relies on this function to skip the BOM. Commit 98840d40 changed the
behavior when switching to UTF-16 since inspecting the input buffer at
this point is fragile.

Revert part of the commit to also skip a potential (decoded UTF-8) BOM
when switching to UTF-16. Make sure that we do this only at the start of
an input stream to avoid U-FEFF characters being lost.

BOM handling should ultimately be moved to the parsing code to avoid
such bugs.

See https://bugs.chromium.org/p/chromium/issues/detail?id=1451026
2023-06-22 18:22:32 +02:00