1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-22 13:34:15 +03:00

libutil/iconv: don't allow wtf-8 surrogate pairs

At present, if we meet a string like "hello \xed\xa7\x96 world", the
bytes in the middle will be converted into half of a surrogate pair,
and the UTF-16 will be invalid. It is better to error out immediately,
because the UTF-8 string is already invalid.

https://learn.microsoft.com/en-us/windows/win32/api/Stringapiset/nf-stringapiset-widechartomultibyte#remarks
is a citation for the statement about this being a pre-Vista
problem.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
This commit is contained in:
Douglas Bagnall 2023-07-05 13:26:12 +12:00 committed by Andrew Bartlett
parent d7481f94e0
commit 949fe57077

View File

@ -861,6 +861,39 @@ static size_t utf8_pull(void *cd, const char **inbuf, size_t *inbytesleft,
errno = EILSEQ;
goto error;
}
if (codepoint >= 0xd800 && codepoint <= 0xdfff) {
/*
* This is an invalid codepoint, per
* RFC3629, as it encodes part of a
* UTF-16 surrogate pair for a
* character over U+10000, which ought
* to have been encoded as a four byte
* utf-8 sequence.
*
* Prior to Vista, Windows might
* sometimes produce invalid strings
* where a utf-16 sequence containing
* surrogate pairs was converted
* "verbatim" into utf-8, instead of
* encoding the actual codepoint. This
* format is sometimes called "WTF-8".
*
* If we were to support that, we'd
* have a branch here for the case
* where the codepoint is between
* 0xd800 and 0xdbff (a "high
* surrogate"), and read a *six*
* character sequence from there which
* would include a low surrogate. But
* that would undermine the
* hard-learnt principle that each
* character should only have one
* encoding.
*/
errno = EILSEQ;
goto error;
}
uc[0] = codepoint & 0xff;
uc[1] = codepoint >> 8;
c += 3;