mirror of
https://github.com/samba-team/samba.git
synced 2024-12-22 13:34:15 +03:00
libutil/iconv: don't allow wtf-8 surrogate pairs
At present, if we meet a string like "hello \xed\xa7\x96 world", the bytes in the middle will be converted into half of a surrogate pair, and the UTF-16 will be invalid. It is better to error out immediately, because the UTF-8 string is already invalid. https://learn.microsoft.com/en-us/windows/win32/api/Stringapiset/nf-stringapiset-widechartomultibyte#remarks is a citation for the statement about this being a pre-Vista problem. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Andrew Bartlett <abartlet@samba.org>
This commit is contained in:
parent
d7481f94e0
commit
949fe57077
@ -861,6 +861,39 @@ static size_t utf8_pull(void *cd, const char **inbuf, size_t *inbytesleft,
|
||||
errno = EILSEQ;
|
||||
goto error;
|
||||
}
|
||||
if (codepoint >= 0xd800 && codepoint <= 0xdfff) {
|
||||
/*
|
||||
* This is an invalid codepoint, per
|
||||
* RFC3629, as it encodes part of a
|
||||
* UTF-16 surrogate pair for a
|
||||
* character over U+10000, which ought
|
||||
* to have been encoded as a four byte
|
||||
* utf-8 sequence.
|
||||
*
|
||||
* Prior to Vista, Windows might
|
||||
* sometimes produce invalid strings
|
||||
* where a utf-16 sequence containing
|
||||
* surrogate pairs was converted
|
||||
* "verbatim" into utf-8, instead of
|
||||
* encoding the actual codepoint. This
|
||||
* format is sometimes called "WTF-8".
|
||||
*
|
||||
* If we were to support that, we'd
|
||||
* have a branch here for the case
|
||||
* where the codepoint is between
|
||||
* 0xd800 and 0xdbff (a "high
|
||||
* surrogate"), and read a *six*
|
||||
* character sequence from there which
|
||||
* would include a low surrogate. But
|
||||
* that would undermine the
|
||||
* hard-learnt principle that each
|
||||
* character should only have one
|
||||
* encoding.
|
||||
*/
|
||||
errno = EILSEQ;
|
||||
goto error;
|
||||
}
|
||||
|
||||
uc[0] = codepoint & 0xff;
|
||||
uc[1] = codepoint >> 8;
|
||||
c += 3;
|
||||
|
Loading…
Reference in New Issue
Block a user