mirror of
https://github.com/samba-team/samba.git
synced 2024-12-22 13:34:15 +03:00
libutil/iconv: don't allow wtf-8 surrogate pairs
At present, if we meet a string like "hello \xed\xa7\x96 world", the bytes in the middle will be converted into half of a surrogate pair, and the UTF-16 will be invalid. It is better to error out immediately, because the UTF-8 string is already invalid. https://learn.microsoft.com/en-us/windows/win32/api/Stringapiset/nf-stringapiset-widechartomultibyte#remarks is a citation for the statement about this being a pre-Vista problem. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Andrew Bartlett <abartlet@samba.org>
This commit is contained in:
parent
d7481f94e0
commit
949fe57077
@ -861,6 +861,39 @@ static size_t utf8_pull(void *cd, const char **inbuf, size_t *inbytesleft,
|
|||||||
errno = EILSEQ;
|
errno = EILSEQ;
|
||||||
goto error;
|
goto error;
|
||||||
}
|
}
|
||||||
|
if (codepoint >= 0xd800 && codepoint <= 0xdfff) {
|
||||||
|
/*
|
||||||
|
* This is an invalid codepoint, per
|
||||||
|
* RFC3629, as it encodes part of a
|
||||||
|
* UTF-16 surrogate pair for a
|
||||||
|
* character over U+10000, which ought
|
||||||
|
* to have been encoded as a four byte
|
||||||
|
* utf-8 sequence.
|
||||||
|
*
|
||||||
|
* Prior to Vista, Windows might
|
||||||
|
* sometimes produce invalid strings
|
||||||
|
* where a utf-16 sequence containing
|
||||||
|
* surrogate pairs was converted
|
||||||
|
* "verbatim" into utf-8, instead of
|
||||||
|
* encoding the actual codepoint. This
|
||||||
|
* format is sometimes called "WTF-8".
|
||||||
|
*
|
||||||
|
* If we were to support that, we'd
|
||||||
|
* have a branch here for the case
|
||||||
|
* where the codepoint is between
|
||||||
|
* 0xd800 and 0xdbff (a "high
|
||||||
|
* surrogate"), and read a *six*
|
||||||
|
* character sequence from there which
|
||||||
|
* would include a low surrogate. But
|
||||||
|
* that would undermine the
|
||||||
|
* hard-learnt principle that each
|
||||||
|
* character should only have one
|
||||||
|
* encoding.
|
||||||
|
*/
|
||||||
|
errno = EILSEQ;
|
||||||
|
goto error;
|
||||||
|
}
|
||||||
|
|
||||||
uc[0] = codepoint & 0xff;
|
uc[0] = codepoint & 0xff;
|
||||||
uc[1] = codepoint >> 8;
|
uc[1] = codepoint >> 8;
|
||||||
c += 3;
|
c += 3;
|
||||||
|
Loading…
Reference in New Issue
Block a user