mirror of
https://gitlab.gnome.org/GNOME/libxml2.git
synced 2025-01-10 01:17:37 +03:00
c5d64345cf
* AUTHORS: added William and Bjorn * include/libxml/*.h *.c README doc/*.html etc.: changed old email to daniel@veillard.com hopefully I won't have to do this again * doc/Makefile.am doc/html/*.html: cleanup makefile, checked that docs can be rebuilt cleanly now * include/libxml/xml*version.h*: removed include/libxml/xmlversion.h from CVs it's generated, added include/libxml/xmlwin32version.h also generated but which should change far less frequently. * catalog.c nanoftp.c: made sure to include libxml.h not libxml/xmlversion.h directly * include/libxml/*.h: include xmlwin32version.h instead of xmlversion.h when compiling on WIN32 and MSC Daniel
289 lines
14 KiB
HTML
289 lines
14 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
|
||
"http://www.w3.org/TR/REC-html40/loose.dtd">
|
||
<html>
|
||
<head>
|
||
<title>Libxml Internationalization support</title>
|
||
<meta name="GENERATOR" content="amaya V3.2">
|
||
<meta http-equiv="Content-Type" content="text/html">
|
||
</head>
|
||
|
||
<body bgcolor="#ffffff">
|
||
<h1 align="center">Libxml Internationalization support</h1>
|
||
|
||
<p>Location: <a
|
||
href="http://xmlsoft.org/encoding.html">http://xmlsoft.org/encoding.html</a></p>
|
||
|
||
<p>Libxml home page: <a href="http://xmlsoft.org/">http://xmlsoft.org/</a></p>
|
||
|
||
<p>Mailing-list archive: <a
|
||
href="http://xmlsoft.org/messages/">http://xmlsoft.org/messages/</a></p>
|
||
|
||
<p>Version: $Revision$</p>
|
||
|
||
<p>Table of Content:</p>
|
||
<ol>
|
||
<li><a href="#What">What does internationalization support mean ?</a></li>
|
||
<li><a href="#internal">The internal encoding, how and why</a></li>
|
||
<li><a href="#implemente">How is it implemented ?</a></li>
|
||
<li><a href="#Default">Default supported encodings</a></li>
|
||
<li><a href="#extend">How to extend the existing support</a></li>
|
||
</ol>
|
||
|
||
<h2><a name="What">What does internationalization support mean ?</a></h2>
|
||
|
||
<p>XML was designed from the start to allow the support of any character set
|
||
by using Unicode. Any conformant XML parser has to support the UTF-8 and
|
||
UTF-16 default encodings which can both express the full unicode ranges. UTF8
|
||
is a variable length encoding whose greatest point are to resuse the same
|
||
emcoding for ASCII and to save space for Western encodings, but it is a bit
|
||
more complex to handle in practice. UTF-16 use 2 bytes per characters (and
|
||
sometimes combines two pairs), it makes implementation easier, but looks a bit
|
||
overkill for Western languages encoding. Moreover the XML specification allows
|
||
document to be encoded in other encodings at the condition that they are
|
||
clearly labelled as such. For example the following is a wellformed XML
|
||
document encoded in ISO-8859 1 and using accentuated letter that we French
|
||
likes for both markup and content:</p>
|
||
<pre><?xml version="1.0" encoding="ISO-8859-1"?>
|
||
<tr<EFBFBD>s>l<EFBFBD></tr<74>s></pre>
|
||
|
||
<p>Having internationalization support in libxml means the foolowing:</p>
|
||
<ul>
|
||
<li>the document is properly parsed</li>
|
||
<li>informations about it's encoding are saved</li>
|
||
<li>it can be modified</li>
|
||
<li>it can be saved in its original encoding</li>
|
||
<li>it can also be saved in another encoding supported by libxml (for
|
||
example straight UTF8 or even an ASCII form)</li>
|
||
</ul>
|
||
|
||
<p>Another very important point is that the whole libxml API, with the
|
||
exception of a few routines to read with a specific encoding or save to a
|
||
specific encoding, is completely agnostic about the original encoding of the
|
||
document.</p>
|
||
|
||
<p>It should be noted too that the HTML parser embedded in libxml now obbey
|
||
the same rules too, the following document will be (as of 2.2.2) handled in
|
||
an internationalized fashion by libxml too:</p>
|
||
<pre><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
|
||
"http://www.w3.org/TR/REC-html40/loose.dtd">
|
||
<html lang="fr">
|
||
<head>
|
||
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
|
||
</head>
|
||
<body>
|
||
<p>W3C cr<63>e des standards pour le Web.</body>
|
||
</html></pre>
|
||
|
||
<h2><a name="internal">The internal encoding, how and why</a></h2>
|
||
|
||
<p>One of the core decision was to force all documents to be converted to a
|
||
default internal encoding, and that encoding to be UTF-8, here are the
|
||
rationale for those choices:</p>
|
||
<ul>
|
||
<li>keeping the native encoding in the internal form would force the libxml
|
||
users (or the code associated) to be fully aware of the encoding of the
|
||
original document, for examples when adding a text node to a document, the
|
||
content would have to be provided in the document encoding, i.e. the
|
||
client code would have to check it before hand, make sure it's conformant
|
||
to the encoding, etc ... Very hard in practice, though in some specific
|
||
cases this may make sense.</li>
|
||
<li>the second decision was which encoding. From the XML spec only UTF8 and
|
||
UTF16 really makes sense as being the two only encodings for which there
|
||
is amndatory support. UCS-4 (32 bits fixed size encoding) could be
|
||
considered an intelligent choice too since it's a direct Unicode mapping
|
||
support. I selected UTF-8 on the basis of efficiency and compatibility
|
||
with surrounding software:
|
||
<ul>
|
||
<li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
|
||
more costly to import and export CPU wise) is also far more compact
|
||
than UTF-16 (and UCS-4) for a majority of the documents I see it used
|
||
for right now (RPM RDF catalogs, advogato data, various configuration
|
||
file formats, etc.) and the key point for today's computer
|
||
architecture is efficient uses of caches. If one nearly double the
|
||
memory requirement to store the same amount of data, this will trash
|
||
caches (main memory/external caches/internal caches) and my take is
|
||
that this harms the system far more than the CPU requirements needed
|
||
for the conversion to UTF-8</li>
|
||
<li>Most of libxml version 1 users were using it with straight ASCII
|
||
most of the time, doing the conversion with an internal encoding
|
||
requiring all their code to be rewritten was a serious show-stopper
|
||
for using UTF-16 or UCS-4.</li>
|
||
<li>UTF-8 is being used as the de-facto internal encoding standard for
|
||
related code like the <a href="http://www.pango.org/">pango</a>
|
||
upcoming Gnome text widget, and a lot of Unix code (yep another place
|
||
where Unix programmer base takes a different approach from Microsoft -
|
||
they are using UTF-16)</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
|
||
<p>What does this mean in practice for the libxml user:</p>
|
||
<ul>
|
||
<li>xmlChar, the libxml data type is a byte, those bytes must be assembled
|
||
as UTF-8 valid strings. The proper way to terminate an xmlChar * string is
|
||
simply to append 0 byte, as usual.</li>
|
||
<li>One just need to make sure that when using chars outside the ASCII set,
|
||
the values has been properly converted to UTF-8</li>
|
||
</ul>
|
||
|
||
<h2><a name="implemente">How is it implemented ?</a></h2>
|
||
|
||
<p>Let's describe how all this works within libxml, basically the I18N
|
||
(internationalization) support get triggered only during I/O operation, i.e.
|
||
when reading a document or saving one. Let's look first at the reading
|
||
sequence:</p>
|
||
<ol>
|
||
<li>when a document is processed, we usually don't know the encoding, a
|
||
simple heuristic allows to detect UTF-18 and UCS-4 from whose where the
|
||
ASCII range (0-0x7F) maps with ASCII</li>
|
||
<li>the xml declaration if available is parsed, including the encoding
|
||
declaration. At that point, if the autodetected encoding is different from
|
||
the one declared a call to xmlSwitchEncoding() is issued.</li>
|
||
<li>If there is no encoding declaration, then the input has to be in either
|
||
UTF-8 or UTF-16, if it is not then at some point when processing the
|
||
input, the converter/checker of UTF-8 form will raise an encoding error.
|
||
You may end-up with a garbled document, or no document at all ! Example:
|
||
<pre>~/XML -> ./xmllint err.xml
|
||
err.xml:1: error: Input is not proper UTF-8, indicate encoding !
|
||
<tr<EFBFBD>s>l<EFBFBD></tr<74>s>
|
||
^
|
||
err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
|
||
<tr<EFBFBD>s>l<EFBFBD></tr<74>s>
|
||
^</pre>
|
||
</li>
|
||
<li>xmlSwitchEncoding() does an encoding name lookup, canonalize it, and
|
||
then search the default registered encoding converters for that encoding.
|
||
If it's not within the default set and iconv() support has been compiled
|
||
it, it will ask iconv for such an encoder. If this fails then the parser
|
||
will report an error and stops processing:
|
||
<pre>~/XML -> ./xmllint err2.xml
|
||
err2.xml:1: error: Unsupported encoding UnsupportedEnc
|
||
<?xml version="1.0" encoding="UnsupportedEnc"?>
|
||
^</pre>
|
||
</li>
|
||
<li>From that point the encoder process progressingly the input (it is
|
||
plugged as a front-end to the I/O module) for that entity. It captures and
|
||
convert on-the-fly the document to be parsed to UTF-8. The parser itself
|
||
just does UTF-8 checking of this input and process it transparently. The
|
||
only difference is that the encoding information has been added to the
|
||
parsing context (more precisely to the input corresponding to this
|
||
entity).</li>
|
||
<li>The result (when using DOM) is an internal form completely in UTF-8 with
|
||
just an encoding information on the document node.</li>
|
||
</ol>
|
||
|
||
<p>Ok then what's happen when saving the document (assuming you
|
||
colllected/built an xmlDoc DOM like structure) ? It depends on the function
|
||
called, xmlSaveFile() will just try to save in the original encoding, while
|
||
xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
|
||
encoding:</p>
|
||
<ol>
|
||
<li>if no encoding is given, libxml will look for an encoding value
|
||
associated to the document and if it exists will try to save to that
|
||
encoding,
|
||
<p>otherwise everything is written in the internal form, i.e. UTF-8</p>
|
||
</li>
|
||
<li>so if an encoding was specified, either at the API level or on the
|
||
document, libxml will again canonalize the encoding name, lookup for a
|
||
converter in the registered set or through iconv. If not found the
|
||
function will return an error code</li>
|
||
<li>the converter is placed before the I/O buffer layer, as another kind of
|
||
buffer, then libxml will simply push the UTF-8 serialization to through
|
||
that buffer, which will then progressively be converted and pushed onto
|
||
the I/O layer.</li>
|
||
<li>It is possible that the converter code fails on some input, for example
|
||
trying to push an UTF-8 encoded chinese character through the UTF-8 to
|
||
ISO-8859-1 converter won't work. Since the encoders are progressive they
|
||
will just report the error and the number of bytes converted, at that
|
||
point libxml will decode the offending character, remove it from the
|
||
buffer and replace it with the associated charRef encoding &#123; and
|
||
resume the convertion. This guarante that any document will be saved
|
||
without losses (except for markup names where this is not legal, this is a
|
||
problem in the current version, in pactice avoid using non-ascci
|
||
characters for tags or attributes names @@). A special "ascii" encoding
|
||
name is used to save documents to a pure ascii form can be used when
|
||
portability is really crucial</li>
|
||
</ol>
|
||
|
||
<p>Here is a few examples based on the same test document:</p>
|
||
<pre>~/XML -> ./xmllint isolat1
|
||
<?xml version="1.0" encoding="ISO-8859-1"?>
|
||
<tr<EFBFBD>s>l<EFBFBD></tr<74>s>
|
||
~/XML -> ./xmllint --encode UTF-8 isolat1
|
||
<?xml version="1.0" encoding="UTF-8"?>
|
||
<très>l<EFBFBD> <20></très>
|
||
~/XML -> </pre>
|
||
|
||
<p>The same processing is applied (and reuse most of the code) for HTML I18N
|
||
processing. Looking up and modifying the content encoding is a bit more
|
||
difficult since it is located in a <meta> tag under the <head>, so
|
||
a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
|
||
been provided. The parser also attempts to switch encoding on the fly when
|
||
detecting such a tag on input. Except for that the processing is the same (and
|
||
again reuses the same code).</p>
|
||
|
||
<h2><a name="Default">Default supported encodings</a></h2>
|
||
|
||
<p>libxml has a set of default converters for the following encodings (located
|
||
in encoding.c):</p>
|
||
<ol>
|
||
<li>UTF-8 is supported by default (null handlers)</li>
|
||
<li>UTF-16, both little and big endian</li>
|
||
<li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
|
||
<li>ASCII, useful mostly for saving</li>
|
||
<li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
|
||
predefined entities like &copy; for the Copyright sign.</li>
|
||
</ol>
|
||
|
||
<p>More over when compiled on an Unix platfor with iconv support the full set
|
||
of encodings supported by iconv can be instantly be used by libxml. On a linux
|
||
machine with glibc-2.1 the list of supported encodings and aliases fill 3 full
|
||
pages, and include UCS-4, the full set of ISO-Latin encodings, and the various
|
||
Japanese ones.</p>
|
||
|
||
<h3>Encoding aliases</h3>
|
||
|
||
<p>From 2.2.3, libxml has support to register encoding names aliases. The goal
|
||
is to be able to parse document whose encoding is supported but where the name
|
||
differs (for example from the default set of names accepted by iconv). The
|
||
following functions allow to register and handle new aliases for existing
|
||
encodings. Once registered libxml will automatically lookup the aliases when
|
||
handling a document:</p>
|
||
<ul>
|
||
<li>int xmlAddEncodingAlias(const char *name, const char *alias);</li>
|
||
<li>int xmlDelEncodingAlias(const char *alias);</li>
|
||
<li>const char * xmlGetEncodingAlias(const char *alias);</li>
|
||
<li>void xmlCleanupEncodingAliases(void);</li>
|
||
</ul>
|
||
|
||
<h2><a name="extend">How to extend the existing support</a></h2>
|
||
|
||
<p>Well adding support for new encoding, or overriding one of the encoders
|
||
(assuming it is buggy) should not be hard, just write an input and output
|
||
conversion routines to/from UTF-8, and register them using
|
||
xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be
|
||
called automatically if the parser(s) encounter such an encoding name
|
||
(register it uppercase, this will help). The description of the encoders,
|
||
their arguments and expected return values are described in the encoding.h
|
||
header.</p>
|
||
|
||
<p>A quick note on the topic of subverting the parser to use a different
|
||
internal encoding than UTF-8, in some case people will absolutely want to keep
|
||
the internal encoding different, I think it's still possible (but the encoding
|
||
must be compliant with ASCII on the same subrange) though I didn't tried it.
|
||
The key is to override the default conversion routines (by registering null
|
||
encoders/decoders for your charsets), and bypass the UTF-8 checking of the
|
||
parser by setting the parser context charset (ctxt->charset) to something
|
||
different than XML_CHAR_ENCODING_UTF8, but there is no guarantee taht this
|
||
will work. You may also have some troubles saving back.</p>
|
||
|
||
<p>Basically proper I18N support is important, this requires at least
|
||
libxml-2.0.0, but a lot of features and corrections are really available only
|
||
starting 2.2.</p>
|
||
|
||
<p><a href="mailto:daniel@veillard.com">Daniel Veillard</a></p>
|
||
|
||
<p>$Id$</p>
|
||
</body>
|
||
</html>
|