mirror of
https://github.com/samba-team/samba.git
synced 2025-01-27 14:04:05 +03:00
a2e3ba6e12
(This used to be commit d8fe4a81fb0d4972b2331b3d5fc4890244b44c33)
175 lines
6.6 KiB
XML
175 lines
6.6 KiB
XML
<chapter id="unicode">
|
|
<chapterinfo>
|
|
&author.jelmer;
|
|
<author>
|
|
<firstname>TAKAHASHI</firstname><surname>Motonobu</surname>
|
|
<affiliation>
|
|
<address><email>monyo@home.monyo.com</email></address>
|
|
</affiliation>
|
|
</author>
|
|
<pubdate>25 March 2003</pubdate>
|
|
</chapterinfo>
|
|
|
|
<title>Unicode/Charsets</title>
|
|
|
|
<sect1>
|
|
<title>Features and Benefits</title>
|
|
|
|
<para>
|
|
Every industry eventually matures. One of the great areas of maturation is in
|
|
the focus that has been given over the past decade to make it possible for anyone
|
|
anywhere to use a computer. It has not always been that way, in fact, not so long
|
|
ago it was common for software to be written for exclusive use in the country of
|
|
origin.
|
|
</para>
|
|
|
|
<para>
|
|
Of all the effort that has been brought to bear on providing native language support
|
|
for all computer users, the efforts of the <ulink url="http://www.openi18n.org/">Openi18n organisation</ulink> is deserving of
|
|
special mention.
|
|
</para>
|
|
|
|
<para>
|
|
Samba-2.x supported a single locale through a mechanism called
|
|
<emphasis>codepages</emphasis>. Samba-3 is destined to become a truly trans-global
|
|
file and printer sharing platform.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>What are charsets and unicode?</title>
|
|
|
|
<para>
|
|
Computers communicate in numbers. In texts, each number will be
|
|
translated to a corresponding letter. The meaning that will be assigned
|
|
to a certain number depends on the <emphasis>character set(charset)
|
|
</emphasis> that is used.
|
|
A charset can be seen as a table that is used to translate numbers to
|
|
letters. Not all computers use the same charset (there are charsets
|
|
with German umlauts, Japanese characters, etc). Usually a charset contains
|
|
256 characters, which means that storing a character with it takes
|
|
exactly one byte. </para>
|
|
|
|
<para>
|
|
There are also charsets that support even more characters,
|
|
but those need twice(or even more) as much storage space. These
|
|
charsets can contain <command>256 * 256 = 65536</command> characters, which
|
|
is more then all possible characters one could think of. They are called
|
|
multibyte charsets (because they use more then one byte to
|
|
store one character).
|
|
</para>
|
|
|
|
<para>
|
|
A standardised multibyte charset is <ulink url="http://www.unicode.org/">unicode</ulink>.
|
|
A big advantage of using a multibyte charset is that you only need one; there
|
|
is no need to make sure two computers use the same charset when they are
|
|
communicating.
|
|
</para>
|
|
|
|
<para>Old windows clients use single-byte charsets, named
|
|
'codepages' by Microsoft. However, there is no support for
|
|
negotiating the charset to be used in the smb protocol. Thus, you
|
|
have to make sure you are using the same charset when talking to an older client.
|
|
Newer clients (Windows NT, 2K, XP) talk unicode over the wire.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Samba and charsets</title>
|
|
|
|
<para>
|
|
As of samba 3.0, samba can (and will) talk unicode over the wire. Internally,
|
|
samba knows of three kinds of character sets:
|
|
</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><smbconfoption><name>unix charset</name></smbconfoption></term>
|
|
<listitem><para>
|
|
This is the charset used internally by your operating system.
|
|
The default is <constant>UTF-8</constant>, which is fine for most
|
|
systems. The default in previous samba releases was <constant>ASCII</constant>.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><smbconfoption><name>display charset</name></smbconfoption></term>
|
|
<listitem><para>This is the charset samba will use to print messages
|
|
on your screen. It should generally be the same as the <command>unix charset</command>.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><smbconfoption><name>dos charset</name></smbconfoption></term>
|
|
<listitem><para>This is the charset samba uses when communicating with
|
|
DOS and Windows 9x clients. It will talk unicode to all newer clients.
|
|
The default depends on the charsets you have installed on your system.
|
|
Run <command>testparm -v | grep "dos charset"</command> to see
|
|
what the default is on your system.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Conversion from old names</title>
|
|
|
|
<para>Because previous samba versions did not do any charset conversion,
|
|
characters in filenames are usually not correct in the unix charset but only
|
|
for the local charset used by the DOS/Windows clients.</para>
|
|
|
|
<para>Bjoern Jacke has written a utility named <ulink url="http://j3e.de/linux/convmv/">convm</ulink> that can convert whole directory
|
|
structures to different charsets with one single command.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Japanese charsets</title>
|
|
|
|
<para>Samba doesn't work correctly with Japanese charsets yet. Here are
|
|
points of attention when setting it up:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para>You should set <smbconfoption><name>mangling method</name><value>hash</value></smbconfoption></para></listitem>
|
|
|
|
<listitem><para>There are various iconv() implementations around and not
|
|
all of them work equally well. glibc2's iconv() has a critical problem
|
|
in CP932. libiconv-1.8 works with CP932 but still has some problems and
|
|
does not work with EUC-JP.</para></listitem>
|
|
|
|
<listitem><para>You should set <smbconfoption><name>dos charset</name><value>CP932</value></smbconfoption>, not
|
|
Shift_JIS, SJIS...</para></listitem>
|
|
|
|
<listitem><para>Currently only <smbconfoption><name>unix charset</name><value>CP932</value></smbconfoption>
|
|
will work (but still has some problems...) because of iconv() issues.
|
|
<smbconfoption><name>unix charset</name><value>EUC-JP</value></smbconfoption> doesn't work well because of
|
|
iconv() issues.</para></listitem>
|
|
|
|
<listitem><para>Currently Samba 3.0 does not support <smbconfoption><name>unix charset</name><value>UTF8-MAC/CAP/HEX/JIS*</value></smbconfoption></para></listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>More information (in Japanese) is available at: <ulink noescape="1" url="http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html">http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html</ulink>.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Common errors</title>
|
|
|
|
<sect2>
|
|
<title>CP850.so can't be found</title>
|
|
|
|
<para><quote>Samba is complaining about a missing <filename>CP850.so</filename> file</quote>.</para>
|
|
|
|
<para>CP850 is the default <smbconfoption><name>dos charset</name></smbconfoption>. The <smbconfoption><name>dos charset</name></smbconfoption> is used to convert data to the codepage used by your dos clients. If you don't have any dos clients, you can safely ignore this message. </para>
|
|
|
|
<para>CP850 should be supported by your local iconv implementation. Make sure you have all the required packages installed. If you compiled samba from source, make sure configure found iconv.</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
</chapter>
|