mirror of
https://github.com/samba-team/samba.git
synced 2025-01-24 02:04:21 +03:00
cc841dde2f
(This used to be commit 85434d3144656e6fe587637276d6a2667df1857f)
143 lines
5.0 KiB
XML
143 lines
5.0 KiB
XML
<chapter id="unicode">
|
|
<chapterinfo>
|
|
&author.jelmer;
|
|
<author>
|
|
<firstname>TAKAHASHI</firstname><surname>Motonobu</surname>
|
|
<affiliation>
|
|
<address><email>monyo@home.monyo.com</email></address>
|
|
</affiliation>
|
|
</author>
|
|
<pubdate>25 March 2003</pubdate>
|
|
</chapterinfo>
|
|
|
|
<title>Unicode/Charsets</title>
|
|
|
|
<sect1>
|
|
<title>What are charsets and unicode?</title>
|
|
|
|
<para>
|
|
Computers communicate in numbers. In texts, each number will be
|
|
translated to a corresponding letter. The meaning that will be assigned
|
|
to a certain number depends on the <emphasis>character set(charset)
|
|
</emphasis> that is used.
|
|
A charset can be seen as a table that is used to translate numbers to
|
|
letters. Not all computers use the same charset (there are charsets
|
|
with German umlauts, Japanese characters, etc). Usually a charset contains
|
|
256 characters, which means that storing a character with it takes
|
|
exactly one byte. </para>
|
|
|
|
<para>
|
|
There are also charsets that support even more characters,
|
|
but those need twice(or even more) as much storage space. These
|
|
charsets can contain <command>256 * 256 = 65536</command> characters, which
|
|
is more then all possible characters one could think of. They are called
|
|
multibyte charsets (because they use more then one byte to
|
|
store one character).
|
|
</para>
|
|
|
|
<para>
|
|
A standardised multibyte charset is unicode, info is available at
|
|
<ulink url="http://www.unicode.org/">www.unicode.org</ulink>.
|
|
A big advantage of using a multibyte charset is that you only need one; no
|
|
need to make sure two computers use the same charset when they are
|
|
communicating.
|
|
</para>
|
|
|
|
<para>Old windows clients used to use single-byte charsets, named
|
|
'codepages' by microsoft. However, there is no support for
|
|
negotiating the charset to be used in the smb protocol. Thus, you
|
|
have to make sure you are using the same charset when talking to an old client.
|
|
Newer clients (Windows NT, 2K, XP) talk unicode over the wire.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Samba and charsets</title>
|
|
|
|
<para>
|
|
As of samba 3.0, samba can (and will) talk unicode over the wire. Internally,
|
|
samba knows of three kinds of character sets:
|
|
</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><parameter>unix charset</parameter></term>
|
|
<listitem><para>
|
|
This is the charset used internally by your operating system.
|
|
The default is <constant>ASCII</constant>, which is fine for most
|
|
systems.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>display charset</parameter></term>
|
|
<listitem><para>This is the charset samba will use to print messages
|
|
on your screen. It should generally be the same as the <command>unix charset</command>.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>dos charset</parameter></term>
|
|
<listitem><para>This is the charset samba uses when communicating with
|
|
DOS and Windows 9x clients. It will talk unicode to all newer clients.
|
|
The default depends on the charsets you have installed on your system.
|
|
Run <command>testparm -v | grep "dos charset"</command> to see
|
|
what the default is on your system.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Conversion from old names</title>
|
|
|
|
<para>Because previous samba versions did not do any charset conversion,
|
|
characters in filenames are usually not correct in the unix charset but only
|
|
for the local charset used by the DOS/Windows clients.</para>
|
|
|
|
<para>The following script from Steve Langasek converts all
|
|
filenames from CP850 to the iso8859-15 charset.</para>
|
|
|
|
<para>
|
|
<prompt>#</prompt><userinput>find <replaceable>/path/to/share</replaceable> -type f -exec bash -c 'CP="{}"; ISO=`echo -n "$CP" | iconv -f cp850 \
|
|
-t iso8859-15`; if [ "$CP" != "$ISO" ]; then mv "$CP" "$ISO"; fi' \;
|
|
</userinput>
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1>
|
|
<title>Japanese charsets</title>
|
|
|
|
<para>Samba doesn't work correctly with Japanese charsets yet. Here are
|
|
points of attention when setting it up:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para>You should set <parameter>mangling method =
|
|
hash</parameter></para></listitem>
|
|
|
|
<listitem><para>There are various iconv() implementations around and not
|
|
all of them work equally well. glibc2's iconv() has a critical problem
|
|
in CP932. libiconv-1.8 works with CP932 but still has some problems and
|
|
does not work with EUC-JP.</para></listitem>
|
|
|
|
<listitem><para>You should set <parameter>dos charset = CP932</parameter>, not
|
|
Shift_JIS, SJIS...</para></listitem>
|
|
|
|
<listitem><para>Currently only <parameter>unix charset = CP932</parameter>
|
|
will work (but still has some problems...) because of iconv() issues.
|
|
<parameter>unix charset = EUC-JP</parameter> doesn't work well because of
|
|
iconv() issues.</para></listitem>
|
|
|
|
<listitem><para>Currently Samba 3.0 does not support <parameter>unix charset
|
|
= UTF8-MAC/CAP/HEX/JIS*</parameter></para></listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>More information (in Japanese) is available at: <ulink url="http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html">http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html</ulink>.</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|