Subject: Re: What are codepages?
From: Andras Kadinger (bandit@freeside.elte.hu)
Date: Mon Jul 10 2000 - 04:37:32 EDT
Peter Gutowski wrote:
> >I hope after this short tour, the above sentence is
> >understandable now.
> Wow! That was quite an explanation. I'm delighted of course -- I never expected such depth. Thank you.
You're welcome! :)
> You mentioned
> > user could 'turn pages', change the character
> > set displayed with a command
>
> Are there Linux (Windows, Mac..) commands that do that?
There is a DOS command to change codepages in DOS. In Windows, AFAIK the
closest thing you can do (after having installed the corresponding
regional/language support files) is changing the current
language/locale, which in turn changes the keyboard layout, and also -
in some yet unknown to me way - communicates to programs the codepage
(displaying characters) and language (sorting order, etc.) needed.
Windows also contains support for Unicode text and font handling, so
Unicode-aware programs with Unicode-aware file-formats work relatively
well when used with Unicode-aware fonts; fot these programs changing the
language setting mostly changes the keyboard layout, as they probably (I
don't really know Windows) already receive Unicode character codes from
the keyboard-driver.
On the Macintosh, AFAIK the keyboard layout is separately settable using
the Keyboard control panel (or the flag on the menubar is displayed),
while the locale/language settings have a separate control panel called
Text I think (or is it Script?).
Under Linux and probably other unices as well, things are even more
separated; under the Linux console you can freely change your screen
fonts, and your keyboard layout, and I think there is a way to tell the
kernel/console which codepage/encoding you want to use (this is used
when display UCS-encoded Unicode filenames - someone more knowledgeable
should fill this in please), and you can set environment variables to
convey the language/locale settings (sort order, date/time number and
currency/money formats, etc.) you intend to use to programs. Under X,
you can also set the keyboard layout, however fonts are handled in a
more generic way that allows specifying the codepage/encoding to be used
as part of the font 'name' or reference; this allows to display text in
multiple languages/codepages simultaneously, automatically if the
application in question supports mechanisms and file formats for
conveying and setting the correct character set.
> Are we just talking about file names or text content of files?
Well, the concepts of character codes, codepages, Unicode in
representation of text is general. If you are asking, what the codepage
support in netatalk is designed to do, then is the answer: file names.
It is generally a wrong thing for a general-purpose file serving system
like netatalk to change contents of files in any way, as it would be
very hard and complicated to determine what parts of a file is text in
need of conversion, and what is other data not to be touched.
> If I write:
>
> "Gäste kamen und Gäste gingen" (R. Wagner, "Die Walküre", 1. Acte)
>
> ...does this look incorrect to you? (Umlauts over a's in Gaste and u in Walkure)
It doesn't look incorrect; I can read it perfectly, because:
- Internet standards people have invented a generic/universal way to
convey the format and encoding of email messages, including conveying
the character set used
- your email software has correctly translated text you entered on your
Macintosh - probably using the MacRoman codepage - into the in emails
most commonly used non-ASCII codepage/character set ISO-8859-1
(ISOLatin-1), and has also indicated this in the headers of the email
message
- my email software and system has direct support for ISO-8859-1, so
without any further transcoding it could display the text perfectly. And
as I myself have some support for the German language, I could even read
it - but that has less to do with codepages. :)
> I noticed when looking at Mac files (e.g. "OurLogo/Blk&PMS284.eps") in a
> terminal window under Linux and that the "objectionable" characters were
> rendered with a colon (:) followed by 2 hexidecimal digits of that character
> (same example: OurLogo:2fBlk&PMS284.eps). My afpd (preasun2.1.4-39) seems not
> to render the "/" as a single charater, but as a trigraph. Is that what a sun
> means that it's without codepage support?
Yes; it used to be that way with netatalk since the University of
Michigan version; it was a clever idea: the '/' is illegal in unix
filenames, and the other 'objectionable' characters also could cause
problems with older utilities that were written in the age of ASCII, and
were unable to cope with characters in the upper half of the character
table. A hack was born: let's encode the problematic character (byte)
values in hex, and indicate this with a ':' - an invalid character in
Macintosh filenames, well suited to be an escape character for this
purpose. Noone really looked at these filenames on the unix side, so it
didn't really matter how they looked like.
It was when the cross-platform integration of samba (Windows), netatalk
(Macintosh) and maybe even NFS (unix) came about when this proved to be
a less than convenient method. After all, all three platforms supported
more or less the same characters, at least the ones in major accented
languages, so why not handle the filenames in a way that would allow
them to appear the same (or as close as character sets permit) on all
three platforms? The first answer given to this was to use rudimentary
codepage (or character set) conversion: the lookup tables Paul Sander
mentioned in his email. With this support in place in netatalk, the
character values (bytes) in a filename provided by a Macintosh are first
converted using the lookup table, and then the result is used as the
unix filename. This is the 'codepage support' Adrian was referring to as
being disabled in the _test versions.
Linux is usually used with an ISO-8859-1 codepage, so when the
conversion routine in netatalk is provided with a MacRoman -> ISOLatin-1
table, filenames would appear (mostly) correctly in Linux.
If used with a MacRoman -> Windows table, filenames would appear
(mostly) correctly under Windows through samba (which also had no
conversion until recently), one would think. This might be true, however
there is more to it, as I think (someone please chime in, I really don't
know this part) Windows nowadays probably uses UCS-encoded
multibyte/character filenames, or fixed two-byte encoded filenames, and
a simple byte-to-byte lookup table is obviously not suitable for this
conversion. Maybe the Macs also do multibyte character encoding in
filenames nowadays?
The point is, a more complex approach is needed to support this, and I
think this is what Adrian has termed as support for double-byte
codepages.
Regards,
Andras Kadinger
This archive was generated by hypermail 2b28 : Wed Jan 17 2001 - 14:31:27 EST