Re: What are codepages?


Subject: Re: What are codepages?
From: Andras Kadinger (bandit@freeside.elte.hu)
Date: Sun Jul 09 2000 - 11:09:00 EDT


Peter Gutowski wrote:
>
> I'll risk betraying my Macintosh roots, but would somebody please explain what 'codepages' are, and what they're used for. I've always got the sense that it had something to do with character encoding, but have no idea how to appreciate and evaluate what a sun wrote last month...

Computers in the last few decades (historians would probably cringe at
hearing this rather vague designation) used one byte to represent a
character; but a byte contains a number and not a character, so there
had to be an agreement to define, what number should represent which
character. One of the early ones, still very popular with most computers
was the ASCII one (which if memory serves well stands for American
Standard Code for Information Interchange); it defined numbers (codes)
for the 52 small and capital letters of the English alphabet, numbers,
punctuation, and also some control codes, like line feed, carriage
return, end of file, end of transmission, etc. It - although is rarely
called so - can very well be called a codepage, as it lists 'on a page'
the numbers (codes) assigned to characters, in turn making it possible
and easy to use numbers to convey textual messages.

However, there are more characters in use than those in the English
alphabet, and the codes in the ASCII codepage are unable to convey
those. So ASCII had to be extended. One of the very first attempts was
that of IBM, which was manifested in the so called IBM PC character set:
an extension of ASCII with characters from some common western languages
(French, German, Swedish, Spanish, etc.) and also to help the first IBM
PCs having no or limited graphics capabilities, some line drawing and
other graphical characters. This was in wide use in the character-based
world of MS-DOS, before Windows got overwhelmingly popular.

However, this character set was still lacking numerous characters from
yet other languages, however, all the 256 character codes offered by one
byte where already taken up in the IBM PC character set. Microsoft upon
realizing this, decided to keep the that time widely adopted one byte -
one character design, and created different character sets for the
different language regions by replacing parts of the characters in the
IBM PC character set, and I think they coined the term 'codepage', as
the user could 'turn pages', change the character set displayed with a
command. This however has lead to incompatibility, as codes meaning one
character in one codepage meant a completely different one in another
codepage. Texts encoded were no longer self-containing, the codepage
used at encoding had to be conveyed as well to allow for correct
rendition and entering of text; there wasn't an unified, universal
approach to convey this information, usually users manually had to set
the right codepage, and even then they were unable to properly display
text they received written in a codepage not installed on their
particular computer.

The Macintosh, being designed in roughly the same age, first also
adopted the one byte - one character design, but used yet another
character set with characters not found elsewhere (like the apple
symbol), and a different codepage (nowadays AFAIK called MacRoman) that
layed out even the more widely used (e.g. accented French) non-ASCII
characters over the possible codes differently. So even the subset of
the non-ASCII characters that were common between Macintosh and Windows
couldn't be conveyed between the two without some kind of conversion.

Sometime later, decades after ASCII came about standards, one-byte
character sets having similar philosophy to codepages as used by
Microsoft, to standardize the sets of characters and their codes for
about a dozen language regions: the ISO-8859 series, also frequently
called isolatin codepages or character sets. Now at least people
interested in using a certain, or a few languages could ensure, that
their texts would be correctly readable if using equipment and software
conforming to these standards.

But this still wasn't the ultimate solution, and it slowly became
obvious to all parties involved, especially with globalization, that the
one byte - one character approach is a rather serious bottleneck in the
flow of communication. So striving for perfection, and all-encompassing
solution, a new standard dubbed Unicode has emerged, which used two
bytes to encode one characters; that means 65536 possible character
codes, more to encode all characters and symbols in all the languages in
the world - one would think at first. However, in the following years,
it became clear, that this is still not enough, due to the multiple tens
of thousands of characters (idioms) used mostly in Asian idiogrammatic
languages (Chinese, Japanese, Korean, etc.); so even Unicode had to be
extended, to provide more than one 65536-symbol block. It also became
apparent, that the fixed two byte - one character approach is lacking in
some respects as well, many software routines would have to be rewritten
and rethought, and also, as it turned out there are more characters to
encode than fits into the two bytes, it would be a futile, useless
attempt anyway.

More methods to represent the more than 65536 character codes in bytes
were invented, ones, that try to provide backwards compatibility; indeed
e.g. if a text only uses ASCII characters and contol codes it looks and
behaves exactly the same as if it was written in plain old ASCII, and
only when using some of the more 'exotic' characters becomes it a bit
'cryptic', but even then byte values that represent valid characters or
control codes in the ASCII character set aren't used, so algorithms able
to handle one-byte encoded text won't get confused by the new encoding,
and as a result less program code is to be rewritten. AFAIK these are
called UCS.

Striving for perfection is (probably) an unreachable goal, and as a
result, Unicode is still (and probably will be for years to come) under
continuous improvement. Nonetheless, stabilized parts of it are rarely
(if ever) touched in a way that would break compatibility with products
already compliant with unicode; AFAIK it currently is more of an
extension work, discovering and codifying characters/symbols of more and
more written languages of the World, and adding them to Unicode. The
encoding mechanism AFAIK has been extended to support at least 2^31 or
2^32 characters (that's more than two, or four billion [thousand
million] characters, about 0.5-0.8 character for each person living on
the Earth), but I assume can be extended relatively easily even further.
Hopefully we should only expect the next major revision of text encoding
standards about the time interplanetary/intercivilization communications
become imminent.

Back to Adrian's work, and how this all affects netatalk:

One increasingly more important aspect of netatalk, and of the server
product Adrian is working at the manufacturer of, is providing
cross-platform file services to Windows, Macintosh and unix computers.
As some of these platforms (probably for the sake of backwards
compatibility) still use a one byte - one character method to represent
filenames, but use different codepages, maybe some already use Unicode
with the UCS encoding, transcoding between them is needed in order to
provide people with the closest to the original rendition of the
filename possible on their platform, to account for the differences in
codepages/encodings. If transcoding is not present, filenames entered on
one platform could be illegible on other platforms, or even could
contain unallowed characters. E.g. Macs allow '/' while unix uses that
for directory separation, and Windows doesn't allow that either;
conversely, ':' or '\', perfectly valid in unix are illegal in Macintosh
and Windows filenames respectively, as they are used as directory
separator there. Without transcoding, this could cause some files to
become inaccessible from some platforms - a functional deficiency,
hardly to be called seamless cross-platform interconnection.

This transcoding mechanism is called codepage support.

>From Adrian's annoucement message:
> > it has the _test suffix because i currently have the
> > codepage support disabled. i'm in the midst of re-writing
> > it to handle double-byte codepages.

I hope after this short tour, the above sentence is understandable now.

Regards,
Andras Kadinger



This archive was generated by hypermail 2b28 : Wed Jan 17 2001 - 14:31:27 EST