writefln and ASCII

Steve Horne stephenwantshornenospam100 at aol.com
Wed Sep 13 03:35:08 PDT 2006


On Tue, 12 Sep 2006 15:03:20 +0300, Serg Kovrov <kovrov at no.spam>
wrote:

>How do I writefln a string from ASCII file contained illegal UTF-8 
>characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

Just to add some angry ranting to what has already been said...

First, there was ASCII. ASCII had character codes 0 to 127.

Then, there was a whole bunch of codepages - different character sets
for different countries. These exploited characters 128 to 255, but
each codepage defined the characters differently. Some codepages had
multi-byte characters.

Then, there was Unicode. Unicode was supposed to make things easier.
But instead, it made things harder.

There are millions of defined codes in unicode. A code does not
necessarily represent a character - it may take several codes in
sequence (for example, applying diacritics).

In addition, the codes are just numbers. There are several different
ways to encode them into streams of bytes or whatever.

In UTF-8, several bytes may be needed to specify a unicode code, and
several codes may be needed to specify a single character. I don't
think any codepage has that level of complexity.

Oh, and let's not forget the UTF-16, UCS-32 and other encodings.

And then, of course, understanding millions of codes is a tall order.
Especially since the number isn't fixed, even now. Back when there
were 40,000 or so codes, people thought the set was almost complete -
hah hah hah! So Unicode explicitly allows that applications and
operating systems don't have to understand all possible codes. So you
have the Windows XP subset, the MacOS subset, the GTK subset etc etc
etc.

And then there's endian marks to worry about!

So much for easier. So much for one standard.


Anyway, the first 256 unicode codes match the characters in the US
codepage. Because of the way UTF-8 encoding works, a genuine ASCII
file is also a UTF-8 file. But a file that uses characters 128 to 255
for any codepage, US included, is not a valid UTF-8 file.

-- 
Remove 'wants' and 'nospam' from e-mail.



More information about the Digitalmars-d-learn mailing list