To Walter, about char[] initialization by FF

Mon Jul 31 13:13:44 PDT 2006

Derek thanks for summarizing all this but I will put it as following.

There are two type of text encodings for two distinct use cases:
  1) transport/storage encodings - one unicode code point
      represented as multiple code units of encoded sequence ( e.g. UTF )
      string.length returns length in code units of encoding - not 
characters.

  2) manipulation encodings - one unicode code point represented
      as one and only one element of the sequence (e.g. one byte, word or 
dword)
      string.length here returns length in code points (mapped character 
glyphs).

The problem as I can see is this:
D propose to use transport encoding for manipulation purposes
which is main problem imo here - transport encodings are not
designed for the manipulation - it is extremely difficult to use
them for manipualtion in practice as we may see.

One more problem:

Encoding like UTF-8 and UTF-16 are almost useless
with let's say Windows API, say TextOutA and TextOutW functions.
Neither one of them will accept D's char[] and wchar[] directly.

- ***A  functions in Windows take byte string (LPSTR) and current
  codepage id  to render text. ( byte + codepage = Unicode Code Point )

- ***W functions in Windows use LPWSTR things which are
  sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
  (  cast(dword) word  = Unicode Code Point )
  Only few functions in Windows API treat LPWSTR as UTF-16.

-----------------
"D strings are utf encoded sequences only" is a design mistake, IMO.
On disk (serialized form) - yes. But not in memory for manipulation please.

Andrew Fedoniouk.
http://terrainformatica.com

"Derek" <derek at psyc.ward> wrote in message 
news:177u058vq8cdj.koexsq99n112.dlg at 40tude.net...
> On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
>
>
>> ... but this is far from concept of null codepoint in character 
>> encodings.
>
> Andrew and others,
> I've read through these posts a few times now, trying to understand the
> various points of view being presented. I keep getting the feeling that
> some people are deliberately trying *not* to understand what other people
> are saying. This is a sad situation.
>
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, 
> and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
>
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh 
> these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
>
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you 
> wish
> to use some other encodings, then use a more appropriate data structure 
> for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.
>
> -- 
> Derek Parnell
> Melbourne, Australia
> "Down with mediocrity!"