First Impressions

Fri Sep 29 11:21:36 PDT 2006

BCS wrote:
> Johan Granberg wrote:
> 
>>
>>
>> I completely agree, char should hold a character independently of 
>> encoding and NOT a code unit or something else. I think it would be
>> beneficial to D in the long term if chars where done right (meaning 
>> that they can store any character) how it is implemented is not 
>> important and i believe performance is not a problem here, so ease of 
>> use and correctness would be appreciated.
> 
> 
> Why isn't performance a problem?
> 
> If you are saying that this won't cause performance hits in run times or 
>  memory space, I might be able to buy it, but I'm not yet convinced.
> 
> If you are saying that causing a performance hit in run times or memory 
> space is not a problem... in that case I think you are dead wrong and 
> you will not convince me otherwise.
> 
> In my opinion, any compiled language should allow fairly direct access 
> to the most efficient practical means of doing something*. If I didn't 
> care about speed and memory I wound use some sort of scripting language.
> 
> A good set of libs should make most of this moot. Leave the char as is 
> and define a typedef struct or whatever that provides the added 
> functionality that you want.
> 
> * OTOH a language should not mandate code to be efficient at the expense 
> of ease of coding.

I will go ahead and say that the current state of char[] is incorrect. 
That is, if you write a program manipulating char[] strings, then run it 
in china, you will be dissapointed with the results.  It won't matter 
how fast the program runs, because bad stuff will happen like entire 
strings becoming unreadable to the user.

Technically if you follow UTF and do your char[] manipulations very 
carefully, it is correct, but realistically few if any people will do 
such things (I won't).  Also, if you do this, your program will probably 
run as slow as one with the proposed char/string solution, maybe slower 
(since language/stdlib level support can be heavily optimized).

What I'd like then, is a program that is correct and as fast as possible 
while still being correct.

Sure you can get some speed gains by just using ASCII and saying to hell 
with UTF, but you should probably only do that when profiling has shown 
that such speed gains are actually useful/needed in your program.

Ultimately we have to decide whether we want D to default to UTF code 
which might run slightly slower but allow better localization and 
international friendliness, or if we want it to default to ASCII or some 
such encoding that runs slightly faster but is mostly limited to english.

I'd like the default to be UTF.  Then we can have a base of code to 
correctly manipulate UTF strings (in phobos and language supported). 
Writing correct ASCII manipulation routine without good library/language 
support is a lot easier than writing good UTF manipulation routines 
without good library/language support, and UTF will probably be used 
much more than ASCII.

Also, if we move over to full blown UTF, we won't have to give up ASCII. 
  It seems to me like the phobos std.string functions are pretty much 
ASCII string manipulating functions (no multibyte string support).  So 
just copy those out to a seperate library, call it "ASCII lib", and 
there's your library support for ASCII.  That leaves string literals, 
which is a slight problem, but I suppose easily fixed:
ubyte[] hi = "hello!"a;
Just add a postfix 'a' for strings which makes the string an ASCII 
literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF 
manipulations without special attention, but they are powerful enough to 
do ASCII manipulations without special attention, so using ubyte[] as an 
ASCII string should give full language support for these.  Given that 
and ASCIILIB you pretty much have the current D string manipulation 
capabilities afaik, and it will be fast.