Top 5

Sat Oct 11 13:38:54 PDT 2008

Benji Smith wrote:
> Andrei Alexandrescu wrote:
>> bearophile wrote:
>>> Benji Smith:
>>>> Java's design decision to always use two-byte characters is a
>>>> superior choice,<
>>>
>>> It's a design error caused by the early adoption of unicode by Java,
>>> because unicode needs 4 bytes. So it may lead to problems.
>>
>> I agree. I find it odd that anyone finds Java's character choice 
>> superior now, when it's acknowledged it missed the mark somewhat 
>> dramatically (only a short time shy of UTF-32 adoption).
>>
>> Andrei
> 
> I think you make a good point. It never occurred to me before, though, 
> because I've never actually run across it in the last eight years of 
> Java coding.

Well it does occur, and the fact that it occurs less frequently makes it 
all the more catastrophic when it comes to the unprepared. A friend of 
mine working at Adobe told me they have had huge issues with very, very 
rare 4-byte surrogates occuring in otherwise tame 16-bit characters.

> But if you think java's implementation is a design mistake, because of 
> sloppy integration across two-byte/four-byte lines, isn't D's string 
> design guilty of the same mistake, but also across the one-byte/two-byte 
> line?

I don't think so because D openly acknowledges 8/16/32-bit encodings, 
whereas Java only does 16 and kind of acts as if 32-bit surrogates don't 
exists. I honestly think D is on the brink of receiving the best 
character processing abilities of all languages in existence. It has 
openly embraced the reality of multi-byte characters when others either 
try to settle on one-size-fits-all or sweep the issue under the rug. The 
best encoding of the day, UTF, which is there to stay, is the standard 
embraced by the language.

In addition, the std.encoding module (where's Janice? sigh) is very 
promising in that it offers open-ended support to other current and 
possibly future encodings. I plan to work on that at some time to make 
it fast, because its use of delegates is inefficient. The advent of 
ranges also clarifies that the right way to treat a string of any 
encoding as a collection of characters is a bidirectional range: you can 
move forward or backward, but there's no random access. Once a library 
type UTFRange is in place, that will work with all current and future 
algorithms accepting bidirectional ranges. Insertion and replacement in 
strings is not easy to code, but certainly doable and most of the time 
as efficient as for regular arrays.

Andrei