Top 5
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sat Oct 11 13:38:54 PDT 2008
Benji Smith wrote:
> Andrei Alexandrescu wrote:
>> bearophile wrote:
>>> Benji Smith:
>>>> Java's design decision to always use two-byte characters is a
>>>> superior choice,<
>>>
>>> It's a design error caused by the early adoption of unicode by Java,
>>> because unicode needs 4 bytes. So it may lead to problems.
>>
>> I agree. I find it odd that anyone finds Java's character choice
>> superior now, when it's acknowledged it missed the mark somewhat
>> dramatically (only a short time shy of UTF-32 adoption).
>>
>> Andrei
>
> I think you make a good point. It never occurred to me before, though,
> because I've never actually run across it in the last eight years of
> Java coding.
Well it does occur, and the fact that it occurs less frequently makes it
all the more catastrophic when it comes to the unprepared. A friend of
mine working at Adobe told me they have had huge issues with very, very
rare 4-byte surrogates occuring in otherwise tame 16-bit characters.
> But if you think java's implementation is a design mistake, because of
> sloppy integration across two-byte/four-byte lines, isn't D's string
> design guilty of the same mistake, but also across the one-byte/two-byte
> line?
I don't think so because D openly acknowledges 8/16/32-bit encodings,
whereas Java only does 16 and kind of acts as if 32-bit surrogates don't
exists. I honestly think D is on the brink of receiving the best
character processing abilities of all languages in existence. It has
openly embraced the reality of multi-byte characters when others either
try to settle on one-size-fits-all or sweep the issue under the rug. The
best encoding of the day, UTF, which is there to stay, is the standard
embraced by the language.
In addition, the std.encoding module (where's Janice? sigh) is very
promising in that it offers open-ended support to other current and
possibly future encodings. I plan to work on that at some time to make
it fast, because its use of delegates is inefficient. The advent of
ranges also clarifies that the right way to treat a string of any
encoding as a collection of characters is a bidirectional range: you can
move forward or backward, but there's no random access. Once a library
type UTFRange is in place, that will work with all current and future
algorithms accepting bidirectional ranges. Insertion and replacement in
strings is not easy to code, but certainly doable and most of the time
as efficient as for regular arrays.
Andrei
More information about the Digitalmars-d
mailing list