Creeping Bloat in Phobos
Dmitry Olshansky via Digitalmars-d
digitalmars-d at puremagic.com
Sun Sep 28 13:08:29 PDT 2014
28-Sep-2014 23:44, Uranuz пишет:
>> I totally agree with all of that.
>>
>> It's one of those cases where correct by default is far too slow (that
>> would have to be graphemes) but fast by default is far too broken.
>> Better to force an explicit choice.
>>
>> There is no magic bullet for unicode in a systems language such as D.
>> The programmer must be aware of it and make choices about how to treat
>> it.
>
> I see didn't know about difference between byCodeUnit and
> byGrapheme, because I speak Russian and it is close to English,
> because it doesn't have diacritics. As far as I remember German,
> that I learned at school have diacritics. So you opened my eyes
> in this question. My position as usual programmer is that I
> speaking language which graphemes coded by 2 bytes
In UTF-16 and UTF-8.
> and I alwas
> need to do decoding otherwise my programme will be broken. Other
> possibility is to use wstring or dstring, but it is less memory
> efficient. Also UTF-8 is more commonly used in the Internet so I
> don't want to do some conversions to UTF-32, for example.
>
> Where I could read about byGrapheme?
std.uni docs:
http://dlang.org/phobos/std_uni.html#.byGrapheme
> Isn't this approach
> overcomplicated? I don't want to write Dostoevskiy's book "War
> and Peace" in order to write some parser for simple DSL.
It's Tolstoy actually:
http://en.wikipedia.org/wiki/War_and_Peace
You don't need byGrapheme for simple DSL. In fact as long as DSL is
simple enough (ASCII only) you may safely avoid decoding. If it's in
Russian you might want to decode. Even in this case there are ways to
avoid decoding, it may involve a bit of writing in as for typical short
novel ;)
In fact I did a couple of such literature exercises in std library.
For codepoint lookups on non-decoded strings:
http://dlang.org/phobos/std_uni.html#.utfMatcher
And to create sets of codepoints to detect with matcher:
http://dlang.org/phobos/std_uni.html#.CodepointSet
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list