Formal Review of std.uni

Dmitry Olshansky dmitry.olsh at gmail.com
Sun May 12 12:27:59 PDT 2013


30-Apr-2013 23:17, Jonathan M Davis пишет:
> On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote:
>> Unicode --> can't be done on character by character basis
>
> Sure it can. It operates on dchar.

Getting back to this.

Sure it can't - I'd hate to break the illusion but the keyword is e.g. 
Unicode Case Folding. Another one is Combining Character sequence.

> So, with how it's been, std.uni would only be operating on dchars, and putting
> a function in there which operated on strings wouldn't make any sense. Maybe
> that doesn't work if you've done a bunch of grapheme stuff, and things will
> have to be adjusted, but it would be a definite shift to put anything in
> std.uni which operated on strings, and I think that it would need some definite
> justification (and there's a good chance that I'd be inclined to argue that it
> should still go in std.string, possibly using some internal modules if
> necessary).

Justification is that we'd rather have exactly one module dealing with a 
bunch of Unicode data arranged into intricate tables.

Strictly speaking I'd abolish any Unicode related algorithm in 
std.string since it's almost definitely doing it wrong anyway (I've 
checked only 2 - both broken).

There is not a single sign of unicode standards used, just the 
fallacious logic: byte --> dchar and use the same algorithm as with 
ASCII. It won't work.

>
> But clearly I need to take the time to take a look at what you've actually
> done (I keep meaning to but haven't gotten around to it yet). It had been my
> impression that what you were doing was primarily a matter of improving the
> implementation, but it sounds like you're doing something beyond that.

Take a peek at icmp and sicmp in new std.uni.
Current fork of Phobos is here:
https://github.com/blackwhale/phobos/tree/new-std-uni

Eventually we'd have to do a bit more in the same direction e.g. title 
casing, split by word boundary etc. (all of these need fixing in 
std.string).

Also all of the core tools are now in the open: CodepointSet, and 
generating Tries from sets and AA-s.


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list