ICU D Wrapper

Trent Forkert via Digitalmars-d digitalmars-d at puremagic.com
Sat Dec 13 09:28:24 PST 2014


On Saturday, 13 December 2014 at 15:44:59 UTC, Sean Kelly wrote:
> On Friday, 12 December 2014 at 17:57:41 UTC, Trent Forkert 
> wrote:
>>
>> I've looked into writing a binding for ICU recently, but 
>> ultimately decided to abandon that idea in favor of writing a 
>> replacement for it in D.
>
> Wow... really?  You're actually going to write transcoders for 
> all available encodings? Plus the conversion and parsing tools, 
> plus expand our calendar functionality to handle the things it 
> doesn't do now, plus...  I mean I'd love it, but the scope of 
> the project can be measured in tens of man-years.

Running down the icu4c API listing:

* Basic Types and Constants - only as needed
* Strings and character iteration - Just use D strings, std.string
* Unicode character properties and names - I think std.uni 
handles this
* Sets of Unicode Code Points and Strings - ditto
* Codepage conversion - ignoring, at least for now. See below.
* Unicode text compression - again, I think std.uni handles this
* Locales - yes
* Resource Bundles - will offer equivalent functionality, just 
not identical
* Normalization - std.uni
* Calendars - see below
* Date and time formatting - yes
* Message formatting - yes
* Number formatting / spell-out - yes
* Transliteration - yes, but may be delayed until after initial 
release
* Bidirectional Algorithm - not at first, is this in std.uni?
* Arabic shaping - not at first, is this in std.uni?
* Collation - I'm delaying this until after the initial release 
to get it out faster
* String searching - depends on Collation
* Index characters - depends on Collation
* Text Boundary analysis - depends on Collation
* Regular Expression - use std.regex
* StringPrep - not initially, is this in std.uni?
* IDNA - not initially, is this in Phobos?
* Identifier spoofing and confusability - not initially
* Layout engine - delayed, looks like ICU is removing this and 
pointing to another library
* Universal Time Scale - see below
* ICU I/O - use phobos

There are very few things above that are not possible to generate 
from CLDR data. Of those, most are RFC-defined algorithms, 
several of which I believe are already part of Phobos.

If I add codepage conversion, it will likely be in terms of iconv 
on POSIX and MultiByteToWideChar and friends on Windows. 
Alternatively, I could "borrow" the IBM CDRA/UCM data the way I'm 
getting almost everything else from CLDR data.

Support of other calendar systems is up in the air at the moment. 
I had thought CLDR contained what I needed, but it looks like it 
might not. It has locale-specific formatting and display info for 
calendars, and mappings to when other calendar's eras begin in 
terms of the Gregorian calendar, but I don't see further 
breakdown of information. So, initially it looks like I'll only 
be supporting Gregorian calendar, but I may add the others in the 
future.

It is a lot of work, yes, but the Unicode Consortium already does 
a significant chunk of it with CLDR.

  - Trent


More information about the Digitalmars-d mailing list