[Issue 5543] to!int to see a char as a single-char string

d-bugmail at puremagic.com d-bugmail at puremagic.com
Fri Dec 21 08:01:00 PST 2012


http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #11 from Dmitry Olshansky <dmitry.olsh at gmail.com> 2012-12-21 08:00:56 PST ---
(In reply to comment #10)
> (In reply to comment #5)
> > 
> > I'm wrapping up a revamp of std.uni that makes it piece of cake to create
> > character sets. And maps are converted to multi-staged tables that are faster
> > the binary search on a large set. I'd suggest to wait a bit on it (so as to not
> > duplicate work) and introduce only std.ascii version as the most useful.
> > 
> > The ongoing polishing, fixing and testing against ICU is going on here:
> > https://github.com/blackwhale/gsoc-bench-2012
> 
> OK: The thing I was having trouble though is that existing binary search
> returns a bool, whereas I need the actual entry, so I can do "value -
> entry[0]", eg:
> 
> //----
>     static immutable dchar[2][] table1 = [
>     [ 0x0030,  0x0039], //
>     [ 0x0660,  0x0669], //ARABIC-INDIC
>     [ 0x06F0,  0x06F9], //EXTENDED ARABIC-INDIC
> 
> ...
> //---
> That's because all the entries in [Nd] are consecutive numerals starting at 0.
> I can also cram a select couple of entries from [Nl] and [Po] that also use
> this scheme.
> 

Sometimes I was able to abuse the natural format of data and sometimes failed.
But what proved to be quite good is varying sizes of multi-staged rable to
match "periods" of data. In the end if the data has a lot of common "rows" a
multi-staged table of certain size per stage is bound hit a sweet spot.

> So if I have the unicode 0x0665 (The ARABIC-INDIC numeral '6'), I'd want to
> find [ 0x0660,  0x0669], and then "return 0x0665 - 0x0660".
> 
> Well, I don't need the entire pair, but at least the lhs of the pair.
> 
> If you could keep that in mind during your re-write. Or not. Just throwing it
> out there.
> 
> For all other entries in [Nl] and [Po], I'd have:
>     static immutable dchar[2][] table1 = [
>     [ 0x261D,  100], //ROMAN NUMERAL ONE HUNDRED
> 
> So that's just basic dictionary. But I don't think you can statically allocate
> an AA. So yeah, just throwing that your direction too.
> 

Well, AA is a fat pig w.r.t RAM usage. But thanks anyway.

> > > The file is too large for std.xml to handle, so it's back to C++ for me :/
> > > 
> > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> > 
> > Same thing but no useless XML trash. Description of fields is somewhere in the
> > middle of this document 
> > http://www.unicode.org/reports/tr44/
> 
> Nice, TY.
> 
> > > The only questions I have is:
> > > Return value: int or double?
> > 
> > Should be rational to acurately represent things like "1/5" character ;)
> > I do suspect some simple custom type could do (2 shorts packed in one struct
> > etc.).
> > 
> > > Input is not numeric: -1 or exception?
> > 
> > -1 is fine I think as this rather low level (per character) and it's not at all
> > convenient to throw (and then catch).
> 
> The only issue I have with returning -1 is that it is a magic value. The fact
> that there is no unicode for -1 is pure coincidence, and not by design. In
> particular, any attempt to write "if (numericValue(c) < 0) fail" would also be
> wrong because:
> http://unicode.org/cldr/utility/character.jsp?a=0F33
> The TIBETAN DIGIT HALF ZERO returns -0.5
> 
> Do we *really* want to standardize the syntax of "if (numericValue(c) < -0.7)"
> ?
> 
> ...
> 
> Damn you unicode!

Aye, and given there are things like "1e12" I don't think packing it would work
any better... some kind of custom type is required.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------


More information about the Digitalmars-d-bugs mailing list