numericValue for (unicode) characters

Fri Jan 4 12:33:11 PST 2013

04-Jan-2013 21:48, monarch_dodra пишет:
> On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
>> 04-Jan-2013 15:58, Jonathan M Davis пишет:
>>> On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
>>>> So... do we agree on
>>>> ascii: int - not found => -1
>>>> uni: double - not found => nan
>>>
>>> I'm not a fan of the ASCII version returning -1, but I don't really
>>> have a
>>> better suggestion. I suppose that you could throw instead, but I
>>> don't know if
>>> that's a good idea or not. It _would_ be more consistent with our other
>>> conversion functions however.
>>>
>>> - Jonathan M Davis
>>
>> I find low-level stuff that throws to be overly awkward to deal with
>> (not to mention performance problems).
>>
>> Hm... I've found an brilliant primitive Expected!T that could be of
>> great help in error code vs exceptions problem. See the recent
>> Andrei's talk that went live not long ago:
>>
>> http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C
>>
>>
>> Time to put the analogous stuff into Phobos?
>
> I finished an implementation:
>
> https://github.com/D-Programming-Language/phobos/pull/1052
>
> It is not "pull ready", so we can still discuss it.
>

Well, for start it features tons of code duplication. But I'm replacing 
the whole std.uni anyway...

> I raised a couple of issues in the pull, which I'll copy here:
>
> //----
> I did run into a couple of issues, namelly that I'm not getting 100%
> equivalence between chars that are numeric, and chars with numeric
> value... Is this normal...?
>

Yes, it's called Unicode ;)

> * There's a fair bit of chars that have numeric value, but aren't
> isNumber. I think they might be new in 6.1.0. But I'm not sure. I
> decided it was best to have them return nan, instead of having
> inconsistent behavior.

You also might be using 6.2. It's released as of a fall of 2012.

> * There's a couple characters in tableLo that have numeric values. These
> aren't considered in isNumber either. I think this might be a bug though.
> * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
> SIGN". These return wild values, and in particular two of them return
> -1. I *think* this should actually return nan for us, because (AFAIK),
> -1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the matter is as 
usual with Unicode things are rather complicated.
So 'numeric character' is a category (general) and 'has numeric value' 
is some other property of codepoint that may or may not correlate 
directly with category.

Thus I think (looking ahead into your other post) that isNumber is 
correct as it follows its documented behavior.

>
> Maybe we should just return -1 on invalid unicode? Or maybe it's just my
> input file:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> It doesn't have a separate field for isNumber/numericValue, so it is
> forced to write a wild number. Maybe these four chars should return nan?

Nope. Does letter 'A' return a wild number?

> //----
>
> Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it
> if we have numericValue.

-- 
Dmitry Olshansky