numericValue for (unicode) characters

monarch_dodra monarchdodra at gmail.com
Fri Jan 4 12:51:20 PST 2013


On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
> 04-Jan-2013 21:48, monarch_dodra пишет:
>>
>> I finished an implementation:
>>
>> https://github.com/D-Programming-Language/phobos/pull/1052
>>
>> It is not "pull ready", so we can still discuss it.
>>
>
> Well, for start it features tons of code duplication. But I'm 
> replacing the whole std.uni anyway...

Well, I wrote that with duplication, keeping in mind you would
probably replace both. I thought it be cleaner to have some 
duplication, than a warped single implementation. I could also 
make the extra effort. I was really concerned with first having 
an implementation that is unicode correct.

I also though that, at worst, you could use my parsed data ;) to 
submit your own (superior?) pull.

>> * There's a couple characters in tableLo that have numeric 
>> values. These
>> aren't considered in isNumber either. I think this might be a 
>> bug though.
>> * There are 4 "non-number numeric" characters in "CUNEIFORM 
>> NUMERIC
>> SIGN". These return wild values, and in particular two of them 
>> return
>> -1. I *think* this should actually return nan for us, because 
>> (AFAIK),
>> -1 is just wild for invalid :/
>
> Some have numeric value of '-1' I think. The truth of the 
> matter is as usual with Unicode things are rather complicated.
> So 'numeric character' is a category (general) and 'has numeric 
> value' is some other property of codepoint that may or may not 
> correlate directly with category.
>
> Thus I think (looking ahead into your other post) that isNumber 
> is correct as it follows its documented behavior.
>
>>
>> Maybe we should just return -1 on invalid unicode? Or maybe 
>> it's just my
>> input file:
>> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>> It doesn't have a separate field for isNumber/numericValue, so 
>> it is
>> forced to write a wild number. Maybe these four chars should 
>> return nan?
>
> Nope. Does letter 'A' return a wild number?
>

Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Maybe isNumber's "documented behavior" is wrong?


More information about the Digitalmars-d mailing list