numericValue for (unicode) characters

Dmitry Olshansky dmitry.olsh at gmail.com
Fri Jan 4 14:00:01 PST 2013


05-Jan-2013 00:51, monarch_dodra пишет:
> On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
>> 04-Jan-2013 21:48, monarch_dodra пишет:
>>>
>>> I finished an implementation:
>>>
>>> https://github.com/D-Programming-Language/phobos/pull/1052
>>>
>>> It is not "pull ready", so we can still discuss it.
>>>
>>
>> Well, for start it features tons of code duplication. But I'm
>> replacing the whole std.uni anyway...
>
> Well, I wrote that with duplication, keeping in mind you would
> probably replace both. I thought it be cleaner to have some duplication,
> than a warped single implementation. I could also make the extra effort.
> I was really concerned with first having an implementation that is
> unicode correct.
>
> I also though that, at worst, you could use my parsed data ;) to submit
> your module that is well due for peer review.

Fixed ;)

>>> * There's a couple characters in tableLo that have numeric values. These
>>> aren't considered in isNumber either. I think this might be a bug
>>> though.
>>> * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
>>> SIGN". These return wild values, and in particular two of them return
>>> -1. I *think* this should actually return nan for us, because (AFAIK),
>>> -1 is just wild for invalid :/
>>
>> Some have numeric value of '-1' I think. The truth of the matter is as
>> usual with Unicode things are rather complicated.
>> So 'numeric character' is a category (general) and 'has numeric value'
>> is some other property of codepoint that may or may not correlate
>> directly with category.
>>
>> Thus I think (looking ahead into your other post) that isNumber is
>> correct as it follows its documented behavior.
>>
>>>
>>> Maybe we should just return -1 on invalid unicode? Or maybe it's just my
>>> input file:
>>> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>> It doesn't have a separate field for isNumber/numericValue, so it is
>>> forced to write a wild number. Maybe these four chars should return nan?
>>
>> Nope. Does letter 'A' return a wild number?
>>
>
> Well, the thing is that I'm getting contradictory info from the
> consortium itself:
> Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
> According to the "UnicodeData.txt", its numeric value is -1.
> According to The "Unocide utilities", it is not a numeric type,
> and it's value is null:
> http://unicode.org/cldr/utility/character.jsp?a=12456
>
> Also according to the consortium: "-1" is an illegal numeric
> value.
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]
>
> Really, all the info seems to indicate a bug in UnicodeData.txt:
> They really seem like 4 entries in Nl that aren't numbers.
>
> I've found a couple people on internet discussing this, but no
> hard conclusion :/

Basically check the bottom of that page:
....
See also: Unicode Display Problems.
Version 3.6; ICU version: 50.0.1.0; Unicode version: 6.1.0.0

So it's not up to date. The file is. I can test with ICU 51 to see what 
it reports.

>
> ****
>
> Anyways, those 4 CUNEIFORM asside, what do you make of the
> entries in Lo:
> http://unicode.org/cldr/utility/character.jsp?a=F96B
> These appear to be numeric, but aren't inside Nd/No/Nl. They
> should return true to isNumber, no?

Hmmm. Take a look here:
http://unicode.org/cldr/utility/properties.jsp

There is a section called Numeric that has 3 properties,
and then there is a General section.
The General has Category which in turn has 'Number' category.

Bottom line is that I believe that std.uni isXXX queries the category of 
a symbol and not some other property. Let any mishaps in between 
properties and general category be consortium's headache.

>
> Maybe isNumber's "documented behavior" is wrong?

Problem is I can't come up with a good description of some other 
behavior. Maybe this one [^[:Numeric_Type=None:]]
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list