numericValue for (unicode) characters

Dmitry Olshansky dmitry.olsh at gmail.com
Thu Jan 10 10:09:23 PST 2013


10-Jan-2013 03:21, H. S. Teoh пишет:
> On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
>> On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
>>> [...]
>>> I, for one, would love to know why isNumeric != hasNumericValue.
> [...]
>> I guess it's just bad wording from the standard.
>>
>> The standard defined 3 groups that make up Number:
>> [Nd] 	Number, Decimal Digit
>> [Nl] 	Number, Letter
>> [No] 	Number, Other
>>
>> However, there are a couple of characters that *are* numbers, but
>> aren't in those goups.
>>
>> The "Good" news is that the standard, *does* define number_types to
>> classify the kind of number a char is:
>> * Null: Not a number
>> * Digit: Obvious
>> * Decimal: Any decimal number that is NOT a digit
>> * Numeric: Everything else.
>>
>> So they used "Numeric" as wild, and "Number" as their general
>> category.
>>
>> This leaves us with ambiguity when choosing our word:
>> Technically '5' does not clasify as "numeric", although you could
>> consider it "has a numeric value".
>>
>> I hope that makes sense.
>
> Hmph. I guess we need to differentiate between the unicode category
> called "numeric", and the property of having a numerical value. So we'd
> need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
> what the standard is, then that's what it is.

isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be definition of 
Unicode properties)

And that's all, correct and to the latter.

>
> Anyway, I'd love to see std.uni cover all unicode categories.
>
> Offhanded note: should we unify the various isX() functions into:
>
> 	bool inCategory(string category)(dchar ch)
>

No, no, no! It's a horrible idea. The main problem with it is: huge 
catalog of data has to be stored in Phobos (object code) of no (even 
niche) use. Also to be practical for use cases other then casual 
observation it has to be fast.. and it can't for any of the useful cases.

Just count the number of bits to store per codepoint and fairly 
irregular structure of the whole set of properties (unlike individual 
combinations that do have nice distribution e.g. Scripts as in Cyrillic).

I've been shoulder-deep in Unicode for about half a year now, and 
reading through TR-xx algorithms and *none* of them requires queries of 
the sort that tests all (more then 1-2?) of properties.

In all cases the algorithm itself defines a set(s) of codepoints with 
different meanings/values for this use case. These (useful) sets could 
be compressed to a fast multi-stage table, the whole catalog of 
properties - no, as it packs enormous heaps of unused junk (Unicode_Age 
anyone??). This junk is not fit for std library but the goal is to 
provide tool for the user to work with sets/data beyond the commonly 
useful in std.

> where category is the Unicode designation, say "Nl", "Nd", etc.? That
> way, it's more future-proof in case the Unicode guys add more
> categories.

I'm posting my work on std.uni as ready for review today or tomorrow.
It includes a type for a set of codepoints and ton of predefined sets 
for Nl, Nd and almost everything sensible (blocks, scripts, properties).
The user can then conjure whatever combination required.

And it still way smaller then having full 'query the database' thing. To 
check the full madness of all of the properties just use the web 
interface of unicode.org.

P.S. Hopefully, nobody rises the point of codepoint _names_ they are 
after all too part of Unicode standard (and character database).

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list