[review] new string type

Ali Çehreli acehreli at yahoo.com
Fri Dec 3 14:11:35 PST 2010


spir wrote:

 > What I have in mind is a "UText" type that provides the right abstraction
 > for text processing / string maipulation as one has when dealing with 
ASCII
 > (in any fact any legacy character set). All what is needed is having 
a true
 > one-to-one mapping between characters (in the common sense) and 
elements of
 > strings (what I call "code stacks"); one given stack unambiguously 
denotes
 > one character. To reach this point, in addition to decoding (ag from 
utf8 to
 > code points), we must:

 > * group codes into stacks
 > * normalize (into 'NFD')

Are those operations independent of the context? Is stacking always desired?

I guess one would use one of the D string types when grouping or 
normalization is not desired, right? Makes sense.

 > * sorts points in stacks

Ok, I see that it is possible with NFD. I am not experienced with 
Unicode, but I think there will be issues with other types of Unicode 
normalization. (Judging from your posts, I know that you know all these. 
:) )

 > Then, we can for instance index or slice in O(1) as usual, and get a
 > consistent substring of _characters_ [...] I do not want to deal with 
anything related to script-, language-, locale- specific issues.

Is the concept of _character_ well defined in Unicode outside of the 
context of the an alphabet (I think your "script" covers alphabet.)

It is an interesting decision when we actually want to see an array of 
code points as characters. When would it be correct to do so? I think 
the answer is when we start treating the string as a piece of text.

For a string to be considered as text, it must be based on an alphabet. 
ASCII strings are pieces of text, because they are based on the 
26-letter alphabet.

I hope I don't sound like saying against anything that you said. I am 
also thinking about the other common operations that work on pieces of text:

- sorting (e.g. ç is between c and d in many alphabets)
- lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets)

As a part of the Turkish D community, we've played with the idea of such 
a text type. It took advantage of D's support for Unicode encoded source 
code, so it's fully in Turkish. Yay! :)

Here is the module that takes care of sorting, capitalization, and 
producing the base forms of the letters of the alphabets:

     http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d

It is also based on dchar[], as you recommend elsewhere in this thread.

It is written with the older D2 operator overloading, doesn't support 
ranges, etc. But it currently supports ten alphabets (including the 
26-letter English, and the Old Irish alphabet).

Going out of the context of this thread, we've also worked on a type 
that contains pieces of text from different alphabets to make a "text", 
where a text like "jim & ali" is correctly capitalized as "JIM & ALİ".

I am thinking more than what you describe. But your string would be 
useful for implementing ours, as we don't have normalization or stacking 
support at all.

Thanks,
Ali


More information about the Digitalmars-d mailing list