Today's programming challenge - How's your Range-Fu ?

Sat Apr 18 04:55:30 PDT 2015

On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:
> On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
>> On 4/18/2015 12:58 AM, John Colvin wrote:
>>> On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
>>>> On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
>>>>> So either you have to throw out all pretenses of 
>>>>> Unicode-correctness and
>>>>> just stick with ASCII-style per-character line-wrapping, or 
>>>>> you have to
>>>>> live with byGrapheme with all the complexity that it 
>>>>> entails. The former
>>>>> is quite easy to write -- I could throw it together in a 
>>>>> couple o' hours
>>>>> max, but the latter is a pretty big project (cf. Unicode 
>>>>> line-breaking
>>>>> algorithm, which is one of the TR's).
>>>>
>>>> It'd be good enough to duplicate the existing behavior, 
>>>> which is to treat
>>>> decoded unicode characters as one column.
>>>
>>> Code points aren't equivalent to characters. They're not the 
>>> same thing in most
>>> European languages,
>>
>> I know a bit of German, for what characters is that not true?
>
> Umlauts, if combined characters are used. Also words that still 
> have their accents left after import from foreign languages. 
> E.g. Café
>
> Getting all unicode correct seems a daunting task with a severe 
> performance impact, esp. if we need to assume that a string 
> might have any normalization form or none at all.
>
> See also: http://unicode.org/reports/tr15/#Norm_Forms

Also another issue is that lower case letters and upper case 
might have different size requirements or look different 
depending on where on the word they are located.

For example, German ß and SS, Greek σ and ς. I know Turkish also 
has similar cases.

--
Paulo