unicode confusion--An answer

Mon Apr 4 17:21:21 PDT 2016

I was writing my output to two different files.  Only one of them was 
set to utf-8, the other must have been some other encoding, because when 
I set the encoding to utf-8 everything cleared up.

On 04/04/2016 04:04 PM, Charles Hixson via Digitalmars-d-learn wrote:
> Well, at least I think that it's unicode confusion.  When a store 
> values into a string (in an array of structs) and then compare it 
> against itself, it compares fine, and if I write it out at that point 
> it writes out fine.  And validate says it's good unicode.
>
> But later...
> valid = true, len = 17, wrd = true , cnt =     2, txt = Gesammtabentheuer
> valid = true, len = 27, wrd = true , cnt =     1, txt = 
> Î½ÎµÏÎµÎ»Î·Î³ÎµÏá½³ÏÎ·Ï
> valid = true, len = 17, wrd = true , cnt =     1, txt = Î¶Î·ÏÎ¿á¿¦ÏÎ¹Î½
> valid = true, len = 36, wrd = true , cnt =     1, txt = 
> Î±á¼±Î¼Î¿ÏÏÎ¿ÏÎ´Î¿ÎºÎ±á½»ÏÏÎ·Ï
> valid = true, len = 18, wrd = true , cnt =     2, txt = 
> Î´ÏÎ½Î·Î¸ÏÎ¼ÎµÎ½
> valid = true, len = 20, wrd = true , cnt =     1, txt = 
> ÏÏÎ¿ÏÎºÏÎ¿ÏÏÏ
> valid = true, len = 20, wrd = true , cnt =     1, txt = 
> ÏÎºÎ¿ÏÏÎ¼ÎÎ½Î·Î½
> valid = true, len = 18, wrd = true , cnt =     1, txt = 
> á¼Î³Î±ÏÎ·ÏÎ¿á½·
> valid = true, len = 28, wrd = true , cnt =     1, txt = 
> ×Ö½Ö·×Ö°×Ö´×Ö¼Ö¸×ªÖ¸×Ö¼
> valid = true, len = 19, wrd = true , cnt =     1, txt = 
> Î¤ÏÏÏÎ·Î½Î¹Îºá½±
> valid = true, len = 17, wrd = true , cnt =     2, txt = IODOHYDRARGYRATIS
> valid = true, len = 21, wrd = true , cnt =     1, txt = 
> ÏÎ¿Î¹Î½Î¹Îºá½·ÏÎ¹Î½
> valid = true, len = 17, wrd = true , cnt =     1, txt = Spectrophotometer
> valid = true, len = 26, wrd = true , cnt =     1, txt = 
> Î±á¼°Î½Î¹ÏÏá½¹Î¼ÎµÎ½Î¿Î¹
> valid = true, len = 70, wrd = 
> true , cnt =     1, txt = ÎÎÎ£Î 
> ÎÎÎÎ¡ÎÎÎÎ£Î§ÎÎÎÎÎÎ¡ÎÎÎ£ÎÎÎÎ¥ÎÎ ÎÎ¤ÎÎ£ÎÎÎ
> valid = true, len = 18, wrd = true , cnt =     1, txt = 
> Î¼Î¹ÎºÏÏÏÎ±ÏÎ±
> valid = true, len = 23, wrd = true , cnt =     1, txt = 
> á¼ÏÎ¿Ïá½±ÏÎ·Ïá½·Î½
> valid = true, len = 18, wrd = true , cnt =     1, txt = 
> ××Ö¹×§Ö°×©×Öµ×
> valid = true, len = 17, wrd = true , cnt =     1, txt = Î´Î¹Î±Î¼á½³Î½ÏÎ½
>      . . . (etc. for 39599 lines)
> (And it looks worse than that, actually, because control characters 
> aren't coming through).
> I think the originals were usually greek letters due to an earlier 
> test (why there should be so many greek words I don't know...but if 
> they're there I want them to be handled properly), but the corrupted 
> text is such a small part of the original file that I can't be 
> certain.  Valid = true means that it passed string validates right 
> before being printed.  wrd = true means that the only characters in it 
> should be isAlpha, hyphen, apostrophe, or underscore.  cnt = n means 
> that it was detected n times in the dataset (of 8013 text files). And 
> the string in each struct is only written once in the execution of the 
> program.
>
> I was scanning the dataset looking to see what long words were 
> valid...I didn't expect THIS at all.  And as you can see from, e.g., 
> "Spectrophotometer", ASCII values don't seem to be damaged at all.
>
> FWIW, I was expecting to encounter an occasional Greek, French, or 
> Chinese word...but nothing like this.  I'd think it was the conversion 
> from string to dchar[] and back that was the problem, but when I test 
> immediately after I know I've written to the string everything looks 
> right.  So I'm guessing it's something about how unicode is handled.
>