unicode confusion--An answer

Charles Hixson via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Apr 4 17:21:21 PDT 2016


I was writing my output to two different files.  Only one of them was 
set to utf-8, the other must have been some other encoding, because when 
I set the encoding to utf-8 everything cleared up.

On 04/04/2016 04:04 PM, Charles Hixson via Digitalmars-d-learn wrote:
> Well, at least I think that it's unicode confusion.  When a store 
> values into a string (in an array of structs) and then compare it 
> against itself, it compares fine, and if I write it out at that point 
> it writes out fine.  And validate says it's good unicode.
>
> But later...
> valid = true, len = 17, wrd = true , cnt =     2, txt = Gesammtabentheuer
> valid = true, len = 27, wrd = true , cnt =     1, txt = 
> νεφεληγερέτης
> valid = true, len = 17, wrd = true , cnt =     1, txt = ζητοῦσιν
> valid = true, len = 36, wrd = true , cnt =     1, txt = 
> αἱμορροϊδοκαύστης
> valid = true, len = 18, wrd = true , cnt =     2, txt = 
> δυνηθώμεν
> valid = true, len = 20, wrd = true , cnt =     1, txt = 
> προσκρούσω
> valid = true, len = 20, wrd = true , cnt =     1, txt = 
> σκοτωμένην
> valid = true, len = 18, wrd = true , cnt =     1, txt = 
> ἀγαπητοί
> valid = true, len = 28, wrd = true , cnt =     1, txt = 
> הַֽמְזִמָּתָהּ
> valid = true, len = 19, wrd = true , cnt =     1, txt = 
> Τυρρηνικά
> valid = true, len = 17, wrd = true , cnt =     2, txt = IODOHYDRARGYRATIS
> valid = true, len = 21, wrd = true , cnt =     1, txt = 
> χοινικίσιν
> valid = true, len = 17, wrd = true , cnt =     1, txt = Spectrophotometer
> valid = true, len = 26, wrd = true , cnt =     1, txt = 
> αἰνιττόμενοι
> valid = true, len = 70, wrd = 
> true , cnt =     1, txt = ΓΗΣΠ
> ΛΕΘΡΑΔΙΣΧΙΛΙΑΕΡΓΑΣΙΜΟΥΑΠΟΤΗΣΟΜΟ
> valid = true, len = 18, wrd = true , cnt =     1, txt = 
> μικρότατα
> valid = true, len = 23, wrd = true , cnt =     1, txt = 
> ἀποπάτησίν
> valid = true, len = 18, wrd = true , cnt =     1, txt = 
> מוֹקְשֵׁי
> valid = true, len = 17, wrd = true , cnt =     1, txt = διαμένων
>      . . . (etc. for 39599 lines)
> (And it looks worse than that, actually, because control characters 
> aren't coming through).
> I think the originals were usually greek letters due to an earlier 
> test (why there should be so many greek words I don't know...but if 
> they're there I want them to be handled properly), but the corrupted 
> text is such a small part of the original file that I can't be 
> certain.  Valid = true means that it passed string validates right 
> before being printed.  wrd = true means that the only characters in it 
> should be isAlpha, hyphen, apostrophe, or underscore.  cnt = n means 
> that it was detected n times in the dataset (of 8013 text files). And 
> the string in each struct is only written once in the execution of the 
> program.
>
> I was scanning the dataset looking to see what long words were 
> valid...I didn't expect THIS at all.  And as you can see from, e.g., 
> "Spectrophotometer", ASCII values don't seem to be damaged at all.
>
> FWIW, I was expecting to encounter an occasional Greek, French, or 
> Chinese word...but nothing like this.  I'd think it was the conversion 
> from string to dchar[] and back that was the problem, but when I test 
> immediately after I know I've written to the string everything looks 
> right.  So I'm guessing it's something about how unicode is handled.
>



More information about the Digitalmars-d-learn mailing list