unicode confusion--An answer
Charles Hixson via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Apr 4 17:21:21 PDT 2016
I was writing my output to two different files. Only one of them was
set to utf-8, the other must have been some other encoding, because when
I set the encoding to utf-8 everything cleared up.
On 04/04/2016 04:04 PM, Charles Hixson via Digitalmars-d-learn wrote:
> Well, at least I think that it's unicode confusion. When a store
> values into a string (in an array of structs) and then compare it
> against itself, it compares fine, and if I write it out at that point
> it writes out fine. And validate says it's good unicode.
>
> But later...
> valid = true, len = 17, wrd = true , cnt = 2, txt = Gesammtabentheuer
> valid = true, len = 27, wrd = true , cnt = 1, txt =
> νεÏεληγεÏá½³ÏηÏ
> valid = true, len = 17, wrd = true , cnt = 1, txt = ζηÏοῦÏιν
> valid = true, len = 36, wrd = true , cnt = 1, txt =
> αἱμοÏÏοÏδοκαύÏÏηÏ
> valid = true, len = 18, wrd = true , cnt = 2, txt =
> δÏ
νηθÏμεν
> valid = true, len = 20, wrd = true , cnt = 1, txt =
> ÏÏοÏκÏοÏÏÏ
> valid = true, len = 20, wrd = true , cnt = 1, txt =
> ÏκοÏÏμÎνην
> valid = true, len = 18, wrd = true , cnt = 1, txt =
> á¼Î³Î±ÏηÏοί
> valid = true, len = 28, wrd = true , cnt = 1, txt =
> ×Ö½Ö·×Ö°×Ö´×ָּתָ×Ö¼
> valid = true, len = 19, wrd = true , cnt = 1, txt =
> ΤÏ
ÏÏηνικά
> valid = true, len = 17, wrd = true , cnt = 2, txt = IODOHYDRARGYRATIS
> valid = true, len = 21, wrd = true , cnt = 1, txt =
> ÏοινικίÏιν
> valid = true, len = 17, wrd = true , cnt = 1, txt = Spectrophotometer
> valid = true, len = 26, wrd = true , cnt = 1, txt =
> αἰνιÏÏόμενοι
> valid = true, len = 70, wrd =
> true , cnt = 1, txt = ÎÎΣÎ
> ÎÎÎΡÎÎÎΣΧÎÎÎÎÎΡÎÎΣÎÎÎÎ¥ÎÎ ÎΤÎΣÎÎÎ
> valid = true, len = 18, wrd = true , cnt = 1, txt =
> μικÏÏÏαÏα
> valid = true, len = 23, wrd = true , cnt = 1, txt =
> á¼ÏοÏá½±ÏηÏίν
> valid = true, len = 18, wrd = true , cnt = 1, txt =
> ××ֹקְש×Öµ×
> valid = true, len = 17, wrd = true , cnt = 1, txt = διαμένÏν
> . . . (etc. for 39599 lines)
> (And it looks worse than that, actually, because control characters
> aren't coming through).
> I think the originals were usually greek letters due to an earlier
> test (why there should be so many greek words I don't know...but if
> they're there I want them to be handled properly), but the corrupted
> text is such a small part of the original file that I can't be
> certain. Valid = true means that it passed string validates right
> before being printed. wrd = true means that the only characters in it
> should be isAlpha, hyphen, apostrophe, or underscore. cnt = n means
> that it was detected n times in the dataset (of 8013 text files). And
> the string in each struct is only written once in the execution of the
> program.
>
> I was scanning the dataset looking to see what long words were
> valid...I didn't expect THIS at all. And as you can see from, e.g.,
> "Spectrophotometer", ASCII values don't seem to be damaged at all.
>
> FWIW, I was expecting to encounter an occasional Greek, French, or
> Chinese word...but nothing like this. I'd think it was the conversion
> from string to dchar[] and back that was the problem, but when I test
> immediately after I know I've written to the string everything looks
> right. So I'm guessing it's something about how unicode is handled.
>
More information about the Digitalmars-d-learn
mailing list