TDPL: Foreach over Unicode string

Sean Kelly sean at invisibleduck.org
Tue Jul 27 16:43:42 PDT 2010


Andrej Mitrovic Wrote:

> On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean at invisibleduck.org> wrote:
> 
> > Sean Kelly Wrote:
> > >
> > > I think it's Windows integration that's the problem, on OSX I get:
> > >
> > > [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
> > > [H][a][l][l][å][,][ ][V][ä][r][l][d][!]
> > >
> > > which is essentially correct.  The only difference between this and doing
> > the same thing in C and using printf() in place of write() is that both
> > lines display correctly in C.  I think printf() must be detecting partial
> > UTF-8 characters and buffering until the complete chunk has arrived.
> >  Interestingly, the C output can't even be broken by badly timed calls to
> > fflush(), so the buffering is happening at a fairly high level.  I'd be
> > interested in seeing the same thing in write() at some point.
> >
> > Ah, write() already works that way.  It was the brackets that were screwing
> > things up.
> >
> 
> You are right about printf(), I'm getting the correct output with this code:
> 
> import std.stdio, std.stream;
> 
> void main() {
>     string str = "Hall\u00E5, V\u00E4rld!";
>     foreach (dchar c; str) {
>         printf("%c", c);
>     }
>     writeln();
> }
> 
> Hallå, Värld!
> 
> Should I file this as a Windows bug for DMD?

Yes.  I looked into this briefly, and after a bit of googling, it looks like fwide() isn't implemented on Windows (unless Walter had done this himself in the DMC libraries).  See here:

http://blogs.msdn.com/b/michkap/archive/2009/06/23/9797156.aspx

If I change std.stdio.LockingTextWriter.put(C)(C c) to always use the version(Windows) code for a 32-bit argument it *almost* works correctly.  Instead of garbage, the Unicode characters are a lowercase o with an accent above (U+01A1 I believe) and an uppercase sigma (U+01A9).  I'll have to spend some more time later trying to figure out why it's these characters and not the intended ones.  I wouldn't think that endian issues should be relevant, but that's the only thing I've come up with so far.


More information about the Digitalmars-d mailing list