Should this work?

Marco Leise Marco.Leise at gmx.de
Mon Jan 20 21:31:06 PST 2014


Am Mon, 20 Jan 2014 11:14:46 +0000
schrieb "Dicebot" <public at dicebot.lv>:

> On Monday, 20 January 2014 at 10:22:03 UTC, Regan Heath wrote:
> > I was thinking in a very Windows centric way when I wrote my 
> > comment but it doesn't surprise me that other platforms can be 
> > configured to other locales.  What do they default to?
> 
> UTF-8 for most "user-friendly" distros I know. Region/locale is 
> usually entered by user during installation but encoding is 
> always UTF-8. That said, changing it system-wide is just matter 
> of tweaking config and regenerating locale so that can't be 
> relied upon in standard library.
> 
> > The last Linux install I did was for my Raspberry Pi and UTF-8 
> > was recommended, and I selected it, and yet still I had to 
> > break out some weird console magic to fully realise that choice 
> > (I think there was a disjoint component which had not been 
> > configured correctly.. some part of the installation dropped 
> > the ball).
> 
> Most likely just installer issue for specific distro you have 
> used.

I asked on #linux how encodings are handled. I figured it must
be a complicated process from the file system to the kernel to
the C lib to you D program. If I understood correctly, the
kernel exposes the names from the file systems as they are,
unless they are in UTF-16 like Joliet, in which case they need
to be converted (in this case to an ISO charset, not UTF-8).

Wondering how I could make sure file names are always
represented as UTF-8 I was told that iocharset=utf8 as the
mount option where applicable should do the trick. I haven't
looked into that yet as I don't currently have any issues, but
what I took away from that is that this is the only place a
conversion will happen and my C locale does not influence it.

Yet the possibility alone that a C string could be in any
encoding is unsettling. Especially when strings from a proxy
library and an implementation library can be concatenated like
this: "Standard-Audiogerät using DirectSound" they may end up
in different encodings (not in this case, but imagine
Cyrillic or Greek). So to work reliably you need to have all
interacting components agree on the charset.

When you ask OpenAL devs why they didn't enforce UTF-8 for
strings, they say that it is just a spec and implementations
are free to use the language default for characters. For most
people that means working with "C strings", whatever they
represent, since most programming languages will just use a C
implementation directly or via bindings. Some Haskell
developer also complained about it. Modern programming
languages in general use some Unicode encoding, making the C
string issue more obvious.

My current "best practice" is this:

* If a C string represents an identifier in C (e.g. variable
  or function name), assume it is ASCII and thus a valid UTF-8
  char* in D.

* Otherwise keep it as a ubyte*. Chances are we need to pass
  it back into the C API as is or don't have to print it on
  screen.

* When a C string has to be displayed, use on Windows:

  wstring ansiToString(const ubyte* ansi)
  {
      import std.exception;
      auto utf16Len = MultiByteToWideChar(CP_ACP, 0, ansi, -1,
                                          null, 0);
      if (utf16Len == 0) { /* handle error */ }
      wchar[] utf16 = new wchar[](utf16Len);
      utf16Len = MultiByteToWideChar(CP_ACP, 0, ansi, -1,
                                 utf16.ptr, utf16.length);
      if (utf16Len == 0) { /* handle error */ }
      return assumeUnique(utf16)[0 .. $-1];
  }

  I don't know the best practice for other systems yet.

* Don't store a C string in a converted UTF-8 version only,
  if you intend to pass it back into the C API. If the
  original C string had some invalid characters in it, a data
  loss would occur.



More information about the Digitalmars-d mailing list