Reading and writing Unicode files

jicman cabrera_ at _wrc.xerox.com
Tue Mar 3 08:41:22 PST 2009


Daniel Keep Wrote:

> 
> 
> jicman wrote:
> > Ok, the only reason that I say Unicode is that when I open the file in Notepad and I do a SaveAs, the Encoding says Unicode.  So, when i read this file and I write it back to the another file, the Encoding turns to UTF8.  I want to keep it as Unicode.
> 
> There is no such thing as a Unicode file format.  There just isn't.  I
> know the option you speak of, and I have no idea what it's supposed to
> be; probably UCS-2 or UTF-16.
> 
> > I will give the suggestion a try.  I did not try it yet.  Maybe Phobos should think about taking care of the BOM byte and provide support for these encodings.  I am a big fan of Phobos. :-)  I have not tried Tango yet, because I would have to uninstall Phobos and I have just spend two years using Phobos and we already have an application based in Phobos and changing back to Tango will slow us down and put us back.  Maybe version 2.0.
> 
> There's std.stream.EndianStream, which looks like it can read and write
> BOMs.  As for converting between UTF encodings, std.utf.

Thanks, Daniel...

I could not the above to work.  Maybe for lack of understanding and examples, but the code below is working.  For now, the 1000+ XML files I have are all the same BOM (UTF16_le), so it will work at least for now.  However, I need to fill in the rest later.  I have a question on this code below which is working for UTF16_le:

import std.stdio;
import std.file;
import std.utf;

char[] getBOM(ubyte[] t)
{
  ubyte[] UTF32_be = [0x00,0x00,0xfe,0xff];
  ubyte[] UTF32_le = [0xff,0xfe,0x00,0x00];
  ubyte[] UTF8 = [0xef,0xbb,0xbf];
  ubyte[] UTF16_be = [0xfe,0xff];
  ubyte[] UTF16_le = [0xff,0xfe];
  if(t == UTF32_be)
    return "UTF32_be";
  if(t == UTF32_le)
    return "UTF32_le";
  if(t[0 .. 3] == UTF8)
    return "UTF8";
  if(t[0 .. 2] == UTF16_be)
    return "UTF16_be";
  if(t[0 .. 2] == UTF16_le)
    return "UTF16_le";
  return "NO_BOM";
}

void main()
{
  char[] f0 = "Unicode.ttx.xml";
  char[] f1 = "UnicodeNew.ttx.xml";
  auto text = cast(string) f0.read();
  ubyte[4] b = cast(ubyte[]) text[0 .. 4];
  char[] bom = getBOM(b);
  char[] wText;
  writefln(bom);
  if (bom[0 .. 5] == "UTF16")
  {
    wchar[] temp = cast(wchar[]) text; //text[2 .. $];
    wText = std.utf.toUTF8(temp);
  }
  else if (bom[0 .. 5] == "UTF32")
  {
  }
  else if (bom == "UTF8")
  {
  }
  
  if (std.string.find(wText,"DisplayText=\"TrixieTag\">") > 0)
  {
    writefln("Found Trixie Tags in " ~ f0);
    char[][] nt = std.string.split(wText,`<ut Style="external" DisplayText="TrixieTag">`);
    wText = nt[0];
    nt = std.string.split(nt[1],"</ut>");
    wText ~= nt[1];
    wchar[] eText = std.utf.toUTF16(wText);
    f1.write(cast(void[]) eText);
  }
}


The question: what happens when I get an UTF16_be (big endian)?  Will the call to std.utf.toUTF16(wText) take care of the BOM?

thanks so much for all the support.

josé


More information about the Digitalmars-d-learn mailing list