Is there a native function to detect if file is UTF encoding?
Kiith-Sa via Digitalmars-d
digitalmars-d at puremagic.com
Fri Aug 22 07:59:00 PDT 2014
On Friday, 22 August 2014 at 13:53:04 UTC, MacAsm wrote:
> To call decode() from std.encoding I need to make sure it is an
> UTF (may ne ASCII too) otherwise is will skyp over ASCII
> values. Is there any D native for it or I need to check byte
> order mark and write one myself?
This may be simpler for reference:
https://github.com/kiith-sa/tinyendian/blob/master/source/tinyendian.d
Note that you _can't_ reliably differentiate between UTF-8 and
plain ASCII,
because not all UTF-8 files start with a UTF-8 BOM.
However, you can (relatively) quickly determine if a UTF-8/ASCII
buffer contains only ASCII characters; as UTF-8 bytes always have
the topmost bit set, and ASCII don't, you can use a 64-bit
bitmask and check by 8 characters at a time.
See
https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d,
specifically the countASCII() function - it should be easy to
change it into 'detectNonASCII':
/// Counts the number of ASCII characters in buffer until the
first UTF-8 sequence.
///
/// Used to determine how many characters we can process without
decoding.
size_t countASCII(const(char)[] buffer) @trusted pure nothrow
@nogc
{
size_t count = 0;
// The topmost bit in ASCII characters is always 0
enum ulong Mask8 = 0x7f7f7f7f7f7f7f7f;
enum uint Mask4 = 0x7f7f7f7f;
enum ushort Mask2 = 0x7f7f;
// Start by checking in 8-byte chunks.
while(buffer.length >= Mask8.sizeof)
{
const block = *cast(typeof(Mask8)*)buffer.ptr;
const masked = Mask8 & block;
if(masked != block) { break; }
count += Mask8.sizeof;
buffer = buffer[Mask8.sizeof .. $];
}
// If 8 bytes didn't match, try 4, 2 bytes.
import std.typetuple;
foreach(Mask; TypeTuple!(Mask4, Mask2))
{
if(buffer.length < Mask.sizeof) { continue; }
const block = *cast(typeof(Mask)*)buffer.ptr;
const masked = Mask & block;
if(masked != block) { continue; }
count += Mask.sizeof;
buffer = buffer[Mask.sizeof .. $];
}
// If even a 2-byte chunk didn't match, test just one byte.
if(buffer.empty || buffer[0] >= 0x80) { return count; }
++count;
return count;
}
More information about the Digitalmars-d
mailing list