Parsing a UTF-16LE file line by line, BUG?
Nestor via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Tue Jan 17 03:40:15 PST 2017
On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
> On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:
>> I see. So correcting my original doubt:
>>
>> How could I parse an UTF16LE file line by line (producing a
>> proper string in each iteration) without loading the entire
>> file into memory?
>
> Could... roll your own? Although if you wanted it to be UTF-8
> output instead would require a second pass or better yet
> changing how the i iterated.
>
> char[] getLine16LE(File inp = stdin) {
> static char[1024*4] buffer; //4k reusable buffer, NOT
> thread safe
> int i;
> while(inp.rawRead(buffer[i .. i+2]) != null) {
> if (buffer[i] == '\n')
> break;
>
> i+=2;
> }
>
> return buffer[0 .. i];
> }
Thanks, but unfortunately this function does not produce proper
UTF8 strings, as a matter of fact the output even starts with the
BOM. Also it doen't handle CRLF, and even for LF terminated lines
it doesn't seem to work for lines other than the first.
I guess I have to code encoding detection, buffered read, and
transcoding by hand, the only problem is that the result could be
sub-optimal, which is why I was looking for a built-in solution.
More information about the Digitalmars-d-learn
mailing list