Parsing a UTF-16LE file line by line, BUG?

Nestor via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Tue Jan 17 03:40:15 PST 2017


On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
> On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:
>> I see. So correcting my original doubt:
>>
>> How could I parse an UTF16LE file line by line (producing a 
>> proper string in each iteration) without loading the entire 
>> file into memory?
>
> Could... roll your own? Although if you wanted it to be UTF-8 
> output instead would require a second pass or better yet 
> changing how the i iterated.
>
> char[] getLine16LE(File inp = stdin) {
>     static char[1024*4] buffer;  //4k reusable buffer, NOT 
> thread safe
>     int i;
>     while(inp.rawRead(buffer[i .. i+2]) != null) {
>         if (buffer[i] == '\n')
>             break;
>
>         i+=2;
>     }
>
>     return buffer[0 .. i];
> }

Thanks, but unfortunately this function does not produce proper 
UTF8 strings, as a matter of fact the output even starts with the 
BOM. Also it doen't handle CRLF, and even for LF terminated lines 
it doesn't seem to work for lines other than the first.

I guess I have to code encoding detection, buffered read, and 
transcoding by hand, the only problem is that the result could be 
sub-optimal, which is why I was looking for a built-in solution.


More information about the Digitalmars-d-learn mailing list