Using lazy code to process large files

Steven Schveighoffer via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Aug 2 08:52:13 PDT 2017


On 8/2/17 11:02 AM, kdevel wrote:
> On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer wrote:
>> As Daniel said, using byCodeUnit will help.
> 
> stripLeft seems to autodecode even when fed with CodeUnits. How do I 
> prevent this?
> 
>        1 void main ()
>        2 {
>        3    import std.stdio;
>        4    import std.string;
>        5    import std.conv;
>        6    import std.utf;
>        7    import std.algorithm;
>        8
>        9    string [] src = [ " \xfc" ]; // blank + latin-1 encoded u 
> umlaut
>       10    auto result = src
>       11       .map!(a => a.byCodeUnit)
>       12       .map!(a => a.stripLeft);
>       13    result.writeln;
>       14 }
> 
> Crashes with a C++-like dump.
> 

First, as a tip, please post either a link to a paste site, or don't put 
the line numbers. It's much easier to copy-paste your code into an 
editor if you don't have the line numbers.

What has happened is that you injected a non-encoded code point. In 
UTF8, any code point above 0x7f must be encoded into a string of several 
code units. See the table on this page: https://en.wikipedia.org/wiki/%C3%9C

If we use the correct code unit sequence (0xc3 0x9c), then it works: 
https://run.dlang.io/is/4umQoo

-Steve


More information about the Digitalmars-d-learn mailing list