readln() returns new line charater

Dmitry Olshansky dmitry.olsh at gmail.com
Sun Dec 29 11:41:46 PST 2013


29-Dec-2013 23:28, Vladimir Panteleev пишет:
> On Sunday, 29 December 2013 at 18:45:36 UTC, Dmitry Olshansky wrote:
>> I've come to conclusion that the only sane line ending behavior is to
>> do what Unicode standard says, and detect the following pattern as
>> line separator:
>>
>> \r\n | \r | \f | \v | \n | \u0085 | \u2028 | \u2029
>>
>> This includes never breaking a line in between \r\n sequence.
>
> I don't think something as basic as a line-splitting function should do
> UTF decoding unless the user asks for it explicitly.

I haven't said decode :)
Just match the pattern as UTF-8 bytes explicitly, the bulk of these 
separators is side-steped away after a single test instruction + 
conditional branch (that is fairly predictable - like almost never taken).

> Getting UTF-8
> decoding errors in splitLines when working with ASCII files has caused
> be enough frustration to stop using that function altogether (unless I
> *KNOW* the text is valid UTF-8). I've yet to encounter a need to split
> by anything other than \n and \r\n.

I would argue there is a way to do that almost as cheap as the trio of 
\r | \n | \r\n would be. Personal experience notwithstanding it would be 
better do the right thing.

P.S. What I know for sure is that there is a strong need for having 
better support for other encodings. Raw ASCII included, but encoding 
assumptions must be explicit.

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list