Major performance problem with std.array.front()
Nick Sabalausky
SeeWebsiteToContactMe at semitwist.com
Mon Mar 10 00:09:08 PDT 2014
On 3/10/2014 12:23 AM, Walter Bright wrote:
> On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
>> On 3/9/2014 6:31 PM, Walter Bright wrote:
>>> On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm at gmx.net>" wrote:
>>>> Also, `byCodeUnit` and `byCodePoint` would probably be better names
>>>> than `raw`
>>>> and `decode`, to much the already existing `byGrapheme` in std.uni.
>>>
>>> I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
>>> wstring, dstring, and InputRange!char, etc.
>>
>> 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
>> completely
>> different from anything else:
>>
>> string str;
>> wstring wstr;
>> dstring dstr;
>>
>> (str|wchar|dchar).byChar // Always range of char
>> (str|wchar|dchar).byWchar // Always range of wchar
>> (str|wchar|dchar).byDchar // Always range of dchar
>>
>> str.representation // Range of ubyte
>> wstr.representation // Range of ushort
>> dstr.representation // Range of uint
>>
>> str.byCodeUnit // Range of char
>> wstr.byCodeUnit // Range of wchar
>> dstr.byCodeUnit // Range of dchar
>
> I don't see much point to the latter 3.
>
Do you mean:
1. You don't see the point to iterating by code unit?
2. You don't see the point to 'byCodeUnit' if we have 'representation'?
3. You don't see the point to 'byCodeUnit' if we have
'byChar/byWchar/byDchar'?
4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?
Responses:
1. Iterating by code unit: Useful for tweaking performance anytime
decoding is unnecessary. For example, parsing a grammar where the bulk
of the keywords and operators are ASCII. (Occasional uses of Unicode,
like unicode whitespace, can of course be handled easily enough by the
lexer FSM).
2. 'byCodeUnit' if we have 'representation': This one I have trouble
answering since I'm still unclear on the purpose of 'representation' (I
wasn't even aware of it until a few days ago.) I've been assuming
there's some specific use-case I've overlooked where it's useful to
iterate by code unit *while* treating the code units as if they weren't
UTF-8/16/32 at all. But since 'representation' is called *on* a
string/wstring/dstring, they should already be UTF-8/16/32 anyway, not
some other encoding that would necessitate using integer types. Or maybe
it's just for working around problems with the auto-verification being
too eager (I've ran into those)? I admit I don't quite get 'representation'.
3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static
if" chain every time you want to use code units inside generic code.
Also, so in non-generic code you can change your data type without
updating instances of 'by*char'.
4. Having 'byCodeUnit' work on UTF-32 dstrings: So generic code working
on code units doesn't have to special-case UTF-32.
More information about the Digitalmars-d
mailing list