Major performance problem with std.array.front()

Mon Mar 10 07:05:03 PDT 2014

In italian we need unicode too. We have several accented letters 
and often programming languages don't handle utf-8 and other 
encoding so well...

In D I never had any problem with this, and I work a lot on text 
processing.

So my question: is there any problem I'm missing in D with 
unicode support or is just a performance problem on algorithms?

If the problem is performance on algorithms that use .front() but 
don't care to understand its data, why don't we add a .rawFront() 
property to implement only when make sense and then a "fallback" 
like:

auto rawFront(R)(R range) if ( ... isrange ... && 
!__traits(compiles, range.rawFront))  { return range.front; }

In this way on copy() or other algorithms we can use rawFront() 
and it's backward compatible with other ranges too.

But I guess I'm missing the point :)

On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
> On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
>> I'm not sure I understood the point of this (long) thread.
>> The main problem is that decode() is called also if not needed?
>>
>
> I'd like to offer up one D 'user' perspective, it's just a 
> single data point but perhaps useful. I write applications that 
> process Arabic, and I'm thinking about converting one of those 
> apps from python to D, for performance reasons.
>
> My app deals with unicode arabic text that is 'out there', and 
> the UnicodeTM support for Arabic is not that well thought out, 
> so the data is often (always) inconsistent in terms of 
> sequencing diacritics etc. Even the code page can vary. 
> Therefore my code has to cater to various ways that other 
> developers have sequenced the code points.
>
> So, my needs as a 'user' are:
> * I want to encode all incoming data immediately into unicode, 
> usually UTF8, if isn't already.
> * I want to iterate over code points. I don't care about the 
> raw data.
> * When I get the length of my string it should be the number of 
> code points.
> * When I index my string it should return the nth code point.
> * When I manipulate my strings I want to work with code points
> ... you get the drift.
>
> If I want to access the raw data, which I don't, then I'm very 
> happy to cast to ubyte etc.
>
> If encode/decode is a performance issue then perhaps there 
> could be a cache for recently used strings where the code point 
> representation is held.
>
> BTW to answer a question in the thread, yes the data is 
> left-to-right and visualised right-to-left.