Why UTF-8/16 character encodings?

Dmitry Olshansky dmitry.olsh at gmail.com
Fri May 24 22:50:52 PDT 2013


25-May-2013 02:42, H. S. Teoh пишет:
> On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
>> 24-May-2013 21:05, Joakim пишет:
> [...]

> As far as Phobos is concerned, Dmitry's new std.uni module has powerful
> code-generation templates that let you write code that operate directly
> on UTF-8 without needing to convert to UTF-32 first.

As is there are no UTF-8 specific tables (yet), but there are tools to 
create the required abstraction by hand. I plan to grow one for 
std.regex that will thus be field-tested and then get into public 
interface. In fact the needs of std.regex prompted me to provide more 
Unicode stuff in the std.

> Well, OK, maybe
> we're not quite there yet, but the foundations are in place, and I'm
> looking forward to the day when string functions will no longer have
> implicit conversion to UTF-32, but will directly manipulate UTF-8 using
> optimized state tables generated by std.uni.

Yup, but let's get the correctness part first, then performance ;)

>
>> Want small - use compression schemes which are perfectly fine and
>> get to the precious 1byte per codepoint with exceptional speed.
>> http://www.unicode.org/reports/tr6/
>
> +1.  Using your own encoding is perfectly fine. Just don't do that for
> data interchange. Unicode was created because we *want* a single
> standard to communicate with each other without stupid broken encoding
> issues that used to be rampant on the web before Unicode came along.
>

BTW the document linked discusses _standard_ compression so that anybody 
can decode that stuff. How you compress would largely affect the 
compression ratio but not much beyond it..

> In the bad ole days, HTML could be served in any random number of
> encodings, often out-of-sync with what the server claims the encoding
> is, and browsers would assume arbitrary default encodings that for the
> most part *appeared* to work but are actually fundamentally b0rken.
> Sometimes webpages would show up mostly-intact, but with a few
> characters mangled, because of deviations / variations on codepage
> interpretation, or non-standard characters being used in a particular
> encoding. It was a total, utter mess, that wasted who knows how many
> man-hours of programming time to work around. For data interchange on
> the internet, we NEED a universal standard that everyone can agree on.

+1 on these and others :)

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list