Why UTF-8/16 character encodings?
Dmitry Olshansky
dmitry.olsh at gmail.com
Fri May 24 22:50:52 PDT 2013
25-May-2013 02:42, H. S. Teoh пишет:
> On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
>> 24-May-2013 21:05, Joakim пишет:
> [...]
> As far as Phobos is concerned, Dmitry's new std.uni module has powerful
> code-generation templates that let you write code that operate directly
> on UTF-8 without needing to convert to UTF-32 first.
As is there are no UTF-8 specific tables (yet), but there are tools to
create the required abstraction by hand. I plan to grow one for
std.regex that will thus be field-tested and then get into public
interface. In fact the needs of std.regex prompted me to provide more
Unicode stuff in the std.
> Well, OK, maybe
> we're not quite there yet, but the foundations are in place, and I'm
> looking forward to the day when string functions will no longer have
> implicit conversion to UTF-32, but will directly manipulate UTF-8 using
> optimized state tables generated by std.uni.
Yup, but let's get the correctness part first, then performance ;)
>
>> Want small - use compression schemes which are perfectly fine and
>> get to the precious 1byte per codepoint with exceptional speed.
>> http://www.unicode.org/reports/tr6/
>
> +1. Using your own encoding is perfectly fine. Just don't do that for
> data interchange. Unicode was created because we *want* a single
> standard to communicate with each other without stupid broken encoding
> issues that used to be rampant on the web before Unicode came along.
>
BTW the document linked discusses _standard_ compression so that anybody
can decode that stuff. How you compress would largely affect the
compression ratio but not much beyond it..
> In the bad ole days, HTML could be served in any random number of
> encodings, often out-of-sync with what the server claims the encoding
> is, and browsers would assume arbitrary default encodings that for the
> most part *appeared* to work but are actually fundamentally b0rken.
> Sometimes webpages would show up mostly-intact, but with a few
> characters mangled, because of deviations / variations on codepage
> interpretation, or non-standard characters being used in a particular
> encoding. It was a total, utter mess, that wasted who knows how many
> man-hours of programming time to work around. For data interchange on
> the internet, we NEED a universal standard that everyone can agree on.
+1 on these and others :)
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list