String implementations

Jarrod qwerty at ytre.wq
Sun Jan 20 02:30:32 PST 2008


On Sun, 20 Jan 2008 08:04:01 +0000, Janice Caron wrote:

>> I'm allowing the user to edit config
>> files
> 
> How? With a GUI interface? With a program written in D? With their
> favorite text editor of choice?
> 
> If the latter, then you cannot be sure of the encoding, and that's
> hardly D's fault!

It is the latter.


> Right, but converting from one encoding to another is the job of
> specialised classes. Detecting whether a text file is in ISO-8859-1, or
> Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If your
> application were going to do that, you'd have to provide the
> implementation. (Or possibly Tango or some other third party library
> already provides such converters - I don't know). In any case, it's not
> a common enough task to warrant built-in language support.
> 
> But I still don't see what this has got to do with whether or not a[n]
> should identify the (n+1)th character rather than the (n+1)th code unit.

Because this issue isn't really to do with the input file itself, it's to 
do with the potential input characters given in the file. As far as I can 
tell (I'm using a C library to parse the input) it should be ascii or 
UTF-8 encoding.
Anything else would probably cause the C lexer to screw up.


> Cool. So what is the real world use case that necessitates that
> sequences of UTF-8 code units must be addressable by character index as
> the default?

The most important one right now is splicing. I'm allowing both user-
defined and program-defined macros in the input data. They can be 
anywhere within a string, so I need to splice them out and replace them 
with their correct counterparts. I hear the std lib provided with D is 
unreliable so I'm unwilling to use it. Plus even if it is fixed up I'd 
hate to limit string manipulation to regular expressions.
I also wish to cut off input at a certain letter count for spacing issues 
in both the GUI and dealing with the webscript.
I'll have to be converting certain characters to their URI equivalent 
too, that will probably take more splicing as well.

The other thing I'm using is single-letter replacement. Simple stuff like 
capitalising letters and replacing spaces with underscores.

I can think of a lot of other situations that would benefit from proper 
multibyte support too, for instance practically any application that 
takes frequent user input could benefit. A text editor is a very good 
example. Any coders who don't natively deal with Latin text would 
probably benefit greatly too ( think of the poor Japanese coders :< ). 
I've seen a lot of programs that print a specified number of characters 
before wrapping around or trailing off, too. The humble gnome console is 
a good example of that. Very handy to have character indexing in this 
case. 
String tokenizing and plain old character counting are two operations I 
can think of that could probably be done easier too.


In the end I think I'm just tired of having to jump through hoops when it 
comes to string manipulation. I want to be able to say 'this is a 
character, I don't care what it is. Store it, change it, splice it, print 
it.' But instead it seems if I don't care what the character type it, it 
might not fit. Then I have to allocate then store it, find and change it, 
locate then splice it, convert then print it.
Small annoyances build up over time and I'm pretty sure I'm not insured 
for blood vessels bursting in my eye.

I live in the hope that one day in the future I'll see something magical 
happen, and I'll be able to type char chr = 'Δ'; and chr will be a proper 
utf-8 character that I can print, insert into an array, and change.
What a beautiful day that will be.

Welp, I think I'm done ranting for now. Back to screwing around with 
strings. Or more accurately, procrastinating about screwing around with 
strings.



More information about the Digitalmars-d mailing list