array of randomly generated names

spir denis.spir at gmail.com
Sat Oct 16 02:24:11 PDT 2010


On Fri, 15 Oct 2010 17:46:03 -0700
Jonathan M Davis <jmdavisProg at gmx.com> wrote:

> On Friday, October 15, 2010 12:50:53 spir wrote:
> > Hello,
> > 
> > A few questions raised by a single func.
> > 
> > ===================
> > alias char[] Text ;
> > 
> > Text letters = ['a','b','c',...] ;
> > 
> > Text[] nameSet (uint count , uint size) {
> > 	/* set of count random names of size size */
> > 	Text[] names ; names.length = count ;
> > 	Text name ; name.length = size ;
> > 	for (int i=0 ; i<count ; i++) {
> > 		for (int j=0 ; j<size ; j++)
> > 		    name[j] = letters[uniform(0u,26u)] ;
> > 		names[i].length = size ;
> > 		names[i][] = name ;
> > 	}
> > 	return names ;
> > }
> > ===================

Thank you very much for this extensive creply; comments in-text below.

> First off, _never_ iterate over chars unless you're _sure_ that that's what you 
> want. char and wchar are code units, not code points, so you potentially need 
> multiple of them to have a code point. Letters are code points, not code units. 
> If you need to iterate over characters, use dchar, so dchar[] or dstring. If 
> you're just reading the string, you can have foreach do the conversion for you.
> [...]

Sure, I'm well aware of those issues (so, other evocations of the same problem are stripped below).
Actually, even dchars / 32-bit codes representing UCS "abstract characters" do not solve the point.
(If you wonder what I evoke here, see example in apppendix, because it's rather off-topic.)

> > 1. In the  inner loop generating name, I have found neither a way to feed
> > directly ints into name, nore a way to cast ints to chars using to! (also
> > found no chr()). So, I had to list letters. But this wouldn't work with a
> > wide range of unicode chars... How to build name directly from random
> > ints?
> 
> I would expect to!dchar(num) to work where num is an integral value [...].
> If, for some reason, to!dchar(num) does not work, then you can 
> simply cast it cast(dchar)(num), much as that's less desirable.

Read somewhere, I guess, that cast(type) is deprecated in favor of to!(type).

> [...]

> > 2. I was surprised to get all names equal... Seems that "names[i] = name"
> > actually copies a ref to the name. Is there another way to produce a copy
> > than "names[i][] = name"?
> 
> You really should take a look at http://www.digitalmars.com/d/2.0/arrays.html .
> Static arrays are value types, but dynamic arrays are reference types.

My bad! I read this page, but did not realise that this applies to (flexible)
arrays of *char, like the ones needed for text processing.

> You can ever slice them without making any copies. e.g.
> 
> string a = "hello world";
> string b = a[1 .. 7]; //it's a slice
> assert(b == "ello w");
> 
> No copying is taking place there.

Oh, yes. You mean since the string data is immutable, there is no risk in sharing strings?
Does this really mean (like lisp lists, for instance), that b really shares elements (chars) with a?
(So that, if they were mutable instead, changing chars in b would then change a?)

> If you want a copy an array, you use dup (or 
> idup if you want an immutable copy). e.g.
> 
> string a = "hello world";
> string b = a.idup; //It's an immutable copy.
> 
> Or, if you want to copy an array into an array, you'd do
> 
> string a = "hello world";
> char[] b = new char[](a.length);
> b[] = a[]; //it's a copy.
> assert(b == "hello world");

Right. I'll use dup, seems more self-commenting for me.
By the way, is it possible to alias funcs/methods like types? (I'll try...). To me,
"copy" would be far more obvious than "dup" ;-)

> Notice the empty []. That indicates a slice of the whole array. You could do 
> partial slices instead:
> 
> string a = "hello world";
> car[] b = "silly string".dup; //literals are immutable on Linux, though I think 
> that they're mutable on Windows
> 
> b[2..5] = a[4..7]; //a copy of part of the array.
> assert(b == "sio w string");
> Regardless of how much of the array you copy with [], notice that the slices of 
> the arrays must be of the some length.

Right.

> > 3. As you see, I individually set the length of each names[i] in the outer
> > loop. (This, only to be able to copy, else the compiler complains about
> > unequals lengths.) How can I set the length of all elements of names once
> > and for all?
> 
> You're dealing with a multi-dimensional array. The inner array is empty until 
> you set it, so of course it won't work to index it until it's been set. If you 
> want to set the whole thing at once, then do
> 
> auto names =new dchar[][](numNames, nameLength);

Great, thank you. That's what I was looking for. I'm a bit lost with all possible syntaxes to perform similar things (wouldn't have thought at using "new" on an array).
Is "auto" here used because it looks stupid to repeat the type, which is neccessary on right side?
Also, let's say names are dstrings (by casting once built) instead of dchar[]. Is it still possible to dimension names at startup? I mean, how to tell D the size of elements (names)?

> Now, personally, I would argue that you really should be using string as much as 
> possible (or dstring when you have to) and avoid mutable arrays of char, wchar, 
> or dchar.

Right, I will try to reverse my point of view and follow your advice as much as possible. Thus, work basically with *string and use *char[] only for the really text processing parts of code.

> That being the case, I'd advise doing this
> 
> auto names = new string[](numNames);
> 
> then use dchar[] in the for loop (maybe even make it a static one to avoid the 
> memory allocation) and the use to!string() to create a string from it and put it 
> in the list of names. e.g.
> 
> dchar[nameLength] name;
> //...
> names[i] = to!string(name[]); (since it's a static array in this case, you have 
> to slice it to pass it to to!()).

Hum, I'm not sure this works because nameLength is a variable,; or does it? (I'll try)
It looks strange to me to define a static array from a variable length ;-)
Is the memory allocation issue really relevant, since if it's not allocated for name, then it must be for names[i]?
Also, would idup work here, instead of (pseudo-)slicing? (I'll try this, too)

> > 4. Is there a kind of map(), or a syntax like list comprehension, to
> > generate array content from a formula? (This would here replace both
> > loops.)
> 
> Not that I'm aware of. Though, if you could define a range (IIRC it would need to 
> be an input range) which generates the next element in the array when popFront() 
> is called, then you could use std.array.array() to create an array from such a 
> range.

Right, later I'll see what D ranges are (only read evocation of them as of now). I suspect they are more or lass what is often called iterators in other languages (or cursors in Eiffel).

> [...]
> It's typical in D to just use strings everywhere rather than char[]. However, 
> there are definitely cases where you need mutable arrays of characters, so there 
> has been some discussion of making many (perhaps all) of the std.string 
> functions work with all string types, but that hasn't been done yet. [...]

Right.

> Also, as a side note, I wouldn't advise using an alias for char[] if you intend 
> other people to be reading your code. It's just going to confuse people.

Right. I agree with you for a general alias (like my Text). But it's also considered good practice (maybe nor in the D community) to give specific type names to specific meanings (interface), even when the type itself is not changed (implementation). This makes for better self-commenting code. For instance, "alias dchar[] ProductCode" (if computed) or "alias dstring ProductCode" (if read).

> - Jonathan M Davis

Thank you again,
Denis



============ problem with unicode notion "abstract character" =============
void main () {
    // single letter "â" encoded by combining
    // codes for base 'a' and "non-spacing" '^'
    dstring s = "\u0061\u0302" ;
    writeln(s) ;                // --> "â"    (if your terminal is unicode-aware)
    writeln(s[0]) ;             // --> "a"    (conceptually) WRONG
    writeln(s.indexOf("â")) ;   // --> -1     WRONG
    writeln(s.indexOf("a")) ;   // -->  0     WRONG
}

/* comments

These semantic problems are caused by 2 things:

* Text processing tools (types, libraries) operate in the best case
at the level of codes, which represent "abstract characters".
Those would be better called "marks": in the UCS sense, '^' and 'a'
inside the letter "â" are abstract characters. Indeed, a letter like "â"
is rather considered as a single character, even by a programmer
-- I guess.

* But this is not what UCS calls character; some docs use "grapheme" in
the sense of "perceived character", in an attempt to avoid confusion (1).
Anyway, "â" is for UCS basically 2 "marks", thus encoded with 2 codes.

* But, *some* composite characters like "â" have precomposed forms (2),
which means they can be encoded with a single code. Or, more generally,
with less codes than the corresponding decomposed form uses.

This leads to some problems, or rather errors:

1. An ordinary indexing routine returns a code representing a "mark".
Thus, we get 'a' instead of the letter "â". Which makes few sense (3).
Moreover, there is no way to be sure whether this code actually represents
a grapheme or instead is part of a composite character like "â".

2. When searching for "â", a routine searches for an argument
as encoded by *my editor*! If this form matches the form in source
text (here 1 or 2 codes), then the letter is found; else, no.
By pure chance... My editor (geany) happens to use a precomposed
 form for "â", so the search fails ;-)

3. (not shown in code) If I explicitely encode one of the 2 possible
forms for "â" as the argument of the search routine,
then this form will be found, if any. The other, not.

4. (not shown in code) To search for "â" absolutely, I thus need
to know of all its possible forms, and search for all of them.
What about all possible forms of a word, of an arbitrary substring?
What about matching or parsing with regexes or patterns in general?
This is simply impossible. The only only solution is to transcode
all strings into a canonical form (4), including literals, including
func arguments.

5. Worse: if searching for "a", the routine wrongly finds it because
the code for base 'a' in a composite grapheme is the same as the one
for the whole letter "a".

Hope I'm clear.

(1) In fact, they introduce more confusion, because "grapheme" has
a precise linguistic meaning: in english, 'sh' & 'ti' are graphemes
for a "voiceless postalveolar fricative" consonant written /ʃ/.

(2) Supposedly introduced to help coping with legacy character sets.
(and thus speed up UCS adoption when the version 1 was published)
But I fail to see any real advantage, since legacy texts need to be
decoded anyway. Adding a line of code to map legacy codes to unicode
code points is not a big issue, I guess...
And now, we need to deal forever with an encoding sheme far more
complicated and obscure than needed. If all characters were coded
using the decomposed form, all would be simpler, and text processing
would hopefully operate at the relevant level.

(3) I can imagine linguistic apps, for instance, dealing at the level
of codes to count occurrences of given diacritics like eg '^'. But
such uses are rare, and would probably use dedicated tools.

(4) The decomposed form (called NFD) is the only sensible choice, since
it allows coding all kinds grapheme uniformly. Precomposed applies only
to a subset of chars, thus we still get decomposed chars, and even
half-composed ones! Also, precomposition first requires decomposition.
(At least, the official algorithm.)
*/
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



More information about the Digitalmars-d-learn mailing list