Freedom, D and Chapel language

bearophile bearophileHUGS at lycos.com
Mon Apr 6 00:32:52 PDT 2009


Despite D being a system language, I use it mostly for numerical processing, data processing, and so on, where Python is too much slow. I also appreciate D because it gives me _freedom_ to use memory almost as I want (expecially from the C heap, because the GC heap adds some constraints, because it allocates memory as powers of 2), so I can implement data structures like trees and other things based on enums, and tagged pointers, that are impossible to do efficiently in Python. Such data structures are necessary when you want to write certain hi-performance code that works on gigabytes of data.

By _freedom_ I mean the freedom to use, in a not too much complex way, most of the hardware of the PC. I have a powerful (but cheap enough) GPU, a 64-bit 2-core CPU with SSE3 vector instructions, and I essentially have still no language to use all this powerful hardware. I want a language that runs my code on both cores and the GPU at the same time. And I don't want to use weeks of my time to write such code. It's not good to have lot of hardware and then have languages that allow you to use just 15% of it.

D may become a language suitable for numeric processing too, and seeing how well it manages floating point values (probably D2 is now one of the best languages around in such regard) it may even become one of its main purposes/usages. To do this I think D has to develop much better management of parallelism (data parallelism and otherwise) & array processing.

There's no single best way to do parallel processing, you need several ones according to the problem and even to solve different part of a problem. Chapel supports several ways, some of them can be mapped on different cores and other ones on the vector parallelism of SSE4/AVX, and Larrabee.

The Chapel language looks like being potentially able to give some of such freedom. To me it looks like potentially one very useful languages.

In the past I have said something about Chapel, but this time I have taken a much better look. I may even try using this language a bit.

Main site:
http://chapel.cs.washington.edu/

The following notes come mostly from the documents in this gz here:
http://chapel.cs.washington.edu/SC08/slides/SC08-S07-Chapel-Tutorial.tar.gz

Take a look in that document pack if you want more info, especially regarding the "task parallelism" and the "locale type".

Below I don't talk about things that I think are both badly designed AND not interesting.

------------------

Amogn its types there are real, imag, complex. It's the same current design of the D language (this suggests to not change this current design of D).


To define types of a different type you use:
int(64), int(8), real(32)
That are equal to D:
long, byte, float

If you don't add the () it uses the default size.

------------------

It has "configuration constants/variables". In a program you can write:
config const n = 10;
config const epsilon = 0.01;
config const verbose = false;

They are like a standard declaration of constants or variables, but they supports command-line overrides. They must be declared at global scope:
> ./a.out --n=10000 --epsilon=0.0000001 --verbose=true

------------------

Implicit conversions, they are similar to C# ones:
- reals do not implicitly convert to ints as in C
- ints and uints don't interconvert as handily as in C (uints can be implicitly converted only to larger ints/uints).

------------------

Among the operators:
**   power
<=>  swap (I think this is overkill, a library function template may be enough)

------------------

Among its types there are also:
  Ranges: regular integer sequences
  Domains: index sets (explained below)


Ranges examples, they are closed on the right:
1..6              // 1, 2, 3, 4, 5, 6
1..6 by -1        // 6, 5, 4, 3, 2, 1
6..1              // an empty sequence
1..6 by 2         // 1, 3, 5
1..6 by -4        // 6, 2
1..               // 1, 2, 3, 4, 5, 6, 7, 8, ...

I think Python slice syntax is much better:
start:stops:stride

# is used for open intervals:
0..#6            // 0, 1, 2, 3, 4, 5
0..#6 by 2       // 0, 2, 4
0..by 2 #6       // 0, 2, 4, 6, 8, 10
1..6 #3          // 1, 2, 3
1..6 by -1 # 3   // 6, 5, 4

I think it's better for intervals to be by default open on the right, ad in Python and D.

I don't like this # syntax much, but it's better than the .. and ... of Ruby. It's less easy to not see a # than a dot.

------------------

Arrays:

var A: [1..3] int,      // a 3-element array of ints
B: [1..3, 1..5] string, // a 2D 3x5 array of strings
C: [1..3] [1..5] real;  // a 3-element array of 5-element arrays of reals

------------------

For loops can be done in several ways, if you use a range you can't change the loop variable:

for i in 1..3 do i += 1; // illegal, ranges yield consts

------------------

Loops aren't much different from Python ones, but you can use [] instead of () to activate a "zipping". Here I like Python sintax more, that uses an explicit "zip" that is more readable.

for i in [0..1, 0..1] ...  // i = (0,0); (0,1); (1,0); (1,1)
for i in (0..1, 0..1) ...      // i = (0,0); (1,1)
for (x,y) in (0..1, 0..1) ...  // x=0, y=0;  x=1, y=1

------------------

It has Named Argument Passing. Arguments may be matched by name rather than position:

def foo(x: int = 2, y:int = x) { ... }
foo();          // equivalent to foo(2, 2);
foo(3);         // equivalent to foo(3, 3);
foo(y=3);       // equivalent to foo(2, 3);
foo(y=3, x=2);  // equivalent to foo(2, 3);
foo(y=3, 2);    // equivalent to foo(2, 3);

I am waiting for this in D.

------------------

It has built-in Iterators:

def RMO() {
  for i in 0..#m do
    for j in 0..#n do
      yield (i, j);

can be used to drive for loops:

for (i,j) in RMO() do
  use (i, j)...

I do this all the time in Python. This is one of those things that requires 5 minutes to be understood and you then use it all the time. So it reduces the complexity of programming.

------------------

Classes/records support a standard method named these(), similar to the opApply() of D, but its syntax is much nicer, because you just need to use "yield" into it to yield items.

If you also add "var" the these() method:

class Pair {
  var x, y: real;

  def these() var {
    yield x;
    yield y;
  }
}

Then you can even modify values as you iterate...

for x in p {
  writeln(x);
  x = -x;
}
writeln(p.x);
writeln(p.y);

It's like in D where you can modify x inside the loop:

foreach (ref x; somearray) x = -x;

I don't remember if this is possible in D too, but this syntax is surely much better, and I don't need to look in docs to use every time as I do in D.

------------------

Chapel has refined built-in ways to do both use task parallelism and data parallelism. See document 3.

I have tried to write a summary here of the task parallelism features of Chapel, but at the end I have scrapped it, because I don't know enough about this. It uses the statements: begin single sync cobegin coforall serial, to do structured, unstructured task parallelism, and to perform parallel loops.

------------------

The section about Data Parallelism is even bigger, probably it's the strongest point of Chapel.

A "domain" is a first-class index set. There three main classes of them:
- arithmetic: indices are Cartesian tuples. (rectilinear, multidimensional, optionally strided and/or sparse)
- associative: indices serve as hash keys (supports hash tables, dictionaries)
- opaque: indices are anonymous (supports sets, graph-based computations)

An example of arithmetic (2D) domain:
var m = 4, n = 8;
var D: domain(2) = [1..m, 1..n];

Subdomains, a part of the matrix domain:
var m = 4, n = 8;
var D: domain(2) = [1..m, 1..n];
var InnerD: subdomain(D) = [2..m-1, 2..n-1];

It's like a generalizazion of the type ranges of Pascal.
You can use them for example to define the type of a matrix.
And too loop on the indexes:

forall (i, j) in InnerD do
    A(i, j) = i + j / 10.0;

Forall loops also supports other forms:

To allocate memory of a matrix and fill it (in parallel too, see below):
var A: [D] real = forall (i,j) in D do i + j/10.0;

A short syntax to fill a matrix:
[(i,j) in D] A(i,j) =  i + j/10.0;

To allocate it and fill it:
var A: [(i,j) in D] real = i + j/10.0;

------------------

Several types of for loops:

"for" loops: one task executes all iterations.

"forall" loops: some number of tasks executes the iterations.

"coforall" loops: one task per iteration.

-----------------

Domains can be sliced, and array domands copied too:

A[InnerD] = B[InnerD];

-----------------

As I've said Domain indices can be more than just dense integer tuples like this:
var DnsDom: domain(2) = [1..10, 0..24],

strided:
StrDom: subdomain(DnsDom) = DnsDom by (2,4),

sparse:
SpsDom: subdomain(DnsDom) = genIndices();

associative:
NameDom: domain(string) = readNames();

Or even graphs (graphs are a built-in data type! This is higher level than Python):
var GrphDom: domain(opaque)


Then you can declare arrays in a very uniform way:

var DnsArr: [DnsDom] complex,
    StrArr: [StrDom] real(32),
    SpsArr: [SpsDom] real;

var GrphArr: [GrphDom] string,
    NameArr: [NameDom] int(64);

All domain types support iteration:
forall ij in StrDom {
  DnsArr(ij) += SpsArr(ij);
}

All domain types support array slicing too:
DnsArr[StrDom] += SpsArr[StrDom];

All domain types support array reallocation too (to clear part of an already existing array):
StrDom = DnsDom by (2,2);
SpsDom += genEquator();

The syntax here is natural enough, I have no problem undestanding such small programs despite it's the first time I see them. This is quite positive, it means that the design is good.

Bounds check are often not necessary because the compiler knows that certain indexes can't go outside certain ranges. This means that you have safety and speed.

Opaque Domains and Arrays can be used to create graphs, trees, sets in a nice way. This is new for me and I don't fully understand it at first sight (document 4, page 12).

-----------------

Promotion:

Functions/operators expecting scalar values can also take:

Arrays, causing each element to be passed in:
A = sin(B);
B = 2 * A;

Or domains, causing each index to be passed in:
def foo(x: (int, int)) { ... }
foo(SpsDom); // calls foo once per index in SpsDom

When multiple arguments are promoted, calls may use (this uses forall internally, so it's partially parallel):
zippered promotion:
X = pow(A, B); // X is 2D; X(i,j) = pow(A(i,j), B(i,j))
Or tensor promotion:
Y = pow[A, B]; // Y is 2x2D;
               // Y(i,j)(k,l) = pow(A(i,j), B(k,l))

-----------------

Reduce is a built-in:
tot = + reduce A;  // tot is the sum of all elements in A
big = max reduce [i in InnerD] abs(A(i) + B(i)); // finds the max

In practice reduce isn't that useful, once you have sum(), max(), min(), product() and few other functions that accept an iterable too. Understanding reduce isn't always easy, that's why it's not in my dlibs and they have removed it from Python3.

-----------------

It also has "scan" as built in, this is less common, I have seen something like this only in Mathematica:

var A, B, C: [1..5] real;
A = 1.1;                // A is: 1.1 1.1 1.1 1.1 1.1
B = + scan A;           // B is: 1.1 2.2 3.3 4.4 5.5
B(3) = -B(3);           // B is: 1.1 2.2 -3.3 4.4 5.5
C = min scan B;         // C is: 1.1 1.1 -3.3 -3.3 -3.3

I don't know how often I can use this scan.

-----------------

A Jacobi Iteration in Chapel, it's very readable:

config const n = 6,
             epsilon = 1.0e-5;

const BigD: domain(2) = [0..n+1, 0..n+1],
      D: subdomain(BigD) = [1..n, 1..n],
      LastRow: subdomain(BigD) = D.exterior(1,0);

var A, Temp : [BigD] real;

A[LastRow] = 1.0;

do {
    [(i,j) in D] Temp(i,j) = (A(i-1,j) + A(i+1,j) +
                              A(i,j-1) + A(i,j+1)) / 4;
    const delta = max reduce abs(A[D] - Temp[D]);
    A[D] = Temp[D];
} while (delta > epsilon);

writeln(A);


It uses a Temp matrix because [(i,j) in D] is partially parallel, so the end result is a higher performance. This code can be mapped to the GPU very well.

I have seen code in Fortress language, and Chapel looks much more readable, more powerful and much simpler to use. So simple that I think in two hours I have already understood a big percentage of the language usage.

exterior() is one of the built-in methods of arithmetic domains, it takes a row, this time at the bottom.

if you change the third line as:

const BigD: domain(2) = [0..n+1, 0..n+1] dist Block,

Then all the domanins and all the matrixes (because all of them in this program have a BigD domain) become sliced in 2D tiles, that are better managed by the CPU cache. This is nice :-) Such tiles can also be managed by different CPU cores, so they can also be a way to divide the work.

-----------------

Even more complex are Hierarchical Array Declarations:

config const n = 1024,
             numLevels = lg2(n);

const Levels = [0..#numLevels];
const ProblemSpace = [1..n, 1..n, 1..n];

var V: [ProblemSpace] real;
const HierSpace: [lvl in Levels] subdomain(ProblemSpace)
           = ProblemSpace by -2**lvl;
var U, R: [lvl in Levels] [HierSpace(lvl)] real;

This allows to perform hierarchical processing, an operation that today is very common for example in image processing. It also allows to perform loops that have complex mixes of such hierarchical processing. This is surely better than doing such things manually. But I think this may be a bit overkill for D. This is good for a purely number crunching language as Chapel, Fortress, etc.

-----------------

In document 5 they talk about the locale type, that allows to control the scope of both data and the processing, to split the Domain processing in chunks given to diffrent CPUs. It's not easy to understand stuff, I have never seen something like that.

-----------------

Finally, this is an example of a complex usage of 3D stencils in Chapel:

def rprj3(S, R) {
    const Stencil = [-1..1, -1..1, -1..1],
          w: [0..3] real = (0.5, 0.25, 0.125, 0.0625),
          w3d = [(i,j,k) in Stencil] w((i!=0) + (j!=0) + (k!=0));

    forall ijk in S.domain do
        S(ijk) = + reduce [offset in Stencil]
                            (w3d(offset) * R(ijk + offset*R.stride));
}

Its translation in (parallel code) Fortran requires hundreds of lines of code, and they say the performance is not too much diffrent.
I don't understand fully that code, but it looks clean and after reading the docs it seems readable.

-----------------

Overall I see several design mistakes (I have not included most of them among my comments here, I have just skipped them), and I think in several points D is better designed. Still, there is a surprisingly large number of things that are both powerful, easy to understand, and have a nice syntax, and I'd like to see some of this (the simpler one) stuff in D. But they look hard to implement :-)

Bye,
bearophile



More information about the Digitalmars-d mailing list