pure or not pure?

Thu Apr 10 10:34:42 PDT 2008

Steven Schveighoffer wrote:
> "Georg Wrede" wrote
>>Steven Schveighoffer wrote:
>>>"Janice Caron" wrote
>>>
>>>So how is new char[] a pure function?  And why can't I 'wrap' new?  For 
>>>instance, if I have a function:
>>>
>>>int[] createIncreasingSequence(int s, int e, int step){...}
>>>
>>>which creates an array of ints that start at s and end at e, increasing 
>>>by step each time, why can't I make that pure?  i.e. I want to specify my 
>>>own way to initialize an array.  If I do that in 10 different pure 
>>>functions, you're saying I have to re-implement that code in each one?
>>
>>The return value must be immutable if you're going to use it several 
>>times.
> 
> Agreed.  But I don't want to use it several times, I want to use it once :)
> 
> I want it to act like new does.  For example, if I have
> 
> char[] c = new char[5];
> 
> in two different pure functions, new better not return the same exact memory 
> space for both!  It should be called every time.  But I can't do this with a 
> custom function because pure functions can only call other pure functions. 
> Basically, I'm asking if there will be a way to move a common piece of a 
> function that allocates and initializes data out of a pure function and into 
> another.
> 
> It would be cool if the compiler did not perform special optimizations (such 
> as memoization) on pure functions that return mutable heap data, but still 
> enforced the rules of purity.
> 
>>And if you want to cache the return values outside of the function (as in 
>>only returning a new array when new parameters are used, and otherwise 
>>just a reference to a previously returned one), then I guess I don't have 
>>an opinion. :-?
>>
>>>This was misleading on my part.  I should have asked instead of these two 
>>>questions, if substr is allowed to be pure, can it be called from a 
>>>non-pure function?  In the context of being called from a pure function, 
>>>src is unique, because pure has to have unique data.  But in the context 
>>>of calling from a non-pure function, it might not be...
>>
>>Ultimately, any function is called from main(). Which is a non-pure 
>>function.

I'll have to take that back. Of course main can be a pure function. Take 
the unix sort function for an example. It receives data (which by the 
way, is /assumed/ immutable, that is, the input file is assumed not to 
change while sort is running) and _without_ otherwise changing its 
environment (read, files on the file system), returns a value (the 
sorted input).

Actually a lot of unix functionality is based on this. (Almost (just so 
I'm not caught lying :-) )) all the unix commands that take stdin as 
input and stdout as output, can be considered pure functions. (That most 
of them also output to stderr doesn't count in this discussion. :-) Nor 
the fact that they can change files (like the tee command) if the user 
wants. Also, they're excellent examples of functions that can be pure or 
not pure, depending on the invocation.)

> This is a special case of pure function.  Let's say I have a pure function:
> 
> pure invariant(char)[] f()
> {
>     char[] c = new char[15];
>     g(c);
>     return assumeUnique!(c);
> }
> 
> Now, if g is a pure function that takes a char[] argument and accesses the 
> data referenced by the argument, it could be OK to call it from another pure 
> function because we could assume that the pure function can only use heap 
> data that is 'local' to the function.  But it might be invalid to call such 
> a pure function from a non-pure function because data held by a non-pure 
> function is not always considered 'local'.  On the other hand, if a pure 
> function can hold pointers to heap data that is shared, but just can't use 
> it, then a pure function that takes mutable heap data might not be able to 
> use it, so a function like g() isn't possible.

I'd say that

  - you can call a pure function from a non-pure function, anytime

  - you can call a non-pure function from a pure function, BUT THAT 
makes your pure function lose its purity. It'll still work, however. 
(And of course, the compiler could be polite enough to remind you of 
losing purity -- if the compiler can see that you're actually trying to 
have a pure function, as opposed to the function being pure just by 
happenstance.)

For a program none of this matters. I'd say that you can mix freely 
whatever you like. BUT, if you wish the compiler has the possibility to 
do serious optimizing, the you better not soil your pure functions.

I'd say, that is your only punishment for frivolous mixing. The data 
doesn't get corrupted, and the sky doesn't turn checkered above you.

> If pure functions only allow invariant heap parameters and invariant heap 
> returns, then this whole discussion becomes moot, but if they accept 
> non-invariant heap parameters, or allow non-invariant heap returns, I want 
> to know what the rules for those are.  It seems to me that only allowing 
> invariant heap data is a severe limitation that will leave D functional 
> programming crippled in the face of an actual FP language.

One of the dreamed benefits of deciding that _D_ pure functions can only 
access invariant data is to ensure that the compiler can do memoisation 
when it wants to. That is, then the programmer doesn't have to do 
memoisation by hand. This lessens bugs, eases coding, and gives the 
compiler potentially big possibilities of very serious optimizing. (Not 
that DMD probably will achieve such in some time, but still. After all, 
we're designing a laguage here, too, which should make it _possible_ to 
write such compilers.)

If we define

  - pure functions can only receive immutable data as arguments
  - pure functions can only access immutable other data
  - pure functions can only return immutable data
  - pure functions can't change or output other than the return "value"
  - pure functions can only use other pure functions

and then we define

  - "return value" can also be a reference to immutable heap data
  - pure functions can use the heap as their local storage

and we decide that the last one means that pure functions can create and 
manipulate mutable data on the heap, but only so that nobody else has 
access to this data (including previous, later, or concurrent instances 
of the same function),

then we should be half-way to the goals A/W are trying to reach.

---

Now, why all this heap stuff?

Simply because even Real Functional Languages, that make it look like 
the stack is the only place in the world, can actually use the heap all 
they want, only they do it behind the back of the programmer. It's not 
about *literally* using the stack only, it's about the metaphor.

This metaphor is dictated by the fact that pure functionality means that 
stuff is passed via return values only.

Now, in D (which is a multi-paradigm, practical language, for Systems 
Programming), we don't want to restrict ourselves any more by purity 
than we'd want to do that with OO, for example.

Before C++ was invented, we had languages that were OO-only, and 
languages that were non-OO. (Even Tiobe still uses this distinction, and 
I always wonder to which category they've put D.) The reason for this 
either-or was simply that people were new to OO, and hadn't yet figured 
out how (or whether) they can mix OO and procedural.

To take another example, when I got a VIC-20 in the early eighties, it 
only had a Microsoft Basic interpreter. You couldn't do structured 
programming with it simply because the facilities were missing in the 
language. This made the programming effort more than exponential into 
the number of lines written. Today, I only do structural programming (as 
in versus unstructural), and admire Walter, who still mixes structured 
and unstructured, quite fluently. (Just see Phobos source, especially 
the C files.)

Now, with D we are charting new ground here. We are (arguably) the first 
to make a serious effort of mixing functional (as in pure) with 
procedural programming.

And this is where the heap stuff comes along. It would be impractical 
for D to pass everything via the stack. First of all, non-trivial 
progams would potentially need huge stack spaces, and in many 
environments the stack space allocated for a program is given at startup 
time. So, either we need to allocate stack just-in-case, or risk running 
out of stack in mid-run. Wasteful, and impolite, especially in a 
multiuser environment.

But to achieve this, we, at this point, have to decide that both input 
to pure functions and output (i.e. the return "value") have to be 
invariant. Only this lets us get away with only "pretending" to use the 
stack, while in essence we pass references to heap stuff via the stack.

---

Later, when we're more at ease with mixing functional with 
non-functional, we might well find ways to ease these restrictions. But 
that might be several years after we've actually started using this in 
real projects.

PS, oh Bob, I wish Walter would comment on this post, in any way. 
Suppose my understanding on this differs from that of A/W, then I'd only 
be misleading everybody here.