Follow-up post explaining research rationale

Mon May 9 12:09:35 PDT 2016

Hi all,

As I mentioned on the other thread where I asked about D syntax, 
I'm a social scientist about to launch some studies of the 
effects of PL syntax on learnability, motivation to pursue 
programming, and differential gender effects on these factors. 
This is a long post – some of you wanted to know more about my 
research goals and rationale, and I also said I would post 
separately on the gender issue, so here we go...

As you know, women are starkly underrepresented in software 
engineering roles. I'm interested in zooming back to the 
decisions people are making when they're 16 or 19 re: programming 
as a career. I'm interested in people's *first encounters* with 
programming, in high school or college, how men and women might 
differentially assess programming as a career option, and why.

Let me note a few things: Someone on the other thread thought 
that my hypothesis was that women don't become programmers 
because of the semicolons and curly braces in PL syntax. That's 
not one of my hypotheses. I do think PL syntax is a large 
problem, and I have some hypotheses about how it 
disproportionately deters qualified women, but the issues I see 
go much deeper than what I've called the "punctuation noise" of 
semicolons and curly braces. (I definitely don't have any 
hypotheses about female perceptions of the aesthetics of curly 
braces, which some posters had inferred – none of this is about 
female aesthetic preferences.)

Also, I don't think D is particularly problematic – it has 
cleaner and clearer syntax than its contemporaries (well, we'll 
need careful research to know if it truly is clearer to a 
targeted population). I plan to use D as a presumptive *clearer 
syntax* condition in some studies – we'll see how it goes. 
Lastly, I'm not approaching the gender issue from an ideological 
or PC Principal perspective. My work will focus mostly on 
cognitive science and pedagogical factors – as you'll see below, 
I'm interested in diversity issues from lots of angles, but I 
don't subscribe to the diversity ideology that is fashionable in 
American academia.

One D-specific question I do have: Have any women ever posted 
here? I scoured a bunch of threads here recently and couldn't 
find a female poster. By this I mean a poster whose supplied name 
was female, where a proper name was supplied (some people just 
have usernames). Of course we don't really know who is posting, 
and there could be some George Eliot situations, but the 
presence/absence of self-identified women is useful enough. Women 
are underrepresented in programming, but the skew in online 
programming communities is even more extreme – we're seeing 
near-zero percent in lots of boards. This is not a D-specific 
problem. Does anyone know of occasions where women posted here? 
Links?

Getting back to the research, recent studies have argued that one 
reason women are underrepresented in certain STEM fields is that 
smart women have more options than smart men. So think of the 
right tail of the bell curve, the men and women in that region on 
the relevant aptitudes for STEM fields. There's some evidence 
that smart women have a broader set of skills -- *on average* -- 
than equivalently smart men, perhaps including better social 
skills (or more interest in social interaction). This probably 
fits with stereotypes and intuitions a lot of people already held 
(lots of stereotypes are accurate, as probability distributions 
and so forth).

I'm interested in monocultures and diversity issues in a number 
of domains. I've done some recent work on the lack of 
philosophical and political diversity in social science, 
particularly in social psychology, and how this has undermined 
the quality and validity of our research (here's a recent paper 
by me and my colleagues in Behavioral and Brain Sciences: 
http://dx.doi.org/10.1017/S0140525X14000430). My interest in the 
lack of gender diversity in programming is an entirely different 
research area, but there isn't much rigorous social science and 
cognitive psychology research on this topic, which surprised me. 
I think it's an important and interesting issue. I also think a 
lot of the diversity efforts that are salient in tech right now 
are acting far too late in the cycle, sort of just waiting for 
women and minorities to show up. The skew starts long before 
people graduate with a CS degree, and I think Google, Microsoft, 
Apple, Facebook, et al. should think deeply about how programming 
language design might be contributing to these effects 
(especially before they roll out any more C-like programming 
languages).

Informally, I think what's happening in many cases is that when 
smart women are exposed to programming, it looks ridiculous and 
they think something like "Screw this – I'm going to med school", 
or any of a thousand permutations of that sentiment.

Mainstream PL syntax is extremely unintuitive and poorly designed 
by known pedagogical, epistemological, and communicative science 
standards. The vast majority people who are introduced to 
programming do not pursue it (likely true of many fields, but 
programming may see a smaller grab than most – this point 
requires a lot more context). I'm open to the possibility that 
the need to master the bizarre syntax of incumbent programming 
languages might serve as a useful filter for qualities valuable 
in a programmer, but I'm not sure how good or precise the filter 
is.

Let me give you a sense of the sorts of issues I'm thinking of. 
Here is a C sample from ProgrammingSimplified.com. It finds the 
frequency of characters in a string:

int main()
{
    char string[100];
    int c = 0, count[26] = {0};

    printf("Enter a string\n");
    gets(string);

    while (string[c] != '\0')
    {
       /** Considering characters from 'a' to 'z' only
           and ignoring others */

       if (string[c] >= 'a' && string[c] <= 'z')
          count[string[c]-'a']++;

       c++;
    }

    for (c = 0; c < 26; c++)
    {
       /** Printing only those characters
           whose count is at least 1 */

       if (count[c] != 0)
          printf("%c occurs %d times in the entered 
string.\n",c+'a',count[c]);
    }

    return 0;
}

There's a lot going on here from a learning, cognitive science 
and linguistic encoding standpoint.

1. There's no clear distinction between types and names. It's 
just plain text run-on phrases like "char string". string is an 
unfortunate name here, and reminds us that this would be a type 
in many modern languages, but my point here is that there's 
nothing to visually distinguish types from names. I would make 
types parenthetical or use a hashtag, so: MyString (char) or 
MyString #char (and definitely with types at the end of the 
declaration, with names and values up front and uninterrupted by 
type names – I'll be testing my hunches here).

2. There's some stuff about an integer c that equals 0, then 
something called count – it's not clear if this is a type or a 
name, since it's all by itself and doesn't follow the pattern we 
saw with int main and char string. It also seems to equal zero. 
Things that equal zero are strange in this context, and we often 
see bizarre x = 0 statements in programming when we don't mean it 
to actually equal zero, or not for long, but PL syntax usually 
doesn't include an explicit concept of a *starting value*, even 
though that's what it often is. We see this further down in the 
for loop.

3. The word *print* is being used to mean display on the screen. 
That's odd. Actually, the non-word printf is being used. We'd 
probably want to just say: display "Enter a string"

4. We switch the person or voice from an imperative "do this" as 
in printf, to some sort of narrator third-person voice with 
"gets". Who are we talking to? Who are we talking about? Who is 
getting? The alignment is the same as printf, and there's not an 
apparent actor or procedure that we would be referring to. 
(Relatedly, the third-person puts command that is so common in 
Ruby always makes me think of Silence of the Lambs – "It puts the 
lotion on its skin"... Or more recently, the third-person style 
of the Faceless Men, "a girl has no name", etc.)

5. Punctuation characters that already have strong semantics in 
English are used in ways that are inconsistent with and unrelated 
to those semantics. e.g. exclamation marks are jarring next to an 
equals sign, and it's not clear why such syntax is desirable. 
Same for percentage signs used to insert variables, rather than 
expressing a percentage. (I predict that the curly brace style of 
variable insertion in some HTML templating languages will be more 
intuitive for learners – they isolate the insertion, and don't 
have any conflicting semantics in normal English.)

I realize that some of this sprouted from the need to overload 
English punctuation in the ASCII-constrained computing world of 
the 1970s. The historical rationales for PL syntax decisions 
don't bear much on my research questions on learnability and the 
cognitive models people form when programming.

6. There are a bunch of semicolons and curly braces, and it's not 
clear why they're needed. Compilation will fail or the program 
will be broken if any of these characters are missing.

7. There are many other things going on here, lots of 
observations one could make from pedagogical, logical 
representation, and engineering standpoints.

Now, there are some reasonable hypotheses having to do with 
programming/tech culture and its effects on gender diversity. I 
think some of those can intertwine with PL design issues. I also 
think there might be an issue with the quality and compellingness 
of today's computing platforms, and the perceived power of 
computers to do amazing and interesting things. I don't think the 
platforms people are introduced to in CS education are very good 
at generating excitement about what computers can do. It would be 
interesting to gauge what sorts of things people think they might 
be able to create, what sorts of problems they think they could 
solve, or new interfaces they could implement, after their 
introduction to programming. What horizons do they see? For 
example, there used to be a lot of excitement about what 
computers could do for education. Those visions have not 
materialized, and it's not clear that computing is doing anything 
non-trivial in education for reasoning ability, unlocking math 
aptitude, writing creativity, etc. It might actually be a net 
harm, with its effects on attention spans and language 
development, though this will be very complicated to assess.

Mobile has reinvigorated some idealism and creativity about 
computing. But the platforms people are introduced to or forced 
to use when learning programming are not mobile platforms, since 
you can't build complex applications on the devices themselves. 
Unix and Linux are extremely popular in CS, but are terrible 
examples for blue sky thinking about computing. Forcing people to 
learn Vim or Emacs, grep, and poorly designed command line 
interfaces that dump a bunch of unformatted text at you are 
disastrous decisions from a pedagogical standpoint. (See the 
BlueJ project for an effort to do something about this.) They do 
nothing to illustrate what new and exciting things you could 
build with computers, and they seem to mold students into a 
rigid, conformist nix, git, and markdown monoculture where 
computing is reduced to bizarre manipulations of ASCII text on a 
black 1980s DOS-like screen, and constantly fiddling with and 
repairing one's operating system just to be able to continue to 
work on this DOS-like screen (Unix/Linux requires a lot of 
maintenance and troubleshooting overhead, especially for 
beginners – if they also have to do this while learning 
programming, then programming itself could be associated with a 
life of neverending, maddening hassles and frustrations). The 
debugging experience on Unix/Linux will be painful. From a 
pedagogical standpoint, this situation looks like a doomsday 
scenario, the worst CS education approach we could devise.

The nuisance/hassle overhead of programming is probably worth a 
few studies in conjunction with my studies on syntax, and I'd 
guess the issues are related – the chance of success in 
programming, in getting a simple program to just work, is pretty 
low. It's not clear that it *needs* to be so low, and I want to 
isolate any platform/toolchain factors from any PL syntax 
factors. (The factors may not exist – I could be wrong across the 
board.)

That's all I've got for now. This isn't as well-organized as I'd 
like, but I wanted to get something out now or I'd likely let it 
slip for weeks.