Rosetta Commatizing numbers

Wed May 31 06:27:24 PDT 2017

On Wednesday, 31 May 2017 at 04:31:14 UTC, Ivan Kazmenko wrote:
> On Tuesday, 30 May 2017 at 10:54:49 UTC, Solomon E wrote:
>> I ran into a Rosetta code solution in D that had obvious 
>> errors. It's like the author or the previous editor wasn't 
>> even trying to do it right, like a protest against how many 
>> detailed rules the task had. I assumed that's not the way we 
>> want to do things in D.
>> ...
>> Does anyone have any thoughts about this? Did I do right by D?
>
> I'd say the previous version (by bearophile) suited the task 
> much better, but both aren't perfect.
>
> As a general note, consider the following paragraph of the 
> problem statement:
>
> "Some of the commatizing rules (specified below) are arbitrary, 
> but they'll be a part of this task requirements, if only to 
> make the results consistent amongst national preferences and 
> other disciplines."
>
> This literally means that, while there are complex rules in the 
> real world for commatizing numbers, the problem is kept simple 
> by enforcing strict rules.  The minute concerns of the Real 
> World, like "Current New Zealand dollar format overrides old 
> Zimbabwe dollar format", are irrelevant to the formal problem 
> being solved.  Perhaps the example inputs section ("Strings to 
> be used as a minimum") gets misleading, but that's what they 
> are: examples, not general rules.  By the way, as it's a wiki 
> page, problem statement text could also be improved ;) .
>
> Why?  For example, look at Indian numbering system where 
> commatizing is visibly different 
> (https://en.wikipedia.org/wiki/Indian_numbering_system) - and 
> we don't know whether the string should use it or not without 
> the context.  Or consider that hexadecimal numbers are usually 
> split in groups of four digits, not three - and we don't know 
> whether a [0-9]+ number is decimal or hexadecimal without the 
> context.  See, trying to provide an ultimate solution to 
> real-world commatizing, while keeping it a single function 
> without the context, can't possibly succeed.
>
> What can be done, then?  Well, the page authors already did the 
> difficult part for us: they extracted the essence of a complex 
> real-world problem into a small set of formal rules, which are 
> now the formal problem statement.  Now comes the easy part: to 
> do exactly what is asked in the problem statement.  The 
> flexibility comes from having function parameters.  If we have 
> a solution to a formal problem, using it for the real-world 
> version of the problem is either just specifying the right 
> parameters (hopefully), or changing the function if the real 
> world gets too complex for it.  In the latter case, the more 
> short and readable the existing solution is, the faster can we 
> change the function to suit our real-world case.
>
> -----
>
> Now, where is the old version wrong?  Turns out it just calls 
> the function with default parameters for every line of input - 
> which is wrong since the first two input lines need to be 
> handled specially.  Well, that's what the function parameters 
> are for.  To have a correct solution, we have to use custom 
> parameters for the first two lines of input.  The function 
> itself is fine.
>
> Your solution addresses this problem by special-casing the 
> inputs inside the function, perhaps because of the misleading 
> inputs section in the problem statement.  That's a wrong 
> approach.  First, it introduces magic numbers 33 and 36 into 
> the code, which is a bad programming practice (see here: 
> https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants).  Second, it's plain wrong.  According to the problem statement, we don't have these rules for every possible line of >33 standalone decimals, or >36 characters in total.  We just have to call our function with a concrete set of custom parameters for one concrete example, and other set of parameters for another example.  That's to demonstrate that our function accepts and makes proper use of custom parameters!  Special-casing example inputs inside the function is not a solution: if we go down this path, the perfect solution would be a bunch of "if" statements for every possible example input producing the respective example outputs, and empty function for all other possible inputs.
>
> So, how do we call with special parameters?  Currently, we can 
> look at every other language except C# as inspiration: ALGOL 
> 68, J, Java, Perl 6, Phix, Racket, and REXX.  Your solution 
> also has a good way to check example inputs: a unittest block.  
> It even shows one of D's strengths compared to other languages.
>  And there, you do use custom parameters to check that the 
> function works.  A good approach would be to put all the 
> examples in the unittest instead of reading them from a file.  
> This way, the program will be immediately usable and runnable: 
> no need to create an additional arbitrarily-named file just to 
> test it.
>
> -----
>
> All in all, the only thing I'd change in bearophile's solution 
> is to remove the file reading loop, add the unittest block from 
> your solution instead, and place all the examples there.  
> Printing the result does not seem imperative on Rosettacode, 
> and there are at least some entries in D which already use 
> unittest for checking the problem requirements (for example, 
> https://rosettacode.org/wiki/Sorting_algorithms/Cocktail_sort#D).
>
> Lastly, please note that Rosettacode supports multiple versions 
> in a single language (example: 
> http://rosettacode.org/wiki/99_Bottles_of_Beer#D).  As 
> bearophile's version certainly has its merits, I strongly 
> suggest to keep it available, either merged with your current 
> version to produce the right solution, or as a second version.
>
> Ivan Kazmenko.

I appreciate getting a code review, and I want to improve. What I 
did was with a sense of humor, so I guess I can find a way to 
make it more serious.

First I want to explain why I didn't just make minimal changes, 
although at first I wanted to make just minimal changes. This is 
the output from bearophile's version:

pi=3.14,159,265,358,979,323,846,264,338,327,950,288,419,716,939,937,510,582,097,494,459,231The author has two Z$100,000,000,000,000 Zimbabwe notes (100 trillion)."-in Aus$+1,411.8millions"===US$0017,440 millions=== (in 2,000 dollars)123.e8,000 is pretty big.The land area of the earth is 57,268,900(29% of the surface) square miles.Ain't no numbers in this here words, nohow, no way, Jose.James was never known as 0000000007Arthur Eddington wrote: I believe there are 15,747,724,136,275,002,577,605,653,961,181,555,468,044,717,914,527,116,709,366,231,425,076,185,631,031,296 protons in the universe.   $-140,000±100 millions.6/9/1,946 was a good year for some.

1. pi has the commas start at the wrong digit, and doesn't follow 
the explicit instructions to use spaces as the separator and a 
grouping of 5
2. There are no newlines (although the input is the list of lines 
to be "commatized" not concatenated.)
3. Zimbabwe dollars are given commas, against the explicit 
request to have dots. (That would be undesirable in the real 
world, not just in this silly example, because comma is used as a 
decimal point in the Zimbabwe press, and spaces for thousands 
separators.)
4. The second number in the line
===US$0017,440 millions=== (in 2,000 dollars)
is "commatized" which is against the explicit instructions to 
"commatize" the first number only, given in the task description 
and explained on the task's talk page.
5. The exponent in 123.e8000 is "commatized" which is against 
explicit and repeated instructions not to "commatize" exponents.
6. (The commas in the Eddington number are acceptable enough.)
7. The year in 6/9/1946 is "commatized" against explicit 
instructions to "commatize" only the first number field. It was 
discussed in the task's talk page that years shouldn't be 
commatized, and that's easy to avoid by never "commatizing" past 
the first number.

Overall, the original function was just messing up 
simplistically, attacking every series of digits and inserting a 
comma every three digits from the rightmost.

For the Eddington number, the task didn't explicitly state to use 
spaces in that long a number, but the task does say there should 
be spaces in the digits of pi, which leaves open to 
interpretation whether that's a special request or a rule that 
could apply to any sufficiently long number, AND the task 
includes a reference to a Wikipedia page on the number that does 
use spaces. The task doesn't say that solutions shouldn't provide 
options to produce results that are a little better (more 
conventional looking and useful) than what the task explicitly 
asks. So when I was adding the part for the requested format for 
pi, I made detecting all long numbers part of the 
humorously-named "smart" option. It's humorous because like most 
consumer "smart" options, it doesn't use AI, it just makes some 
assumptions about what you want that are detected and applied, 
overriding other options for some lines.

I totally get the abhorrence of magic numbers. I use named 
constants in place of literals usually. I didn't think those were 
magic numbers that needed a constant declared if each of those 
was only used once and each had a three line comment explaining 
its value and the  rationale for applying it. (It only applies in 
the humorous "smart option" anyway.)

Those magic numbers were hard to figure out, at first I thought 
they should both be 33, then later realized my explanation of the 
values required one to be 36. In a more serious program, I would 
want to calculate such numbers so that any changes in the 
requirements would change the result.

So can we compromise that a user of a function gets to have lots 
of extra options, as long as those are optional arguments and 
don't affect the result in any way if you don't touch them? Is 
that normal for D code? I think it should be normal for some 
languages, but thinking about it right now, because D doesn't 
have named optional arguments, it's trouble to use the optional 
arguments sometimes, having to fill in earlier optional arguments 
in the argument list, and sometimes having to know the argument 
at compile time. So D isn't designed to be exactly that sort of 
language where extra arguments are piled on with abandon. There 
should be just more useful arguments in a good API for D as it 
is, then there could be another named function that has more 
specialized arguments.

I think it's ugly to solve a problem by special casing a function 
call with different arguments for each line of processing a file, 
in a case where a single call that abstracts what you want to do 
would be shorter to write and more reliable. Of course it's even 
uglier to get more than half the answers wrong on a test and 
present that as a solution. It looked like irony to me, and it 
still does.

The other language solutions to Rosetta tasks may be 
"inspirational" in some ways, but there are also errors in them, 
at least for this task, that would be found if they were fully 
tested. They're made by human beings, and Rosetta code is just a 
game. It's not something that's been around as long as the older 
languages used there have existed, to look up to solutions in old 
languages with awe as time-worn and carved in stone.

The original code can't pass the unittests, so I can't add them 
to it. It's short because it's fundamentally flawed, taking the 
task as unidirectional and not involving recognizing decimal 
points, when the task is bidirectional and centered around the 
decimal point.

I'll try to improve the code again, based on the comments here.