Rosetta Commatizing numbers
Solomon E via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed May 31 06:27:24 PDT 2017
On Wednesday, 31 May 2017 at 04:31:14 UTC, Ivan Kazmenko wrote:
> On Tuesday, 30 May 2017 at 10:54:49 UTC, Solomon E wrote:
>> I ran into a Rosetta code solution in D that had obvious
>> errors. It's like the author or the previous editor wasn't
>> even trying to do it right, like a protest against how many
>> detailed rules the task had. I assumed that's not the way we
>> want to do things in D.
>> ...
>> Does anyone have any thoughts about this? Did I do right by D?
>
> I'd say the previous version (by bearophile) suited the task
> much better, but both aren't perfect.
>
> As a general note, consider the following paragraph of the
> problem statement:
>
> "Some of the commatizing rules (specified below) are arbitrary,
> but they'll be a part of this task requirements, if only to
> make the results consistent amongst national preferences and
> other disciplines."
>
> This literally means that, while there are complex rules in the
> real world for commatizing numbers, the problem is kept simple
> by enforcing strict rules. The minute concerns of the Real
> World, like "Current New Zealand dollar format overrides old
> Zimbabwe dollar format", are irrelevant to the formal problem
> being solved. Perhaps the example inputs section ("Strings to
> be used as a minimum") gets misleading, but that's what they
> are: examples, not general rules. By the way, as it's a wiki
> page, problem statement text could also be improved ;) .
>
> Why? For example, look at Indian numbering system where
> commatizing is visibly different
> (https://en.wikipedia.org/wiki/Indian_numbering_system) - and
> we don't know whether the string should use it or not without
> the context. Or consider that hexadecimal numbers are usually
> split in groups of four digits, not three - and we don't know
> whether a [0-9]+ number is decimal or hexadecimal without the
> context. See, trying to provide an ultimate solution to
> real-world commatizing, while keeping it a single function
> without the context, can't possibly succeed.
>
> What can be done, then? Well, the page authors already did the
> difficult part for us: they extracted the essence of a complex
> real-world problem into a small set of formal rules, which are
> now the formal problem statement. Now comes the easy part: to
> do exactly what is asked in the problem statement. The
> flexibility comes from having function parameters. If we have
> a solution to a formal problem, using it for the real-world
> version of the problem is either just specifying the right
> parameters (hopefully), or changing the function if the real
> world gets too complex for it. In the latter case, the more
> short and readable the existing solution is, the faster can we
> change the function to suit our real-world case.
>
> -----
>
> Now, where is the old version wrong? Turns out it just calls
> the function with default parameters for every line of input -
> which is wrong since the first two input lines need to be
> handled specially. Well, that's what the function parameters
> are for. To have a correct solution, we have to use custom
> parameters for the first two lines of input. The function
> itself is fine.
>
> Your solution addresses this problem by special-casing the
> inputs inside the function, perhaps because of the misleading
> inputs section in the problem statement. That's a wrong
> approach. First, it introduces magic numbers 33 and 36 into
> the code, which is a bad programming practice (see here:
> https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants). Second, it's plain wrong. According to the problem statement, we don't have these rules for every possible line of >33 standalone decimals, or >36 characters in total. We just have to call our function with a concrete set of custom parameters for one concrete example, and other set of parameters for another example. That's to demonstrate that our function accepts and makes proper use of custom parameters! Special-casing example inputs inside the function is not a solution: if we go down this path, the perfect solution would be a bunch of "if" statements for every possible example input producing the respective example outputs, and empty function for all other possible inputs.
>
> So, how do we call with special parameters? Currently, we can
> look at every other language except C# as inspiration: ALGOL
> 68, J, Java, Perl 6, Phix, Racket, and REXX. Your solution
> also has a good way to check example inputs: a unittest block.
> It even shows one of D's strengths compared to other languages.
> And there, you do use custom parameters to check that the
> function works. A good approach would be to put all the
> examples in the unittest instead of reading them from a file.
> This way, the program will be immediately usable and runnable:
> no need to create an additional arbitrarily-named file just to
> test it.
>
> -----
>
> All in all, the only thing I'd change in bearophile's solution
> is to remove the file reading loop, add the unittest block from
> your solution instead, and place all the examples there.
> Printing the result does not seem imperative on Rosettacode,
> and there are at least some entries in D which already use
> unittest for checking the problem requirements (for example,
> https://rosettacode.org/wiki/Sorting_algorithms/Cocktail_sort#D).
>
> Lastly, please note that Rosettacode supports multiple versions
> in a single language (example:
> http://rosettacode.org/wiki/99_Bottles_of_Beer#D). As
> bearophile's version certainly has its merits, I strongly
> suggest to keep it available, either merged with your current
> version to produce the right solution, or as a second version.
>
> Ivan Kazmenko.
I appreciate getting a code review, and I want to improve. What I
did was with a sense of humor, so I guess I can find a way to
make it more serious.
First I want to explain why I didn't just make minimal changes,
although at first I wanted to make just minimal changes. This is
the output from bearophile's version:
pi=3.14,159,265,358,979,323,846,264,338,327,950,288,419,716,939,937,510,582,097,494,459,231The author has two Z$100,000,000,000,000 Zimbabwe notes (100 trillion)."-in Aus$+1,411.8millions"===US$0017,440 millions=== (in 2,000 dollars)123.e8,000 is pretty big.The land area of the earth is 57,268,900(29% of the surface) square miles.Ain't no numbers in this here words, nohow, no way, Jose.James was never known as 0000000007Arthur Eddington wrote: I believe there are 15,747,724,136,275,002,577,605,653,961,181,555,468,044,717,914,527,116,709,366,231,425,076,185,631,031,296 protons in the universe. $-140,000±100 millions.6/9/1,946 was a good year for some.
1. pi has the commas start at the wrong digit, and doesn't follow
the explicit instructions to use spaces as the separator and a
grouping of 5
2. There are no newlines (although the input is the list of lines
to be "commatized" not concatenated.)
3. Zimbabwe dollars are given commas, against the explicit
request to have dots. (That would be undesirable in the real
world, not just in this silly example, because comma is used as a
decimal point in the Zimbabwe press, and spaces for thousands
separators.)
4. The second number in the line
===US$0017,440 millions=== (in 2,000 dollars)
is "commatized" which is against the explicit instructions to
"commatize" the first number only, given in the task description
and explained on the task's talk page.
5. The exponent in 123.e8000 is "commatized" which is against
explicit and repeated instructions not to "commatize" exponents.
6. (The commas in the Eddington number are acceptable enough.)
7. The year in 6/9/1946 is "commatized" against explicit
instructions to "commatize" only the first number field. It was
discussed in the task's talk page that years shouldn't be
commatized, and that's easy to avoid by never "commatizing" past
the first number.
Overall, the original function was just messing up
simplistically, attacking every series of digits and inserting a
comma every three digits from the rightmost.
For the Eddington number, the task didn't explicitly state to use
spaces in that long a number, but the task does say there should
be spaces in the digits of pi, which leaves open to
interpretation whether that's a special request or a rule that
could apply to any sufficiently long number, AND the task
includes a reference to a Wikipedia page on the number that does
use spaces. The task doesn't say that solutions shouldn't provide
options to produce results that are a little better (more
conventional looking and useful) than what the task explicitly
asks. So when I was adding the part for the requested format for
pi, I made detecting all long numbers part of the
humorously-named "smart" option. It's humorous because like most
consumer "smart" options, it doesn't use AI, it just makes some
assumptions about what you want that are detected and applied,
overriding other options for some lines.
I totally get the abhorrence of magic numbers. I use named
constants in place of literals usually. I didn't think those were
magic numbers that needed a constant declared if each of those
was only used once and each had a three line comment explaining
its value and the rationale for applying it. (It only applies in
the humorous "smart option" anyway.)
Those magic numbers were hard to figure out, at first I thought
they should both be 33, then later realized my explanation of the
values required one to be 36. In a more serious program, I would
want to calculate such numbers so that any changes in the
requirements would change the result.
So can we compromise that a user of a function gets to have lots
of extra options, as long as those are optional arguments and
don't affect the result in any way if you don't touch them? Is
that normal for D code? I think it should be normal for some
languages, but thinking about it right now, because D doesn't
have named optional arguments, it's trouble to use the optional
arguments sometimes, having to fill in earlier optional arguments
in the argument list, and sometimes having to know the argument
at compile time. So D isn't designed to be exactly that sort of
language where extra arguments are piled on with abandon. There
should be just more useful arguments in a good API for D as it
is, then there could be another named function that has more
specialized arguments.
I think it's ugly to solve a problem by special casing a function
call with different arguments for each line of processing a file,
in a case where a single call that abstracts what you want to do
would be shorter to write and more reliable. Of course it's even
uglier to get more than half the answers wrong on a test and
present that as a solution. It looked like irony to me, and it
still does.
The other language solutions to Rosetta tasks may be
"inspirational" in some ways, but there are also errors in them,
at least for this task, that would be found if they were fully
tested. They're made by human beings, and Rosetta code is just a
game. It's not something that's been around as long as the older
languages used there have existed, to look up to solutions in old
languages with awe as time-worn and carved in stone.
The original code can't pass the unittests, so I can't add them
to it. It's short because it's fundamentally flawed, taking the
task as unidirectional and not involving recognizing decimal
points, when the task is bidirectional and centered around the
decimal point.
I'll try to improve the code again, based on the comments here.
More information about the Digitalmars-d-learn
mailing list