lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Warning: long post ahead!

On 24/10/2013 22.39, Roberto Ierusalimschy wrote:
[...]
> The problem is that the specification is not very clear, and in my
> understanding the implementation in libc is buggy... For instance,
> consider this example (in Linux):
> 
>> print(io.read('*n', '*l'))
> 3.4e-                                 <<< input
> 3.4	                              <<< output
> 
The same happens form me on Windows! It happens also with a simple C
program compiled in C99 mode with TDM-GCC 4.8.1, although it may be due
to the underlying Windows C runtime.

> What happened with "e-"? Did fscanf accepted "3.4e-" as a number,
> or it simply thrown away the "e-"? In both cases, it seems the wrong
> behavior. How to document it? (It seems like the read operation
> consumed "an unspecified number of characters" even though the
> operation did not fail...)
> 

Mmmh, I did some investigations and I'm not completely sure it is a bug,
although the C89 standard is *really* murky on the subject. According to
the standard (a draft document I found long ago on the Internet whose
number is not specified, I hope you have a copy of the actual C89
standard to check my references):

------------------------------------------------------------------------
4.9.6.2 The fscanf function

[...]

The fscanf function executes each directive of the format in turn. If a
directive fails, as detailed below, the fscanf function returns.
Failures are described as input failures (due to the unavailability of
input characters), or matching failures (due to inappropriate input).

[...]

A directive that is a conversion specification defines a set of matching
input sequences, as described below for each specifier. A conversion
specification is executed in the following steps:

[...]

An input item is read from the stream, unless the specification includes
an n specifier. An input item is defined as the longest sequence of
input characters (up to any specified maximum field width) which is an
initial subsequence of a matching sequence. The first character, if any,
after the input item remains unread. If the length of the input item is
zero, the execution of the directive fails: this condition is a matching
failure, unless an error prevented input from the stream, in which case
it is an input failure.

[...]


e,f,g Matches an optionally signed floating-point number, whose format
is the same as expected for the subject string of the strtod function.
The corresponding argument shall be a pointer to floating.
------------------------------------------------------------------------

The key sentence seems to be:

 "An input item is defined as the longest sequence of input characters
(up to any specified maximum field width) which is an initial
subsequence of a matching sequence."

where the matching sequence is defined by the conversion %lf as:

"Matches an optionally signed floating-point number, whose format is the
same as expected for the subject string of the strtod function."

So it states that the input item is a "longest *initial subsequence* of
what strtod deems a representation for a double", thus "3.4e-" complies
with this definition. But how is the input item actually converted? The
only paragraph in fscanf description is the following:

------------------------------------------------------------------------
Except in the case of a % specifier, the input item (or, in the case of
a %n directive, the count of input characters) is converted to a type
appropriate to the conversion specifier. If the input item is not a
matching sequence, the execution of the directive fails: this condition
is a matching failure. Unless assignment suppression was indicated by a
* , the result of the conversion is placed in the object pointed to by
the first argument following the format argument that has not already
received a conversion result. If this object does not have an
appropriate type, or if the result of the conversion cannot be
represented in the space provided, the behavior is undefined.
------------------------------------------------------------------------

Thus it says simply that in our case "[...] the input item [...] is
converted to a type appropriate to the conversion specifier. [...]"

Sadly nothing is said about the semantics of the conversion, but it
seems that "3.4e-" doesn't trigger a matching failure, thus it is
amenable to conversion!


The C99 standard (draft N1256) is a bit clearer:

------------------------------------------------------------------------
7.19.6.2 The fscanf function
[...]
9
An input item is read from the stream, unless the
specification includes an n specifier. An input item is
defined as the longest sequence of input characters which
does not exceed any specified field width and which is, or is
a prefix of, a matching input sequence. The first
character, if any, after the input item remains unread. If
the length of the input item is zero, the execution of the
directive fails; this condition is a matching failure unless
end-of-file, an encoding error, or a read error prevented
input from the stream, in which case it is an input failure.


10
Except in the case of a % specifier, the input item (or, in
the case of a %n directive, the count of input characters) is
converted to a type appropriate to the conversion specifier.
If the input item is not a matching sequence, the execution
of the directive fails: this condition is a matching failure.
Unless assignment suppression was indicated by a *, the
result of the conversion is placed in the object pointed to
by the first argument following the format argument that has
not already received a conversion result. [...]

[...]

a,e,f,g Matches an optionally signed floating-point number,
infinity, or NaN, whose format is the same as expected for
the subject sequence of the strtod function. The
corresponding argument shall be a pointer to floating.
------------------------------------------------------------------------

The wording is a bit better since an input item is defined as:
"the longest sequence of input characters which
does not exceed any specified field width and which is, or is
a prefix of, a matching input sequence". Thus, again, "3.4e-" seems a
valid input item.

In both cases the only reasonable hint is that the conversion is done
using strtod, but this is not stated clearly! This simple C program
confirms this hypothesis (on my system):

------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
int main( void )
{

   // c will hold a ptr to the char after the last interpreted char
    char* c;
    double d = strtod( "3.4e-", &c );
    printf( "%g\n", d );   // --> 3.4
    printf( "%c\n", *c );  // --> e
    return 0;
}
------------------------------------------------------------------------

This is coherent with what strtod is expected to do by the C89 standard:

------------------------------------------------------------------------
4.10.1.4 The strtod function
[...]

 The strtod function converts the initial portion of the string pointed
to by nptr to double representation. First it decomposes the input
string into three parts: an initial, possibly empty, sequence of
white-space characters (as specified by the isspace function), a subject
sequence resembling a floating-point constant; and a final string of one
or more unrecognized characters, including the terminating null
character of the input string. Then it attempts to convert the subject
sequence to a floating-point number, and returns the result.

The expected form of the subject sequence is an optional plus or minus
sign, then a nonempty sequence of digits optionally containing a
decimal-point character, then an optional exponent part as defined in
3.1.3.1, but no floating suffix. The subject sequence is defined as the
longest subsequence of the input string, starting with the first
non-white-space character, that is an initial subsequence of a sequence
of the expected form. The subject sequence contains no characters if the
input string is empty or consists entirely of white space, or if the
first non-white-space character is other than a sign, a digit, or a
decimal-point character.
[...]
------------------------------------------------------------------------

Note: "[...]converts the initial portion[...]".

BTW, see also these docs for strtod[1] which are much clearer.

Therefore, assuming %lf conversion is implemented using strtod under the
hood (which admittedly is not clear from both C89 and C99 standards),
then the result we are experimenting is actually correct.

The problem with describing Lua's "*n" then is not limited to when it
fails. Probably it should be stated that "*n" can consume an unspecified
number of characters (>=0) and it is "safe" to use only when the number
representation to be parsed on input is surrounded by whitespaces, since
strtod skips initial whitespaces and surely terminates when it finds
whitespaces.

Good wording is wanted (now I'm too tired to think of something good ;-)

Cheers!

-- Lorenzo

[1] http://en.cppreference.com/w/c/string/byte/strtof

-- 
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments