faster printf

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

faster printf

Theo Buehler-3
There is a simplification and optimization for __vfprintf() from android
pointed out by tedu:

https://github.com/aosp-mirror/platform_bionic/commit/5305a4d4a723b06494b93f2df81733b83a0c46d3

If we only support UTF-8 and ASCII, we do not need complicated multibyte
decoding to recognize a '%' in the format string.

In his commit message, enh claims that there is a 10x speedup. In my own
benchmarking on amd64, a speedup between 1.5x and 5x seems to be more
realistic. The code does get significantly simpler, so I think it might
be worth it.

The android commit has lots of style noise. There is a second part to it
that fixes some missing error checking in vfwprintf similar to some of
schwarze's fixes in vfprintf.c r1.71.  I'll send that one out in a
second mail as it's completely independent.

Do we want this?

Index: lib/libc/stdio/vfprintf.c
===================================================================
RCS file: /var/cvs/src/lib/libc/stdio/vfprintf.c,v
retrieving revision 1.77
diff -u -p -r1.77 vfprintf.c
--- lib/libc/stdio/vfprintf.c 29 Aug 2016 12:20:57 -0000 1.77
+++ lib/libc/stdio/vfprintf.c 14 Nov 2017 08:07:27 -0000
@@ -279,8 +279,6 @@ __vfprintf(FILE *fp, const char *fmt0, _
  int width; /* width from format (%8d), or 0 */
  int prec; /* precision from format; <0 for N/A */
  char sign; /* sign prefix (' ', '+', '-', or \0) */
- wchar_t wc;
- mbstate_t ps;
 #ifdef FLOATING_POINT
  /*
  * We can decompose the printed representation of floating
@@ -481,25 +479,13 @@ __vfprintf(FILE *fp, const char *fmt0, _
  convbuf = NULL;
 #endif
 
- memset(&ps, 0, sizeof(ps));
  /*
  * Scan the format for conversions (`%' character).
  */
  for (;;) {
- size_t len;
+ for (cp = fmt; (ch = *fmt) != '\0' && ch != '%'; fmt++)
+ continue;
 
- cp = fmt;
- while ((len = mbrtowc(&wc, fmt, MB_CUR_MAX, &ps)) != 0) {
- if (len == (size_t)-1 || len == (size_t)-2) {
- ret = -1;
- goto error;
- }
- fmt += len;
- if (wc == '%') {
- fmt--;
- break;
- }
- }
  if (fmt != cp) {
  ptrdiff_t m = fmt - cp;
  if (m < 0 || m > INT_MAX - ret)
@@ -507,7 +493,7 @@ __vfprintf(FILE *fp, const char *fmt0, _
  PRINT(cp, m);
  ret += m;
  }
- if (len == 0)
+ if (ch == '\0')
  goto done;
  fmt++; /* skip over '%' */
 
@@ -852,7 +838,9 @@ fp_common:
  xdigs = xdigs_lower;
  ox[1] = 'x';
  goto nosign;
- case 's':
+ case 's': {
+ size_t len;
+
 #ifdef PRINTF_WIDE_CHAR
  if (flags & LONGINT) {
  wchar_t *wcp;
@@ -893,6 +881,7 @@ fp_common:
  goto overflow;
  size = (int)len;
  sign = '\0';
+ }
  break;
  case 'U':
  flags |= LONGINT;
@@ -1156,8 +1145,6 @@ __find_arguments(const char *fmt0, va_li
  int tablemax; /* largest used index in table */
  int nextarg; /* 1-based argument index */
  int ret = 0; /* return value */
- wchar_t wc;
- mbstate_t ps;
 
  /*
  * Add an argument type to the table, expanding if necessary.
@@ -1211,25 +1198,14 @@ __find_arguments(const char *fmt0, va_li
  tablemax = 0;
  nextarg = 1;
  memset(typetable, T_UNUSED, STATIC_ARG_TBL_SIZE);
- memset(&ps, 0, sizeof(ps));
 
  /*
  * Scan the format for conversions (`%' character).
  */
  for (;;) {
- size_t len;
-
- cp = fmt;
- while ((len = mbrtowc(&wc, fmt, MB_CUR_MAX, &ps)) != 0) {
- if (len == (size_t)-1 || len == (size_t)-2)
- return (-1);
- fmt += len;
- if (wc == '%') {
- fmt--;
- break;
- }
- }
- if (len == 0)
+ for (cp = fmt; (ch = *fmt) != '\0' && ch != '%'; fmt++)
+ continue;
+ if (ch == '\0')
  goto done;
  fmt++; /* skip over '%' */
 

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Todd C. Miller
On Tue, 14 Nov 2017 09:26:47 +0100, Theo Buehler wrote:

> If we only support UTF-8 and ASCII, we do not need complicated multibyte
> decoding to recognize a '%' in the format string.
>
> In his commit message, enh claims that there is a 10x speedup. In my own
> benchmarking on amd64, a speedup between 1.5x and 5x seems to be more
> realistic. The code does get significantly simpler, so I think it might
> be worth it.

I think we do want this change.  I've seen run-time failures before
due to that code when using snprintf() to write binary data to a
string without a format string caused by mbrtowc() returning an
error.

 - todd

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ingo Schwarze
In reply to this post by Theo Buehler-3
Hi Theo,

it's bad that i slacked on this, causing people to spend so much effort,
but i failed to make up my mind at first. I think now i see clearly.

Theo Buehler wrote on Tue, Nov 14, 2017 at 09:26:47AM +0100:

> There is a simplification and optimization for __vfprintf()
> from android pointed out by tedu:
>
> https://github.com/aosp-mirror/platform_bionic/commit/5305a4d4a723b06494b93f2df81733b83a0c46d3
>
> If we only support UTF-8 and ASCII, we do not need complicated multibyte
> decoding to recognize a '%' in the format string.

While that is true, the multibyte decoding is still required to
validate the format string, which is in turn required for correct
operation.  Consequently, the diff is incorrect and introduces
a bug.

POSIX says:
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/printf.html)

  DESCRIPTION
    The format is a character string, [...]

  ERRORS
    In addition, all forms of fprintf() shall fail if:

    [EILSEQ]
      A wide-character code that does not correspond to a valid
      character has been detected.

That means that the functions are *required* to fail ("shall fail")
if encoding errors can be detected, that -1 must be returned, and
that errno must be set.

Even if someone finds a loophole to evade this reading of the
standard, after some thought, i consider the change reckless.
If somebody is using *printf(3) to format text output under a UTF-8
locale, the output is likely intended to be consumed by arbitrary
other multibyte-character-handling software, and such software,
in practice, is often of miserable quality regarding the handling
of encoding errors, and often misbehaves in scary ways when fed
invalidly encoded input.  DOS-like exploitability is a common
consequence, and arbitrary data corruption and other unpredictable
behaviour may also ensue.

Invalid UTF-8 encoding in a printf format string under the UTF-8
locale is invariably a severe coding error.  It is the job of the
operating system to catch dangerous coding errors as early as
possible, even if the standards would not absolutely require it,
and in particular if even the standards allow the interpretation
that the system is required to catch such errors.

While variable format strings are a thoroughly bad idea in the
vast majority of cases, they are not illegal, and there might
be rare valid use cases where this validation is even more
important than in the case of constant format strings.

> In his commit message, enh claims that there is a 10x speedup.

That's completely irrelevant.  Three things are at stake here:
correctness, simplicity, and performance.  Simplicity and performance
obviously point into the same direction, so there is no need to
even measure performance.  But correctness unconditionally
trumps both.

Sorry again,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ingo Schwarze
In reply to this post by Todd C. Miller
Hi Todd,

Todd C. Miller wrote on Tue, Nov 14, 2017 at 09:09:13AM -0700:
> On Tue, 14 Nov 2017 09:26:47 +0100, Theo Buehler wrote:

>> If we only support UTF-8 and ASCII, we do not need complicated multibyte
>> decoding to recognize a '%' in the format string.
>>
>> In his commit message, enh claims that there is a 10x speedup. In my own
>> benchmarking on amd64, a speedup between 1.5x and 5x seems to be more
>> realistic. The code does get significantly simpler, so I think it might
>> be worth it.

> I think we do want this change.  I've seen run-time failures before
> due to that code when using snprintf() to write binary data to a
> string without a format string caused by mbrtowc() returning an
> error.

As explained in my other message, that's not a libc bug in *printf(),
but a potentially security-relevant feature of the interfaces.

Software that uses *printf() in a non-POSIX locale to write binary
data (without explicitly using %c or %s) is broken, and such bugs
must be not be swept under the carpet.

The fact that %c and %s can, depending on the context, easily be
misused to cause similar potentially dangerous consequences is an
unfortunate, but unavoidable consequence of how the standard is
written, but does not excuse making the danger even worse by adding
yet more opportunity for havoc.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ted Unangst-6
In reply to this post by Ingo Schwarze
Ingo Schwarze wrote:
>     [EILSEQ]
>       A wide-character code that does not correspond to a valid
>       character has been detected.
>
> That means that the functions are *required* to fail ("shall fail")
> if encoding errors can be detected, that -1 must be returned, and
> that errno must be set.

I understand what you're saying, but I think strictly following the standard
is a bit silly here. This is a very poor way for the fastidious programmer to
check for valid byte sequences. Also, nobody is actually that careful. I'm
actually kind of surprised that such a check is performed. Usually the rule is
that the caller is responsible for checking. strdup doesn't return an error
for a NULL pointer. I would expect if I printf "\xc0\x80" that I will get the
two bytes c0 80 as output.

There's some danger of slippery slope here, which parts of the standard are
safe to ignore and which aren't. It's reasonable to argue it's best not to
ignore anything, but I feel pretty good about ignoring this rule as an
exception.

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Stefan Sperling-5
On Thu, Nov 16, 2017 at 09:57:06AM -0500, Ted Unangst wrote:

> Ingo Schwarze wrote:
> >     [EILSEQ]
> >       A wide-character code that does not correspond to a valid
> >       character has been detected.
> >
> > That means that the functions are *required* to fail ("shall fail")
> > if encoding errors can be detected, that -1 must be returned, and
> > that errno must be set.
>
> I understand what you're saying, but I think strictly following the standard
> is a bit silly here. This is a very poor way for the fastidious programmer to
> check for valid byte sequences. Also, nobody is actually that careful. I'm
> actually kind of surprised that such a check is performed. Usually the rule is
> that the caller is responsible for checking. strdup doesn't return an error
> for a NULL pointer. I would expect if I printf "\xc0\x80" that I will get the
> two bytes c0 80 as output.
>
> There's some danger of slippery slope here, which parts of the standard are
> safe to ignore and which aren't. It's reasonable to argue it's best not to
> ignore anything, but I feel pretty good about ignoring this rule as an
> exception.
>

I agree with Ted.

I would expect EILSEQ during %lc and %ls conversions which explicitly
expect wide characters as arguments, but not for arbitrary data that
happens to be part of the format string.

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Todd C. Miller
On Thu, 16 Nov 2017 16:19:52 +0100, Stefan Sperling wrote:

> I would expect EILSEQ during %lc and %ls conversions which explicitly
> expect wide characters as arguments, but not for arbitrary data that
> happens to be part of the format string.

It is worth noting that this restriction is a POSIX extension not
present in the C standard.

Also, POSIX isn't explicit as to whether that restriction applies
to the format string or just the arguments to %lc and %ls conversions.

What it does say is:

    The format is composed of zero or more directives: ordinary
    characters, which are simply copied to the output stream, and
    conversion specifications, each of which shall result in the
    fetching of zero or more arguments.

I don't think the proposed change makes us non-compliant.

 - todd

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Theo de Raadt-2
> Also, POSIX isn't explicit as to whether that restriction applies
> to the format string or just the arguments to %lc and %ls conversions.
>
> What it does say is:
>
>     The format is composed of zero or more directives: ordinary
>     characters, which are simply copied to the output stream, and
>     conversion specifications, each of which shall result in the
>     fetching of zero or more arguments.

Well that says the format string is a string, not a wide string.

I think EILSEQ and -1 are intended to apply entirely to failed
conversions, and these checks were mistakenly added to printf a while
ago.

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Todd C. Miller
On Thu, 16 Nov 2017 09:52:39 -0700, "Theo de Raadt" wrote:

> > Also, POSIX isn't explicit as to whether that restriction applies
> > to the format string or just the arguments to %lc and %ls conversions.
> >
> > What it does say is:
> >
> >     The format is composed of zero or more directives: ordinary
> >     characters, which are simply copied to the output stream, and
> >     conversion specifications, each of which shall result in the
> >     fetching of zero or more arguments.
>
> Well that says the format string is a string, not a wide string.
>
> I think EILSEQ and -1 are intended to apply entirely to failed
> conversions, and these checks were mistakenly added to printf a while
> ago.

Yes, that's what I was trying to get across.

 - todd

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ingo Schwarze
In reply to this post by Theo de Raadt-2
Hi Theo,

Quick answer, more later:

Theo de Raadt wrote on Thu, Nov 16, 2017 at 09:52:39AM -0700:
> Todd Miller wrote:

>> Also, POSIX isn't explicit as to whether that restriction applies
>> to the format string or just the arguments to %lc and %ls conversions.
>>
>> What it does say is:
>>
>>     The format is composed of zero or more directives: ordinary
>>     characters, which are simply copied to the output stream, and
>>     conversion specifications, each of which shall result in the
>>     fetching of zero or more arguments.

> Well that says the format string is a string, not a wide string.

There are three kinds of strings, not two.  You are confusing wide
strings and multibyte strings.  It is certainly not a wide string.
It is a multibyte string, that is what the use of the word "character"
indicates.  If it were a byte string, it would talk about bytes
instead, see for example how POSIX describes %s:

  The argument shall be a pointer to an array of char.  Bytes from
  the array shall be written up to (but not including) any terminating
  null byte.


There isn't the slightest doubt that passing a format containing
non-UTF-8 bytes under a UTF-8 locale is invalid.  The only questions
are whether the standard says that is merely undefined, or whether
it requires failure.  If it requires failure, some think we should
ignore the standard; i say that isn't safe in this case (i'll explain
in more detail later why it isn't).  If it is undefined behaviour,
some seem to say it doesn't matter; i say failing closed is safer
even then.

> I think EILSEQ and -1 are intended to apply entirely to failed
> conversions,

That is not true.  For example, the function mblen(3) is specified
to return EILSEQ, and it does so.  So EILSEQ is also used for
validation even without conversion, even elsewhere.

> and these checks were mistakenly added to printf a while
> ago.

The *printf() functions set EILSEQ in these cases since revision 1.1
in 1995.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Theo de Raadt-2
> Quick answer, more later:
>
> Theo de Raadt wrote on Thu, Nov 16, 2017 at 09:52:39AM -0700:
> > Todd Miller wrote:
>
> >> Also, POSIX isn't explicit as to whether that restriction applies
> >> to the format string or just the arguments to %lc and %ls conversions.
> >>
> >> What it does say is:
> >>
> >>     The format is composed of zero or more directives: ordinary
> >>     characters, which are simply copied to the output stream, and
> >>     conversion specifications, each of which shall result in the
> >>     fetching of zero or more arguments.
>
> > Well that says the format string is a string, not a wide string.
>
> There are three kinds of strings, not two.  You are confusing wide
> strings and multibyte strings.  It is certainly not a wide string.
> It is a multibyte string, that is what the use of the word "character"
> indicates.  If it were a byte string, it would talk about bytes
> instead, see for example how POSIX describes %s:
>
>   The argument shall be a pointer to an array of char.  Bytes from
>   the array shall be written up to (but not including) any terminating
>   null byte.

Doesn't make sense to me.  It says

   which are simply copied to the output stream

What part of "simply copied" involves calling a function which makes
a decision and decides to error instead?


Have you found a single example which needs to do the check?  Surely
if this is important, there will be at least one in ports.

So I'm positive they should be copied out byte-for-byte.  Without a
check.  Simply can mean without additional labour. I also feel adding
-1 handling would only serve one purpose: Increasing fragility.

Let us recall that the source tree used to contain no checks for
sprintf/snprintf returning -1.  Only checks for < size.  Solaris was
the first system to do this -1 stuff, and it took 15 years to complete
the work of adding the clumsy checks to our tree.

15 years of nearly wasted work, I suspect.

> > I think EILSEQ and -1 are intended to apply entirely to failed
> > conversions,
>
> That is not true.  For example, the function mblen(3) is specified
> to return EILSEQ, and it does so.  So EILSEQ is also used for
> validation even without conversion, even elsewhere.

The suggestion is that it should not be called.  If it is not called,
then it is irrelevant.

> > and these checks were mistakenly added to printf a while
> > ago.
>
> The *printf() functions set EILSEQ in these cases since revision 1.1
> in 1995.

Yes, I already proposed that someone made a mistake a while ago.


Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Todd C. Miller
On Thu, 16 Nov 2017 11:27:45 -0700, "Theo de Raadt" wrote:

> Yes, I already proposed that someone made a mistake a while ago.

This was added in NetBSD in 1995:

----------------------------
revision 1.17
date: 1995/05/02 19:52:41;  author: jtc;  state: Exp;  lines: +15 -8;
The C Standard says that printf's format string is a multi-byte
character string.  NA1 says that the 99 characters required by the
Standard have representations in the initial state which are one byte
long and do not alter the state.

Thus we can safely break apart the format string with mbtowc() until
we reach a '%' character, and the process format directive characters
one by one.

We really shouldn't be using mbtowc(), rather mbrtowc() (which takes a
mbstate-t argument) but we don't have the NA1 functions implemented
yet.  This is safe, because even when we do we're not likely to
support multi-byte character encodings that use shift states.
----------------------------

The change was never adopted by FreeBSD and modern NetBSD doesn't
include do it either.  There is nothing in C99 that I can find to
indicate that the format string is multi-byte.  Either this part
of NA1 was not adopted as part of C99 or jtc misread the standard.

I've done a brief survey using the test program at the end of
this message.  Here are the results:

OpenBSD:
    mbrtowc fail (expected)
    printf fail (unexpected)
Linux:
    mbrtowc fail (expected)
    printf OK, ret 1 (expected)
macOS:
    mbrtowc fail (expected)
    printf OK, ret 1 (expected)
Solaris 11:
    mbrtowc fail (expected)
    printf OK, ret 1 (expected)

OpenBSD is the outlier here.  If everyone else interprets the
standard differently that we do, I think it is reasonable to say
that our interpretation in incorrect.  Furthermore, the source of
that change (NetBSD) no longer includes it.

 - todd

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Theo de Raadt-2
> On Thu, 16 Nov 2017 11:27:45 -0700, "Theo de Raadt" wrote:
>
> > Yes, I already proposed that someone made a mistake a while ago.
>
> This was added in NetBSD in 1995:
>
> ----------------------------
> revision 1.17
> date: 1995/05/02 19:52:41;  author: jtc;  state: Exp;  lines: +15 -8;
> The C Standard says that printf's format string is a multi-byte
> character string.  NA1 says that the 99 characters required by the
> Standard have representations in the initial state which are one byte
> long and do not alter the state.
>
> Thus we can safely break apart the format string with mbtowc() until
> we reach a '%' character, and the process format directive characters
> one by one.
>
> We really shouldn't be using mbtowc(), rather mbrtowc() (which takes a
> mbstate-t argument) but we don't have the NA1 functions implemented
> yet.  This is safe, because even when we do we're not likely to
> support multi-byte character encodings that use shift states.
> ----------------------------
>
> The change was never adopted by FreeBSD and modern NetBSD doesn't
> include do it either.  There is nothing in C99 that I can find to
> indicate that the format string is multi-byte.  Either this part
> of NA1 was not adopted as part of C99 or jtc misread the standard.
>
> I've done a brief survey using the test program at the end of
> this message.  Here are the results:
>
> OpenBSD:
>     mbrtowc fail (expected)
>     printf fail (unexpected)
> Linux:
>     mbrtowc fail (expected)
>     printf OK, ret 1 (expected)
> macOS:
>     mbrtowc fail (expected)
>     printf OK, ret 1 (expected)
> Solaris 11:
>     mbrtowc fail (expected)
>     printf OK, ret 1 (expected)
>
> OpenBSD is the outlier here.  If everyone else interprets the
> standard differently that we do, I think it is reasonable to say
> that our interpretation in incorrect.  Furthermore, the source of
> that change (NetBSD) no longer includes it.

I agree completely with that assessment.

The POSIX and ANSI commitees are often insafe and careless.  But
if they had suddenly decided that the format string needed to be
PARSED, they would have risked all sorts of crazy embedded shit
using format strings using \0NNN breaking in crazy ways.


Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ingo Schwarze
In reply to this post by Theo de Raadt-2
Ingo Schwarze wrote on Fri, Nov 17, 2017 at 03:07:48PM +0100:

[ regarding cases where this may matter in practice ]
>  (2) Programs legitimately calling *printf() with a variable format
>      string in any non-POSIX locale, even if it's just UTF-8.

Whoa.  I just realized there is a very widespread subclass of this:
Programs using gettext(3) for printf(3) format strings.
They are easy to grep for with /printf\(_/.

Here are a few examples:

  aspell-0.60.6.1/prog/aspell.cpp: setlocale (LC_ALL, "");
  aspell-0.60.6.1/prog/aspell.cpp: printf(_("\n  %s filter: %s\n"), ...
    and many other instances, but i don't see any return value checks

  avahi-0.7/avahi-utils/avahi-browse.c: setlocale(LC_ALL, "");
  avahi-0.7/avahi-utils/avahi-browse.c: printf(_(": Cache exhausted\n"));
    and some others, again no return value checks

  e2fsprogs-1.42.12/resize/main.c: setlocale(LC_CTYPE, "");
  e2fsprogs-1.42.12/resize/online.c: printf(_("Filesystem at %s is mounted...
    and hundreds of others, no return value checks

  geeqie-1.3/src/main.c: setlocale(LC_ALL, "");
  geeqie-1.3/src/main.c: log_printf(_("Creating %s dir:%s\n"), ...
    dozens of them, no return value checks in sight

  git-2.14.1/gettext.c: setlocale(LC_CTYPE, "");
  git-2.14.1/builtin/merge.c: printf(_("Wonderful.\n"));
    hundreds of them, hard to say whether there are any return value
    checks, quite possibly not

  gnupg-2.1.23/common/i18n.c: setlocale (LC_ALL, "" );
  gnupg-2.1.23/g10/keygen.c: tty_printf(_("Invalid selection.\n"));
    dozens of them, no return value checks in sight

There are also many ports using g_strdup_printf(3) in this way, no
idea what that does, but is seems likely to call *printf(3) internally
in some way.

The show goes on (without checking for setlocale(3) to save time):
gnutls, libiconv, libv4l, mutt, openjp2, openldap, postgresql, vlc,
xz, ...

These are just some ports that i happened to build from source
recently.  So basically, almost *everybody* is using this, but
hardly anybody ever checks for success or failure.

When the return value is not checked, the change still makes the
following difference: Without the change, an invalidly encoded
format string prints nothing.  With the change, an invalidly encoded
format string prints invalidly encoded output.   The former may
sometimes be safer, but the missing information might sometimes
lead to trouble.


That said, i just checked what glibc and commercial Solaris 11 do,
and lo and behold:

  schwarze@donnerwolke:~/Test/printf$ uname -a
  Linux donnerwolke.asta-kit.de 4.9.0-0.bpo.3-686 #1 SMP \
    Debian 4.9.30-2+deb9u2~bpo8+1 (2017-06-27) i686 GNU/Linux

  schwarze@donnerwolke:~/Test/printf$ cat printf.c
  #include <err.h>
  #include <locale.h>
  #include <stdio.h>
 
  int
  main(void)
  {
        int irc;

        if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL)
                errx(1, "setlocale");
        irc = printf("start\xc3\xa9middle\xc3%s\n", "end");
        printf("%d\n", irc);
        return 0;
  }

  schwarze@donnerwolke:~/Test/printf$ make printf
  cc     printf.c   -o printf

  schwarze@donnerwolke:~/Test/printf$ ./printf > tmp.txt ; echo $?
  0

  schwarze@donnerwolke:~/Test/printf$ hexdump -C tmp.txt
  00000000  73 74 61 72 74 c3 a9 6d  69 64 64 6c 65 c3 65 6e |start..middle.en|
  00000010  64 0a 31 38 0a                                   |d.18.|
  00000015

Same thing on Solaris.

Judging from a superficial look at the FreeBSD and NetBSD sources,
they don't appear to validate the format either.

So even if my reading of the standard should be correct (which some
here have challenged), given that everybody else here intuitively
expected the function to behave differently, that the stuff is
actually used a lot in practice, and given that most (if not all)
other implementations appear to behave the way that people intuitively
expect, i think i should stand down and no longer object to the
change.  Maybe i should consider a small clarification in the manual
page afterwards, not yet sure whether that is needed.

I don't think, though, that the commit message should advertise
this as a performance improvement.  It should be called an intentional
change of behaviour, now using the format string as a byte string
like everyone else, no matter whether POSIX explicitly specifies
it as a character string instead.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Todd C. Miller
In reply to this post by Theo Buehler-3
On Fri, 17 Nov 2017 10:20:49 -0700, "Todd C. Miller" wrote:

> I've done a brief survey using the test program at the end of
> this message.  Here are the results:

Here's the missing test program.  It compares how mbrtowc() and
snprintf() treat an invalid UTF-8 sequence.  I chose a simple one.

 - todd

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

int main(int argc, char *argv[])
{
    const char *fmt = "\xc0 ";
    char buf[1024];
    wchar_t wc;
    mbstate_t ps;
    size_t len;
    int ret;

    setlocale(LC_ALL, "en_US.UTF-8");

    memset(&ps, 0, sizeof(ps));
    len = mbrtowc(&wc, fmt, MB_CUR_MAX, &ps);
    if (len == (size_t)-1 || len == (size_t)-2)
        printf("mbrtowc fail (expected)\n");
    else
        printf("mbrtowc OK, len %zu (unexpected)\n", len);

    ret = snprintf(buf, sizeof(buf), "\xc0");
    if ((size_t)ret >= sizeof(buf))
        printf("printf fail (unexpected)\n");
    else
        printf("printf OK, ret %d (expected)\n", ret);
}

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Theo de Raadt-2
In reply to this post by Ingo Schwarze
> I don't think, though, that the commit message should advertise
> this as a performance improvement.  It should be called an intentional
> change of behaviour, now using the format string as a byte string
> like everyone else, no matter whether POSIX explicitly specifies
> it as a character string instead.

I do not agree with your position that POSIX calls this a character
string.

Pure and simply, it is a legacy char *, and any attempt to re-standardize
it from that would crash cars and other heavy equipment.

I'd like to mention in a previous email you used the term fail-closed
incorrectly.

Returning an error number and changing behaviour (truncating)
when there there potentially isn't a caller-checks isn't fail-closed.

Fail-closed implies the software crashes or terminates, so that the
bug can be inspected and fixed.

Returning an error in a newer version of a standard, when a past
standard passed data straight though?

With no way for authors to know the fallout, and requiring them to
audit all code for the broad practice of missing caller-checks?
Not realistic.

So it would be responsible to assume a check can be here.  I think
you have misread POSIX, or someone did 's/char/character/' too broadly.


Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ingo Schwarze
Hi Theo,

Theo de Raadt wrote on Fri, Nov 17, 2017 at 10:43:10AM -0700:
> Ingo Schwarze wrote:

>> I don't think, though, that the commit message should advertise
>> this as a performance improvement.  It should be called an intentional
>> change of behaviour, now using the format string as a byte string
>> like everyone else, no matter whether POSIX explicitly specifies
>> it as a character string instead.

> I do not agree with your position that POSIX calls this a character
> string.

  http://pubs.opengroup.org/onlinepubs/9699919799/functions/fprintf.html

  "The format is a character string, ..."

Well, i does, and...

> Pure and simply, it is a legacy char *, and any attempt to re-standardize
> it from that would crash cars and other heavy equipment.

 ... given that POSIX-2008 is explicitly based on C99 and both C89
and C99 are even more explicit (see below), it follows that people
thirty years ago probably intended to crash cars and other heavy
equipment, even though you may quite possibly be right that it was
likely a bad idea back then.

> I'd like to mention in a previous email you used the term fail-closed
> incorrectly.
>
> Returning an error number and changing behaviour (truncating)
> when there there potentially isn't a caller-checks isn't fail-closed.
>
> Fail-closed implies the software crashes or terminates, so that the
> bug can be inspected and fixed.
>
> Returning an error in a newer version of a standard, when a past
> standard passed data straight though?
>
> With no way for authors to know the fallout, and requiring them to
> audit all code for the broad practice of missing caller-checks?
> Not realistic.

Point taken.

> So it would be responsible to assume a check can be here.  I think
> you have misread POSIX, or someone did 's/char/character/' too broadly.

Todd's research revealed that jtc@ got the information from the
C standard in 1995, so i just checked what C89 (sic!) says:

  4.9.6.1 The fprintf function
  [...]
  The format shall be a multibyte character sequence, beginning and
  ending in its initial shift state.  The format is composed of
  zero or more directives: ordinary multibyte characters (not % ),
  which are copied unchanged to the output stream; and conversion
  specifications, each of which results in fetching zero or more
  subsequent arguments.

C99 says exactly the same in 7.19.6.1.3.

That teaches me that i should stop trusting POSIX to accurately and
clearly paraphrase the C standard.

In 7.19.6.1.14, C99 then goes on to say:

  The fprintf function returns the number of characters transmitted,
  or a negative value if an output or encoding error occurred.

That is even more explicit than POSIX, which only says:

  If an output error was encountered, these functions shall return a
  negative value and set errno to indicate the error.

So C99 explicitly requires failure *for encoding errors* and
explicitly requires multibyte encoding for the format string.
So it appears that *everybody* (except us) is in blatant violation
of C99.

To hell with multibyte characters!  How on earth do so many dragons
fit into such a small rabbit hole!

Sigh,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Theo de Raadt-2
> Todd's research revealed that jtc@ got the information from the
> C standard in 1995, so i just checked what C89 (sic!) says:
>
>   4.9.6.1 The fprintf function
>   [...]
>   The format shall be a multibyte character sequence, beginning and
>   ending in its initial shift state.  The format is composed of
>   zero or more directives: ordinary multibyte characters (not % ),
>   which are copied unchanged to the output stream; and conversion
>   specifications, each of which results in fetching zero or more
>   subsequent arguments.

I still don't understand Ingo

This means something to me:

    which are copied unchanged to the output stream

Perhaps it should more clearly say "unchecked".

> In 7.19.6.1.14, C99 then goes on to say:
>
>   The fprintf function returns the number of characters transmitted,
>   or a negative value if an output or encoding error occurred.

If the format characters are "copied unchanged to the output stream"
without checks, then there are no errors to worry about from them
and that point is irrelevant.

>   If an output error was encountered, these functions shall return a
>   negative value and set errno to indicate the error.
>
> So C99 explicitly requires failure *for encoding errors* and
> explicitly requires multibyte encoding for the format string.
> So it appears that *everybody* (except us) is in blatant violation
> of C99.

I still see no words saying it must check the bytes in the
format string.

> To hell with multibyte characters!  How on earth do so many dragons
> fit into such a small rabbit hole!

Breaking old software is unacceptable.

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Theo de Raadt-2
In reply to this post by Ingo Schwarze
> So C99 explicitly requires failure *for encoding errors* and
> explicitly requires multibyte encoding for the format string.
> So it appears that *everybody* (except us) is in blatant violation
> of C99.
>
> To hell with multibyte characters!  How on earth do so many dragons
> fit into such a small rabbit hole!
>
> Sigh,

Humans make mistakes.  Those can be fixed.

Doubling down on dangeorus weasel words of some document isn't
a mistake, it is foolish endangerment.

Can you please step back from your interpretation of the sloppy words
in the documents, and think HOW SHOULD THIS WORK AND WHAT WOULD BE THE
BEST DIRECTION.


THAT IS ENGINEERING RATHER THAN DOGMA.

Reply | Threaded
Open this post in threaded view
|

Re: faster printf

Ingo Schwarze
Hi Theo,

Theo de Raadt wrote on Fri, Nov 17, 2017 at 11:59:47AM -0700:

> how should this work and what would be the best direction

The following two aspects provide no clear guidance what is better:

 1. Both printing invalid bytes (in particular to terminals)
    and losing information that was intended to be printed
    can be dangerous.

 2. My brief look into the ports tree indicates that, while
    this is used a lot, people hardly ever check for any kind
    of failure or invalid data.

Consequently, it may be best to follow what other system do such
that we at least get the unintended application program behaviour
in the same places, whatever the standards may say.

Even if not much pre-ANSI-C code should be left today, it seems
that most people intuitively expect the format string to be a
byte string rather than a multibyte cheracter string, so it is
likely that new code was written post-ANSI-C and even post C99
that assumes the format string can contain arbitrary bytes.

Conversely, while the look at the ports tree seems to indicate that
many people do (unintentionally?) rely on *printf(3) for the
validation of their variable format strings, as long as they don't
check for errors, neither of the options is clearly better than
the other in that respect.

So i withdraw my objection to the patch.

Yours,
  Ingo

12