ksh(1): don't output invalid UTF-8 characters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

ksh(1): don't output invalid UTF-8 characters

Anton Lindqvist
Hi,
I did submit this problem[1] earlier but with an incomplete analysis and
fix. Here's a second attempt.

This does only occur when running ksh with emacs mode under tmux. How to
re-produce:

1. Run ksh under tmux.

2. Input the following characters, without spaces:

   a (any character) ^B (backward-char) ö (any UTF-8 character)

3. At this point, the prompt gets overwritten.

Since ksh read a single byte of input, it will display a partial UTF-8
character before the whole character has been read. This is especially
troublesome when the cursor is not placed at the end of the line. In the
scenario above, after reading the first byte of 'ö' the following
sequence will be displayed:

  0xc3 0x61 0x08

That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
does the right thing here, since 0xc3 is a valid UTF-8 start byte it
expects it to be followed by a UTF-8 continuation byte which is not the
case. The two first bytes (0xc3, 0x61) are discarded and the parser is
reset to its initial state causing the backspace to be accepted and the
first character in the prompt to be overwritten.

After the second byte of 'ö' (0xb6) is read by ksh, the following
sequence will be displayed:

   0x08 0xc3 0xb6 0x61 0x08

That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since
ksh assumes the cursor is correctly positioned it displays a leading
backspace in order to move passed the first character. This is however
not true causing another character in the prompt to be overwritten.

Below is diff that make sure to read a whole UTF-8 character in
x_emacs() prior doing another iteration of the main-loop which solves
the problem. It does not validate UTF-8 input but instead assumes every
such character is valid.

Comments and feedback are much appreciated.

[1] http://marc.info/?l=openbsd-misc&m=148509346310901&w=2

Index: emacs.c
===================================================================
RCS file: /cvs/src/bin/ksh/emacs.c,v
retrieving revision 1.67
diff -u -p -r1.67 emacs.c
--- emacs.c 12 May 2017 14:37:52 -0000 1.67
+++ emacs.c 14 May 2017 08:21:26 -0000
@@ -98,6 +98,7 @@ static int x_col;
 static int x_displen;
 static int x_arg; /* general purpose arg */
 static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */
+static int x_getc_again;
 
 static int xlp_valid;
 /* end from 4.9 edit.h } */
@@ -142,6 +143,7 @@ static int x_fold_case(int);
 static char *x_lastcp(void);
 static void do_complete(int, Comp_type);
 static int isu8cont(unsigned char);
+static int u8len(unsigned char);
 
 /* proto's for keybindings */
 static int x_abort(int);
@@ -272,6 +274,21 @@ isu8cont(unsigned char c)
  return (c & (0x80 | 0x40)) == 0x80;
 }
 
+static int
+u8len(unsigned char c)
+{
+ switch (c & 0xF0) {
+ case 0xF0:
+ return 4;
+ case 0xE0:
+ return 3;
+ case 0xC0:
+ return 2;
+ default:
+ return 1;
+ }
+}
+
 int
 x_emacs(char *buf, size_t len)
 {
@@ -318,10 +335,12 @@ x_emacs(char *buf, size_t len)
  x_last_command = NULL;
  while (1) {
  x_flush();
- if ((c = x_e_getc()) < 0)
- return 0;
+ do {
+ if ((c = x_e_getc()) < 0)
+ return 0;
 
- line[at++] = c;
+ line[at++] = c;
+ } while (x_getc_again > 0);
  line[at] = '\0';
 
  if (x_arg == -1) {
@@ -364,7 +383,10 @@ x_emacs(char *buf, size_t len)
  } else {
  if (submatch)
  continue;
- if (at == 1)
+ if (at > 1) {
+ x_ins(line);
+ ret = KSTD;
+ } else if (at == 1)
  ret = x_insert(c);
  else
  ret = x_error(c); /* not matched meta sequence */
@@ -1887,8 +1909,12 @@ x_e_getc(void)
  macro_args = NULL;
  c = x_getc();
  }
- } else
+ } else {
  c = x_getc();
+ if (x_getc_again == 0)
+ x_getc_again = u8len(c);
+ x_getc_again--;
+ }
 
  return c;
 }

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Lucas Gabriel Vuotto
Hi,

On 19/05/17 03:42, Anton Lindqvist wrote:

> Hi,
> I did submit this problem[1] earlier but with an incomplete analysis and
> fix. Here's a second attempt.
>
> This does only occur when running ksh with emacs mode under tmux. How to
> re-produce:
>
> 1. Run ksh under tmux.
>
> 2. Input the following characters, without spaces:
>
>    a (any character) ^B (backward-char) ö (any UTF-8 character)
>
> 3. At this point, the prompt gets overwritten.
>
> Since ksh read a single byte of input, it will display a partial UTF-8
> character before the whole character has been read. This is especially
> troublesome when the cursor is not placed at the end of the line. In the
> scenario above, after reading the first byte of 'ö' the following
> sequence will be displayed:
>
>   0xc3 0x61 0x08
>
> That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> expects it to be followed by a UTF-8 continuation byte which is not the
> case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> reset to its initial state causing the backspace to be accepted and the
> first character in the prompt to be overwritten.
>
> After the second byte of 'ö' (0xb6) is read by ksh, the following
> sequence will be displayed:
>
>    0x08 0xc3 0xb6 0x61 0x08
>
> That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since
> ksh assumes the cursor is correctly positioned it displays a leading
> backspace in order to move passed the first character. This is however
> not true causing another character in the prompt to be overwritten.
>
> Below is diff that make sure to read a whole UTF-8 character in
> x_emacs() prior doing another iteration of the main-loop which solves
> the problem. It does not validate UTF-8 input but instead assumes every
> such character is valid.
>
> Comments and feedback are much appreciated.
>
> [1] http://marc.info/?l=openbsd-misc&m=148509346310901&w=2
>
> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.67
> diff -u -p -r1.67 emacs.c
> --- emacs.c 12 May 2017 14:37:52 -0000 1.67
> +++ emacs.c 14 May 2017 08:21:26 -0000
> @@ -98,6 +98,7 @@ static int x_col;
>  static int x_displen;
>  static int x_arg; /* general purpose arg */
>  static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */
> +static int x_getc_again;
>  
>  static int xlp_valid;
>  /* end from 4.9 edit.h } */
> @@ -142,6 +143,7 @@ static int x_fold_case(int);
>  static char *x_lastcp(void);
>  static void do_complete(int, Comp_type);
>  static int isu8cont(unsigned char);
> +static int u8len(unsigned char);
>  
>  /* proto's for keybindings */
>  static int x_abort(int);
> @@ -272,6 +274,21 @@ isu8cont(unsigned char c)
>   return (c & (0x80 | 0x40)) == 0x80;
>  }
>  
> +static int
> +u8len(unsigned char c)
> +{
> + switch (c & 0xF0) {
> + case 0xF0:
> + return 4;
> + case 0xE0:
> + return 3;
> + case 0xC0:
> + return 2;
> + default:
> + return 1;
> + }
> +}
> +

This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater than U+0400) would be interpreted as being 1 character long instead of 2.

>  int
>  x_emacs(char *buf, size_t len)
>  {
> @@ -318,10 +335,12 @@ x_emacs(char *buf, size_t len)
>   x_last_command = NULL;
>   while (1) {
>   x_flush();
> - if ((c = x_e_getc()) < 0)
> - return 0;
> + do {
> + if ((c = x_e_getc()) < 0)
> + return 0;
>  
> - line[at++] = c;
> + line[at++] = c;
> + } while (x_getc_again > 0);
>   line[at] = '\0';
>  
>   if (x_arg == -1) {
> @@ -364,7 +383,10 @@ x_emacs(char *buf, size_t len)
>   } else {
>   if (submatch)
>   continue;
> - if (at == 1)
> + if (at > 1) {
> + x_ins(line);
> + ret = KSTD;
> + } else if (at == 1)
>   ret = x_insert(c);
>   else
>   ret = x_error(c); /* not matched meta sequence */
> @@ -1887,8 +1909,12 @@ x_e_getc(void)
>   macro_args = NULL;
>   c = x_getc();
>   }
> - } else
> + } else {
>   c = x_getc();
> + if (x_getc_again == 0)
> + x_getc_again = u8len(c);
> + x_getc_again--;
> + }
>  
>   return c;
>  }
>

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Anton Lindqvist
On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote:

> Hi,
>
> On 19/05/17 03:42, Anton Lindqvist wrote:
> > Hi,
> > I did submit this problem[1] earlier but with an incomplete analysis and
> > fix. Here's a second attempt.
> >
> > This does only occur when running ksh with emacs mode under tmux. How to
> > re-produce:
> >
> > 1. Run ksh under tmux.
> >
> > 2. Input the following characters, without spaces:
> >
> >    a (any character) ^B (backward-char) ö (any UTF-8 character)
> >
> > 3. At this point, the prompt gets overwritten.
> >
> > Since ksh read a single byte of input, it will display a partial UTF-8
> > character before the whole character has been read. This is especially
> > troublesome when the cursor is not placed at the end of the line. In the
> > scenario above, after reading the first byte of 'ö' the following
> > sequence will be displayed:
> >
> >   0xc3 0x61 0x08
> >
> > That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > expects it to be followed by a UTF-8 continuation byte which is not the
> > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > reset to its initial state causing the backspace to be accepted and the
> > first character in the prompt to be overwritten.
> >
> > After the second byte of 'ö' (0xb6) is read by ksh, the following
> > sequence will be displayed:
> >
> >    0x08 0xc3 0xb6 0x61 0x08
> >
> > That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since
> > ksh assumes the cursor is correctly positioned it displays a leading
> > backspace in order to move passed the first character. This is however
> > not true causing another character in the prompt to be overwritten.
> >
> > Below is diff that make sure to read a whole UTF-8 character in
> > x_emacs() prior doing another iteration of the main-loop which solves
> > the problem. It does not validate UTF-8 input but instead assumes every
> > such character is valid.
> >
> > Comments and feedback are much appreciated.
> >
> > [1] http://marc.info/?l=openbsd-misc&m=148509346310901&w=2
> >
> > Index: emacs.c
> > ===================================================================
> > RCS file: /cvs/src/bin/ksh/emacs.c,v
> > retrieving revision 1.67
> > diff -u -p -r1.67 emacs.c
> > --- emacs.c 12 May 2017 14:37:52 -0000 1.67
> > +++ emacs.c 14 May 2017 08:21:26 -0000
> > @@ -98,6 +98,7 @@ static int x_col;
> >  static int x_displen;
> >  static int x_arg; /* general purpose arg */
> >  static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */
> > +static int x_getc_again;
> >  
> >  static int xlp_valid;
> >  /* end from 4.9 edit.h } */
> > @@ -142,6 +143,7 @@ static int x_fold_case(int);
> >  static char *x_lastcp(void);
> >  static void do_complete(int, Comp_type);
> >  static int isu8cont(unsigned char);
> > +static int u8len(unsigned char);
> >  
> >  /* proto's for keybindings */
> >  static int x_abort(int);
> > @@ -272,6 +274,21 @@ isu8cont(unsigned char c)
> >   return (c & (0x80 | 0x40)) == 0x80;
> >  }
> >  
> > +static int
> > +u8len(unsigned char c)
> > +{
> > + switch (c & 0xF0) {
> > + case 0xF0:
> > + return 4;
> > + case 0xE0:
> > + return 3;
> > + case 0xC0:
> > + return 2;
> > + default:
> > + return 1;
> > + }
> > +}
> > +
>
> This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater than U+0400) would be interpreted as being 1 character long instead of 2.

Thanks for the heads-up. Maybe a more reliable solution would be to call
mbtowc(3) repeatedly as new input arrives until it returns successfully.
Assuming the first read byte is a UTF-8 start byte.

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze
In reply to this post by Anton Lindqvist
Hi Anton,

Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:

> 1. Run ksh under tmux.
>
> 2. Input the following characters, without spaces:
>
>    a (any character) ^B (backward-char) ö (any UTF-8 character)
>
> 3. At this point, the prompt gets overwritten.
>
> Since ksh read a single byte of input, it will display a partial UTF-8
> character before the whole character has been read. This is especially
> troublesome when the cursor is not placed at the end of the line. In the
> scenario above, after reading the first byte of 'ö' the following
> sequence will be displayed:
>
>   0xc3 0x61 0x08
>
> That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> expects it to be followed by a UTF-8 continuation byte which is not the
> case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> reset to its initial state

I call that a bug in tmux.  At least for UTF-8, tmux should never reset
its parser.  Depending on the keyboard configuration, it may be possible
to enter single non-UTF-8 bytes.  For example, at the console, i can do
this:

 - type "printf x"
 - press Ctrl-V, Ctrl-Alt-Z
 - type "printf x | hexdump -C"

The result is:

  00000000  78 9a 78   |x.x|
  00000003

All of the shell, the console, and xterm more or less handle such
stunts, even though admittedly, that is not the usual way of typing
in a binary file.  tmux is a terminal emulator, so it ought to cope,
too, and not reset its state.

Note that this is yet another instance where the concept of arbitrary
locales is utterly broken and insecure.  On some non-OpenBSD system
supporting such insecure stuff, if the user has set an arbitrary,
non-UTF-8, state-dependent locale and somehow manages to insert an
invalid byte into the input stream, the state of the input stream
becomes invalid, no further characters can be read from that terminal,
and *there is no way to recover*.  The only remaining half-secure
option is for the shell to exit, which may also not be very secure,
on a different level.

But with UTF-8, there is no problem whatsoever dealing with invalid
bytes, so tmux ought to cope.

> causing the backspace to be accepted and the
> first character in the prompt to be overwritten.

That's part of the reason why tmux must not reset its state:
The shell won't know about it, and the screen will become garbled.

> Below is diff that make sure to read a whole UTF-8 character in
> x_emacs() prior doing another iteration of the main-loop

I don't think we can do that.  What if there is no next byte?
Then the shell will hang until, maybe considerably later, the next
character is typed.  Also, we cannot rely on parsing the UTF-8 start
byte (even if done correctly).  It tells the byte length of the
character only for valid byte sequences, and the byte sequence need
not be valid.

Besides, i think patching the shell to work around terminal bugs
(or terminal emulator bugs) is the wrong approach in the first place.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Nicholas Marriott-2
Hi

Perhaps I haven't understood what you are saying correctly, but I don't
think it is possible to send control characters or any other invalid
UTF-8 bytes inside UTF-8 characters and safely predict what the terminal
will do. How about these examples:

printf '\343\203\010\217a\n'
printf '\343\203\r\217b\n'
printf '\343\203\n\217c\n'

Or with a width 1 character:

printf '\342\211\010\240\n'
printf '\342\211\r\240b\n'
printf '\342\211\n\240c\n'

I have four UTF-8 capable X terminals here and between them they do
three different things (xterm, rxvt-unicode, gnome-terminal & konsole).

tmux can't forward this stuff to the terminal outside, because if it
can't predict what that terminal will do, tmux's idea of the display
will get out of sync with reality.

Having tmux ignore the whole lot seems like a relatively sensible
course.

The only other alternative would be to substitute U+FFFD. But that seems
iffy too - U+FFFD is width 1, but what if the application is expecting
a width 2?




On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote:

> Hi Anton,
>
> Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:
>
> > 1. Run ksh under tmux.
> >
> > 2. Input the following characters, without spaces:
> >
> >    a (any character) ^B (backward-char) ?? (any UTF-8 character)
> >
> > 3. At this point, the prompt gets overwritten.
> >
> > Since ksh read a single byte of input, it will display a partial UTF-8
> > character before the whole character has been read. This is especially
> > troublesome when the cursor is not placed at the end of the line. In the
> > scenario above, after reading the first byte of '??' the following
> > sequence will be displayed:
> >
> >   0xc3 0x61 0x08
> >
> > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > expects it to be followed by a UTF-8 continuation byte which is not the
> > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > reset to its initial state
>
> I call that a bug in tmux.  At least for UTF-8, tmux should never reset
> its parser.  Depending on the keyboard configuration, it may be possible
> to enter single non-UTF-8 bytes.  For example, at the console, i can do
> this:
>
>  - type "printf x"
>  - press Ctrl-V, Ctrl-Alt-Z
>  - type "printf x | hexdump -C"
>
> The result is:
>
>   00000000  78 9a 78   |x.x|
>   00000003
>
> All of the shell, the console, and xterm more or less handle such
> stunts, even though admittedly, that is not the usual way of typing
> in a binary file.  tmux is a terminal emulator, so it ought to cope,
> too, and not reset its state.
>
> Note that this is yet another instance where the concept of arbitrary
> locales is utterly broken and insecure.  On some non-OpenBSD system
> supporting such insecure stuff, if the user has set an arbitrary,
> non-UTF-8, state-dependent locale and somehow manages to insert an
> invalid byte into the input stream, the state of the input stream
> becomes invalid, no further characters can be read from that terminal,
> and *there is no way to recover*.  The only remaining half-secure
> option is for the shell to exit, which may also not be very secure,
> on a different level.
>
> But with UTF-8, there is no problem whatsoever dealing with invalid
> bytes, so tmux ought to cope.
>
> > causing the backspace to be accepted and the
> > first character in the prompt to be overwritten.
>
> That's part of the reason why tmux must not reset its state:
> The shell won't know about it, and the screen will become garbled.
>
> > Below is diff that make sure to read a whole UTF-8 character in
> > x_emacs() prior doing another iteration of the main-loop
>
> I don't think we can do that.  What if there is no next byte?
> Then the shell will hang until, maybe considerably later, the next
> character is typed.  Also, we cannot rely on parsing the UTF-8 start
> byte (even if done correctly).  It tells the byte length of the
> character only for valid byte sequences, and the byte sequence need
> not be valid.
>
> Besides, i think patching the shell to work around terminal bugs
> (or terminal emulator bugs) is the wrong approach in the first place.
>
> Yours,
>   Ingo
>

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Nicholas Marriott-2
ksh has problems for me with Anton's example in several terminals, not
just in tmux. Mostly the cursor seems to end up one character off rather
than in the prompt, which is less visibly incorrect perhaps, but still
wrong.

I don't know that ksh will be able to predict this reliably (not
uncommon with shells).

FWIW bash seems to do the replacement to U+FFFD itself before it sends
it to the terminal, which means it is (more) predictable. I don't know
if this is a sensible option for ksh.




On Fri, May 19, 2017 at 07:04:53PM +0100, Nicholas Marriott wrote:

> Hi
>
> Perhaps I haven't understood what you are saying correctly, but I don't
> think it is possible to send control characters or any other invalid
> UTF-8 bytes inside UTF-8 characters and safely predict what the terminal
> will do. How about these examples:
>
> printf '\343\203\010\217a\n'
> printf '\343\203\r\217b\n'
> printf '\343\203\n\217c\n'
>
> Or with a width 1 character:
>
> printf '\342\211\010\240\n'
> printf '\342\211\r\240b\n'
> printf '\342\211\n\240c\n'
>
> I have four UTF-8 capable X terminals here and between them they do
> three different things (xterm, rxvt-unicode, gnome-terminal & konsole).
>
> tmux can't forward this stuff to the terminal outside, because if it
> can't predict what that terminal will do, tmux's idea of the display
> will get out of sync with reality.
>
> Having tmux ignore the whole lot seems like a relatively sensible
> course.
>
> The only other alternative would be to substitute U+FFFD. But that seems
> iffy too - U+FFFD is width 1, but what if the application is expecting
> a width 2?
>
>
>
>
> On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote:
> > Hi Anton,
> >
> > Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:
> >
> > > 1. Run ksh under tmux.
> > >
> > > 2. Input the following characters, without spaces:
> > >
> > >    a (any character) ^B (backward-char) ?? (any UTF-8 character)
> > >
> > > 3. At this point, the prompt gets overwritten.
> > >
> > > Since ksh read a single byte of input, it will display a partial UTF-8
> > > character before the whole character has been read. This is especially
> > > troublesome when the cursor is not placed at the end of the line. In the
> > > scenario above, after reading the first byte of '??' the following
> > > sequence will be displayed:
> > >
> > >   0xc3 0x61 0x08
> > >
> > > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > > expects it to be followed by a UTF-8 continuation byte which is not the
> > > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > > reset to its initial state
> >
> > I call that a bug in tmux.  At least for UTF-8, tmux should never reset
> > its parser.  Depending on the keyboard configuration, it may be possible
> > to enter single non-UTF-8 bytes.  For example, at the console, i can do
> > this:
> >
> >  - type "printf x"
> >  - press Ctrl-V, Ctrl-Alt-Z
> >  - type "printf x | hexdump -C"
> >
> > The result is:
> >
> >   00000000  78 9a 78   |x.x|
> >   00000003
> >
> > All of the shell, the console, and xterm more or less handle such
> > stunts, even though admittedly, that is not the usual way of typing
> > in a binary file.  tmux is a terminal emulator, so it ought to cope,
> > too, and not reset its state.
> >
> > Note that this is yet another instance where the concept of arbitrary
> > locales is utterly broken and insecure.  On some non-OpenBSD system
> > supporting such insecure stuff, if the user has set an arbitrary,
> > non-UTF-8, state-dependent locale and somehow manages to insert an
> > invalid byte into the input stream, the state of the input stream
> > becomes invalid, no further characters can be read from that terminal,
> > and *there is no way to recover*.  The only remaining half-secure
> > option is for the shell to exit, which may also not be very secure,
> > on a different level.
> >
> > But with UTF-8, there is no problem whatsoever dealing with invalid
> > bytes, so tmux ought to cope.
> >
> > > causing the backspace to be accepted and the
> > > first character in the prompt to be overwritten.
> >
> > That's part of the reason why tmux must not reset its state:
> > The shell won't know about it, and the screen will become garbled.
> >
> > > Below is diff that make sure to read a whole UTF-8 character in
> > > x_emacs() prior doing another iteration of the main-loop
> >
> > I don't think we can do that.  What if there is no next byte?
> > Then the shell will hang until, maybe considerably later, the next
> > character is typed.  Also, we cannot rely on parsing the UTF-8 start
> > byte (even if done correctly).  It tells the byte length of the
> > character only for valid byte sequences, and the byte sequence need
> > not be valid.
> >
> > Besides, i think patching the shell to work around terminal bugs
> > (or terminal emulator bugs) is the wrong approach in the first place.
> >
> > Yours,
> >   Ingo
> >

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze
Hi Nicholas,

Nicholas Marriott wrote on Fri, May 19, 2017 at 07:27:36PM +0100:

> ksh has problems for me with Anton's example in several terminals,
> not just in tmux. Mostly the cursor seems to end up one character
> off rather than in the prompt, which is less visibly incorrect
> perhaps, but still wrong.

Can't reproduce.  At least when having LC_CTYPE=en_US.UTF-8 set
when starting the terminal, there is no problem for me with
Anton's example in xterm, urxvt, gnome-terminal, or konsole.

If i type

  echo xx

then left-arrow, a width-one non-ASCII character, a dot, and hit
enter, the dot appears at the right place both on the command line
and in the output.  Moving around with arrow-left or arrow-right
after typing the non-ASCII character to insert the dot at a different
place doesn't break anything for me either.

The output is correct even with tmux, all that tmux garbles is the
command line display.


On a side note, i don't think gnome-terminal and konsole are relevant.
I never installed them before and did so now for the first time for
testing, but they installed so many libraries that i feel uncomfortable
and unsafe using them even briefly and purely for testing purposes.
I will certainly pkg_delete them, and the libraries they pulled in, in
the near future.

I can't believe anyone in their right mind would use a terminal
depending on webkit, json, javascript, geoclue, pulseaudio, polkit,
libssh and the like for any serious work (and on top of that, konsole
is spewing huge amounts of scary error messages to stderr and/or
stdout).

> FWIW bash seems to do the replacement to U+FFFD itself before it sends
> it to the terminal, which means it is (more) predictable. I don't know
> if this is a sensible option for ksh.

That is not really an option.  This matters most while the multibyte
character is being typed, when the first byte is already being
processed but the second one not yet.  Replacing the byte with
U+FFFD, then later substituting the actual bytes back when all have
arrived, doesn't really make much sense to me.

Hanging the shell until all expected bytes have arrived seems like
a bad idea, too.  You can see the misbehaviour that is causing in
programs like uniname(1) from the misc/uniutils package:

 $ printf '\377turd' | uniname

Now the program hangs indefinitely.  I wouldn't want the shell to
hang in such a manner.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze
In reply to this post by Nicholas Marriott-2
Hi Nicholas,

Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100:

> Perhaps I haven't understood what you are saying correctly,

What matters most is that sending an incomplete character
followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense
that it doesn't change the line being edited and that it doesn't
change the display.  All terminals you mentioned seem to conform
to that according to my testing, except tmux.

> but I don't think it is possible to send control characters or
> any other invalid UTF-8 bytes inside UTF-8 characters and safely
> predict what the terminal will do. How about these examples:

If the invalid bytes are still present by the time the line is sent
off for processing (like in your example printf '\343\203\n'), then
it is indeed hard to predict what random terminals will do, though
i would argue that xterm's behaviour is correct (print one substitution
glyph for each incomplete character, or if bytes don't form even
incomplete characters, then one for each such invalid byte.

urxvt is clearly broken:

 $ printf '\343\203x\n'

prints U+00E3 x linefeed; i have no idea what it does to garble
0xe383 into 0xc3a3.  Maybe some naive misparsing, or spewing out
incomplete parsing state in some inconsistent way.

gnome-terminal and konsole print one replacement character for each
invalid byte, even if bytes form an incomplete character.  Maybe
not outright wrong, but arguably a bit confusing.

So yeah, if lines containing incomplete sequences *when they are
sent off* misformat with tmux and gnome-terminal or konsole, i
wouldn't call that tmux'es fault, and i agree there is little that
can be done about it.

> Having tmux ignore the whole lot seems like a relatively sensible
> course.

Well, what tmux currently does is making sure that everything gets
broken in the maximum possible way on every terminal, even if the
line that is finally sent off is completely correct.

> The only other alternative would be to substitute U+FFFD.

Why not just pass the bytes through?  I don't think it's the job
of a terminal mutiplexer to mess with individual bytes.  It's the
job of the final terminal doing the display to select glyphs and
place them, for printable characters, for non-printable characters,
for incomplete characters, and for invalid bytes.

> But that seems iffy too - U+FFFD is width 1, but what if the
> application is expecting a width 2?

By definition, incomplete characters and invalid bytes don't
have a width, so it doesn't matter what the application wanted
(for example, which character the user intended to type but
didn't finish).  What matters is how wide the replacement
glyph will look on the final terminal.  In that respect, we
cannot help making an assumption, and "incomplete sequences
and invalid bytes are displayed as U+FFFD (i.e. width 1)
seems about the best we can do.  That may be slightly off
for gnome-terminal and konsole, but i don't see how that can
be helped.

Anyway, these subtleties of invalid bytes that *remain* are not
the main inconvenience in practice.  What matters more is that
tmux breaks even if incomplete characters are deleted again
with backspaces and never sent off.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Nicholas Marriott-2
Hi

On Fri, May 19, 2017 at 10:23:08PM +0200, Ingo Schwarze wrote:

> Hi Nicholas,
>
> Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100:
>
> > Perhaps I haven't understood what you are saying correctly,
>
> What matters most is that sending an incomplete character
> followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense
> that it doesn't change the line being edited and that it doesn't
> change the display.  All terminals you mentioned seem to conform
> to that according to my testing, except tmux.

They don't all do the same thing for me. I am doing this, the same as
ksh does:

printf 'a\010\342a\010\010\342\211a\010\010\342\211\240a\010\n'

And the terminals behave differently.

> > but I don't think it is possible to send control characters or
> > any other invalid UTF-8 bytes inside UTF-8 characters and safely
> > predict what the terminal will do. How about these examples:
>
> If the invalid bytes are still present by the time the line is sent
> off for processing (like in your example printf '\343\203\n'), then
> it is indeed hard to predict what random terminals will do, though

I don't know what you mean by "sent off for processing".

I think you might be under some misapprehension.

\010 (ASCII BS) is just a cursor positioning sequence, all it does is
move the cursor one position to the left. \n the same, it moves the
cursor one line down.

The problem here is that ksh assumes a partial UTF-8 character (whether
one byte or two) will also move the cursor by one position, but that is
not always the case. Not for tmux, and - for me - not for other
terminals.

Are you thinking of ICANON? That is a kernel mechanism, and when it is
in use, the backspace will never reach the tmux. But ksh doesn't appear
to use it.

> i would argue that xterm's behaviour is correct (print one substitution
> glyph for each incomplete character, or if bytes don't form even
> incomplete characters, then one for each such invalid byte.

So you think the correct behaviour is to replace invalid sequences by
U+FEFF?

tmux could do that, but it won't fix ksh for other terminals. ksh is
making an assumption about how terminals will behave that is not
correct.

> urxvt is clearly broken:
>
>  $ printf '\343\203x\n'
>
> prints U+00E3 x linefeed; i have no idea what it does to garble
> 0xe383 into 0xc3a3.  Maybe some naive misparsing, or spewing out
> incomplete parsing state in some inconsistent way.
>
> gnome-terminal and konsole print one replacement character for each
> invalid byte, even if bytes form an incomplete character.  Maybe
> not outright wrong, but arguably a bit confusing.

This doesn't sound like all the terminals doing the same thing.

> So yeah, if lines containing incomplete sequences *when they are
> sent off* misformat with tmux and gnome-terminal or konsole, i
> wouldn't call that tmux'es fault, and i agree there is little that
> can be done about it.

There is no "sent off". What you send is what you get, invalid
characters, backspaces, everything.

>
> > Having tmux ignore the whole lot seems like a relatively sensible
> > course.
>
> Well, what tmux currently does is making sure that everything gets
> broken in the maximum possible way on every terminal, even if the
> line that is finally sent off is completely correct.
>
> > The only other alternative would be to substitute U+FFFD.
>
> Why not just pass the bytes through?  I don't think it's the job
> of a terminal mutiplexer to mess with individual bytes.  It's the
> job of the final terminal doing the display to select glyphs and
> place them, for printable characters, for non-printable characters,
> for incomplete characters, and for invalid bytes.

No no, it is the job of tmux to interpret what it receives from the
application and make sure it shows the same no matter what terminal is
outside.

If terminals outside behave differently for the same output, tmux can't
keep its own internal state correct, so the display will be
mangled. tmux has to understand everything it receives, there is almost
no case where we can just pass through.

>
> > But that seems iffy too - U+FFFD is width 1, but what if the
> > application is expecting a width 2?
>
> By definition, incomplete characters and invalid bytes don't
> have a width, so it doesn't matter what the application wanted
> (for example, which character the user intended to type but
> didn't finish).  What matters is how wide the replacement
> glyph will look on the final terminal.  In that respect, we
> cannot help making an assumption, and "incomplete sequences
> and invalid bytes are displayed as U+FFFD (i.e. width 1)
> seems about the best we can do.  That may be slightly off
> for gnome-terminal and konsole, but i don't see how that can
> be helped.
>
> Anyway, these subtleties of invalid bytes that *remain* are not
> the main inconvenience in practice.  What matters more is that
> tmux breaks even if incomplete characters are deleted again
> with backspaces and never sent off.

Backspace does not delete anything, it just moves the cursor.

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Nicholas Marriott-2
In reply to this post by Ingo Schwarze
On Fri, May 19, 2017 at 09:29:06PM +0200, Ingo Schwarze wrote:
> On a side note, i don't think gnome-terminal and konsole are relevant.
> I never installed them before and did so now for the first time for
> testing, but they installed so many libraries that i feel uncomfortable
> and unsafe using them even briefly and purely for testing purposes.
> I will certainly pkg_delete them, and the libraries they pulled in, in
> the near future.

You might not use them, and I don't use them, but they are very popular,
as are half a dozen others that may well have different behaviours.

> > FWIW bash seems to do the replacement to U+FFFD itself before it sends
> > it to the terminal, which means it is (more) predictable. I don't know
> > if this is a sensible option for ksh.
>
> That is not really an option.  This matters most while the multibyte
> character is being typed, when the first byte is already being
> processed but the second one not yet.  Replacing the byte with
> U+FFFD, then later substituting the actual bytes back when all have
> arrived, doesn't really make much sense to me.

It would help because terminals would know how to deal with the
character.

So ksh is doing this now - if we pretend A,B,C are UTF-8 characters:

A,cursor left,B,cursor left,C

So first A appears, then B overwrites A, then C overwrites B.

Except A and B are invalid so the terminal may not do the right thing
with them. It may draw no characters and not move the cursor (like
tmux), or draw two and move the cursor by two (like rxvt-unicode appears
to).

If you sent U+FFFD instead of the invalid characters, that is a known
character and terminals should be guaranteed to do what ksh wants (move
the cursor back one position to the right), until ksh overwrites it
again.

> Hanging the shell until all expected bytes have arrived seems like
> a bad idea, too.  You can see the misbehaviour that is causing in
> programs like uniname(1) from the misc/uniutils package:

I don't really understand this. How does ksh handle cursor keys? What if
when I press Left, the three bytes (\033OA) are split across two
separate read()s? How does this avoid hanging ksh?

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Nicholas Marriott-2
Having a look at ksh, I don't see how Anton's original diff is much
different from x_emacs() looping around x_e_getc() until it finishes a
long key input?

It would be better to stop reading early if an invalid UTF-8 byte is
input rather than always requiring exactly N bytes; he needs to fix his
u8len(); and I would want to test the other calls to x_e_getc(), such as
in x_search_char_forw() -- but I the general idea seems reasonable.



On Sat, May 20, 2017 at 12:04:35AM +0100, Nicholas Marriott wrote:

> On Fri, May 19, 2017 at 09:29:06PM +0200, Ingo Schwarze wrote:
> > On a side note, i don't think gnome-terminal and konsole are relevant.
> > I never installed them before and did so now for the first time for
> > testing, but they installed so many libraries that i feel uncomfortable
> > and unsafe using them even briefly and purely for testing purposes.
> > I will certainly pkg_delete them, and the libraries they pulled in, in
> > the near future.
>
> You might not use them, and I don't use them, but they are very popular,
> as are half a dozen others that may well have different behaviours.
>
> > > FWIW bash seems to do the replacement to U+FFFD itself before it sends
> > > it to the terminal, which means it is (more) predictable. I don't know
> > > if this is a sensible option for ksh.
> >
> > That is not really an option.  This matters most while the multibyte
> > character is being typed, when the first byte is already being
> > processed but the second one not yet.  Replacing the byte with
> > U+FFFD, then later substituting the actual bytes back when all have
> > arrived, doesn't really make much sense to me.
>
> It would help because terminals would know how to deal with the
> character.
>
> So ksh is doing this now - if we pretend A,B,C are UTF-8 characters:
>
> A,cursor left,B,cursor left,C
>
> So first A appears, then B overwrites A, then C overwrites B.
>
> Except A and B are invalid so the terminal may not do the right thing
> with them. It may draw no characters and not move the cursor (like
> tmux), or draw two and move the cursor by two (like rxvt-unicode appears
> to).
>
> If you sent U+FFFD instead of the invalid characters, that is a known
> character and terminals should be guaranteed to do what ksh wants (move
> the cursor back one position to the right), until ksh overwrites it
> again.
>
> > Hanging the shell until all expected bytes have arrived seems like
> > a bad idea, too.  You can see the misbehaviour that is causing in
> > programs like uniname(1) from the misc/uniutils package:
>
> I don't really understand this. How does ksh handle cursor keys? What if
> when I press Left, the three bytes (\033OA) are split across two
> separate read()s? How does this avoid hanging ksh?

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Boudewijn Dijkstra-3
In reply to this post by Anton Lindqvist
Op Fri, 19 May 2017 15:17:55 +0200 schreef Anton Lindqvist  
<[hidden email]>:

> On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote:
>> On 19/05/17 03:42, Anton Lindqvist wrote:
>> >
>> > +static int
>> > +u8len(unsigned char c)
>> > +{
>> > + switch (c & 0xF0) {
>> > + case 0xF0:
>> > + return 4;
>> > + case 0xE0:
>> > + return 3;
>> > + case 0xC0:
>> > + return 2;
>> > + default:
>> > + return 1;
>> > + }
>> > +}
>> > +
>>
>> This is wrong: most codepoints in the range U+0080-U+07ff (the ones  
>> greater than U+0400) would be interpreted as being 1 character long  
>> instead of 2.
>
> Thanks for the heads-up. Maybe a more reliable solution would be to call
> mbtowc(3) repeatedly as new input arrives until it returns successfully.
> Assuming the first read byte is a UTF-8 start byte.

Not needed. Only case 0xD0 is missing.

case 0xC0: case 0xD0:
  return 2;



--
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Anton Lindqvist
In reply to this post by Nicholas Marriott-2
Hi,
Although this discussion hasn't settled, here's a new diff trying to
address the previously raised issues:

- The new function x_e_getu8() tries to read a complete UTF-8 character.
  When a continuation byte is expected but not received, it resets its
  state and retries.

  The fix to u8len() from Boudewijn Dijkstra is incorporated in this new
  function, thanks!

- In x_emacs(), In order to distinguish between when the line buffer
  contains more than one character due to a UTF-8 character being read
  or caused by calling x_e_getu8() repeatedly the number of calls to
  x_e_getu8() is stored ntries. This is of importance to preserve the
  existing behavior where none matched escape sequences should not be
  inserted but instead ring the bell.

Index: emacs.c
===================================================================
RCS file: /cvs/src/bin/ksh/emacs.c,v
retrieving revision 1.67
diff -u -p -r1.67 emacs.c
--- emacs.c 12 May 2017 14:37:52 -0000 1.67
+++ emacs.c 4 Jun 2017 09:06:54 -0000
@@ -135,6 +135,7 @@ static void x_push(int);
 static void x_adjust(void);
 static void x_e_ungetc(int);
 static int x_e_getc(void);
+static int x_e_getu8(char *, int);
 static void x_e_putc(int);
 static void x_e_puts(const char *);
 static int x_comment(int);
@@ -277,7 +278,7 @@ x_emacs(char *buf, size_t len)
 {
  struct kb_entry *k, *kmatch = NULL;
  char line[LINE + 1];
- int at = 0, submatch, ret, c;
+ int at = 0, ntries = 0, submatch, ret;
  const char *p;
 
  xbp = xbuf = buf; xend = buf + len;
@@ -318,11 +319,9 @@ x_emacs(char *buf, size_t len)
  x_last_command = NULL;
  while (1) {
  x_flush();
- if ((c = x_e_getc()) < 0)
+ if ((at = x_e_getu8(line, at)) < 0)
  return 0;
-
- line[at++] = c;
- line[at] = '\0';
+ ntries++;
 
  if (x_arg == -1) {
  x_arg = 1;
@@ -360,14 +359,18 @@ x_emacs(char *buf, size_t len)
  macro_args = kmatch->args;
  ret = KSTD;
  } else
- ret = kmatch->ftab->xf_func(c);
+ ret = kmatch->ftab->xf_func(line[at - 1]);
  } else {
  if (submatch)
  continue;
- if (at == 1)
- ret = x_insert(c);
- else
- ret = x_error(c); /* not matched meta sequence */
+ if (ntries > 1) {
+ ret = x_error(0); /* not matched meta sequence */
+ } else if (at > 1) {
+ x_ins(line);
+ ret = KSTD;
+ } else {
+ ret = x_insert(line[0]);
+ }
  }
 
  switch (ret) {
@@ -392,7 +395,7 @@ x_emacs(char *buf, size_t len)
  }
 
  /* reset meta sequence */
- at = 0;
+ at = ntries = 0;
  line[0] = '\0';
  if (x_arg_set)
  x_arg_set = 0; /* reset args next time around */
@@ -1891,6 +1894,49 @@ x_e_getc(void)
  c = x_getc();
 
  return c;
+}
+
+static int
+x_e_getu8(char *buf, int off)
+{
+ int c, len;
+
+again:
+ c = x_e_getc();
+ if (c == -1)
+ return -1;
+ buf[off++] = c;
+
+ switch (c & 0xF0) {
+ case 0xF0:
+ len = 4;
+ break;
+ case 0xE0:
+ len = 3;
+ break;
+ case 0xC0:
+ case 0xD0:
+ len = 2;
+ break;
+ default:
+ len = 1;
+ }
+
+ for (; len > 1; len--) {
+ c = x_e_getc();
+ if (c == -1)
+ return -1;
+ if (!isu8cont(c)) {
+ off = 0;
+ x_e_ungetc(c);
+ x_error(0);
+ goto again;
+ }
+ buf[off++] = c;
+ }
+ buf[off] = '\0';
+
+ return off;
 }
 
 static void

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze
Hi Anton,

Anton Lindqvist wrote on Sun, Jun 04, 2017 at 11:09:35AM +0200:

> Although this discussion hasn't settled,

True.  I think nicm@ has convinced me that the shell *can* try to be
nicer towards terminals, without risking hangs if done very carefully.
Probably that's worth doing, it makes the system more robust.

The goal would be to only send invalid sequences to the terminal if
the user actually typed invalid input, but to not send invalid
intermediate states to the terminal if what the user types is
correct.

> here's a new diff trying to
> address the previously raised issues:
>
> - The new function x_e_getu8() tries to read a complete UTF-8 character.
>   When a continuation byte is expected but not received, it resets its
>   state and retries.
>
>   The fix to u8len() from Boudewijn Dijkstra is incorporated in this
>    new function, thanks!

No, no, no.  You are thanking him for really bad advice.
That function is still outrageously wrong.

> - In x_emacs(), In order to distinguish between when the line buffer
>   contains more than one character

I assume you mean "byte" here.

>   due to a UTF-8 character being read
>   or caused by calling x_e_getu8() repeatedly the number of calls to
>   x_e_getu8() is stored ntries. This is of importance to preserve the
>   existing behavior where none matched escape sequences should not be
>   inserted but instead ring the bell.

I didn't do full testing and review of your changes yet, but nothing
springs to my eye as being obviously wrong, except the UTF-8 parsing.

> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.67
> diff -u -p -r1.67 emacs.c
> --- emacs.c 12 May 2017 14:37:52 -0000 1.67
> +++ emacs.c 4 Jun 2017 09:06:54 -0000
[...]

> @@ -1891,6 +1894,49 @@ x_e_getc(void)
>   c = x_getc();
>  
>   return c;
> +}
> +
> +static int
> +x_e_getu8(char *buf, int off)
> +{
> + int c, len;
> +
> +again:
> + c = x_e_getc();
> + if (c == -1)
> + return -1;
> + buf[off++] = c;
> +
> + switch (c & 0xF0) {

That is very WRONG.  You cannot identify a UTF-8 start byte by only
looking at its first half.  At the very least, you must also make
sure that bit 5 is not set.

> + case 0xF0:
> + len = 4;
> + break;

That is very WRONG.  The bytes 0xf8 to 0xff must *not* result in
len = 4.  They are not UTF-8 start bytes, and not UTF-8 at all, so
they must result in len = 1.

Also, we do know that the last three bits must be 100.  If bit 6
was zero, the codepoint would have to be encoded in a three-byte
sequence instead.  If bit 7 or 8 would be set in addition to bit
6, that would take us beyond U+10FFFF (specifically, U+1C0000 to
U+1FFFFF, which are invalid).  So why not check bits 6 to 8 right
away, too?

> + case 0xE0:
> + len = 3;
> + break;

Hmmm.
Let me think whether we can already catch surrogates at this point.

Valid low three-byte:   U+0800  11100000 10100000 10000000  0xe0a080
                        U+D000  11101101 10000000 10000000  0xed8080
                        U+D7FF  11101101 10011111 10111111  0xed9fbf
Invalid surrogates:     U+D800  11101101 10100000 10000000  0xeda080
                        U+DFFF  11101101 10111111 10111111  0xedbfbf
Valid high three-byte:  U+E000  11101110 10000000 10000000  0xee8080
                        U+FFFF  11101111 10111111 10111111  0xefbfbf

So the start byte 0xed includes both the valid range U+D000 to U+D7FF
and the invalid surrogate range U+D800 to U+DFFF.  The decisive bit
is the third bit of the *second* byte, so surrogates cannot be
excluded when only the first bit is known.

> + case 0xC0:
> + case 0xD0:
> + len = 2;
> + break;

That is also (somewhat) wrong.  Arguably, it would be better
for 0xc0 and 0xc1 to result in len = 1, because we already know
that there is no valid continuation.

So, here is a better start byte parser (only partially tested):

        if (c == 0xf4)
                len = 4;
        else if ((c & 0xf0) == 0xe0)
                len = 3;
        else if ((c & 0xe0) == 0xc0 && c > 0xc1)
                len = 2;
        else
                len = 1;

> + default:
> + len = 1;
> + }
> +
> + for (; len > 1; len--) {
> + c = x_e_getc();

Better use a different variable, say cc, such that we retain the
start byte for subsequent validation.

> + if (c == -1)
> + return -1;
> + if (!isu8cont(c)) {
> + off = 0;
> + x_e_ungetc(c);
> + x_error(0);
> + goto again;

That looks wrong.  At "again:", the first thing you do is x_e_getc(),
undoing the effect of the (reasonable) x_e_ungetc() you just did.
That looks like an endless loop to me.

Also, off = 0 and x_error() seem wrong because that discards a byte.

Instead, wouldn't you have to return successfully, to process the
invalid byte just read?

> + }

At this point, we might consider validating the second byte.

Something like (only partially tested):

        if (isu8cont(cc) == 0 ||
            (c == 0xe0 && len == 3 && cc < 0xa0) ||
            (c == 0xed && len == 3 && cc & 0x20) ||
            (c == 0xf4 && len == 4 && cc & 0x30)) {
                x_e_ungetc(cc);
                break;
        }

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze
In reply to this post by Nicholas Marriott-2
Hi Nicholas,

Nicholas Marriott wrote on Fri, May 19, 2017 at 11:46:52PM +0100:
> On Fri, May 19, 2017 at 10:23:08PM +0200, Ingo Schwarze wrote:

>> What matters most is that sending an incomplete character
>> followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense
>> that it doesn't change the line being edited and that it doesn't
>> change the display.  All terminals you mentioned seem to conform
>> to that according to my testing, except tmux.

> They don't all do the same thing for me. I am doing this, the same as
> ksh does:
>
> printf 'a\010\342a\010\010\342\211a\010\010\342\211\240a\010\n'
>
> And the terminals behave differently.

Hmmm.  My testing must have been off.
xterm and urxvt behave the way ksh expects (printing the not equal
sign followed by the letter a), but you are right, gnome-terminal
and konsole differ.  For those two, "printf '\342\211\n'" results
in *two* replacement characters, and hence your test sequence
above ends up with one remaining replacement character before
the not equal sign.

So you are right, the shell should try to avoid sending invalid
intermediate states when it can, that will make the system more
robust.

Anton has a candidate diff for ksh/emacs.c now, but review and
testing are not finished yet.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Walter Alejandro Iglesias-3
In reply to this post by Anton Lindqvist
Just to applogize to developers here,

I'm still not skilled enough to make a proper patch or a clear bug
report (I'm on chapter 2 of K&R :-)).  I wish with time I'll learn how
to do it.  I came to the ksh utf8 discussion because I've been playing
with some mail mime encoder just to learn C and recognizing valid utf-8
was the first challenge I ecountered.

The code pasted below is what I got so far in recognizing valid utf-8.
I'm showing it to make my point, I realize it isn't easy; and from my
poor C I'm not able to figure out how you can do such test byte by byte
while the user is typing at command line.  (Don't bother in explaining
me how, I know this is not the place to take C lessons.)

By the way, something the last paragraph of the new utf8(7) man page
isn't clear enough (I mentioned this to tedu@).

Thanks to all of you for your work.  Now I know how hard it is.


#include <stdio.h>

#define ASCII 0x7f
#define YES 1
#define NO 0

int
main()
{
        int c, ch, wd, ln, col, isutf8;

        ch = wd = ln = col = 1;
        isutf8 = YES;

        while ((c = getchar()) != EOF) {
                if (c > ASCII) {
                        if ((ch == 1 && (c < 0xc2 || c > 0xf7)) ||
                                ((ch > 1 && c <= 4) &&
                                ch <= wd && (c < 0x80 || c > 0xbf)))
                                isutf8 = NO;
                /* 110..... */
                        else if (ch == 1 && c >= 0xc2 && c <= 0xdf) {
                                wd = 2;
                                ++ch;
                /* 1110.... */
                        } else if (ch == 1 && c >= 0xe0 && c <= 0xef) {
                                wd = 3;
                                ++ch;
                /* 11110... */
                        } else if (ch == 1 && c >= 0xf0 && c <= 0xf7) {
                                wd = 4;
                                ++ch;
                        } else if (ch > 1 && c <= 4 &&
                                ch == wd && c >= 0x80 && c <= 0xbf)
                                ch = 1;
                        else if (ch > 1 && c <= 4 &&
                                ch < wd && c >= 0x80 && c <= 0xbf)
                                ++ch;
                        else
                                ++ch;
                } else if (ch > 1 && ch <= 4 && ch <= wd)
                        isutf8= NO;
                else
                        ch = 1;

                if (isutf8 == NO) {
                        printf("Invalid utf-8 character");
                        printf(" at line %d col %d.\n", ln, col);
                        return 1;
                }
                if (c == '\n') {
                        col = 1;
                        ++ln;
                } else
                        ++col;
        }

        return 0;
}

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze
Hi Walter,

Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:

> I'm still not skilled enough to make a proper patch or a clear bug
> report (I'm on chapter 2 of K&R :-)).  I wish with time I'll learn how
> to do it.

IIRC, you said you saw some undesirable behaviour with ksh input.

I assume you have a sequence of key presses on your keyboard that
demonstrate the undesirable behaviour.  To capture the sequence,

 1. Type

    printf '

    (without hitting enter)

 2. Now type the sequence of characters, BUT
     - before any control character, press Ctrl-V
     - before any single quote, press a backslash

 3. Type

    ' > input.txt

    and hit enter.

You can look at the sequence with

  hexdump -C input.txt

There should be at least one byte for each key pressed, sometimes
more.  If some key presses do not show up or the order is mixed up,
you likely forgot the Ctrl-V before it.  In that case, retry.

If you want to save people who look at your report some work,
you can craft a printf(1) statement that generates the same
output as the above, but using only ASCII letters, and send
that instead of the output of hexdump -C input.txt.

Here is a simple example.
Test insertion of a non-ASCII character between two ASCII characters.

What i type on my keyboard:

  x y Ctrl-B e-accent-aigu

What i type to capture the input:

  p r i n t f blank ' x y Ctrl-V Ctrl-B e-accent-aigu ' > i n p u t . t x t

This can be used to create the same input file:

   $ printf 'xy\x02\xc3\xa9' > input2.txt
   $ cmp input.txt input2.txt                                    
   $ echo $?
  0

For testing, go to the regress directory:

   $ cd /usr/src/regress/bin/ksh
   $ cvs up -dP
   $ cd edit
   $ make obj
   $ make cleandir
   $ make regress
   $ ./obj/edit < input.txt | hexdump -C
  00000000  24 20 78 79 08 c3 a9 79  08 0a   |$ xy...y..|
  0000000a

Note that the above output is already an improved (uncommitted)
version containing a private patch.

Here is what the -current ksh does:

   $ PATH=/obin ./obj/edit < input.txt | hexdump -C
  00000000  24 20 78 79 08 c3 79 08  08 c3 a9 79 08 0a   |$ xy..y....y..|
  0000000e

Also describe in words why you are typing that exact sequence,
what you would like to happen, and how the actual events
differ from that.

Should be feasible, right?


> I came to the ksh utf8 discussion because I've been playing
> with some mail mime encoder just to learn C and recognizing
> valid utf-8 was the first challenge I ecountered.

In a normal non-threaded application program, you should probably
use mbtowc(3) rather than coding your own UTF-8 parser.

> The code pasted below is what I got so far in recognizing valid utf-8.

That code looks confusing.  I'm not studying it in detail right now
because we already have at least two working UTF-8 decoders in the
tree.

One is /usr/src/lib/libc/citrus/citrus_utf8.c,
function _citrus_utf8_ctype_mbrtowc().  It is relatively long,
but quite easy to understand.

Another one is /usr/src/usr.bin/mandoc/preconv.c,
function preconv_encode().  It is substantially shorter,
but admittedly harder to understand.

This is OpenBSD.  Read the fantastic source to learn something.

> I'm showing it to make my point, I realize it isn't easy; and from my
> poor C I'm not able to figure out how you can do such test byte by byte
> while the user is typing at command line.  (Don't bother in explaining
> me how, I know this is not the place to take C lessons.)

Well, it seems likely the something doing just that will be committed
to ksh/emacs.c during the next week or so - not a full parser, but
something showing the incremental nature.  Then you can inspect it.

If you report actual bugs or propose fixes, then developers are often
willing to help with C issues related to it as well, even though
teaching C is indeed not the point of this list.

> By the way, something the last paragraph of the new utf8(7) man page
> isn't clear enough (I mentioned this to tedu@).

Which paragraph exactly, and what is unclear?  Maybe we can fix it
quickly.

> #define YES 1
> #define NO 0

Either just use 1 and 0 in the code or use <stdbool.h>.

[...]
> }

Don't forget checking ferror(3) after falling out of the main loop.

> return 0;
> }

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Walter Alejandro Iglesias-3
On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote:

> Hi Walter,
>
> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:
>
> > report (I'm on chapter 2 of K&R :-)).  I wish with time I'll learn how
> > to do it.
>
> IIRC, you said you saw some undesirable behaviour with ksh input.
>
> I assume you have a sequence of key presses on your keyboard that
> demonstrate the undesirable behaviour.  To capture the sequence,
>

I will *study* all the indications you gave me.  But this time
I don't think you need a capture of the sequence.  Just use *any*
latin-1 character whose hex value is smaller than \xc0.

To facilitate you the test, in xterm after setting "setxkbmap de":

  AltGr + Shift + 1

prints me the opening exclamation mark (\xa1) we also use in Spanish.
In console or a C xterm, type that merged among random ascii characters,
then move the cursor from the first to the last column passing over that
character.  Assuming you're running current, see what happens.


Anyway, to be honest, these bugs don't hurt, you can live with them.
What I'm trying to say with these reports is I'm not truly convinced
utf8 support in console is a good idea.

Another test you can do, this time in a utf-8 xterm: if you activate the
bell and go with the cursor to the end of the line it'll beep.  Now type
some utf-8 character at the end and do the same, it won't beep, because
the cursor is in the first byte of the utf-8 character, *it can't reach
the real end of the line*.  Nobody will die because this issue or the
other above.  My point is utf8 will always be a mess.  KEN, DO YOU HEAR
ME?, IT WAS YOUR OWN CHILD, KEN! :-)

I wonder how plan9 handle utf8.


[...]

>
>
> For testing, go to the regress directory:
>
>    $ cd /usr/src/regress/bin/ksh
>    $ cvs up -dP
>    $ cd edit
>    $ make obj
>    $ make cleandir
>    $ make regress
>    $ ./obj/edit < input.txt | hexdump -C
>   00000000  24 20 78 79 08 c3 a9 79  08 0a   |$ xy...y..|
>   0000000a


I've been wondering how to work with this.  Thanks!


[...]

> > By the way, something the last paragraph of the new utf8(7) man page
> > isn't clear enough (I mentioned this to tedu@).
>
> Which paragraph exactly, and what is unclear?  Maybe we can fix it
> quickly.

As I told you, the _last_ one:

   Encodings using more bytes than required are invalid.  In particular,
   11000000 and 11000001 are not valid start bytes, the byte after
   11100000 must be at least 10100000, and the byte after 11110000 must
   be at least 10010000.

I don't understand the 'at least' assumptions.  Some examples in which
the byte after 1110.... is *smaller* than 1010....:

    Euro sign:
    11100010 10000010 10101100

    Em dash:
    11100010 10000000 10010100

    Double quotes:
    11100010 10000000 10011100
    11100010 10000000 10011101

You can find examples where the byte after 1110.... is *grater* than
1010.... here:

http://www.utf8-chartable.de/



Thank you for your advices I'll study your whole message carefuly.


>
> Yours,
>   Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Kurt H Maier
On Mon, Jun 05, 2017 at 09:21:31PM +0200, Walter Alejandro Iglesias wrote:
>
> I wonder how plan9 handle utf8.
>

In general, by getting rid of TTYs and character-addressed interfaces
almost entirely.  Probably not the best fit for OpenBSD.

khm

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): don't output invalid UTF-8 characters

Walter Alejandro Iglesias-3
In reply to this post by Walter Alejandro Iglesias-3
In article <[hidden email]> you wrote:
>
>    Encodings using more bytes than required are invalid.  In particular,
>    11000000 and 11000001 are not valid start bytes, the byte after
>    11100000 must be at least 10100000, and the byte after 11110000 must
>    be at least 10010000.
>

Someone off list explained me this is true just for the exact 11100000
byte.  I'd assumed the man page was referring to any 1110.... sequence.
Now I understand.

12