ksh(1): vi mode UTF-8 bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

ksh(1): vi mode UTF-8 bug

Anton Lindqvist
Hi,
Another UTF-8 related bug reported by tb@. How to re-produce:

1. Enable vi mode:

   $ set -o vi

2. Input the following characters: öa

3. Press escape and then x twice.

4. An invalid UTF-8 character is displayed.

Similar to one of my previous diffs, looks like the column counter is
wrong. The diff below fixes the problem and includes a regression test.
I'm not running vi mode myself so further testing would be appreciated.

Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
retrieving revision 1.44
diff -u -p -r1.44 vi.c
--- bin/ksh/vi.c 17 Oct 2016 18:39:43 -0000 1.44
+++ bin/ksh/vi.c 19 May 2017 10:47:44 -0000
@@ -1155,7 +1155,9 @@ vi_cmd(int argcnt, const char *cmd)
  break;
  }
  if (insert == 0 && es->cursor != 0 && es->cursor >= es->linelen)
- es->cursor--;
+ while (es->cursor > 0)
+ if (!isu8cont(es->cbuf[--es->cursor]))
+ break;
  }
  return 0;
 }
Index: regress/bin/ksh/vi/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/vi/vi.sh,v
retrieving revision 1.1
diff -u -p -r1.1 vi.sh
--- regress/bin/ksh/vi/vi.sh 12 Jan 2016 09:00:39 -0000 1.1
+++ regress/bin/ksh/vi/vi.sh 19 May 2017 10:47:44 -0000
@@ -168,6 +168,7 @@ testseq "abcd\00332X" " $ abcd\b\b\bd  \
 # x: Delete character.
 # |: Move to column.
 testseq "abcd\00332|2x" " $ abcd\b\b\bd  \b\b\b\r\nad"
+testseq "\0303\0266a\0033xx" " $ \0303\0266a\b \b\b  \b\b\r"
 
 # Y: Yank to end of line.
 testseq "abcd\0033hYp" " $ abcd\b\bccdd\b\b\r\nabccdd"

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Ingo Schwarze
Hi Anton,

Anton Lindqvist wrote on Fri, May 19, 2017 at 02:11:37PM +0200:

> Hi,
> Another UTF-8 related bug reported by tb@. How to re-produce:
>
> 1. Enable vi mode:
>
>    $ set -o vi
>
> 2. Input the following characters: öa
>
> 3. Press escape and then x twice.
>
> 4. An invalid UTF-8 character is displayed.
>
> Similar to one of my previous diffs, looks like the column counter is
> wrong. The diff below fixes the problem and includes a regression test.
> I'm not running vi mode myself so further testing would be appreciated.

Yes, that is correct, from both testing and code inspection.

I'd prefer to also delete the condition that becomes obsolete,
see below.

OK?
  Ingo


Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
retrieving revision 1.44
diff -u -p -r1.44 vi.c
--- vi.c 17 Oct 2016 18:39:43 -0000 1.44
+++ vi.c 19 May 2017 14:15:27 -0000
@@ -1154,8 +1154,10 @@ vi_cmd(int argcnt, const char *cmd)
  expand_word(1);
  break;
  }
- if (insert == 0 && es->cursor != 0 && es->cursor >= es->linelen)
- es->cursor--;
+ if (insert == 0 && es->cursor >= es->linelen)
+ while (es->cursor > 0)
+ if (!isu8cont(es->cbuf[--es->cursor]))
+ break;
  }
  return 0;
 }

> Index: regress/bin/ksh/vi/vi.sh
> ===================================================================
> RCS file: /cvs/src/regress/bin/ksh/vi/vi.sh,v
> retrieving revision 1.1
> diff -u -p -r1.1 vi.sh
> --- regress/bin/ksh/vi/vi.sh 12 Jan 2016 09:00:39 -0000 1.1
> +++ regress/bin/ksh/vi/vi.sh 19 May 2017 10:47:44 -0000
> @@ -168,6 +168,7 @@ testseq "abcd\00332X" " $ abcd\b\b\bd  \
>  # x: Delete character.
>  # |: Move to column.
>  testseq "abcd\00332|2x" " $ abcd\b\b\bd  \b\b\b\r\nad"
> +testseq "\0303\0266a\0033xx" " $ \0303\0266a\b \b\b  \b\b\r"
>  
>  # Y: Yank to end of line.
>  testseq "abcd\0033hYp" " $ abcd\b\bccdd\b\b\r\nabccdd"
>

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Theo Buehler
On Fri, May 19, 2017 at 04:19:57PM +0200, Ingo Schwarze wrote:

> Hi Anton,
>
> Anton Lindqvist wrote on Fri, May 19, 2017 at 02:11:37PM +0200:
>
> > Hi,
> > Another UTF-8 related bug reported by tb@. How to re-produce:
> >
> > 1. Enable vi mode:
> >
> >    $ set -o vi
> >
> > 2. Input the following characters: öa
> >
> > 3. Press escape and then x twice.
> >
> > 4. An invalid UTF-8 character is displayed.
> >
> > Similar to one of my previous diffs, looks like the column counter is
> > wrong. The diff below fixes the problem and includes a regression test.
> > I'm not running vi mode myself so further testing would be appreciated.
>
> Yes, that is correct, from both testing and code inspection.
>
> I'd prefer to also delete the condition that becomes obsolete,
> see below.
>
> OK?

What's the status of this diff? I sent my ok to Ingo, but it seems to
have been forgotten. Ingo or Anton, are you going to commit it?

>   Ingo
>
>
> Index: vi.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/vi.c,v
> retrieving revision 1.44
> diff -u -p -r1.44 vi.c
> --- vi.c 17 Oct 2016 18:39:43 -0000 1.44
> +++ vi.c 19 May 2017 14:15:27 -0000
> @@ -1154,8 +1154,10 @@ vi_cmd(int argcnt, const char *cmd)
>   expand_word(1);
>   break;
>   }
> - if (insert == 0 && es->cursor != 0 && es->cursor >= es->linelen)
> - es->cursor--;
> + if (insert == 0 && es->cursor >= es->linelen)
> + while (es->cursor > 0)
> + if (!isu8cont(es->cbuf[--es->cursor]))
> + break;
>   }
>   return 0;
>  }
>
> > Index: regress/bin/ksh/vi/vi.sh
> > ===================================================================
> > RCS file: /cvs/src/regress/bin/ksh/vi/vi.sh,v
> > retrieving revision 1.1
> > diff -u -p -r1.1 vi.sh
> > --- regress/bin/ksh/vi/vi.sh 12 Jan 2016 09:00:39 -0000 1.1
> > +++ regress/bin/ksh/vi/vi.sh 19 May 2017 10:47:44 -0000
> > @@ -168,6 +168,7 @@ testseq "abcd\00332X" " $ abcd\b\b\bd  \
> >  # x: Delete character.
> >  # |: Move to column.
> >  testseq "abcd\00332|2x" " $ abcd\b\b\bd  \b\b\b\r\nad"
> > +testseq "\0303\0266a\0033xx" " $ \0303\0266a\b \b\b  \b\b\r"
> >  
> >  # Y: Yank to end of line.
> >  testseq "abcd\0033hYp" " $ abcd\b\bccdd\b\b\r\nabccdd"
> >

Reply | Threaded
Open this post in threaded view
|

[wai@roquesor.com: Re: ksh(1): vi mode UTF-8 bug]

Walter Alejandro Iglesias-3
In reply to this post by Anton Lindqvist

I forgot to Cc this here.


----- Forwarded message from Walter Alejandro Iglesias <[hidden email]> -----

From: Walter Alejandro Iglesias <[hidden email]>
To: Anton Lindqvist <[hidden email]>
Subject: Re: ksh(1): vi mode UTF-8 bug
In-Reply-To: <[hidden email]>
X-Newsgroups: gmane.os.openbsd.tech
User-Agent: tin/2.4.1-20161224 ("Daill") (UNIX) (OpenBSD/6.1 (amd64))
Status: RO
Content-Length: 703
Lines: 25

Hi Anton,

In article <[hidden email]> you wrote:

> Hi,
> Another UTF-8 related bug reported by tb@. How to re-produce:
>
> 1. Enable vi mode:
>
>    $ set -o vi
>
> 2. Input the following characters: ??a
>
> 3. Press escape and then x twice.
>
> 4. An invalid UTF-8 character is displayed.
>
> Similar to one of my previous diffs, looks like the column counter is
> wrong. The diff below fixes the problem and includes a regression test.
> I'm not running vi mode myself so further testing would be appreciated.
>

There is still a similar issue when you try to "replace" a utf-8
character (in command mode press 'r' to replace a single character or
'R' to replace a string).


----- End forwarded message -----

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Anton Lindqvist
Hi Walter,

On Sun, May 28, 2017 at 10:56:19AM +0200, Walter Alejandro Iglesias wrote:

>
> I forgot to Cc this here.
>
>
> ----- Forwarded message from Walter Alejandro Iglesias <[hidden email]> -----
>
> From: Walter Alejandro Iglesias <[hidden email]>
> To: Anton Lindqvist <[hidden email]>
> Subject: Re: ksh(1): vi mode UTF-8 bug
> In-Reply-To: <[hidden email]>
> X-Newsgroups: gmane.os.openbsd.tech
> User-Agent: tin/2.4.1-20161224 ("Daill") (UNIX) (OpenBSD/6.1 (amd64))
> Status: RO
> Content-Length: 703
> Lines: 25
>
> Hi Anton,
>
> In article <[hidden email]> you wrote:
> > Hi,
> > Another UTF-8 related bug reported by tb@. How to re-produce:
> >
> > 1. Enable vi mode:
> >
> >    $ set -o vi
> >
> > 2. Input the following characters: ??a
> >
> > 3. Press escape and then x twice.
> >
> > 4. An invalid UTF-8 character is displayed.
> >
> > Similar to one of my previous diffs, looks like the column counter is
> > wrong. The diff below fixes the problem and includes a regression test.
> > I'm not running vi mode myself so further testing would be appreciated.
> >
>
> There is still a similar issue when you try to "replace" a utf-8
> character (in command mode press 'r' to replace a single character or
> 'R' to replace a string).

Thanks for the report, please try out the diff below.
As I understand the problem: the current code assumes that the character
to replace consists of a single byte, which is not true for Unicode
characters. When replacing such a character, delete the continuation
bytes and then replace the start byte with the replacement.
This ensures no continuation bytes are left behind.
I made use of putbuf() since it has the side-effect of advancing the
cursor.
Lastly, adjust the cursor to be positioned on the last replaced
character.

NUL-terminating the line buffer is necessary in order for the following
to work:

1. Insert ö

2. Press esc, h (back one char), ro (replace with o), ax (append x)

Note that replacing a character with a Unicode character does not work
either.

Comments? OK?

Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
retrieving revision 1.45
diff -u -p -r1.45 vi.c
--- bin/ksh/vi.c 28 May 2017 07:27:01 -0000 1.45
+++ bin/ksh/vi.c 28 May 2017 15:59:59 -0000
@@ -926,13 +926,22 @@ vi_cmd(int argcnt, const char *cmd)
  if (cmd[1] == 0)
  vi_error();
  else {
- int n;
-
  if (es->cursor + argcnt > es->linelen)
  return -1;
- for (n = 0; n < argcnt; ++n)
- es->cbuf[es->cursor + n] = cmd[1];
- es->cursor += n - 1;
+
+ while (argcnt-- > 0) {
+ for (cur = es->cursor + 1;
+    cur < es->linelen; cur++)
+ if (!isu8cont(es->cbuf[cur]))
+ break;
+ if (cur > 1)
+ del_range(es->cursor, cur - 1);
+ putbuf(&cmd[1], 1, 1);
+ }
+ while (es->cursor > 0)
+ if (!isu8cont(es->cbuf[--es->cursor]))
+ break;
+ es->cbuf[es->linelen] = '\0';
  }
  break;
 
Index: regress/bin/ksh/vi/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/vi/vi.sh,v
retrieving revision 1.2
diff -u -p -r1.2 vi.sh
--- regress/bin/ksh/vi/vi.sh 28 May 2017 07:27:01 -0000 1.2
+++ regress/bin/ksh/vi/vi.sh 28 May 2017 15:59:59 -0000
@@ -134,6 +134,7 @@ testseq "abcdef\00334h2Rxy\0033" " $ abc
 
 # r: Replace character.
 testseq "abcd\00332h2rxiy" " $ abcd\b\b\bxx\byxd\b\b\r\naxyxd"
+testseq "\0303\0266\0033ro" " $ \0303\0266\bo \b\b\r\no"
 
 # S: Substitute whole line.
 testseq "oldst\0033Snew" " $ oldst\b\b\b\b\b     \r $ new\r\nnew"

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Walter Alejandro Iglesias-3
In article <[hidden email]> you wrote:
> Hi Walter,
>
> Thanks for the report, please try out the diff below.
>

The diff works as you explained.  It replaces correctly utf-8 with
ascii chars.


Thank you.


Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Ingo Schwarze
In reply to this post by Anton Lindqvist
Hi,

Anton Lindqvist wrote on Sun, May 28, 2017 at 06:07:00PM +0200:
> On Sun, May 28, 2017 at 10:56:19AM +0200, Walter Alejandro Iglesias wrote:

>> There is still a similar issue when you try to "replace" a utf-8
>> character (in command mode press 'r' to replace a single character or
>> 'R' to replace a string).

> Thanks for the report, please try out the diff below.
> As I understand the problem: the current code assumes that the character
> to replace consists of a single byte, which is not true for Unicode
> characters.

Correct.  That needs to be improved.

> When replacing such a character, delete the continuation
> bytes and then replace the start byte with the replacement.
> This ensures no continuation bytes are left behind.
> I made use of putbuf() since it has the side-effect of advancing the
> cursor.
> Lastly, adjust the cursor to be positioned on the last replaced
> character.
>
> NUL-terminating the line buffer is necessary in order for the following
> to work:
>
> 1. Insert ö
>
> 2. Press esc, h (back one char), ro (replace with o), ax (append x)
>
> Note that replacing a character with a Unicode character does not work
> either.
>
> Comments? OK?
>
> Index: bin/ksh/vi.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/vi.c,v
> retrieving revision 1.45
> diff -u -p -r1.45 vi.c
> --- bin/ksh/vi.c 28 May 2017 07:27:01 -0000 1.45
> +++ bin/ksh/vi.c 28 May 2017 15:59:59 -0000
> @@ -926,13 +926,22 @@ vi_cmd(int argcnt, const char *cmd)
>   if (cmd[1] == 0)
>   vi_error();
>   else {
> - int n;
> -
>   if (es->cursor + argcnt > es->linelen)
>   return -1;

These two lines are no longer accurate.  They try to make sure there
are enough characters under and to the right of the cursor to match
the number you want to replace (for example, with "2r"), and beep
otherwise - but they count bytes, which is wrong.

To catch the error condition of an excessive argument, i think you
first need to iterate to the right, using the c1 variable and isu8cont(),
and return -1 if you hit the end prematurely.  Do not change anything
in that case.

If so far, you succeed, you know you have to replace the range
[es->cursor, c1].

> - for (n = 0; n < argcnt; ++n)
> - es->cbuf[es->cursor + n] = cmd[1];
> - es->cursor += n - 1;
> +
> + while (argcnt-- > 0) {
> + for (cur = es->cursor + 1;
> +    cur < es->linelen; cur++)
> + if (!isu8cont(es->cbuf[cur]))
> + break;
> + if (cur > 1)
> + del_range(es->cursor, cur - 1);

Given that you don't know the length (in bytes) of the character
to insert yet, i think it may be simpler to delete the byte under the
cursor as well, even though that is slightly inefficient for the ASCII
case.

> + putbuf(&cmd[1], 1, 1);

It seems that here, you may need to measure the length of the character
to insert in bytes and then call something like

  putbuf(cmd + 1, #bytes, 0);


My impression is that the 's' command is likely also affected, but that
can be fixed in a separate patch.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Ingo Schwarze
Hi,

ooops, correcting myself...

Ingo Schwarze wrote on Mon, May 29, 2017 at 04:16:06PM +0200:

> It seems that here, you may need to measure the length of the character
> to insert in bytes and then call something like
>
>   putbuf(cmd + 1, #bytes, 0);

It isn't that simple.  At this point, only the first byte of the
character has been read, so we don't even know the number of bytes
yet, and the subsequent ones are *not* available in cmd+2.

Implementing "r<non-ASCII>" looks like a major undertaking, in particular
since "r" uses the same state machine that is also used for "@", "F", "T",
"f", "t", and none of those have support for dealing with more than
one byte yet.

So handling multi-byte "r" should probably be treated as a separate
issue.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Walter Alejandro Iglesias-3
On Mon, May 29, 2017 at 05:59:34PM +0200, Ingo Schwarze wrote:
> So handling multi-byte "r" should probably be treated as a separate
> issue.
>

I'm just a beginner with C, what I'm about to say is purely intuitive.

As far as I can understand you're trying to adapt the code that works
with ascii to handle utf-8, what requires to *guess* how to deal with
the next character the user will type at any time.

Are those wide char versions of C functions consistent enough to write a
separate implementation to be loaded when LC_TYPE is set to utf-8?

If I'm telling nonsense just ignore me. :)

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Ingo Schwarze
Hi Walter,

Walter Alejandro Iglesias wrote on Mon, May 29, 2017 at 06:44:40PM +0200:

> Are those wide char versions of C functions consistent enough to write
> a separate implementation to be loaded when LC_TYPE is set to utf-8?

Sure, you can rewrite the complete shell to use wchar_t * rather
than char *, and if you do that, you can use the new code to handle
ASCII as well, no need to have two copies.  But that would be a
huge effort, even more error-prone than the small, careful adjustments
we are doing now, and would have a number of additional downsides;
among others, losing the ability to handle arbitrary bytes, while
in UTF-8 mode.

For an editor, going wchar_t might be better because having substantial
amounts of UTF-8 in user input is a common case in some files that
people edit.

For a shell, editing strings that contain non-ASCII is not the main
purpose.  Sure, it is nice if the command line is able to handle
strings containing an occasional UTF-8 character.  But the main
purpose of the shell remains to safely input and execute Unix-style
command lines, where non-ASCII characters are a non-essential addition
at best.

Yours,
  Ingo


For more details, see
https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Walter Alejandro Iglesias-3
On Mon, May 29, 2017 at 07:28:37PM +0200, Ingo Schwarze wrote:

> Hi Walter,
>
> Walter Alejandro Iglesias wrote on Mon, May 29, 2017 at 06:44:40PM +0200:
>
> > Are those wide char versions of C functions consistent enough to write
> > a separate implementation to be loaded when LC_TYPE is set to utf-8?
>
> Sure, you can rewrite the complete shell to use wchar_t * rather
> than char *, and if you do that, you can use the new code to handle
> ASCII as well, no need to have two copies.  But that would be a
> huge effort, even more error-prone than the small, careful adjustments
> we are doing now, and would have a number of additional downsides;
> among others, losing the ability to handle arbitrary bytes, while
> in UTF-8 mode.
>
> For an editor, going wchar_t might be better because having substantial
> amounts of UTF-8 in user input is a common case in some files that
> people edit.
>
> For a shell, editing strings that contain non-ASCII is not the main
> purpose.  Sure, it is nice if the command line is able to handle
> strings containing an occasional UTF-8 character.  But the main
> purpose of the shell remains to safely input and execute Unix-style
> command lines, where non-ASCII characters are a non-essential addition
> at best.

I totally agree with you and that's exactly why I value you're
preserving the ascii version, not only ksh, even the editor, I mostly
use vi and have nvi from packages at hand just for when I want to send
mail to family or edit my web site.

Thanks for your kind explanation.


>
> Yours,
>   Ingo
>
>
> For more details, see
> https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Theo de Raadt-2
In reply to this post by Ingo Schwarze
> Sure, you can rewrite the complete shell to use wchar_t * rather
> than char *, and if you do that, you can use the new code to handle
> ASCII as well, no need to have two copies.  But that would be a
> huge effort, even more error-prone than the small, careful adjustments
> we are doing now, and would have a number of additional downsides;
> among others, losing the ability to handle arbitrary bytes, while
> in UTF-8 mode.

yes, that's the situation.

Code which only uses "char" is less error prone, since there is no
smaller C type.  As a result a whole class of errors occur less often.
Move to a multi-char or larger integer type, and down-conversion
errors silently and subtly show up, causing damage.  It could be managed.
It isn't easy.

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Anton Lindqvist
In reply to this post by Ingo Schwarze
Hi,

On Mon, May 29, 2017 at 04:16:06PM +0200, Ingo Schwarze wrote:

> Hi,
>
> Anton Lindqvist wrote on Sun, May 28, 2017 at 06:07:00PM +0200:
> > On Sun, May 28, 2017 at 10:56:19AM +0200, Walter Alejandro Iglesias wrote:
>
> >> There is still a similar issue when you try to "replace" a utf-8
> >> character (in command mode press 'r' to replace a single character or
> >> 'R' to replace a string).
>
> > Thanks for the report, please try out the diff below.
> > As I understand the problem: the current code assumes that the character
> > to replace consists of a single byte, which is not true for Unicode
> > characters.
>
> Correct.  That needs to be improved.
>
> > When replacing such a character, delete the continuation
> > bytes and then replace the start byte with the replacement.
> > This ensures no continuation bytes are left behind.
> > I made use of putbuf() since it has the side-effect of advancing the
> > cursor.
> > Lastly, adjust the cursor to be positioned on the last replaced
> > character.
> >
> > NUL-terminating the line buffer is necessary in order for the following
> > to work:
> >
> > 1. Insert ö
> >
> > 2. Press esc, h (back one char), ro (replace with o), ax (append x)
> >
> > Note that replacing a character with a Unicode character does not work
> > either.
> >
> > Comments? OK?
> >
> > Index: bin/ksh/vi.c
> > ===================================================================
> > RCS file: /cvs/src/bin/ksh/vi.c,v
> > retrieving revision 1.45
> > diff -u -p -r1.45 vi.c
> > --- bin/ksh/vi.c 28 May 2017 07:27:01 -0000 1.45
> > +++ bin/ksh/vi.c 28 May 2017 15:59:59 -0000
> > @@ -926,13 +926,22 @@ vi_cmd(int argcnt, const char *cmd)
> >   if (cmd[1] == 0)
> >   vi_error();
> >   else {
> > - int n;
> > -
> >   if (es->cursor + argcnt > es->linelen)
> >   return -1;
>
> These two lines are no longer accurate.  They try to make sure there
> are enough characters under and to the right of the cursor to match
> the number you want to replace (for example, with "2r"), and beep
> otherwise - but they count bytes, which is wrong.

Correct, replacing 'ö' with 2ro is currently valid which is wrong. This
is fixed in the diff below and I added a test capturing this behavior.

>
> To catch the error condition of an excessive argument, i think you
> first need to iterate to the right, using the c1 variable and isu8cont(),
> and return -1 if you hit the end prematurely.  Do not change anything
> in that case.
>
> If so far, you succeed, you know you have to replace the range
> [es->cursor, c1].

Thanks for the pointers. I made use of c1 to count the number of
characters and cur denotes the upper limit for the range to replace
expressed in bytes. I've also added another test replacing the Euro sign
which consists of 3 bytes.

> > - for (n = 0; n < argcnt; ++n)
> > - es->cbuf[es->cursor + n] = cmd[1];
> > - es->cursor += n - 1;
> > +
> > + while (argcnt-- > 0) {
> > + for (cur = es->cursor + 1;
> > +    cur < es->linelen; cur++)
> > + if (!isu8cont(es->cbuf[cur]))
> > + break;
> > + if (cur > 1)
> > + del_range(es->cursor, cur - 1);
>
> Given that you don't know the length (in bytes) of the character
> to insert yet, i think it may be simpler to delete the byte under the
> cursor as well, even though that is slightly inefficient for the ASCII
> case.
>
> > + putbuf(&cmd[1], 1, 1);
>
> It seems that here, you may need to measure the length of the character
> to insert in bytes and then call something like
>
>   putbuf(cmd + 1, #bytes, 0);
>
>
> My impression is that the 's' command is likely also affected, but that
> can be fixed in a separate patch.

As for replacement using a UTF-8 character, I made the same conclusion
as stated in your follow-up email. Since we don't know the length of the
character, we can't do the right thing right now.

Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
retrieving revision 1.45
diff -u -p -r1.45 vi.c
--- bin/ksh/vi.c 28 May 2017 07:27:01 -0000 1.45
+++ bin/ksh/vi.c 30 May 2017 07:54:04 -0000
@@ -926,13 +926,24 @@ vi_cmd(int argcnt, const char *cmd)
  if (cmd[1] == 0)
  vi_error();
  else {
- int n;
-
- if (es->cursor + argcnt > es->linelen)
+ c1 = 0;
+ for (cur = es->cursor;
+    cur < es->linelen; cur++) {
+ if (!isu8cont(es->cbuf[cur]))
+ c1++;
+ if (c1 > argcnt)
+ break;
+ }
+ if (argcnt > c1)
  return -1;
- for (n = 0; n < argcnt; ++n)
- es->cbuf[es->cursor + n] = cmd[1];
- es->cursor += n - 1;
+
+ del_range(es->cursor, cur);
+ while (argcnt-- > 0)
+ putbuf(&cmd[1], 1, 0);
+ while (es->cursor > 0)
+ if (!isu8cont(es->cbuf[--es->cursor]))
+ break;
+ es->cbuf[es->linelen] = '\0';
  }
  break;
 
Index: regress/bin/ksh/vi/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/vi/vi.sh,v
retrieving revision 1.2
diff -u -p -r1.2 vi.sh
--- regress/bin/ksh/vi/vi.sh 28 May 2017 07:27:01 -0000 1.2
+++ regress/bin/ksh/vi/vi.sh 30 May 2017 07:54:04 -0000
@@ -134,6 +134,9 @@ testseq "abcdef\00334h2Rxy\0033" " $ abc
 
 # r: Replace character.
 testseq "abcd\00332h2rxiy" " $ abcd\b\b\bxx\byxd\b\b\r\naxyxd"
+testseq "\0303\0266\0033ro" " $ \0303\0266\bo \b\b\r\no"
+testseq "\0342\0202\0254\0033ro" " $ \0342\0202\0254\bo  \b\b\b\r\no"
+testseq "\0303\0266\00332ro" " $ \0303\0266\b\a\r\n\0303\0266"
 
 # S: Substitute whole line.
 testseq "oldst\0033Snew" " $ oldst\b\b\b\b\b     \r $ new\r\nnew"

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

TAKAHASHI Tamotsu
In reply to this post by Ingo Schwarze
Hi Ingo,

On Tue, May 30, 2017 at 2:28 AM, Ingo Schwarze <[hidden email]> wrote:

> Walter Alejandro Iglesias wrote on Mon, May 29, 2017 at 06:44:40PM +0200:
>
>> Are those wide char versions of C functions consistent enough to write
>> a separate implementation to be loaded when LC_TYPE is set to utf-8?
>
> Sure, you can rewrite the complete shell to use wchar_t * rather
> than char *, and if you do that, you can use the new code to handle
> ASCII as well, no need to have two copies.  But that would be a
> huge effort, even more error-prone than the small, careful adjustments
> we are doing now, and would have a number of additional downsides;
> among others, losing the ability to handle arbitrary bytes, while
> in UTF-8 mode.

You are right.

With wchar_t, you could avoid isu8cont() and other dirty hacks,
but you might face much more troubles.
I have a joke diff for ksh: https://pastebin.com/HC48zS1q
In fact, I didn't figure out why my diff didn't show what I typed,
while Ctrl-L worked. ;)

Multibyte is simply crazy if you have to calculate the width.
I don't want /bin/sh to be a complete multibyte shell.
When I need a multilingual shell, I use other shells like zsh.
The current implementation of ksh has a perfect balance
between complexity and usefulness.

Thanks,
tamo

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Anton Lindqvist
In reply to this post by Anton Lindqvist
On Tue, May 30, 2017 at 04:16:15PM +0200, Theo Buehler wrote:
> On Tue, May 30, 2017 at 03:39:50PM +0200, Ingo Schwarze wrote:
> > Hi Anton,
> >
> > OK schwarze@ for the diff below (which, i think, is sufficient for
> > commit in this case).
>
> I agree. This is good for commit. ok tb

Thanks, committed.

Reply | Threaded
Open this post in threaded view
|

[wai@roquesor.com: Re: ksh(1): vi mode UTF-8 bug]

Walter Alejandro Iglesias-3
In reply to this post by Ingo Schwarze

I realized the issue I describe in the message below (sent to Ingo in
private) happens in the tty console (no X11) indifferently of what you
set in LC_CTYPE.  The problem comes when you have a non english
keyboard; you can easily type by accident some non ascii character
smaller than 0xc0, then your line is lost.  So, I think it *is* a bug.

Vi editing mode users can workaround the problem using vi-show8.

As far as I can understand there isn't an easy solid way to know if a
non ascii character is utf-8, so (with all respect others work deserve
and please correct me if I'm wrong) each time you fix some utf8 input
issue there is a big chance you're unfixing the non-utf8 non-ascii
input. :-)  To preserve a safe ascii implementation you should consider
to keep utf8 hacks aside.  You could do the same you do with nvi,
diverting the effort developers are now putting in this hacks in
implementing some utf-8 version of ksh as a package for those who think
they need utf-8 support in console (don't count me among them).


----- Forwarded message from Walter Alejandro Iglesias <[hidden email]> -----

Date: Fri, 2 Jun 2017 14:47:55 +0200
From: Walter Alejandro Iglesias <[hidden email]>
To: Ingo Schwarze <[hidden email]>
Subject: Re: ksh(1): vi mode UTF-8 bug
User-Agent: Mutt/1.8.2hg (2017-04-18)

Hi Ingo,

On Mon, May 29, 2017 at 07:28:37PM +0200, Ingo Schwarze wrote:

> Hi Walter,
>
> Walter Alejandro Iglesias wrote on Mon, May 29, 2017 at 06:44:40PM +0200:
>
> > Are those wide char versions of C functions consistent enough to write
> > a separate implementation to be loaded when LC_TYPE is set to utf-8?
>
> Sure, you can rewrite the complete shell to use wchar_t * rather
> than char *, and if you do that, you can use the new code to handle
> ASCII as well, no need to have two copies.  But that would be a
> huge effort, even more error-prone than the small, careful adjustments
> we are doing now, and would have a number of additional downsides;
> among others, losing the ability to handle arbitrary bytes, while
> in UTF-8 mode.
>
> For an editor, going wchar_t might be better because having substantial
> amounts of UTF-8 in user input is a common case in some files that
> people edit.
>
> For a shell, editing strings that contain non-ASCII is not the main
> purpose.  Sure, it is nice if the command line is able to handle
> strings containing an occasional UTF-8 character.  But the main
> purpose of the shell remains to safely input and execute Unix-style
> command lines, where non-ASCII characters are a non-essential addition
> at best.
>
> Yours,
>   Ingo
>
>
> For more details, see
> https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf


There is an issue I ignore since when it is present (regression?).  I
suppose it's caused by the way you test if non ascii characteres are
utf8.  It happens with both vi and emacs editing modes.

With LC_CTYPE=C if you type non ascii characters smaller than 0xc0 and
pass over them with the cursor you'll see how the cursor thinks it's a
wide character.  This overrides characters, commands as x, r or s get
confused and calling the line from the history file get screwed too.

Given opensbsd formally support only utf8, are you aware and accept this
issue as part of the deal to handle utf8 or may I report it as a bug?


----- End forwarded message -----

Reply | Threaded
Open this post in threaded view
|

Re: ksh(1): vi mode UTF-8 bug

Ingo Schwarze
Hi Walter,

Walter Alejandro Iglesias wrote on Sun, Jun 04, 2017 at 11:45:04AM +0200:

> I realized the issue I describe in the message below

I don't understand at all what you are trying to talk about.

For a proper bug report, please send the exact byte sequence that
you are passing in to the shell on standard input (preferably
expressed as a printf(1) statement using ASCII-only characters, or
alternatively just the byte sequence encoded with "hexdump -C",
such that it doesn't get mangled in mail transit), the exact byte
sequence that the shell is passing on to the terminal for display
(again, preferably with "hexdump -C"), and state exactly which byte
sequence you would expect the shell to send to the terminal instead.

If you don't know how to obtain this information, wait a bit until
anton@ has committed his new /usr/src/regress/bin/ksh/edit/ suite.
That will make obtaining such information easier.

Yours,
  Ingo