[Patch] mg(1): Experimental UTF-8 support

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
Hi!

Diff is taken from here:
https://github.com/hboetes/mg/compare/display-wide-characters
with minor modifications (didn't patch some files and updated mg.1)

I got two not critical cursor movement bugs
(like I have to press C-f twice or M-f moves cursor subword forward)

mg.patch (36K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
On Tue, May 29, 2018 at 03:33:08PM +0200, Henning Brauer wrote:
> Hi,
>
> very welcome!
>
> I have applied the diff and don't notice immediate breakage. Pls poke

You didn't notice cursor movement bugs? o_O
Well, I'm giving example: авыавыавы
To move from start to end of that word, you have to press M-f 3 times.

Also you might notice you have to press C-f twice to move one character
forward.

https://github.com/hboetes/mg/commits/display-wide-characters

There are new commits, I'll test them later and send a new diff,
so I hope mg is ready to support UTF-8, yeah <3

> me in a few days to give an ok (and hope same brave soul takes
> committing on his plate) assuming I don't run into trouble.
>
> ciao
>
> Henning
>

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Hiltjo Posthuma
On Tue, May 29, 2018 at 05:22:43PM +0300, Leonid Bobrov wrote:

> On Tue, May 29, 2018 at 03:33:08PM +0200, Henning Brauer wrote:
> > Hi,
> >
> > very welcome!
> >
> > I have applied the diff and don't notice immediate breakage. Pls poke
>
> You didn't notice cursor movement bugs? o_O
> Well, I'm giving example: авыавыавы
> To move from start to end of that word, you have to press M-f 3 times.
>
> Also you might notice you have to press C-f twice to move one character
> forward.
>
> https://github.com/hboetes/mg/commits/display-wide-characters
>

The viewing looks good, but I can confirm to also notice the "character"
movement bug.

locale is all set to "en_US.UTF-8".

A good test file is I think:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

Thanks for the work so far. UTF-8 support would be really nice! :)

> There are new commits, I'll test them later and send a new diff,
> so I hope mg is ready to support UTF-8, yeah <3
>
> > me in a few days to give an ok (and hope same brave soul takes
> > committing on his plate) assuming I don't run into trouble.
> >
> > ciao
> >
> > Henning
> >
>

--
Kind regards,
Hiltjo

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
On Wed, May 30, 2018 at 09:05:12AM +0200, Hiltjo Posthuma wrote:

> On Tue, May 29, 2018 at 05:22:43PM +0300, Leonid Bobrov wrote:
> > On Tue, May 29, 2018 at 03:33:08PM +0200, Henning Brauer wrote:
> > > Hi,
> > >
> > > very welcome!
> > >
> > > I have applied the diff and don't notice immediate breakage. Pls poke
> >
> > You didn't notice cursor movement bugs? o_O
> > Well, I'm giving example: авыавыавы
> > To move from start to end of that word, you have to press M-f 3 times.
> >
> > Also you might notice you have to press C-f twice to move one character
> > forward.
> >
> > https://github.com/hboetes/mg/commits/display-wide-characters
> >
>
> The viewing looks good, but I can confirm to also notice the "character"
> movement bug.
>
> locale is all set to "en_US.UTF-8".
>
> A good test file is I think:
> https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
>
> Thanks for the work so far. UTF-8 support would be really nice! :)
>

Thanks goes to S. Gilles <sgilles at math.umd.edu>, he made it :)

Recent commits fixed cursor movement bugs, my work is to send fresh
patches to this list to make sure mg merges UTF-8 support <3

> > There are new commits, I'll test them later and send a new diff,
> > so I hope mg is ready to support UTF-8, yeah <3
> >
> > > me in a few days to give an ok (and hope same brave soul takes
> > > committing on his plate) assuming I don't run into trouble.
> > >
> > > ciao
> > >
> > > Henning
> > >
> >
>
> --
> Kind regards,
> Hiltjo
>

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Reyk Floeter-2

> Am 30.05.2018 um 10:10 schrieb Leonid Bobrov <[hidden email]>:
>
>> On Wed, May 30, 2018 at 09:05:12AM +0200, Hiltjo Posthuma wrote:
>>> On Tue, May 29, 2018 at 05:22:43PM +0300, Leonid Bobrov wrote:
>>>> On Tue, May 29, 2018 at 03:33:08PM +0200, Henning Brauer wrote:
>>>> Hi,
>>>>
>>>> very welcome!
>>>>
>>>> I have applied the diff and don't notice immediate breakage. Pls poke
>>>
>>> You didn't notice cursor movement bugs? o_O
>>> Well, I'm giving example: авыавыавы
>>> To move from start to end of that word, you have to press M-f 3 times.
>>>
>>> Also you might notice you have to press C-f twice to move one character
>>> forward.
>>>
>>> https://github.com/hboetes/mg/commits/display-wide-characters
>>>
>>
>> The viewing looks good, but I can confirm to also notice the "character"
>> movement bug.
>>
>> locale is all set to "en_US.UTF-8".
>>
>> A good test file is I think:
>> https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
>>
>> Thanks for the work so far. UTF-8 support would be really nice! :)
>>
>
> Thanks goes to S. Gilles <sgilles at math.umd.edu>, he made it :)
>
> Recent commits fixed cursor movement bugs, my work is to send fresh
> patches to this list to make sure mg merges UTF-8 support <3
>

I‘m in favor of this, mg is my number one productivity tool and UTF-8 support would be really really great. I will test it as well.

Reyk

>>> There are new commits, I'll test them later and send a new diff,
>>> so I hope mg is ready to support UTF-8, yeah <3
>>>
>>>> me in a few days to give an ok (and hope same brave soul takes
>>>> committing on his plate) assuming I don't run into trouble.
>>>>
>>>> ciao
>>>>
>>>> Henning
>>>>
>>>
>>
>> --
>> Kind regards,
>> Hiltjo
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Stefan Sperling-5
In reply to this post by Leonid Bobrov
Comments inline. I think this still needs a lot of work...

I've done one pass over this to weed out some obvious problems.
There might be many I've missed.

Not going to do any run-time testing because emacs/mg break my fingers ;)

On Mon, May 28, 2018 at 09:32:10PM +0300, Leonid Bobrov wrote:

> Index: basic.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/basic.c,v
> retrieving revision 1.47
> diff -u -p -u -p -r1.47 basic.c
> --- basic.c 10 Oct 2015 09:13:14 -0000 1.47
> +++ basic.c 28 May 2018 18:08:10 -0000
> @@ -18,6 +18,7 @@
>  #include <signal.h>
>  #include <stdio.h>
>  #include <stdlib.h>
> +#include <wchar.h>
>  
>  #include "def.h"
>  
> @@ -269,12 +270,25 @@ setgoal(void)
>  int
>  getgoal(struct line *dlp)
>  {
> - int c, i, col = 0;
> - char tmp[5];
> + return getbyteofcol(dlp, 0, curgoal);
> +}
>  
> +/*
> + * Return the byte offset within lp that is targetcol columns beyond
> + * startbyte
> + */
> +size_t
> +getbyteofcol(const struct line *lp, const size_t startbyte,
> +             const size_t targetcol)
> +{
> + int c;
> + size_t i, col = 0;
> + char tmp[5];
> + size_t advance_by = 1;
>  
> - for (i = 0; i < llength(dlp); i++) {
> - c = lgetc(dlp, i);
> + for (i = startbyte; i < llength(lp); i += advance_by) {
> + advance_by = 1;
> + c = lgetc(lp, i);
>   if (c == '\t'
>  #ifdef NOTAB
>      && !(curbp->b_flag & BFNOTAB)
> @@ -284,16 +298,84 @@ getgoal(struct line *dlp)
>   col++;
>   } else if (ISCTRL(c) != FALSE) {
>   col += 2;
> - } else if (isprint(c))
> + } else if (isprint(c)) {
>   col++;
> - else {
> + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> +                          llength(lp) - i, &mbs);
> + int width = -1;
> + if (consumed < (size_t) -2) {

I think this should say: consumed > 0

Since both (size_t)-1 and (size_t)-2 indicate special error conditions
they should only be compared the with == operator, in my opinion.

Please read the mbrtowc man page if you haven't already.
It covers error conditions extensively.

A good idiom would be something like:

        size_t ret = mbrtowc(...)
        if (ret == (size_t)-1) {
                /* handle decoding error */
        } else if (ret == (size_t)-2) {
                /* handle incomplete sequence */
        } else if (ret == 0) {
                /* handle end of line */
        } else {
                /* handle character which was parsed */
        }

> + width = wcwidth(wc);
> + }
> + if (width >= 0) {

You should explicitly deal with width == -1 here as well.
A negative width means this line contains a non-printable character.
But that doesn't mean this character won't have any effect when printed.
You might want to narrow down the truth about such a character
with iswcntrl() and friends and take some appropriate action.

> + col += width;
> + advance_by = consumed;

What if consumed == (size_t)-1 or (size_t)-2 here?

In particular, you need to decide what to do with illegal byte sequences.
An editor should probably display them in some escaped form and continue
processing valid text beyond. UTF-8 is self-synchronizing so this should
not be too difficult. Printing such bytes in an escaped form might
require more columns than a valid byte sequence would.

Can it happen at any step during editing with mg that a valid multi-byte
character gets split by a newline? That would cause (size_t)-2 here since
this function is processing only one line. An editor should avoid introducing
this problem, but should be able to cope with files which already have this
problem when they are opened.
A good design would process a file such that any existing invalid multi-byte
characters will at least remain as they were when the file was opened, or
at best can be repaired by the user (do not get tempted to try to magically
"fix" the data).

> + } else {
> + col += snprintf(tmp, sizeof(tmp), "\\%o", c);

What if snprintf returns -1 here?

> + }
> + } else {
>   col += snprintf(tmp, sizeof(tmp), "\\%o", c);

What if snprintf returns -1 here?

>   }
> - if (col > curgoal)
> + if (col > targetcol)
>   break;
>   }
>   return (i);
>  }
> +
> +/*
> + * Return the column at which specified offset byte would appear, if
> + * this were part of a longer string printed by vtputs, starting at
> + * intial_col
> + */
> +size_t
> +getcolofbyte(const struct line *lp, const size_t startbyte,
> +             const size_t initial_col, const size_t targetoffset)
> +{
> + int c;
> + size_t i, col = initial_col;
> + char tmp[5];
> + size_t advance_by = 1;
> +
> + for (i = startbyte; i < llength(lp); i += advance_by) {
> + if (i >= targetoffset)
> + break;
> + advance_by = 1;
> + c = lgetc(lp, i);
> + if (c == '\t'
> +#ifdef NOTAB
> +    && !(curbp->b_flag & BFNOTAB)
> +#endif
> + ) {
> + col |= 0x07;
> + col++;
> + } else if (ISCTRL(c) != FALSE) {
> + col += 2;
> + } else if (isprint(c)) {
> + col++;
> + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> +                          llength(lp) - i, &mbs);
> + int width = -1;
> + if (consumed < (size_t) -2) {

Same concern as above: consumed > 0

> + width = wcwidth(wc);
> + }
> + if (width >= 0) {
> + col += width;
> +     while (cp1 != vvp->v_text && !isprint(*cp1) &&
> +            mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
        > + advance_by = consumed;

Agaile (cp1 != vvp->v_text && !isprint(*cp1) &&
> +            mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
> +     while (cp1 != vvp->v_text && !isprint(*cp1) &&
> +            mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
n, consumed could represent a -1 or -2 error from mbrtowc() at this point.

> + } else {
> + col += snprintf(tmp, sizeof(tmp), "\\%o", c);

Again, snprintf can return -1.

> + }
> + } else {
> + col += snprintf(tmp, sizeof(tmp), "\\%o", c);

Again, snprintf can return -1.

> + }
> + }
> + return (col);
> +}
> +
>  
>  /*
>   * Scroll forward by a specified number
> Index: cmode.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/cmode.c,v
> retrieving revision 1.16
> diff -u -p -u -p -r1.16 cmode.c
> --- cmode.c 26 Sep 2015 21:51:58 -0000 1.16
> +++ cmode.c 28 May 2018 18:08:10 -0000
> @@ -14,6 +14,7 @@
>  #include <ctype.h>
>  #include <signal.h>
>  #include <stdio.h>
> +#include <wchar.h>
>  
>  #include "def.h"
>  #include "funmap.h"
> @@ -419,10 +420,25 @@ findcolpos(const struct buffer *bp, cons
>   ) {
>   col |= 0x07;
>   col++;
> - } else if (ISCTRL(c) != FALSE)
> + } else if (ISCTRL(c) != FALSE) {
>   col += 2;
> - else if (isprint(c)) {
> + } else if (isprint(c)) {
>   col++;
> + } else if (!(bp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> +                          llength(lp) - i, &mbs);
> + int width = -1;
> + if (consumed < (size_t) -2) {

Same.

> + width = wcwidth(wc);
> + }
> + if (width >= 0) {
> + col += width;
> + i += (consumed - 1);
> + } else {
> + col += snprintf(tmp, sizeof(tmp), "\\%o", c);

Same.

> + }
>   } else {
>   col += snprintf(tmp, sizeof(tmp), "\\%o", c);

Same.

This pattern has now occured 3 times in this diff.
Consider adding a function to avoid code duplication so that
bugs can be fixed in one place.

>   }
> Index: def.h
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/def.h,v
> retrieving revision 1.155
> diff -u -p -u -p -r1.155 def.h
> --- def.h 14 Apr 2016 17:05:32 -0000 1.155
> +++ def.h 28 May 2018 18:08:10 -0000
> @@ -263,7 +263,7 @@ struct buffer {
>   int b_marko; /* ditto for the "mark" */
>   short b_nmodes; /* number of non-fundamental modes */
>   char b_nwnd; /* Count of windows on buffer */
> - char b_flag; /* Flags */
> + short b_flag; /* Flags */
>   char b_fname[NFILEN]; /* File name */
>   char b_cwd[NFILEN]; /* working directory */
>   struct fileinfo b_fi; /* File attributes */
> @@ -290,6 +290,7 @@ struct buffer {
>  #define BFDIRTY     0x20 /* Buffer was modified elsewhere */
>  #define BFIGNDIRTY  0x40 /* Ignore modifications */
>  #define BFDIREDDEL  0x80 /* Dired has a deleted 'D' file */
> +#define BFSHOWRAW   0x100 /* Show unprintable as octal */
>  /*
>   * This structure holds information about recent actions for the Undo command.
>   */
> @@ -490,6 +491,7 @@ int digit_argument(int, int);
>  int negative_argument(int, int);
>  int selfinsert(int, int);
>  int quote(int, int);
> +int insert_char(int, int);
>  
>  /* main.c */
>  int ctrlg(int, int);
> @@ -512,6 +514,8 @@ int forwline(int, int);
>  int backline(int, int);
>  void setgoal(void);
>  int getgoal(struct line *);
> +size_t getbyteofcol(const struct line *, size_t, size_t);
> +size_t getcolofbyte(const struct line *, size_t, size_t, size_t);
>  int forwpage(int, int);
>  int backpage(int, int);
>  int forw1page(int, int);
> @@ -658,6 +662,7 @@ int notabmode(int, int);
>  #endif /* NOTAB */
>  int overwrite_mode(int, int);
>  int set_default_mode(int,int);
> +int show_raw_mode(int, int);
>  
>  #ifdef REGEX
>  /* re_search.c X */
> Index: display.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/display.c,v
> retrieving revision 1.48
> diff -u -p -u -p -r1.48 display.c
> --- display.c 6 Jul 2017 19:27:37 -0000 1.48
> +++ display.c 28 May 2018 18:08:10 -0000
> @@ -7,7 +7,7 @@
>   * redisplay system knows almost nothing about the editing
>   * process; the editing functions do, however, set some
>   * hints to eliminate a lot of the grinding. There is more
> - * that can be done; the "vtputc" interface is a real
> + * that can be done; the "vtputs" interface is a real
>   * pig.
>   */
>  
> @@ -18,6 +18,7 @@
>  #include <stdlib.h>
>  #include <string.h>
>  #include <term.h>
> +#include <wchar.h>
>  
>  #include "def.h"
>  #include "kbd.h"
> @@ -52,11 +53,10 @@ struct score {
>  };
>  
>  void vtmove(int, int);
> -void vtputc(int);
>  void vtpute(int);
> -int vtputs(const char *);
> +int vtputs(const char *, size_t, size_t);
>  void vteeol(void);
> -void updext(int, int);
> +int updext(int, int);
>  void modeline(struct mgwin *, int);
>  void setscores(int, int);
>  void traceback(int, int, int, int);
> @@ -216,8 +216,8 @@ vtresize(int force, int newrow, int newc
>   }
>   if (rowchanged || colchanged || first_run) {
>   for (i = 0; i < 2 * (newrow - 1); i++)
> - TRYREALLOC(video[i].v_text, newcol);
> - TRYREALLOC(blanks.v_text, newcol);
> + TRYREALLOC(video[i].v_text, newcol * MB_CUR_MAX);
> + TRYREALLOC(blanks.v_text, newcol * MB_CUR_MAX);

I am not quite sure why the blanks array needs to be grown.
Doesn't the array size relate to the screen size, and doesn't
each ' ' in the array occupy an entire column on the screen?

Then again, the code involved in updating the display is too hairy
for me to digest right now, so I might be missing something.

>   }
>  
>   nrow = newrow;
> @@ -260,7 +260,7 @@ vtinit(void)
>   */
>  
>   blanks.v_color = CTEXT;
> - for (i = 0; i < ncol; ++i)
> + for (i = 0; i < ncol * MB_CUR_MAX; ++i)
>   blanks.v_text[i] = ' ';
>  }
>  
> @@ -287,7 +287,7 @@ vttidy(void)
>   * Move the virtual cursor to an origin
>   * 0 spot on the virtual display screen. I could
>   * store the column as a character pointer to the spot
> - * on the line, which would make "vtputc" a little bit
> + * on the line, which would make "vtputs" a little bit
>   * more efficient. No checking for errors.
>   */
>  void
> @@ -298,88 +298,6 @@ vtmove(int row, int col)
>  }
>  
>  /*
> - * Write a character to the virtual display,
> - * dealing with long lines and the display of unprintable
> - * things like control characters. Also expand tabs every 8
> - * columns. This code only puts printing characters into
> - * the virtual display image. Special care must be taken when
> - * expanding tabs. On a screen whose width is not a multiple
> - * of 8, it is possible for the virtual cursor to hit the
> - * right margin before the next tab stop is reached. This
> - * makes the tab code loop if you are not careful.
> - * Three guesses how we found this.
> - */
> -void
> -vtputc(int c)
> -{
> - struct video *vp;
> -
> - c &= 0xff;
> -
> - vp = vscreen[vtrow];
> - if (vtcol >= ncol)
> - vp->v_text[ncol - 1] = '$';
> - else if (c == '\t'
> -#ifdef NOTAB
> -    && !(curbp->b_flag & BFNOTAB)
> -#endif
> -    ) {
> - do {
> - vtputc(' ');
> - } while (vtcol < ncol && (vtcol & 0x07) != 0);
> - } else if (ISCTRL(c)) {
> - vtputc('^');
> - vtputc(CCHR(c));
> - } else if (isprint(c))
> - vp->v_text[vtcol++] = c;
> - else {
> - char bf[5];
> -
> - snprintf(bf, sizeof(bf), "\\%o", c);
> - vtputs(bf);
> - }
> -}
> -
> -/*
> - * Put a character to the virtual screen in an extended line.  If we are not
> - * yet on left edge, don't print it yet.  Check for overflow on the right
> - * margin.
> - */
> -void
> -vtpute(int c)
> -{
> - struct video *vp;
> -
> - c &= 0xff;
> -
> - vp = vscreen[vtrow];
> - if (vtcol >= ncol)
> - vp->v_text[ncol - 1] = '$';
> - else if (c == '\t'
> -#ifdef NOTAB
> -    && !(curbp->b_flag & BFNOTAB)
> -#endif
> -    ) {
> - do {
> - vtpute(' ');
> - } while (((vtcol + lbound) & 0x07) != 0 && vtcol < ncol);
> - } else if (ISCTRL(c) != FALSE) {
> - vtpute('^');
> - vtpute(CCHR(c));
> - } else if (isprint(c)) {
> - if (vtcol >= 0)
> - vp->v_text[vtcol] = c;
> - ++vtcol;
> - } else {
> - char bf[5], *cp;
> -
> - snprintf(bf, sizeof(bf), "\\%o", c);
> - for (cp = bf; *cp != '\0'; cp++)
> - vtpute(*cp);
> - }
> -}
> -
> -/*
>   * Erase from the end of the software cursor to the end of the line on which
>   * the software cursor is located. The display routines will decide if a
>   * hardware erase to end of line command should be used to display this.
> @@ -390,7 +308,7 @@ vteeol(void)
>   struct video *vp;
>  
>   vp = vscreen[vtrow];
> - while (vtcol < ncol)
> + while (vtcol < ncol * MB_CUR_MAX)
>   vp->v_text[vtcol++] = ' ';
>  }
>  
> @@ -410,7 +328,7 @@ update(int modelinecolor)
>   struct mgwin *wp;
>   struct video *vp1;
>   struct video *vp2;
> - int c, i, j;
> + int c, i;
>   int hflag;
>   int currow, curcol;
>   int offs, size;
> @@ -485,8 +403,9 @@ update(int modelinecolor)
>   vscreen[i]->v_color = CTEXT;
>   vscreen[i]->v_flag |= (VFCHG | VFHBAD);
>   vtmove(i, 0);
> - for (j = 0; j < llength(lp); ++j)
> - vtputc(lgetc(lp, j));
> + if (llength(lp)) {
> + vtputs(lp->l_text, llength(lp), 0);
> + }
>   vteeol();
>   } else if ((wp->w_rflag & (WFEDIT | WFFULL)) != 0) {
>   hflag = TRUE;
> @@ -495,8 +414,10 @@ update(int modelinecolor)
>   vscreen[i]->v_flag |= (VFCHG | VFHBAD);
>   vtmove(i, 0);
>   if (lp != wp->w_bufp->b_headp) {
> - for (j = 0; j < llength(lp); ++j)
> - vtputc(lgetc(lp, j));
> + if (llength(lp)) {
> + vtputs(lp->l_text, llength(lp),
> +    0);
> + }
>   lp = lforw(lp);
>   }
>   vteeol();
> @@ -514,32 +435,53 @@ update(int modelinecolor)
>   ++currow;
>   lp = lforw(lp);
>   }
> +
>   curcol = 0;
>   i = 0;
>   while (i < curwp->w_doto) {
> - c = lgetc(lp, i++);
> + char tmp[5];
> + c = lgetc(lp, i);
>   if (c == '\t'
>  #ifdef NOTAB
>      && !(curbp->b_flag & BFNOTAB)
>  #endif
>   ) {
> - curcol |= 0x07;
>   curcol++;
> + while ((curcol - lbound) & 0x07) {
> + curcol++;
> + }
>   } else if (ISCTRL(c) != FALSE)
>   curcol += 2;
> - else if (isprint(c))
> + else if (isprint(c)) {
>   curcol++;
> - else {
> - char bf[5];
> -
> - snprintf(bf, sizeof(bf), "\\%o", c);
> - curcol += strlen(bf);
> + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> +                          llength(lp) - i, &mbs);
> + int width = -1;
> + if (consumed < (size_t) -2) {

I feel like I'm repeating myself :)

> + width = wcwidth(wc);
> + } else {
> + memset(&mbs, 0, sizeof mbs);
> + }
> + if (width >= 0) {
> + curcol += width;
> + i += (consumed - 1);
> + } else {
> + snprintf(tmp, sizeof(tmp), "\\%o", c);

Missing check for error return (< 0) from snprintf().

> + curcol += strlen(tmp);

You should check for truncation in snprintf: If it returns >= sizeof(tmp)
the string was truncated which you should treat as an error condition
because would mean your chosen tmp buffer size is too small.

And if there was no truncation, you can use snprintf's return value
directly instead of calling strlen() to calculate the same number again.

> + }
> + } else {
> + snprintf(tmp, sizeof(tmp), "\\%o", c);

Same.

> + curcol += strlen(tmp);

Same.

>   }
> + i++;
>   }
>   if (curcol >= ncol - 1) { /* extended line. */
>   /* flag we are extended and changed */
>   vscreen[currow]->v_flag |= VFEXT | VFCHG;
> - updext(currow, curcol); /* and output extended line */
> + curcol = updext(currow, curcol);
>   } else
>   lbound = 0; /* not extended line */
>  
> @@ -558,8 +500,10 @@ update(int modelinecolor)
>   if ((wp != curwp) || (lp != wp->w_dotp) ||
>      (curcol < ncol - 1)) {
>   vtmove(i, 0);
> - for (j = 0; j < llength(lp); ++j)
> - vtputc(lgetc(lp, j));
> + if (llength(lp)) {
> + vtputs(lp->l_text, llength(lp),
> +    0);
> + }
>   vteeol();
>   /* this line no longer is extended */
>   vscreen[i]->v_flag &= ~VFEXT;
> @@ -661,39 +605,44 @@ ucopy(struct video *vvp, struct video *p
>   pvp->v_hash = vvp->v_hash;
>   pvp->v_cost = vvp->v_cost;
>   pvp->v_color = vvp->v_color;
> - bcopy(vvp->v_text, pvp->v_text, ncol);
> + bcopy(vvp->v_text, pvp->v_text, ncol * MB_CUR_MAX);
>  }
>  
>  /*
> - * updext: update the extended line which the cursor is currently on at a
> - * column greater than the terminal width. The line will be scrolled right or
> - * left to let the user see where the cursor is.
> + * updext: update the extended line which the cursor is currently on
> + * at a column greater than the terminal width. The line will be
> + * scrolled right or left to let the user see where the cursor
> + * is. curcol may need to be adjusted, depending on how wide
> + * characters and lbound interact, that adjusted position is returned.
>   */
> -void
> +int
>  updext(int currow, int curcol)
>  {
> - struct line *lp; /* pointer to current line */
> - int j; /* index into line */
> + struct line *lp = curwp->w_dotp; /* pointer to current line */
> + size_t startbyte;
> + int bettercol = curcol;
> + size_t fullextent;
>  
>   if (ncol < 2)
> - return;
> + return curcol;
>  
>   /*
> - * calculate what column the left bound should be
> - * (force cursor into middle half of screen)
> + * calculate what column the left bound should be (force
> + * cursor into middle half of screen). Ensuring that it is at
> + * a tabstop allows update() to calculate curcol without
> + * wondering how tabstops are calculated before the first '$'.
>   */
>   lbound = curcol - (curcol % (ncol >> 1)) - (ncol >> 2);
> + lbound = (lbound | 0x07) + 1;
> + vscreen[currow]->v_text[0] = '$';
> + vtmove(currow, 1);
> + startbyte = getbyteofcol(lp, 0, lbound + 1);
> + fullextent = getbyteofcol(lp, startbyte, ncol + 1);

This might be a good place for some error checking and falling back
to displaying the line with invalid characeters in an escaped form
if necessary.

> + vtputs(lp->l_text + startbyte, fullextent - startbyte, 1);
> + vteeol();
>  
> - /*
> - * scan through the line outputing characters to the virtual screen
> - * once we reach the left edge
> - */
> - vtmove(currow, -lbound); /* start scanning offscreen */
> - lp = curwp->w_dotp; /* line to output */
> - for (j = 0; j < llength(lp); ++j) /* until the end-of-line */
> - vtpute(lgetc(lp, j));
> - vteeol(); /* truncate the virtual line */
> - vscreen[currow]->v_text[0] = '$'; /* and put a '$' in column 1 */
> + bettercol = lbound + getcolofbyte(lp, startbyte, 1, curwp->w_doto);
> + return (bettercol);
>  }
>  
>  /*
> @@ -708,12 +657,35 @@ updext(int currow, int curcol)
>  void
>  uline(int row, struct video *vvp, struct video *pvp)
>  {
> - char  *cp1;
> - char  *cp2;
> - char  *cp3;
> - char  *cp4;
> - char  *cp5;
> + char  *cp1; /* Pointer to the start of dirty region */
> + char  *cp2; /* pvp's counterpart for cp1 */
> + char  *cp3; /* Pointer to end of dirty region */
> + char  *cp4; /* pvp's counterpart for cp3 */
> + char  *cp5; /* After this, within dirty region, all is ' ' */
> + char  *mbcounter;
>   int    nbflag;
> + int    startcol; /* onscreen column matching cp1 */
> + char  *lastbyte; /* byte which handles last onscreen column  */
> + int    seencols = 0;
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> +
> + lastbyte = vvp->v_text;
> + while (seencols < ncol && *lastbyte) {
> + size_t consumed = mbrtowc(&wc, lastbyte,
> +    (vvp->v_text + ncol * MB_CUR_MAX - lastbyte), &mbs);
> + if (consumed < (size_t) -2) {

Again, the error checking as it is done here isn't idiomatic.
See above.

> + lastbyte += consumed;
> + seencols += wcwidth(wc);

wcwidth() could return -1 here.

> + } else {
> + lastbyte++;
> + seencols++;
> + memset(&mbs, 0, sizeof mbs);
> + }
> + }
> + if (lastbyte - vvp->v_text < ncol) {
> + lastbyte = &vvp->v_text[ncol];
> + }
>  
>   if (vvp->v_color != pvp->v_color) { /* Wrong color, do a */
>   ttmove(row, 0); /* full redraw. */
> @@ -729,11 +701,12 @@ uline(int row, struct video *vvp, struct
>   * putting the invisible glitch character on the next line.
>   * (Hazeltine executive 80 model 30)
>   */
> - cp2 = &vvp->v_text[ncol - (magic_cookie_glitch >= 0 ?
> -    (magic_cookie_glitch != 0 ? magic_cookie_glitch : 1) : 0)];
> + cp2 = lastbyte -
> +    (magic_cookie_glitch >= 0 ? (magic_cookie_glitch != 0 ?
> +    magic_cookie_glitch : 1) : 0);
>  #else
>   cp1 = &vvp->v_text[0];
> - cp2 = &vvp->v_text[ncol];
> + cp2 = lastbyte;
>  #endif
>   while (cp1 != cp2) {
>   ttputc(*cp1++);
> @@ -744,21 +717,31 @@ uline(int row, struct video *vvp, struct
>   }
>   cp1 = &vvp->v_text[0]; /* Compute left match. */
>   cp2 = &pvp->v_text[0];
> - while (cp1 != &vvp->v_text[ncol] && cp1[0] == cp2[0]) {
> + while (cp1 != lastbyte && cp1[0] == cp2[0]) {
>   ++cp1;
>   ++cp2;
>   }
> - if (cp1 == &vvp->v_text[ncol]) /* All equal. */
> + if (cp1 == lastbyte) /* All equal. */
>   return;
> + while (cp1 != vvp->v_text && !isprint(*cp1) &&
> +       mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {

NEVER call mbrtowc in a loop header like that.
Call it inside the loop body and handle errors.

> + --cp1;
> + --cp2;
> + }
>   nbflag = FALSE;
> - cp3 = &vvp->v_text[ncol]; /* Compute right match.  */
> - cp4 = &pvp->v_text[ncol];
> + cp3 = lastbyte; /* Compute right match.  */
> + cp4 = &pvp->v_text[lastbyte - vvp->v_text];
>   while (cp3[-1] == cp4[-1]) {
>   --cp3;
>   --cp4;
>   if (cp3[0] != ' ') /* Note non-blanks in */
>   nbflag = TRUE; /* the right match. */
>   }
> + while (cp3 != lastbyte && !isprint(*cp3) &&
> +        mbrtowc(&wc, cp3, (lastbyte - cp3), &mbs) >= (size_t) -2) {

Same.

> + ++cp3;
> + ++cp4;
> + }
>   cp5 = cp3; /* Is erase good? */
>   if (nbflag == FALSE && vvp->v_color == CTEXT) {
>   while (cp5 != cp1 && cp5[-1] == ' ')
> @@ -768,13 +751,27 @@ uline(int row, struct video *vvp, struct
>   cp5 = cp3;
>   }
>   /* Alcyon hack */
> - ttmove(row, (int) (cp1 - &vvp->v_text[0]));
> + startcol = 0;
> + mbcounter = vvp->v_text;
> + while ((cp1 - mbcounter) > 0) {
> + size_t consumed = mbrtowc(&wc, mbcounter, (cp1 - mbcounter),
> +                          &mbs);
> + if (consumed < (size_t) -2) {

Again, please use a better idiom.

> + mbcounter += consumed;
> + startcol += wcwidth(wc);

wcwidth() could return -1.

> + } else {
> + mbcounter++;
> + startcol++;
> + memset(&mbs, 0, sizeof mbs);
> + }
> + }
> + ttmove(row, startcol);
>  #ifdef STANDOUT_GLITCH
>   if (vvp->v_color != CTEXT && magic_cookie_glitch > 0) {
>   if (cp1 < &vvp->v_text[magic_cookie_glitch])
>   cp1 = &vvp->v_text[magic_cookie_glitch];
> - if (cp5 > &vvp->v_text[ncol - magic_cookie_glitch])
> - cp5 = &vvp->v_text[ncol - magic_cookie_glitch];
> + if (cp5 > lastbyte - magic_cookie_glitch)
> + cp5 = lastbyte - magic_cookie_glitch;
>   } else if (magic_cookie_glitch < 0)
>  #endif
>   ttcolor(vvp->v_color);
> @@ -807,46 +804,39 @@ modeline(struct mgwin *wp, int modelinec
>   vscreen[n]->v_flag |= (VFCHG | VFHBAD); /* Recompute, display. */
>   vtmove(n, 0); /* Seek to right line. */
>   bp = wp->w_bufp;
> - vtputc('-');
> - vtputc('-');
> + n = vtputs("--", 0, 0);
>   if ((bp->b_flag & BFREADONLY) != 0) {
> - vtputc('%');
> + n += vtputs("%", 0, n);
>   if ((bp->b_flag & BFCHG) != 0)
> - vtputc('*');
> + n += vtputs("*", 0, n);
>   else
> - vtputc('%');
> + n += vtputs("%", 0, n);
>   } else if ((bp->b_flag & BFCHG) != 0) { /* "*" if changed. */
> - vtputc('*');
> - vtputc('*');
> + n += vtputs("**", 0, n);
>   } else {
> - vtputc('-');
> - vtputc('-');
> + n += vtputs("--", 0, n);
>   }
> - vtputc('-');
> + n += vtputs("-", 0, n);
>   n = 5;

n is being reset to 5 here, overwriting what vtputs() returned.
That looks wrong.

> - n += vtputs("Mg: ");
> + n += vtputs("Mg: ", 0, n);
>   if (bp->b_bname[0] != '\0')
> - n += vtputs(&(bp->b_bname[0]));
> + n += vtputs(&(bp->b_bname[0]), 0, n);
>   while (n < 42) { /* Pad out with blanks. */
> - vtputc(' ');
> - ++n;
> + n += vtputs(" ", 0, n);
>   }
> - vtputc('(');
> - ++n;
> + n += vtputs("(", 0, n);
>   for (md = 0; ; ) {
> - n += vtputs(bp->b_modes[md]->p_name);
> + n += vtputs(bp->b_modes[md]->p_name, 0, n);
>   if (++md > bp->b_nmodes)
>   break;
> - vtputc('-');
> - ++n;
> + n += vtputs("-", 0, n);
>   }
>   /* XXX These should eventually move to a real mode */
>   if (macrodef == TRUE)
> - n += vtputs("-def");
> + n += vtputs("-def", 0, n);
>   if (globalwd == TRUE)
> - n += vtputs("-gwd");
> - vtputc(')');
> - ++n;
> + n += vtputs("-gwd", 0, n);
> + n += vtputs(")", 0, n);
>  
>   if (linenos && colnos)
>   len = snprintf(sl, sizeof(sl), "--L%d--C%d", wp->w_dotline,
> @@ -856,27 +846,132 @@ modeline(struct mgwin *wp, int modelinec
>   else if (colnos)
>   len = snprintf(sl, sizeof(sl), "--C%d", getcolpos(wp));
>   if ((linenos || colnos) && len < sizeof(sl) && len != -1)
> - n += vtputs(sl);
> + n += vtputs(sl, 0, n);
>  
>   while (n < ncol) { /* Pad out. */
> - vtputc('-');
> - ++n;
> + n += vtputs("-", 0, n);
>   }
>  }
>  
>  /*
> - * Output a string to the mode line, report how long it was.
> + * Output a string to the mode line, report how long it was,
> + * dealing with long lines and the display of unprintable
> + * things like control characters. Also expand tabs every 8
> + * columns. This code only puts printing characters into
> + * the virtual display image. Special care must be taken when
> + * expanding tabs. On a screen whose width is not a multiple
> + * of 8, it is possible for the virtual cursor to hit the
> + * right margin before the next tab stop is reached. This
> + * makes the tab code loop if you are not careful.
> + * Three guesses how we found this.
>   */
>  int
> -vtputs(const char *s)
> +vtputs(const char *s, const size_t max_bytes, const size_t initial_col)
>  {
> - int n = 0;
> + const unsigned char *us = (const unsigned char *) s;
> + struct video *vp = vscreen[vtrow];
> + size_t bytes_handled = 0;
> + size_t last_full_byte_start = vtcol;
> + size_t space_printed = 0;
> +
> + if (!s) {
> + return (0);
> + }
> +
> + while (*us && (!max_bytes || bytes_handled < max_bytes)) {
> + if (space_printed + initial_col >= ncol) {
> + break;
> + } else if (*us == '\t'
> +#ifdef NOTAB
> +           && !(curbp->b_flag & BFNOTAB)
> +#endif
> +           ) {
> + last_full_byte_start = vtcol;
> + do {
> + if (vtcol >= 0) {
> + last_full_byte_start = vtcol;
> + vp->v_text[vtcol] = ' ';
> + }
> + vtcol++;
> + space_printed++;
> + } while (space_printed + initial_col < ncol &&
> +         ((space_printed + initial_col) & 0x07));
> + us++;
> + bytes_handled++;
> + } else if (ISCTRL(*us)) {

Won't this interpret multi-byte characters and only look at the first byte?
It seems you'll need some mbtrtowc action in this function, too.

> + last_full_byte_start = vtcol;
> + if (vtcol >= 0) {
> + vp->v_text[vtcol] = '^';
> + }
> + vtcol++;
> + if (vtcol >= 0) {
> + vp->v_text[vtcol] = CCHR(*us);
> + }
> + vtcol++;
> + bytes_handled++;
> + space_printed += 2;
> + us++;
> + } else if (isprint(*us)) {

Same problem.

> + last_full_byte_start = vtcol;
> + if (vtcol >= 0) {
> + vp->v_text[vtcol] = *us++;
> + }
> + vtcol++;
> + bytes_handled++;
> + space_printed++;
> + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> + size_t consumable = max_bytes ?
> +        (max_bytes - bytes_handled) : -1;
> + size_t consumed = mbrtowc(&wc, (const char *)us,
> +                          consumable, &mbs);
> + int width = -1;
> + last_full_byte_start = vtcol;
> + if (consumed < (size_t) -2) {

Same mbrtowc return value handling problem.

> + width = wcwidth(wc);
> + }

Check width == -1 here?

> + if (width >= 0) {
> + bytes_handled += consumed;
> + space_printed += width;
> + do {
> + if (vtcol >= 0) {
> + vp->v_text[vtcol] = *us++;
> + }
> + vtcol++;
> + } while (--consumed);
> + } else {

Cause otherwise this block runs with a non-printable multi-byte
character. Is that what you want?

> + char bf[5];
> + snprintf(bf, sizeof(bf), "\\%o", *us);
> + bytes_handled++;
> + space_printed += vtputs(bf, 0,
> +    space_printed + initial_col);
> + us++;
> + }
> + } else {
> + char bf[5];
> + last_full_byte_start = vtcol;
> + snprintf(bf, sizeof(bf), "\\%o", *us);
> + bytes_handled++;
> + space_printed += vtputs(bf, 0,
> +    space_printed + initial_col);
> + us++;
> + }
> + }
>  
> - while (*s != '\0') {
> - vtputc(*s++);
> - ++n;
> + if ((space_printed + initial_col > ncol) ||
> +    (space_printed + initial_col == ncol &&
> +    (*us && (!max_bytes || bytes_handled < max_bytes)))) {
> + vp->v_text[last_full_byte_start] = '$';
> + while (++last_full_byte_start <= vtcol) {
> + vp->v_text[last_full_byte_start] = ' ';
> + }
> + bytes_handled++;
> + space_printed++;
> + us++;
>   }
> - return (n);
> +
> + return (space_printed);
>  }
>  
>  /*
> @@ -894,11 +989,11 @@ hash(struct video *vp)
>   char   *s;
>  
>   if ((vp->v_flag & VFHBAD) != 0) { /* Hash bad. */
> - s = &vp->v_text[ncol - 1];
> - for (i = ncol; i != 0; --i, --s)
> + s = &vp->v_text[ncol * MB_CUR_MAX - 1];
> + for (i = ncol * MB_CUR_MAX; i != 0; --i, --s)
>   if (*s != ' ')
>   break;
> - n = ncol - i; /* Erase cheaper? */
> + n = ncol * MB_CUR_MAX - i; /* Erase cheaper? */
>   if (n > tceeol)
>   n = tceeol;
>   vp->v_cost = i + n; /* Bytes + blanks. */
> Index: echo.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/echo.c,v
> retrieving revision 1.66
> diff -u -p -u -p -r1.66 echo.c
> --- echo.c 24 Oct 2016 17:18:42 -0000 1.66
> +++ echo.c 28 May 2018 18:08:10 -0000

It looks like the changes made to this file could be split off into
a separate diff that could be reviewed ahead of any changes for UTF-8.

> @@ -844,9 +844,11 @@ ewprintf(const char *fmt, ...)
>   * %k prints the name of the current key (and takes no arguments).
>   * %d prints a decimal integer
>   * %o prints an octal integer
> + * %x prints a hexadecimal integer
>   * %p prints a pointer
>   * %s prints a string
>   * %ld prints a long word
> + * %lx prints a hexadecimal long word
>   * Anything else is echoed verbatim
>   */
>  static void
> @@ -885,6 +887,10 @@ eformat(const char *fp, va_list ap)
>   eputi(va_arg(ap, int), 8);
>   break;
>  
> + case 'x':
> + eputi(va_arg(ap, int), 16);
> + break;
> +
>   case 'p':
>   snprintf(tmp, sizeof(tmp), "%p",
>      va_arg(ap, void *));
> @@ -902,6 +908,9 @@ eformat(const char *fp, va_list ap)
>   case 'd':
>   eputl(va_arg(ap, long), 10);
>   break;
> + case 'x':
> + eputl(va_arg(ap, long), 16);
> + break;
>   default:
>   eputc(c);
>   break;
> @@ -939,6 +948,7 @@ static void
>  eputl(long l, int r)
>  {
>   long q;
> + int c;
>  
>   if (l < 0) {
>   eputc('-');
> @@ -946,7 +956,10 @@ eputl(long l, int r)
>   }
>   if ((q = l / r) != 0)
>   eputl(q, r);
> - eputc((int)(l % r) + '0');
> + c = (int)(l % r) + '0';
> + if (c > '9')
> + c += 'a' - '9' - 1;
> + eputc(c);
>  }
>  
>  /*
> Index: fileio.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/fileio.c,v
> retrieving revision 1.105
> diff -u -p -u -p -r1.105 fileio.c
> --- fileio.c 13 Apr 2018 14:11:37 -0000 1.105
> +++ fileio.c 28 May 2018 18:08:10 -0000
> @@ -1,4 +1,4 @@
> -/* $OpenBSD: fileio.c,v 1.105 2018/04/13 14:11:37 florian Exp $ */
> +/* $OpenBSD: fileio.c,v 1.104 2017/05/30 07:05:22 florian Exp $ */
>  
>  /* This file is in the public domain. */
>  

What are the changes in this file for? Are they unrelated changes?
Looks like they're backing out the r1.105 commit? Why?

> @@ -722,12 +722,15 @@ expandtilde(const char *fn)
>   return (NULL);
>   return(ret);
>   }
> - if (ulen == 0) /* ~/ or ~ */
> - pw = getpwuid(geteuid());
> - else { /* ~user/ or ~user */
> + pw = getpwuid(geteuid());
> + if (ulen == 0) { /* ~/ or ~ */
> + if (pw != NULL)
> + (void)strlcpy(user, pw->pw_name, sizeof(user));
> + else
> + user[0] = '\0';
> + } else { /* ~user/ or ~user */
>   memcpy(user, &fn[1], ulen);
>   user[ulen] = '\0';
> - pw = getpwnam(user);
>   }
>   if (pw != NULL) {
>   plen = strlcpy(path, pw->pw_dir, sizeof(path));
> Index: funmap.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/funmap.c,v
> retrieving revision 1.53
> diff -u -p -u -p -r1.53 funmap.c
> --- funmap.c 14 Apr 2016 17:05:32 -0000 1.53
> +++ funmap.c 28 May 2018 18:08:10 -0000
> @@ -114,6 +114,7 @@ static struct funmap functnames[] = {
>   {bufferinsert, "insert-buffer",},
>   {fileinsert, "insert-file",},
>   {fillword, "insert-with-wrap",},
> + {insert_char, "insert-char",},
>   {backisearch, "isearch-backward",},
>   {forwisearch, "isearch-forward",},
>   {joinline, "join-line",},
> @@ -191,6 +192,7 @@ static struct funmap functnames[] = {
>   {shellcommand, "shell-command",},
>   {piperegion, "shell-command-on-region",},
>   {shrinkwind, "shrink-window",},
> + {show_raw_mode, "show-raw-mode",},
>  #ifdef NOTAB
>   {space_to_tabstop, "space-to-tabstop",},
>  #endif /* NOTAB */
> Index: kbd.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/kbd.c,v
> retrieving revision 1.30
> diff -u -p -u -p -r1.30 kbd.c
> --- kbd.c 26 Sep 2015 21:51:58 -0000 1.30
> +++ kbd.c 28 May 2018 18:08:10 -0000
> @@ -9,6 +9,8 @@
>  #include <sys/queue.h>
>  #include <signal.h>
>  #include <stdio.h>
> +#include <stdlib.h>
> +#include <wchar.h>
>  
>  #include "def.h"
>  #include "kbd.h"
> @@ -403,6 +405,43 @@ quote(int f, int n)
>   ungetkey(c);
>   }
>   return (selfinsert(f, n));
> +}
> +
> +/*
> + * Prompt for a codepoint in whatever the native system's encoding is,
> + * insert it into the file
> + */
> +int
> +insert_char(int f, int n)
> +{
> + char *bufp;
> + char inpbuf[32];
> + wchar_t wc;
> + char mb[MB_CUR_MAX + 1];

The + 1 here is not needed, unless you intend to use mb as
a NUL terminated string. Which doesn't seem to be the case
because the loop below is bound by mbslen.

> + mbstate_t mbs = { 0 };
> + size_t mbslen;
> + size_t i;
> +
> + if ((bufp = eread("Insert character (hex): ", inpbuf, sizeof inpbuf,
> +     EFNEW)) == NULL) {
> + return (ABORT);
> + } else if (bufp[0] == '\0') {
> + return (FALSE);
> + }
> +
> + wc = (wchar_t) strtoll(bufp, NULL, 16);
> + mbslen = wcrtomb(mb, wc, &mbs);
> + if (mbslen == (size_t) -1) {
> + return (FALSE);
> + }

In case you wanted mb to be NUL terminated that would have to be done here.
wcrtomb(3) will write at most MB_CUR_MAX bytes.

> +
> + for (i = 0; i < mbslen; ++i) {
> + if (linsert(1, mb[i]) == FALSE) {
> + return (FALSE);
> + }
> + }
> +
> + return (TRUE);
>  }
>  
>  /*
> Index: keymap.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/keymap.c,v
> retrieving revision 1.58
> diff -u -p -u -p -r1.58 keymap.c
> --- keymap.c 29 Dec 2015 19:44:32 -0000 1.58
> +++ keymap.c 28 May 2018 18:08:10 -0000
> @@ -120,6 +120,21 @@ static struct KEYMAPE (2) cX4map = {
>   }
>  };
>  
> +static PF cX8J[] = {
> + insert_char /* ^M */
> +};
> +
> +static struct KEYMAPE (1) cX8map = {
> + 1,
> + 1,
> + rescan,
> + {
> + {
> + CCHR('M'), CCHR('M'), cX8J, NULL
> + }
> + }
> +};
> +
>  static PF cXcB[] = {
>   listbuffers, /* ^B */
>   quit, /* ^C */
> @@ -158,6 +173,10 @@ static PF cX0[] = {
>   NULL /* 4 */
>  };
>  
> +static PF cX8[] = {
> + NULL /* 4 */
> +};
> +
>  static PF cXeq[] = {
>   showcpos /* = */
>  };
> @@ -189,9 +208,9 @@ static PF cXcar[] = {
>   undo /* u */
>  };
>  
> -struct KEYMAPE (6) cXmap = {
> - 6,
> - 6,
> +struct KEYMAPE (7) cXmap = {
> + 7,
> + 7,
>   rescan,
>   {
>   {
> @@ -207,6 +226,9 @@ struct KEYMAPE (6) cXmap = {
>   '0', '4', cX0, (KEYMAP *) & cX4map
>   },
>   {
> + '8', '8', cX8, (KEYMAP *) & cX8map
> + },
> + {
>   '=', '=', cXeq, NULL
>   },
>   {
> @@ -491,6 +513,18 @@ static struct KEYMAPE (1) overwmap = {
>   }
>  };
>  
> +static struct KEYMAPE (1) rawmap = {
> + 0,
> + 1, /* 1 to avoid 0 sized array */
> + rescan,
> + {
> + /* unused dummy entry for VMS C */
> + {
> + (KCHAR)0, (KCHAR)0, NULL, NULL
> + }
> + }
> +};
> +
>  
>  /*
>   * The basic (root) keyboard map
> @@ -513,6 +547,7 @@ static struct maps_s map_table[] = {
>   {(KEYMAP *) &notabmap, "notab",},
>  #endif /* NOTAB */
>   {(KEYMAP *) &overwmap, "overwrite",},
> + {(KEYMAP *) &rawmap, "raw",},
>   {(KEYMAP *) &metamap, "esc prefix",},
>   {(KEYMAP *) &cXmap, "c-x prefix",},
>   {(KEYMAP *) &cX4map, "c-x 4 prefix",},
> Index: mg.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/mg.1,v
> retrieving revision 1.106
> diff -u -p -u -p -r1.106 mg.1
> --- mg.1 11 Dec 2017 07:27:07 -0000 1.106
> +++ mg.1 28 May 2018 18:08:10 -0000
> @@ -1,7 +1,7 @@
>  .\" $OpenBSD: mg.1,v 1.106 2017/12/11 07:27:07 jmc Exp $
>  .\" This file is in the public domain.
>  .\"
> -.Dd $Mdocdate: December 11 2017 $
> +.Dd $Mdocdate: May 28 2018 $
>  .Dt MG 1
>  .Os
>  .Sh NAME
> @@ -849,7 +849,7 @@ This is the default.
>  .It set-default-mode
>  Append the supplied mode to the list of default modes
>  used by subsequent buffer creation.
> -Built in modes include: fill, indent and overwrite.
> +Built in modes include: fill, indent, overwrite, and raw.
>  .It set-fill-column
>  Prompt the user for a fill column.
>  Used by auto-fill-mode.
> @@ -861,6 +861,8 @@ Sets the prefix string to be used by the
>  Execute external command from mini-buffer.
>  .It shell-command-on-region
>  Provide the text in region to the shell command as input.
> +.It show-raw-mode
> +ASCII encoding mode.
>  .It shrink-window
>  Shrink current window by one line.
>  The window immediately below is expanded to pick up the slack.
> @@ -1091,5 +1093,3 @@ In order to use 8-bit characters (such a
>  needs to be disabled via the
>  .Dq meta-key-mode
>  command.
> -.Pp
> -Multi-byte character sets, such as UTF-8, are not supported.
> Index: modes.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/modes.c,v
> retrieving revision 1.21
> diff -u -p -u -p -r1.21 modes.c
> --- modes.c 30 May 2017 07:05:22 -0000 1.21
> +++ modes.c 28 May 2018 18:08:10 -0000
> @@ -111,6 +111,23 @@ overwrite_mode(int f, int n)
>  }
>  
>  int
> +show_raw_mode(int f, int n)
> +{
> + if (changemode(f, n, "raw") == FALSE)
> + return (FALSE);
> + if (f & FFARG) {
> + if (n <= 0)
> + curbp->b_flag &= ~BFSHOWRAW;
> + else
> + curbp->b_flag |= BFSHOWRAW;
> + } else
> + curbp->b_flag ^= BFSHOWRAW;
> +
> + sgarbf = TRUE;
> + return (TRUE);
> +}
> +
> +int
>  set_default_mode(int f, int n)
>  {
>   int i;
> @@ -170,5 +187,11 @@ set_default_mode(int f, int n)
>   defb_flag |= BFNOTAB;
>   }
>  #endif /* NOTAB */
> + if (strcmp(modebuf, "raw") == 0) {
> + if (n <= 0)
> + defb_flag &= ~BFSHOWRAW;
> + else
> + defb_flag |= BFSHOWRAW;
> + }
>   return (TRUE);
>  }
> Index: util.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mg/util.c,v
> retrieving revision 1.38
> diff -u -p -u -p -r1.38 util.c
> --- util.c 18 Nov 2015 18:21:06 -0000 1.38
> +++ util.c 28 May 2018 18:08:10 -0000
> @@ -13,6 +13,9 @@
>  #include <ctype.h>
>  #include <signal.h>
>  #include <stdio.h>
> +#include <string.h>
> +#include <stdlib.h>
> +#include <wchar.h>
>  
>  #include "def.h"
>  
> @@ -33,6 +36,9 @@ showcpos(int f, int n)
>   int nline, row;
>   int cline, cbyte; /* Current line/char/byte */
>   int ratio;
> + char ismb = 0;
> + wchar_t wc = 0;
> + char mbc[MB_CUR_MAX + 1];
>  
>   /* collect the data */
>   clp = bfirstlp(curbp);
> @@ -69,8 +75,43 @@ showcpos(int f, int n)
>   clp = lforw(clp);
>   }
>   ratio = nchar ? (100L * cchar) / nchar : 100;
> - ewprintf("Char: %c (0%o)  point=%ld(%d%%)  line=%d  row=%d  col=%d",
> -    cbyte, cbyte, cchar, ratio, cline, row, getcolpos(curwp));
> +
> + if (!(curbp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + size_t consumed = 0;
> + size_t offset = 0;
> + while (cbyte != '\n' && offset <= curwp->w_doto) {
> + int c = lgetc(clp, curwp->w_doto - offset);
> + if (isprint(c) || (ISCTRL(c) != FALSE)) {
> + break;
> + }
> + consumed = mbrtowc(&wc,
> +                   &clp->l_text[curwp->w_doto - offset],
> +                   llength(clp) - curwp->w_doto + offset,
> +                   &mbs);
> + if (consumed < (size_t) -2) {

Again, please use a better mbrtowc() idiom.

> + ismb = (offset < consumed);
> + snprintf(mbc, consumed + 1, "%s",
> +         &clp->l_text[curwp->w_doto - offset]);
> + mbc[consumed + 1] = '\0';
> + break;
> + } else {
> + memset(&mbs, 0, sizeof mbs);
> + }
> + offset++;
> + }
> + }
> +
> + if (ismb) {
> + ewprintf("Char: %s (codepoint 0x%lx) Byte: %c (0%o)  "
> +         "point=%ld(%d%%)  line=%d  row=%d col=%d", mbc,
> +         (long) wc, cbyte, cbyte, cchar, ratio, cline, row,
> +         getcolpos(curwp));
> + } else {
> + ewprintf("Char: %c (0%o)  point=%ld(%d%%)  line=%d  row=%d"
> +         "col=%d", cbyte, cbyte, cchar, ratio, cline, row,
> +         getcolpos(curwp));
> + }
>   return (TRUE);
>  }
>  
> @@ -96,6 +137,22 @@ getcolpos(struct mgwin *wp)
>   col += 2;
>   else if (isprint(c)) {
>   col++;
> + } else if (!(wp->w_bufp->b_flag & BFSHOWRAW)) {
> + mbstate_t mbs = { 0 };
> + wchar_t wc = 0;
> + size_t consumed = mbrtowc(&wc, &wp->w_dotp->l_text[i],
> +                          llength(wp->w_dotp) - i,
> +                          &mbs);
> + int width = -1;
> + if (consumed < (size_t) -2) {

Same.

> + width = wcwidth(wc);
> + }

width == -1 ?

> + if (width >= 0) {
> + col += width;
> + i += (consumed - 1);
> + } else {
> + col += snprintf(tmp, sizeof(tmp), "\\%o", c);

snprintf can return -1

> + }
>   } else {
>   col += snprintf(tmp, sizeof(tmp), "\\%o", c);

snprintf can return -1

>   }

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
Hi Stefan!

I'm sorry, fileio.c was out of sync that day, that's my bad, I forgot
to remove its diff.

Sorry for not reading diffs before sending them (I read your comments),
that's why I call it experimental: I blindly applied diffs and tested
mg in runtime. I'll send your comments to S. Gilles, after all he made
all that huge work :)

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Han Boetes-3
For anyone interested in contributing to this specific branch,
Please have a look on github:
https://github.com/hboetes/mg/tree/display-wide-characters

On Wed, May 30, 2018 at 12:55 PM Leonid Bobrov <[hidden email]> wrote:

> Hi Stefan!
>
> I'm sorry, fileio.c was out of sync that day, that's my bad, I forgot
> to remove its diff.
>
> Sorry for not reading diffs before sending them (I read your comments),
> that's why I call it experimental: I blindly applied diffs and tested
> mg in runtime. I'll send your comments to S. Gilles, after all he made
> all that huge work :)
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
I'm updating a diff, but it is just a fix for cursor movement,
it is still experimental and obvious problems are not fixed yet,
this is just a diff taken from latest commit at github:

mg.patch (76K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Theo de Raadt-2
In reply to this post by Stefan Sperling-5
This approach seems misguided.  Let me tell a story.

More than two decades ago, I made a fork of mg which was 100% byte
clean.  Unfortunately I lost the source code of that change.  mg's data
buffers are a linked list of lines, with a \n implied by the end of each
string.  0 bytes are unsupported.  Supporting multibyte risks a string
merger getting confused by these problems and creating junk.  My fork
changed mg to embed \n and \0 in the strings, and have the display code
understand that.  All buffer-search functionalities also learned of this
change.  The change had to be made quite incrementally and carefully,
but I recall the end result deleted far more code than it added.  mg became
100% 8-bit clean, I could edit binaries with it.

I think trying to shoehorn utf8 in before mg is 8-bit clean is a recipe
for disaster.  There are too many functions (imagine search-replace)
which already have nasty special cases for \n, and now will need nasty
special cases for utf8 and I don't think this will work out nice.



Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

S. Gilles
In reply to this post by Stefan Sperling-5
On 2018-05-30T11:50:37+0200, Stefan Sperling wrote:
> Comments inline. I think this still needs a lot of work...

Thanks for the review; replies inline (and omitted where the reply
is the same as one above). By the time you read this, I'll have
pushed the changes I mention to my branch in hboetes' repo.

> I've done one pass over this to weed out some obvious problems.
> There might be many I've missed.
>
> Not going to do any run-time testing because emacs/mg break my fingers ;)

Understood :). A couple of your suggestions are actually already
handled by mg's behavior, though.

> On Mon, May 28, 2018 at 09:32:10PM +0300, Leonid Bobrov wrote:
> > Index: basic.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/basic.c,v
> > retrieving revision 1.47
> > diff -u -p -u -p -r1.47 basic.c
> > --- basic.c 10 Oct 2015 09:13:14 -0000 1.47
> > +++ basic.c 28 May 2018 18:08:10 -0000
> > @@ -18,6 +18,7 @@
> >  #include <signal.h>
> >  #include <stdio.h>
> >  #include <stdlib.h>
> > +#include <wchar.h>
> >  
> >  #include "def.h"
> >  
> > @@ -269,12 +270,25 @@ setgoal(void)
> >  int
> >  getgoal(struct line *dlp)
> >  {
> > - int c, i, col = 0;
> > - char tmp[5];
> > + return getbyteofcol(dlp, 0, curgoal);
> > +}
> >  
> > +/*
> > + * Return the byte offset within lp that is targetcol columns beyond
> > + * startbyte
> > + */
> > +size_t
> > +getbyteofcol(const struct line *lp, const size_t startbyte,
> > +             const size_t targetcol)
> > +{
> > + int c;
> > + size_t i, col = 0;
> > + char tmp[5];
> > + size_t advance_by = 1;
> >  
> > - for (i = 0; i < llength(dlp); i++) {
> > - c = lgetc(dlp, i);
> > + for (i = startbyte; i < llength(lp); i += advance_by) {
> > + advance_by = 1;
> > + c = lgetc(lp, i);
> >   if (c == '\t'
> >  #ifdef NOTAB
> >      && !(curbp->b_flag & BFNOTAB)
> > @@ -284,16 +298,84 @@ getgoal(struct line *dlp)
> >   col++;
> >   } else if (ISCTRL(c) != FALSE) {
> >   col += 2;
> > - } else if (isprint(c))
> > + } else if (isprint(c)) {
> >   col++;
> > - else {
> > + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> > +                          llength(lp) - i, &mbs);
> > + int width = -1;
> > + if (consumed < (size_t) -2) {
>
> I think this should say: consumed > 0

If size_t is unsigned, using "consumed > 0" would be a rather large
semantic difference, but I believe I understand your meaning: compare
explicitly with -1 and -2, then handle normal path in else.

> Since both (size_t)-1 and (size_t)-2 indicate special error conditions
> they should only be compared the with == operator, in my opinion.
>
> Please read the mbrtowc man page if you haven't already.
> It covers error conditions extensively.

I have, many times.  mg makes no distinction (that I know of) between
an incomplete sequence and a decoding error.  Either error should
cause mg to throw up its hands and display an octal sequence for
the offending byte.

That said, I see your point about comparing errors only with ==. I
don't personally agree with it, but neither do I disagree. In the
interest of line length, I'll wrap "not equal to -2 or -1" into a
MBRTOWC_SUCCESSFUL macro, and use that in place of < -2.

> A good idiom would be something like:
>
> size_t ret = mbrtowc(...)
> if (ret == (size_t)-1) {
> /* handle decoding error */
> } else if (ret == (size_t)-2) {
> /* handle incomplete sequence */
> } else if (ret == 0) {
> /* handle end of line */

As I read the manual, ret = 0 corresponds to returning L'\0'. The
"end of line" condition is actually handled above, with the llength(lp)
comparision. For the L'\0' case, mg currently deals with NUL bytes
via the control check (also above), so consumed should never be 0.

That said, it's pretty trivial to add another layer ensuring
advance_by is never 0, so I'll add that.

> } else {
> /* handle character which was parsed */
> }

That idiom handles four separate cases. Of those, mg sees no
distinction between the -1 and -2 case, and the special handling
for the L'\0' case makes more sense elsewhere (where advance_by is
set), folded into the else case.

> > + width = wcwidth(wc);
> > + }
> > + if (width >= 0) {
>
> You should explicitly deal with width == -1 here as well.
> A negative width means this line contains a non-printable character.
> But that doesn't mean this character won't have any effect when printed.
> You might want to narrow down the truth about such a character
> with iswcntrl() and friends and take some appropriate action.

I would have no intrinsic problem accepting a patch for such
appropriate action, but I'm not going to write it myself. This is
because there isn't a single non-ASCII control character that I'd
know what to do with: Bidirectional text, Ruby, "variational
selectors", etc. We're displaying those characters in octal, so
there's at least a clue to the user that mg doesn't handle their
text. Correctly handling that text correctly is a couple orders of
magnitude beyond my goals, and quite possibly technically impossible
for mg's use case.

> > + col += width;
> > + advance_by = consumed;
>
> What if consumed == (size_t)-1 or (size_t)-2 here?

Then "width = wcwidth(wc)" was never called above, so width remains
at "-1", so this line won't be reached.

> In particular, you need to decide what to do with illegal byte sequences.
> An editor should probably display them in some escaped form and continue
> processing valid text beyond. UTF-8 is self-synchronizing so this should
> not be too difficult. Printing such bytes in an escaped form might
> require more columns than a valid byte sequence would.

Completely agreed! That's exactly what mg does, actually (and the
reason for all the snprintf stuff). I'm actually not making any
real decisions about editors here, just following what's already present.

mg already has a lot of current behavior around invalid byte
sequences. The behavior is just ASCII (or DEC in some cases) specific,
and sort of broken with long lines.

> Can it happen at any step during editing with mg that a valid multi-byte
> character gets split by a newline? That would cause (size_t)-2 here since
> this function is processing only one line. An editor should avoid introducing
> this problem, but should be able to cope with files which already have this
> problem when they are opened.

If the multibyte character gets split by a '\n', it's no longer a
multibyte character. We'll read up to the '\n', classify the bytes
before as garbage, and see that the next line starts with some
garbage as well. I don't think it's worth trying to do any sort of
error recovery on damaged bytes: the file contains invalid UTF-8
now, let the user deal with it.

> A good design would process a file such that any existing invalid multi-byte
> characters will at least remain as they were when the file was opened, or
> at best can be repaired by the user (do not get tempted to try to magically
> "fix" the data).
> > + } else {
> > + col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> What if snprintf returns -1 here?

I can't actually think of how that's possible. There aren't any
dereferencing problems with tmp, there's no allocation with snprintf,
and there should be no encoding problems by rendering an int to
octal. Old versions of glibc had a bug returning -1 on truncation,
but c is small enough that "\%o" can only use 4 characters.

If it somehow does return -1, that will simply mess with the
calculation of "col > targetcol" below.  We'll return a wrong offset
(but nothing beyond string length).

> > + }
> > + } else {
> >   col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> What if snprintf returns -1 here?

Even to be especially guardful, I don't think there's anything
intelligent we can do here. The goal of this function is to synchronize
offsets with the display onscreen. If snprintf into a fixed, local
buffer fails, did the corresponding display print fail? Did the
flush to screen fail?

As an aside, the idiom of using snprintf isn't from my diff. I would
prefer to print something fixed-width, so that this could just be
"col += 4" or so. But if I were to follow all my desires about
changing internals, I'd end up completely rewriting the editor, or
at the very least fielding lots of unwanted "why change this?"
questions.

> >   }
> > - if (col > curgoal)
> > + if (col > targetcol)
> >   break;
> >   }
> >   return (i);
> >  }
> > +
> > +/*
> > + * Return the column at which specified offset byte would appear, if
> > + * this were part of a longer string printed by vtputs, starting at
> > + * intial_col
> > + */
> > +size_t
> > +getcolofbyte(const struct line *lp, const size_t startbyte,
> > +             const size_t initial_col, const size_t targetoffset)
> > +{
> > + int c;
> > + size_t i, col = initial_col;
> > + char tmp[5];
> > + size_t advance_by = 1;
> > +
> > + for (i = startbyte; i < llength(lp); i += advance_by) {
> > + if (i >= targetoffset)
> > + break;
> > + advance_by = 1;
> > + c = lgetc(lp, i);
> > + if (c == '\t'
> > +#ifdef NOTAB
> > +    && !(curbp->b_flag & BFNOTAB)
> > +#endif
> > + ) {
> > + col |= 0x07;
> > + col++;
> > + } else if (ISCTRL(c) != FALSE) {
> > + col += 2;
> > + } else if (isprint(c)) {
> > + col++;
> > + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> > +                          llength(lp) - i, &mbs);
> > + int width = -1;
> > + if (consumed < (size_t) -2) {
>
> Same concern as above: consumed > 0
>
> > + width = wcwidth(wc);
> > + }
> > + if (width >= 0) {
> > + col += width;
> > +     while (cp1 != vvp->v_text && !isprint(*cp1) &&
> > +            mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
> > + advance_by = consumed;
>
> Agaile (cp1 != vvp->v_text && !isprint(*cp1) &&
> > +            mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
> > +     while (cp1 != vvp->v_text && !isprint(*cp1) &&
> > +            mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
> n, consumed could represent a -1 or -2 error from mbrtowc() at this point.

I don't believe it can -- same logic as above about the "width >=
0" check.

>
> > + } else {
> > + col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> Again, snprintf can return -1.
>
> > + }
> > + } else {
> > + col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> Again, snprintf can return -1.
>
> > + }
> > + }
> > + return (col);
> > +}
> > +
> >  
> >  /*
> >   * Scroll forward by a specified number
> > Index: cmode.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/cmode.c,v
> > retrieving revision 1.16
> > diff -u -p -u -p -r1.16 cmode.c
> > --- cmode.c 26 Sep 2015 21:51:58 -0000 1.16
> > +++ cmode.c 28 May 2018 18:08:10 -0000
> > @@ -14,6 +14,7 @@
> >  #include <ctype.h>
> >  #include <signal.h>
> >  #include <stdio.h>
> > +#include <wchar.h>
> >  
> >  #include "def.h"
> >  #include "funmap.h"
> > @@ -419,10 +420,25 @@ findcolpos(const struct buffer *bp, cons
> >   ) {
> >   col |= 0x07;
> >   col++;
> > - } else if (ISCTRL(c) != FALSE)
> > + } else if (ISCTRL(c) != FALSE) {
> >   col += 2;
> > - else if (isprint(c)) {
> > + } else if (isprint(c)) {
> >   col++;
> > + } else if (!(bp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> > +                          llength(lp) - i, &mbs);
> > + int width = -1;
> > + if (consumed < (size_t) -2) {
>
> Same.
>
> > + width = wcwidth(wc);
> > + }
> > + if (width >= 0) {
> > + col += width;
> > + i += (consumed - 1);
> > + } else {
> > + col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> Same.
>
> > + }
> >   } else {
> >   col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> Same.
>
> This pattern has now occured 3 times in this diff.
> Consider adding a function to avoid code duplication so that
> bugs can be fixed in one place.

Indeed. My reasoning is currently as follows: I actually want to
make minimal structural changes to mg (believe it or not). In the
areas where I'm adding all this multibyte gunk, there's a lot of
surrounding duplication. All that "col |= 0x07" stuff is also
duplication, and should also be factored out.

In the scale of choice between "preserve and add to duplication"
and "wholly rewrite huge chunks of each file", I chose to stick
closer to the former -- too close in this case. I'll factor this
stepping pattern out.

> >   }
> > Index: def.h
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/def.h,v
> > retrieving revision 1.155
> > diff -u -p -u -p -r1.155 def.h
> > --- def.h 14 Apr 2016 17:05:32 -0000 1.155
> > +++ def.h 28 May 2018 18:08:10 -0000
> > @@ -263,7 +263,7 @@ struct buffer {
> >   int b_marko; /* ditto for the "mark" */
> >   short b_nmodes; /* number of non-fundamental modes */
> >   char b_nwnd; /* Count of windows on buffer */
> > - char b_flag; /* Flags */
> > + short b_flag; /* Flags */
> >   char b_fname[NFILEN]; /* File name */
> >   char b_cwd[NFILEN]; /* working directory */
> >   struct fileinfo b_fi; /* File attributes */
> > @@ -290,6 +290,7 @@ struct buffer {
> >  #define BFDIRTY     0x20 /* Buffer was modified elsewhere */
> >  #define BFIGNDIRTY  0x40 /* Ignore modifications */
> >  #define BFDIREDDEL  0x80 /* Dired has a deleted 'D' file */
> > +#define BFSHOWRAW   0x100 /* Show unprintable as octal */
> >  /*
> >   * This structure holds information about recent actions for the Undo command.
> >   */
> > @@ -490,6 +491,7 @@ int digit_argument(int, int);
> >  int negative_argument(int, int);
> >  int selfinsert(int, int);
> >  int quote(int, int);
> > +int insert_char(int, int);
> >  
> >  /* main.c */
> >  int ctrlg(int, int);
> > @@ -512,6 +514,8 @@ int forwline(int, int);
> >  int backline(int, int);
> >  void setgoal(void);
> >  int getgoal(struct line *);
> > +size_t getbyteofcol(const struct line *, size_t, size_t);
> > +size_t getcolofbyte(const struct line *, size_t, size_t, size_t);
> >  int forwpage(int, int);
> >  int backpage(int, int);
> >  int forw1page(int, int);
> > @@ -658,6 +662,7 @@ int notabmode(int, int);
> >  #endif /* NOTAB */
> >  int overwrite_mode(int, int);
> >  int set_default_mode(int,int);
> > +int show_raw_mode(int, int);
> >  
> >  #ifdef REGEX
> >  /* re_search.c X */
> > Index: display.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/display.c,v
> > retrieving revision 1.48
> > diff -u -p -u -p -r1.48 display.c
> > --- display.c 6 Jul 2017 19:27:37 -0000 1.48
> > +++ display.c 28 May 2018 18:08:10 -0000
> > @@ -7,7 +7,7 @@
> >   * redisplay system knows almost nothing about the editing
> >   * process; the editing functions do, however, set some
> >   * hints to eliminate a lot of the grinding. There is more
> > - * that can be done; the "vtputc" interface is a real
> > + * that can be done; the "vtputs" interface is a real
> >   * pig.
> >   */
> >  
> > @@ -18,6 +18,7 @@
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <term.h>
> > +#include <wchar.h>
> >  
> >  #include "def.h"
> >  #include "kbd.h"
> > @@ -52,11 +53,10 @@ struct score {
> >  };
> >  
> >  void vtmove(int, int);
> > -void vtputc(int);
> >  void vtpute(int);
> > -int vtputs(const char *);
> > +int vtputs(const char *, size_t, size_t);
> >  void vteeol(void);
> > -void updext(int, int);
> > +int updext(int, int);
> >  void modeline(struct mgwin *, int);
> >  void setscores(int, int);
> >  void traceback(int, int, int, int);
> > @@ -216,8 +216,8 @@ vtresize(int force, int newrow, int newc
> >   }
> >   if (rowchanged || colchanged || first_run) {
> >   for (i = 0; i < 2 * (newrow - 1); i++)
> > - TRYREALLOC(video[i].v_text, newcol);
> > - TRYREALLOC(blanks.v_text, newcol);
> > + TRYREALLOC(video[i].v_text, newcol * MB_CUR_MAX);
> > + TRYREALLOC(blanks.v_text, newcol * MB_CUR_MAX);
>
> I am not quite sure why the blanks array needs to be grown.
> Doesn't the array size relate to the screen size, and doesn't
> each ' ' in the array occupy an entire column on the screen?

blanks is used by uline at some point, and uline expects that both
vvp->v_text and pvp->v_text have the same length. You'll get a
pretty nasty segfault without that multiplication.

> Then again, the code involved in updating the display is too hairy
> for me to digest right now, so I might be missing something.

Agreed 100%. This was the part of the code that really made me
decide against refactoring.

> >   }
> >  
> >   nrow = newrow;
> > @@ -260,7 +260,7 @@ vtinit(void)
> >   */
> >  
> >   blanks.v_color = CTEXT;
> > - for (i = 0; i < ncol; ++i)
> > + for (i = 0; i < ncol * MB_CUR_MAX; ++i)
> >   blanks.v_text[i] = ' ';
> >  }
> >  
> > @@ -287,7 +287,7 @@ vttidy(void)
> >   * Move the virtual cursor to an origin
> >   * 0 spot on the virtual display screen. I could
> >   * store the column as a character pointer to the spot
> > - * on the line, which would make "vtputc" a little bit
> > + * on the line, which would make "vtputs" a little bit
> >   * more efficient. No checking for errors.
> >   */
> >  void
> > @@ -298,88 +298,6 @@ vtmove(int row, int col)
> >  }
> >  
> >  /*
> > - * Write a character to the virtual display,
> > - * dealing with long lines and the display of unprintable
> > - * things like control characters. Also expand tabs every 8
> > - * columns. This code only puts printing characters into
> > - * the virtual display image. Special care must be taken when
> > - * expanding tabs. On a screen whose width is not a multiple
> > - * of 8, it is possible for the virtual cursor to hit the
> > - * right margin before the next tab stop is reached. This
> > - * makes the tab code loop if you are not careful.
> > - * Three guesses how we found this.
> > - */
> > -void
> > -vtputc(int c)
> > -{
> > - struct video *vp;
> > -
> > - c &= 0xff;
> > -
> > - vp = vscreen[vtrow];
> > - if (vtcol >= ncol)
> > - vp->v_text[ncol - 1] = '$';
> > - else if (c == '\t'
> > -#ifdef NOTAB
> > -    && !(curbp->b_flag & BFNOTAB)
> > -#endif
> > -    ) {
> > - do {
> > - vtputc(' ');
> > - } while (vtcol < ncol && (vtcol & 0x07) != 0);
> > - } else if (ISCTRL(c)) {
> > - vtputc('^');
> > - vtputc(CCHR(c));
> > - } else if (isprint(c))
> > - vp->v_text[vtcol++] = c;
> > - else {
> > - char bf[5];
> > -
> > - snprintf(bf, sizeof(bf), "\\%o", c);
> > - vtputs(bf);
> > - }
> > -}
> > -
> > -/*
> > - * Put a character to the virtual screen in an extended line.  If we are not
> > - * yet on left edge, don't print it yet.  Check for overflow on the right
> > - * margin.
> > - */
> > -void
> > -vtpute(int c)
> > -{
> > - struct video *vp;
> > -
> > - c &= 0xff;
> > -
> > - vp = vscreen[vtrow];
> > - if (vtcol >= ncol)
> > - vp->v_text[ncol - 1] = '$';
> > - else if (c == '\t'
> > -#ifdef NOTAB
> > -    && !(curbp->b_flag & BFNOTAB)
> > -#endif
> > -    ) {
> > - do {
> > - vtpute(' ');
> > - } while (((vtcol + lbound) & 0x07) != 0 && vtcol < ncol);
> > - } else if (ISCTRL(c) != FALSE) {
> > - vtpute('^');
> > - vtpute(CCHR(c));
> > - } else if (isprint(c)) {
> > - if (vtcol >= 0)
> > - vp->v_text[vtcol] = c;
> > - ++vtcol;
> > - } else {
> > - char bf[5], *cp;
> > -
> > - snprintf(bf, sizeof(bf), "\\%o", c);
> > - for (cp = bf; *cp != '\0'; cp++)
> > - vtpute(*cp);
> > - }
> > -}
> > -
> > -/*
> >   * Erase from the end of the software cursor to the end of the line on which
> >   * the software cursor is located. The display routines will decide if a
> >   * hardware erase to end of line command should be used to display this.
> > @@ -390,7 +308,7 @@ vteeol(void)
> >   struct video *vp;
> >  
> >   vp = vscreen[vtrow];
> > - while (vtcol < ncol)
> > + while (vtcol < ncol * MB_CUR_MAX)
> >   vp->v_text[vtcol++] = ' ';
> >  }
> >  
> > @@ -410,7 +328,7 @@ update(int modelinecolor)
> >   struct mgwin *wp;
> >   struct video *vp1;
> >   struct video *vp2;
> > - int c, i, j;
> > + int c, i;
> >   int hflag;
> >   int currow, curcol;
> >   int offs, size;
> > @@ -485,8 +403,9 @@ update(int modelinecolor)
> >   vscreen[i]->v_color = CTEXT;
> >   vscreen[i]->v_flag |= (VFCHG | VFHBAD);
> >   vtmove(i, 0);
> > - for (j = 0; j < llength(lp); ++j)
> > - vtputc(lgetc(lp, j));
> > + if (llength(lp)) {
> > + vtputs(lp->l_text, llength(lp), 0);
> > + }
> >   vteeol();
> >   } else if ((wp->w_rflag & (WFEDIT | WFFULL)) != 0) {
> >   hflag = TRUE;
> > @@ -495,8 +414,10 @@ update(int modelinecolor)
> >   vscreen[i]->v_flag |= (VFCHG | VFHBAD);
> >   vtmove(i, 0);
> >   if (lp != wp->w_bufp->b_headp) {
> > - for (j = 0; j < llength(lp); ++j)
> > - vtputc(lgetc(lp, j));
> > + if (llength(lp)) {
> > + vtputs(lp->l_text, llength(lp),
> > +    0);
> > + }
> >   lp = lforw(lp);
> >   }
> >   vteeol();
> > @@ -514,32 +435,53 @@ update(int modelinecolor)
> >   ++currow;
> >   lp = lforw(lp);
> >   }
> > +
> >   curcol = 0;
> >   i = 0;
> >   while (i < curwp->w_doto) {
> > - c = lgetc(lp, i++);
> > + char tmp[5];
> > + c = lgetc(lp, i);
> >   if (c == '\t'
> >  #ifdef NOTAB
> >      && !(curbp->b_flag & BFNOTAB)
> >  #endif
> >   ) {
> > - curcol |= 0x07;
> >   curcol++;
> > + while ((curcol - lbound) & 0x07) {
> > + curcol++;
> > + }
> >   } else if (ISCTRL(c) != FALSE)
> >   curcol += 2;
> > - else if (isprint(c))
> > + else if (isprint(c)) {
> >   curcol++;
> > - else {
> > - char bf[5];
> > -
> > - snprintf(bf, sizeof(bf), "\\%o", c);
> > - curcol += strlen(bf);
> > + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > + size_t consumed = mbrtowc(&wc, &lp->l_text[i],
> > +                          llength(lp) - i, &mbs);
> > + int width = -1;
> > + if (consumed < (size_t) -2) {
>
> I feel like I'm repeating myself :)

You are :)

> > + width = wcwidth(wc);
> > + } else {
> > + memset(&mbs, 0, sizeof mbs);
> > + }
> > + if (width >= 0) {
> > + curcol += width;
> > + i += (consumed - 1);
> > + } else {
> > + snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> Missing check for error return (< 0) from snprintf().
>
> > + curcol += strlen(tmp);
>
> You should check for truncation in snprintf: If it returns >= sizeof(tmp)
> the string was truncated which you should treat as an error condition
> because would mean your chosen tmp buffer size is too small.
>
> And if there was no truncation, you can use snprintf's return value
> directly instead of calling strlen() to calculate the same number again.

Same comments as above for snprintf, but completely agreed about
the superfluous strlen(). I'll change that.

> > + }
> > + } else {
> > + snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> Same.
>
> > + curcol += strlen(tmp);
>
> Same.
>
> >   }
> > + i++;
> >   }
> >   if (curcol >= ncol - 1) { /* extended line. */
> >   /* flag we are extended and changed */
> >   vscreen[currow]->v_flag |= VFEXT | VFCHG;
> > - updext(currow, curcol); /* and output extended line */
> > + curcol = updext(currow, curcol);
> >   } else
> >   lbound = 0; /* not extended line */
> >  
> > @@ -558,8 +500,10 @@ update(int modelinecolor)
> >   if ((wp != curwp) || (lp != wp->w_dotp) ||
> >      (curcol < ncol - 1)) {
> >   vtmove(i, 0);
> > - for (j = 0; j < llength(lp); ++j)
> > - vtputc(lgetc(lp, j));
> > + if (llength(lp)) {
> > + vtputs(lp->l_text, llength(lp),
> > +    0);
> > + }
> >   vteeol();
> >   /* this line no longer is extended */
> >   vscreen[i]->v_flag &= ~VFEXT;
> > @@ -661,39 +605,44 @@ ucopy(struct video *vvp, struct video *p
> >   pvp->v_hash = vvp->v_hash;
> >   pvp->v_cost = vvp->v_cost;
> >   pvp->v_color = vvp->v_color;
> > - bcopy(vvp->v_text, pvp->v_text, ncol);
> > + bcopy(vvp->v_text, pvp->v_text, ncol * MB_CUR_MAX);
> >  }
> >  
> >  /*
> > - * updext: update the extended line which the cursor is currently on at a
> > - * column greater than the terminal width. The line will be scrolled right or
> > - * left to let the user see where the cursor is.
> > + * updext: update the extended line which the cursor is currently on
> > + * at a column greater than the terminal width. The line will be
> > + * scrolled right or left to let the user see where the cursor
> > + * is. curcol may need to be adjusted, depending on how wide
> > + * characters and lbound interact, that adjusted position is returned.
> >   */
> > -void
> > +int
> >  updext(int currow, int curcol)
> >  {
> > - struct line *lp; /* pointer to current line */
> > - int j; /* index into line */
> > + struct line *lp = curwp->w_dotp; /* pointer to current line */
> > + size_t startbyte;
> > + int bettercol = curcol;
> > + size_t fullextent;
> >  
> >   if (ncol < 2)
> > - return;
> > + return curcol;
> >  
> >   /*
> > - * calculate what column the left bound should be
> > - * (force cursor into middle half of screen)
> > + * calculate what column the left bound should be (force
> > + * cursor into middle half of screen). Ensuring that it is at
> > + * a tabstop allows update() to calculate curcol without
> > + * wondering how tabstops are calculated before the first '$'.
> >   */
> >   lbound = curcol - (curcol % (ncol >> 1)) - (ncol >> 2);
> > + lbound = (lbound | 0x07) + 1;
> > + vscreen[currow]->v_text[0] = '$';
> > + vtmove(currow, 1);
> > + startbyte = getbyteofcol(lp, 0, lbound + 1);
> > + fullextent = getbyteofcol(lp, startbyte, ncol + 1);
>
> This might be a good place for some error checking and falling back
> to displaying the line with invalid characeters in an escaped form
> if necessary.

As above: mg already displays invalid characters. I don't disagree
with you on placement of error checking, but I don't want to touch
mg's control flow logic any more than I have to (I was already quite
uncomfortable changing this to return an int).

> > + vtputs(lp->l_text + startbyte, fullextent - startbyte, 1);
> > + vteeol();
> >  
> > - /*
> > - * scan through the line outputing characters to the virtual screen
> > - * once we reach the left edge
> > - */
> > - vtmove(currow, -lbound); /* start scanning offscreen */
> > - lp = curwp->w_dotp; /* line to output */
> > - for (j = 0; j < llength(lp); ++j) /* until the end-of-line */
> > - vtpute(lgetc(lp, j));
> > - vteeol(); /* truncate the virtual line */
> > - vscreen[currow]->v_text[0] = '$'; /* and put a '$' in column 1 */
> > + bettercol = lbound + getcolofbyte(lp, startbyte, 1, curwp->w_doto);
> > + return (bettercol);
> >  }
> >  
> >  /*
> > @@ -708,12 +657,35 @@ updext(int currow, int curcol)
> >  void
> >  uline(int row, struct video *vvp, struct video *pvp)
> >  {
> > - char  *cp1;
> > - char  *cp2;
> > - char  *cp3;
> > - char  *cp4;
> > - char  *cp5;
> > + char  *cp1; /* Pointer to the start of dirty region */
> > + char  *cp2; /* pvp's counterpart for cp1 */
> > + char  *cp3; /* Pointer to end of dirty region */
> > + char  *cp4; /* pvp's counterpart for cp3 */
> > + char  *cp5; /* After this, within dirty region, all is ' ' */
> > + char  *mbcounter;
> >   int    nbflag;
> > + int    startcol; /* onscreen column matching cp1 */
> > + char  *lastbyte; /* byte which handles last onscreen column  */
> > + int    seencols = 0;
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > +
> > + lastbyte = vvp->v_text;
> > + while (seencols < ncol && *lastbyte) {
> > + size_t consumed = mbrtowc(&wc, lastbyte,
> > +    (vvp->v_text + ncol * MB_CUR_MAX - lastbyte), &mbs);
> > + if (consumed < (size_t) -2) {
>
> Again, the error checking as it is done here isn't idiomatic.
> See above.
>
> > + lastbyte += consumed;
> > + seencols += wcwidth(wc);
>
> wcwidth() could return -1 here.

I'm pretty sure that would require a raw control character in
vvp->v_text, and that vvp->v_text is the nice, "escaped" form (i.e.
it would contain '^' and '@' for a NUL). That said, it's pretty
easy case to handle, so I'll handle it.

> > + } else {
> > + lastbyte++;
> > + seencols++;
> > + memset(&mbs, 0, sizeof mbs);
> > + }
> > + }
> > + if (lastbyte - vvp->v_text < ncol) {
> > + lastbyte = &vvp->v_text[ncol];
> > + }
> >  
> >   if (vvp->v_color != pvp->v_color) { /* Wrong color, do a */
> >   ttmove(row, 0); /* full redraw. */
> > @@ -729,11 +701,12 @@ uline(int row, struct video *vvp, struct
> >   * putting the invisible glitch character on the next line.
> >   * (Hazeltine executive 80 model 30)
> >   */
> > - cp2 = &vvp->v_text[ncol - (magic_cookie_glitch >= 0 ?
> > -    (magic_cookie_glitch != 0 ? magic_cookie_glitch : 1) : 0)];
> > + cp2 = lastbyte -
> > +    (magic_cookie_glitch >= 0 ? (magic_cookie_glitch != 0 ?
> > +    magic_cookie_glitch : 1) : 0);
> >  #else
> >   cp1 = &vvp->v_text[0];
> > - cp2 = &vvp->v_text[ncol];
> > + cp2 = lastbyte;
> >  #endif
> >   while (cp1 != cp2) {
> >   ttputc(*cp1++);
> > @@ -744,21 +717,31 @@ uline(int row, struct video *vvp, struct
> >   }
> >   cp1 = &vvp->v_text[0]; /* Compute left match. */
> >   cp2 = &pvp->v_text[0];
> > - while (cp1 != &vvp->v_text[ncol] && cp1[0] == cp2[0]) {
> > + while (cp1 != lastbyte && cp1[0] == cp2[0]) {
> >   ++cp1;
> >   ++cp2;
> >   }
> > - if (cp1 == &vvp->v_text[ncol]) /* All equal. */
> > + if (cp1 == lastbyte) /* All equal. */
> >   return;
> > + while (cp1 != vvp->v_text && !isprint(*cp1) &&
> > +       mbrtowc(&wc, cp1, (lastbyte - cp1), &mbs) >= (size_t) -2) {
>
> NEVER call mbrtowc in a loop header like that.
> Call it inside the loop body and handle errors.

I strongly disagree on placing mbrtowc inside the loop body. The
condition IS "while there's a multibyte error at this point in the
text". Blindly following a rule for never putting certain functions
in a loop header would be like saying "Never use i++ at the end of
for statements. Always increment explicitly and check for overflow".

...that said, since I'm already changing all comparisions to (size_t)
-2, the loop header would be quite unreadable if I didn't pull
things inside the loop body. So I'll move it, but I'm not compromising
my principles on it. :)

> > + --cp1;
> > + --cp2;
> > + }
> >   nbflag = FALSE;
> > - cp3 = &vvp->v_text[ncol]; /* Compute right match.  */
> > - cp4 = &pvp->v_text[ncol];
> > + cp3 = lastbyte; /* Compute right match.  */
> > + cp4 = &pvp->v_text[lastbyte - vvp->v_text];
> >   while (cp3[-1] == cp4[-1]) {
> >   --cp3;
> >   --cp4;
> >   if (cp3[0] != ' ') /* Note non-blanks in */
> >   nbflag = TRUE; /* the right match. */
> >   }
> > + while (cp3 != lastbyte && !isprint(*cp3) &&
> > +        mbrtowc(&wc, cp3, (lastbyte - cp3), &mbs) >= (size_t) -2) {
>
> Same.
>
> > + ++cp3;
> > + ++cp4;
> > + }
> >   cp5 = cp3; /* Is erase good? */
> >   if (nbflag == FALSE && vvp->v_color == CTEXT) {
> >   while (cp5 != cp1 && cp5[-1] == ' ')
> > @@ -768,13 +751,27 @@ uline(int row, struct video *vvp, struct
> >   cp5 = cp3;
> >   }
> >   /* Alcyon hack */
> > - ttmove(row, (int) (cp1 - &vvp->v_text[0]));
> > + startcol = 0;
> > + mbcounter = vvp->v_text;
> > + while ((cp1 - mbcounter) > 0) {
> > + size_t consumed = mbrtowc(&wc, mbcounter, (cp1 - mbcounter),
> > +                          &mbs);
> > + if (consumed < (size_t) -2) {
>
> Again, please use a better idiom.
>
> > + mbcounter += consumed;
> > + startcol += wcwidth(wc);
>
> wcwidth() could return -1.

As above, I don't think it can: everything in vvp should be nice,
printable characters. This section of code is already double-checking
things that should never happen, though, and this is an easy condition
to add, so I'll do so.

> > + } else {
> > + mbcounter++;
> > + startcol++;
> > + memset(&mbs, 0, sizeof mbs);
> > + }
> > + }
> > + ttmove(row, startcol);
> >  #ifdef STANDOUT_GLITCH
> >   if (vvp->v_color != CTEXT && magic_cookie_glitch > 0) {
> >   if (cp1 < &vvp->v_text[magic_cookie_glitch])
> >   cp1 = &vvp->v_text[magic_cookie_glitch];
> > - if (cp5 > &vvp->v_text[ncol - magic_cookie_glitch])
> > - cp5 = &vvp->v_text[ncol - magic_cookie_glitch];
> > + if (cp5 > lastbyte - magic_cookie_glitch)
> > + cp5 = lastbyte - magic_cookie_glitch;
> >   } else if (magic_cookie_glitch < 0)
> >  #endif
> >   ttcolor(vvp->v_color);
> > @@ -807,46 +804,39 @@ modeline(struct mgwin *wp, int modelinec
> >   vscreen[n]->v_flag |= (VFCHG | VFHBAD); /* Recompute, display. */
> >   vtmove(n, 0); /* Seek to right line. */
> >   bp = wp->w_bufp;
> > - vtputc('-');
> > - vtputc('-');
> > + n = vtputs("--", 0, 0);
> >   if ((bp->b_flag & BFREADONLY) != 0) {
> > - vtputc('%');
> > + n += vtputs("%", 0, n);
> >   if ((bp->b_flag & BFCHG) != 0)
> > - vtputc('*');
> > + n += vtputs("*", 0, n);
> >   else
> > - vtputc('%');
> > + n += vtputs("%", 0, n);
> >   } else if ((bp->b_flag & BFCHG) != 0) { /* "*" if changed. */
> > - vtputc('*');
> > - vtputc('*');
> > + n += vtputs("**", 0, n);
> >   } else {
> > - vtputc('-');
> > - vtputc('-');
> > + n += vtputs("--", 0, n);
> >   }
> > - vtputc('-');
> > + n += vtputs("-", 0, n);
> >   n = 5;
>
> n is being reset to 5 here, overwriting what vtputs() returned.
> That looks wrong.

I vaguely remember having the same feeling about that "n = 5" back
with the vtputc interface, but I can't find any justification now
for why I kept it. I'll delete it.

> > - n += vtputs("Mg: ");
> > + n += vtputs("Mg: ", 0, n);
> >   if (bp->b_bname[0] != '\0')
> > - n += vtputs(&(bp->b_bname[0]));
> > + n += vtputs(&(bp->b_bname[0]), 0, n);
> >   while (n < 42) { /* Pad out with blanks. */
> > - vtputc(' ');
> > - ++n;
> > + n += vtputs(" ", 0, n);
> >   }
> > - vtputc('(');
> > - ++n;
> > + n += vtputs("(", 0, n);
> >   for (md = 0; ; ) {
> > - n += vtputs(bp->b_modes[md]->p_name);
> > + n += vtputs(bp->b_modes[md]->p_name, 0, n);
> >   if (++md > bp->b_nmodes)
> >   break;
> > - vtputc('-');
> > - ++n;
> > + n += vtputs("-", 0, n);
> >   }
> >   /* XXX These should eventually move to a real mode */
> >   if (macrodef == TRUE)
> > - n += vtputs("-def");
> > + n += vtputs("-def", 0, n);
> >   if (globalwd == TRUE)
> > - n += vtputs("-gwd");
> > - vtputc(')');
> > - ++n;
> > + n += vtputs("-gwd", 0, n);
> > + n += vtputs(")", 0, n);
> >  
> >   if (linenos && colnos)
> >   len = snprintf(sl, sizeof(sl), "--L%d--C%d", wp->w_dotline,
> > @@ -856,27 +846,132 @@ modeline(struct mgwin *wp, int modelinec
> >   else if (colnos)
> >   len = snprintf(sl, sizeof(sl), "--C%d", getcolpos(wp));
> >   if ((linenos || colnos) && len < sizeof(sl) && len != -1)
> > - n += vtputs(sl);
> > + n += vtputs(sl, 0, n);
> >  
> >   while (n < ncol) { /* Pad out. */
> > - vtputc('-');
> > - ++n;
> > + n += vtputs("-", 0, n);
> >   }
> >  }
> >  
> >  /*
> > - * Output a string to the mode line, report how long it was.
> > + * Output a string to the mode line, report how long it was,
> > + * dealing with long lines and the display of unprintable
> > + * things like control characters. Also expand tabs every 8
> > + * columns. This code only puts printing characters into
> > + * the virtual display image. Special care must be taken when
> > + * expanding tabs. On a screen whose width is not a multiple
> > + * of 8, it is possible for the virtual cursor to hit the
> > + * right margin before the next tab stop is reached. This
> > + * makes the tab code loop if you are not careful.
> > + * Three guesses how we found this.
> >   */
> >  int
> > -vtputs(const char *s)
> > +vtputs(const char *s, const size_t max_bytes, const size_t initial_col)
> >  {
> > - int n = 0;
> > + const unsigned char *us = (const unsigned char *) s;
> > + struct video *vp = vscreen[vtrow];
> > + size_t bytes_handled = 0;
> > + size_t last_full_byte_start = vtcol;
> > + size_t space_printed = 0;
> > +
> > + if (!s) {
> > + return (0);
> > + }
> > +
> > + while (*us && (!max_bytes || bytes_handled < max_bytes)) {
> > + if (space_printed + initial_col >= ncol) {
> > + break;
> > + } else if (*us == '\t'
> > +#ifdef NOTAB
> > +           && !(curbp->b_flag & BFNOTAB)
> > +#endif
> > +           ) {
> > + last_full_byte_start = vtcol;
> > + do {
> > + if (vtcol >= 0) {
> > + last_full_byte_start = vtcol;
> > + vp->v_text[vtcol] = ' ';
> > + }
> > + vtcol++;
> > + space_printed++;
> > + } while (space_printed + initial_col < ncol &&
> > +         ((space_printed + initial_col) & 0x07));
> > + us++;
> > + bytes_handled++;
> > + } else if (ISCTRL(*us)) {
>
> Won't this interpret multi-byte characters and only look at the first byte?
> It seems you'll need some mbtrtowc action in this function, too.

ISCTRL doesn't fire on the first byte of multibyte UTF-8, so control
drops down to the multibyte-handling bit. That part of the code
eats an entire multibyte character at a time, so ISCTRL never gets
a chance to check trailing bytes of a multibyte.

As an aside, I've had to replace these ctype-style macros with
normal iscntrl/iswcntrl-style ones. One of them, ISWORD, was having
exactly the problem you predict (spuriously recognizing bits of
multibyte text as non-words), which was partially causing the
movement bug that some have reported.

> > + last_full_byte_start = vtcol;
> > + if (vtcol >= 0) {
> > + vp->v_text[vtcol] = '^';
> > + }
> > + vtcol++;
> > + if (vtcol >= 0) {
> > + vp->v_text[vtcol] = CCHR(*us);
> > + }
> > + vtcol++;
> > + bytes_handled++;
> > + space_printed += 2;
> > + us++;
> > + } else if (isprint(*us)) {
>
> Same problem.

Same solution :)

> > + last_full_byte_start = vtcol;
> > + if (vtcol >= 0) {
> > + vp->v_text[vtcol] = *us++;
> > + }
> > + vtcol++;
> > + bytes_handled++;
> > + space_printed++;
> > + } else if (!(curbp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > + size_t consumable = max_bytes ?
> > +        (max_bytes - bytes_handled) : -1;
> > + size_t consumed = mbrtowc(&wc, (const char *)us,
> > +                          consumable, &mbs);
> > + int width = -1;
> > + last_full_byte_start = vtcol;
> > + if (consumed < (size_t) -2) {
>
> Same mbrtowc return value handling problem.
>
> > + width = wcwidth(wc);
> > + }
>
> Check width == -1 here?
>
> > + if (width >= 0) {
> > + bytes_handled += consumed;
> > + space_printed += width;
> > + do {
> > + if (vtcol >= 0) {
> > + vp->v_text[vtcol] = *us++;
> > + }
> > + vtcol++;
> > + } while (--consumed);
> > + } else {
>
> Cause otherwise this block runs with a non-printable multi-byte
> character. Is that what you want?

Yes, it is (and it will trigger this block for all following bytes
in the character as well, since they'll fail isprint() and iscntrl()).

> > + char bf[5];
> > + snprintf(bf, sizeof(bf), "\\%o", *us);
> > + bytes_handled++;
> > + space_printed += vtputs(bf, 0,
> > +    space_printed + initial_col);
> > + us++;
> > + }
> > + } else {
> > + char bf[5];
> > + last_full_byte_start = vtcol;
> > + snprintf(bf, sizeof(bf), "\\%o", *us);
> > + bytes_handled++;
> > + space_printed += vtputs(bf, 0,
> > +    space_printed + initial_col);
> > + us++;
> > + }
> > + }
> >  
> > - while (*s != '\0') {
> > - vtputc(*s++);
> > - ++n;
> > + if ((space_printed + initial_col > ncol) ||
> > +    (space_printed + initial_col == ncol &&
> > +    (*us && (!max_bytes || bytes_handled < max_bytes)))) {
> > + vp->v_text[last_full_byte_start] = '$';
> > + while (++last_full_byte_start <= vtcol) {
> > + vp->v_text[last_full_byte_start] = ' ';
> > + }
> > + bytes_handled++;
> > + space_printed++;
> > + us++;
> >   }
> > - return (n);
> > +
> > + return (space_printed);
> >  }
> >  
> >  /*
> > @@ -894,11 +989,11 @@ hash(struct video *vp)
> >   char   *s;
> >  
> >   if ((vp->v_flag & VFHBAD) != 0) { /* Hash bad. */
> > - s = &vp->v_text[ncol - 1];
> > - for (i = ncol; i != 0; --i, --s)
> > + s = &vp->v_text[ncol * MB_CUR_MAX - 1];
> > + for (i = ncol * MB_CUR_MAX; i != 0; --i, --s)
> >   if (*s != ' ')
> >   break;
> > - n = ncol - i; /* Erase cheaper? */
> > + n = ncol * MB_CUR_MAX - i; /* Erase cheaper? */
> >   if (n > tceeol)
> >   n = tceeol;
> >   vp->v_cost = i + n; /* Bytes + blanks. */
> > Index: echo.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/echo.c,v
> > retrieving revision 1.66
> > diff -u -p -u -p -r1.66 echo.c
> > --- echo.c 24 Oct 2016 17:18:42 -0000 1.66
> > +++ echo.c 28 May 2018 18:08:10 -0000
>
> It looks like the changes made to this file could be split off into
> a separate diff that could be reviewed ahead of any changes for UTF-8.

I think this actually was, originally (this patch is >2 years old).
I originally developed this as a somewhat incremental series of
local patches, but when I submitted it to hboetes and/or upstream
(I forget which, maybe both), I had to format it for CVS-style
patching, at which point I think I blobbed it. Then hboetes put it
in a branch on Github, and after a year or so I lost my local
history.

Between now and then, a couple upstream updates have been merged
in, so you're looking at a blob of a blob of a blob of a series
originally applied to the portable version. :)

> > @@ -844,9 +844,11 @@ ewprintf(const char *fmt, ...)
> >   * %k prints the name of the current key (and takes no arguments).
> >   * %d prints a decimal integer
> >   * %o prints an octal integer
> > + * %x prints a hexadecimal integer
> >   * %p prints a pointer
> >   * %s prints a string
> >   * %ld prints a long word
> > + * %lx prints a hexadecimal long word
> >   * Anything else is echoed verbatim
> >   */
> >  static void
> > @@ -885,6 +887,10 @@ eformat(const char *fp, va_list ap)
> >   eputi(va_arg(ap, int), 8);
> >   break;
> >  
> > + case 'x':
> > + eputi(va_arg(ap, int), 16);
> > + break;
> > +
> >   case 'p':
> >   snprintf(tmp, sizeof(tmp), "%p",
> >      va_arg(ap, void *));
> > @@ -902,6 +908,9 @@ eformat(const char *fp, va_list ap)
> >   case 'd':
> >   eputl(va_arg(ap, long), 10);
> >   break;
> > + case 'x':
> > + eputl(va_arg(ap, long), 16);
> > + break;
> >   default:
> >   eputc(c);
> >   break;
> > @@ -939,6 +948,7 @@ static void
> >  eputl(long l, int r)
> >  {
> >   long q;
> > + int c;
> >  
> >   if (l < 0) {
> >   eputc('-');
> > @@ -946,7 +956,10 @@ eputl(long l, int r)
> >   }
> >   if ((q = l / r) != 0)
> >   eputl(q, r);
> > - eputc((int)(l % r) + '0');
> > + c = (int)(l % r) + '0';
> > + if (c > '9')
> > + c += 'a' - '9' - 1;
> > + eputc(c);
> >  }
> >  
> >  /*
> > Index: fileio.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/fileio.c,v
> > retrieving revision 1.105
> > diff -u -p -u -p -r1.105 fileio.c
> > --- fileio.c 13 Apr 2018 14:11:37 -0000 1.105
> > +++ fileio.c 28 May 2018 18:08:10 -0000
> > @@ -1,4 +1,4 @@
> > -/* $OpenBSD: fileio.c,v 1.105 2018/04/13 14:11:37 florian Exp $ */
> > +/* $OpenBSD: fileio.c,v 1.104 2017/05/30 07:05:22 florian Exp $ */
> >  
> >  /* This file is in the public domain. */
> >  
>
> What are the changes in this file for? Are they unrelated changes?
> Looks like they're backing out the r1.105 commit? Why?

I honestly don't know. Perhaps Leonid made this diff while my branch
was behind the merges from upstream? The file I'm looking at right
now, on my local branch, appears to match the 1.105 version, so
whatever this is, it's gone now.

> > @@ -723,12 +722,15 @@ expandtilde(const char *fn)
> >   return (NULL);
> >   return(ret);
> >   }
> > - if (ulen == 0) /* ~/ or ~ */
> > - pw = getpwuid(geteuid());
> > - else { /* ~user/ or ~user */
> > + pw = getpwuid(geteuid());
> > + if (ulen == 0) { /* ~/ or ~ */
> > + if (pw != NULL)
> > + (void)strlcpy(user, pw->pw_name, sizeof(user));
> > + else
> > + user[0] = '\0';
> > + } else { /* ~user/ or ~user */
> >   memcpy(user, &fn[1], ulen);
> >   user[ulen] = '\0';
> > - pw = getpwnam(user);
> >   }
> >   if (pw != NULL) {
> >   plen = strlcpy(path, pw->pw_dir, sizeof(path));
> > Index: funmap.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/funmap.c,v
> > retrieving revision 1.53
> > diff -u -p -u -p -r1.53 funmap.c
> > --- funmap.c 14 Apr 2016 17:05:32 -0000 1.53
> > +++ funmap.c 28 May 2018 18:08:10 -0000
> > @@ -114,6 +114,7 @@ static struct funmap functnames[] = {
> >   {bufferinsert, "insert-buffer",},
> >   {fileinsert, "insert-file",},
> >   {fillword, "insert-with-wrap",},
> > + {insert_char, "insert-char",},
> >   {backisearch, "isearch-backward",},
> >   {forwisearch, "isearch-forward",},
> >   {joinline, "join-line",},
> > @@ -191,6 +192,7 @@ static struct funmap functnames[] = {
> >   {shellcommand, "shell-command",},
> >   {piperegion, "shell-command-on-region",},
> >   {shrinkwind, "shrink-window",},
> > + {show_raw_mode, "show-raw-mode",},
> >  #ifdef NOTAB
> >   {space_to_tabstop, "space-to-tabstop",},
> >  #endif /* NOTAB */
> > Index: kbd.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/kbd.c,v
> > retrieving revision 1.30
> > diff -u -p -u -p -r1.30 kbd.c
> > --- kbd.c 26 Sep 2015 21:51:58 -0000 1.30
> > +++ kbd.c 28 May 2018 18:08:10 -0000
> > @@ -9,6 +9,8 @@
> >  #include <sys/queue.h>
> >  #include <signal.h>
> >  #include <stdio.h>
> > +#include <stdlib.h>
> > +#include <wchar.h>
> >  
> >  #include "def.h"
> >  #include "kbd.h"
> > @@ -403,6 +405,43 @@ quote(int f, int n)
> >   ungetkey(c);
> >   }
> >   return (selfinsert(f, n));
> > +}
> > +
> > +/*
> > + * Prompt for a codepoint in whatever the native system's encoding is,
> > + * insert it into the file
> > + */
> > +int
> > +insert_char(int f, int n)
> > +{
> > + char *bufp;
> > + char inpbuf[32];
> > + wchar_t wc;
> > + char mb[MB_CUR_MAX + 1];
>
> The + 1 here is not needed, unless you intend to use mb as
> a NUL terminated string. Which doesn't seem to be the case
> because the loop below is bound by mbslen.

Sounds good. I think I may have wanted to use it as a string at one
point, but later decided not to. I'll remove the +1.

> > + mbstate_t mbs = { 0 };
> > + size_t mbslen;
> > + size_t i;
> > +
> > + if ((bufp = eread("Insert character (hex): ", inpbuf, sizeof inpbuf,
> > +     EFNEW)) == NULL) {
> > + return (ABORT);
> > + } else if (bufp[0] == '\0') {
> > + return (FALSE);
> > + }
> > +
> > + wc = (wchar_t) strtoll(bufp, NULL, 16);
> > + mbslen = wcrtomb(mb, wc, &mbs);
> > + if (mbslen == (size_t) -1) {
> > + return (FALSE);
> > + }
>
> In case you wanted mb to be NUL terminated that would have to be done here.
> wcrtomb(3) will write at most MB_CUR_MAX bytes.

Perhaps this is why I gave up on treating it as a string? That
sounds like something I might have done.

> > +
> > + for (i = 0; i < mbslen; ++i) {
> > + if (linsert(1, mb[i]) == FALSE) {
> > + return (FALSE);
> > + }
> > + }
> > +
> > + return (TRUE);
> >  }
> >  
> >  /*
> > Index: keymap.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/keymap.c,v
> > retrieving revision 1.58
> > diff -u -p -u -p -r1.58 keymap.c
> > --- keymap.c 29 Dec 2015 19:44:32 -0000 1.58
> > +++ keymap.c 28 May 2018 18:08:10 -0000
> > @@ -120,6 +120,21 @@ static struct KEYMAPE (2) cX4map = {
> >   }
> >  };
> >  
> > +static PF cX8J[] = {
> > + insert_char /* ^M */
> > +};
> > +
> > +static struct KEYMAPE (1) cX8map = {
> > + 1,
> > + 1,
> > + rescan,
> > + {
> > + {
> > + CCHR('M'), CCHR('M'), cX8J, NULL
> > + }
> > + }
> > +};
> > +
> >  static PF cXcB[] = {
> >   listbuffers, /* ^B */
> >   quit, /* ^C */
> > @@ -158,6 +173,10 @@ static PF cX0[] = {
> >   NULL /* 4 */
> >  };
> >  
> > +static PF cX8[] = {
> > + NULL /* 4 */
> > +};
> > +
> >  static PF cXeq[] = {
> >   showcpos /* = */
> >  };
> > @@ -189,9 +208,9 @@ static PF cXcar[] = {
> >   undo /* u */
> >  };
> >  
> > -struct KEYMAPE (6) cXmap = {
> > - 6,
> > - 6,
> > +struct KEYMAPE (7) cXmap = {
> > + 7,
> > + 7,
> >   rescan,
> >   {
> >   {
> > @@ -207,6 +226,9 @@ struct KEYMAPE (6) cXmap = {
> >   '0', '4', cX0, (KEYMAP *) & cX4map
> >   },
> >   {
> > + '8', '8', cX8, (KEYMAP *) & cX8map
> > + },
> > + {
> >   '=', '=', cXeq, NULL
> >   },
> >   {
> > @@ -491,6 +513,18 @@ static struct KEYMAPE (1) overwmap = {
> >   }
> >  };
> >  
> > +static struct KEYMAPE (1) rawmap = {
> > + 0,
> > + 1, /* 1 to avoid 0 sized array */
> > + rescan,
> > + {
> > + /* unused dummy entry for VMS C */
> > + {
> > + (KCHAR)0, (KCHAR)0, NULL, NULL
> > + }
> > + }
> > +};
> > +
> >  
> >  /*
> >   * The basic (root) keyboard map
> > @@ -513,6 +547,7 @@ static struct maps_s map_table[] = {
> >   {(KEYMAP *) &notabmap, "notab",},
> >  #endif /* NOTAB */
> >   {(KEYMAP *) &overwmap, "overwrite",},
> > + {(KEYMAP *) &rawmap, "raw",},
> >   {(KEYMAP *) &metamap, "esc prefix",},
> >   {(KEYMAP *) &cXmap, "c-x prefix",},
> >   {(KEYMAP *) &cX4map, "c-x 4 prefix",},
> > Index: mg.1
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/mg.1,v
> > retrieving revision 1.106
> > diff -u -p -u -p -r1.106 mg.1
> > --- mg.1 11 Dec 2017 07:27:07 -0000 1.106
> > +++ mg.1 28 May 2018 18:08:10 -0000
> > @@ -1,7 +1,7 @@
> >  .\" $OpenBSD: mg.1,v 1.106 2017/12/11 07:27:07 jmc Exp $
> >  .\" This file is in the public domain.
> >  .\"
> > -.Dd $Mdocdate: December 11 2017 $
> > +.Dd $Mdocdate: May 28 2018 $
> >  .Dt MG 1
> >  .Os
> >  .Sh NAME
> > @@ -849,7 +849,7 @@ This is the default.
> >  .It set-default-mode
> >  Append the supplied mode to the list of default modes
> >  used by subsequent buffer creation.
> > -Built in modes include: fill, indent and overwrite.
> > +Built in modes include: fill, indent, overwrite, and raw.
> >  .It set-fill-column
> >  Prompt the user for a fill column.
> >  Used by auto-fill-mode.
> > @@ -861,6 +861,8 @@ Sets the prefix string to be used by the
> >  Execute external command from mini-buffer.
> >  .It shell-command-on-region
> >  Provide the text in region to the shell command as input.
> > +.It show-raw-mode
> > +ASCII encoding mode.
> >  .It shrink-window
> >  Shrink current window by one line.
> >  The window immediately below is expanded to pick up the slack.
> > @@ -1091,5 +1093,3 @@ In order to use 8-bit characters (such a
> >  needs to be disabled via the
> >  .Dq meta-key-mode
> >  command.
> > -.Pp
> > -Multi-byte character sets, such as UTF-8, are not supported.
> > Index: modes.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/modes.c,v
> > retrieving revision 1.21
> > diff -u -p -u -p -r1.21 modes.c
> > --- modes.c 30 May 2017 07:05:22 -0000 1.21
> > +++ modes.c 28 May 2018 18:08:10 -0000
> > @@ -111,6 +111,23 @@ overwrite_mode(int f, int n)
> >  }
> >  
> >  int
> > +show_raw_mode(int f, int n)
> > +{
> > + if (changemode(f, n, "raw") == FALSE)
> > + return (FALSE);
> > + if (f & FFARG) {
> > + if (n <= 0)
> > + curbp->b_flag &= ~BFSHOWRAW;
> > + else
> > + curbp->b_flag |= BFSHOWRAW;
> > + } else
> > + curbp->b_flag ^= BFSHOWRAW;
> > +
> > + sgarbf = TRUE;
> > + return (TRUE);
> > +}
> > +
> > +int
> >  set_default_mode(int f, int n)
> >  {
> >   int i;
> > @@ -170,5 +187,11 @@ set_default_mode(int f, int n)
> >   defb_flag |= BFNOTAB;
> >   }
> >  #endif /* NOTAB */
> > + if (strcmp(modebuf, "raw") == 0) {
> > + if (n <= 0)
> > + defb_flag &= ~BFSHOWRAW;
> > + else
> > + defb_flag |= BFSHOWRAW;
> > + }
> >   return (TRUE);
> >  }
> > Index: util.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/mg/util.c,v
> > retrieving revision 1.38
> > diff -u -p -u -p -r1.38 util.c
> > --- util.c 18 Nov 2015 18:21:06 -0000 1.38
> > +++ util.c 28 May 2018 18:08:10 -0000
> > @@ -13,6 +13,9 @@
> >  #include <ctype.h>
> >  #include <signal.h>
> >  #include <stdio.h>
> > +#include <string.h>
> > +#include <stdlib.h>
> > +#include <wchar.h>
> >  
> >  #include "def.h"
> >  
> > @@ -33,6 +36,9 @@ showcpos(int f, int n)
> >   int nline, row;
> >   int cline, cbyte; /* Current line/char/byte */
> >   int ratio;
> > + char ismb = 0;
> > + wchar_t wc = 0;
> > + char mbc[MB_CUR_MAX + 1];
> >  
> >   /* collect the data */
> >   clp = bfirstlp(curbp);
> > @@ -69,8 +75,43 @@ showcpos(int f, int n)
> >   clp = lforw(clp);
> >   }
> >   ratio = nchar ? (100L * cchar) / nchar : 100;
> > - ewprintf("Char: %c (0%o)  point=%ld(%d%%)  line=%d  row=%d  col=%d",
> > -    cbyte, cbyte, cchar, ratio, cline, row, getcolpos(curwp));
> > +
> > + if (!(curbp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + size_t consumed = 0;
> > + size_t offset = 0;
> > + while (cbyte != '\n' && offset <= curwp->w_doto) {
> > + int c = lgetc(clp, curwp->w_doto - offset);
> > + if (isprint(c) || (ISCTRL(c) != FALSE)) {
> > + break;
> > + }
> > + consumed = mbrtowc(&wc,
> > +                   &clp->l_text[curwp->w_doto - offset],
> > +                   llength(clp) - curwp->w_doto + offset,
> > +                   &mbs);
> > + if (consumed < (size_t) -2) {
>
> Again, please use a better mbrtowc() idiom.
>
> > + ismb = (offset < consumed);
> > + snprintf(mbc, consumed + 1, "%s",
> > +         &clp->l_text[curwp->w_doto - offset]);
> > + mbc[consumed + 1] = '\0';
> > + break;
> > + } else {
> > + memset(&mbs, 0, sizeof mbs);
> > + }
> > + offset++;
> > + }
> > + }
> > +
> > + if (ismb) {
> > + ewprintf("Char: %s (codepoint 0x%lx) Byte: %c (0%o)  "
> > +         "point=%ld(%d%%)  line=%d  row=%d col=%d", mbc,
> > +         (long) wc, cbyte, cbyte, cchar, ratio, cline, row,
> > +         getcolpos(curwp));
> > + } else {
> > + ewprintf("Char: %c (0%o)  point=%ld(%d%%)  line=%d  row=%d"
> > +         "col=%d", cbyte, cbyte, cchar, ratio, cline, row,
> > +         getcolpos(curwp));
> > + }
> >   return (TRUE);
> >  }
> >  
> > @@ -96,6 +137,22 @@ getcolpos(struct mgwin *wp)
> >   col += 2;
> >   else if (isprint(c)) {
> >   col++;
> > + } else if (!(wp->w_bufp->b_flag & BFSHOWRAW)) {
> > + mbstate_t mbs = { 0 };
> > + wchar_t wc = 0;
> > + size_t consumed = mbrtowc(&wc, &wp->w_dotp->l_text[i],
> > +                          llength(wp->w_dotp) - i,
> > +                          &mbs);
> > + int width = -1;
> > + if (consumed < (size_t) -2) {
>
> Same.
>
> > + width = wcwidth(wc);
> > + }
>
> width == -1 ?

As in the earlier cases, I want nothing to do with handling multibyte
control characters, so treating them as binary garbage is intended
behavior.

> > + if (width >= 0) {
> > + col += width;
> > + i += (consumed - 1);
> > + } else {
> > + col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> snprintf can return -1
>
> > + }
> >   } else {
> >   col += snprintf(tmp, sizeof(tmp), "\\%o", c);
>
> snprintf can return -1
>
> >   }
>

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

S. Gilles
In reply to this post by Theo de Raadt-2
On 2018-05-30T09:17:22-0600, Theo de Raadt wrote:

> This approach seems misguided.  Let me tell a story.
>
> More than two decades ago, I made a fork of mg which was 100% byte
> clean.  Unfortunately I lost the source code of that change.  mg's data
> buffers are a linked list of lines, with a \n implied by the end of each
> string.  0 bytes are unsupported.  Supporting multibyte risks a string
> merger getting confused by these problems and creating junk.  My fork
> changed mg to embed \n and \0 in the strings, and have the display code
> understand that.  All buffer-search functionalities also learned of this
> change.  The change had to be made quite incrementally and carefully,
> but I recall the end result deleted far more code than it added.  mg became
> 100% 8-bit clean, I could edit binaries with it.
>
> I think trying to shoehorn utf8 in before mg is 8-bit clean is a recipe
> for disaster.  There are too many functions (imagine search-replace)
> which already have nasty special cases for \n, and now will need nasty
> special cases for utf8 and I don't think this will work out nice.

Fair enough. I am decidely not up for reworking all of mg. It seems
better that this patch, as it is, remain off to the side in case
someone wants to use it themselves.

--
S. Gilles

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
In reply to this post by Leonid Bobrov
To make testing easier, I temporarily committed a port to WIP repo:
https://github.com/jasperla/openbsd-wip/tree/master/editors/umg

I won't submit it to ports@, this is experimental and only for testing.

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Stefan Sperling-5
In reply to this post by S. Gilles
On Wed, May 30, 2018 at 04:01:57PM -0400, S. Gilles wrote:

> On 2018-05-30T09:17:22-0600, Theo de Raadt wrote:
> > This approach seems misguided.  Let me tell a story.
> >
> > More than two decades ago, I made a fork of mg which was 100% byte
> > clean.  Unfortunately I lost the source code of that change.  mg's data
> > buffers are a linked list of lines, with a \n implied by the end of each
> > string.  0 bytes are unsupported.  Supporting multibyte risks a string
> > merger getting confused by these problems and creating junk.  My fork
> > changed mg to embed \n and \0 in the strings, and have the display code
> > understand that.  All buffer-search functionalities also learned of this
> > change.  The change had to be made quite incrementally and carefully,
> > but I recall the end result deleted far more code than it added.  mg became
> > 100% 8-bit clean, I could edit binaries with it.
> >
> > I think trying to shoehorn utf8 in before mg is 8-bit clean is a recipe
> > for disaster.  There are too many functions (imagine search-replace)
> > which already have nasty special cases for \n, and now will need nasty
> > special cases for utf8 and I don't think this will work out nice.
>
> Fair enough. I am decidely not up for reworking all of mg. It seems
> better that this patch, as it is, remain off to the side in case
> someone wants to use it themselves.

The feedback you got to your patch may seem like a setback.
But framing it as people asking you to "rework all of mg" by yourself
doesn't really align with what's happening here.

What is really being asked for is work which improves the overall quality
of this editor and makes UTF-8 support easier to add in the long term.
Ideally, this isn't done by just one person, There needs to be a review
process and pooling of expertise held in the minds of various people.

I'll note that someone else (Leonid), and not you, posted this patch of
yours to this list and thereby started a review process which you might
not even want to be part of. And that's fine, you don't need to be here.
I don't know if you were even asked before your patch was posted.

Generally we expect people who post patches to also take care of any
feedback they get for those patches themselves. But Lenoid just funnelled
our feedback back to you, which isn't how this is supposed to work.

Nobody expects anyone to get anything done within any particular
amount of time. If there is no fun in it for you anymore, just stop.
Somebody who cares (or several people who care) enough about mg might
eventually put in the work to make happen what the community wants.

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

Leonid Bobrov
On Fri, Jun 01, 2018 at 11:02:57AM +0200, Stefan Sperling wrote:

> On Wed, May 30, 2018 at 04:01:57PM -0400, S. Gilles wrote:
> > On 2018-05-30T09:17:22-0600, Theo de Raadt wrote:
> > > This approach seems misguided.  Let me tell a story.
> > >
> > > More than two decades ago, I made a fork of mg which was 100% byte
> > > clean.  Unfortunately I lost the source code of that change.  mg's data
> > > buffers are a linked list of lines, with a \n implied by the end of each
> > > string.  0 bytes are unsupported.  Supporting multibyte risks a string
> > > merger getting confused by these problems and creating junk.  My fork
> > > changed mg to embed \n and \0 in the strings, and have the display code
> > > understand that.  All buffer-search functionalities also learned of this
> > > change.  The change had to be made quite incrementally and carefully,
> > > but I recall the end result deleted far more code than it added.  mg became
> > > 100% 8-bit clean, I could edit binaries with it.
> > >
> > > I think trying to shoehorn utf8 in before mg is 8-bit clean is a recipe
> > > for disaster.  There are too many functions (imagine search-replace)
> > > which already have nasty special cases for \n, and now will need nasty
> > > special cases for utf8 and I don't think this will work out nice.
> >
> > Fair enough. I am decidely not up for reworking all of mg. It seems
> > better that this patch, as it is, remain off to the side in case
> > someone wants to use it themselves.
>
> The feedback you got to your patch may seem like a setback.
> But framing it as people asking you to "rework all of mg" by yourself
> doesn't really align with what's happening here.
>
> What is really being asked for is work which improves the overall quality
> of this editor and makes UTF-8 support easier to add in the long term.
> Ideally, this isn't done by just one person, There needs to be a review
> process and pooling of expertise held in the minds of various people.
>
> I'll note that someone else (Leonid), and not you, posted this patch of
> yours to this list and thereby started a review process which you might
> not even want to be part of. And that's fine, you don't need to be here.
> I don't know if you were even asked before your patch was posted.
>

I asked: https://github.com/hboetes/mg/pull/2#issue-190977781

> Generally we expect people who post patches to also take care of any
> feedback they get for those patches themselves. But Lenoid just funnelled
> our feedback back to you, which isn't how this is supposed to work.
>

I am happy to be a tester, this is the best what I can do right now. I
didn't study mg's code, nor did I check the diffs and ask why there
are so many weird things in code. I do run time testing and report run
time bugs. I started this thread because I'm sure I'm not the only one
who needs UTF-8 support and doesn't want to run GNU Emacs.

> Nobody expects anyone to get anything done within any particular
> amount of time. If there is no fun in it for you anymore, just stop.
> Somebody who cares (or several people who care) enough about mg might
> eventually put in the work to make happen what the community wants.
>

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] mg(1): Experimental UTF-8 support

S. Gilles
In reply to this post by Stefan Sperling-5
On 2018-06-01T11:02:57+0200, Stefan Sperling wrote:

> On Wed, May 30, 2018 at 04:01:57PM -0400, S. Gilles wrote:
> > On 2018-05-30T09:17:22-0600, Theo de Raadt wrote:
> > > This approach seems misguided.  Let me tell a story.
> > >
> > > More than two decades ago, I made a fork of mg which was 100% byte
> > > clean.  Unfortunately I lost the source code of that change.  mg's data
> > > buffers are a linked list of lines, with a \n implied by the end of each
> > > string.  0 bytes are unsupported.  Supporting multibyte risks a string
> > > merger getting confused by these problems and creating junk.  My fork
> > > changed mg to embed \n and \0 in the strings, and have the display code
> > > understand that.  All buffer-search functionalities also learned of this
> > > change.  The change had to be made quite incrementally and carefully,
> > > but I recall the end result deleted far more code than it added.  mg became
> > > 100% 8-bit clean, I could edit binaries with it.
> > >
> > > I think trying to shoehorn utf8 in before mg is 8-bit clean is a recipe
> > > for disaster.  There are too many functions (imagine search-replace)
> > > which already have nasty special cases for \n, and now will need nasty
> > > special cases for utf8 and I don't think this will work out nice.
> >
> > Fair enough. I am decidely not up for reworking all of mg. It seems
> > better that this patch, as it is, remain off to the side in case
> > someone wants to use it themselves.
>
> The feedback you got to your patch may seem like a setback.
> But framing it as people asking you to "rework all of mg" by yourself
> doesn't really align with what's happening here.

My apologies -- I wasn't entirely clear in my reply. I know that
nobody's asking me to do rework all of mg (either alone or working
with others), I just wanted to make it clear that it's not in my
plans to do so. If someone with an itch for a better mg wants to
recreate a nice, 8-bit clean version, I don't want to hold them
back with thoughts that their work would be redundant (or, worse,
that I might resent it).

> What is really being asked for is work which improves the overall quality
> of this editor and makes UTF-8 support easier to add in the long term.
> Ideally, this isn't done by just one person, There needs to be a review
> process and pooling of expertise held in the minds of various people.
>
> I'll note that someone else (Leonid), and not you, posted this patch of
> yours to this list and thereby started a review process which you might
> not even want to be part of. And that's fine, you don't need to be here.
> I don't know if you were even asked before your patch was posted.
>
> Generally we expect people who post patches to also take care of any
> feedback they get for those patches themselves. But Lenoid just funnelled
> our feedback back to you, which isn't how this is supposed to work.
>
> Nobody expects anyone to get anything done within any particular
> amount of time. If there is no fun in it for you anymore, just stop.
> Somebody who cares (or several people who care) enough about mg might
> eventually put in the work to make happen what the community wants.

I haven't really thought about this code since I worked on it 2+
years ago. Even then, my motivation was rather more selfish than
improving the overall quality of mg -- I just wanted to write emails
to a guy with "á" in his name and not deal with "\303\241". Back
then, I got what I wanted and made the upstream maintainers aware
that 1) at least one user wanted that functionality, and 2) code
existed, either to take wholesale or use as inspiration. Doing so
carries a lot more weight than a simple "please add X", and is a
pre-emptive answer to the reply "patches welcome". :)

Now, as someone who published a bit of code to the world, I do have
moderate responsibility to consider feedback and to fix easy bugs
with it; I don't begrudge that at all. But I don't have any special
motivation beyond that. That's pretty much it.

--
S. Gilles