[patch] ls + utf-8 support

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[patch] ls + utf-8 support

Martijn van Duren-5
Hello tech@,

My last mail about ksh in vi mode got me thinking about how UTF-8 is
implemented in ls. The question marks, while useful for human readers
have a big downside.

I've come across a fair amount of malformed file names by all sorts of
causes. Be it malware or just human error. When such a malformed
character is in an inconvenient place and can't be auto-completed I
usually fix this by something like de following:
$ cd "`ls | tail -1`"
ksh: cd: /home/martijn/Muziek/Motörhead/N?? Sleep at All - No such file
or directory
$ cd "`/usr/src/bin/ls/ls | tail -1`"
$

My patch maintains the the question marks when stdout is a tty, but
returns the original byte otherwise. Afaik the only logical use for the
length is when doing formatted output, which is only when printing to a tty.

This doesn't solve the case when ls is run over ssh -t and the content
is redirected client-side, but you can't win them all.

Sincerely,

Martijn van Duren

ls_utf8.diff (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Ingo Schwarze
Hi Martijn,

Martijn van Duren wrote on Sun, Jan 17, 2016 at 12:58:38PM +0100:

> I've come across a fair amount of malformed file names by all sorts
> of causes.  Be it malware or just human error.  When such a malformed
> character is in an inconvenient place and can't be auto-completed
> I usually fix this by something like the following:
>
> $ cd "`ls | tail -1`"
> ksh: cd: /home/martijn/Muziek/Mot????rhead/N?? Sleep at All - No such file
> or directory
> $ cd "`/usr/src/bin/ls/ls | tail -1`"

Why not just

 $ cd N*Sleep*

That seems simpler to me.
Sure, if you have more than one file in the same directory showing
this problem, and the names are too similar to be independently
globbed, and you want to keep them, it's a bit more work:

 $ i=1; for f in N*Sleep*; do mv $f tmp$i; i=$((i+1)); done

I don't see a real reason to use ls(1) in such situations.

> My patch maintains the question marks when stdout is a tty, but returns
> the original byte otherwise.  Afaik the only logical use for the length
> is when doing formatted output, which is only when printing to a tty.

The rationale for weeding out invalid bytes and non-printable
characters is not that we don't know how many display columns the
terminal might use to represent them.  The rationale is that they
might cause the terminal to change state, to interpret them as
in-band control codes.

> This doesn't solve the case when ls is run over ssh -t and the content
> is redirected client-side, but you can't win them all.

Indeed.  I worry that might result in security violations.

The old ls(1) also weeded out non-printable bytes, in particular
control codes.

Even though this question is only tangentially related to UTF-8:
Do we want to change that?  Should ls(1) sometimes pass through
bytes that might be control codes for some terminals?  Imposing
the responsibility that non-isatty(3) ls(1) output never ends up
on a terminal on the user?

I tend to think "no".  I wouldn't want my ls(1) to sometimes generate
output that isn't safe on a terminal.  I doubt i would get into the
habit of considering whether or not ls(1) is safe to run in a given
situation.  I'd rather have it just be safe, always.  And deal with
the occasional insane file name with different tools.

What do people think?
  Ingo


> Index: ls.c
> ===================================================================
> RCS file: /cvs/src/bin/ls/ls.c,v
> retrieving revision 1.44
> diff -u -p -r1.44 ls.c
> --- ls.c 1 Dec 2015 18:36:13 -0000 1.44
> +++ ls.c 17 Jan 2016 10:57:03 -0000
> @@ -94,6 +94,7 @@ int f_type; /* add type character for
>  int f_typedir; /* add type character for directories */
>  
>  int rval;
> +int istty = 0;
>  
>  int
>  ls_main(int argc, char *argv[])
> @@ -110,6 +111,7 @@ ls_main(int argc, char *argv[])
>  
>   /* Terminal defaults to -Cq, non-terminal defaults to -1. */
>   if (isatty(STDOUT_FILENO)) {
> + istty = 1;
>   if ((p = getenv("COLUMNS")) != NULL)
>   width = strtonum(p, 1, INT_MAX, NULL);
>   if (width == 0 &&
> Index: utf8.c
> ===================================================================
> RCS file: /cvs/src/bin/ls/utf8.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 utf8.c
> --- utf8.c 1 Dec 2015 18:36:13 -0000 1.1
> +++ utf8.c 17 Jan 2016 10:57:03 -0000
> @@ -21,6 +21,8 @@
>  #include <stdlib.h>
>  #include <wchar.h>
>  
> +extern int istty;
> +
>  int
>  mbsprint(const char *mbs, int print)
>  {
> @@ -33,12 +35,12 @@ mbsprint(const char *mbs, int print)
>   if ((len = mbtowc(&wc, mbs, MB_CUR_MAX)) == -1) {
>   (void)mbtowc(NULL, NULL, MB_CUR_MAX);
>   if (print)
> - putchar('?');
> + putchar(istty ? '?' : *mbs);
>   total_width++;
>   len = 1;
>   } else if ((width = wcwidth(wc)) == -1) {
>   if (print)
> - putchar('?');
> + putchar(istty ? '?' : *mbs);
>   total_width++;
>   } else {
>   if (print)

Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Martin Natano
In reply to this post by Martijn van Duren-5
I agree with Ingo, ls(1) shouldn't generate unsafe output. Regardless
of whether the output is to a terminal or some other file.

cheers,
natano

Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Ted Unangst-6
In reply to this post by Ingo Schwarze
Ingo Schwarze wrote:
> The old ls(1) also weeded out non-printable bytes, in particular
> control codes.

The old ls only had this behavior for terminals however. Redirecting to a file
or pipe would always output the original bytes.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Stuart Henderson-6
On 2016/01/17 14:29, Ted Unangst wrote:
> Ingo Schwarze wrote:
> > The old ls(1) also weeded out non-printable bytes, in particular
> > control codes.
>
> The old ls only had this behavior for terminals however. Redirecting to a file
> or pipe would always output the original bytes.

I've used this a few times in the past, for example "ls | hexdump -C"
or .."| vis", to find out what the characters used in some filename are.
I'd find it surprising for this to not work.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Ingo Schwarze
Hi,

Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +0000:
> On 2016/01/17 14:29, Ted Unangst wrote:
>> Ingo Schwarze wrote:

>>> The old ls(1) also weeded out non-printable bytes, in particular
>>> control codes.

>> The old ls only had this behavior for terminals however.
>> Redirecting to a file or pipe would always output the original bytes.

> I've used this a few times in the past, for example "ls | hexdump -C"
> or .."| vis", to find out what the characters used in some filename are.
> I'd find it surprising for this to not work.

Oops.  What we currently have in the tree is broken in that respect,
i broke it, including the -q option.

Current behaviour is:

 * SMALL: fully works, but no UTF-8 support
 * not SMALL:
    - LC_CTYPE=C on a tty or with -q: does '?', ok
    - LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok
    - LC_CTYPE=C neither tty nor -q: does '?', wrong
    - LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong

The following patch fixes the last two cases.
It is similar in spirit to what Martijn originally sent,
but fixes two issues with his patch:

 1) Do not invent a new global variable, use the existing f_nonprint.
 2) For valid, but non-printable codepoints, print all bytes of the
    codepoint's encoding rather than just the first byte.

Should i commit this?

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Ingo Schwarze
In reply to this post by Martijn van Duren-5
And now with the patch...

----- Forwarded message from Ingo Schwarze <[hidden email]> -----

From: Ingo Schwarze <[hidden email]>
Date: Sun, 17 Jan 2016 21:37:56 +0100
To: Stuart Henderson <[hidden email]>, Ted Unangst <[hidden email]>
Cc: Martijn van Duren <[hidden email]>, [hidden email]
Subject: Re: [patch] ls + utf-8 support

Hi,

Stuart Henderson wrote on Sun, Jan 17, 2016 at 07:46:23PM +0000:
> On 2016/01/17 14:29, Ted Unangst wrote:
>> Ingo Schwarze wrote:

>>> The old ls(1) also weeded out non-printable bytes, in particular
>>> control codes.

>> The old ls only had this behavior for terminals however.
>> Redirecting to a file or pipe would always output the original bytes.

> I've used this a few times in the past, for example "ls | hexdump -C"
> or .."| vis", to find out what the characters used in some filename are.
> I'd find it surprising for this to not work.

Oops.  What we currently have in the tree is broken in that respect,
i broke it, including the -q option.

Current behaviour is:

 * SMALL: fully works, but no UTF-8 support
 * not SMALL:
    - LC_CTYPE=C on a tty or with -q: does '?', ok
    - LC_CTYPE=en_US.UTF-8 on a tty or with -q: does '?', ok
    - LC_CTYPE=C neither tty nor -q: does '?', wrong
    - LC_CTYPE=en_US.UTF-8 neither tty nor -q: does '?', wrong

The following patch fixes the last two cases.
It is similar in spirit to what Martijn originally sent,
but fixes two issues with his patch:

 1) Do not invent a new global variable, use the existing f_nonprint.
 2) For valid, but non-printable codepoints, print all bytes of the
    codepoint's encoding rather than just the first byte.

Should i commit this?

Yours,
  Ingo

----- End forwarded message -----

Index: utf8.c
===================================================================
RCS file: /cvs/src/bin/ls/utf8.c,v
retrieving revision 1.1
diff -u -p -r1.1 utf8.c
--- utf8.c 1 Dec 2015 18:36:13 -0000 1.1
+++ utf8.c 17 Jan 2016 20:13:51 -0000
@@ -21,6 +21,8 @@
 #include <stdlib.h>
 #include <wchar.h>
 
+extern int f_nonprint;
+
 int
 mbsprint(const char *mbs, int print)
 {
@@ -33,12 +35,16 @@ mbsprint(const char *mbs, int print)
  if ((len = mbtowc(&wc, mbs, MB_CUR_MAX)) == -1) {
  (void)mbtowc(NULL, NULL, MB_CUR_MAX);
  if (print)
- putchar('?');
+ putchar(f_nonprint ? '?' : *mbs);
  total_width++;
  len = 1;
  } else if ((width = wcwidth(wc)) == -1) {
- if (print)
- putchar('?');
+ if (print) {
+ if (f_nonprint)
+ putchar('?');
+ else
+ fwrite(mbs, 1, len, stdout);
+ }
  total_width++;
  } else {
  if (print)

Reply | Threaded
Open this post in threaded view
|

Re: [patch] ls + utf-8 support

Michael McConville-3
In reply to this post by Martin Natano
Martin Natano wrote:
> I agree with Ingo, ls(1) shouldn't generate unsafe output. Regardless
> of whether the output is to a terminal or some other file.

While POSIX is not holy law, doing what you suggest would probably be a
violation. See the description of the -q option:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ls.html

That combined with Stuart's mentioned legitimate uses makes me think
that this could cause subtle and unexpected problems. I do see the
danger of allowing raw bytestring redirection, though.