sed: Fix up y command

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

sed: Fix up y command

Martijn van Duren-5
Hello tech@,

Our sed command goes a bit bit bonkers on the y command in a couple of  
different scenarios and was found thanks to the sed.1 diff from kshe.

1) is a posix[0] violation, which states:
If a <backslash> followed by an 'n' appear in string1 or string2, the
two characters shall be handled as a single <newline>.
and
If the delimiter is not 'n', within string1 and string2, the delimiter
itself can be used as a literal character if it is preceded by a
<backslash>.
But our implementation (same as Free, Net, and gsed) replace \n with n
if the delimiter is n.
$ echo a | sed 'ynan\nn'
n
$ # With patch
$ echo a | ./patchedsed 'ynan\nn'


$

2) is a broken delimiter parser:
y needs to take all characters literally (except for <backslash>), but
all the *BSD implementations pass the delimiter parsing through
compile_delimited, which does an extra '[' pair matching, where it
should not do this. The handling of said character is done as expected
once the closing tag is found. gsed does this correct:
$ echo [ | sed 'y/[/a/'
sed: 1: "y/[/a/": unbalanced brackets ([])
$ echo [ | sed 'y/[a]/bcd/'
b
$
$ echo [ | ./patchedsed 'y/[/a/'  
a

3) is inconsistent behaviour between implementations:
According to posix:
The meaning of a <backslash> followed by any character that is not 'n',
a <backslash>, or the delimiter character is undefined.
and
If the number of characters in string1 and string2 are not equal, or if
any of the characters in string1 appear more than once, the results are
undefined.

Right now we generate an error if the strings are of unequal length, but
double characters and unmatched <backslash> treated literal. This also
contradicts our manpage:
Within string1 and string2, a backslash followed by any character other
than a newline is that literal character.

But the code does something different:
$ echo a | sed 'y/ab/\c/'
\
gsed is a little mostly more accurate to our manpage, but awefully
inconsistent:
$ echo a | gsed 'y/a/\c/'
gsed: -e expression #1, char 7: strings for `y' command are different lengths
$ echo a | gsed 'y/a/\d/'  
d

As for the repeating lines we always choose the last matching character,
where gsed chooses the first:
$ echo a | sed 'y/aa/bc/'
c
$ echo a | gsed 'y/aa/bc/'
b

To make portability issues more obvious I've chosen to turn these undefined
scenarios into an error to make these non-obvious issues more visible:
$ echo a | ./patchedsed 'y/ab/\c/'
sed: 1: "y/ab/\c/": Unexpected character after backslash
$ echo a | ./patchedsed 'y/aa/bc/'
sed: 1: "y/aa/bc/": Repeated character in source string

Since this changes functional behaviour and can cause backwards
incompatible changes for people with scripts that live in this unspecified
area: I would like to know if someone is against this change.

The manpage bits were helped by jmc@, but this code makes his head spin,
so extra eyes on that part are welcome as well.

OK?

Sincerely,

martijn@

[0]http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

Index: compile.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/compile.c,v
retrieving revision 1.42
diff -u -r1.42 compile.c
--- compile.c 1 Aug 2017 18:05:53 -0000 1.42
+++ compile.c 7 Dec 2017 14:57:04 -0000
@@ -617,46 +617,62 @@
  * Compile a translation set of strings into a lookup table.
  */
 static char *
-compile_tr(char *p, char **transtab)
+compile_tr(char *old, char **transtab)
 {
  int i;
- char *lt, *op, *np;
- char *old = NULL, *new = NULL;
+ char delimiter, check[UCHAR_MAX];
+ char *new, *end;
 
- if (*p == '\0' || *p == '\\')
- error(COMPILE,
-"transform pattern can not be delimited by newline or backslash");
- old = xmalloc(strlen(p) + 1);
- p = compile_delimited(p, old, 1);
- if (p == NULL) {
- error(COMPILE, "unterminated transform source string");
- goto bad;
- }
- new = xmalloc(strlen(p) + 1);
- p = compile_delimited(--p, new, 1);
- if (p == NULL) {
- error(COMPILE, "unterminated transform target string");
- goto bad;
+ bzero(check, sizeof(check));
+ delimiter = *old++;
+ if (delimiter == '\0' || delimiter == '\\')
+ error(COMPILE, "transform pattern can not be delimited by "
+    "newline or backslash");
+
+ new = old;
+ do {
+ if ((new = strchr(new + 1, delimiter)) == NULL)
+ error(COMPILE, "unterminated transform source string");
+ } while (*(new - 1) == '\\' && *(new -2) != '\\');
+ *new++ = '\0';
+ end = new;
+ do {
+ if ((end = strchr(end + 1, delimiter)) == NULL)
+ error(COMPILE, "unterminated transform target string");
+ } while (*(end -1) == '\\' && *(end -2) != '\\');
+ *end = '\0';
+
+ // We assume characters are 8 bits
+ *transtab = xmalloc(UCHAR_MAX + 1);
+ for (i = 0; i <= UCHAR_MAX; i++)
+ (*transtab)[i] = (char)i;
+
+ while (*old != '\0' && *new != '\0') {
+ if (*old == '\\') {
+ old++;
+ if (*old == 'n')
+ *old = '\n';
+ else if (*old != delimiter && *old != '\\')
+ error(COMPILE, "Unexpected character after "
+    "backslash");
+
+ }
+ if (*new == '\\') {
+ new++;
+ if (*new == 'n')
+ *new = '\n';
+ else if (*new != delimiter && *new != '\\')
+ error(COMPILE, "Unexpected character after "
+    "backslash");
+ }
+ if (check[*old] == 1)
+ error(COMPILE, "Repeated character in source string");
+ check[*old] = 1;
+ (*transtab)[*old++] = *new++;
  }
- EATSPACE();
- if (strlen(new) != strlen(old)) {
+ if (*old != '\0' || *new != '\0')
  error(COMPILE, "transform strings are not the same length");
- goto bad;
- }
- /* We assume characters are 8 bits */
- lt = xmalloc(UCHAR_MAX + 1);
- for (i = 0; i <= UCHAR_MAX; i++)
- lt[i] = (char)i;
- for (op = old, np = new; *op; op++, np++)
- lt[(u_char)*op] = *np;
- *transtab = lt;
- free(old);
- free(new);
- return (p);
-bad:
- free(old);
- free(new);
- return (NULL);
+ return end + 1;
 }
 
 /*
Index: sed.1
===================================================================
RCS file: /cvs/src/usr.bin/sed/sed.1,v
retrieving revision 1.51
diff -u -r1.51 sed.1
--- sed.1 7 Dec 2017 09:52:26 -0000 1.51
+++ sed.1 7 Dec 2017 14:57:04 -0000
@@ -482,14 +482,29 @@
 .Ar string2 .
 Any character other than a backslash or newline can be used instead of
 a slash to delimit the strings.
+.Pp
 Within
 .Ar string1
 and
 .Ar string2 ,
-a backslash followed by any character other than a newline is that literal
-character, and a backslash followed by an
+a backslash followed by another backslash
+is replaced by a single backslash,
+a backslash followed by an
 .Sq n
-is replaced by a newline character.
+is replaced by a newline character,
+and a backslash followed by the delimiting character
+is replaced by that character,
+causing it to be treated literally,
+with the exception of the
+.Sq n
+character,
+which will still be treated like a newline character.
+It is an error for a backslash to not be followed by another backslash,
+.Sq n ,
+or the delimiting character,
+or for
+.Ar string1
+to contain repeating characters,
 .It [0addr] Ns Ic \&: Ns Ar label
 This function does nothing; it bears a
 .Ar label

Reply | Threaded
Open this post in threaded view
|

Re: sed: Fix up y command

Todd C. Miller-2
On Thu, 07 Dec 2017 15:59:16 +0100, Martijn van Duren wrote:

> Our sed command goes a bit bit bonkers on the y command in a couple of  
> different scenarios and was found thanks to the sed.1 diff from kshe.
>
> 1) is a posix[0] violation, which states:
> If a <backslash> followed by an 'n' appear in string1 or string2, the
> two characters shall be handled as a single <newline>.
> and
> If the delimiter is not 'n', within string1 and string2, the delimiter
> itself can be used as a literal character if it is preceded by a
> <backslash>.
> But our implementation (same as Free, Net, and gsed) replace \n with n
> if the delimiter is n.

That is clearly a bug.

> 2) is a broken delimiter parser:
> y needs to take all characters literally (except for <backslash>), but
> all the *BSD implementations pass the delimiter parsing through
> compile_delimited, which does an extra '[' pair matching, where it
> should not do this. The handling of said character is done as expected
> once the closing tag is found. gsed does this correct:
> $ echo [ | sed 'y/[/a/'
> sed: 1: "y/[/a/": unbalanced brackets ([])
> $ echo [ | sed 'y/[a]/bcd/'
> b
> $
> $ echo [ | ./patchedsed 'y/[/a/'  
> a

Same.

> 3) is inconsistent behaviour between implementations:
> According to posix:
> The meaning of a <backslash> followed by any character that is not 'n',
> a <backslash>, or the delimiter character is undefined.
> and
> If the number of characters in string1 and string2 are not equal, or if
> any of the characters in string1 appear more than once, the results are
> undefined.
>
> Right now we generate an error if the strings are of unequal length, but
> double characters and unmatched <backslash> treated literal. This also
> contradicts our manpage:
> Within string1 and string2, a backslash followed by any character other
> than a newline is that literal character.
>
> But the code does something different:
> $ echo a | sed 'y/ab/\c/'
> \
> gsed is a little mostly more accurate to our manpage, but awefully
> inconsistent:
> $ echo a | gsed 'y/a/\c/'
> gsed: -e expression #1, char 7: strings for `y' command are different lengths
> $ echo a | gsed 'y/a/\d/'  
> d

Here I would naively expect the escaped character to be treated as
if no escape was present.  This is what the Solaris sed does.

> As for the repeating lines we always choose the last matching character,
> where gsed chooses the first:
> $ echo a | sed 'y/aa/bc/'
> c
> $ echo a | gsed 'y/aa/bc/'
> b

Solaris sed also chooses the last match.

> To make portability issues more obvious I've chosen to turn these undefined
> scenarios into an error to make these non-obvious issues more visible:
> $ echo a | ./patchedsed 'y/ab/\c/'
> sed: 1: "y/ab/\c/": Unexpected character after backslash
> $ echo a | ./patchedsed 'y/aa/bc/'
> sed: 1: "y/aa/bc/": Repeated character in source string

I'm a bit leary of changing these into errors.  However, these are
undefined behaviors and implementations do vary in how they handle
them, so throwing an error is probably best.  Otherwise we are just
creating a trap for the user.

> Since this changes functional behaviour and can cause backwards
> incompatible changes for people with scripts that live in this unspecified
> area: I would like to know if someone is against this change.
>
> The manpage bits were helped by jmc@, but this code makes his head spin,
> so extra eyes on that part are welcome as well.
>
> OK?
>
> Sincerely,
>
> martijn@
>
> [0]http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html
>
> Index: compile.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/sed/compile.c,v
> retrieving revision 1.42
> diff -u -r1.42 compile.c
> --- compile.c 1 Aug 2017 18:05:53 -0000 1.42
> +++ compile.c 7 Dec 2017 14:57:04 -0000
> @@ -617,46 +617,62 @@
>   * Compile a translation set of strings into a lookup table.
>   */
>  static char *
> -compile_tr(char *p, char **transtab)
> +compile_tr(char *old, char **transtab)
>  {
>   int i;
> - char *lt, *op, *np;
> - char *old = NULL, *new = NULL;
> + char delimiter, check[UCHAR_MAX];
> + char *new, *end;
>  
> - if (*p == '\0' || *p == '\\')
> - error(COMPILE,
> -"transform pattern can not be delimited by newline or backslash");
> - old = xmalloc(strlen(p) + 1);
> - p = compile_delimited(p, old, 1);
> - if (p == NULL) {
> - error(COMPILE, "unterminated transform source string");
> - goto bad;
> - }
> - new = xmalloc(strlen(p) + 1);
> - p = compile_delimited(--p, new, 1);
> - if (p == NULL) {
> - error(COMPILE, "unterminated transform target string");
> - goto bad;
> + bzero(check, sizeof(check));

I'd prefer memset over bzero here.

> + delimiter = *old++;
> + if (delimiter == '\0' || delimiter == '\\')
> + error(COMPILE, "transform pattern can not be delimited by "
> +    "newline or backslash");
> +
> + new = old;
> + do {
> + if ((new = strchr(new + 1, delimiter)) == NULL)
> + error(COMPILE, "unterminated transform source string");
> + } while (*(new - 1) == '\\' && *(new -2) != '\\');

Given "y/", new + 1 will point outside the string (one past the
NUL).  Perhaps add an explicit check for *new == '\0' or just enforce
a minimum length of 3 characters?

> + *new++ = '\0';
> + end = new;
> + do {
> + if ((end = strchr(end + 1, delimiter)) == NULL)
> + error(COMPILE, "unterminated transform target string");
> + } while (*(end -1) == '\\' && *(end -2) != '\\');
> + *end = '\0';

Similar problem here.

> +
> + // We assume characters are 8 bits
> + *transtab = xmalloc(UCHAR_MAX + 1);
> + for (i = 0; i <= UCHAR_MAX; i++)
> + (*transtab)[i] = (char)i;

Can we preserve the old-style C comment for consistency?

> +
> + while (*old != '\0' && *new != '\0') {
> + if (*old == '\\') {
> + old++;
> + if (*old == 'n')
> + *old = '\n';
> + else if (*old != delimiter && *old != '\\')
> + error(COMPILE, "Unexpected character after "
> +    "backslash");
> +
> + }
> + if (*new == '\\') {
> + new++;
> + if (*new == 'n')
> + *new = '\n';
> + else if (*new != delimiter && *new != '\\')
> + error(COMPILE, "Unexpected character after "
> +    "backslash");
> + }
> + if (check[*old] == 1)
> + error(COMPILE, "Repeated character in source string");
> + check[*old] = 1;
> + (*transtab)[*old++] = *new++;
>   }
> - EATSPACE();
> - if (strlen(new) != strlen(old)) {
> + if (*old != '\0' || *new != '\0')
>   error(COMPILE, "transform strings are not the same length");
> - goto bad;
> - }
> - /* We assume characters are 8 bits */
> - lt = xmalloc(UCHAR_MAX + 1);
> - for (i = 0; i <= UCHAR_MAX; i++)
> - lt[i] = (char)i;
> - for (op = old, np = new; *op; op++, np++)
> - lt[(u_char)*op] = *np;
> - *transtab = lt;
> - free(old);
> - free(new);
> - return (p);
> -bad:
> - free(old);
> - free(new);
> - return (NULL);
> + return end + 1;
>  }

I can't see any problems with the rest.  All errors are fatal so
there's no need to return NULL (and returning NULL would have
resulted in a crash anyway).

When this goes in you can also remove the "is_tr" function argument
from compile_delimited() as it is now unused.

The manual change looks good as well.

 - todd

Reply | Threaded
Open this post in threaded view
|

Re: sed: Fix up y command

Martijn van Duren-5
Here's a new diff with the issues addressed.
But instead of a check I fixed the starting offset, which also caused
issues with y///.

OK?

Index: compile.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/compile.c,v
retrieving revision 1.43
diff -u -p -u -r1.43 compile.c
--- compile.c 8 Dec 2017 18:41:59 -0000 1.43
+++ compile.c 11 Dec 2017 09:05:56 -0000
@@ -59,7 +59,7 @@ static struct labhash {
 
 static char *compile_addr(char *, struct s_addr *);
 static char *compile_ccl(char **, char *);
-static char *compile_delimited(char *, char *, int);
+static char *compile_delimited(char *, char *);
 static char *compile_flags(char *, struct s_subst *);
 static char *compile_re(char *, regex_t **);
 static char *compile_subst(char *, struct s_subst *);
@@ -351,7 +351,7 @@ nonsel: /* Now parse the command */
  * with the processed string.
  */
 static char *
-compile_delimited(char *p, char *d, int is_tr)
+compile_delimited(char *p, char *d)
 {
  char c;
 
@@ -376,10 +376,7 @@ compile_delimited(char *p, char *d, int
  p += 2;
  continue;
  } else if (*p == '\\' && p[1] == '\\') {
- if (is_tr)
- p++;
- else
- *d++ = *p++;
+ *d++ = *p++;
  } else if (*p == c) {
  *d = '\0';
  return (p + 1);
@@ -436,7 +433,7 @@ compile_re(char *p, regex_t **repp)
  char *re;
 
  re = xmalloc(strlen(p) + 1); /* strlen(re) <= strlen(p) */
- p = compile_delimited(p, re, 0);
+ p = compile_delimited(p, re);
  if (p && strlen(re) == 0) {
  *repp = NULL;
  free(re);
@@ -603,46 +600,63 @@ compile_flags(char *p, struct s_subst *s
  * Compile a translation set of strings into a lookup table.
  */
 static char *
-compile_tr(char *p, char **transtab)
+compile_tr(char *old, char **transtab)
 {
  int i;
- char *lt, *op, *np;
- char *old = NULL, *new = NULL;
+ char delimiter, check[UCHAR_MAX];
+ char *new, *end;
+
+ memset(check, 0, sizeof(check));
+ delimiter = *old;
+ if (delimiter == '\\')
+ error(COMPILE, "\\ can not be used as a string delimiter");
+ else if (delimiter == '\n' || delimiter == '\0')
+ error(COMPILE, "newline can not be used as a string delimiter");
+
+ new = old++;
+ do {
+ if ((new = strchr(new + 1, delimiter)) == NULL)
+ error(COMPILE, "unterminated transform source string");
+ } while (*(new - 1) == '\\' && *(new -2) != '\\');
+ *new = '\0';
+ end = new++;
+ do {
+ if ((end = strchr(end + 1, delimiter)) == NULL)
+ error(COMPILE, "unterminated transform target string");
+ } while (*(end -1) == '\\' && *(end -2) != '\\');
+ *end = '\0';
 
- if (*p == '\0' || *p == '\\')
- error(COMPILE,
-"transform pattern can not be delimited by newline or backslash");
- old = xmalloc(strlen(p) + 1);
- p = compile_delimited(p, old, 1);
- if (p == NULL) {
- error(COMPILE, "unterminated transform source string");
- goto bad;
- }
- new = xmalloc(strlen(p) + 1);
- p = compile_delimited(--p, new, 1);
- if (p == NULL) {
- error(COMPILE, "unterminated transform target string");
- goto bad;
- }
- EATSPACE();
- if (strlen(new) != strlen(old)) {
- error(COMPILE, "transform strings are not the same length");
- goto bad;
- }
  /* We assume characters are 8 bits */
- lt = xmalloc(UCHAR_MAX + 1);
+ *transtab = xmalloc(UCHAR_MAX + 1);
  for (i = 0; i <= UCHAR_MAX; i++)
- lt[i] = (char)i;
- for (op = old, np = new; *op; op++, np++)
- lt[(u_char)*op] = *np;
- *transtab = lt;
- free(old);
- free(new);
- return (p);
-bad:
- free(old);
- free(new);
- return (NULL);
+ (*transtab)[i] = (char)i;
+
+ while (*old != '\0' && *new != '\0') {
+ if (*old == '\\') {
+ old++;
+ if (*old == 'n')
+ *old = '\n';
+ else if (*old != delimiter && *old != '\\')
+ error(COMPILE, "Unexpected character after "
+    "backslash");
+
+ }
+ if (*new == '\\') {
+ new++;
+ if (*new == 'n')
+ *new = '\n';
+ else if (*new != delimiter && *new != '\\')
+ error(COMPILE, "Unexpected character after "
+    "backslash");
+ }
+ if (check[*old] == 1)
+ error(COMPILE, "Repeated character in source string");
+ check[*old] = 1;
+ (*transtab)[*old++] = *new++;
+ }
+ if (*old != '\0' || *new != '\0')
+ error(COMPILE, "transform strings are not the same length");
+ return end + 1;
 }
 
 /*
Index: sed.1
===================================================================
RCS file: /cvs/src/usr.bin/sed/sed.1,v
retrieving revision 1.52
diff -u -p -u -r1.52 sed.1
--- sed.1 8 Dec 2017 18:41:59 -0000 1.52
+++ sed.1 11 Dec 2017 09:05:56 -0000
@@ -482,14 +482,29 @@ in the pattern space with the correspond
 .Ar string2 .
 Any character other than a backslash or newline can be used instead of
 a slash to delimit the strings.
+.Pp
 Within
 .Ar string1
 and
 .Ar string2 ,
-a backslash followed by any character other than a newline is that literal
-character, and a backslash followed by an
+a backslash followed by another backslash
+is replaced by a single backslash,
+a backslash followed by an
 .Sq n
-is replaced by a newline character.
+is replaced by a newline character,
+and a backslash followed by the delimiting character
+is replaced by that character,
+causing it to be treated literally,
+with the exception of the
+.Sq n
+character,
+which will still be treated like a newline character.
+It is an error for a backslash to not be followed by another backslash,
+.Sq n ,
+or the delimiting character,
+or for
+.Ar string1
+to contain repeating characters,
 .It [0addr] Ns Ic \&: Ns Ar label
 This function does nothing; it bears a
 .Ar label

Reply | Threaded
Open this post in threaded view
|

Re: sed: Fix up y command

Todd C. Miller-2
On Mon, 11 Dec 2017 10:07:50 +0100, Martijn van Duren wrote:

> Here's a new diff with the issues addressed.
> But instead of a check I fixed the starting offset, which also caused
> issues with y///.

Looks good now, OK millert@

 - todd