locale in locate(1)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

locale in locate(1)

Jan Stary
Does locate(1) need to setlocale(3)?

        Jan

Index: locate/locate.c
===================================================================
RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v
retrieving revision 1.31
diff -u -p -r1.31 locate.c
--- locate/locate.c 19 Nov 2015 21:46:05 -0000 1.31
+++ locate/locate.c 10 Jan 2019 20:34:16 -0000
@@ -73,7 +73,6 @@
 #include <fnmatch.h>
 #include <libgen.h>
 #include <limits.h>
-#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -122,7 +121,6 @@ main(int argc, char *argv[])
 {
  int ch;
  char **dbv = NULL;
- (void) setlocale(LC_ALL, "");
 
  if (pledge("stdio rpath", NULL) == -1)
  err(1, "pledge");

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Scott Cheloha
> On Jan 10, 2019, at 14:35, Jan Stary <[hidden email]> wrote:
>
> Does locate(1) need to setlocale(3)?

locate(1) uses tolower(3), the results of which can be affected by
locale settings.  locate(1)'s database is, at least in theory, portable.
If you don't explicitly set the locale to POSIX/C I think it might be
possible to create a corrupted database, or read a valid database
as corrupted.

But tedu says we have a policy so maybe I am mistaken about this.
The ctype wizard may also have a very elegant explanation clarifying
whether this is correct or not.

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Theo de Raadt-2
Scott Cheloha <[hidden email]> wrote:

> The ctype wizard may also have a very elegant explanation clarifying
> whether this is correct or not.

The diffs received are annoying.

What problem does it solve?

What are the downsides?

Is there any education in the diffs as to why the changes are needed,
so that people not freshly in the game can participate?

No.

Questions are not asked nor answered.

Like, seriously come on.  If you don't know why you are writing a diff
or can't be bothered to articulate why you are writing the diff why are
you sending it?

I suggest we all sit this out until there is another try.

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Theo de Raadt-2
In reply to this post by Jan Stary
> Does locate(1) need to setlocale(3)?

Since you ask the question, yes.

Oh wait, was your question rhetorical?

It lacks facts.

Who's time are you wasting?


> Index: locate/locate.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v
> retrieving revision 1.31
> diff -u -p -r1.31 locate.c
> --- locate/locate.c 19 Nov 2015 21:46:05 -0000 1.31
> +++ locate/locate.c 10 Jan 2019 20:34:16 -0000
> @@ -73,7 +73,6 @@
>  #include <fnmatch.h>
>  #include <libgen.h>
>  #include <limits.h>
> -#include <locale.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
> @@ -122,7 +121,6 @@ main(int argc, char *argv[])
>  {
>   int ch;
>   char **dbv = NULL;
> - (void) setlocale(LC_ALL, "");
>  
>   if (pledge("stdio rpath", NULL) == -1)
>   err(1, "pledge");
>

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Ted Unangst-6
In reply to this post by Scott Cheloha
Scott Cheloha wrote:
> > On Jan 10, 2019, at 14:35, Jan Stary <[hidden email]> wrote:
> >
> > Does locate(1) need to setlocale(3)?
>
> locate(1) uses tolower(3), the results of which can be affected by
> locale settings.  locate(1)'s database is, at least in theory, portable.
> If you don't explicitly set the locale to POSIX/C I think it might be
> possible to create a corrupted database, or read a valid database
> as corrupted.

Speaking of portability...

man locate says:

     The locate database is not byte order independent.  It is not possible to
     share the databases between machines with different byte order.  The
     current locate implementation understands databases in host byte order or
     network byte order.  So a little-endian machine can't understand a locate
     database which was built on a big-endian machine.

That doesn't really make sense. If it understands network byte order, then it
can read big endian databases. And then there's this comment in the source:

 * Convert network byte order to host byte order if necessary.
 * So we can read on FreeBSD/i386 (little endian) a locate database
 * which was built on SunOS/sparc (big endian).

So there's a little mystery to unravel for somebody with multiple CPUs.
Does it work or not?

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Ingo Schwarze
In reply to this post by Scott Cheloha
Hi Scott,

Scott Cheloha wrote on Thu, Jan 10, 2019 at 07:18:43PM -0600:

> locate(1) uses tolower(3), the results of which can be affected by
> locale settings.

In general, yes.  On OpenBSD, not really:

   $ man tolower | grep -A 1 always  
     OpenBSD always uses the C locale for these functions, ignoring the
     global locale, the thread-specific locale, and the locale argument.

That makes sense because:

   $ man tolower | grep -A 1 representable
     The argument c must be EOF or representable as an unsigned char;
     otherwise, the result is undefined.

So locale(1) support for tolower(3) really only makes sense for
single-byte non-ASCII locales (like LATIN-1) and we don't support
any of those.  The wchar_t characters needed for UTF-8 cannot be
represented as unsigned chars.

> locate(1)'s database is, at least in theory, portable.

I guess you are trying to say the right thing...
More precisely:  Filenames are byte strings, *NOT* character strings,
not even multibyte character strings.  Consequently, a filename
does *NOT* have a character encoding.  The locate datebase simply
stores these byte strings.  That it uses its own, home-grown encoding
scheme does not matter because that is not a *character* encoding
but rather some very naive compression method.  Since it does *NOT*
store characters, the locate database format is locale-independent
in the first place.

> If you don't explicitly set the locale to POSIX/C I think it might
> be possible to create a corrupted database,

No.  The database creation tools - updatedb, mklocatedb, bigram,
code - all explicitly ignore the locale.  There is no way to create
a corrupt database.

You will have fun copying a database from one machine to another
machine where sizeof(int) is different, though.  But that's not
corrupt, just incompatible, and unrelated to character encoding.

> or read a valid database as corrupted.

No, that isn't possible either.  Even if the tolower(3) function were
locale-dependent on OpenBSD, the worst thing that could happen would
be partially broken support for the locale(1) -i option.  But it isn't,
so whether you run locate(1) with the ASCII or with the UTF-8 locale
makes no difference.

That said, something else *can* go wrong.  If filenames contain
terminal control codes, running locate(1) can screw up your terminal:
as far as i can see, it doesn't have the protections that ls(1) has,
but that's a separate matter.


A brief look into /usr/src/usr.bin/locale/ makes it obvious that
it's a rathole of messy code requiring some love gently applied
with Bob's whale flensing knife.

For example, the multiple layers of wrappers around tolower(3)
are totally pointless because native tolower(3) already does table
lookup in the first place.

So, here is some cleanup:
 * garbage collect useless hand-rolled lookup tables
 * merge one-line helper functions into callers
 * garbage collect obfuscating macros
 * fix one format string error (%d used for u_int)
 * garbage collect setlocale(3) and <locale.h>
 * garbage collect several unused constants
 * minus 45 LOC, no functional change

OK?
  Ingo


Index: locate/fastfind.c
===================================================================
RCS file: /cvs/src/usr.bin/locate/locate/fastfind.c,v
retrieving revision 1.14
diff -u -p -r1.14 fastfind.c
--- locate/fastfind.c 8 Dec 2017 17:26:42 -0000 1.14
+++ locate/fastfind.c 13 Jan 2019 02:07:47 -0000
@@ -126,10 +126,8 @@ fastfind_mmap
  u_char bigram1[NBG], bigram2[NBG], path[PATH_MAX];
 
 #ifdef FF_ICASE
- /* use a lookup table for case insensitive search */
- u_char table[UCHAR_MAX + 1];
-
- tolower_word(pathpart);
+ for (p = pathpart; *p != '\0'; p++)
+ *p = tolower(*p);
 #endif /* FF_ICASE*/
 
  /* init bigram table */
@@ -156,11 +154,8 @@ fastfind_mmap
  p = pathpart;
  patend = patprep(p);
  cc = *patend;
-
 #ifdef FF_ICASE
- /* set patend char to true */
- table[TOLOWER(*patend)] = 1;
- table[toupper(*patend)] = 1;
+ cc = tolower(cc);
 #endif /* FF_ICASE */
 
 
@@ -173,10 +168,11 @@ fastfind_mmap
 
  /* go forward or backward */
  if (c == SWITCH) { /* big step, an integer */
- if (len < INTSIZE)
+ if (len < sizeof(int))
  break;
  count += getwm(paddr) - OFFSET;
- len -= INTSIZE; paddr += INTSIZE;
+ len -= sizeof(int);
+ paddr += sizeof(int);
  } else {   /* slow step, =< 14 chars */
  count += c - OFFSET;
  }
@@ -207,7 +203,7 @@ fastfind_mmap
  break; /* SWITCH */
  }
 #ifdef FF_ICASE
- if (table[c])
+ if (tolower(c) == cc)
 #else
  if (c == cc)
 #endif /* FF_ICASE */
@@ -215,17 +211,15 @@ fastfind_mmap
  *p++ = c;
  } else {
  /* bigrams are parity-marked */
- TO7BIT(c);
-
-#ifndef FF_ICASE
+ c &= ASCII_MAX;
+#ifdef FF_ICASE
+ if (tolower(bigram1[c]) == cc ||
+    tolower(bigram2[c]) == cc)
+#else
  if (bigram1[c] == cc ||
     bigram2[c] == cc)
-#else
-
- if (table[bigram1[c]] ||
-    table[bigram2[c]])
 #endif /* FF_ICASE */
- foundchar = p + 1;
+ foundchar = p + 1;
 
  *p++ = bigram1[c];
  *p++ = bigram2[c];
@@ -246,14 +240,14 @@ fastfind_mmap
  for (s = foundchar; s >= cutoff; s--) {
  if (*s == cc
 #ifdef FF_ICASE
-    || TOLOWER(*s) == cc
+    || tolower(*s) == cc
 #endif /* FF_ICASE */
     ) { /* fast first char check */
  for (p = patend - 1, q = s - 1; *p != '\0';
     p--, q--)
  if (*q != *p
 #ifdef FF_ICASE
-    && TOLOWER(*q) != *p
+    && tolower(*q) != *p
 #endif /* FF_ICASE */
     )
  break;
@@ -281,7 +275,7 @@ fastfind_mmap
  if (f_limit >= counter)
  (void)puts(path);
  else  {
- (void)fprintf(stderr, "[show only %d lines]\n", counter - 1);
+ fprintf(stderr, "[show only %u lines]\n", counter - 1);
  exit(0);
  }
  } else
Index: locate/locate.c
===================================================================
RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v
retrieving revision 1.31
diff -u -p -r1.31 locate.c
--- locate/locate.c 19 Nov 2015 21:46:05 -0000 1.31
+++ locate/locate.c 13 Jan 2019 02:07:47 -0000
@@ -73,7 +73,6 @@
 #include <fnmatch.h>
 #include <libgen.h>
 #include <limits.h>
-#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -109,10 +108,8 @@ void search_statistic(char *, char **);
 unsigned long cputime(void);
 
 extern char     **colon(char **, char*, char*);
-extern void     print_matches(u_int);
 extern int      getwm(caddr_t);
 extern int      getwf(FILE *);
-extern u_char   *tolower_word(u_char *);
 extern int check_bigram_char(int);
 extern char *patprep(char *);
 
@@ -122,7 +119,6 @@ main(int argc, char *argv[])
 {
  int ch;
  char **dbv = NULL;
- (void) setlocale(LC_ALL, "");
 
  if (pledge("stdio rpath", NULL) == -1)
  err(1, "pledge");
@@ -168,10 +164,6 @@ main(int argc, char *argv[])
  dbv = colon(dbv, path_fcodes, _PATH_FCODES);
  }
 
- if (f_icase && UCHAR_MAX < 4096) /* init tolower lookup table */
- for (ch = 0; ch < UCHAR_MAX + 1; ch++)
- myctype[ch] = tolower(ch);
-
  /* foreach database ... */
  while ((path_fcodes = *dbv) != NULL) {
  dbv++;
@@ -183,7 +175,7 @@ main(int argc, char *argv[])
  }
 
  if (f_silent)
- print_matches(counter);
+ printf("%u\n", counter);
  exit(0);
 }
 
Index: locate/locate.h
===================================================================
RCS file: /cvs/src/usr.bin/locate/locate/locate.h,v
retrieving revision 1.10
diff -u -p -r1.10 locate.h
--- locate/locate.h 29 Sep 2003 16:03:16 -0000 1.10
+++ locate/locate.h 13 Jan 2019 02:07:47 -0000
@@ -36,36 +36,10 @@
 #define NBG 128 /* number of bigrams considered */
 #define OFFSET 14 /* abs value of max likely diff */
 #define PARITY 0200 /* parity bit */
-#define SWITCH 30 /* switch code */
-#define UMLAUT          31              /* an 8 bit char followed */
-
-/* 0-28 likeliest differential counts + offset to make nonnegative */
-#define LDC_MIN         0
-#define LDC_MAX        28
-
-#undef CHAR_MAX
-#define CHAR_MAX 127
-/* 128-255 bigram codes (128 most common, as determined by 'updatedb') */
-#define BIGRAM_MIN    (UCHAR_MAX - CHAR_MAX)
-#define BIGRAM_MAX    UCHAR_MAX
-
-/* 32-127  single character (printable) ascii residue (ie, literal) */
-#define ASCII_MIN      32
-#define ASCII_MAX     CHAR_MAX
-
-/* #define TO7BIT(x)     (x = ( ((u_char)x) & CHAR_MAX )) */
-#define TO7BIT(x)     (x = x & CHAR_MAX )
 
-
-#if UCHAR_MAX >= 4096
-   define TOLOWER(ch)  tolower(ch)
-#else
-
-u_char myctype[UCHAR_MAX + 1];
-#define TOLOWER(ch) (myctype[ch])
-#endif
-
-#define INTSIZE (sizeof(int))
+#define SWITCH 30 /* switch code */
+#define UMLAUT 31 /* a byte < 32 or > 127 follows */
+#define ASCII_MIN 32 /* printable ASCII characters */
+#define ASCII_MAX 127
 
 #define LOCATE_REG "*?[]\\"  /* fnmatch(3) meta characters */
-
Index: locate/util.c
===================================================================
RCS file: /cvs/src/usr.bin/locate/locate/util.c,v
retrieving revision 1.14
diff -u -p -r1.14 util.c
--- locate/util.c 1 Sep 2016 09:48:20 -0000 1.14
+++ locate/util.c 13 Jan 2019 02:07:47 -0000
@@ -45,8 +45,6 @@
 
 char **colon(char **, char*, char*);
 char *patprep(char *);
-void print_matches(u_int);
-u_char *tolower_word(u_char *);
 int getwm(caddr_t);
 int getwf(FILE *);
 int check_bigram_char(int);
@@ -136,13 +134,6 @@ colon(dbv, path, dot)
  return (dbv);
 }
 
-void
-print_matches(counter)
- u_int counter;
-{
- (void)printf("%d\n", counter);
-}
-
 
 /*
  * extract last glob-free subpattern in name for fast pre-match; prepend
@@ -203,19 +194,6 @@ patprep(name)
  return(--subp);
 }
 
-/* tolower word */
-u_char *
-tolower_word(word)
- u_char *word;
-{
- u_char *p;
-
- for (p = word; *p != '\0'; p++)
- *p = TOLOWER(*p);
-
- return(word);
-}
-
 
 /*
  * Read integer from mmap pointer.
@@ -230,12 +208,12 @@ getwm(p)
  caddr_t p;
 {
  union {
- char buf[INTSIZE];
+ char buf[sizeof(int)];
  int i;
  } u;
  int i;
 
- for (i = 0; i < INTSIZE; i++)
+ for (i = 0; i < sizeof(int); i++)
  u.buf[i] = *p++;
 
  i = u.i;

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Ted Unangst-6
Ingo Schwarze wrote:

> So, here is some cleanup:
>  * garbage collect useless hand-rolled lookup tables
>  * merge one-line helper functions into callers
>  * garbage collect obfuscating macros
>  * fix one format string error (%d used for u_int)
>  * garbage collect setlocale(3) and <locale.h>
>  * garbage collect several unused constants
>  * minus 45 LOC, no functional change
>
> OK?
>   Ingo

I think you may be fixing a bug here...

> Index: locate/fastfind.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/locate/locate/fastfind.c,v
> retrieving revision 1.14
> diff -u -p -r1.14 fastfind.c
> --- locate/fastfind.c 8 Dec 2017 17:26:42 -0000 1.14
> +++ locate/fastfind.c 13 Jan 2019 02:07:47 -0000
> @@ -126,10 +126,8 @@ fastfind_mmap
>   u_char bigram1[NBG], bigram2[NBG], path[PATH_MAX];
>  
>  #ifdef FF_ICASE
> - /* use a lookup table for case insensitive search */
> - u_char table[UCHAR_MAX + 1];
> -
> - tolower_word(pathpart);
> + for (p = pathpart; *p != '\0'; p++)
> + *p = tolower(*p);
>  #endif /* FF_ICASE*/
>  
>   /* init bigram table */
> @@ -156,11 +154,8 @@ fastfind_mmap
>   p = pathpart;
>   patend = patprep(p);
>   cc = *patend;
> -
>  #ifdef FF_ICASE
> - /* set patend char to true */
> - table[TOLOWER(*patend)] = 1;
> - table[toupper(*patend)] = 1;
> + cc = tolower(cc);
>  #endif /* FF_ICASE */

I cannot see where that table is zeroed. It lives on the stack, it could
contain anything.

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Ingo Schwarze
Hi Ted,

Ted Unangst wrote on Sun, Jan 13, 2019 at 03:22:41AM -0500:
> Ingo Schwarze wrote:

>> So, here is some cleanup:
>>  * garbage collect useless hand-rolled lookup tables
>>  * merge one-line helper functions into callers
>>  * garbage collect obfuscating macros
>>  * fix one format string error (%d used for u_int)
>>  * garbage collect setlocale(3) and <locale.h>
>>  * garbage collect several unused constants
>>  * minus 45 LOC, no functional change
>>
>> OK?

> I think you may be fixing a bug here...

Is that an OK?

>> Index: locate/fastfind.c
>> ===================================================================
>> RCS file: /cvs/src/usr.bin/locate/locate/fastfind.c,v
>> retrieving revision 1.14
>> diff -u -p -r1.14 fastfind.c
>> --- locate/fastfind.c 8 Dec 2017 17:26:42 -0000 1.14
>> +++ locate/fastfind.c 13 Jan 2019 02:07:47 -0000
>> @@ -126,10 +126,8 @@ fastfind_mmap
>>   u_char bigram1[NBG], bigram2[NBG], path[PATH_MAX];
>>  
>>  #ifdef FF_ICASE
>> - /* use a lookup table for case insensitive search */
>> - u_char table[UCHAR_MAX + 1];
>> -
>> - tolower_word(pathpart);
>> + for (p = pathpart; *p != '\0'; p++)
>> + *p = tolower(*p);
>>  #endif /* FF_ICASE*/
>>  
>>   /* init bigram table */
>> @@ -156,11 +154,8 @@ fastfind_mmap
>>   p = pathpart;
>>   patend = patprep(p);
>>   cc = *patend;
>> -
>>  #ifdef FF_ICASE
>> - /* set patend char to true */
>> - table[TOLOWER(*patend)] = 1;
>> - table[toupper(*patend)] = 1;
>> + cc = tolower(cc);
>>  #endif /* FF_ICASE */

> I cannot see where that table is zeroed.
> It lives on the stack, it could contain anything.

Good point.  Then again, if that bug stings, it only causes bogus
matches in the pre-match phase (and only with -i).  The real
comparison is done later on the complete strings, see the calls to
fnmatch(3) and strstr(3) bwlow.  So the bug probably only ruined -i
performance, but did not lead to invalid results.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: locale in locate(1)

Ingo Schwarze
In reply to this post by Jan Stary
Hi,

Jan Stary wrote on Thu, Jan 10, 2019 at 09:35:48PM +0100:

> Does locate(1) need to setlocale(3)?

Committed as part of a much larger diff.

Note that answering that question required substantial code review
and code cleanup.

When auditing, don't just mechanically look for suspicious patterns.
You have to actually inspect and understand the code to make an
an audit useful.  Something like the patch below is merely a
starting point for an audit (trivial to find), not a result.

Yours,
  Ingo


> Index: locate/locate.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/locate/locate/locate.c,v
> retrieving revision 1.31
> diff -u -p -r1.31 locate.c
> --- locate/locate.c 19 Nov 2015 21:46:05 -0000 1.31
> +++ locate/locate.c 10 Jan 2019 20:34:16 -0000
> @@ -73,7 +73,6 @@
>  #include <fnmatch.h>
>  #include <libgen.h>
>  #include <limits.h>
> -#include <locale.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
> @@ -122,7 +121,6 @@ main(int argc, char *argv[])
>  {
>   int ch;
>   char **dbv = NULL;
> - (void) setlocale(LC_ALL, "");
>  
>   if (pledge("stdio rpath", NULL) == -1)
>   err(1, "pledge");