clean up LC_NUMERIC in sort(1)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

clean up LC_NUMERIC in sort(1)

Ingo Schwarze
Hi,

it was noticed years ago that our implementation of sort(1) is very
messy.  It contains huge amounts of dead code -- many hundreds of
lines -- and many wrapper functions around standard C library
functions that are in very bad taste and mostly pointless.  However,
nobody ever came round to cleaning it up.

As a first simple step, this patch deletes LC_NUMERIC support.
LC_NUMERIC is an absolutely terrible idea that we should never
support anywhere, neither now nor in the future.

The benefit is getting rid of

 * Four global variables (extern, not even file static).
 * Two static variables in sort.c.
 * One pointless wrapper function, conv_mbtowc().
 * Substantial parts of the code in set_locale().
 * One obscure environment variable.
 * Some so-called "debug" output that is always the same.

In order to not do too much in a single diff, this diff leaves
LC_NUMERIC and LC_CTYPE alone even though both are NOOPs on OpenBSD.
That's work for another day.

However, let's already drop LC_MESSAGES and LC_TIME from the
manual page together with LC_NUMERIC: they are of course NOOPs
and don't even seem to have dead tentacles in the code.

The test suite is still fine.

OK?
  Ingo


Index: coll.c
===================================================================
RCS file: /cvs/src/usr.bin/sort/coll.c,v
retrieving revision 1.11
diff -u -p -r1.11 coll.c
--- coll.c 11 Dec 2015 21:41:51 -0000 1.11
+++ coll.c 6 May 2019 11:59:44 -0000
@@ -46,12 +46,6 @@
 struct key_specs *keys;
 size_t keys_num = 0;
 
-wint_t symbol_decimal_point = L'.';
-/* there is no default thousands separator in collate rules: */
-wint_t symbol_thousands_sep = 0;
-wint_t symbol_negative_sign = L'-';
-wint_t symbol_positive_sign = L'+';
-
 static int wstrcoll(struct key_value *kv1, struct key_value *kv2, size_t offset);
 static int gnumcoll(struct key_value*, struct key_value *, size_t offset);
 static int monthcoll(struct key_value*, struct key_value *, size_t offset);
@@ -701,7 +695,7 @@ read_number(struct bwstring *s0, int *si
  while (iswblank(bws_get_iter_value(s)))
  s = bws_iterator_inc(s, 1);
 
- if (bws_get_iter_value(s) == (wchar_t)symbol_negative_sign) {
+ if (bws_get_iter_value(s) == L'-') {
  *sign = -1;
  s = bws_iterator_inc(s, 1);
  }
@@ -716,16 +710,13 @@ read_number(struct bwstring *s0, int *si
  smain[*main_len] = bws_get_iter_value(s);
  s = bws_iterator_inc(s, 1);
  *main_len += 1;
- } else if (symbol_thousands_sep &&
-    (bws_get_iter_value(s) == (wchar_t)symbol_thousands_sep))
- s = bws_iterator_inc(s, 1);
- else
+ } else
  break;
  }
 
  smain[*main_len] = 0;
 
- if (bws_get_iter_value(s) == (wchar_t)symbol_decimal_point) {
+ if (bws_get_iter_value(s) == L'.') {
  s = bws_iterator_inc(s, 1);
  while (iswdigit(bws_get_iter_value(s)) &&
     *frac_len < MAX_NUM_SIZE) {
Index: coll.h
===================================================================
RCS file: /cvs/src/usr.bin/sort/coll.h,v
retrieving revision 1.1
diff -u -p -r1.1 coll.h
--- coll.h 17 Mar 2015 17:45:13 -0000 1.1
+++ coll.h 6 May 2019 11:59:44 -0000
@@ -123,14 +123,6 @@ typedef int (*listcoll_t)(struct sort_li
 extern struct key_specs *keys;
 extern size_t keys_num;
 
-/*
- * Main localised symbols. These must be wint_t as they may hold WEOF.
- */
-extern wint_t symbol_decimal_point;
-extern wint_t symbol_thousands_sep;
-extern wint_t symbol_negative_sign;
-extern wint_t symbol_positive_sign;
-
 /* funcs */
 
 cmpcoll_t get_sort_func(struct sort_mods *sm);
Index: sort.1
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.1,v
retrieving revision 1.58
diff -u -p -r1.58 sort.1
--- sort.1 24 Feb 2019 09:57:43 -0000 1.58
+++ sort.1 6 May 2019 11:59:44 -0000
@@ -177,10 +177,6 @@ Unknown strings are considered smaller t
 .It Fl n , Fl Fl numeric-sort , Fl Fl sort Ns = Ns Cm numeric
 An initial numeric string, consisting of optional blank space, optional
 minus sign, and zero or more digits (including decimal point)
-.\" with
-.\" optional radix character and thousands
-.\" separator
-.\" (as defined in the current locale),
 is sorted by arithmetic value.
 Leading blank characters are ignored.
 .It Fl R , Fl Fl random-sort , Fl Fl sort Ns = Ns Cm random
@@ -498,18 +494,6 @@ which has no
 equivalent.
 .Sh ENVIRONMENT
 .Bl -tag -width Fl
-.It Ev GNUSORT_NUMERIC_COMPATIBILITY
-If defined
-.Fl t
-will not override the locale numeric symbols, that is, thousand
-separators and decimal separators.
-By default, if we specify
-.Fl t
-with the same symbol as the thousand separator or decimal point,
-the symbol will be treated as the field separator.
-Older behavior was less definite: the symbol was treated as both field
-separator and numeric separator, simultaneously.
-This environment variable enables the old behavior.
 .It Ev LANG
 Used as a last resort to determine different kinds of locale-specific
 behavior if neither the respective environment variable nor
@@ -526,15 +510,6 @@ sorting records.
 Locale settings to be used to case conversion and classification
 of characters, that is, which characters are considered
 whitespaces, etc.
-.It Ev LC_MESSAGES
-Locale settings that determine the language of output messages
-that
-.Nm
-prints out.
-.It Ev LC_NUMERIC
-Locale settings that determine the number format used in numeric sort.
-.It Ev LC_TIME
-Locale settings that determine the month format used in month sort.
 .It Ev TMPDIR
 Path to the directory in which temporary files will be stored.
 Note that
Index: sort.c
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.c,v
retrieving revision 1.87
diff -u -p -r1.87 sort.c
--- sort.c 4 Jan 2017 15:30:58 -0000 1.87
+++ sort.c 6 May 2019 11:59:44 -0000
@@ -70,13 +70,9 @@ struct sort_opts sort_opts_vals;
 bool debug_sort;
 bool need_hint;
 
-static bool gnusort_numeric_compatibility;
-
 static struct sort_mods default_sort_mods_object;
 struct sort_mods * const default_sort_mods = &default_sort_mods_object;
 
-static bool print_symbols_on_debug;
-
 /*
  * Arguments from file (when file0-from option is used:
  */
@@ -239,45 +235,14 @@ set_hw_params(void)
 }
 
 /*
- * Convert "plain" symbol to wide symbol, with default value.
- */
-static void
-conv_mbtowc(wchar_t *wc, const char *c, const wchar_t def)
-{
- int res;
-
- res = mbtowc(wc, c, MB_CUR_MAX);
- if (res < 1)
- *wc = def;
-}
-
-/*
  * Set current locale symbols.
  */
 static void
 set_locale(void)
 {
- struct lconv *lc;
  const char *locale;
 
- setlocale(LC_ALL, "");
-
- /* Obtain LC_NUMERIC info */
- lc = localeconv();
-
- /* Convert to wide char form */
- conv_mbtowc(&symbol_decimal_point, lc->decimal_point,
-    symbol_decimal_point);
- conv_mbtowc(&symbol_thousands_sep, lc->thousands_sep,
-    symbol_thousands_sep);
- conv_mbtowc(&symbol_positive_sign, lc->positive_sign,
-    symbol_positive_sign);
- conv_mbtowc(&symbol_negative_sign, lc->negative_sign,
-    symbol_negative_sign);
-
- if (getenv("GNUSORT_NUMERIC_COMPATIBILITY"))
- gnusort_numeric_compatibility = true;
-
+ setlocale(LC_CTYPE, "");
  locale = setlocale(LC_COLLATE, NULL);
  if (locale != NULL) {
  char *tmpl;
@@ -518,7 +483,6 @@ set_sort_modifier(struct sort_mods *sm,
  case 'n':
  sm->nflag = true;
  need_hint = true;
- print_symbols_on_debug = true;
  break;
  case 'r':
  sm->rflag = true;
@@ -529,7 +493,6 @@ set_sort_modifier(struct sort_mods *sm,
  case 'h':
  sm->hflag = true;
  need_hint = true;
- print_symbols_on_debug = true;
  break;
  default:
  return false;
@@ -963,16 +926,6 @@ main(int argc, char *argv[])
  errno = EINVAL;
  err(2, NULL);
  }
- if (!gnusort_numeric_compatibility) {
- if (symbol_decimal_point == sort_opts_vals.field_sep)
- symbol_decimal_point = WEOF;
- if (symbol_thousands_sep == sort_opts_vals.field_sep)
- symbol_thousands_sep = WEOF;
- if (symbol_negative_sign == sort_opts_vals.field_sep)
- symbol_negative_sign = WEOF;
- if (symbol_positive_sign == sort_opts_vals.field_sep)
- symbol_positive_sign = WEOF;
- }
  break;
  case 'u':
  sort_opts_vals.uflag = true;
@@ -1167,14 +1120,6 @@ main(int argc, char *argv[])
     setlocale(LC_COLLATE, NULL));
  if (byte_sort)
  printf("Byte sort is used\n");
- if (print_symbols_on_debug) {
- printf("Decimal Point: <%lc>\n", symbol_decimal_point);
- if (symbol_thousands_sep)
- printf("Thousands separator: <%lc>\n",
-    symbol_thousands_sep);
- printf("Positive sign: <%lc>\n", symbol_positive_sign);
- printf("Negative sign: <%lc>\n", symbol_negative_sign);
- }
  }
 
  if (sort_opts_vals.cflag)

Reply | Threaded
Open this post in threaded view
|

clean up LC_COLLATE and LC_CTYPE in sort(1)

Ingo Schwarze
Hi,

after my LC_NUMERIC cleanup for sort(1) went in (thanks to tb@ for
the review), i'd like to adress the rest of locale dependency.

Large amounts of extremely ugly code in sort(1) - many hundreds of
lines - deal with LC_COLLATE, which we don't support now and have
no intention to support in the future.

The code is very repetitive and currently written to handle three cases:

 1. byte_sort == true && sort_mb_cur_max == 1

    That is the only mode currently supported on OpenBSD.
    It means everything uses the POSIX locale and ASCII.

 2. byte_sort == false && sort_mb_cur_max == 1

    That will never be supported on OpenBSD.
    It handles 8-bit single byte character encodings which are
    incompatible with UTF-8, for example ISO-LATIN-1.

 3. byte_sort == false && sort_mb_cur_max > 1

    Even though i doubt we will ever do it, that could theoretically
    happen on OpenBSD in the remote future, if we ever choose to
    implement collation support for UTF-8 locales.

Handling case 3 would be a massive undertaking - not just a matter
of improving Unicode support, but also forcing us to maintain many
different UTF-8 locales for many different languages, which means
extremely messy stuff invading the C library.  During the Belgrade
EuroBSDCon a few years ago, i talked to Baptiste Daroussin who had
just implemented LC_COLLATE in FreeBSD libc and who was utterly
scared by the complexity.  Knowing ourselves, we would be scared
even more once we got there.  So it will definitely not happen
quickly.  Then again, ruling that out for good is maybe not a
decision to make in this particular patch.

Consequently, the byte_sort variable can be deleted immediately,
killing case 2 for good, but i'm keeping the sort_mb_cur_max variable
as a global constant for now, even though more than half of the
code it controls is currently dead code.

Since none of our single-byte character and string functions are
locale dependent, we can also zap LC_CTYPE while here.

After committing this patch, i shall re-indent bwscoll() properly
in a separate commit, but i'm not including that in the patch sent
out here because it would make the patch unreadable.

OK?
  Ingo


Index: bwstring.c
===================================================================
RCS file: /cvs/src/usr.bin/sort/bwstring.c,v
retrieving revision 1.7
diff -u -p -r1.7 bwstring.c
--- bwstring.c 1 Apr 2015 22:38:08 -0000 1.7
+++ bwstring.c 14 May 2019 15:19:54 -0000
@@ -40,9 +40,6 @@
 #include "bwstring.h"
 #include "sort.h"
 
-bool byte_sort;
-size_t sort_mb_cur_max = 1;
-
 static wchar_t **wmonths;
 static char **cmonths;
 
@@ -686,22 +683,20 @@ bwscoll(const struct bwstring *bws1, con
 
  if (len1 <= offset)
  return (len2 <= offset) ? 0 : -1;
- else {
+
  if (len2 <= offset)
  return 1;
- else {
+
  len1 -= offset;
  len2 -= offset;
 
  if (sort_mb_cur_max == 1) {
  const unsigned char *s1, *s2;
+ int res;
 
  s1 = bws1->data.cstr + offset;
  s2 = bws2->data.cstr + offset;
 
- if (byte_sort) {
- int res = 0;
-
  if (len1 > len2) {
  res = memcmp(s1, s2, len2);
  if (!res)
@@ -714,66 +709,6 @@ bwscoll(const struct bwstring *bws1, con
  res = memcmp(s1, s2, len1);
 
  return res;
-
- } else {
- int res = 0;
- size_t i, maxlen;
-
- i = 0;
- maxlen = len1;
-
- if (maxlen > len2)
- maxlen = len2;
-
- while (i < maxlen) {
- /* goto next non-zero part: */
- while ((i < maxlen) &&
-    !s1[i] && !s2[i])
- ++i;
-
- if (i >= maxlen)
- break;
-
- if (s1[i] == 0) {
- if (s2[i] == 0)
- /* NOTREACHED */
- err(2, "bwscoll error 01");
- else
- return -1;
- } else if (s2[i] == 0)
- return 1;
-
- res = strcoll((const char *)(s1 + i), (const char *)(s2 + i));
- if (res)
- return res;
-
- while ((i < maxlen) &&
-    s1[i] && s2[i])
- ++i;
-
- if (i >= maxlen)
- break;
-
- if (s1[i] == 0) {
- if (s2[i] == 0) {
- ++i;
- continue;
- } else
- return -1;
- } else if (s2[i] == 0)
- return 1;
- else
- /* NOTREACHED */
- err(2, "bwscoll error 02");
- }
-
- if (len1 < len2)
- return -1;
- else if (len1 > len2)
- return 1;
-
- return 0;
- }
  } else {
  const wchar_t *s1, *s2;
  size_t i, maxlen;
@@ -834,8 +769,6 @@ bwscoll(const struct bwstring *bws1, con
  return 0;
  return len1 < len2 ? -1 : 1;
  }
- }
- }
 }
 
 /*
Index: bwstring.h
===================================================================
RCS file: /cvs/src/usr.bin/sort/bwstring.h,v
retrieving revision 1.2
diff -u -p -r1.2 bwstring.h
--- bwstring.h 31 Dec 2015 16:09:31 -0000 1.2
+++ bwstring.h 14 May 2019 15:19:54 -0000
@@ -37,8 +37,7 @@
 
 #include "mem.h"
 
-extern bool byte_sort;
-extern size_t sort_mb_cur_max;
+static const size_t sort_mb_cur_max = 1;
 
 /* wchar_t is of 4 bytes: */
 #define SIZEOF_WCHAR_STRING(LEN) ((LEN)*sizeof(wchar_t))
Index: file.c
===================================================================
RCS file: /cvs/src/usr.bin/sort/file.c,v
retrieving revision 1.21
diff -u -p -r1.21 file.c
--- file.c 17 Oct 2016 02:58:29 -0000 1.21
+++ file.c 14 May 2019 15:19:54 -0000
@@ -1078,7 +1078,8 @@ sort_list_to_file(struct sort_list *list
 
  if (!sm->Mflag && !sm->Rflag && !sm->Vflag &&
     !sm->gflag && !sm->hflag && !sm->nflag) {
- if ((sort_opts_vals.sort_method == SORT_DEFAULT) && byte_sort)
+ if (sort_opts_vals.sort_method == SORT_DEFAULT &&
+    sort_mb_cur_max == 1)
  sort_opts_vals.sort_method = SORT_RADIXSORT;
 
  } else if (sort_opts_vals.sort_method == SORT_RADIXSORT)
Index: sort.1
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.1,v
retrieving revision 1.59
diff -u -p -r1.59 sort.1
--- sort.1 13 May 2019 17:00:12 -0000 1.59
+++ sort.1 14 May 2019 15:19:54 -0000
@@ -60,9 +60,8 @@ option
 A record can contain any printable or unprintable characters.
 Comparisons are based on one or more sort keys extracted from
 each line of input, and are performed lexicographically,
-according to the current locale's collating rules and the
-specified command-line options that can tune the actual
-sorting behavior.
+according to the specified command-line options
+that can tune the actual sorting behavior.
 By default, if keys are not given,
 .Nm
 uses entire lines for comparison.
@@ -201,7 +200,6 @@ The files are compared by their prefixes
 zeros are ignored in version numbers, see example below).
 If an input string does not match the pattern, then it is compared
 using the byte compare function.
-All string comparisons are performed in the C locale.
 .Pp
 For example:
 .Bd -literal -offset indent
@@ -494,22 +492,6 @@ which has no
 equivalent.
 .Sh ENVIRONMENT
 .Bl -tag -width Fl
-.It Ev LANG
-Used as a last resort to determine different kinds of locale-specific
-behavior if neither the respective environment variable nor
-.Ev LC_ALL
-are set.
-.It Ev LC_ALL
-Locale settings that override all of the other locale settings.
-This environment variable can be used to set all these settings
-to the same value at once.
-.It Ev LC_COLLATE
-Locale settings to be used to determine the collation for
-sorting records.
-.It Ev LC_CTYPE
-Locale settings to be used to case conversion and classification
-of characters, that is, which characters are considered
-whitespaces, etc.
 .It Ev TMPDIR
 Path to the directory in which temporary files will be stored.
 Note that
@@ -553,7 +535,10 @@ The
 .Nm
 utility is compliant with the
 .St -p1003.1-2008
-specification.
+specification, except that it ignores the user's
+.Xr locale 1
+and always assumes
+.Ev LC_ALL Ns =C.
 .Pp
 The flags
 .Op Fl gHhiMRSsTVz
@@ -603,13 +588,10 @@ This implementation of
 has no limits on input line length (other than imposed by available
 memory) or any restrictions on bytes allowed within lines.
 .Pp
-The performance depends highly on locale settings,
+The performance depends highly on
 efficient choice of sort keys and key complexity.
-The fastest sort is with the C locale, on whole lines, with option
+The fastest sort is on whole lines, with option
 .Fl s .
-In general, the C locale is the fastest, followed by single-byte
-locales with multi-byte locales being the slowest.
-The correct collation order respected in all cases.
 For the key specification, the simpler to process the
 lines the faster the search will be.
 .Pp
Index: sort.c
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.c,v
retrieving revision 1.88
diff -u -p -r1.88 sort.c
--- sort.c 13 May 2019 17:00:12 -0000 1.88
+++ sort.c 14 May 2019 15:19:55 -0000
@@ -36,7 +36,6 @@
 #include <errno.h>
 #include <getopt.h>
 #include <limits.h>
-#include <locale.h>
 #include <md5.h>
 #include <regex.h>
 #include <signal.h>
@@ -235,37 +234,6 @@ set_hw_params(void)
 }
 
 /*
- * Set current locale symbols.
- */
-static void
-set_locale(void)
-{
- const char *locale;
-
- setlocale(LC_CTYPE, "");
- locale = setlocale(LC_COLLATE, NULL);
- if (locale != NULL) {
- char *tmpl;
- const char *byteclocale;
-
- tmpl = sort_strdup(locale);
- byteclocale = setlocale(LC_COLLATE, "C");
- if (byteclocale && strcmp(byteclocale, tmpl) == 0) {
- byte_sort = true;
- } else {
- byteclocale = setlocale(LC_COLLATE, "POSIX");
- if (byteclocale && strcmp(byteclocale, tmpl) == 0)
- byte_sort = true;
- else
- setlocale(LC_COLLATE, tmpl);
- }
- sort_free(tmpl);
- }
- if (!byte_sort)
- sort_mb_cur_max = MB_CUR_MAX;
-}
-
-/*
  * Set directory temporary files.
  */
 static void
@@ -846,7 +814,6 @@ main(int argc, char *argv[])
 
  atexit(clear_tmp_files);
 
- set_locale();
  set_tmpdir();
  set_sort_opts();
 
@@ -1113,14 +1080,9 @@ main(int argc, char *argv[])
  ks->sm.func = get_sort_func(&(ks->sm));
  }
 
- if (debug_sort) {
+ if (debug_sort)
  printf("Memory to be used for sorting: %llu\n",
     available_free_memory);
- printf("Using collate rules of %s locale\n",
-    setlocale(LC_COLLATE, NULL));
- if (byte_sort)
- printf("Byte sort is used\n");
- }
 
  if (sort_opts_vals.cflag)
  return check(argc ? *argv : "-");

Reply | Threaded
Open this post in threaded view
|

Re: clean up LC_COLLATE and LC_CTYPE in sort(1)

Todd C. Miller-3
On Tue, 14 May 2019 17:55:21 +0200, Ingo Schwarze wrote:

> Consequently, the byte_sort variable can be deleted immediately,
> killing case 2 for good, but i'm keeping the sort_mb_cur_max variable
> as a global constant for now, even though more than half of the
> code it controls is currently dead code.
>
> Since none of our single-byte character and string functions are
> locale dependent, we can also zap LC_CTYPE while here.
>
> After committing this patch, i shall re-indent bwscoll() properly
> in a separate commit, but i'm not including that in the patch sent
> out here because it would make the patch unreadable.

OK millert@

 - todd

Reply | Threaded
Open this post in threaded view
|

Re: clean up LC_COLLATE and LC_CTYPE in sort(1)

Theo Buehler-5
In reply to this post by Ingo Schwarze
On Tue, May 14, 2019 at 05:55:21PM +0200, Ingo Schwarze wrote:

> Hi,
>
> after my LC_NUMERIC cleanup for sort(1) went in (thanks to tb@ for
> the review), i'd like to adress the rest of locale dependency.
>
> Large amounts of extremely ugly code in sort(1) - many hundreds of
> lines - deal with LC_COLLATE, which we don't support now and have
> no intention to support in the future.
>
> The code is very repetitive and currently written to handle three cases:
>
>  1. byte_sort == true && sort_mb_cur_max == 1
>
>     That is the only mode currently supported on OpenBSD.
>     It means everything uses the POSIX locale and ASCII.
>
>  2. byte_sort == false && sort_mb_cur_max == 1
>
>     That will never be supported on OpenBSD.
>     It handles 8-bit single byte character encodings which are
>     incompatible with UTF-8, for example ISO-LATIN-1.
>
>  3. byte_sort == false && sort_mb_cur_max > 1
>
>     Even though i doubt we will ever do it, that could theoretically
>     happen on OpenBSD in the remote future, if we ever choose to
>     implement collation support for UTF-8 locales.
>
> Handling case 3 would be a massive undertaking - not just a matter
> of improving Unicode support, but also forcing us to maintain many
> different UTF-8 locales for many different languages, which means
> extremely messy stuff invading the C library.  During the Belgrade
> EuroBSDCon a few years ago, i talked to Baptiste Daroussin who had
> just implemented LC_COLLATE in FreeBSD libc and who was utterly
> scared by the complexity.  Knowing ourselves, we would be scared
> even more once we got there.  So it will definitely not happen
> quickly.  Then again, ruling that out for good is maybe not a
> decision to make in this particular patch.
>
> Consequently, the byte_sort variable can be deleted immediately,
> killing case 2 for good, but i'm keeping the sort_mb_cur_max variable
> as a global constant for now, even though more than half of the
> code it controls is currently dead code.
>
> Since none of our single-byte character and string functions are
> locale dependent, we can also zap LC_CTYPE while here.
>
> After committing this patch, i shall re-indent bwscoll() properly
> in a separate commit, but i'm not including that in the patch sent
> out here because it would make the patch unreadable.
>
> OK?

ok