Cannot set locale categories independently with POSIX 2008 locales

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Cannot set locale categories independently with POSIX 2008 locales

Karl Williamson
This has been reported from multiple sources, and we at Perl 5 have
diagnosed the problem

If LC_CTYPE is set to C.UTF-8, it is not possible to set any other
category independently to either C or C.UTF-8 without inadvertently
setting LC_CTYPE back to C.  The attached program demonstrates the problem.

Other operating systems don't have this problem because they don't
ignore the third parameter to newlocale().  Perhaps you didn't consider
this scenario when you made the decision to ignore it.

But you can still ignore it, and get proper behavior, I believe.  From
looking at execution traces, it appears that there are two underlying
locales supported by OpenBSD: 1 and 2.  The locale objects are ints.

Locale 1 is the C locale for all categories
Locale 2 is C.UTF-8 for LC_CTYPE and C for all other categories.

The problem is that calling newlocale() with a non-LC_CTYPE locale and
the name "C" causes it to return 1, regardless of the state of the
LC_CTYPE locale.  Then uselocale() is called with 1, and LC_CTYPE gets
changed to C

I believe the following behavior in newlocale() would fix the problem:

If the mask contains LC_CTYPE_MASK, set the locale to 1 or 2 as appropriate.

If the mask doesn't contain LC_CTYPE_MASK, do nothing, and return 1 or 2
depending on LC_CTYPE.

I have a suspicion, without having looked at your code, that you already
do all this except for considering LC_CTYPE when deciding what
newlocale's return value should be.

I don't think you have to save the name of the locale the user used in
the call to newlocale(), unlike setlocale() where you have to
regurgitate it if asked, so I don't know why you suggest that it's best
to call newlocale with a third parameter of zero.  That harms portability.


openbsd_bug.c (815 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Ted Unangst-6
Karl Williamson wrote:
> This has been reported from multiple sources, and we at Perl 5 have
> diagnosed the problem
>
> If LC_CTYPE is set to C.UTF-8, it is not possible to set any other
> category independently to either C or C.UTF-8 without inadvertently
> setting LC_CTYPE back to C.  The attached program demonstrates the problem.

Thanks for a very thorough report. I believe your guesses about the
implementation were quite accurate. :)

This should fix things.

Index: newlocale.c
===================================================================
RCS file: /home/cvs/src/lib/libc/locale/newlocale.c,v
retrieving revision 1.1
diff -u -p -r1.1 newlocale.c
--- newlocale.c 5 Sep 2017 03:16:13 -0000 1.1
+++ newlocale.c 27 Mar 2019 06:25:01 -0000
@@ -22,8 +22,7 @@
 #include "rune.h"
 
 locale_t
-newlocale(int mask, const char *locname,
-    locale_t oldloc __attribute__((__unused__)))
+newlocale(int mask, const char *locname, locale_t oldloc)
 {
  int ic, flag;
 
@@ -45,7 +44,7 @@ newlocale(int mask, const char *locname,
 
  /* Only character encoding has thread-specific effects. */
  if ((mask & LC_CTYPE_MASK) == 0)
- return _LOCALE_C;
+ return oldloc;
 
  /* The following may initialize UTF-8 for later use. */
  if ((locname = _get_locname(LC_CTYPE, locname)) == NULL) {

Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Karl Williamson
On 3/27/19 12:26 AM, Ted Unangst wrote:

> Karl Williamson wrote:
>> This has been reported from multiple sources, and we at Perl 5 have
>> diagnosed the problem
>>
>> If LC_CTYPE is set to C.UTF-8, it is not possible to set any other
>> category independently to either C or C.UTF-8 without inadvertently
>> setting LC_CTYPE back to C.  The attached program demonstrates the problem.
>
> Thanks for a very thorough report. I believe your guesses about the
> implementation were quite accurate. :)
>
> This should fix things.

Does it work if 'oldloc' is 0? a legal value

>
> Index: newlocale.c
> ===================================================================
> RCS file: /home/cvs/src/lib/libc/locale/newlocale.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 newlocale.c
> --- newlocale.c 5 Sep 2017 03:16:13 -0000 1.1
> +++ newlocale.c 27 Mar 2019 06:25:01 -0000
> @@ -22,8 +22,7 @@
>   #include "rune.h"
>  
>   locale_t
> -newlocale(int mask, const char *locname,
> -    locale_t oldloc __attribute__((__unused__)))
> +newlocale(int mask, const char *locname, locale_t oldloc)
>   {
>   int ic, flag;
>  
> @@ -45,7 +44,7 @@ newlocale(int mask, const char *locname,
>  
>   /* Only character encoding has thread-specific effects. */
>   if ((mask & LC_CTYPE_MASK) == 0)
> - return _LOCALE_C;
> + return oldloc;
>  
>   /* The following may initialize UTF-8 for later use. */
>   if ((locname = _get_locname(LC_CTYPE, locname)) == NULL) {
>

Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Ted Unangst-6
Karl Williamson wrote:
> > This should fix things.
>
> Does it work if 'oldloc' is 0? a legal value

Nope. Good catch.


Index: newlocale.c
===================================================================
RCS file: /home/cvs/src/lib/libc/locale/newlocale.c,v
retrieving revision 1.1
diff -u -p -r1.1 newlocale.c
--- newlocale.c 5 Sep 2017 03:16:13 -0000 1.1
+++ newlocale.c 27 Mar 2019 22:01:12 -0000
@@ -22,8 +22,7 @@
 #include "rune.h"
 
 locale_t
-newlocale(int mask, const char *locname,
-    locale_t oldloc __attribute__((__unused__)))
+newlocale(int mask, const char *locname, locale_t oldloc)
 {
  int ic, flag;
 
@@ -45,7 +44,7 @@ newlocale(int mask, const char *locname,
 
  /* Only character encoding has thread-specific effects. */
  if ((mask & LC_CTYPE_MASK) == 0)
- return _LOCALE_C;
+ return oldloc ? oldloc : _LOCALE_C;
 
  /* The following may initialize UTF-8 for later use. */
  if ((locname = _get_locname(LC_CTYPE, locname)) == NULL) {

Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Ingo Schwarze
Hi Ted and Karl,

Ted Unangst wrote on Wed, Mar 27, 2019 at 06:03:29PM -0400:
> Karl Williamson wrote:
>> Ted Unangst wrote:

>>> This should fix things.

There is no bug and nothing to fix.

Our existing implementation fully conforms to the POSIX specification.
POSIX explicitly says:

  It is unspecified whether the locale object pointed to by base
  shall be modified, or freed and a new locale object created.

I chose to take the second option: free the object pointed to by
base and create a new locale object, because that is the simpler
one of the two conforming possibilities, and we generally prefer
simplicity.  Note that nothing in POSIX requires copying anything
from base when creating a new locale object.

That said, if you think it helps compatibility with other
implementations, i'm OK with changing the behaviour and implementing
the other conforming possibility, i.e. modify the object pointed
to by base instead.  While the complication of the code is minimal,
the manual page requires substantial rewriting and becomes noticeably
harder to understand, but oh well, if it improves compatibility...

As you already observed, the first patch sent by Ted was wrong.
The second patch is correct, by i dislike the behaviour change for
the case base == LC_GLOBAL_LOCALE and for the case of an invalid
base.  Yes, POSIX explicity stipulates those two cases result in
undefined behaviour, but returning invalid results from newlocale(3)
is not nice.  I'd much prefer to always return a valid locale, even
for input causing undefined behaviour.

In any case, in the commit message, do not call it a "bug fix",
describe it as a change for compatibility with other systems.

Ted, feel free to either commit this version or tell me to commit.

Yours,
  Ingo


Index: locale/newlocale.3
===================================================================
RCS file: /cvs/src/lib/libc/locale/newlocale.3,v
retrieving revision 1.1
diff -u -p -r1.1 newlocale.3
--- locale/newlocale.3 5 Sep 2017 03:16:13 -0000 1.1
+++ locale/newlocale.3 28 Mar 2019 13:56:37 -0000
@@ -46,11 +46,23 @@ creates a new locale object for use with
 and with many functions that accept
 .Vt locale_t
 arguments.
+Locale categories not contained in the
+.Fa mask
+are copied from
+.Fa oldloc
+to the new locale object, or from the
+.Qq C
+locale if
+.Fa oldloc
+is
+.Po Vt locale_t Pc Ns 0 .
 .Pp
 On
 .Ox ,
+.Fa locname
+only affects the return value if
 .Fa mask
-is only meaningful if it includes
+includes
 .Dv LC_CTYPE_MASK ,
 and
 .Fa locname
@@ -60,18 +72,11 @@ or
 .Qq POSIX ,
 if it ends with
 .Qq .UTF-8 ,
-or if it is an empty string; otherwise,
-.Fn newlocale
-always returns the C locale.
-.Pp
-On
-.Ox ,
-.Fn newlocale
-ignores
-.Fa oldloc ,
-and passing
-.Po Vt locale_t Pc Ns 0
-is recommended.
+or if it is an empty string.
+Other
+.Fa locname
+arguments have the same effect as
+.Qq C .
 .Pp
 The function
 .Fn duplocale
@@ -159,3 +164,11 @@ These functions conform to
 .Sh HISTORY
 These functions have been available since
 .Ox 6.2 .
+.Sh CAVEATS
+Calling
+.Fn newlocale
+with an
+.Fa oldloc
+argument of
+.Dv LC_GLOBAL_LOCALE
+results in undefined behaviour.
Index: locale/newlocale.c
===================================================================
RCS file: /cvs/src/lib/libc/locale/newlocale.c,v
retrieving revision 1.1
diff -u -p -r1.1 newlocale.c
--- locale/newlocale.c 5 Sep 2017 03:16:13 -0000 1.1
+++ locale/newlocale.c 28 Mar 2019 13:56:37 -0000
@@ -22,8 +22,7 @@
 #include "rune.h"
 
 locale_t
-newlocale(int mask, const char *locname,
-    locale_t oldloc __attribute__((__unused__)))
+newlocale(int mask, const char *locname, locale_t oldloc)
 {
  int ic, flag;
 
@@ -45,7 +44,7 @@ newlocale(int mask, const char *locname,
 
  /* Only character encoding has thread-specific effects. */
  if ((mask & LC_CTYPE_MASK) == 0)
- return _LOCALE_C;
+ return oldloc == _LOCALE_UTF8 ? _LOCALE_UTF8 : _LOCALE_C;
 
  /* The following may initialize UTF-8 for later use. */
  if ((locname = _get_locname(LC_CTYPE, locname)) == NULL) {

Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Karl Williamson
On 3/28/19 8:03 AM, Ingo Schwarze wrote:

> Hi Ted and Karl,
>
> Ted Unangst wrote on Wed, Mar 27, 2019 at 06:03:29PM -0400:
>> Karl Williamson wrote:
>>> Ted Unangst wrote:
>
>>>> This should fix things.
>
> There is no bug and nothing to fix.
>
> Our existing implementation fully conforms to the POSIX specification.
> POSIX explicitly says:
>
>    It is unspecified whether the locale object pointed to by base
>    shall be modified, or freed and a new locale object created.

I can see how you might be able to argue for your interpretation.  I
believe the wording in the spec is poor in this case (and it isn't the
only such place).  I, and every other implementer takes the above text
to mean, that it's up to the implementation as to how to combine 'base'
with the new locale specified by the other parameters.  'base' can be
modified and returned, or 'base' can be freed with some new locale which
is the combination of 'base' and the other parameters.  I don't take
that wording to mean that 'base' can be irrelevant.  That can't be the
intent, as that would mean wildly unportable code, and no way for the
program to specify an update to a category in isolation from the others.
  The errata page yields a 404.

There is further support for your position later on when it says "If a
completely new locale object is created, the data for all sections not
requested by category_mask shall be taken from the default locale."  I
think they mean a "completely new locale" where there was no component
from 'base'.

But, it doesn't really matter here, since you're planning to change in
the interests of portability.  Thank you from me, us, and everyone who
wants to use this function on other distros.

>
> I chose to take the second option: free the object pointed to by
> base and create a new locale object, because that is the simpler
> one of the two conforming possibilities, and we generally prefer
> simplicity.  Note that nothing in POSIX requires copying anything
> from base when creating a new locale object.
>
> That said, if you think it helps compatibility with other
> implementations, i'm OK with changing the behaviour and implementing
> the other conforming possibility, i.e. modify the object pointed
> to by base instead.  While the complication of the code is minimal,
> the manual page requires substantial rewriting and becomes noticeably
> harder to understand, but oh well, if it improves compatibility...
>
> As you already observed, the first patch sent by Ted was wrong.
> The second patch is correct, by i dislike the behaviour change for
> the case base == LC_GLOBAL_LOCALE and for the case of an invalid
> base.  Yes, POSIX explicity stipulates those two cases result in
> undefined behaviour, but returning invalid results from newlocale(3)
> is not nice.  I'd much prefer to always return a valid locale, even
> for input causing undefined behaviour.
>
> In any case, in the commit message, do not call it a "bug fix",
> describe it as a change for compatibility with other systems.
>
> Ted, feel free to either commit this version or tell me to commit.
>
> Yours,
>    Ingo
>
>
> Index: locale/newlocale.3
> ===================================================================
> RCS file: /cvs/src/lib/libc/locale/newlocale.3,v
> retrieving revision 1.1
> diff -u -p -r1.1 newlocale.3
> --- locale/newlocale.3 5 Sep 2017 03:16:13 -0000 1.1
> +++ locale/newlocale.3 28 Mar 2019 13:56:37 -0000
> @@ -46,11 +46,23 @@ creates a new locale object for use with
>   and with many functions that accept
>   .Vt locale_t
>   arguments.
> +Locale categories not contained in the
> +.Fa mask
> +are copied from
> +.Fa oldloc
> +to the new locale object, or from the
> +.Qq C
> +locale if
> +.Fa oldloc
> +is
> +.Po Vt locale_t Pc Ns 0 .
>   .Pp
>   On
>   .Ox ,
> +.Fa locname
> +only affects the return value if
>   .Fa mask
> -is only meaningful if it includes
> +includes
>   .Dv LC_CTYPE_MASK ,
>   and
>   .Fa locname
> @@ -60,18 +72,11 @@ or
>   .Qq POSIX ,
>   if it ends with
>   .Qq .UTF-8 ,
> -or if it is an empty string; otherwise,
> -.Fn newlocale
> -always returns the C locale.
> -.Pp
> -On
> -.Ox ,
> -.Fn newlocale
> -ignores
> -.Fa oldloc ,
> -and passing
> -.Po Vt locale_t Pc Ns 0
> -is recommended.
> +or if it is an empty string.
> +Other
> +.Fa locname
> +arguments have the same effect as
> +.Qq C .
>   .Pp
>   The function
>   .Fn duplocale
> @@ -159,3 +164,11 @@ These functions conform to
>   .Sh HISTORY
>   These functions have been available since
>   .Ox 6.2 .
> +.Sh CAVEATS
> +Calling
> +.Fn newlocale
> +with an
> +.Fa oldloc
> +argument of
> +.Dv LC_GLOBAL_LOCALE
> +results in undefined behaviour.
> Index: locale/newlocale.c
> ===================================================================
> RCS file: /cvs/src/lib/libc/locale/newlocale.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 newlocale.c
> --- locale/newlocale.c 5 Sep 2017 03:16:13 -0000 1.1
> +++ locale/newlocale.c 28 Mar 2019 13:56:37 -0000
> @@ -22,8 +22,7 @@
>   #include "rune.h"
>  
>   locale_t
> -newlocale(int mask, const char *locname,
> -    locale_t oldloc __attribute__((__unused__)))
> +newlocale(int mask, const char *locname, locale_t oldloc)
>   {
>   int ic, flag;
>  
> @@ -45,7 +44,7 @@ newlocale(int mask, const char *locname,
>  
>   /* Only character encoding has thread-specific effects. */
>   if ((mask & LC_CTYPE_MASK) == 0)
> - return _LOCALE_C;
> + return oldloc == _LOCALE_UTF8 ? _LOCALE_UTF8 : _LOCALE_C;
>  
>   /* The following may initialize UTF-8 for later use. */
>   if ((locname = _get_locname(LC_CTYPE, locname)) == NULL) {
>

Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Ted Unangst-6
Karl Williamson wrote:

> On 3/28/19 8:03 AM, Ingo Schwarze wrote:
> >    It is unspecified whether the locale object pointed to by base
> >    shall be modified, or freed and a new locale object created.
>
> I can see how you might be able to argue for your interpretation.  I
> believe the wording in the spec is poor in this case (and it isn't the
> only such place).  I, and every other implementer takes the above text
> to mean, that it's up to the implementation as to how to combine 'base'
> with the new locale specified by the other parameters.  'base' can be
> modified and returned, or 'base' can be freed with some new locale which
> is the combination of 'base' and the other parameters.  I don't take
> that wording to mean that 'base' can be irrelevant.  That can't be the
> intent, as that would mean wildly unportable code, and no way for the
> program to specify an update to a category in isolation from the others.

I would agree.

> > In any case, in the commit message, do not call it a "bug fix",
> > describe it as a change for compatibility with other systems.
> >
> > Ted, feel free to either commit this version or tell me to commit.

Your version is good to commit. If you like, you may call it a workaround for a
specification bug that permitted the current behavior. :)

Reply | Threaded
Open this post in threaded view
|

Re: Cannot set locale categories independently with POSIX 2008 locales

Ingo Schwarze
Hi Ted and Karl,

Ted Unangst wrote on Fri, Mar 29, 2019 at 12:34:20AM -0400:
> Karl Williamson wrote:
>> On 3/28/19 8:03 AM, Ingo Schwarze wrote:

>>>    It is unspecified whether the locale object pointed to by base
>>>    shall be modified, or freed and a new locale object created.

>> I can see how you might be able to argue for your interpretation.
>> I believe the wording in the spec is poor in this case (and it
>> isn't the only such place).

Deplorably so.

>> I, and every other implementer takes the above text
>> to mean, that it's up to the implementation as to how to combine 'base'
>> with the new locale specified by the other parameters.  'base' can be
>> modified and returned, or 'base' can be freed with some new locale which
>> is the combination of 'base' and the other parameters.  I don't take
>> that wording to mean that 'base' can be irrelevant.  That can't be the
>> intent, as that would mean wildly unportable code, and no way for the
>> program to specify an update to a category in isolation from the others.

> I would agree.

Fair enough, you convinced me that, if my interpretation of the standard
is valid, then parts of the specification of newlocale(3) aren't useful.
So i now agree it's probably a bug in the standard, and i filed

  http://austingroupbugs.net/view.php?id=1243

>>> In any case, in the commit message, do not call it a "bug fix",
>>> describe it as a change for compatibility with other systems.
>>>
>>> Ted, feel free to either commit this version or tell me to commit.

> Your version is good to commit.

Committed, thanks for analyzing and reporting the issue
and for drafting and checking diffs.

> If you like, you may call it a workaround for a
> specification bug that permitted the current behavior. :)

While that would sound witty and humorous commit messages can be fun,
they can also provoke misunderstandings, so i opted for a neutral
wording of the commit message instead, see below.

Yours,
  Ingo


CVSROOT: /cvs
Module name: src
Changes by: [hidden email] 2019/03/29 06:34:44

Modified files:
        lib/libc/locale: newlocale.3 newlocale.c

Log message:
Copy categories outside "mask" from "oldloc" to the new locale object.
While POSIX appears to allow the old behaviour of ignoring "oldloc",
Ted and Karl convinced me that is a bug in the spec and the Austin
group almost certainly intended to require the new behaviour.
Anyway, compatibility strongly suggests the new behaviour because
most (or maybe even all?) other systems do not ignore "oldloc",
and some software appears to depend on the copying from "oldloc"
to the new locale.

Issue analyzed and reported by Karl Williamson <public at
khwilliamson dot com> with support from the Perl 5 community.
This final diff is similar to two earlier diffs from Ted,
but handles invalid input in a mode robust way.
OK tedu@.