[sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Stuart Henderson-6
Widening the audience to tech in case anyone with an idea missed it
on ports: it seems some people are having a lot more trouble than just
needing to restart build 2 or 3 times.


----- Forwarded message from Stuart Henderson <[hidden email]> -----

From: Stuart Henderson <[hidden email]>
Date: Tue, 15 Dec 2015 12:48:21 +0000
To: ports <[hidden email]>
User-Agent: Mutt/1.5.24 (2015-08-30)
Subject: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list
Mail-Followup-To: ports <[hidden email]>

VLC build often fails with a bus error or segfault in ld.so when
running vlc-cache-gen. This program loads/unloads modules (dlopen)
to produce a pregenerated list. This used to fail quite often,
then I figured out that using LD_PRELOAD to force loading of
libgobject-2.0.so worked around this, but more recently it has
started failing again whether or not this is done.

Typically it takes me 2 or 3 attempts to get this to work without
crashing which is not ideal for bulk builds.

I wondered if it was the sdl2 fix in ld.so (which is actually
a fix for a similar situation with load/unload), but backing that
out doesn't seem to help.

Does anyone have an idea what might be up?

To replicate

cd /usr/ports/x11/vlc
make fake
cd /usr/ports/pobj/vlc-2.2.1/vlc-2.2.1
PATH="/usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/bin:$PATH" LD_LIBRARY_PATH="/usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib:$LD_LIBRARY_PATH" /usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/vlc-cache-gen -f /usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins

repeat the last step until it crashes, e.g.

/usr/ports/pobj/vlc-2.2.1/vlc-2.2.1/bin/.libs/vlc-cache-gen:/usr/local/lib/libebml.so.3.0: undefined symbol '_ZNSs4_Rep10_M_destroyERKSaIcE'
lazy binding failed!
Segmentation fault (core dumped)

or

Bus error (core dumped)

An example backtrace, using ports gdb because base gdb gives "Dwarf
Error: Cannot find type of die":

$ egdb /usr/obj/ports/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/vlc-cache-gen vlc-cache-gen.core
GNU gdb (GDB) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-openbsd5.8".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/obj/ports/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/vlc-cache-gen...done.
[New process 18294]
Core was generated by `vlc-cache-gen'.
Program terminated with signal SIGBUS, Bus error.
#0  _dl_cache_grpsym_list (object=0xdfe3fdf7000) at /usr/src/libexec/ld.so/library_subr.c:554
554                     _dl_link_grpsym(n->data, 0);
(gdb) bt
#0  _dl_cache_grpsym_list (object=0xdfe3fdf7000) at /usr/src/libexec/ld.so/library_subr.c:554
#1  0x00000dfd96c062ed in _dl_cache_grpsym_list (object=0xdfe3fdf8a00) at /usr/src/libexec/ld.so/library_subr.c:557
#2  0x00000dfd96c062ed in _dl_cache_grpsym_list (object=0xdfe0d967000) at /usr/src/libexec/ld.so/library_subr.c:557
#3  0x00000dfd96c062ed in _dl_cache_grpsym_list (object=0xdfe3fdfb800) at /usr/src/libexec/ld.so/library_subr.c:557
#4  0x00000dfd96c013b6 in _dl_load_dep_libs (object=0xdfe3fdfb800, flags=0, booting=0) at /usr/src/libexec/ld.so/loader.c:348
#5  0x00000dfd96c04159 in dlopen (
    libname=0xdfdcb9d8e00 "/usr/obj/ports/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins/video_output/libgl_plugin.so",
    flags=<optimized out>) at /usr/src/libexec/ld.so/dlfcn.c:106
#6  0x00000dfe0333fef1 in module_Load (p_this=0xdfe4d913650,
    path=0xdfdcb9d8e00 "/usr/obj/ports/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins/video_output/libgl_plugin.so",
    p_handle=0x7f7fffff1190, lazy=64) at posix/plugin.c:60
#7  0x00000dfe0332a7f7 in module_InitDynamic (obj=0xdfe4d913650,
    path=0xdfdcb9d8e00 "/usr/obj/ports/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins/video_output/libgl_plugin.so",
    fast=<optimized out>) at modules/bank.c:583
#8  0x00000dfe0332ac7e in AllocatePluginFile (bank=<optimized out>, abspath=<optimized out>, relpath=<optimized out>, st=<optimized out>)
    at modules/bank.c:526
#9  AllocatePluginDir (bank=0x7f7fffff13a8, maxdepth=3, absdir=<optimized out>, reldir=0xdfd6af74c20 "video_output")
    at modules/bank.c:488
#10 0x00000dfe0332abaf in AllocatePluginDir (bank=0x7f7fffff13a8, maxdepth=4, absdir=<optimized out>, reldir=0x0) at modules/bank.c:492
#11 0x00000dfe0332a98f in AllocatePluginPath (p_this=0xdfe4d913650,
    path=0xdfdfb3ede40 "/usr/obj/ports/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins", mode=CACHE_RESET) at modules/bank.c:401
#12 0x00000dfe0332a45e in AllocateAllPlugins (p_this=0xdfe4d913650) at modules/bank.c:346
#13 module_LoadPlugins (obj=0xdfe4d913650) at modules/bank.c:184
#14 0x00000dfe032d2efe in libvlc_InternalInit (p_libvlc=0xdfe4d913650, i_argc=4, ppsz_argv=0x7f7fffff1590) at libvlc.c:151
#15 0x00000dfd970d7931 in libvlc_new (argc=<optimized out>, argv=<optimized out>) at core.c:59
#16 0x00000dfb52f010b6 in main (argc=3, argv=0x7f7fffff16a8) at cachegen.c:99


----- End forwarded message -----

Reply | Threaded
Open this post in threaded view
|

Re: [sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Philip Guenther-2
On Sun, Dec 27, 2015 at 9:13 AM, Stuart Henderson <[hidden email]> wrote:
> Widening the audience to tech in case anyone with an idea missed it
> on ports: it seems some people are having a lot more trouble than just
> needing to restart build 2 or 3 times.

Can you drop the LD_DEBUG=1 output from a crash somewhere?


Philip Guenther

Reply | Threaded
Open this post in threaded view
|

Re: [sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Stuart Henderson-6
On 2015/12/27 20:36, Philip Guenther wrote:

> On Sun, Dec 27, 2015 at 9:13 AM, Stuart Henderson <[hidden email]> wrote:
> > Widening the audience to tech in case anyone with an idea missed it
> > on ports: it seems some people are having a lot more trouble than just
> > needing to restart build 2 or 3 times.
>
> Can you drop the LD_DEBUG=1 output from a crash somewhere?
>
>
> Philip Guenther
>

Yes, here are two examples:

https://junkpile.org/vlc-cache-gen-ld.so-crash1.txt 2117KB
https://junkpile.org/vlc-cache-gen-ld.so-crash2.txt 700KB

Reply | Threaded
Open this post in threaded view
|

Re: [sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Stuart Henderson-6
On 2015/12/28 16:11, Stuart Henderson wrote:

> On 2015/12/27 20:36, Philip Guenther wrote:
> > On Sun, Dec 27, 2015 at 9:13 AM, Stuart Henderson <[hidden email]> wrote:
> > > Widening the audience to tech in case anyone with an idea missed it
> > > on ports: it seems some people are having a lot more trouble than just
> > > needing to restart build 2 or 3 times.
> >
> > Can you drop the LD_DEBUG=1 output from a crash somewhere?
> >
> >
> > Philip Guenther
> >
>
> Yes, here are two examples:
>
> https://junkpile.org/vlc-cache-gen-ld.so-crash1.txt 2117KB
> https://junkpile.org/vlc-cache-gen-ld.so-crash2.txt 700KB
>

More examples in http://junkpile.org/vlc-cache-gen-lddebugs.tar.gz -
quite a few (though not all) of them are when loading libnotify_plugin.so

I'm going to readd and extend the dirty hack for now..

Reply | Threaded
Open this post in threaded view
|

Re: [sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Philip Guenther-2
In reply to this post by Stuart Henderson-6
On Sun, 27 Dec 2015, Stuart Henderson wrote:
> Widening the audience to tech in case anyone with an idea missed it on
> ports: it seems some people are having a lot more trouble than just
> needing to restart build 2 or 3 times.
...

> To replicate
>
> cd /usr/ports/x11/vlc
> make fake
> cd /usr/ports/pobj/vlc-2.2.1/vlc-2.2.1
> PATH="/usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/bin:$PATH" LD_LIBRARY_PATH="/usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib:$LD_LIBRARY_PATH" /usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/vlc-cache-gen -f /usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins
>
> repeat the last step until it crashes, e.g.
>
> /usr/ports/pobj/vlc-2.2.1/vlc-2.2.1/bin/.libs/vlc-cache-gen:/usr/local/lib/libebml.so.3.0: undefined symbol '_ZNSs4_Rep10_M_destroyERKSaIcE'
> lazy binding failed!
> Segmentation fault (core dumped)
>
> or
>
> Bus error (core dumped)

Okay.

The problem is that vlc-cache-gen dlopen()s a plugin that has a dependency
(libgio) which is marked nodelete.  When the plugin is dlclose()d, that
dependency is correctly kept around...but other parts of the load group
*are* unmapped and their elf_object_t structures freed.

You Can't Do That: the objects that make up a load group must be deleted
all at once or not at all, as they may have resolved relocations to each
other and there are certainly pointers between their elf_object_t
structures via the grpref_list, grpsym_list, and load_object members.

So, once a nodelete object is brought in, the entire load group needs to
be locked in.  The diff below does this by changing the nodelete bits to
add an "open" reference on the root of the load group that pulled in the
nodelete object instead of a "child" reference on the nodelete itself.  
(The type of reference was always wrong and may have permitted nodelete
modules being deleted even in simpler cases.)


Ports question: does libgio still need to be marked nodelete, or was that
just from when pthread_atfork() handlers weren't unregistered on
dlclose()?


Ok?

Philip Guenther


Index: resolve.c
===================================================================
RCS file: /data/src/openbsd/src/libexec/ld.so/resolve.c,v
retrieving revision 1.69
diff -u -p -r1.69 resolve.c
--- resolve.c 2 Nov 2015 07:02:53 -0000 1.69
+++ resolve.c 16 Jan 2016 23:06:46 -0000
@@ -54,12 +54,15 @@ elf_object_t *_dl_loading_object;
 void
 _dl_add_object(elf_object_t *object)
 {
- /* if a .so is marked nodelete, then add a reference */
+ /*
+ * If a .so is marked nodelete, then the entire load group that it's
+ * in needs to be kept around forever, so add a reference there.
+ */
  if (object->obj_flags & DF_1_NODELETE &&
-    (object->status & STAT_NODELETE) == 0) {
+    (object->load_object->status & STAT_NODELETE) == 0) {
  DL_DEB(("objname %s is nodelete\n", object->load_name));
- object->refcount++;
- object->status |= STAT_NODELETE;
+ object->load_object->opencount++;
+ object->load_object->status |= STAT_NODELETE;
  }
 
  /*

Reply | Threaded
Open this post in threaded view
|

Re: [sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Stuart Henderson-6
On 2016/01/16 15:10, Philip Guenther wrote:

> On Sun, 27 Dec 2015, Stuart Henderson wrote:
> > Widening the audience to tech in case anyone with an idea missed it on
> > ports: it seems some people are having a lot more trouble than just
> > needing to restart build 2 or 3 times.
> ...
> > To replicate
> >
> > cd /usr/ports/x11/vlc
> > make fake
> > cd /usr/ports/pobj/vlc-2.2.1/vlc-2.2.1
> > PATH="/usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/bin:$PATH" LD_LIBRARY_PATH="/usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib:$LD_LIBRARY_PATH" /usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/vlc-cache-gen -f /usr/ports/pobj/vlc-2.2.1/fake-amd64/usr/local/lib/vlc/plugins
> >
> > repeat the last step until it crashes, e.g.
> >
> > /usr/ports/pobj/vlc-2.2.1/vlc-2.2.1/bin/.libs/vlc-cache-gen:/usr/local/lib/libebml.so.3.0: undefined symbol '_ZNSs4_Rep10_M_destroyERKSaIcE'
> > lazy binding failed!
> > Segmentation fault (core dumped)
> >
> > or
> >
> > Bus error (core dumped)
>
> Okay.

Thanks for looking at this which is obviously a tricky area.

> The problem is that vlc-cache-gen dlopen()s a plugin that has a dependency
> (libgio) which is marked nodelete.  When the plugin is dlclose()d, that
> dependency is correctly kept around...but other parts of the load group
> *are* unmapped and their elf_object_t structures freed.
>
> You Can't Do That: the objects that make up a load group must be deleted
> all at once or not at all, as they may have resolved relocations to each
> other and there are certainly pointers between their elf_object_t
> structures via the grpref_list, grpsym_list, and load_object members.
>
> So, once a nodelete object is brought in, the entire load group needs to
> be locked in.  The diff below does this by changing the nodelete bits to
> add an "open" reference on the root of the load group that pulled in the
> nodelete object instead of a "child" reference on the nodelete itself.  
> (The type of reference was always wrong and may have permitted nodelete
> modules being deleted even in simpler cases.)

This makes complete sense.

> Ports question: does libgio still need to be marked nodelete, or was that
> just from when pthread_atfork() handlers weren't unregistered on
> dlclose()?

I'm unsure about libgio, but at least gobject (which is another
implicated library) apparently does: "Since the type system does not
support reloading its data and assumes that libgobject remains loaded
for the lifetime of the process, we should link libgobject with a flag
indicating that it can't be unloaded."

https://mail.gnome.org/archives/commits-list/2014-April/msg02316.html
https://bugzilla.gnome.org/show_bug.cgi?id=707298

> Ok?

Yes, OK.

There's a small side-effect: with LD_DEBUG, now only the first object
from the load group is reported as being 'nodelete'.

$ LD_DEBUG=1 /usr/local/lib/vlc/vlc-cache-gen -f . 2>&1 | grep nodel | uniq
objname /usr/lib/libpthread.so.20.1 is nodelete
objname /usr/local/lib/libgthread-2.0.so.4200.2 is nodelete

I don't know if that's considered important, but if it's desirable to
keep it then this diff relative to yours would do so:

$ LD_DEBUG=1 /usr/local/lib/vlc/vlc-cache-gen -f . 2>&1 | grep nodel | uniq
objname /usr/lib/libpthread.so.20.1 is nodelete
objname /usr/local/lib/libgthread-2.0.so.4200.2 is nodelete
objname /usr/local/lib/libgmodule-2.0.so.4200.2 is nodelete
objname /usr/local/lib/libgio-2.0.so.4200.2 is nodelete
objname /usr/local/lib/libgobject-2.0.so.4200.2 is nodelete
objname /usr/local/lib/libglib-2.0.so.4200.2 is nodelete

--- resolve.c, Mon Jan 18 12:38:27 2016
+++ resolve.c Mon Jan 18 12:47:03 2016
@@ -59,8 +59,12 @@ _dl_add_object(elf_object_t *object)
  * in needs to be kept around forever, so add a reference there.
  */
  if (object->obj_flags & DF_1_NODELETE &&
-    (object->load_object->status & STAT_NODELETE) == 0) {
+    (object->status & STAT_NODELETE) == 0) {
+ object->status |= STAT_NODELETE;
  DL_DEB(("objname %s is nodelete\n", object->load_name));
+ }
+ if (object->obj_flags & DF_1_NODELETE &&
+    (object->load_object->status & STAT_NODELETE) == 0) {
  object->load_object->opencount++;
  object->load_object->status |= STAT_NODELETE;
  }

> Philip Guenther
>
>
> Index: resolve.c
> ===================================================================
> RCS file: /data/src/openbsd/src/libexec/ld.so/resolve.c,v
> retrieving revision 1.69
> diff -u -p -r1.69 resolve.c
> --- resolve.c 2 Nov 2015 07:02:53 -0000 1.69
> +++ resolve.c 16 Jan 2016 23:06:46 -0000
> @@ -54,12 +54,15 @@ elf_object_t *_dl_loading_object;
>  void
>  _dl_add_object(elf_object_t *object)
>  {
> - /* if a .so is marked nodelete, then add a reference */
> + /*
> + * If a .so is marked nodelete, then the entire load group that it's
> + * in needs to be kept around forever, so add a reference there.
> + */
>   if (object->obj_flags & DF_1_NODELETE &&
> -    (object->status & STAT_NODELETE) == 0) {
> +    (object->load_object->status & STAT_NODELETE) == 0) {
>   DL_DEB(("objname %s is nodelete\n", object->load_name));
> - object->refcount++;
> - object->status |= STAT_NODELETE;
> + object->load_object->opencount++;
> + object->load_object->status |= STAT_NODELETE;
>   }
>  
>   /*

Reply | Threaded
Open this post in threaded view
|

Re: [sthen@openbsd.org: vlc, ld.so sigsegv/sigbus: _dl_cache_grpsym_list]

Philip Guenther-2
I've committed my diff.  However, based on memory of a blog post by
the Solaris ld.so guru and a quick test against Linux,  I expanded the
comment further to explain what a better solution (more like Solaris
and glibc) would involve.  That means anyone with the original diff in
their tree will get a conflict when they "cvs update".  Sorry, but I
wanted to record the info before I forgot it.


On Mon, Jan 18, 2016 at 4:52 AM, Stuart Henderson <[hidden email]> wrote:
> There's a small side-effect: with LD_DEBUG, now only the first object
> from the load group is reported as being 'nodelete'.
...

> --- resolve.c,  Mon Jan 18 12:38:27 2016
> +++ resolve.c   Mon Jan 18 12:47:03 2016
> @@ -59,8 +59,12 @@ _dl_add_object(elf_object_t *object)
>          * in needs to be kept around forever, so add a reference there.
>          */
>         if (object->obj_flags & DF_1_NODELETE &&
> -           (object->load_object->status & STAT_NODELETE) == 0) {
> +           (object->status & STAT_NODELETE) == 0) {
> +               object->status |= STAT_NODELETE;
>                 DL_DEB(("objname %s is nodelete\n", object->load_name));
> +       }
> +       if (object->obj_flags & DF_1_NODELETE &&
> +           (object->load_object->status & STAT_NODELETE) == 0) {
>                 object->load_object->opencount++;
>                 object->load_object->status |= STAT_NODELETE;
>         }

Sure.


Philip Guenther