collectd on 6.3/octeon segfault - write_graphite

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

collectd on 6.3/octeon segfault - write_graphite

Jonathon Sisson
Hey ports,

I seem to have tripped over a bug where collectd segfaults on 6.3/octeon
whenever I have write_graphite set in the config.  If I change to using
the network plugin to relay metrics to a 6.3/amd64 machine running a
collectd relay (with write_graphite), the failures go away.  (Since I
have a reasonable workaround, this isn't a real issue, but I thought I'd
report it to see if anyone else is experiencing this issue).  I'm able
to reproduce the crash on multiple ERLs running OpenBSD 6.3.

ktrace/kdump didn't provide a lot of useful information to my untrained
eye, but I'd be more than happy to supply that info to anyone who wants
additional detail.

-Jonathon

Reply | Threaded
Open this post in threaded view
|

Re: collectd on 6.3/octeon segfault - write_graphite

Jonathon Sisson
I poked at this a bit more at the request of an off-list message,
and found the following:

(gdb) r -f -C /etc/collectd-segfault.conf
Starting program: /usr/local/sbin/collectd -f -C /etc/collectd-segfault.conf

Program received signal SIGSEGV, Segmentation fault.
0x00000005422d40bc in _thread_sys_nanosleep () at {standard input}:6
6       {standard input}: No such file or directory.
        in {standard input}
Current language:  auto; currently asm
(gdb) bt
#0  0x00000005422d40bc in _thread_sys_nanosleep () at {standard input}:6
#1  0x0000000542282c6c in *_libc_nanosleep_cancel (timeout=0xfffffcf470, remainder=0xfffffcf470)
    at /usr/src/lib/libc/sys/w_nanosleep.c:27
#2  0x00000000d4aa119c in main (argc=Variable "argc" is not available.
) at collectd.c:328
(gdb)

I swapped the config out and tried it without the write_graphite plugin,
running under gdb with 'r -f -C /etc/collectd.conf', and it ran without
fail. (I did get the same output as above when I hit Ctrl-C on the
functional config running under gdb, as expected, though).

Running gdb with the core dump revealed the following:

(gdb) bt
#0  0x000000048e25c000 in ?? ()
warning: GDB can't find the start of the function at 0x48e25c000.

    GDB is unable to find the start of the function at 0x48e25c000
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
    This problem is most likely caused by an invalid program counter or
stack pointer.
    However, if you think GDB should simply search farther back
from 0x48e25c000 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
#1  0x000000048e25c000 in ?? ()
warning: GDB can't find the start of the function at 0x48e25c000
Previous frame identical to this frame (corrupt stack?)
(gdb) set heuristic-fence-post  0x48e000000
(gdb) bt
#0  0x000000048e25c000 in ?? ()
Cannot access memory at address 0x48e25bffc
(gdb) set heuristic-fence-post  0x48ffff000
(gdb) bt
#0  0x000000048e25c000 in ?? ()
Cannot access memory at address 0x48e25bffc

If anyone has further suggestions to debug, I'm all ears.

-Jonathon