How to debug hanging machines / proc: table is full

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

How to debug hanging machines / proc: table is full

Raimo Niskanen-7
Hi misc@!

If anyone has got some tips about how to debug two hanging machines we have
in our test lab I am eager to learn.

The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's
openup.  Other than that they are rather different, one small Zotac
ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge
R230 with Intel Xeon E3-1220.

The overall symptoms are that it is possible to switch screens using
Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no
prompt.  Alt+Ctrl+Del does not work.  The power button does not work.  I
have to long press the power button to force power off.

This happens during our nightly tests, that are quite resource intesive.

In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
possibly before the machines become inresponsive, but these entries appear
many more times before that point.  And after this "table is full" message
there are many syslog entries; on one machine smartd constatly complains about
an unreadable (pending) sector and atascsi_passthru_done timeout, and on
the other the kernel complains about a probed monitor but no|invalid EDID.

So it seems the machine is out of some resource and fails to spawn a login
shell.  Any clues to how I can find more details and a remedy?  I suspect a
full process table, but wonder how to detect and|or avoid that.

I have considered having systat running on a console screen but do not know
which systat display that might tell me anything.

Best regards
--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Stuart Henderson
On 2019-07-02, Raimo Niskanen <[hidden email]> wrote:

> Hi misc@!
>
> If anyone has got some tips about how to debug two hanging machines we have
> in our test lab I am eager to learn.
>
> The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's
> openup.  Other than that they are rather different, one small Zotac
> ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge
> R230 with Intel Xeon E3-1220.
>
> The overall symptoms are that it is possible to switch screens using
> Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no
> prompt.  Alt+Ctrl+Del does not work.  The power button does not work.  I
> have to long press the power button to force power off.
>
> This happens during our nightly tests, that are quite resource intesive.
>
> In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> possibly before the machines become inresponsive, but these entries appear
> many more times before that point.  And after this "table is full" message
> there are many syslog entries; on one machine smartd constatly complains about
> an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> the other the kernel complains about a probed monitor but no|invalid EDID.
>
> So it seems the machine is out of some resource and fails to spawn a login
> shell.  Any clues to how I can find more details and a remedy?  I suspect a
> full process table, but wonder how to detect and|or avoid that.
>
> I have considered having systat running on a console screen but do not know
> which systat display that might tell me anything.
>
> Best regards

"/bsd: proc: table is full" means that the process table is full, but it doesn't
tell you what caused this.

The process table size is controlled by kern.maxproc, it is possible
that the default is insufficient for your needs, but it's also possible
that there was a build-up of processes that didn't exit due to another
problem on the system.

I would leave top(1) running on the system, and also save "ps ax" output
regularly, then look at that output in the run-up to a failure, to see
if that gives clues.


Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote:

> On 2019-07-02, Raimo Niskanen <[hidden email]> wrote:
> > Hi misc@!
> >
> > If anyone has got some tips about how to debug two hanging machines we have
> > in our test lab I am eager to learn.
> >
> > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's
> > openup.  Other than that they are rather different, one small Zotac
> > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge
> > R230 with Intel Xeon E3-1220.
> >
> > The overall symptoms are that it is possible to switch screens using
> > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no
> > prompt.  Alt+Ctrl+Del does not work.  The power button does not work.  I
> > have to long press the power button to force power off.
> >
> > This happens during our nightly tests, that are quite resource intesive.
> >
> > In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> > possibly before the machines become inresponsive, but these entries appear
> > many more times before that point.  And after this "table is full" message
> > there are many syslog entries; on one machine smartd constatly complains about
> > an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> > the other the kernel complains about a probed monitor but no|invalid EDID.
> >
> > So it seems the machine is out of some resource and fails to spawn a login
> > shell.  Any clues to how I can find more details and a remedy?  I suspect a
> > full process table, but wonder how to detect and|or avoid that.
> >
> > I have considered having systat running on a console screen but do not know
> > which systat display that might tell me anything.
> >
> > Best regards
>
> "/bsd: proc: table is full" means that the process table is full, but it doesn't
> tell you what caused this.
>
> The process table size is controlled by kern.maxproc, it is possible
> that the default is insufficient for your needs, but it's also possible
> that there was a build-up of processes that didn't exit due to another
> problem on the system.
>
> I would leave top(1) running on the system, and also save "ps ax" output
> regularly, then look at that output in the run-up to a failure, to see
> if that gives clues.
>

Great!  I will do that...
--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Kenneth Gober
In reply to this post by Raimo Niskanen-7
On Tue, Jul 2, 2019 at 10:06 AM Raimo Niskanen <
[hidden email]> wrote:

> In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> possibly before the machines become inresponsive, but these entries appear
> many more times before that point.  And after this "table is full" message
> there are many syslog entries; on one machine smartd constatly complains
> about
> an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> the other the kernel complains about a probed monitor but no|invalid EDID.
>

In addition to Stuart's suggestion to leave top(1) running, and
periodically save "ps ax"
output, it might also be a good idea to start up a bunch of nested shells
and just leave
them running.  This will reserve a bunch of process table slots, which you
will be able to
use via "exec", the idea being that if you can't fork new processes, you
can at least use
exec to replace an existing ksh process with something else.  This will
hopefully give you
some limited ability to run a few post-mortem diagnostic commands before
you run out
of reserved process table slots.

-ken
Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
On Tue, Jul 09, 2019 at 10:33:46AM -0400, Kenneth Gober wrote:

> On Tue, Jul 2, 2019 at 10:06 AM Raimo Niskanen <
> [hidden email]> wrote:
>
> > In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> > possibly before the machines become inresponsive, but these entries appear
> > many more times before that point.  And after this "table is full" message
> > there are many syslog entries; on one machine smartd constatly complains
> > about
> > an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> > the other the kernel complains about a probed monitor but no|invalid EDID.
> >
>
> In addition to Stuart's suggestion to leave top(1) running, and
> periodically save "ps ax"
> output, it might also be a good idea to start up a bunch of nested shells
> and just leave
> them running.  This will reserve a bunch of process table slots, which you
> will be able to
> use via "exec", the idea being that if you can't fork new processes, you
> can at least use
> exec to replace an existing ksh process with something else.  This will
> hopefully give you
> some limited ability to run a few post-mortem diagnostic commands before
> you run out
> of reserved process table slots.
>
> -ken

That's a nice one.  Thank you!

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
In reply to this post by Stuart Henderson
On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote:

> On 2019-07-02, Raimo Niskanen <[hidden email]> wrote:
> > Hi misc@!
> >
> > If anyone has got some tips about how to debug two hanging machines we have
> > in our test lab I am eager to learn.
> >
> > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's
> > openup.  Other than that they are rather different, one small Zotac
> > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge
> > R230 with Intel Xeon E3-1220.
> >
> > The overall symptoms are that it is possible to switch screens using
> > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no
> > prompt.  Alt+Ctrl+Del does not work.  The power button does not work.  I
> > have to long press the power button to force power off.
> >
> > This happens during our nightly tests, that are quite resource intesive.
> >
> > In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> > possibly before the machines become inresponsive, but these entries appear
> > many more times before that point.  And after this "table is full" message
> > there are many syslog entries; on one machine smartd constatly complains about
> > an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> > the other the kernel complains about a probed monitor but no|invalid EDID.
> >
> > So it seems the machine is out of some resource and fails to spawn a login
> > shell.  Any clues to how I can find more details and a remedy?  I suspect a
> > full process table, but wonder how to detect and|or avoid that.
> >
> > I have considered having systat running on a console screen but do not know
> > which systat display that might tell me anything.
> >
> > Best regards
>
> "/bsd: proc: table is full" means that the process table is full, but it doesn't
> tell you what caused this.
>
> The process table size is controlled by kern.maxproc, it is possible
> that the default is insufficient for your needs, but it's also possible
> that there was a build-up of processes that didn't exit due to another
> problem on the system.
>
> I would leave top(1) running on the system, and also save "ps ax" output
> regularly, then look at that output in the run-up to a failure, to see
> if that gives clues.
>

It seems that the full process table is a secondary symptom, and that there
is something else that happens on the machines a few hours before the
process table fills...

On one machine I hade left "systat pigs" running, and the last thing it
showed was about 90% for softnet and the rest <idle>, IIRC.

I have now corrected a presumably unrelated error in our nightly tests that
occured just before the freeze.  The test started a child process that was
abandoned, and when it noticed its controlling socket close it started to
write an error log.  Previously that froze sometimes and a few hours later
the process table got full.  Now the child process is not abandoned, and
I have not seen the freeze since...

Still chasing ghosts, this can simply not be over yet.

Best Regards
--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
On Mon, Jul 15, 2019 at 08:55:35AM +0200, Raimo Niskanen wrote:

> On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote:
> > On 2019-07-02, Raimo Niskanen <[hidden email]> wrote:
> > > Hi misc@!
> > >
> > > If anyone has got some tips about how to debug two hanging machines we have
> > > in our test lab I am eager to learn.
> > >
> > > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's
> > > openup.  Other than that they are rather different, one small Zotac
> > > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge
> > > R230 with Intel Xeon E3-1220.
> > >
> > > The overall symptoms are that it is possible to switch screens using
> > > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no
> > > prompt.  Alt+Ctrl+Del does not work.  The power button does not work.  I
> > > have to long press the power button to force power off.
> > >
> > > This happens during our nightly tests, that are quite resource intesive.
> > >
> > > In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> > > possibly before the machines become inresponsive, but these entries appear
> > > many more times before that point.  And after this "table is full" message
> > > there are many syslog entries; on one machine smartd constatly complains about
> > > an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> > > the other the kernel complains about a probed monitor but no|invalid EDID.
> > >
> > > So it seems the machine is out of some resource and fails to spawn a login
> > > shell.  Any clues to how I can find more details and a remedy?  I suspect a
> > > full process table, but wonder how to detect and|or avoid that.
> > >
> > > I have considered having systat running on a console screen but do not know
> > > which systat display that might tell me anything.
> > >
> > > Best regards
> >
> > "/bsd: proc: table is full" means that the process table is full, but it doesn't
> > tell you what caused this.
> >
> > The process table size is controlled by kern.maxproc, it is possible
> > that the default is insufficient for your needs, but it's also possible
> > that there was a build-up of processes that didn't exit due to another
> > problem on the system.
> >
> > I would leave top(1) running on the system, and also save "ps ax" output
> > regularly, then look at that output in the run-up to a failure, to see
> > if that gives clues.
> >
>
> It seems that the full process table is a secondary symptom, and that there
> is something else that happens on the machines a few hours before the
> process table fills...
>
> On one machine I hade left "systat pigs" running, and the last thing it
> showed was about 90% for softnet and the rest <idle>, IIRC.
>
> I have now corrected a presumably unrelated error in our nightly tests that
> occured just before the freeze.  The test started a child process that was
> abandoned, and when it noticed its controlling socket close it started to
> write an error log.  Previously that froze sometimes and a few hours later
> the process table got full.  Now the child process is not abandoned, and
> I have not seen the freeze since...
>
> Still chasing ghosts, this can simply not be over yet.

A new hang, I tried to invstigate:

At July 19 the last log entry from my 'ps' log was from 14:55, which is
also the time on the 'systat vmstat' screen when it froze.  Then the machine
hums along but just after midnight at 00:42:01 the first "/bsd: process:
table is full" entry appears.  That message repeats until I rebooted it
today at July 29 10:48.

I had a terminal with top running.  It was still updating.  It showed about
98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
the process table had gotten full) it had 1282 idle processes and 1 on
processor, which was 'top' itself.
Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.

I had 8 shells under tmux ready for debugging.  'ls worked.
'systat' on one hung.  'top' on another failed with "cannot fork".
'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
and then hung.  'exec reboot' did not succeed.  Neither did a short power
button, that at least caused a printout "stopping daemon nginx(failed)",
but got no further.  I had to do a hard power off.

My theory now is that our daily tests right before 14:55 started a process
(this process is the top 'top' process with 10:14 execution time) that
triggers a lock or other contention problem in the kernel which causes
one CPU to spin in the system, and blocks processes from dying.
About 10 hours later the process table gets full.

Any, ANY ideas of how to proceed would be appreciated!

Best Regards
--
Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Stuart Henderson
On 2019-07-29, Raimo Niskanen <[hidden email]> wrote:

> A new hang, I tried to invstigate:
>
> At July 19 the last log entry from my 'ps' log was from 14:55, which is
> also the time on the 'systat vmstat' screen when it froze.  Then the machine
> hums along but just after midnight at 00:42:01 the first "/bsd: process:
> table is full" entry appears.  That message repeats until I rebooted it
> today at July 29 10:48.
>
> I had a terminal with top running.  It was still updating.  It showed about
> 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> the process table had gotten full) it had 1282 idle processes and 1 on
> processor, which was 'top' itself.
> Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
>
> I had 8 shells under tmux ready for debugging.  'ls worked.
> 'systat' on one hung.  'top' on another failed with "cannot fork".
> 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> and then hung.  'exec reboot' did not succeed.  Neither did a short power
> button, that at least caused a printout "stopping daemon nginx(failed)",
> but got no further.  I had to do a hard power off.
>
> My theory now is that our daily tests right before 14:55 started a process
> (this process is the top 'top' process with 10:14 execution time) that
> triggers a lock or other contention problem in the kernel which causes
> one CPU to spin in the system, and blocks processes from dying.
> About 10 hours later the process table gets full.
>
> Any, ANY ideas of how to proceed would be appreciated!
>
> Best Regards

Did you notice any odd waitchan's (WAIT in top output)?

Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
set), then try to break into DDB during a hang and see how things look
in ps there. (Test breaking into DDB before a hang first so you know
that you can do it .. you can just "c" to continue).

There might also be clues in things like "sh malloc" or "sh all pools".

Perhaps you could also get clues from running a kernel built with
'option WITNESS', you may get some messages in dmesg, or it adds commands
to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
details).

Can you provoke a hang by running this process manually?

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:

> On 2019-07-29, Raimo Niskanen <[hidden email]> wrote:
> > A new hang, I tried to invstigate:
> >
> > At July 19 the last log entry from my 'ps' log was from 14:55, which is
> > also the time on the 'systat vmstat' screen when it froze.  Then the machine
> > hums along but just after midnight at 00:42:01 the first "/bsd: process:
> > table is full" entry appears.  That message repeats until I rebooted it
> > today at July 29 10:48.
> >
> > I had a terminal with top running.  It was still updating.  It showed about
> > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> > the process table had gotten full) it had 1282 idle processes and 1 on
> > processor, which was 'top' itself.
> > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> >
> > I had 8 shells under tmux ready for debugging.  'ls worked.
> > 'systat' on one hung.  'top' on another failed with "cannot fork".
> > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > and then hung.  'exec reboot' did not succeed.  Neither did a short power
> > button, that at least caused a printout "stopping daemon nginx(failed)",
> > but got no further.  I had to do a hard power off.
> >
> > My theory now is that our daily tests right before 14:55 started a process
> > (this process is the top 'top' process with 10:14 execution time) that
> > triggers a lock or other contention problem in the kernel which causes
> > one CPU to spin in the system, and blocks processes from dying.
> > About 10 hours later the process table gets full.
> >
> > Any, ANY ideas of how to proceed would be appreciated!
> >
> > Best Regards
>
> Did you notice any odd waitchan's (WAIT in top output)?

I do not think so:
  select (for the possibly triggering process), - (for 'top'), kqread, netlock,
  bpf, wait, piperd.

>
> Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> set), then try to break into DDB during a hang and see how things look
> in ps there. (Test breaking into DDB before a hang first so you know
> that you can do it .. you can just "c" to continue).
>
> There might also be clues in things like "sh malloc" or "sh all pools".

Sounds like fun - will try that!

>
> Perhaps you could also get clues from running a kernel built with
> 'option WITNESS', you may get some messages in dmesg, or it adds commands
> to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
> details).

Maybe later.  I have gotten used to not compiling my kernel...

>
> Can you provoke a hang by running this process manually?

Might be worth a try to repeat the suspected test case many times.
I will try.

Thanks for the hints!
--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
In reply to this post by Stuart Henderson
On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:

> On 2019-07-29, Raimo Niskanen <[hidden email]> wrote:
> > A new hang, I tried to invstigate:
> >
> > At July 19 the last log entry from my 'ps' log was from 14:55, which is
> > also the time on the 'systat vmstat' screen when it froze.  Then the machine
> > hums along but just after midnight at 00:42:01 the first "/bsd: process:
> > table is full" entry appears.  That message repeats until I rebooted it
> > today at July 29 10:48.
> >
> > I had a terminal with top running.  It was still updating.  It showed about
> > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> > the process table had gotten full) it had 1282 idle processes and 1 on
> > processor, which was 'top' itself.
> > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> >
> > I had 8 shells under tmux ready for debugging.  'ls worked.
> > 'systat' on one hung.  'top' on another failed with "cannot fork".
> > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > and then hung.  'exec reboot' did not succeed.  Neither did a short power
> > button, that at least caused a printout "stopping daemon nginx(failed)",
> > but got no further.  I had to do a hard power off.
> >
> > My theory now is that our daily tests right before 14:55 started a process
> > (this process is the top 'top' process with 10:14 execution time) that
> > triggers a lock or other contention problem in the kernel which causes
> > one CPU to spin in the system, and blocks processes from dying.
> > About 10 hours later the process table gets full.
> >
> > Any, ANY ideas of how to proceed would be appreciated!
> >
> > Best Regards
>
> Did you notice any odd waitchan's (WAIT in top output)?
>
> Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> set), then try to break into DDB during a hang and see how things look
> in ps there. (Test breaking into DDB before a hang first so you know
> that you can do it .. you can just "c" to continue).
>
> There might also be clues in things like "sh malloc" or "sh all pools".
>
> Perhaps you could also get clues from running a kernel built with
> 'option WITNESS', you may get some messages in dmesg, or it adds commands
> to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
> details).

I have enabled Witness, it went so-so.  We'll see what it catches.

I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
applied all patches for stable 001-006 and built a kernel with:
  include "arch/amd64/conf/GENERIC"
  option MULTIPROCESSOR
  option MP_LOCKDEBUG
  option WITNESS

Then I activated in /etc/sysctl.conf:
  ddb.console=1
  kern.witness.locktrace=1
  kern.witness.watch=3

For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
"show witness".  It printed lots of info, I scrolled down to the end, but
during the printout there was an UVM fault:

  Spin locks:
  /usr/src/sys/....
  :
  bla bla bla
  :
  uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
  kernel: page fault trap, code=0
  Faulted in DDB: continuing...

Then I typed "cont" and it panicked.
If anybody want details I took a picture.

Have I combined too many debugging options, or is this sh*t that happens?

Nevertheless, now the machine is running again, with Witness...

I'll be back.


>
> Can you provoke a hang by running this process manually?

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Visa Hankala-2
On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:

> I have enabled Witness, it went so-so.  We'll see what it catches.
>
> I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> applied all patches for stable 001-006 and built a kernel with:
>   include "arch/amd64/conf/GENERIC"
>   option MULTIPROCESSOR
>   option MP_LOCKDEBUG
>   option WITNESS
>
> Then I activated in /etc/sysctl.conf:
>   ddb.console=1
>   kern.witness.locktrace=1
>   kern.witness.watch=3
>
> For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
> "show witness".  It printed lots of info, I scrolled down to the end, but
> during the printout there was an UVM fault:
>
>   Spin locks:
>   /usr/src/sys/....
>   :
>   bla bla bla
>   :
>   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
>   kernel: page fault trap, code=0
>   Faulted in DDB: continuing...

The output of "show witness" is unlikely to be useful in your case.
It is more of a tool for debugging witness. You can ignore it.
However, "show all locks" might display interesting information
after a witness-related panic.

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
On Wed, Jul 31, 2019 at 04:20:12PM +0000, Visa Hankala wrote:

> On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:
> > I have enabled Witness, it went so-so.  We'll see what it catches.
> >
> > I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> > applied all patches for stable 001-006 and built a kernel with:
> >   include "arch/amd64/conf/GENERIC"
> >   option MULTIPROCESSOR
> >   option MP_LOCKDEBUG
> >   option WITNESS
> >
> > Then I activated in /etc/sysctl.conf:
> >   ddb.console=1
> >   kern.witness.locktrace=1
> >   kern.witness.watch=3
> >
> > For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
> > "show witness".  It printed lots of info, I scrolled down to the end, but
> > during the printout there was an UVM fault:
> >
> >   Spin locks:
> >   /usr/src/sys/....
> >   :
> >   bla bla bla
> >   :
> >   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
> >   kernel: page fault trap, code=0
> >   Faulted in DDB: continuing...
>
> The output of "show witness" is unlikely to be useful in your case.
> It is more of a tool for debugging witness. You can ignore it.
> However, "show all locks" might display interesting information
> after a witness-related panic.

Ok, great!

It is just that an uvm_fault during show witness felt like a bad thing...

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
In reply to this post by Raimo Niskanen-7
On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:

> On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> > On 2019-07-29, Raimo Niskanen <[hidden email]> wrote:
> > > A new hang, I tried to invstigate:
> > >
> > > At July 19 the last log entry from my 'ps' log was from 14:55, which is
> > > also the time on the 'systat vmstat' screen when it froze.  Then the machine
> > > hums along but just after midnight at 00:42:01 the first "/bsd: process:
> > > table is full" entry appears.  That message repeats until I rebooted it
> > > today at July 29 10:48.
> > >
> > > I had a terminal with top running.  It was still updating.  It showed about
> > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> > > the process table had gotten full) it had 1282 idle processes and 1 on
> > > processor, which was 'top' itself.
> > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> > >
> > > I had 8 shells under tmux ready for debugging.  'ls worked.
> > > 'systat' on one hung.  'top' on another failed with "cannot fork".
> > > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > > and then hung.  'exec reboot' did not succeed.  Neither did a short power
> > > button, that at least caused a printout "stopping daemon nginx(failed)",
> > > but got no further.  I had to do a hard power off.
> > >
> > > My theory now is that our daily tests right before 14:55 started a process
> > > (this process is the top 'top' process with 10:14 execution time) that
> > > triggers a lock or other contention problem in the kernel which causes
> > > one CPU to spin in the system, and blocks processes from dying.
> > > About 10 hours later the process table gets full.
> > >
> > > Any, ANY ideas of how to proceed would be appreciated!
> > >
> > > Best Regards
> >
> > Did you notice any odd waitchan's (WAIT in top output)?
> >
> > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> > set), then try to break into DDB during a hang and see how things look
> > in ps there. (Test breaking into DDB before a hang first so you know
> > that you can do it .. you can just "c" to continue).
> >
> > There might also be clues in things like "sh malloc" or "sh all pools".
> >
> > Perhaps you could also get clues from running a kernel built with
> > 'option WITNESS', you may get some messages in dmesg, or it adds commands
> > to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
> > details).
>
> I have enabled Witness, it went so-so.  We'll see what it catches.
>
> I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> applied all patches for stable 001-006 and built a kernel with:
>   include "arch/amd64/conf/GENERIC"
>   option MULTIPROCESSOR
>   option MP_LOCKDEBUG
>   option WITNESS
>
> Then I activated in /etc/sysctl.conf:
>   ddb.console=1
>   kern.witness.locktrace=1
>   kern.witness.watch=3
>
> For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
> "show witness".  It printed lots of info, I scrolled down to the end, but
> during the printout there was an UVM fault:
>
>   Spin locks:
>   /usr/src/sys/....
>   :
>   bla bla bla
>   :
>   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
>   kernel: page fault trap, code=0
>   Faulted in DDB: continuing...
>
> Then I typed "cont" and it panicked.
> If anybody want details I took a picture.
>
> Have I combined too many debugging options, or is this sh*t that happens?
>
> Nevertheless, now the machine is running again, with Witness...
>
> I'll be back.

I have encountered some kind of stop, oddly enought not a panic - it
just sat in ddb and I missed it for a week (or more).  Then I did not
remember what I had planned to do so I "improvised" X-| , but anyway:

ddb{0}> ps
shows about 350 processes from cron, half of them in state netlock, half
in state piperd.  Then I have my test processes beam.smp: 6 in netlock, 6
in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread.
Then about 100 more ordinary looking processes...

ddb{0}> trace
db_enter()...
softclock(0)...
softintr_dispatch(0)...
Xsoftclock(0,0,1388,....)...
acpicpu_idle()...
shed_idle(ffffffff81ceff0)...
end trace frame: 0x0, count: -6

ddb{0}> show locks
exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) locked @
/usr/src/sys/arch/amd64/amd64/softintr.c:87
#0  witness_lock+0x41f
#1  softintr_dispatc+0x56
#2  Xsoftclock+0x1f
#3  acpicpu_idle+0x271
#4  sched_idle+0x235
#5  proc_trampoline+0x1c

ddb{0}> show nfsnode
size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 1099511681152

(I think the size looks strange)

Then I tried show map and got a protection fault trap, gave up and
rebooted.

That was it!  Next time I will try:
  trace
  ps
  show malloc
  show all pools
  show locks
  show all locks
unless anyone has got more or better suggestions...

Best Regards
--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: How to debug hanging machines / proc: table is full

Raimo Niskanen-7
On Mon, Sep 09, 2019 at 05:42:02PM +0200, Raimo Niskanen wrote:

> On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:
> > On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> > > On 2019-07-29, Raimo Niskanen <[hidden email]> wrote:
> > > > A new hang, I tried to invstigate:
> > > >
> > > > At July 19 the last log entry from my 'ps' log was from 14:55, which is
> > > > also the time on the 'systat vmstat' screen when it froze.  Then the machine
> > > > hums along but just after midnight at 00:42:01 the first "/bsd: process:
> > > > table is full" entry appears.  That message repeats until I rebooted it
> > > > today at July 29 10:48.
> > > >
> > > > I had a terminal with top running.  It was still updating.  It showed about
> > > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> > > > the process table had gotten full) it had 1282 idle processes and 1 on
> > > > processor, which was 'top' itself.
> > > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> > > >
> > > > I had 8 shells under tmux ready for debugging.  'ls worked.
> > > > 'systat' on one hung.  'top' on another failed with "cannot fork".
> > > > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > > > and then hung.  'exec reboot' did not succeed.  Neither did a short power
> > > > button, that at least caused a printout "stopping daemon nginx(failed)",
> > > > but got no further.  I had to do a hard power off.
> > > >
> > > > My theory now is that our daily tests right before 14:55 started a process
> > > > (this process is the top 'top' process with 10:14 execution time) that
> > > > triggers a lock or other contention problem in the kernel which causes
> > > > one CPU to spin in the system, and blocks processes from dying.
> > > > About 10 hours later the process table gets full.
> > > >
> > > > Any, ANY ideas of how to proceed would be appreciated!
> > > >
> > > > Best Regards
> > >
> > > Did you notice any odd waitchan's (WAIT in top output)?
> > >
> > > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> > > set), then try to break into DDB during a hang and see how things look
> > > in ps there. (Test breaking into DDB before a hang first so you know
> > > that you can do it .. you can just "c" to continue).
> > >
> > > There might also be clues in things like "sh malloc" or "sh all pools".
> > >
> > > Perhaps you could also get clues from running a kernel built with
> > > 'option WITNESS', you may get some messages in dmesg, or it adds commands
> > > to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
> > > details).
> >
> > I have enabled Witness, it went so-so.  We'll see what it catches.
> >
> > I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> > applied all patches for stable 001-006 and built a kernel with:
> >   include "arch/amd64/conf/GENERIC"
> >   option MULTIPROCESSOR
> >   option MP_LOCKDEBUG
> >   option WITNESS
> >
> > Then I activated in /etc/sysctl.conf:
> >   ddb.console=1
> >   kern.witness.locktrace=1
> >   kern.witness.watch=3
> >
> > For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
> > "show witness".  It printed lots of info, I scrolled down to the end, but
> > during the printout there was an UVM fault:
> >
> >   Spin locks:
> >   /usr/src/sys/....
> >   :
> >   bla bla bla
> >   :
> >   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
> >   kernel: page fault trap, code=0
> >   Faulted in DDB: continuing...
> >
> > Then I typed "cont" and it panicked.
> > If anybody want details I took a picture.
> >
> > Have I combined too many debugging options, or is this sh*t that happens?
> >
> > Nevertheless, now the machine is running again, with Witness...
> >
> > I'll be back.
>
> I have encountered some kind of stop, oddly enought not a panic - it
> just sat in ddb and I missed it for a week (or more).  Then I did not
> remember what I had planned to do so I "improvised" X-| , but anyway:
>
> ddb{0}> ps
> shows about 350 processes from cron, half of them in state netlock, half

Sorry, that should have been about 1350 processes...

> in state piperd.  Then I have my test processes beam.smp: 6 in netlock, 6
> in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread.
> Then about 100 more ordinary looking processes...
>
> ddb{0}> trace
> db_enter()...
> softclock(0)...
> softintr_dispatch(0)...
> Xsoftclock(0,0,1388,....)...
> acpicpu_idle()...
> shed_idle(ffffffff81ceff0)...
> end trace frame: 0x0, count: -6
>
> ddb{0}> show locks
> exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) locked @
> /usr/src/sys/arch/amd64/amd64/softintr.c:87
> #0  witness_lock+0x41f
> #1  softintr_dispatc+0x56
> #2  Xsoftclock+0x1f
> #3  acpicpu_idle+0x271
> #4  sched_idle+0x235
> #5  proc_trampoline+0x1c
>
> ddb{0}> show nfsnode
> size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 1099511681152
>
> (I think the size looks strange)
>
> Then I tried show map and got a protection fault trap, gave up and
> rebooted.
>
> That was it!  Next time I will try:
>   trace
>   ps
>   show malloc
>   show all pools
>   show locks
>   show all locks
> unless anyone has got more or better suggestions...
>
> Best Regards
> --
>
> / Raimo Niskanen, Erlang/OTP, Ericsson AB

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB