The case of the phantom reboot

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

The case of the phantom reboot

David Newman-2
OpenBSD 6.8 GENERIC#5 i386

One of my systems rebooted at 03:01 local time today. I've seen kernel
panics and bad hardware but I've never seen OpenBSD "just reboot" by
itself, ever.

There's no cron job that would do this. last(1) is no help; it shows the
reboot command but not the shutdown that preceded it:

root@ns ~ 4# last -f /var/log/wtmp.0
reboot    ~                                 Sat Mar 27 03:01
root      ttyp0    192.168.0.132            Wed Mar 24 11:23 - 11:23
(00:00)

wtmp.0 begins Wed Mar 24 11:23 2021
root@ns ~ 5# last -f /var/log/wtmp.1
root      ttyp0    192.168.0.132            Tue Mar 16 21:30 - 21:30
(00:00)
root      ttyp0    75.82.86.131             Tue Mar 16 13:14 - 21:30
(08:15)
root      ttyp0    75.82.86.131             Sun Mar 14 21:20 - 21:29
(00:08)
root      ttyp0    75.82.86.131             Sat Mar 13 17:42 - 21:13
(03:31)

The date gaps seem odd. I've ssh'd into this system multiple times
between March 16-27. I don't see other signs of trouble in /var/log.

I could use some help in looking for evidence of foul play, or "just" a
hardware or software problem.

Thanks in advance for further troubleshooting clues.

dn

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

root-7
On 3/27/21 10:27 PM, David Newman wrote:

> OpenBSD 6.8 GENERIC#5 i386
>
> One of my systems rebooted at 03:01 local time today. I've seen kernel
> panics and bad hardware but I've never seen OpenBSD "just reboot" by
> itself, ever.
>
> There's no cron job that would do this. last(1) is no help; it shows the
> reboot command but not the shutdown that preceded it:
>
> root@ns ~ 4# last -f /var/log/wtmp.0
> reboot    ~                                 Sat Mar 27 03:01
> root      ttyp0    192.168.0.132            Wed Mar 24 11:23 - 11:23
> (00:00)
>
> wtmp.0 begins Wed Mar 24 11:23 2021
> root@ns ~ 5# last -f /var/log/wtmp.1
> root      ttyp0    192.168.0.132            Tue Mar 16 21:30 - 21:30
> (00:00)
> root      ttyp0    75.82.86.131             Tue Mar 16 13:14 - 21:30
> (08:15)
> root      ttyp0    75.82.86.131             Sun Mar 14 21:20 - 21:29
> (00:08)
> root      ttyp0    75.82.86.131             Sat Mar 13 17:42 - 21:13
> (03:31)
>
> The date gaps seem odd. I've ssh'd into this system multiple times
> between March 16-27. I don't see other signs of trouble in /var/log.
>
> I could use some help in looking for evidence of foul play, or "just" a
> hardware or software problem.
>
> Thanks in advance for further troubleshooting clues.
>
> dn
>
What kind of a machine is it running on? I remember having reboot
problems on certain HP and Supermicro servers with hardware watchdogs.

--
Kristjan Komloši

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

David Newman-2
On 3/28/21 4:58 AM, Kristjan Komloši wrote:

> On 3/27/21 10:27 PM, David Newman wrote:
>> OpenBSD 6.8 GENERIC#5 i386
>>
>> One of my systems rebooted at 03:01 local time today. I've seen kernel
>> panics and bad hardware but I've never seen OpenBSD "just reboot" by
>> itself, ever.
>>
>> There's no cron job that would do this. last(1) is no help; it shows the
>> reboot command but not the shutdown that preceded it:
>>
>> root@ns ~ 4# last -f /var/log/wtmp.0
>> reboot    ~                                 Sat Mar 27 03:01
>> root      ttyp0    192.168.0.132            Wed Mar 24 11:23 - 11:23
>> (00:00)
>>
>> wtmp.0 begins Wed Mar 24 11:23 2021
>> root@ns ~ 5# last -f /var/log/wtmp.1
>> root      ttyp0    192.168.0.132            Tue Mar 16 21:30 - 21:30
>> (00:00)
>> root      ttyp0    75.82.86.131             Tue Mar 16 13:14 - 21:30
>> (08:15)
>> root      ttyp0    75.82.86.131             Sun Mar 14 21:20 - 21:29
>> (00:08)
>> root      ttyp0    75.82.86.131             Sat Mar 13 17:42 - 21:13
>> (03:31)
>>
>> The date gaps seem odd. I've ssh'd into this system multiple times
>> between March 16-27. I don't see other signs of trouble in /var/log.
>>
>> I could use some help in looking for evidence of foul play, or "just" a
>> hardware or software problem.
>>
>> Thanks in advance for further troubleshooting clues.
>>
>> dn
>>
> What kind of a machine is it running on? I remember having reboot
> problems on certain HP and Supermicro servers with hardware watchdogs.

This is a 10+-year-old Dell 1U server with a 2-GHz Celeron 440, part of
a pair running CARP. Aside from having to replace spinning disks with
SSDs a couple of years ago, they've been rock solid.

I too have seen issues with Supermicros but that's with other OSs. I've
never had a spontaneous reboot, on this system, and am concerned from
the wtmp stuff above that this *may* have been triggered externally. I
could use some clues in other things to check. Thanks.

dn

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

Stuart Henderson
On 2021-03-28, David Newman <[hidden email]> wrote:

> On 3/28/21 4:58 AM, Kristjan Komloši wrote:
>
>> On 3/27/21 10:27 PM, David Newman wrote:
>>> OpenBSD 6.8 GENERIC#5 i386
>>>
>>> One of my systems rebooted at 03:01 local time today. I've seen kernel
>>> panics and bad hardware but I've never seen OpenBSD "just reboot" by
>>> itself, ever.
>>>
>>> There's no cron job that would do this. last(1) is no help; it shows the
>>> reboot command but not the shutdown that preceded it:
>>>
>>> root@ns ~ 4# last -f /var/log/wtmp.0
>>> reboot    ~                                 Sat Mar 27 03:01
>>> root      ttyp0    192.168.0.132            Wed Mar 24 11:23 - 11:23
>>> (00:00)
>>>
>>> wtmp.0 begins Wed Mar 24 11:23 2021
>>> root@ns ~ 5# last -f /var/log/wtmp.1
>>> root      ttyp0    192.168.0.132            Tue Mar 16 21:30 - 21:30
>>> (00:00)
>>> root      ttyp0    75.82.86.131             Tue Mar 16 13:14 - 21:30
>>> (08:15)
>>> root      ttyp0    75.82.86.131             Sun Mar 14 21:20 - 21:29
>>> (00:08)
>>> root      ttyp0    75.82.86.131             Sat Mar 13 17:42 - 21:13
>>> (03:31)
>>>
>>> The date gaps seem odd. I've ssh'd into this system multiple times
>>> between March 16-27. I don't see other signs of trouble in /var/log.
>>>
>>> I could use some help in looking for evidence of foul play, or "just" a
>>> hardware or software problem.
>>>
>>> Thanks in advance for further troubleshooting clues.
>>>
>>> dn
>>>
>> What kind of a machine is it running on? I remember having reboot
>> problems on certain HP and Supermicro servers with hardware watchdogs.
>
> This is a 10+-year-old Dell 1U server with a 2-GHz Celeron 440, part of
> a pair running CARP. Aside from having to replace spinning disks with
> SSDs a couple of years ago, they've been rock solid.
>
> I too have seen issues with Supermicros but that's with other OSs. I've
> never had a spontaneous reboot, on this system, and am concerned from
> the wtmp stuff above that this *may* have been triggered externally. I
> could use some clues in other things to check. Thanks.
>
> dn
>
>

The "reboot" wtmp entry is written by init(8).

It is something that could possibly be caused by bad hardware or a
glitch in the power feed amongst other options (the latter may affect
some machines differently than others)..

Perhaps it's worth enabling accounting in rc.conf.local to see if
you can figure out if any commands are executed around that time if
it happens again.


Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

Rick Aliwalas-3

On Sun, 28 Mar 2021, Stuart Henderson wrote:

> It is something that could possibly be caused by bad hardware or a
> glitch in the power feed amongst other options (the latter may affect
> some machines differently than others)..

I've had a string of power "blips" over the last year or so. Oddly
enough, the OpenBSD machine always stays up and a Debian machine
next to it on the same power strip reboots. I always figured it was
due to the superior operating system ;)


Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

Nick Holland
In reply to this post by David Newman-2
On 3/28/21 12:13 PM, David Newman wrote:
> On 3/28/21 4:58 AM, Kristjan Komloši wrote:
>
>> On 3/27/21 10:27 PM, David Newman wrote:
>>> OpenBSD 6.8 GENERIC#5 i386
>>>
>>> One of my systems rebooted at 03:01 local time today. I've seen kernel
>>> panics and bad hardware but I've never seen OpenBSD "just reboot" by
>>> itself, ever.

OpenBSD, not usually.  Hardware OpenBSD is running on? Sure.

>>> There's no cron job that would do this. last(1) is no help; it shows the
>>> reboot command but not the shutdown that preceded it:
>>>
>>> root@ns ~ 4# last -f /var/log/wtmp.0
>>> reboot    ~                                 Sat Mar 27 03:01
>>> root      ttyp0    192.168.0.132            Wed Mar 24 11:23 - 11:23
>>> (00:00)
>>>
>>> wtmp.0 begins Wed Mar 24 11:23 2021
>>> root@ns ~ 5# last -f /var/log/wtmp.1
>>> root      ttyp0    192.168.0.132            Tue Mar 16 21:30 - 21:30
>>> (00:00)
>>> root      ttyp0    75.82.86.131             Tue Mar 16 13:14 - 21:30
>>> (08:15)
>>> root      ttyp0    75.82.86.131             Sun Mar 14 21:20 - 21:29
>>> (00:08)
>>> root      ttyp0    75.82.86.131             Sat Mar 13 17:42 - 21:13
>>> (03:31)
>>>
>>> The date gaps seem odd. I've ssh'd into this system multiple times
>>> between March 16-27. I don't see other signs of trouble in /var/log.
>>>
>>> I could use some help in looking for evidence of foul play, or "just" a
>>> hardware or software problem.
>>>
>>> Thanks in advance for further troubleshooting clues.
>>>
>>> dn
>>>
>> What kind of a machine is it running on? I remember having reboot
>> problems on certain HP and Supermicro servers with hardware watchdogs.
>
> This is a 10+-year-old Dell 1U server with a 2-GHz Celeron 440, part of
> a pair running CARP. Aside from having to replace spinning disks with
> SSDs a couple of years ago, they've been rock solid.

basic machine, worked for a long time, then starts giving problems, almost
certainly a hw problem unless you can tie the problem to a recent upgrade.
And that's not terribly likely on a "basic" hardware.

Every broken device started out "rock solid" ... until it isn't.  That's
the definition of "Broken".

> I too have seen issues with Supermicros but that's with other OSs. I've
> never had a spontaneous reboot, on this system, and am concerned from
> the wtmp stuff above that this *may* have been triggered externally. I
> could use some clues in other things to check. Thanks.

As Stuart pointed out, that comes from the boot process, not the shutdown.

If you are really curious, you could put a serial console on it and wait
for the next event.  PROBABLY won't see much, however.

Believe me, I'm all in favor of recycling computers -- in fact, as I
often tell skeptical employers, I'd rather have two ten year old systems
than one brand new system with a service contract, but computers don't
last as long as they used to, and curiously, some big-name servers seem
to sometimes have a shorter life than some desktops,  A ten year old
computer that does the job reliably is good, but not an expectation.

Nick.

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

David Newman-2


On 3/29/21 5:28 AM, Nick Holland wrote:

> On 3/28/21 12:13 PM, David Newman wrote:
>> On 3/28/21 4:58 AM, Kristjan Komloši wrote:
>>
>>> On 3/27/21 10:27 PM, David Newman wrote:
>>>> OpenBSD 6.8 GENERIC#5 i386
>>>>
>>>> One of my systems rebooted at 03:01 local time today. I've seen kernel
>>>> panics and bad hardware but I've never seen OpenBSD "just reboot" by
>>>> itself, ever.
>
> OpenBSD, not usually.  Hardware OpenBSD is running on? Sure.
>
>>>> There's no cron job that would do this. last(1) is no help; it shows
>>>> the
>>>> reboot command but not the shutdown that preceded it:
>>>>
>>>> root@ns ~ 4# last -f /var/log/wtmp.0
>>>> reboot   
>>>> ~                                
>>>> Sat Mar 27 03:01
>>>> root      ttyp0    192.168.0.132            Wed
>>>> Mar 24 11:23 - 11:23
>>>> (00:00)
>>>>
>>>> wtmp.0 begins Wed Mar 24 11:23 2021
>>>> root@ns ~ 5# last -f /var/log/wtmp.1
>>>> root      ttyp0    192.168.0.132            Tue
>>>> Mar 16 21:30 - 21:30
>>>> (00:00)
>>>> root      ttyp0    75.82.86.131             Tue
>>>> Mar 16 13:14 - 21:30
>>>> (08:15)
>>>> root      ttyp0    75.82.86.131             Sun
>>>> Mar 14 21:20 - 21:29
>>>> (00:08)
>>>> root      ttyp0    75.82.86.131             Sat
>>>> Mar 13 17:42 - 21:13
>>>> (03:31)
>>>>
>>>> The date gaps seem odd. I've ssh'd into this system multiple times
>>>> between March 16-27. I don't see other signs of trouble in /var/log.
>>>>
>>>> I could use some help in looking for evidence of foul play, or "just" a
>>>> hardware or software problem.
>>>>
>>>> Thanks in advance for further troubleshooting clues.
>>>>
>>>> dn
>>>>
>>> What kind of a machine is it running on? I remember having reboot
>>> problems on certain HP and Supermicro servers with hardware watchdogs.
>>
>> This is a 10+-year-old Dell 1U server with a 2-GHz Celeron 440, part of
>> a pair running CARP. Aside from having to replace spinning disks with
>> SSDs a couple of years ago, they've been rock solid.
>
> basic machine, worked for a long time, then starts giving problems, almost
> certainly a hw problem unless you can tie the problem to a recent upgrade.
> And that's not terribly likely on a "basic" hardware.
>
> Every broken device started out "rock solid" ... until it isn't.  That's
> the definition of "Broken".
>
>> I too have seen issues with Supermicros but that's with other OSs. I've
>> never had a spontaneous reboot, on this system, and am concerned from
>> the wtmp stuff above that this *may* have been triggered externally. I
>> could use some clues in other things to check. Thanks.
>
> As Stuart pointed out, that comes from the boot process, not the shutdown.
>
> If you are really curious, you could put a serial console on it and wait
> for the next event.  PROBABLY won't see much, however.
>
> Believe me, I'm all in favor of recycling computers -- in fact, as I
> often tell skeptical employers, I'd rather have two ten year old systems
> than one brand new system with a service contract, but computers don't
> last as long as they used to, and curiously, some big-name servers seem
> to sometimes have a shorter life than some desktops,  A ten year old
> computer that does the job reliably is good, but not an expectation.

I hope it is "just" a hardware problem. These ancient machines don't owe
me anything. If anything they've been a testament to how well OpenBSD
just works, year in, year out.

Until I can swap in a replacement (the unit in question is in a colo in
another state), I may try Stuart's suggestion of enabling accounting.
The only concern I have about an external actor is that there seem to be
some missing entries in wtmp, but I don't know enough about init or wtmp
to rule out a hardware glitch.

Someone else suggested a battery problem, which seems plausible for a
unit this old.

Appreciate all the feedback -- many thanks.

dn

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

Marco Scholz
In reply to this post by Stuart Henderson
On Sun, Mar 28, 2021 at 08:05:58PM -0000, Stuart Henderson wrote:
[...]
> It is something that could possibly be caused by bad hardware or a
> glitch in the power feed amongst other options (the latter may affect
> some machines differently than others)..

Power glitch, bad power supply, bad RAM, ...
Do you have a UPS? If so I bet it's a hardware problem.

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

Rafael Possamai-2
In reply to this post by David Newman-2
>One of my systems rebooted at 03:01 local time today.

Do you happen to have a cat nearby?

Reply | Threaded
Open this post in threaded view
|

Re: The case of the phantom reboot

David Newman-2
On 4/1/21 2:51 PM, Rafael Possamai wrote:

>> One of my systems rebooted at 03:01 local time today.
>
> Do you happen to have a cat nearby?

:-)

I'm allergic, and this box is in a colo.

Appreciate all the feedback. I've enabled accounting per Stuart's
suggestion and am pretty sure this is a hiccup on old hardware.

dn