kernel crash in setrunqueue

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

kernel crash in setrunqueue

Mike Larkin-2
Hi,

 I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
on GENERIC.MP regardless of whether or not the VM has one cpu or more than
one. It does not happen on GENERIC kernels.

 The crash will happen fairly quickly after the kernel starts executing
processes. Sometimes it crashes instantly, sometimes it lasts for a minute
or two. It rarely makes it to the login prompt. The problem is 100%
reproducible on two different VMs I have, running on two different
hypervisors (Hyper-V and ESXi6.7U2).

 I first started noticing the problem on the 24th July snap, but TBH these
machines were not frequently updated, so the previous snap I had installed
might have been a couple months old. Whatever older snap was on them before
worked fine.

 Since this is happening on two different machines with two different VMs,
I'm gonna rule out hardware issues.

 Crash:

kernel: pretection fault trap, code=0
Stopped at setrunqueue+0xa2: addl $0x1,0x288(%r13)

 Trace:
ddb{2}> trace
setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2
sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c
taskq_thread(ffffffff82121548) at taskq_thread+0x8d
end trace frame: 0x0, count: -3

 Registers:
ddb{2}> sh r
rdi 0xffffffff821ee728 sched_lock
rsi 0xffff800014cc6ff0
rbp 0xffff800015ea0e40
rbx 0
rdx  0x23ca94 acpi_pdirpa_0x2288fc
rcx       0xc
rax       0xc
r8     0x202
r9       0x2
r10 0
r11 0x57f79bf6968709d8
r12 0xffff800015e874e0
r13 0x27b3d6c24c3fab80
r14      0x32
r15 0x27b3d6c24c3fab80
rip 0xffffffff81b9df22 setrunqueue+0xa2
cs       0x8
rflags   0x10207 __ALIGN_SIZE+0xf207
rsp 0xffff800015ea0df0
ss      0x10


The offending instruction is in kern_sched.c:260:

        spc->spc_nrun++;

... which indicates 'spc' is trash (and it is, based on %r13 above). In my
tests, %r13 always is this same trash value. That comes from 'ci', which is
either passed in or chosen by sched_choosecpu. Neither of these functions
have changed recently, so I'm guessing this corruption is coming from something
else.

 Anyone have ideas where to start looking? I suppose I could start bisecting,
but does anyone know of any changes that would affect this area?

 I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
configured in them. 4 CPUs, 8GB RAM, etc.

-ml

Reply | Threaded
Open this post in threaded view
|

Re: kernel crash in setrunqueue

Mark Kettenis
> Date: Wed, 29 Jul 2020 13:03:43 -0700
> From: Mike Larkin <[hidden email]>
>
> Hi,
>
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
>
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
>
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
>
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
>
>  Crash:
>
> kernel: pretection fault trap, code=0
> Stopped at setrunqueue+0xa2: addl $0x1,0x288(%r13)
>
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(ffffffff82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
>
>  Registers:
> ddb{2}> sh r
> rdi 0xffffffff821ee728 sched_lock
> rsi 0xffff800014cc6ff0
> rbp 0xffff800015ea0e40
> rbx 0
> rdx  0x23ca94 acpi_pdirpa_0x2288fc
> rcx       0xc
> rax       0xc
> r8     0x202
> r9       0x2
> r10 0
> r11 0x57f79bf6968709d8
> r12 0xffff800015e874e0
> r13 0x27b3d6c24c3fab80
> r14      0x32
> r15 0x27b3d6c24c3fab80
> rip 0xffffffff81b9df22 setrunqueue+0xa2
> cs       0x8
> rflags   0x10207 __ALIGN_SIZE+0xf207
> rsp 0xffff800015ea0df0
> ss      0x10
>
>
> The offending instruction is in kern_sched.c:260:
>
> spc->spc_nrun++;
>
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from something
> else.
>
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
>
>  I can send dmesgs if needed, but these are pretty standard VMs,
> nothing fancy configured in them. 4 CPUs, 8GB RAM, etc.

They're VMs and it turns out that many of the "PV" drivers are/were
using the intr_barrier() interface the wrong way.

For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun
17 snapshot" thread on bugs@ from earlier today.

Cheers,

Mark

Reply | Threaded
Open this post in threaded view
|

Re: kernel crash in setrunqueue

Mike Larkin-2
In reply to this post by Mike Larkin-2
On Wed, Jul 29, 2020 at 01:03:43PM -0700, Mike Larkin wrote:

> Hi,
>
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
>
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
>
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
>
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
>
>  Crash:
>
> kernel: pretection fault trap, code=0
> Stopped at setrunqueue+0xa2: addl $0x1,0x288(%r13)
>
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(ffffffff82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
>
>  Registers:
> ddb{2}> sh r
> rdi 0xffffffff821ee728 sched_lock
> rsi 0xffff800014cc6ff0
> rbp 0xffff800015ea0e40
> rbx 0
> rdx  0x23ca94 acpi_pdirpa_0x2288fc
> rcx       0xc
> rax       0xc
> r8     0x202
> r9       0x2
> r10 0
> r11 0x57f79bf6968709d8
> r12 0xffff800015e874e0
> r13 0x27b3d6c24c3fab80
> r14      0x32
> r15 0x27b3d6c24c3fab80
> rip 0xffffffff81b9df22 setrunqueue+0xa2
> cs       0x8
> rflags   0x10207 __ALIGN_SIZE+0xf207
> rsp 0xffff800015ea0df0
> ss      0x10
>
>
> The offending instruction is in kern_sched.c:260:
>
> spc->spc_nrun++;
>
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from something
> else.
>
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
>
>  I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
> configured in them. 4 CPUs, 8GB RAM, etc.
>
> -ml
>

Also I should note that the problem happens with snaps as well as kernels built
from source (-current), so this isn't likely something that is in snaps but not
yet in tree.

-ml

Reply | Threaded
Open this post in threaded view
|

Re: kernel crash in setrunqueue

Mike Larkin-2
In reply to this post by Mark Kettenis
On Wed, Jul 29, 2020 at 10:14:11PM +0200, Mark Kettenis wrote:

> > Date: Wed, 29 Jul 2020 13:03:43 -0700
> > From: Mike Larkin <[hidden email]>
> >
> > Hi,
> >
> >  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> > on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> > one. It does not happen on GENERIC kernels.
> >
> >  The crash will happen fairly quickly after the kernel starts executing
> > processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> > or two. It rarely makes it to the login prompt. The problem is 100%
> > reproducible on two different VMs I have, running on two different
> > hypervisors (Hyper-V and ESXi6.7U2).
> >
> >  I first started noticing the problem on the 24th July snap, but TBH these
> > machines were not frequently updated, so the previous snap I had installed
> > might have been a couple months old. Whatever older snap was on them before
> > worked fine.
> >
> >  Since this is happening on two different machines with two different VMs,
> > I'm gonna rule out hardware issues.
> >
> >  Crash:
> >
> > kernel: pretection fault trap, code=0
> > Stopped at setrunqueue+0xa2: addl $0x1,0x288(%r13)
> >
> >  Trace:
> > ddb{2}> trace
> > setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2
> > sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c
> > taskq_thread(ffffffff82121548) at taskq_thread+0x8d
> > end trace frame: 0x0, count: -3
> >
> >  Registers:
> > ddb{2}> sh r
> > rdi 0xffffffff821ee728 sched_lock
> > rsi 0xffff800014cc6ff0
> > rbp 0xffff800015ea0e40
> > rbx 0
> > rdx  0x23ca94 acpi_pdirpa_0x2288fc
> > rcx       0xc
> > rax       0xc
> > r8     0x202
> > r9       0x2
> > r10 0
> > r11 0x57f79bf6968709d8
> > r12 0xffff800015e874e0
> > r13 0x27b3d6c24c3fab80
> > r14      0x32
> > r15 0x27b3d6c24c3fab80
> > rip 0xffffffff81b9df22 setrunqueue+0xa2
> > cs       0x8
> > rflags   0x10207 __ALIGN_SIZE+0xf207
> > rsp 0xffff800015ea0df0
> > ss      0x10
> >
> >
> > The offending instruction is in kern_sched.c:260:
> >
> > spc->spc_nrun++;
> >
> > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> > tests, %r13 always is this same trash value. That comes from 'ci', which is
> > either passed in or chosen by sched_choosecpu. Neither of these functions
> > have changed recently, so I'm guessing this corruption is coming from something
> > else.
> >
> >  Anyone have ideas where to start looking? I suppose I could start bisecting,
> > but does anyone know of any changes that would affect this area?
> >
> >  I can send dmesgs if needed, but these are pretty standard VMs,
> > nothing fancy configured in them. 4 CPUs, 8GB RAM, etc.
>
> They're VMs and it turns out that many of the "PV" drivers are/were
> using the intr_barrier() interface the wrong way.
>
> For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun
> 17 snapshot" thread on bugs@ from earlier today.
>
> Cheers,
>
> Mark
>

Thanks. I don't subscribe to bugs@ anymore, so that's why I likely missed it.

-ml