Quantcast

Kernel panic on 6.1: init dies under load

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Kernel panic on 6.1: init dies under load

Dan Cross-4
>Synopsis:      init dies causing kernel panic on virtualized hosts.
>Category:      system
>Environment:
        System      : OpenBSD 6.1
        Details     : OpenBSD 6.1 (GENERIC) #6: Sat May  6 09:33:26 CEST
2017
                         [hidden email]:
/usr/src/sys/arch/amd64/compile/GENERIC

        Architecture: OpenBSD.amd64
        Machine     : amd64
>Description:
        Kernel panics under moderate/heavy load when running under a
        hypervisor (I believe my VPS provider is using Xen); init(8)
        dies and the machine panics. `boot sync` does not work and
        the filesystem requires manual fsck on reboot.

        I have not seen this on harware.

        Console data from the panic is as follows:

        : tempest; cat panic
        coredump of syslogd(94574), write failed: errno 14
        coredump of init(1), write failed: errno 14
        panic: init died (signal 10, exit 0)
        Stopped at      Debugger+0x9:   leave
            TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
        *285197      1      0       0x802     0x2000    0  init
        Debuggger() at Debugger+0x9
        panic() at panic+0xfe
        exit1() at exit1+0x58d
        trapsignal() at trapsignal+0x110
        trap() at trap+0x309
        --- trap (number 4) ---
        end of kernel
        end trace fram: 0xff, count: 10
        0x18057281cfdc
        https://www.openbsd.org/ddb.html describes the minimum info
required in bug
        reports.  Insufficient info makes it difficult to find and fix bugs.
        ddb>
        : tempest;

>How-To-Repeat:
        Run some CPU/memory intensive workload; for example, rebuilding
        the Go compiler and toolchain.  Occasionally the system will
survive,
        but gets into a state where processes are dying.
>Fix:
        Unknown.


dmesg:
OpenBSD 6.1 (GENERIC) #6: Sat May  6 09:33:26 CEST 2017
    [hidden email]:/usr/src/sys/arch/
amd64/compile/GENERIC
real mem = 520093696 (496MB)
avail mem = 499785728 (476MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xeb01f (10 entries)
bios0: vendor Xen version "3.4.4" date 07/15/2016
bios0: Xen HVM domU
acpi0 at bios0: rev 2
acpi0: sleep states S3 S4 S5
acpi0: tables DSDT FACP APIC
acpi0: wakeup devices
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
ioapic0 at mainbus0: apid 1 pa 0xfec00000, version 11, 48 pins
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, 2267.15 MHz
cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
CMOV,PAT,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,SSSE3,CX16,SSE4.
1,SSE4.2,POPCNT,HV,NXE,LONG,LAHF
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
acpiprt0 at acpi0: bus 0 (PCI0)
acpicpu0 at acpi0: C1(@1 halt!)
"PNP0F13" at acpi0 not configured
"PNP0303" at acpi0 not configured
"PNP0700" at acpi0 not configured
"PNP0501" at acpi0 not configured
"PNP0400" at acpi0 not configured
pvbus0 at mainbus0: Xen 3.4
xen0 at pvbus0: features 0x5, 32 grant table frames, event channel 2
"vkbd" at xen0: device/vkbd/0 not configured
"vfb" at xen0: device/vfb/0 not configured
xbf0 at xen0 backend 0 channel 4: disk
scsibus1 at xbf0: 2 targets
sd0 at scsibus1 targ 0 lun 0: <Xen, file hda 768, 0000> SCSI3 0/direct fixed
sd0: 20480MB, 512 bytes/sector, 41943040 sectors
xnf0 at xen0 backend 0 channel 5: address 00:16:3e:15:9a:43
xnf1 at xen0 backend 0 channel 6: address 00:16:3e:48:5b:04
"console" at xen0: device/console/0 not configured
pci0 at mainbus0 bus 0
pchb0 at pci0 dev 0 function 0 "Intel 82441FX" rev 0x02
pcib0 at pci0 dev 1 function 0 "Intel 82371SB ISA" rev 0x00
pciide0 at pci0 dev 1 function 1 "Intel 82371SB IDE" rev 0x00: DMA, channel
0 wired to compatibility, channel 1 wired to compatibility
pciide0: channel 0 disabled (no drives)
pciide0: channel 1 disabled (no drives)
piixpm0 at pci0 dev 1 function 3 "Intel 82371AB Power" rev 0x01: SMBus
disabled
vga1 at pci0 dev 2 function 0 "Cirrus Logic CL-GD5446" rev 0x00
wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
xspd0 at pci0 dev 3 function 0 "XenSource Platform Device" rev 0x01: apic 1
int 28
isa0 at pcib0
isadma0 at isa0
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
fd0 at fdc0 drive 0: density unknown
fd1 at fdc0 drive 1: density unknown
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
pckbc0 at isa0 port 0x60/5 irq 1 irq 12
pckbd0 at pckbc0 (kbd slot)
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
wsmouse0 at pms0 mux 0
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
vscsi0 at root
scsibus2 at vscsi0: 256 targets
softraid0 at root
scsibus3 at softraid0: 256 targets
root on sd0a (e0bfc277bba6b729.a) swap on sd0b dump on sd0b

usbdevs:
usbdevs: no USB controllers found
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Mike Belopuhov-5
Hi,

Thanks for reporting this, however there's not enough info to follow
up on this right now.  What is clear is that your provider is using
an ancient version of Xen that doesn't even support the callback
vector interrupt delivery (the emulated xspd0 device is delivering
all interrupts).  We have developed code for Xen 4.5+ platforms and
there was only some testing done by users on 3.x.  So, in a way, you
can consider Xen 3.x to not be officially supported at this point.

Having said that, I've got a few questions:

 - Do you see other write failures as well?

 - Do you have swap enabled? (pstat -s)

 - Do you see crashes when bsd.mp is used instead of a single processor
   kernel (that's right, even on the single processor VM)?

Regards,
Mike

On Mon, May 15, 2017 at 10:28 -0400, Dan Cross wrote:

> >Synopsis:      init dies causing kernel panic on virtualized hosts.
> >Category:      system
> >Environment:
>         System      : OpenBSD 6.1
>         Details     : OpenBSD 6.1 (GENERIC) #6: Sat May  6 09:33:26 CEST
> 2017
>                          [hidden email]:
> /usr/src/sys/arch/amd64/compile/GENERIC
>
>         Architecture: OpenBSD.amd64
>         Machine     : amd64
> >Description:
>         Kernel panics under moderate/heavy load when running under a
>         hypervisor (I believe my VPS provider is using Xen); init(8)
>         dies and the machine panics. `boot sync` does not work and
>         the filesystem requires manual fsck on reboot.
>
>         I have not seen this on harware.
>
>         Console data from the panic is as follows:
>
>         : tempest; cat panic
>         coredump of syslogd(94574), write failed: errno 14
>         coredump of init(1), write failed: errno 14
>         panic: init died (signal 10, exit 0)
>         Stopped at      Debugger+0x9:   leave
>             TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>         *285197      1      0       0x802     0x2000    0  init
>         Debuggger() at Debugger+0x9
>         panic() at panic+0xfe
>         exit1() at exit1+0x58d
>         trapsignal() at trapsignal+0x110
>         trap() at trap+0x309
>         --- trap (number 4) ---
>         end of kernel
>         end trace fram: 0xff, count: 10
>         0x18057281cfdc
>         https://www.openbsd.org/ddb.html describes the minimum info
> required in bug
>         reports.  Insufficient info makes it difficult to find and fix bugs.
>         ddb>
>         : tempest;
>
> >How-To-Repeat:
>         Run some CPU/memory intensive workload; for example, rebuilding
>         the Go compiler and toolchain.  Occasionally the system will
> survive,
>         but gets into a state where processes are dying.
> >Fix:
>         Unknown.
>
>
> dmesg:
> OpenBSD 6.1 (GENERIC) #6: Sat May  6 09:33:26 CEST 2017
>     [hidden email]:/usr/src/sys/arch/
> amd64/compile/GENERIC
> real mem = 520093696 (496MB)
> avail mem = 499785728 (476MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xeb01f (10 entries)
> bios0: vendor Xen version "3.4.4" date 07/15/2016
> bios0: Xen HVM domU
> acpi0 at bios0: rev 2
> acpi0: sleep states S3 S4 S5
> acpi0: tables DSDT FACP APIC
> acpi0: wakeup devices
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> ioapic0 at mainbus0: apid 1 pa 0xfec00000, version 11, 48 pins
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, 2267.15 MHz
> cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
> CMOV,PAT,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,SSSE3,CX16,SSE4.
> 1,SSE4.2,POPCNT,HV,NXE,LONG,LAHF
> cpu0: 256KB 64b/line 8-way L2 cache
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 99MHz
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpicpu0 at acpi0: C1(@1 halt!)
> "PNP0F13" at acpi0 not configured
> "PNP0303" at acpi0 not configured
> "PNP0700" at acpi0 not configured
> "PNP0501" at acpi0 not configured
> "PNP0400" at acpi0 not configured
> pvbus0 at mainbus0: Xen 3.4
> xen0 at pvbus0: features 0x5, 32 grant table frames, event channel 2
> "vkbd" at xen0: device/vkbd/0 not configured
> "vfb" at xen0: device/vfb/0 not configured
> xbf0 at xen0 backend 0 channel 4: disk
> scsibus1 at xbf0: 2 targets
> sd0 at scsibus1 targ 0 lun 0: <Xen, file hda 768, 0000> SCSI3 0/direct fixed
> sd0: 20480MB, 512 bytes/sector, 41943040 sectors
> xnf0 at xen0 backend 0 channel 5: address 00:16:3e:15:9a:43
> xnf1 at xen0 backend 0 channel 6: address 00:16:3e:48:5b:04
> "console" at xen0: device/console/0 not configured
> pci0 at mainbus0 bus 0
> pchb0 at pci0 dev 0 function 0 "Intel 82441FX" rev 0x02
> pcib0 at pci0 dev 1 function 0 "Intel 82371SB ISA" rev 0x00
> pciide0 at pci0 dev 1 function 1 "Intel 82371SB IDE" rev 0x00: DMA, channel
> 0 wired to compatibility, channel 1 wired to compatibility
> pciide0: channel 0 disabled (no drives)
> pciide0: channel 1 disabled (no drives)
> piixpm0 at pci0 dev 1 function 3 "Intel 82371AB Power" rev 0x01: SMBus
> disabled
> vga1 at pci0 dev 2 function 0 "Cirrus Logic CL-GD5446" rev 0x00
> wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> xspd0 at pci0 dev 3 function 0 "XenSource Platform Device" rev 0x01: apic 1
> int 28
> isa0 at pcib0
> isadma0 at isa0
> fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
> fd0 at fdc0 drive 0: density unknown
> fd1 at fdc0 drive 1: density unknown
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5 irq 1 irq 12
> pckbd0 at pckbc0 (kbd slot)
> wskbd0 at pckbd0: console keyboard, using wsdisplay0
> pms0 at pckbc0 (aux slot)
> wsmouse0 at pms0 mux 0
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> lpt0 at isa0 port 0x378/4 irq 7
> vscsi0 at root
> scsibus2 at vscsi0: 256 targets
> softraid0 at root
> scsibus3 at softraid0: 256 targets
> root on sd0a (e0bfc277bba6b729.a) swap on sd0b dump on sd0b
>
> usbdevs:
> usbdevs: no USB controllers found

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Dan Cross-4
On Mon, May 15, 2017 at 11:01 AM, Mike Belopuhov <[hidden email]> wrote:
>
> Thanks for reporting this, however there's not enough info to follow
> up on this right now.  What is clear is that your provider is using
> an ancient version of Xen that doesn't even support the callback
> vector interrupt delivery (the emulated xspd0 device is delivering
> all interrupts).  We have developed code for Xen 4.5+ platforms and
> there was only some testing done by users on 3.x.  So, in a way, you
> can consider Xen 3.x to not be officially supported at this point.
>

That's unfortunate. Sadly, this is common across two different providers
(Panix and rootbsd.net). The latter, I'm sure, would at least be interested
in coordinating with you guys to get a fix. I'll open a trouble ticket with
them.

Having said that, I've got a few questions:
>
>  - Do you see other write failures as well?
>

Yes. E.g, syslogd had a similar write failure before panic.

 - Do you have swap enabled? (pstat -s)


Yes; a gig:

: jaan; pstat -s
Device      1K-blocks     Used    Avail Capacity  Priority
/dev/sd0b     1048249        0  1048249     0%    0
: jaan;

 - Do you see crashes when bsd.mp is used instead of a single processor

   kernel (that's right, even on the single processor VM)?
>

Yes; the panic happens whether using single- or multi-processor kernels.

        - Dan C.


Regards,

> Mike
>
> On Mon, May 15, 2017 at 10:28 -0400, Dan Cross wrote:
> > >Synopsis:      init dies causing kernel panic on virtualized hosts.
> > >Category:      system
> > >Environment:
> >         System      : OpenBSD 6.1
> >         Details     : OpenBSD 6.1 (GENERIC) #6: Sat May  6 09:33:26 CEST
> > 2017
> >                          [hidden email]:
> > /usr/src/sys/arch/amd64/compile/GENERIC
> >
> >         Architecture: OpenBSD.amd64
> >         Machine     : amd64
> > >Description:
> >         Kernel panics under moderate/heavy load when running under a
> >         hypervisor (I believe my VPS provider is using Xen); init(8)
> >         dies and the machine panics. `boot sync` does not work and
> >         the filesystem requires manual fsck on reboot.
> >
> >         I have not seen this on harware.
> >
> >         Console data from the panic is as follows:
> >
> >         : tempest; cat panic
> >         coredump of syslogd(94574), write failed: errno 14
> >         coredump of init(1), write failed: errno 14
> >         panic: init died (signal 10, exit 0)
> >         Stopped at      Debugger+0x9:   leave
> >             TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> >         *285197      1      0       0x802     0x2000    0  init
> >         Debuggger() at Debugger+0x9
> >         panic() at panic+0xfe
> >         exit1() at exit1+0x58d
> >         trapsignal() at trapsignal+0x110
> >         trap() at trap+0x309
> >         --- trap (number 4) ---
> >         end of kernel
> >         end trace fram: 0xff, count: 10
> >         0x18057281cfdc
> >         https://www.openbsd.org/ddb.html describes the minimum info
> > required in bug
> >         reports.  Insufficient info makes it difficult to find and fix
> bugs.
> >         ddb>
> >         : tempest;
> >
> > >How-To-Repeat:
> >         Run some CPU/memory intensive workload; for example, rebuilding
> >         the Go compiler and toolchain.  Occasionally the system will
> > survive,
> >         but gets into a state where processes are dying.
> > >Fix:
> >         Unknown.
> >
> >
> > dmesg:
> > OpenBSD 6.1 (GENERIC) #6: Sat May  6 09:33:26 CEST 2017
> >     [hidden email]:/usr/src/sys/arch/
> > amd64/compile/GENERIC
> > real mem = 520093696 (496MB)
> > avail mem = 499785728 (476MB)
> > mpath0 at root
> > scsibus0 at mpath0: 256 targets
> > mainbus0 at root
> > bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xeb01f (10 entries)
> > bios0: vendor Xen version "3.4.4" date 07/15/2016
> > bios0: Xen HVM domU
> > acpi0 at bios0: rev 2
> > acpi0: sleep states S3 S4 S5
> > acpi0: tables DSDT FACP APIC
> > acpi0: wakeup devices
> > acpitimer0 at acpi0: 3579545 Hz, 32 bits
> > acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> > ioapic0 at mainbus0: apid 1 pa 0xfec00000, version 11, 48 pins
> > cpu0 at mainbus0: apid 0 (boot processor)
> > cpu0: Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, 2267.15 MHz
> > cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
> > CMOV,PAT,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,SSSE3,CX16,SSE4.
> > 1,SSE4.2,POPCNT,HV,NXE,LONG,LAHF
> > cpu0: 256KB 64b/line 8-way L2 cache
> > cpu0: smt 0, core 0, package 0
> > mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> > cpu0: apic clock running at 99MHz
> > acpiprt0 at acpi0: bus 0 (PCI0)
> > acpicpu0 at acpi0: C1(@1 halt!)
> > "PNP0F13" at acpi0 not configured
> > "PNP0303" at acpi0 not configured
> > "PNP0700" at acpi0 not configured
> > "PNP0501" at acpi0 not configured
> > "PNP0400" at acpi0 not configured
> > pvbus0 at mainbus0: Xen 3.4
> > xen0 at pvbus0: features 0x5, 32 grant table frames, event channel 2
> > "vkbd" at xen0: device/vkbd/0 not configured
> > "vfb" at xen0: device/vfb/0 not configured
> > xbf0 at xen0 backend 0 channel 4: disk
> > scsibus1 at xbf0: 2 targets
> > sd0 at scsibus1 targ 0 lun 0: <Xen, file hda 768, 0000> SCSI3 0/direct
> fixed
> > sd0: 20480MB, 512 bytes/sector, 41943040 sectors
> > xnf0 at xen0 backend 0 channel 5: address 00:16:3e:15:9a:43
> > xnf1 at xen0 backend 0 channel 6: address 00:16:3e:48:5b:04
> > "console" at xen0: device/console/0 not configured
> > pci0 at mainbus0 bus 0
> > pchb0 at pci0 dev 0 function 0 "Intel 82441FX" rev 0x02
> > pcib0 at pci0 dev 1 function 0 "Intel 82371SB ISA" rev 0x00
> > pciide0 at pci0 dev 1 function 1 "Intel 82371SB IDE" rev 0x00: DMA,
> channel
> > 0 wired to compatibility, channel 1 wired to compatibility
> > pciide0: channel 0 disabled (no drives)
> > pciide0: channel 1 disabled (no drives)
> > piixpm0 at pci0 dev 1 function 3 "Intel 82371AB Power" rev 0x01: SMBus
> > disabled
> > vga1 at pci0 dev 2 function 0 "Cirrus Logic CL-GD5446" rev 0x00
> > wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> > wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> > xspd0 at pci0 dev 3 function 0 "XenSource Platform Device" rev 0x01:
> apic 1
> > int 28
> > isa0 at pcib0
> > isadma0 at isa0
> > fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
> > fd0 at fdc0 drive 0: density unknown
> > fd1 at fdc0 drive 1: density unknown
> > com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> > pckbc0 at isa0 port 0x60/5 irq 1 irq 12
> > pckbd0 at pckbc0 (kbd slot)
> > wskbd0 at pckbd0: console keyboard, using wsdisplay0
> > pms0 at pckbc0 (aux slot)
> > wsmouse0 at pms0 mux 0
> > pcppi0 at isa0 port 0x61
> > spkr0 at pcppi0
> > lpt0 at isa0 port 0x378/4 irq 7
> > vscsi0 at root
> > scsibus2 at vscsi0: 256 targets
> > softraid0 at root
> > scsibus3 at softraid0: 256 targets
> > root on sd0a (e0bfc277bba6b729.a) swap on sd0b dump on sd0b
> >
> > usbdevs:
> > usbdevs: no USB controllers found
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Mike Belopuhov-5
On Mon, May 15, 2017 at 11:18 -0400, Dan Cross wrote:

> On Mon, May 15, 2017 at 11:01 AM, Mike Belopuhov <[hidden email]> wrote:
> >
> > Thanks for reporting this, however there's not enough info to follow
> > up on this right now.  What is clear is that your provider is using
> > an ancient version of Xen that doesn't even support the callback
> > vector interrupt delivery (the emulated xspd0 device is delivering
> > all interrupts).  We have developed code for Xen 4.5+ platforms and
> > there was only some testing done by users on 3.x.  So, in a way, you
> > can consider Xen 3.x to not be officially supported at this point.
> >
>
> That's unfortunate. Sadly, this is common across two different providers
> (Panix and rootbsd.net). The latter, I'm sure, would at least be interested
> in coordinating with you guys to get a fix. I'll open a trouble ticket with
> them.
>
> Having said that, I've got a few questions:
> >
> >  - Do you see other write failures as well?
> >
>
> Yes. E.g, syslogd had a similar write failure before panic.
>

Can you reproduce any of these write failures at will?

What happens when you just send a signal to dump the core?
You can test this by running "sleep 100", and then call
"pkill -ABRT -lf sleep".

>  - Do you have swap enabled? (pstat -s)
>
>
> Yes; a gig:
>
> : jaan; pstat -s
> Device      1K-blocks     Used    Avail Capacity  Priority
> /dev/sd0b     1048249        0  1048249     0%    0
> : jaan;
>

Do you see swap being used under your load?

>  - Do you see crashes when bsd.mp is used instead of a single processor
>
>    kernel (that's right, even on the single processor VM)?
> >
>
> Yes; the panic happens whether using single- or multi-processor kernels.
>

Good, nothing has slipped through those cracks again.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Dan Cross-4
On Mon, May 15, 2017 at 11:28 AM, Mike Belopuhov <[hidden email]> wrote:

> On Mon, May 15, 2017 at 11:18 -0400, Dan Cross wrote:
> > On Mon, May 15, 2017 at 11:01 AM, Mike Belopuhov <[hidden email]>
> wrote:
> > >
> > > Thanks for reporting this, however there's not enough info to follow
> > > up on this right now.  What is clear is that your provider is using
> > > an ancient version of Xen that doesn't even support the callback
> > > vector interrupt delivery (the emulated xspd0 device is delivering
> > > all interrupts).  We have developed code for Xen 4.5+ platforms and
> > > there was only some testing done by users on 3.x.  So, in a way, you
> > > can consider Xen 3.x to not be officially supported at this point.
> >
> > That's unfortunate. Sadly, this is common across two different providers
> > (Panix and rootbsd.net). The latter, I'm sure, would at least be
> interested
> > in coordinating with you guys to get a fix. I'll open a trouble ticket
> with
> > them.
> >
> > Having said that, I've got a few questions:
> > >
> > >  - Do you see other write failures as well?
> >
> > Yes. E.g, syslogd had a similar write failure before panic.
>
> Can you reproduce any of these write failures at will?
>

I'm not sure what you mean. If I induce the load conditions, then the VM
will panic fairly reliably.

What happens when you just send a signal to dump the core?
> You can test this by running "sleep 100", and then call
> "pkill -ABRT -lf sleep".


I'm not sure what this shows, but sure I can do that:

: jaan; /bin/sleep 100&
[1] 20701
: jaan; pkill -ABRT -lf sleep
20701 sleep
: jaan;
[1]  + abort (core dumped)  /bin/sleep 100
: jaan; ls -l sleep.core
-rw-------  1 cross  staff  4208416 May 15 15:42 sleep.core
: jaan;

The panic-inducing condition seems to be that, for whatever reason, the
kernel gets into a funny state where processes like init(8) die due to
having part of their VM image corrupted; the kernel then panics because
`init` dies.

>  - Do you have swap enabled? (pstat -s)
> >
> >
> > Yes; a gig:
> >
> > : jaan; pstat -s
> > Device      1K-blocks     Used    Avail Capacity  Priority
> > /dev/sd0b     1048249        0  1048249     0%    0
> > : jaan;
> >
>
> Do you see swap being used under your load?


I'm not sure. I can try and crash a machine again and see poke at a kernel
var from ddb to see; anything in particular you want me to look at?

>  - Do you see crashes when bsd.mp is used instead of a single processor
> >
> >    kernel (that's right, even on the single processor VM)?
> > >
> >
> > Yes; the panic happens whether using single- or multi-processor kernels.
>
> Good, nothing has slipped through those cracks again.
>

I can see the value in narrowing down the search space. :-)

        - Dan C.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Mike Belopuhov-5
On Mon, May 15, 2017 at 11:45 -0400, Dan Cross wrote:

> On Mon, May 15, 2017 at 11:28 AM, Mike Belopuhov <[hidden email]> wrote:
>
> > On Mon, May 15, 2017 at 11:18 -0400, Dan Cross wrote:
> > > On Mon, May 15, 2017 at 11:01 AM, Mike Belopuhov <[hidden email]>
> > wrote:
> > > >
> > > > Thanks for reporting this, however there's not enough info to follow
> > > > up on this right now.  What is clear is that your provider is using
> > > > an ancient version of Xen that doesn't even support the callback
> > > > vector interrupt delivery (the emulated xspd0 device is delivering
> > > > all interrupts).  We have developed code for Xen 4.5+ platforms and
> > > > there was only some testing done by users on 3.x.  So, in a way, you
> > > > can consider Xen 3.x to not be officially supported at this point.
> > >
> > > That's unfortunate. Sadly, this is common across two different providers
> > > (Panix and rootbsd.net). The latter, I'm sure, would at least be
> > interested
> > > in coordinating with you guys to get a fix. I'll open a trouble ticket
> > with
> > > them.
> > >
> > > Having said that, I've got a few questions:
> > > >
> > > >  - Do you see other write failures as well?
> > >
> > > Yes. E.g, syslogd had a similar write failure before panic.
> >
> > Can you reproduce any of these write failures at will?
> >
>
> I'm not sure what you mean. If I induce the load conditions, then the VM
> will panic fairly reliably.
>

I was wondering if you have seen any other write errors apart
from those that cause the panic.

> What happens when you just send a signal to dump the core?
> > You can test this by running "sleep 100", and then call
> > "pkill -ABRT -lf sleep".
>
>
> I'm not sure what this shows, but sure I can do that:
>

There are quite a number of different I/O codepaths in the
kernel and some are wonkier than the other.

> : jaan; /bin/sleep 100&
> [1] 20701
> : jaan; pkill -ABRT -lf sleep
> 20701 sleep
> : jaan;
> [1]  + abort (core dumped)  /bin/sleep 100
> : jaan; ls -l sleep.core
> -rw-------  1 cross  staff  4208416 May 15 15:42 sleep.core
> : jaan;
>
> The panic-inducing condition seems to be that, for whatever reason, the
> kernel gets into a funny state where processes like init(8) die due to
> having part of their VM image corrupted; the kernel then panics because
> `init` dies.
>
> >  - Do you have swap enabled? (pstat -s)
> > >
> > >
> > > Yes; a gig:
> > >
> > > : jaan; pstat -s
> > > Device      1K-blocks     Used    Avail Capacity  Priority
> > > /dev/sd0b     1048249        0  1048249     0%    0
> > > : jaan;
> > >
> >
> > Do you see swap being used under your load?
>
>
> I'm not sure. I can try and crash a machine again and see poke at a kernel
> var from ddb to see; anything in particular you want me to look at?
>

Indeed.  You can run a "show uvmexp" DDB command.

Please try running with the diff below.  It will log all polled
and bounced transfers as well as some additional info.



diff --git sys/dev/pv/xbf.c sys/dev/pv/xbf.c
index d5c44770acb..29e7615d0fc 100644
--- sys/dev/pv/xbf.c
+++ sys/dev/pv/xbf.c
@@ -36,11 +36,11 @@
 #include <scsi/scsi_all.h>
 #include <scsi/cd.h>
 #include <scsi/scsi_disk.h>
 #include <scsi/scsiconf.h>
 
-/* #define XBF_DEBUG */
+#define XBF_DEBUG
 
 #ifdef XBF_DEBUG
 #define DPRINTF(x...) printf(x)
 #else
 #define DPRINTF(x...)
@@ -478,10 +478,11 @@ xbf_load_xs(struct scsi_xfer *xs, int desc)
  sge->sge_first = i > 0 ? 0 :
     ((vaddr_t)xs->data & PAGE_MASK) >> XBF_SEC_SHIFT;
  sge->sge_last = sge->sge_first +
     (map->dm_segs[i].ds_len >> XBF_SEC_SHIFT) - 1;
 
+ if (ISSET(xs->flags, SCSI_POLL))
  DPRINTF("%s:   seg %d/%d ref %lu len %lu first %u last %u\n",
     sc->sc_dev.dv_xname, i + 1, map->dm_nsegs,
     map->dm_segs[i].ds_addr, map->dm_segs[i].ds_len,
     sge->sge_first, sge->sge_last);
 
@@ -640,10 +641,11 @@ xbf_submit_cmd(struct scsi_xfer *xs)
  xrd->xrd_req.req_op = operation;
  xrd->xrd_req.req_unit = (uint16_t)sc->sc_unit;
  xrd->xrd_req.req_sector = lba;
 
  if (operation == XBF_OP_READ || operation == XBF_OP_WRITE) {
+ if (ISSET(xs->flags, SCSI_POLL))
  DPRINTF("%s: desc %d %s%s lba %llu nsec %u len %d\n",
     sc->sc_dev.dv_xname, desc, operation == XBF_OP_READ ?
     "read" : "write", ISSET(xs->flags, SCSI_POLL) ? "-poll" :
     "", lba, nblk, xs->datalen);
 
@@ -718,10 +720,11 @@ xbf_complete_cmd(struct scsi_xfer *xs, int desc)
     BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
  bus_dmamap_unload(sc->sc_dmat, map);
 
  sc->sc_xs[desc] = NULL;
 
+ if (ISSET(xs->flags, SCSI_POLL))
  DPRINTF("%s: completing desc %d(%llu) op %u with error %d\n",
     sc->sc_dev.dv_xname, desc, xrd->xrd_rsp.rsp_id,
     xrd->xrd_rsp.rsp_op, xrd->xrd_rsp.rsp_status);
 
  id = xrd->xrd_rsp.rsp_id;

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Mike Belopuhov-5
On Mon, May 15, 2017 at 19:24 +0200, Mike Belopuhov wrote:
> Indeed.  You can run a "show uvmexp" DDB command.
>
> Please try running with the diff below.  It will log all polled
> and bounced transfers as well as some additional info.

Hi,

While I'm still interested in the "show uvmexp" output , I'd like
to ask you to hold off the testing of this diff.  I've identified
a few issues and working on resolving them.

Cheers,
Mike

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Dan Cross-4
Thanks; sorry I've been sidetracked the last couple of days. Let me see if
I can get a machine to panic and grab the "show uvmexp" output.

On Thu, May 18, 2017 at 6:20 PM, Mike Belopuhov <[hidden email]> wrote:

> On Mon, May 15, 2017 at 19:24 +0200, Mike Belopuhov wrote:
> > Indeed.  You can run a "show uvmexp" DDB command.
> >
> > Please try running with the diff below.  It will log all polled
> > and bounced transfers as well as some additional info.
>
> Hi,
>
> While I'm still interested in the "show uvmexp" output , I'd like
> to ask you to hold off the testing of this diff.  I've identified
> a few issues and working on resolving them.
>
> Cheers,
> Mike
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Dan Cross-4
Okay, here is the output. I apologize for the screen shot; there's no other
particularly great way to capture the console output from the VPS and I
don't trust myself to type it all in without making a mistake of some kind.


On Thu, May 18, 2017 at 8:48 PM, Dan Cross <[hidden email]> wrote:

> Thanks; sorry I've been sidetracked the last couple of days. Let me see if
> I can get a machine to panic and grab the "show uvmexp" output.
>
> On Thu, May 18, 2017 at 6:20 PM, Mike Belopuhov <[hidden email]>
> wrote:
>
>> On Mon, May 15, 2017 at 19:24 +0200, Mike Belopuhov wrote:
>> > Indeed.  You can run a "show uvmexp" DDB command.
>> >
>> > Please try running with the diff below.  It will log all polled
>> > and bounced transfers as well as some additional info.
>>
>> Hi,
>>
>> While I'm still interested in the "show uvmexp" output , I'd like
>> to ask you to hold off the testing of this diff.  I've identified
>> a few issues and working on resolving them.
>>
>> Cheers,
>> Mike
>>
>
>

Screen Shot 2017-05-18 at 9.11.08 PM.png (301K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Mike Belopuhov-5
On Thu, May 18, 2017 at 21:15 -0400, Dan Cross wrote:
> Okay, here is the output. I apologize for the screen shot; there's no other
> particularly great way to capture the console output from the VPS and I
> don't trust myself to type it all in without making a mistake of some kind.
>

That's OK, I can see that there's quite some swapping going on.
I haven't finished investigating yet, but the first thing I've
noticed is that FFS read-ahead issues 64k read requests.  xbf(4)
cannot handle more than 45056 at a time so it fails the request.
This might be causing some serious problems.

Unfortunately, it turned out that our SCSI and VFS layers don't
implement proper handling of short reads (b_resid is ignored
on clustered reads by the buffercache and SCSI doesn't do
anything about it either), so I took a stab at getting it
working.

For now the most appropriate way to solve this that I've found is
to invalidate read-ahead portion of a cluster read: when FFS asks
for a block, e.g. 16k, bread_cluster creates an array of bufs for
a MAXPHYS worth of I/O sliced in chunks of the block size (e.g.
16k).  Then (after the I/O is done) we can walk down-up and ditch
all chunks that correspond to failed I/O and throw them away.
For example if b_resid is 20480 and we were using 16k chunks,
then we have to invalidate two last bufs (32k).

Unfortunately, there's a major problem that this diff doesn't
solve: if we've read even less than what we were initially asked
for (excluding all of read-ahead blocks).  This is because the
biodone for the xbpp[0] aka "the bp" is done from sd_buf_done
directly *before* we can do buf_fix_mapping and restore it's
intended bp->b_bcount.  In other words, when sd_buf_done calls
biodone you cannot correlate b_bcount and b_resid and mark the
buffer B_INVAL because you don't know it's intended length.

This is not a final version, but as I won't get back to it
before Monday, I wanted to post it for a wider audience.


diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
index 95bc80bc0e6..1cc1943d752 100644
--- sys/kern/vfs_bio.c
+++ sys/kern/vfs_bio.c
@@ -534,11 +534,29 @@ bread_cluster_callback(struct buf *bp)
  */
  buf_fix_mapping(bp, newsize);
  bp->b_bcount = newsize;
  }
 
- for (i = 1; xbpp[i] != 0; i++) {
+ /* Invalidate read-ahead buffers if read short */
+ if (bp->b_resid > 0) {
+ for (i = 0; xbpp[i] != NULL; i++)
+ continue;
+ for (i = i - 1; i != 0; i--) {
+ if (xbpp[i]->b_bufsize <= bp->b_resid) {
+ bp->b_resid -= xbpp[i]->b_bufsize;
+ SET(xbpp[i]->b_flags, B_INVAL);
+ } else if (bp->b_resid > 0) {
+ bp->b_resid = 0;
+ SET(xbpp[i]->b_flags, B_INVAL);
+ } else
+ break;
+ }
+ if (bp->b_resid > 0)
+ printf("short read %ld\n", bp->b_resid);
+ }
+
+ for (i = 1; xbpp[i] != NULL; i++) {
  if (ISSET(bp->b_flags, B_ERROR))
  SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
  biodone(xbpp[i]);
  }
 
@@ -605,11 +623,11 @@ bread_cluster(struct vnode *vp, daddr_t blkno, int size, struct buf **rbpp)
  }
  }
 
  bp = xbpp[0];
 
- xbpp[howmany] = 0;
+ xbpp[howmany] = NULL;
 
  inc = btodb(size);
 
  for (i = 1; i < howmany; i++) {
  bcstats.pendingreads++;
diff --git sys/dev/pv/xbf.c sys/dev/pv/xbf.c
index d5c44770acb..9a94e3dc48f 100644
--- sys/dev/pv/xbf.c
+++ sys/dev/pv/xbf.c
@@ -448,29 +448,32 @@ xbf_load_xs(struct scsi_xfer *xs, int desc)
  struct xbf_softc *sc = xs->sc_link->adapter_softc;
  struct xbf_sge *sge;
  union xbf_ring_desc *xrd;
  bus_dmamap_t map;
  int i, error, mapflags;
+ bus_size_t datalen;
 
  xrd = &sc->sc_xr->xr_desc[desc];
  map = sc->sc_xs_map[desc];
 
+ datalen = MIN(xs->datalen, sc->sc_maxphys);
+
  mapflags = (sc->sc_domid << 16);
  if (ISSET(xs->flags, SCSI_NOSLEEP))
  mapflags |= BUS_DMA_NOWAIT;
  else
  mapflags |= BUS_DMA_WAITOK;
  if (ISSET(xs->flags, SCSI_DATA_IN))
  mapflags |= BUS_DMA_READ;
  else
  mapflags |= BUS_DMA_WRITE;
 
- error = bus_dmamap_load(sc->sc_dmat, map, xs->data, xs->datalen,
+ error = bus_dmamap_load(sc->sc_dmat, map, xs->data, datalen,
     NULL, mapflags);
  if (error) {
- DPRINTF("%s: failed to load %d bytes of data\n",
-    sc->sc_dev.dv_xname, xs->datalen);
+ DPRINTF("%s: failed to load %ld bytes of data\n",
+    sc->sc_dev.dv_xname, datalen);
  return (error);
  }
 
  for (i = 0; i < map->dm_nsegs; i++) {
  sge = &xrd->xrd_req.req_sgl[i];
@@ -726,11 +729,11 @@ xbf_complete_cmd(struct scsi_xfer *xs, int desc)
 
  id = xrd->xrd_rsp.rsp_id;
  memset(xrd, 0, sizeof(*xrd));
  xrd->xrd_req.req_id = id;
 
- xs->resid = 0;
+ xs->resid = xs->datalen - MIN(xs->datalen, sc->sc_maxphys);
 
  xbf_reclaim_xs(xs, desc);
  xbf_scsi_done(xs, error);
 }
 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Dan Cross-4
Thanks for the patch; I just got a few minutes today and I applied it,
rebuilt and installed the kernel and rebooted. Sadly, I get a similar
panic. Attached is a screenshot of console output. Note that, 'boot sync'
from ddb hangs forever.

        - Dan C.


On Fri, May 19, 2017 at 3:58 PM, Mike Belopuhov <[hidden email]> wrote:

> On Thu, May 18, 2017 at 21:15 -0400, Dan Cross wrote:
> > Okay, here is the output. I apologize for the screen shot; there's no
> other
> > particularly great way to capture the console output from the VPS and I
> > don't trust myself to type it all in without making a mistake of some
> kind.
> >
>
> That's OK, I can see that there's quite some swapping going on.
> I haven't finished investigating yet, but the first thing I've
> noticed is that FFS read-ahead issues 64k read requests.  xbf(4)
> cannot handle more than 45056 at a time so it fails the request.
> This might be causing some serious problems.
>
> Unfortunately, it turned out that our SCSI and VFS layers don't
> implement proper handling of short reads (b_resid is ignored
> on clustered reads by the buffercache and SCSI doesn't do
> anything about it either), so I took a stab at getting it
> working.
>
> For now the most appropriate way to solve this that I've found is
> to invalidate read-ahead portion of a cluster read: when FFS asks
> for a block, e.g. 16k, bread_cluster creates an array of bufs for
> a MAXPHYS worth of I/O sliced in chunks of the block size (e.g.
> 16k).  Then (after the I/O is done) we can walk down-up and ditch
> all chunks that correspond to failed I/O and throw them away.
> For example if b_resid is 20480 and we were using 16k chunks,
> then we have to invalidate two last bufs (32k).
>
> Unfortunately, there's a major problem that this diff doesn't
> solve: if we've read even less than what we were initially asked
> for (excluding all of read-ahead blocks).  This is because the
> biodone for the xbpp[0] aka "the bp" is done from sd_buf_done
> directly *before* we can do buf_fix_mapping and restore it's
> intended bp->b_bcount.  In other words, when sd_buf_done calls
> biodone you cannot correlate b_bcount and b_resid and mark the
> buffer B_INVAL because you don't know it's intended length.
>
> This is not a final version, but as I won't get back to it
> before Monday, I wanted to post it for a wider audience.
>
>
> diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
> index 95bc80bc0e6..1cc1943d752 100644
> --- sys/kern/vfs_bio.c
> +++ sys/kern/vfs_bio.c
> @@ -534,11 +534,29 @@ bread_cluster_callback(struct buf *bp)
>                  */
>                 buf_fix_mapping(bp, newsize);
>                 bp->b_bcount = newsize;
>         }
>
> -       for (i = 1; xbpp[i] != 0; i++) {
> +       /* Invalidate read-ahead buffers if read short */
> +       if (bp->b_resid > 0) {
> +               for (i = 0; xbpp[i] != NULL; i++)
> +                       continue;
> +               for (i = i - 1; i != 0; i--) {
> +                       if (xbpp[i]->b_bufsize <= bp->b_resid) {
> +                               bp->b_resid -= xbpp[i]->b_bufsize;
> +                               SET(xbpp[i]->b_flags, B_INVAL);
> +                       } else if (bp->b_resid > 0) {
> +                               bp->b_resid = 0;
> +                               SET(xbpp[i]->b_flags, B_INVAL);
> +                       } else
> +                               break;
> +               }
> +               if (bp->b_resid > 0)
> +                       printf("short read %ld\n", bp->b_resid);
> +       }
> +
> +       for (i = 1; xbpp[i] != NULL; i++) {
>                 if (ISSET(bp->b_flags, B_ERROR))
>                         SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
>                 biodone(xbpp[i]);
>         }
>
> @@ -605,11 +623,11 @@ bread_cluster(struct vnode *vp, daddr_t blkno, int
> size, struct buf **rbpp)
>                 }
>         }
>
>         bp = xbpp[0];
>
> -       xbpp[howmany] = 0;
> +       xbpp[howmany] = NULL;
>
>         inc = btodb(size);
>
>         for (i = 1; i < howmany; i++) {
>                 bcstats.pendingreads++;
> diff --git sys/dev/pv/xbf.c sys/dev/pv/xbf.c
> index d5c44770acb..9a94e3dc48f 100644
> --- sys/dev/pv/xbf.c
> +++ sys/dev/pv/xbf.c
> @@ -448,29 +448,32 @@ xbf_load_xs(struct scsi_xfer *xs, int desc)
>         struct xbf_softc *sc = xs->sc_link->adapter_softc;
>         struct xbf_sge *sge;
>         union xbf_ring_desc *xrd;
>         bus_dmamap_t map;
>         int i, error, mapflags;
> +       bus_size_t datalen;
>
>         xrd = &sc->sc_xr->xr_desc[desc];
>         map = sc->sc_xs_map[desc];
>
> +       datalen = MIN(xs->datalen, sc->sc_maxphys);
> +
>         mapflags = (sc->sc_domid << 16);
>         if (ISSET(xs->flags, SCSI_NOSLEEP))
>                 mapflags |= BUS_DMA_NOWAIT;
>         else
>                 mapflags |= BUS_DMA_WAITOK;
>         if (ISSET(xs->flags, SCSI_DATA_IN))
>                 mapflags |= BUS_DMA_READ;
>         else
>                 mapflags |= BUS_DMA_WRITE;
>
> -       error = bus_dmamap_load(sc->sc_dmat, map, xs->data, xs->datalen,
> +       error = bus_dmamap_load(sc->sc_dmat, map, xs->data, datalen,
>             NULL, mapflags);
>         if (error) {
> -               DPRINTF("%s: failed to load %d bytes of data\n",
> -                   sc->sc_dev.dv_xname, xs->datalen);
> +               DPRINTF("%s: failed to load %ld bytes of data\n",
> +                   sc->sc_dev.dv_xname, datalen);
>                 return (error);
>         }
>
>         for (i = 0; i < map->dm_nsegs; i++) {
>                 sge = &xrd->xrd_req.req_sgl[i];
> @@ -726,11 +729,11 @@ xbf_complete_cmd(struct scsi_xfer *xs, int desc)
>
>         id = xrd->xrd_rsp.rsp_id;
>         memset(xrd, 0, sizeof(*xrd));
>         xrd->xrd_req.req_id = id;
>
> -       xs->resid = 0;
> +       xs->resid = xs->datalen - MIN(xs->datalen, sc->sc_maxphys);
>
>         xbf_reclaim_xs(xs, desc);
>         xbf_scsi_done(xs, error);
>  }
>
>

Screen Shot 2017-05-24 at 12.25.06 PM.png (37K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kernel panic on 6.1: init dies under load

Mike Belopuhov-5
On Wed, May 24, 2017 at 12:27 -0400, Dan Cross wrote:
> Thanks for the patch; I just got a few minutes today and I applied it,
> rebuilt and installed the kernel and rebooted. Sadly, I get a similar
> panic. Attached is a screenshot of console output. Note that, 'boot sync'
> from ddb hangs forever.
>
>         - Dan C.

That's OK. I've discovered more problems related to 64k transfers.
The reason why we didn't notice anything bad when aborting sleep
was because sleep has a small memory footprint, but if you dump
core of a larger (> 64k) program, you'd notice the issue because
core dump routine like some other places in the kernel assumes
that 64k transfers always work.

I've attempted to attack this problem from a different angle:
ensure that xbf(4) can handle 64k transfers.  Solutions to this
problem are notoriously messy and complicated and so far this
one is no exception. Today I got to the point where the system
boots multiuser but couldn't test further. I've noticed however
that "boot dump" from ddb still crashes so I know it's not 100%
right just yet, but since I won't get around doing anything
about this until early next week, I'd appreciate a quick test
if possible.

I'm not attaching the diff since it's rather large:

http://gir.theapt.org/~mike/xbf.diff

Cheers,
Mike

Loading...