xnf crash - repeatable

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

xnf crash - repeatable

Jonathon Sisson
Hi,

First off, thank you for OpenBSD in general, and thank you specifically
for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
gets ever closer for me, and I appreciate everything the OpenBSD dev
team does.

I've found what appears to be a repeatable crash that results in this:

panic: xnf0: save vs spell: 214

Stopped at      Debugger+0x9:   leave
   TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
 14532   9243      0         0x3  0x4000000    1  python2.7
* 7215   9243      0         0x3  0x4000000    0  python2.7

Debugger() at Debugger+0x9
panic() at panic+0xfe
xnf_encap() at xnf_encap+0x1a9
xnf_start() at xnf_start+0x7f
ifq_serialize() at ifq_serialize+0xd9
if_enqueue() at if_enqueue+0x71
ether_output() at ether_output+0x166
ip_output() at ip_output+0x6d3
tcp_output() at tcp_output+0x87e
tcp_usrreq() at tcp_usrreq+0x3fc
sosend() at sosend+0x3d8
dofilewritev() at dofilewritev+0x205
sys_write() at sys_write+0x89
syscall() at syscall+0x368
--- syscall (number 4) ---
end of kernel
end trace frame: 0x9a8c96a2800, count: 1
0x9a91790279a:
--db_more--

I'm unable to run further commands at the console, as AWS does not
provide console.

I'm using this test machine to build CURRENT and upload it to an s3
bucket that I've been using for STABLE builds.  The python code is
the awscli installed via py-pip running on Python 2.7.11.  The precise
command is:

aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/

If there is any further testing I can provide, I am more than happy
to provide any details you need.

-Jonathon

dmesg:
OpenBSD 5.9-beta (GENERIC.MP) #0: Sun Jan 17 16:44:01 PST 2016
    [hidden email]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 8036286464 (7664MB)
avail mem = 7788552192 (7427MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xeb01f (12 entries)
bios0: vendor Xen version "4.2.amazon" date 12/07/2015
bios0: Xen HVM domU
acpi0 at bios0: rev 2
acpi0: sleep states S3 S4 S5
acpi0: tables DSDT FACP APIC HPET WAET SSDT SSDT
acpi0: wakeup devices
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
ioapic0 at mainbus0: apid 1 pa 0xfec00000, version 11, 48 pins
ioapic0: misconfigured as apic 0, remapped to apid 1
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2500.42 MHz
cpu0:
+FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,POPCNT,DEADLINE,AES,XSAVE,AV
+X,F16C,RDRAND,HV,NXE,LONG,LAHF,FSGSBASE,SMEP,ERMS
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 100MHz
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2499.35 MHz
cpu1:
+FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,POPCNT,DEADLINE,AES,XSAVE,AV
+X,F16C,RDRAND,HV,NXE,LONG,LAHF,FSGSBASE,SMEP,ERMS
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 1, core 0, package 0
acpihpet0 at acpi0: 62500000 Hz
acpiprt0 at acpi0: bus 0 (PCI0)
acpicpu0 at acpi0: C1(@1 halt!)
acpicpu1 at acpi0: C1(@1 halt!)
pvbus0 at mainbus0: Xen 4.2
xen0 at pvbus0: features 0x705, 32 grant table frames, event channel 4
"vfb" at xen0: device/vfb/0 not configured
"vbd" at xen0: device/vbd/51712 not configured
xnf0 at xen0: event channel 6, address 12:dd:e7:6d:a9:49
"console" at xen0: device/console/0 not configured
pci0 at mainbus0 bus 0
pchb0 at pci0 dev 0 function 0 "Intel 82441FX" rev 0x02
pcib0 at pci0 dev 1 function 0 "Intel 82371SB ISA" rev 0x00
pciide0 at pci0 dev 1 function 1 "Intel 82371SB IDE" rev 0x00: DMA, channel 0 wired to compatibility, channel 1 wired to compatibility
wd0 at pciide0 channel 0 drive 0: <QEMU HARDDISK>
wd0: 16-sector PIO, LBA48, 30720MB, 62914560 sectors
wd0(pciide0:0:0): using PIO mode 0, DMA mode 2
pciide0: channel 1 disabled (no drives)
piixpm0 at pci0 dev 1 function 3 "Intel 82371AB Power" rev 0x01: SMBus disabled
vga1 at pci0 dev 2 function 0 "Cirrus Logic CL-GD5446" rev 0x00
wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
xspd0 at pci0 dev 3 function 0 "XenSource Platform Device" rev 0x01
isa0 at pcib0
isadma0 at isa0
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
fd0 at fdc0 drive 0: density unknown
fd1 at fdc0 drive 1: density unknown
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
com0: console
pckbc0 at isa0 port 0x60/5 irq 1 irq 12
pckbd0 at pckbc0 (kbd slot)
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
wsmouse0 at pms0 mux 0
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
nvram: invalid checksum
vscsi0 at root
scsibus1 at vscsi0: 256 targets
softraid0 at root
scsibus2 at softraid0: 256 targets
root on wd0a (819eec0d7a88edce.a) swap on wd0b dump on wd0b
WARNING: / was not properly unmounted
clock: unknown CMOS layout



Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Mike Belopuhov-5
On Sun, Jan 17, 2016 at 20:46 -0800, Jonathon Sisson wrote:

> Hi,
>
> First off, thank you for OpenBSD in general, and thank you specifically
> for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
> gets ever closer for me, and I appreciate everything the OpenBSD dev
> team does.
>
> I've found what appears to be a repeatable crash that results in this:
>
> panic: xnf0: save vs spell: 214
>
> Stopped at      Debugger+0x9:   leave
>    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>  14532   9243      0         0x3  0x4000000    1  python2.7
> * 7215   9243      0         0x3  0x4000000    0  python2.7
>
> Debugger() at Debugger+0x9
> panic() at panic+0xfe
> xnf_encap() at xnf_encap+0x1a9
> xnf_start() at xnf_start+0x7f
> ifq_serialize() at ifq_serialize+0xd9
> if_enqueue() at if_enqueue+0x71
> ether_output() at ether_output+0x166
> ip_output() at ip_output+0x6d3
> tcp_output() at tcp_output+0x87e
> tcp_usrreq() at tcp_usrreq+0x3fc
> sosend() at sosend+0x3d8
> dofilewritev() at dofilewritev+0x205
> sys_write() at sys_write+0x89
> syscall() at syscall+0x368
> --- syscall (number 4) ---
> end of kernel
> end trace frame: 0x9a8c96a2800, count: 1
> 0x9a91790279a:
> --db_more--
>
> I'm unable to run further commands at the console, as AWS does not
> provide console.
>
> I'm using this test machine to build CURRENT and upload it to an s3
> bucket that I've been using for STABLE builds.  The python code is
> the awscli installed via py-pip running on Python 2.7.11.  The precise
> command is:
>
> aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/
>
> If there is any further testing I can provide, I am more than happy
> to provide any details you need.
>
> -Jonathon
>

Can you please try the diff below on top of a -current kernel
(I've pushed some additional Xen fixes just now).

You should be able to copy the kernel into the AWS instance.

My math wasn't correct here and txeof would unload a chain before
we would've processed all descriptors/fragments.

diff --git sys/dev/pv/if_xnf.c sys/dev/pv/if_xnf.c
index 02761d8..d5d5bb0 100644
--- sys/dev/pv/if_xnf.c
+++ sys/dev/pv/if_xnf.c
@@ -489,11 +489,11 @@ xnf_encap(struct xnf_softc *sc, struct mbuf *m, uint32_t *prod)
  struct xnf_tx_ring *txr = sc->sc_tx_ring;
  union xnf_tx_desc *txd;
  bus_dmamap_t dmap;
  int error, i, n = 0;
 
- if (((txr->txr_cons - *prod - 1) & (XNF_TX_DESC - 1)) < XNF_TX_FRAG) {
+ if ((XNF_TX_DESC - (*prod - txr->txr_cons)) < XNF_TX_FRAG) {
  error = ENOENT;
  goto errout;
  }
 
  i = *prod & (XNF_TX_DESC - 1);
@@ -513,21 +513,22 @@ xnf_encap(struct xnf_softc *sc, struct mbuf *m, uint32_t *prod)
  i = *prod & (XNF_TX_DESC - 1);
  if (sc->sc_tx_buf[i])
  panic("%s: save vs spell: %d\n", ifp->if_xname, i);
  txd = &txr->txr_desc[i];
  if (n == 0) {
- sc->sc_tx_buf[i] = m;
  if (0 && m->m_pkthdr.csum_flags & M_IPV4_CSUM_OUT)
  txd->txd_req.txq_flags = XNF_TXF_CSUM |
     XNF_TXF_VALID;
  txd->txd_req.txq_size = m->m_pkthdr.len;
  } else
  txd->txd_req.txq_size = dmap->dm_segs[n].ds_len;
  if (n != dmap->dm_nsegs - 1)
  txd->txd_req.txq_flags |= XNF_TXF_CHUNK;
  txd->txd_req.txq_ref = dmap->dm_segs[n].ds_addr;
  txd->txd_req.txq_offset = dmap->dm_segs[n].ds_offset;
+ sc->sc_tx_buf[i] = m;
+ m = m->m_next;
  }
 
  ifp->if_opackets++;
  return (0);
 
@@ -583,11 +584,11 @@ xnf_txeof(struct xnf_softc *sc)
  if (sc->sc_tx_buf[i]) {
  dmap = sc->sc_tx_dmap[i];
  bus_dmamap_unload(sc->sc_dmat, dmap);
  m = sc->sc_tx_buf[i];
  sc->sc_tx_buf[i] = NULL;
- m_freem(m);
+ m_free(m);
  }
  pkts++;
  }
 
  if (pkts > 0) {

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Mike Belopuhov-5
On Mon, Jan 18, 2016 at 20:25 +0100, Mike Belopuhov wrote:

> On Sun, Jan 17, 2016 at 20:46 -0800, Jonathon Sisson wrote:
> > Hi,
> >
> > First off, thank you for OpenBSD in general, and thank you specifically
> > for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
> > gets ever closer for me, and I appreciate everything the OpenBSD dev
> > team does.
> >
> > I've found what appears to be a repeatable crash that results in this:
> >
> > panic: xnf0: save vs spell: 214
> >
> > Stopped at      Debugger+0x9:   leave
> >    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> >  14532   9243      0         0x3  0x4000000    1  python2.7
> > * 7215   9243      0         0x3  0x4000000    0  python2.7
> >
> > Debugger() at Debugger+0x9
> > panic() at panic+0xfe
> > xnf_encap() at xnf_encap+0x1a9
> > xnf_start() at xnf_start+0x7f
> > ifq_serialize() at ifq_serialize+0xd9
> > if_enqueue() at if_enqueue+0x71
> > ether_output() at ether_output+0x166
> > ip_output() at ip_output+0x6d3
> > tcp_output() at tcp_output+0x87e
> > tcp_usrreq() at tcp_usrreq+0x3fc
> > sosend() at sosend+0x3d8
> > dofilewritev() at dofilewritev+0x205
> > sys_write() at sys_write+0x89
> > syscall() at syscall+0x368
> > --- syscall (number 4) ---
> > end of kernel
> > end trace frame: 0x9a8c96a2800, count: 1
> > 0x9a91790279a:
> > --db_more--
> >
> > I'm unable to run further commands at the console, as AWS does not
> > provide console.
> >
> > I'm using this test machine to build CURRENT and upload it to an s3
> > bucket that I've been using for STABLE builds.  The python code is
> > the awscli installed via py-pip running on Python 2.7.11.  The precise
> > command is:
> >
> > aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/
> >
> > If there is any further testing I can provide, I am more than happy
> > to provide any details you need.
> >
> > -Jonathon
> >
>
> Can you please try the diff below on top of a -current kernel
> (I've pushed some additional Xen fixes just now).
>
> You should be able to copy the kernel into the AWS instance.
>
> My math wasn't correct here and txeof would unload a chain before
> we would've processed all descriptors/fragments.
>

A slight amendment to the diff (forgot one chunk).

diff --git sys/dev/pv/if_xnf.c sys/dev/pv/if_xnf.c
index 02761d8..7c0e1fb 100644
--- sys/dev/pv/if_xnf.c
+++ sys/dev/pv/if_xnf.c
@@ -489,11 +489,11 @@ xnf_encap(struct xnf_softc *sc, struct mbuf *m, uint32_t *prod)
  struct xnf_tx_ring *txr = sc->sc_tx_ring;
  union xnf_tx_desc *txd;
  bus_dmamap_t dmap;
  int error, i, n = 0;
 
- if (((txr->txr_cons - *prod - 1) & (XNF_TX_DESC - 1)) < XNF_TX_FRAG) {
+ if ((XNF_TX_DESC - (*prod - txr->txr_cons)) < XNF_TX_FRAG) {
  error = ENOENT;
  goto errout;
  }
 
  i = *prod & (XNF_TX_DESC - 1);
@@ -513,21 +513,22 @@ xnf_encap(struct xnf_softc *sc, struct mbuf *m, uint32_t *prod)
  i = *prod & (XNF_TX_DESC - 1);
  if (sc->sc_tx_buf[i])
  panic("%s: save vs spell: %d\n", ifp->if_xname, i);
  txd = &txr->txr_desc[i];
  if (n == 0) {
- sc->sc_tx_buf[i] = m;
  if (0 && m->m_pkthdr.csum_flags & M_IPV4_CSUM_OUT)
  txd->txd_req.txq_flags = XNF_TXF_CSUM |
     XNF_TXF_VALID;
  txd->txd_req.txq_size = m->m_pkthdr.len;
  } else
  txd->txd_req.txq_size = dmap->dm_segs[n].ds_len;
  if (n != dmap->dm_nsegs - 1)
  txd->txd_req.txq_flags |= XNF_TXF_CHUNK;
  txd->txd_req.txq_ref = dmap->dm_segs[n].ds_addr;
  txd->txd_req.txq_offset = dmap->dm_segs[n].ds_offset;
+ sc->sc_tx_buf[i] = m;
+ m = m->m_next;
  }
 
  ifp->if_opackets++;
  return (0);
 
@@ -583,11 +584,11 @@ xnf_txeof(struct xnf_softc *sc)
  if (sc->sc_tx_buf[i]) {
  dmap = sc->sc_tx_dmap[i];
  bus_dmamap_unload(sc->sc_dmat, dmap);
  m = sc->sc_tx_buf[i];
  sc->sc_tx_buf[i] = NULL;
- m_freem(m);
+ m_free(m);
  }
  pkts++;
  }
 
  if (pkts > 0) {
@@ -934,11 +935,11 @@ xnf_tx_ring_destroy(struct xnf_softc *sc)
  if (sc->sc_tx_dmap[i] == NULL)
  continue;
  bus_dmamap_unload(sc->sc_dmat, sc->sc_tx_dmap[i]);
  if (sc->sc_tx_buf[i] == NULL)
  continue;
- m_freem(sc->sc_tx_buf[i]);
+ m_free(sc->sc_tx_buf[i]);
  sc->sc_tx_buf[i] = NULL;
  }
  for (i = 0; i < XNF_TX_DESC; i++) {
  if (sc->sc_tx_dmap[i] == NULL)
  continue;

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Jonathon Sisson
On Mon, Jan 18, 2016 at 08:30:21PM +0100, Mike Belopuhov wrote:

> On Mon, Jan 18, 2016 at 20:25 +0100, Mike Belopuhov wrote:
> > On Sun, Jan 17, 2016 at 20:46 -0800, Jonathon Sisson wrote:
> > > Hi,
> > >
> > > First off, thank you for OpenBSD in general, and thank you specifically
> > > for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
> > > gets ever closer for me, and I appreciate everything the OpenBSD dev
> > > team does.
> > >
> > > I've found what appears to be a repeatable crash that results in this:
> > >
> > > panic: xnf0: save vs spell: 214
> > >
> > > Stopped at      Debugger+0x9:   leave
> > >    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> > >  14532   9243      0         0x3  0x4000000    1  python2.7
> > > * 7215   9243      0         0x3  0x4000000    0  python2.7
> > >
> > > Debugger() at Debugger+0x9
> > > panic() at panic+0xfe
> > > xnf_encap() at xnf_encap+0x1a9
> > > xnf_start() at xnf_start+0x7f
> > > ifq_serialize() at ifq_serialize+0xd9
> > > if_enqueue() at if_enqueue+0x71
> > > ether_output() at ether_output+0x166
> > > ip_output() at ip_output+0x6d3
> > > tcp_output() at tcp_output+0x87e
> > > tcp_usrreq() at tcp_usrreq+0x3fc
> > > sosend() at sosend+0x3d8
> > > dofilewritev() at dofilewritev+0x205
> > > sys_write() at sys_write+0x89
> > > syscall() at syscall+0x368
> > > --- syscall (number 4) ---
> > > end of kernel
> > > end trace frame: 0x9a8c96a2800, count: 1
> > > 0x9a91790279a:
> > > --db_more--
> > >
> > > I'm unable to run further commands at the console, as AWS does not
> > > provide console.
> > >
> > > I'm using this test machine to build CURRENT and upload it to an s3
> > > bucket that I've been using for STABLE builds.  The python code is
> > > the awscli installed via py-pip running on Python 2.7.11.  The precise
> > > command is:
> > >
> > > aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/
> > >
> > > If there is any further testing I can provide, I am more than happy
> > > to provide any details you need.
> > >
> > > -Jonathon
> > >
> >
> > Can you please try the diff below on top of a -current kernel
> > (I've pushed some additional Xen fixes just now).
> >
> > You should be able to copy the kernel into the AWS instance.
> >
> > My math wasn't correct here and txeof would unload a chain before
> > we would've processed all descriptors/fragments.
> >
>
> A slight amendment to the diff (forgot one chunk).
>
Updated sources, applied the patch and recompiled kernel.

I can't reproduce now with the same upload commands.

Even with a bit heavier testing I've been unable to get the instance
to panic again.

Thanks!  I really appreciate the PV drivers =)

-Jonathon

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Jonathon Sisson
In reply to this post by Mike Belopuhov-5
On Mon, Jan 18, 2016 at 08:30:21PM +0100, Mike Belopuhov wrote:

> On Mon, Jan 18, 2016 at 20:25 +0100, Mike Belopuhov wrote:
> > On Sun, Jan 17, 2016 at 20:46 -0800, Jonathon Sisson wrote:
> > > Hi,
> > >
> > > First off, thank you for OpenBSD in general, and thank you specifically
> > > for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
> > > gets ever closer for me, and I appreciate everything the OpenBSD dev
> > > team does.
> > >
> > > I've found what appears to be a repeatable crash that results in this:
> > >
> > > panic: xnf0: save vs spell: 214
> > >
> > > Stopped at      Debugger+0x9:   leave
> > >    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> > >  14532   9243      0         0x3  0x4000000    1  python2.7
> > > * 7215   9243      0         0x3  0x4000000    0  python2.7
> > >
> > > Debugger() at Debugger+0x9
> > > panic() at panic+0xfe
> > > xnf_encap() at xnf_encap+0x1a9
> > > xnf_start() at xnf_start+0x7f
> > > ifq_serialize() at ifq_serialize+0xd9
> > > if_enqueue() at if_enqueue+0x71
> > > ether_output() at ether_output+0x166
> > > ip_output() at ip_output+0x6d3
> > > tcp_output() at tcp_output+0x87e
> > > tcp_usrreq() at tcp_usrreq+0x3fc
> > > sosend() at sosend+0x3d8
> > > dofilewritev() at dofilewritev+0x205
> > > sys_write() at sys_write+0x89
> > > syscall() at syscall+0x368
> > > --- syscall (number 4) ---
> > > end of kernel
> > > end trace frame: 0x9a8c96a2800, count: 1
> > > 0x9a91790279a:
> > > --db_more--
> > >
> > > I'm unable to run further commands at the console, as AWS does not
> > > provide console.
> > >
> > > I'm using this test machine to build CURRENT and upload it to an s3
> > > bucket that I've been using for STABLE builds.  The python code is
> > > the awscli installed via py-pip running on Python 2.7.11.  The precise
> > > command is:
> > >
> > > aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/
> > >
> > > If there is any further testing I can provide, I am more than happy
> > > to provide any details you need.
> > >
> > > -Jonathon
> > >
> >
> > Can you please try the diff below on top of a -current kernel
> > (I've pushed some additional Xen fixes just now).
> >
> > You should be able to copy the kernel into the AWS instance.
> >
> > My math wasn't correct here and txeof would unload a chain before
> > we would've processed all descriptors/fragments.
> >
>
> A slight amendment to the diff (forgot one chunk).
>
>
Of course, 2 minutes after I sent that last email I get this:

panic: xnf0: save vs spell: 129

Stopped at      Debugger+0x9:   leave
   TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
*13456  16606      0         0x3  0x4000000    1  python2.7
 10910  10910      0     0x14000      0x210    0  softnet

Debugger() at Debugger+0x9
panic() at panic+0xfe
xnf_encap() at xnf_encap+0x1b7
xnf_start() at xnf_start+0x7f
ifq_serialize() at ifq_serialize+0xd9
if_enqueue() at if_enqueue+0x71
ether_output() at ether_output+0x166
ip_output() at ip_output+0x6d3
tcp_output() at tcp_output+0x87e
tcp_usrreq() at tcp_usrreq+0x3fc
sosend() at sosend+0x3d8
dofilewritev() at dofilewritev+0x205
sys_write() at sys_write+0x89
syscall() at syscall+0x368
--- syscall (number 4) ---
end of kernel
end trace frame: 0x2f508906d00, count: 1
0x2f4e69f079a:
--db_more--

This time took considerably longer, as multiple uploads were successful
prior to the panic.  This time I had to download install58.iso and then
upload it to a test bucket to get it to panic.

My apologies for the false confirmation.

-Jonathon

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Mike Belopuhov-5
On 18 January 2016 at 23:20, Jonathon Sisson <[hidden email]> wrote:

> On Mon, Jan 18, 2016 at 08:30:21PM +0100, Mike Belopuhov wrote:
>> On Mon, Jan 18, 2016 at 20:25 +0100, Mike Belopuhov wrote:
>> > On Sun, Jan 17, 2016 at 20:46 -0800, Jonathon Sisson wrote:
>> > > Hi,
>> > >
>> > > First off, thank you for OpenBSD in general, and thank you specifically
>> > > for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
>> > > gets ever closer for me, and I appreciate everything the OpenBSD dev
>> > > team does.
>> > >
>> > > I've found what appears to be a repeatable crash that results in this:
>> > >
>> > > panic: xnf0: save vs spell: 214
>> > >
>> > > Stopped at      Debugger+0x9:   leave
>> > >    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>> > >  14532   9243      0         0x3  0x4000000    1  python2.7
>> > > * 7215   9243      0         0x3  0x4000000    0  python2.7
>> > >
>> > > Debugger() at Debugger+0x9
>> > > panic() at panic+0xfe
>> > > xnf_encap() at xnf_encap+0x1a9
>> > > xnf_start() at xnf_start+0x7f
>> > > ifq_serialize() at ifq_serialize+0xd9
>> > > if_enqueue() at if_enqueue+0x71
>> > > ether_output() at ether_output+0x166
>> > > ip_output() at ip_output+0x6d3
>> > > tcp_output() at tcp_output+0x87e
>> > > tcp_usrreq() at tcp_usrreq+0x3fc
>> > > sosend() at sosend+0x3d8
>> > > dofilewritev() at dofilewritev+0x205
>> > > sys_write() at sys_write+0x89
>> > > syscall() at syscall+0x368
>> > > --- syscall (number 4) ---
>> > > end of kernel
>> > > end trace frame: 0x9a8c96a2800, count: 1
>> > > 0x9a91790279a:
>> > > --db_more--
>> > >
>> > > I'm unable to run further commands at the console, as AWS does not
>> > > provide console.
>> > >
>> > > I'm using this test machine to build CURRENT and upload it to an s3
>> > > bucket that I've been using for STABLE builds.  The python code is
>> > > the awscli installed via py-pip running on Python 2.7.11.  The precise
>> > > command is:
>> > >
>> > > aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/
>> > >
>> > > If there is any further testing I can provide, I am more than happy
>> > > to provide any details you need.
>> > >
>> > > -Jonathon
>> > >
>> >
>> > Can you please try the diff below on top of a -current kernel
>> > (I've pushed some additional Xen fixes just now).
>> >
>> > You should be able to copy the kernel into the AWS instance.
>> >
>> > My math wasn't correct here and txeof would unload a chain before
>> > we would've processed all descriptors/fragments.
>> >
>>
>> A slight amendment to the diff (forgot one chunk).
>>
>>
> Of course, 2 minutes after I sent that last email I get this:
>
> panic: xnf0: save vs spell: 129
>
> Stopped at      Debugger+0x9:   leave
>    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> *13456  16606      0         0x3  0x4000000    1  python2.7
>  10910  10910      0     0x14000      0x210    0  softnet
>
> Debugger() at Debugger+0x9
> panic() at panic+0xfe
> xnf_encap() at xnf_encap+0x1b7
> xnf_start() at xnf_start+0x7f
> ifq_serialize() at ifq_serialize+0xd9
> if_enqueue() at if_enqueue+0x71
> ether_output() at ether_output+0x166
> ip_output() at ip_output+0x6d3
> tcp_output() at tcp_output+0x87e
> tcp_usrreq() at tcp_usrreq+0x3fc
> sosend() at sosend+0x3d8
> dofilewritev() at dofilewritev+0x205
> sys_write() at sys_write+0x89
> syscall() at syscall+0x368
> --- syscall (number 4) ---
> end of kernel
> end trace frame: 0x2f508906d00, count: 1
> 0x2f4e69f079a:
> --db_more--
>
> This time took considerably longer, as multiple uploads were successful
> prior to the panic.  This time I had to download install58.iso and then
> upload it to a test bucket to get it to panic.
>
> My apologies for the false confirmation.
>
> -Jonathon

That's OK.  Thank you for taking your time to test it.
I can reproduce the problem with tcpbench as well and
hopefully will have a solution soon.

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Mike Belopuhov-5
On 19 January 2016 at 17:02, Mike Belopuhov <[hidden email]> wrote:

> On 18 January 2016 at 23:20, Jonathon Sisson <[hidden email]> wrote:
>> On Mon, Jan 18, 2016 at 08:30:21PM +0100, Mike Belopuhov wrote:
>>> On Mon, Jan 18, 2016 at 20:25 +0100, Mike Belopuhov wrote:
>>> > On Sun, Jan 17, 2016 at 20:46 -0800, Jonathon Sisson wrote:
>>> > > Hi,
>>> > >
>>> > > First off, thank you for OpenBSD in general, and thank you specifically
>>> > > for the PV drivers on OpenBSD =)  The day of migrating workloads to AWS
>>> > > gets ever closer for me, and I appreciate everything the OpenBSD dev
>>> > > team does.
>>> > >
>>> > > I've found what appears to be a repeatable crash that results in this:
>>> > >
>>> > > panic: xnf0: save vs spell: 214
>>> > >
>>> > > Stopped at      Debugger+0x9:   leave
>>> > >    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>>> > >  14532   9243      0         0x3  0x4000000    1  python2.7
>>> > > * 7215   9243      0         0x3  0x4000000    0  python2.7
>>> > >
>>> > > Debugger() at Debugger+0x9
>>> > > panic() at panic+0xfe
>>> > > xnf_encap() at xnf_encap+0x1a9
>>> > > xnf_start() at xnf_start+0x7f
>>> > > ifq_serialize() at ifq_serialize+0xd9
>>> > > if_enqueue() at if_enqueue+0x71
>>> > > ether_output() at ether_output+0x166
>>> > > ip_output() at ip_output+0x6d3
>>> > > tcp_output() at tcp_output+0x87e
>>> > > tcp_usrreq() at tcp_usrreq+0x3fc
>>> > > sosend() at sosend+0x3d8
>>> > > dofilewritev() at dofilewritev+0x205
>>> > > sys_write() at sys_write+0x89
>>> > > syscall() at syscall+0x368
>>> > > --- syscall (number 4) ---
>>> > > end of kernel
>>> > > end trace frame: 0x9a8c96a2800, count: 1
>>> > > 0x9a91790279a:
>>> > > --db_more--
>>> > >
>>> > > I'm unable to run further commands at the console, as AWS does not
>>> > > provide console.
>>> > >
>>> > > I'm using this test machine to build CURRENT and upload it to an s3
>>> > > bucket that I've been using for STABLE builds.  The python code is
>>> > > the awscli installed via py-pip running on Python 2.7.11.  The precise
>>> > > command is:
>>> > >
>>> > > aws s3 sync /usr/rel/ s3://$AWS_BUCKET_NAME/path/
>>> > >
>>> > > If there is any further testing I can provide, I am more than happy
>>> > > to provide any details you need.
>>> > >
>>> > > -Jonathon
>>> > >
>>> >
>>> > Can you please try the diff below on top of a -current kernel
>>> > (I've pushed some additional Xen fixes just now).
>>> >
>>> > You should be able to copy the kernel into the AWS instance.
>>> >
>>> > My math wasn't correct here and txeof would unload a chain before
>>> > we would've processed all descriptors/fragments.
>>> >
>>>
>>> A slight amendment to the diff (forgot one chunk).
>>>
>>>
>> Of course, 2 minutes after I sent that last email I get this:
>>
>> panic: xnf0: save vs spell: 129
>>
>> Stopped at      Debugger+0x9:   leave
>>    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>> *13456  16606      0         0x3  0x4000000    1  python2.7
>>  10910  10910      0     0x14000      0x210    0  softnet
>>
>> Debugger() at Debugger+0x9
>> panic() at panic+0xfe
>> xnf_encap() at xnf_encap+0x1b7
>> xnf_start() at xnf_start+0x7f
>> ifq_serialize() at ifq_serialize+0xd9
>> if_enqueue() at if_enqueue+0x71
>> ether_output() at ether_output+0x166
>> ip_output() at ip_output+0x6d3
>> tcp_output() at tcp_output+0x87e
>> tcp_usrreq() at tcp_usrreq+0x3fc
>> sosend() at sosend+0x3d8
>> dofilewritev() at dofilewritev+0x205
>> sys_write() at sys_write+0x89
>> syscall() at syscall+0x368
>> --- syscall (number 4) ---
>> end of kernel
>> end trace frame: 0x2f508906d00, count: 1
>> 0x2f4e69f079a:
>> --db_more--
>>
>> This time took considerably longer, as multiple uploads were successful
>> prior to the panic.  This time I had to download install58.iso and then
>> upload it to a test bucket to get it to panic.
>>
>> My apologies for the false confirmation.
>>
>> -Jonathon
>
> That's OK.  Thank you for taking your time to test it.
> I can reproduce the problem with tcpbench as well and
> hopefully will have a solution soon.

OK, this should be fixed in current.  Please update and test :-)

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Jonathon Sisson
On Tue, Jan 19, 2016 at 06:17:40PM +0100, Mike Belopuhov wrote:
> On 19 January 2016 at 17:02, Mike Belopuhov <[hidden email]> wrote:
> > That's OK.  Thank you for taking your time to test it.
> > I can reproduce the problem with tcpbench as well and
> > hopefully will have a solution soon.
>
> OK, this should be fixed in current.  Please update and test :-)
>
Mike,

I appreciate the update.  I've rebuilt with CURRENT code checked out
as of a few hours ago and now I'm 99% certain I can't reproduce =)

To test, I pushed 4 simultaneous uploads and the instance remained
responsive.  Afterwards, I noted this:

# netstat -sI xnf0
Name    Mtu   Network     Address              Ipkts Ierrs    Opkts Oerrs Colls
xnf0    1500  <Link>      12:dd:e7:6d:a9:49  1901330     0  4210541 48602     0
xnf0    1500  172.31.47/2 ip-172-31-47-76.e  1901330     0  4210541 48602     0

I don't know if the output errors are an issue, but I thought I'd
mention it regardless.

I really appreciate your quick answers and fix.  Thank you, and the
rest of the team, for OpenBSD.

-Jonathon

Reply | Threaded
Open this post in threaded view
|

Re: xnf crash - repeatable

Mike Belopuhov-5
On 20 January 2016 at 00:46, Jonathon Sisson <[hidden email]> wrote:

> On Tue, Jan 19, 2016 at 06:17:40PM +0100, Mike Belopuhov wrote:
>> On 19 January 2016 at 17:02, Mike Belopuhov <[hidden email]> wrote:
>> > That's OK.  Thank you for taking your time to test it.
>> > I can reproduce the problem with tcpbench as well and
>> > hopefully will have a solution soon.
>>
>> OK, this should be fixed in current.  Please update and test :-)
>>
> Mike,
>
> I appreciate the update.  I've rebuilt with CURRENT code checked out
> as of a few hours ago and now I'm 99% certain I can't reproduce =)
>

Great!

> To test, I pushed 4 simultaneous uploads and the instance remained
> responsive.  Afterwards, I noted this:
>
> # netstat -sI xnf0
> Name    Mtu   Network     Address              Ipkts Ierrs    Opkts Oerrs Colls
> xnf0    1500  <Link>      12:dd:e7:6d:a9:49  1901330     0  4210541 48602     0
> xnf0    1500  172.31.47/2 ip-172-31-47-76.e  1901330     0  4210541 48602     0
>
> I don't know if the output errors are an issue, but I thought I'd
> mention it regardless.
>

Indeed.  I was accounting for tx ring full events as output errors
during debugging.  Now it's time to move on.  That said, it should
be fixed by my last commit.  Thanks for reporting!

> I really appreciate your quick answers and fix.  Thank you, and the
> rest of the team, for OpenBSD.
>
> -Jonathon