Crash in softnet on SGI

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Crash in softnet on SGI

Jesse Darrone
Hello All,

Synopsis: spurious crash in softnet
Category: kernel sgi
Environment:

  System: OpenBSD 6.0
  Details: OpenBSD 6.0-beta (GENERIC-IP22) #664: Sun Jul 10 00:31:39 MDT 2016
  Architecture: SGI (MIPS64)
  Machine: Challenge S R5000

Description:

Machine seems to hang at (seemingly) random intervals.  This has
occurred on several recent snapshots including 10-Jul. I have
reproduced the issue on multiple systems, so it doesn't seem to be a
hardware issue.  It may not be relevant but the machines are running
an MTU of 1454 on sq1.

How-To-Repeat:

Seems to repeat itself given enough time, but I've not been able to tie it
to any specific sequence of events.  I will say that the machine typically does
not run longer than a day (though it has on occasion).

Fix: Unknown


sq1: receive FIFO overflow

Trap cause = 4 Frame 0xffffffff91f439b0
Trap PC 0xffffffff888b2be0 RA 0xffffffff888b2dbc fault 0xd97d3b7057b9cf7b
pool_put+0xa8 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2358)
 ra 0xffffffff888d18f0 sp 0xffffffff91f43b08,0
m_extfree+0x110
(1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2358)
ra 0xffffffff888d1fa0 sp 0xffffffff91f43ba2
m_free+0x138 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2358)
 ra 0xffffffff888d20b0 sp 0xffffffff91f43bc8, 8
m_freem+0x28 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2358)
 ra 0xffffffff88961b88 sp 0xffffffff91f43bf8, 2
in_arpinput+0x88
(1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2358)
ra 0xffffffff8892169c sp 0xffffffff91f43c4
ether_input+0x334
(1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2358)
ra 0xffffffff8891df38 sp 0xffffffff91f432
User-level: pid 34898
stopped on non ddb fault
Stopped at      pool_put+0xa8:  ld      v0,8(v1)

ddb> trace
pool_put+0xa8 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d2
358)  ra 0xffffffff888d18f0 sp 0xffffffff91f43b08, sz 160
m_extfree+0x110 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888
d2358)  ra 0xffffffff888d1fa0 sp 0xffffffff91f43ba8, sz 32
m_free+0x138 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d23
58)  ra 0xffffffff888d20b0 sp 0xffffffff91f43bc8, sz 48
m_freem+0x28 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff888d23
58)  ra 0xffffffff88961b88 sp 0xffffffff91f43bf8, sz 32
in_arpinput+0x88 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff88
8d2358)  ra 0xffffffff8892169c sp 0xffffffff91f43c18, sz 144
ether_input+0x334 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff8
88d2358)  ra 0xffffffff8891df38 sp 0xffffffff91f43ca8, sz 112
if_input_process+0xf8 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,fffff
fff888d2358)  ra 0xffffffff888a3968 sp 0xffffffff91f43d18, sz 80
taskq_thread+0xd0 (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffffff8
88d2358)  ra 0xffffffff88a797fc sp 0xffffffff91f43d68, sz 80
proc_trampoline+0x1c (1becdf323dc0c775,c0000000030a2800,c0000000030f87e0,ffffff
ff888d2358)  ra 0x0 sp 0xffffffff91f43db8, sz 0
User-level: pid 34898

ddb> ps
   TID   PPID   PGRP    UID  S       FLAGS  WAIT          COMMAND
 74365      1  74365      0  3    0x100083  ttyin         getty
  2436      1   2436      0  3    0x100098  poll          cron
 35314  58087  58087    619  3        0x82  kqread        bandb
 40008  58087  58087    619  3        0x82  kqread        ssld
 55415  58087  58087    619  3        0x82  kqread        resolver
 58087      1  58087    619  3        0x90  kqread        ircd
 82599      1  82599      0  3        0x80  select        sshd
 65889  67599  99686     83  3    0x100090  poll          ntpd
 67599  99686  99686     83  3    0x100090  poll          ntpd
 99686      1  99686      0  3        0x80  poll          ntpd
 97827  46866  46866     74  3    0x100090  bpf           pflogd
 46866      1  46866      0  3        0x80  netio         pflogd
 94958  52247  52247     73  2    0x100090                syslogd
 52247      1  52247      0  3    0x100080  netio         syslogd
  8668      0      0      0  3     0x14200  pgzero        zerothread
 60775      0      0      0  3     0x14200  aiodoned      aiodoned
 87475      0      0      0  3     0x14200  syncer        update
 81321      0      0      0  3     0x14200  cleaner       cleaner
 75445      0      0      0  3     0x14200  reaper        reaper
 67147      0      0      0  3     0x14200  pgdaemon      pagedaemon
 26226      0      0      0  3     0x14200  bored         crynlk
 85686      0      0      0  3     0x14200  bored         crypto
 71123      0      0      0  3     0x14200  pftm          pfpurge
*34898      0      0      0  7     0x14210                softnet
 93986      0      0      0  3     0x14200  bored         systqmp
 52845      0      0      0  3     0x14200  bored         systq
 16345      0      0      0  3  0x40014200                idle0
 69755      0      0      0  3     0x14200  kmalloc       kmthread
     1      0      1      0  3        0x82  wait          init
     0     -1      0      0  3     0x10200  scheduler     swapper

ddb> show panic
the kernel did not panic

ddb> show registers
at                0xffffffff88b60000    sysent+0xec0
v0                0xd97d3b7057b9cf73
v1                0xd97d3b7057b9cf73
a0                0x1becdf323dc0c775
a1                0xc0000000030a2800
a2                0xc0000000030f87e0
a3                0xffffffff888d2358    m_extfree_pool
a4                0xffffffff91f43be6    end+0x92e34b6
a5                              0x14
a6                              0x18
a7                               0x8
t0                               0x4
t1                0xffffffff88c0e2f0    kernel_pmap_store
t2                                 0
t3                0xffffffff91f40000    end+0x92df8d0
s0                0xc0000000030f87e0
s1                0xc0000000030a2800
s2                0xffffffff88b88070    mclpools
s3                               0x1
s4                0xc0000000000de078
s5                                 0
s6                0xc0000000030a2818
s7                0xffffffff91f43c38    end+0x92e3508
t8                        0x52f2c064
t9                0xffffffff88a95188    int2_splx
k0                0xffffffff8894a114    rtable_match+0x84
k1                0xc000000002f40bc0
gp                0xffffffff88b64430    _gp
sp                0xffffffff91f43b08    end+0x92e33d8
s8                                 0
ra                0xffffffff888b2dbc    pool_put+0x284
sr                        0x1000cfa3
lo                        0x61861862
hi                                 0
bad               0xd97d3b7057b9cf7b
cs                              0x10
pc                0xffffffff888b2be0    pool_put+0xa8
pool_put+0xa8:  ld      v0,8(v1)

ddb> continue
panic: trap
Stopped at      Debugger+0x4:   jr      ra
Debugger+0x8:    nop
   TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
*34898  34898      0     0x14000      0x210    0  softnet
Debugger+0x4 (73e2c57b1f779808,900000001fbd9880,900000001fbd9830,ffffffff91f438
30)  ra 0xffffffff888b6040 sp 0xffffffff91f43868, sz 0
panic+0x100 (73e2c57b1f779808,ffffffff91f43af0,0,ffffffff88c0eb20)  ra 0xffffff
ff88a76aec sp 0xffffffff91f43868, sz 112
itsa+0xf4 (73e2c57b1f779808,ffffffff91f43af0,0,ffffffff88c0eb20)  ra 0xffffffff
88a7a2fc sp 0xffffffff91f438d8, sz 176
k_general+0x114 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0x0 s
p 0xffffffff91f43988, sz 0
(KERNEL TRAP)
pool_put+0xa8 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0xffff
ffff888d18f0 sp 0xffffffff91f43b08, sz 160
m_extfree+0x110 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0xff
ffffff888d1fa0 sp 0xffffffff91f43ba8, sz 32
m_free+0x138 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0xfffff
fff888d20b0 sp 0xffffffff91f43bc8, sz 48
m_freem+0x28 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0xfffff
fff88961b88 sp 0xffffffff91f43bf8, sz 32
in_arpinput+0x88 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0xf
fffffff8892169c sp 0xffffffff91f43c18, sz 144
ether_input+0x334 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0x
ffffffff8891df38 sp 0xffffffff91f43ca8, sz 112
if_input_process+0xf8 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  r
a 0xffffffff888a3968 sp 0xffffffff91f43d18, sz 80
taskq_thread+0xd0 (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra 0x
ffffffff88a797fc sp 0xffffffff91f43d68, sz 80
proc_trampoline+0x1c (ffffffff91f439b0,ffffffff91f43af0,0,ffffffff888b2be0)  ra
 0x0 sp 0xffffffff91f43db8, sz 0
User-level: pid 34898
http://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.

ddb> boot reboot
panic: wd33c93_scsicmd: busy
Stopped at      Debugger+0x4:   jr      ra
Debugger+0x8:    nop
Debugger+0x4 (73e2c57b1f779808,900000001fbd9880,900000001fbd9830,ffffffff91f42b
e0)  ra 0xffffffff888b6040 sp 0xffffffff91f42c18, sz 0
panic+0x100 (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra 0xfffff
fff88806960 sp 0xffffffff91f42c18, sz 112
wd33c93_scsi_cmd+0x280
(73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  r
a 0xffffffff88aa9e18 sp 0xffffffff91f42c88, sz 64
scsi_xs_sync+0xb8 (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra 0
xffffffff88ab3454 sp 0xffffffff91f42cc8, sz 64
sd_flush+0x8c (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra 0xfff
fffff88ab54d8 sp 0xffffffff91f42d08, sz 48
sdactivate+0x140 (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra 0x
ffffffff888a54dc sp 0xffffffff91f42d38, sz 48
config_suspend+0x3c (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra
 0xffffffff88aae184 sp 0xffffffff91f42d68, sz 48
scsi_activate_target+0x54 (73e2c57b1f779808,35,c000000002b9c16b,c00000000000870
0)  ra 0xffffffff88aae20c sp 0xffffffff91f42d98, sz 64
scsi_activate_bus+0x44
(73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  r
a 0xffffffff888a54dc sp 0xffffffff91f42dd8, sz 64
config_suspend+0x3c (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra
 0xffffffff888a5398 sp 0xffffffff91f42e18, sz 48
config_activate_children+0x78 (73e2c57b1f779808,35,c000000002b9c16b,c0000000000
08700)  ra 0xffffffff888a5510 sp 0xffffffff91f42e48, sz 80
config_suspend+0x70 (73e2c57b1f779808,35,c000000002b9c16b,c000000000008700)  ra

ddb> boot reboot
System restart.
sc0,1,0: cmd=0x12 timeout after 2 sec.  Resetting SCSI bus

[ using 388944 bytes of bsd ELF symbol table ]
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2016 OpenBSD. All rights reserved.  http://www.OpenBSD.org

OpenBSD 6.0-beta (GENERIC-IP22) #664: Sun Jul 10 00:31:39 MDT 2016
    [hidden email]:/usr/src/sys/arch/sgi/compile/GENERIC-IP22
real mem = 167772160 (160MB)
rsvd mem = 802816 (1MB)
avail mem = 160169984 (152MB)
mainbus0 at root: Challenge S
cpu0 at mainbus0: MIPS R5000 CPU rev 1.0 150 MHz, R5000 based FPC rev 1.0
cpu0: cache L1-I 32KB D 32KB 2 way, L2 512KB direct
int0 at mainbus0 addr 0x1fbd9880
imc0 at mainbus0: revision 3
gio0 at imc0
hpc0 at gio0 addr 0x1fb80000: SGI HPC3 (onboard)
zs0 at hpc0 offset 0x00059830 irq 29: 85230
zstty0 at zs0 channel 1: console
zstty1 at zs0 channel 0
sq0 at hpc0 offset 0x00054000 irq 3: Seeq 80c03, address 08:00:69:0a:34:09
wdsc0 at hpc0 offset 0x00044000 irq 1: WD33C93B, 20.0 MHz, burst DMA
wdsc0: microcode revision 0x0d, fast SCSI
scsibus0 at wdsc0: 8 targets, initiator 0
sd0 at scsibus0 targ 1 lun 0: <SEAGATE, ST39103LCSUN9.0G, 034A> SCSI2
0/direct fixed serial.SEAGATE_ST39103LCSUN9.0GLS4557570000101519ZQ
sd0: 8637MB, 512 bytes/sector, 17689267 sectors
pione at hpc0 offset 0x00059800 irq 5 not configured
panel0 at hpc0 offset 0x00059850 irq 9: power button
dsclock0 at hpc0 offset 0x00060000
hpc1 at gio0 addr 0x1fb00000: SGI HPC3 (IO+ mezzanine)
hpc1: using EXP1's DMA channel
sq1 at hpc1 offset 0x00054000 irq 0: Seeq 80c03, address 08:00:69:02:64:d1
clock0 at mainbus0: int 5
vscsi0 at root
scsibus1 at vscsi0: 256 targets
softraid0 at root
scsibus2 at softraid0: 256 targets
boot device: sd0
root on sd0a (ffbd62fcf39fc195.a) swap on sd0b dump on sd0b
WARNING: / was not properly unmounted

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Miod Vallat
> Machine seems to hang at (seemingly) random intervals.  This has
> occurred on several recent snapshots including 10-Jul. I have
> reproduced the issue on multiple systems, so it doesn't seem to be a
> hardware issue.  It may not be relevant but the machines are running
> an MTU of 1454 on sq1.

Are the other systems on which the issue has been reproduced also
Challenge S systems using the second Ethernet interface?

If so, the problem might be caused by bad timing settings for the second
interface.

Does the following diff help?

Index: hpc/hpc.c
===================================================================
RCS file: /OpenBSD/src/sys/arch/sgi/hpc/hpc.c,v
retrieving revision 1.18
diff -u -p -r1.18 hpc.c
--- hpc/hpc.c 18 Sep 2015 20:50:02 -0000 1.18
+++ hpc/hpc.c 15 Jul 2016 11:43:25 -0000
@@ -427,8 +427,8 @@ hpc_attach(struct device *parent, struct
  uint32_t hpctype;
  int isonboard;
  int isioplus;
- int giofast;
- int needprobe;
+ int giofast = 0;
+ int needprobe = 0;
  int sysmask = 0;
 
  sc->sc_base = ga->ga_addr;
@@ -496,9 +496,34 @@ hpc_attach(struct device *parent, struct
  isioplus = (sc->sc_base == HPC_BASE_ADDRESS_1 && hpctype == 3 &&
     (sysmask & HPCDEV_IP24) != 0);
 
- printf(": SGI HPC%d%s (%s)\n", (hpctype ==  3) ? 3 : 1,
-    (hpctype == 15) ? ".5" : "", (isonboard) ? "onboard" :
-    (isioplus) ? "IO+ mezzanine" : "GIO slot");
+ if (hpctype == 3) {
+ if (sys_config.system_subtype == IP22_INDIGO2) {
+ /* wild guess */
+ giofast = 1;
+ } else {
+ /*
+ * According to IRIX hpc3.h, the fast GIO bit
+ * is active high, but the register value has
+ * been found to be 0xf8 on slow GIO systems
+ * and 0xf1 on fast ones, which tends to prove
+ * the opposite...
+ */
+ if ((bus_space_read_4(sc->sc_ct, sc->sc_ch,
+    IOC_BASE + IOC_GCREG) & IOC_GCREG_GIO_33MHZ) == 0)
+ giofast = 1;
+ }
+ }
+
+ if (hpctype == 3) {
+ printf(": SGI HPC3 (%s, %uMHz)\n",
+    (isonboard) ? "onboard" :
+    (isioplus) ? "IO+ mezzanine" : "GIO slot",
+    25 + 8 * giofast);
+ } else {
+ printf(": SGI HPC1%s (%s)\n",
+    (hpctype == 15) ? ".5" : "",
+    (isonboard) ? "onboard" : "GIO slot");
+ }
 
  /*
  * Configure the IOC.
@@ -586,42 +611,16 @@ hpc_attach(struct device *parent, struct
 
  if (hpctype == 3) {
  hv = &hpc3_values;
- if (isonboard) {
+ if (isonboard)
  hd = hpc3_onboard;
- if (sys_config.system_subtype == IP22_INDIGO2) {
- /* wild guess */
- giofast = 1;
- } else {
- /*
- * According to IRIX hpc3.h, the fast GIO bit
- * is active high, but the register value has
- * been found to be 0xf8 on slow GIO systems
- * and 0xf1 on fast ones, which tends to prove
- * the opposite...
- */
- if (bus_space_read_4(sc->sc_ct, sc->sc_ch,
-    IOC_BASE + IOC_GCREG) & IOC_GCREG_GIO_33MHZ)
- giofast = 0;
- else
- giofast = 1;
- }
- } else {
+ else
  hd = hpc3_devices;
- /*
- * XXX should IO+ Mezzanine use the same settings as
- * XXX the onboard HPC3?
- */
- giofast = 0;
- }
- needprobe = 0;
  } else {
  hv = &hpc1_values;
  hv->revision = hpctype;
- giofast = 0;
- if (isonboard) {
+ if (isonboard)
  hd = hpc1_onboard;
- needprobe = 0;
- } else {
+ else {
  hd = hpc1_devices;
  /*
  * Until a reliable way of telling E++ and GIO32 SCSI

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Jesse Darrone
Hey Miod,

Thanks for the reply!

You are correct, I am using the second interface so that seems
plausible.  My Challenge Ss are the only SGIs I have that are
currently running OpenBSD, but I do have an Octane that I was going to
rebuild and add to the mix so that could be used for validation.

Theo suggested that a fix for ARP committed the other day might have
some impact on this so I've been testing the latest snapshot.  So far
I've been up for 19:47, so it's looking good so far.  Interestingly
enough I was getting "sq1: receive FIFO overflow" periodically with
the previous snapshot, so far on this boot that has not recurred.

If the box cores again I'll test your diff and see if that improves my
situation.

Thanks again, Miod!
-Jesse

On Fri, Jul 15, 2016 at 7:57 AM, Miod Vallat <[hidden email]> wrote:

>> Machine seems to hang at (seemingly) random intervals.  This has
>> occurred on several recent snapshots including 10-Jul. I have
>> reproduced the issue on multiple systems, so it doesn't seem to be a
>> hardware issue.  It may not be relevant but the machines are running
>> an MTU of 1454 on sq1.
>
> Are the other systems on which the issue has been reproduced also
> Challenge S systems using the second Ethernet interface?
>
> If so, the problem might be caused by bad timing settings for the second
> interface.
>
> Does the following diff help?
>
> Index: hpc/hpc.c
> ===================================================================
> RCS file: /OpenBSD/src/sys/arch/sgi/hpc/hpc.c,v
> retrieving revision 1.18
> diff -u -p -r1.18 hpc.c
> --- hpc/hpc.c   18 Sep 2015 20:50:02 -0000      1.18
> +++ hpc/hpc.c   15 Jul 2016 11:43:25 -0000
> @@ -427,8 +427,8 @@ hpc_attach(struct device *parent, struct
>         uint32_t hpctype;
>         int isonboard;
>         int isioplus;
> -       int giofast;
> -       int needprobe;
> +       int giofast = 0;
> +       int needprobe = 0;
>         int sysmask = 0;
>
>         sc->sc_base = ga->ga_addr;
> @@ -496,9 +496,34 @@ hpc_attach(struct device *parent, struct
>         isioplus = (sc->sc_base == HPC_BASE_ADDRESS_1 && hpctype == 3 &&
>             (sysmask & HPCDEV_IP24) != 0);
>
> -       printf(": SGI HPC%d%s (%s)\n", (hpctype ==  3) ? 3 : 1,
> -           (hpctype == 15) ? ".5" : "", (isonboard) ? "onboard" :
> -           (isioplus) ? "IO+ mezzanine" : "GIO slot");
> +       if (hpctype == 3) {
> +               if (sys_config.system_subtype == IP22_INDIGO2) {
> +                       /* wild guess */
> +                       giofast = 1;
> +               } else {
> +                       /*
> +                        * According to IRIX hpc3.h, the fast GIO bit
> +                        * is active high, but the register value has
> +                        * been found to be 0xf8 on slow GIO systems
> +                        * and 0xf1 on fast ones, which tends to prove
> +                        * the opposite...
> +                        */
> +                       if ((bus_space_read_4(sc->sc_ct, sc->sc_ch,
> +                           IOC_BASE + IOC_GCREG) & IOC_GCREG_GIO_33MHZ) == 0)
> +                               giofast = 1;
> +               }
> +       }
> +
> +       if (hpctype == 3) {
> +               printf(": SGI HPC3 (%s, %uMHz)\n",
> +                   (isonboard) ? "onboard" :
> +                   (isioplus) ? "IO+ mezzanine" : "GIO slot",
> +                   25 + 8 * giofast);
> +       } else {
> +               printf(": SGI HPC1%s (%s)\n",
> +                   (hpctype == 15) ? ".5" : "",
> +                   (isonboard) ? "onboard" : "GIO slot");
> +       }
>
>         /*
>          * Configure the IOC.
> @@ -586,42 +611,16 @@ hpc_attach(struct device *parent, struct
>
>         if (hpctype == 3) {
>                 hv = &hpc3_values;
> -               if (isonboard) {
> +               if (isonboard)
>                         hd = hpc3_onboard;
> -                       if (sys_config.system_subtype == IP22_INDIGO2) {
> -                               /* wild guess */
> -                               giofast = 1;
> -                       } else {
> -                               /*
> -                                * According to IRIX hpc3.h, the fast GIO bit
> -                                * is active high, but the register value has
> -                                * been found to be 0xf8 on slow GIO systems
> -                                * and 0xf1 on fast ones, which tends to prove
> -                                * the opposite...
> -                                */
> -                               if (bus_space_read_4(sc->sc_ct, sc->sc_ch,
> -                                   IOC_BASE + IOC_GCREG) & IOC_GCREG_GIO_33MHZ)
> -                                       giofast = 0;
> -                               else
> -                                       giofast = 1;
> -                       }
> -               } else {
> +               else
>                         hd = hpc3_devices;
> -                       /*
> -                        * XXX should IO+ Mezzanine use the same settings as
> -                        * XXX the onboard HPC3?
> -                        */
> -                       giofast = 0;
> -               }
> -               needprobe = 0;
>         } else {
>                 hv = &hpc1_values;
>                 hv->revision = hpctype;
> -               giofast = 0;
> -               if (isonboard) {
> +               if (isonboard)
>                         hd = hpc1_onboard;
> -                       needprobe = 0;
> -               } else {
> +               else {
>                         hd = hpc1_devices;
>                         /*
>                          * Until a reliable way of telling E++ and GIO32 SCSI

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Miod Vallat
> Theo suggested that a fix for ARP committed the other day might have
> some impact on this so I've been testing the latest snapshot.  So far
> I've been up for 19:47, so it's looking good so far.  Interestingly
> enough I was getting "sq1: receive FIFO overflow" periodically with
> the previous snapshot, so far on this boot that has not recurred.
>
> If the box cores again I'll test your diff and see if that improves my
> situation.

If you still get `receive FIFO overflow' messages, even if the kernel
does not panic, please test this diff and tell me what speed gets
reported for hpc0 and hpc1 attachments.

Thanks,
Miod

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Jesse Darrone
Hey Miod,

It crashed again last night so I rebuilt the kernel with your patch.
Both hpc0 and hpc1 now report 25 mhz.  I've attached the full dmesg
below for reference.

Thanks again, Miod!
-Jesse


[ using 388904 bytes of bsd ELF symbol table ]
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2016 OpenBSD. All rights reserved.  http://www.OpenBSD.org

OpenBSD 6.0 (GENERIC-IP22) #0: Sat Jul 16 12:21:12 EDT 2016
    [hidden email]:/usr/src/sys/arch/sgi/compile/GENERIC-IP22
real mem = 167772160 (160MB)
rsvd mem = 802816 (1MB)
avail mem = 160169984 (152MB)
mainbus0 at root: Challenge S
cpu0 at mainbus0: MIPS R5000 CPU rev 1.0 150 MHz, R5000 based FPC rev 1.0
cpu0: cache L1-I 32KB D 32KB 2 way, L2 512KB direct
int0 at mainbus0 addr 0x1fbd9880
imc0 at mainbus0: revision 3
gio0 at imc0
hpc0 at gio0 addr 0x1fb80000: SGI HPC3 (onboard, 25MHz)
zs0 at hpc0 offset 0x00059830 irq 29: 85230
zstty0 at zs0 channel 1: console
zstty1 at zs0 channel 0
sq0 at hpc0 offset 0x00054000 irq 3: Seeq 80c03, address 08:00:69:0a:34:09
wdsc0 at hpc0 offset 0x00044000 irq 1: WD33C93B, 20.0 MHz, burst DMA
wdsc0: microcode revision 0x0d, fast SCSI
scsibus0 at wdsc0: 8 targets, initiator 0
sd0 at scsibus0 targ 1 lun 0: <SEAGATE, ST39103LCSUN9.0G, 034A> SCSI2
0/direct fixed serial.SEAGATE_ST39103LCSUN9.0GLS4557570000101519ZQ
sd0: 8637MB, 512 bytes/sector, 17689267 sectors
pione at hpc0 offset 0x00059800 irq 5 not configured
panel0 at hpc0 offset 0x00059850 irq 9: power button
dsclock0 at hpc0 offset 0x00060000
hpc1 at gio0 addr 0x1fb00000: SGI HPC3 (IO+ mezzanine, 25MHz)
hpc1: using EXP1's DMA channel
sq1 at hpc1 offset 0x00054000 irq 0: Seeq 80c03, address 08:00:69:02:64:d1
clock0 at mainbus0: int 5
vscsi0 at root
scsibus1 at vscsi0: 256 targets
softraid0 at root
scsibus2 at softraid0: 256 targets
boot device: sd0
root on sd0a (ffbd62fcf39fc195.a) swap on sd0b dump on sd0b

On Fri, Jul 15, 2016 at 1:00 PM, Miod Vallat <[hidden email]> wrote:

>> Theo suggested that a fix for ARP committed the other day might have
>> some impact on this so I've been testing the latest snapshot.  So far
>> I've been up for 19:47, so it's looking good so far.  Interestingly
>> enough I was getting "sq1: receive FIFO overflow" periodically with
>> the previous snapshot, so far on this boot that has not recurred.
>>
>> If the box cores again I'll test your diff and see if that improves my
>> situation.
>
> If you still get `receive FIFO overflow' messages, even if the kernel
> does not panic, please test this diff and tell me what speed gets
> reported for hpc0 and hpc1 attachments.
>
> Thanks,
> Miod

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Jesse Darrone
It crashed again, unfortunately. :(

ddb> trace
pool_put+0xa8 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff888d1
fc8)  ra 0xffffffff888d1560 sp 0xffffffff91f43b48, sz 160
m_extfree+0x110 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff888
d1fc8)  ra 0xffffffff888d1c10 sp 0xffffffff91f43be8, sz 32
m_free+0x138 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff888d1f
c8)  ra 0xffffffff888d1d20 sp 0xffffffff91f43c08, sz 48
m_freem+0x28 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff888d1f
c8)  ra 0xffffffff889619f0 sp 0xffffffff91f43c38, sz 32
in_arpinput+0x88 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff88
8d1fc8)  ra 0xffffffff88961d8c sp 0xffffffff91f43c58, sz 144
arpintr+0x64 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff888d1f
c8)  ra 0xffffffff8891d9e8 sp 0xffffffff91f43ce8, sz 64
if_netisr+0x140 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff888
d1fc8)  ra 0xffffffff888a35d8 sp 0xffffffff91f43d28, sz 64
taskq_thread+0xd0 (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffffff8
88d1fc8)  ra 0xffffffff88a795ac sp 0xffffffff91f43d68, sz 80
proc_trampoline+0x1c (fe8328730110b586,c000000002f1b000,c0000000030fa060,ffffff
ff888d1fc8)  ra 0x0 sp 0xffffffff91f43db8, sz 0
User-level: pid 30639
ddb> ps
   TID   PPID   PGRP    UID  S       FLAGS  WAIT          COMMAND
 11563      1  11563      0  3    0x100083  ttyin         getty
 28919      1  28919      0  3    0x100098  poll          cron
 97586  60856  60856    619  3        0x82  kqread        bandb
 35239  60856  60856    619  3        0x82  kqread        ssld
 61725  60856  60856    619  3        0x82  kqread        resolver
 60856      1  60856    619  3        0x90  kqread        ircd
  2439      1   2439      0  3        0x80  select        sshd
 14087  40477  35467     83  3    0x100090  poll          ntpd
 40477  35467  35467     83  3    0x100090  poll          ntpd
 35467      1  35467      0  3        0x80  poll          ntpd
 13056  50432  50432     74  3    0x100090  bpf           pflogd
 50432      1  50432      0  3        0x80  netio         pflogd
 32095  51132  51132     73  2    0x100090                syslogd
 51132      1  51132      0  3    0x100080  netio         syslogd
 46894      0      0      0  3     0x14200  pgzero        zerothread
 37484      0      0      0  3     0x14200  aiodoned      aiodoned
 92831      0      0      0  3     0x14200  syncer        update
 63336      0      0      0  3     0x14200  cleaner       cleaner
 74532      0      0      0  3     0x14200  reaper        reaper
 94086      0      0      0  3     0x14200  pgdaemon      pagedaemon
 94815      0      0      0  3     0x14200  bored         crynlk
  3084      0      0      0  3     0x14200  bored         crypto
 81861      0      0      0  3     0x14200  pftm          pfpurge
*30639      0      0      0  7     0x14210                softnet
 56005      0      0      0  3     0x14200  bored         systqmp
 45756      0      0      0  3     0x14200  bored         systq
 92539      0      0      0  3  0x40014200                idle0
 96023      0      0      0  3     0x14200  kmalloc       kmthread
     1      0      1      0  3        0x82  wait          init
     0     -1      0      0  3     0x10200  scheduler     swapper
ddb> show panic
the kernel did not panic

ddb> show registers
at                0xffffffff88b60000    sysent+0x1320
v0                0xfe8dac10ee1eb5a0
v1                0xfe8dac10ee1eb5a0
a0                0xfe8328730110b586
a1                0xc000000002f1b000
a2                0xc0000000030fa060
a3                0xffffffff888d1fc8    m_extfree_pool
a4                0xffffffff91f43c26    end+0x92e3916
a5                              0x14
a6                              0x18
a7                               0x8
t0                               0x4
t1                0xffffffff88c0ded0    kernel_pmap_store
t2                                 0
t3                0xffffffff91f40000    end+0x92dfcf0
s0                0xc0000000030fa060
s1                0xc000000002f1b000
s2                0xffffffff88b87c50    mclpools
s3                               0x1
s4                0xc0000000000de078
s5                                 0
s6                0xc000000002f1b018
s7                0xffffffff91f43c78    end+0x92e3968
t8                        0x59605df7
t9                0xffffffff88a94f38    int2_splx
k0                0xffffffff91f43c20    end+0x92e3910
k1                0xc000000002f448c0
gp                0xffffffff88b63fd0    _gp
sp                0xffffffff91f43b48    end+0x92e3838
s8                                 0
ra                0xffffffff888b2a2c    pool_put+0x284
sr                        0x1000cfa3
lo                0x231285d0dc100a00
hi                                 0
bad               0xfe8dac10ee1eb5a8
cs                              0x10
pc                0xffffffff888b2850    pool_put+0xa8
pool_put+0xa8:  ld      v0,8(v1)

ddb> continue
panic: trap
Stopped at      Debugger+0x4:   jr      ra
Debugger+0x8:    nop
   TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
*30639  30639      0     0x14000      0x210    0  softnet
Debugger+0x4 (f2b05ff317e288e3,900000001fbd9880,900000001fbd9830,ffffffff91f438
70)  ra 0xffffffff888b5cb0 sp 0xffffffff91f438a8, sz 0
panic+0x100 (f2b05ff317e288e3,ffffffff91f43b30,0,ffffffff88c0e700)  ra 0xffffff
ff88a7689c sp 0xffffffff91f438a8, sz 112
itsa+0xf4 (f2b05ff317e288e3,ffffffff91f43b30,0,ffffffff88c0e700)  ra 0xffffffff
88a7a0ac sp 0xffffffff91f43918, sz 176
k_general+0x114 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0x0 s
p 0xffffffff91f439c8, sz 0
(KERNEL TRAP)
pool_put+0xa8 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xffff
ffff888d1560 sp 0xffffffff91f43b48, sz 160
m_extfree+0x110 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xff
ffffff888d1c10 sp 0xffffffff91f43be8, sz 32
m_free+0x138 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xfffff
fff888d1d20 sp 0xffffffff91f43c08, sz 48
m_freem+0x28 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xfffff
fff889619f0 sp 0xffffffff91f43c38, sz 32
in_arpinput+0x88 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xf
fffffff88961d8c sp 0xffffffff91f43c58, sz 144
arpintr+0x64 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xfffff
fff8891d9e8 sp 0xffffffff91f43ce8, sz 64
if_netisr+0x140 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0xff
ffffff888a35d8 sp 0xffffffff91f43d28, sz 64
taskq_thread+0xd0 (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra 0x
ffffffff88a795ac sp 0xffffffff91f43d68, sz 80
proc_trampoline+0x1c (ffffffff91f439f0,ffffffff91f43b30,0,ffffffff888b2850)  ra
 0x0 sp 0xffffffff91f43db8, sz 0
User-level: pid 30639
http://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.


On Sat, Jul 16, 2016 at 1:06 PM, Jesse Darrone <[hidden email]> wrote:

> Hey Miod,
>
> It crashed again last night so I rebuilt the kernel with your patch.
> Both hpc0 and hpc1 now report 25 mhz.  I've attached the full dmesg
> below for reference.
>
> Thanks again, Miod!
> -Jesse
>
>
> [ using 388904 bytes of bsd ELF symbol table ]
> Copyright (c) 1982, 1986, 1989, 1991, 1993
>         The Regents of the University of California.  All rights reserved.
> Copyright (c) 1995-2016 OpenBSD. All rights reserved.  http://www.OpenBSD.org
>
> OpenBSD 6.0 (GENERIC-IP22) #0: Sat Jul 16 12:21:12 EDT 2016
>     [hidden email]:/usr/src/sys/arch/sgi/compile/GENERIC-IP22
> real mem = 167772160 (160MB)
> rsvd mem = 802816 (1MB)
> avail mem = 160169984 (152MB)
> mainbus0 at root: Challenge S
> cpu0 at mainbus0: MIPS R5000 CPU rev 1.0 150 MHz, R5000 based FPC rev 1.0
> cpu0: cache L1-I 32KB D 32KB 2 way, L2 512KB direct
> int0 at mainbus0 addr 0x1fbd9880
> imc0 at mainbus0: revision 3
> gio0 at imc0
> hpc0 at gio0 addr 0x1fb80000: SGI HPC3 (onboard, 25MHz)
> zs0 at hpc0 offset 0x00059830 irq 29: 85230
> zstty0 at zs0 channel 1: console
> zstty1 at zs0 channel 0
> sq0 at hpc0 offset 0x00054000 irq 3: Seeq 80c03, address 08:00:69:0a:34:09
> wdsc0 at hpc0 offset 0x00044000 irq 1: WD33C93B, 20.0 MHz, burst DMA
> wdsc0: microcode revision 0x0d, fast SCSI
> scsibus0 at wdsc0: 8 targets, initiator 0
> sd0 at scsibus0 targ 1 lun 0: <SEAGATE, ST39103LCSUN9.0G, 034A> SCSI2
> 0/direct fixed serial.SEAGATE_ST39103LCSUN9.0GLS4557570000101519ZQ
> sd0: 8637MB, 512 bytes/sector, 17689267 sectors
> pione at hpc0 offset 0x00059800 irq 5 not configured
> panel0 at hpc0 offset 0x00059850 irq 9: power button
> dsclock0 at hpc0 offset 0x00060000
> hpc1 at gio0 addr 0x1fb00000: SGI HPC3 (IO+ mezzanine, 25MHz)
> hpc1: using EXP1's DMA channel
> sq1 at hpc1 offset 0x00054000 irq 0: Seeq 80c03, address 08:00:69:02:64:d1
> clock0 at mainbus0: int 5
> vscsi0 at root
> scsibus1 at vscsi0: 256 targets
> softraid0 at root
> scsibus2 at softraid0: 256 targets
> boot device: sd0
> root on sd0a (ffbd62fcf39fc195.a) swap on sd0b dump on sd0b
>
> On Fri, Jul 15, 2016 at 1:00 PM, Miod Vallat <[hidden email]> wrote:
>>> Theo suggested that a fix for ARP committed the other day might have
>>> some impact on this so I've been testing the latest snapshot.  So far
>>> I've been up for 19:47, so it's looking good so far.  Interestingly
>>> enough I was getting "sq1: receive FIFO overflow" periodically with
>>> the previous snapshot, so far on this boot that has not recurred.
>>>
>>> If the box cores again I'll test your diff and see if that improves my
>>> situation.
>>
>> If you still get `receive FIFO overflow' messages, even if the kernel
>> does not panic, please test this diff and tell me what speed gets
>> reported for hpc0 and hpc1 attachments.
>>
>> Thanks,
>> Miod

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Miod Vallat
In reply to this post by Jesse Darrone
> It crashed again last night so I rebuilt the kernel with your patch.
> Both hpc0 and hpc1 now report 25 mhz.  I've attached the full dmesg
> below for reference.

I was wondering if your system had a 33MHz GIO bus and was incorrectly
using the 25MHz settings. But since the diff now reports 25MHz, the
settings do not change and the diff doesn't change anything.

I am a bit surprised, because here my R5000 Indy has a 33MHz GIO bus
while all the R4000/R4400 Indys have 25MHz GIO buses, so I would naively
expect your R5000 Challenge S system to also use a 33MHz flavour; but
there might be good reasons for things to be this way (also, maybe SGI
used 25MHz buses for the R5000 model initially, and only switched to
33MHz months or years later).

Back to square one...

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Jesse Darrone
Hmm, that's curious.  I do have 4 of these with slightly different
configurations (even one that is tandem badged), so I could see if the
bus speed is the same or different across all of them.  I realize it
might not be all that valuable other than to satisfy some curiosity,
but an interesting data-point nonetheless.

As far as troubleshooting goes, I'll try and switch to the primary
interface to see if it makes a difference (maybe we can rule something
out?).  I was only using the secondary so I wouldn't need to employ an
external transceiver.

-Jesse


On Mon, Jul 18, 2016 at 2:46 AM, Miod Vallat <[hidden email]> wrote:

>> It crashed again last night so I rebuilt the kernel with your patch.
>> Both hpc0 and hpc1 now report 25 mhz.  I've attached the full dmesg
>> below for reference.
>
> I was wondering if your system had a 33MHz GIO bus and was incorrectly
> using the 25MHz settings. But since the diff now reports 25MHz, the
> settings do not change and the diff doesn't change anything.
>
> I am a bit surprised, because here my R5000 Indy has a 33MHz GIO bus
> while all the R4000/R4400 Indys have 25MHz GIO buses, so I would naively
> expect your R5000 Challenge S system to also use a 33MHz flavour; but
> there might be good reasons for things to be this way (also, maybe SGI
> used 25MHz buses for the R5000 model initially, and only switched to
> 33MHz months or years later).
>
> Back to square one...

Reply | Threaded
Open this post in threaded view
|

Re: Crash in softnet on SGI

Jesse Darrone
So even using the sq0 interface it continues to crash the same way.

In the name of science, I've also checked the bus speed on my 3 other
Challenge S systems and they are all 25mhz (including the one R4400
system I have).  The others are R5000 150 and 180 mhz.

-Jesse

On Mon, Jul 18, 2016 at 9:43 AM, Jesse Darrone <[hidden email]> wrote:

> Hmm, that's curious.  I do have 4 of these with slightly different
> configurations (even one that is tandem badged), so I could see if the
> bus speed is the same or different across all of them.  I realize it
> might not be all that valuable other than to satisfy some curiosity,
> but an interesting data-point nonetheless.
>
> As far as troubleshooting goes, I'll try and switch to the primary
> interface to see if it makes a difference (maybe we can rule something
> out?).  I was only using the secondary so I wouldn't need to employ an
> external transceiver.
>
> -Jesse
>
>
> On Mon, Jul 18, 2016 at 2:46 AM, Miod Vallat <[hidden email]> wrote:
>>> It crashed again last night so I rebuilt the kernel with your patch.
>>> Both hpc0 and hpc1 now report 25 mhz.  I've attached the full dmesg
>>> below for reference.
>>
>> I was wondering if your system had a 33MHz GIO bus and was incorrectly
>> using the 25MHz settings. But since the diff now reports 25MHz, the
>> settings do not change and the diff doesn't change anything.
>>
>> I am a bit surprised, because here my R5000 Indy has a 33MHz GIO bus
>> while all the R4000/R4400 Indys have 25MHz GIO buses, so I would naively
>> expect your R5000 Challenge S system to also use a 33MHz flavour; but
>> there might be good reasons for things to be this way (also, maybe SGI
>> used 25MHz buses for the R5000 model initially, and only switched to
>> 33MHz months or years later).
>>
>> Back to square one...