sshd high cpu : system freeze

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

sshd high cpu : system freeze

Thuban
Hello,
While transferring files over SSH (either scp or sftp used) from/to an USB drive
plugged in a ROCK64 from/to another machine, the CPU usage keep increasing and
finally, the ROCK64 freeze.
Because of the freeze, I'm not sure how to get errors. Please telle me how to
debug this.

Some details :
- Both systems (server and client) are on OpenBSD : client is 6.4 (any
architecture), ROCK64 is
last -current (architecture arm64).
- The usb drive is mounted with this fstab entry :
        782f1ddb783cdd13.b /mnt/bigstorage ffs rw,noexec,nodev,nosuid 1 2
- The issue only happens when transferring from/to the USB hard drive.
- I guess the issue can be related to USB transfert, not SSH.
- dmesg :

OpenBSD 6.4-current (GENERIC.MP) #383: Mon Jan 14 05:05:43 MST 2019
    [hidden email]:/usr/src/sys/arch/arm64/compile/GENERIC.MP
real mem  = 4213792768 (4018MB)
avail mem = 4050677760 (3863MB)
mainbus0 at root: Pine64 Rock64
cpu0 at mainbus0 mpidr 0: ARM Cortex-A53 r0p4
cpu0: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
cpu0: 256KB 64b/line 16-way L2 cache
efi0 at mainbus0: UEFI 2.0.5
efi0: Das U-boot rev 0x0
psci0 at mainbus0: PSCI 1.0
syscon0 at mainbus0: "syscon"
"io-domains" at syscon0 not configured
"grf-gpio" at syscon0 not configured
"power-controller" at syscon0 not configured
"reboot-mode" at syscon0 not configured
rkclock0 at mainbus0
syscon1 at mainbus0: "syscon"
"usb2-phy" at syscon1 not configured
ampintc0 at mainbus0 nirq 160, ncpu 4 ipi: 0, 1: "interrupt-controller"
rkpinctrl0 at mainbus0: "pinctrl"
rkgpio0 at rkpinctrl0
rkgpio1 at rkpinctrl0
rkgpio2 at rkpinctrl0
rkgpio3 at rkpinctrl0
simplebus0 at mainbus0: "amba"
"dmac" at simplebus0 not configured
agtimer0 at mainbus0: tick rate 24000 KHz
com0 at mainbus0: ns16550, no working fifo
com0: console
rkiic0 at mainbus0
iic0 at rkiic0
rkpmic0 at iic0 addr 0x18: RK805
rktemp0 at mainbus0
dwmmc0 at mainbus0: 50 MHz base clock
sdmmc0 at dwmmc0: 4-bit, sd high-speed, mmc high-speed, dma
dwmmc1 at mainbus0: 50 MHz base clock
sdmmc1 at dwmmc1: 8-bit, mmc high-speed, dma
dwge0 at mainbus0
dwge0: address: d6:8f:58:d2:50:ce
rgephy0 at dwge0 phy 0: RTL8169S/8110S/8211 PHY, rev. 6
ehci0 at mainbus0
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 configuration 1 interface 0 "Generic EHCI root hub" rev 2.00/1.00 addr 1
ohci0 at mainbus0: version 1.0
cpu1 at mainbus0 mpidr 1: ARM Cortex-A53 r0p4
cpu1: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
cpu1: 256KB 64b/line 16-way L2 cache
cpu2 at mainbus0 mpidr 2: ARM Cortex-A53 r0p4
cpu2: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
cpu2: 256KB 64b/line 16-way L2 cache
cpu3 at mainbus0 mpidr 3: ARM Cortex-A53 r0p4
cpu3: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
cpu3: 256KB 64b/line 16-way L2 cache
usb1 at ohci0: USB revision 1.0
uhub1 at usb1 configuration 1 interface 0 "Generic OHCI root hub" rev 1.00/1.00 addr 1
scsibus0 at sdmmc1: 2 targets, initiator 0
sd0 at scsibus0 targ 1 lun 0: <SD/MMC, NCard, 0000> SCSI2 0/direct removable
sd0: 29600MB, 512 bytes/sector, 60620800 sectors
scsibus1 at sdmmc0: 2 targets, initiator 0
sd1 at scsibus1 targ 1 lun 0: <SD/MMC, USD00, 0010> SCSI2 0/direct removable
sd1: 15080MB, 512 bytes/sector, 30883840 sectors
umass0 at uhub0 port 1 configuration 1 interface 0 "Western Digital Ext HDD 1021" rev 2.00/20.21 addr 2
umass0: using SCSI over Bulk-Only
scsibus2 at umass0: 2 targets, initiator 0
sd2 at scsibus2 targ 1 lun 0: <WD, Ext HDD 1021, 2021> SCSI2 0/direct fixed serial.10581021383235373034
sd2: 1907727MB, 512 bytes/sector, 3907024896 sectors
vscsi0 at root
scsibus3 at vscsi0: 256 targets
softraid0 at root
scsibus4 at softraid0: 256 targets
bootfile: sd0a:/bsd
boot device: sd0
root on sd0a (532ea4ce9a5e001c.a) swap on sd0b dump on sd0b
WARNING: preposterous clock chip time
WARNING: CHECK AND RESET THE DATE!

--
    thuban

Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Thuban
* Thuban <thubanyeuxdelibad!net> le [14-01-2019 16:16:01 +0000]:

> Hello,
> While transferring files over SSH (either scp or sftp used) from/to an USB drive
> plugged in a ROCK64 from/to another machine, the CPU usage keep increasing and
> finally, the ROCK64 freeze.
> Because of the freeze, I'm not sure how to get errors. Please telle me how to
> debug this.
>
> Some details :
> - Both systems (server and client) are on OpenBSD : client is 6.4 (any
> architecture), ROCK64 is
> last -current (architecture arm64).
> - The usb drive is mounted with this fstab entry :
> 782f1ddb783cdd13.b /mnt/bigstorage ffs rw,noexec,nodev,nosuid 1 2
> - The issue only happens when transferring from/to the USB hard drive.
> - I guess the issue can be related to USB transfert, not SSH.
> - dmesg :
>
> OpenBSD 6.4-current (GENERIC.MP) #383: Mon Jan 14 05:05:43 MST 2019
>     [hidden email]:/usr/src/sys/arch/arm64/compile/GENERIC.MP
> real mem  = 4213792768 (4018MB)
> avail mem = 4050677760 (3863MB)
> mainbus0 at root: Pine64 Rock64
> cpu0 at mainbus0 mpidr 0: ARM Cortex-A53 r0p4
> cpu0: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
> cpu0: 256KB 64b/line 16-way L2 cache
> efi0 at mainbus0: UEFI 2.0.5
> efi0: Das U-boot rev 0x0
> psci0 at mainbus0: PSCI 1.0
> syscon0 at mainbus0: "syscon"
> "io-domains" at syscon0 not configured
> "grf-gpio" at syscon0 not configured
> "power-controller" at syscon0 not configured
> "reboot-mode" at syscon0 not configured
> rkclock0 at mainbus0
> syscon1 at mainbus0: "syscon"
> "usb2-phy" at syscon1 not configured
> ampintc0 at mainbus0 nirq 160, ncpu 4 ipi: 0, 1: "interrupt-controller"
> rkpinctrl0 at mainbus0: "pinctrl"
> rkgpio0 at rkpinctrl0
> rkgpio1 at rkpinctrl0
> rkgpio2 at rkpinctrl0
> rkgpio3 at rkpinctrl0
> simplebus0 at mainbus0: "amba"
> "dmac" at simplebus0 not configured
> agtimer0 at mainbus0: tick rate 24000 KHz
> com0 at mainbus0: ns16550, no working fifo
> com0: console
> rkiic0 at mainbus0
> iic0 at rkiic0
> rkpmic0 at iic0 addr 0x18: RK805
> rktemp0 at mainbus0
> dwmmc0 at mainbus0: 50 MHz base clock
> sdmmc0 at dwmmc0: 4-bit, sd high-speed, mmc high-speed, dma
> dwmmc1 at mainbus0: 50 MHz base clock
> sdmmc1 at dwmmc1: 8-bit, mmc high-speed, dma
> dwge0 at mainbus0
> dwge0: address: d6:8f:58:d2:50:ce
> rgephy0 at dwge0 phy 0: RTL8169S/8110S/8211 PHY, rev. 6
> ehci0 at mainbus0
> usb0 at ehci0: USB revision 2.0
> uhub0 at usb0 configuration 1 interface 0 "Generic EHCI root hub" rev 2.00/1.00 addr 1
> ohci0 at mainbus0: version 1.0
> cpu1 at mainbus0 mpidr 1: ARM Cortex-A53 r0p4
> cpu1: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
> cpu1: 256KB 64b/line 16-way L2 cache
> cpu2 at mainbus0 mpidr 2: ARM Cortex-A53 r0p4
> cpu2: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
> cpu2: 256KB 64b/line 16-way L2 cache
> cpu3 at mainbus0 mpidr 3: ARM Cortex-A53 r0p4
> cpu3: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
> cpu3: 256KB 64b/line 16-way L2 cache
> usb1 at ohci0: USB revision 1.0
> uhub1 at usb1 configuration 1 interface 0 "Generic OHCI root hub" rev 1.00/1.00 addr 1
> scsibus0 at sdmmc1: 2 targets, initiator 0
> sd0 at scsibus0 targ 1 lun 0: <SD/MMC, NCard, 0000> SCSI2 0/direct removable
> sd0: 29600MB, 512 bytes/sector, 60620800 sectors
> scsibus1 at sdmmc0: 2 targets, initiator 0
> sd1 at scsibus1 targ 1 lun 0: <SD/MMC, USD00, 0010> SCSI2 0/direct removable
> sd1: 15080MB, 512 bytes/sector, 30883840 sectors
> umass0 at uhub0 port 1 configuration 1 interface 0 "Western Digital Ext HDD 1021" rev 2.00/20.21 addr 2
> umass0: using SCSI over Bulk-Only
> scsibus2 at umass0: 2 targets, initiator 0
> sd2 at scsibus2 targ 1 lun 0: <WD, Ext HDD 1021, 2021> SCSI2 0/direct fixed serial.10581021383235373034
> sd2: 1907727MB, 512 bytes/sector, 3907024896 sectors
> vscsi0 at root
> scsibus3 at vscsi0: 256 targets
> softraid0 at root
> scsibus4 at softraid0: 256 targets
> bootfile: sd0a:/bsd
> boot device: sd0
> root on sd0a (532ea4ce9a5e001c.a) swap on sd0b dump on sd0b
> WARNING: preposterous clock chip time
> WARNING: CHECK AND RESET THE DATE!


I've got some more.
After mounting the disk with "softdep" option, I saw this message on the
console a short while before the crash :

        ehci_sync_hc: tsleep() = 35
        ehci_sync_hc: tsleep() = 35
        ehci_sync_hc: tsleep() = 35
        ehci_sync_hc: tsleep() = 35
        /var: got error 5 while accessing filesystem
        panic: softdep_deallocate_dependencies: unrecovered I/O error
        Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
        PU  COMMAND
         311215  63029     73    0x100010          0    1  syslogd
         390837  99391      0     0x panic+0x150
        panic() at brelse+0xc4
        brelse() at sd_buf_done+0x10c
        sd_buf_done() at scsi_done+0x34
        scsi_done() at usb_transfer_complete+0x238
        usb_transfer_complete() at ehci_abort_xfer+0x258
        ehci_abort_xfer() at ehci_timeout_task+0x34


Then, the ddb prompt showed up, but I wasn't sure what to do next.
Please, tell me how to give you more relevant information to solve this
issue.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Karel Gardas

Based on my experience, softdep may be very fragile on the fast system with slow drive. I guess what you see may be softdep own issue.

On Sun, 20 Jan 2019 11:49:32 +0100
Thuban <[hidden email]> wrote:

> I've got some more.
> After mounting the disk with "softdep" option, I saw this message on the
> console a short while before the crash :
>
> ehci_sync_hc: tsleep() = 35
> ehci_sync_hc: tsleep() = 35
> ehci_sync_hc: tsleep() = 35
> ehci_sync_hc: tsleep() = 35
> /var: got error 5 while accessing filesystem
> panic: softdep_deallocate_dependencies: unrecovered I/O error
> Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
> PU  COMMAND
> 311215  63029     73    0x100010          0    1  syslogd
> 390837  99391      0     0x panic+0x150
> panic() at brelse+0xc4
> brelse() at sd_buf_done+0x10c
> sd_buf_done() at scsi_done+0x34
> scsi_done() at usb_transfer_complete+0x238
> usb_transfer_complete() at ehci_abort_xfer+0x258
> ehci_abort_xfer() at ehci_timeout_task+0x34
>
>
> Then, the ddb prompt showed up, but I wasn't sure what to do next.
> Please, tell me how to give you more relevant information to solve this
> issue.
>
> Regards.
>


Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Juan Francisco Cantero Hurtado
On Sun, Jan 20, 2019 at 07:24:44PM +0100, Karel Gardas wrote:
>
> Based on my experience, softdep may be very fragile on the fast system with slow drive. I guess what you see may be softdep own issue.

The panic is in softdep but he had also a crash without softdep.

Thuban, can you run fsck on the partitions of the USB drive using amd64
and try to crash the system without softdep?. Run the same commands on
the ddb prompt as your previous mail.

>
> On Sun, 20 Jan 2019 11:49:32 +0100
> Thuban <[hidden email]> wrote:
>
> > I've got some more.
> > After mounting the disk with "softdep" option, I saw this message on the
> > console a short while before the crash :
> >
> > ehci_sync_hc: tsleep() = 35
> > ehci_sync_hc: tsleep() = 35
> > ehci_sync_hc: tsleep() = 35
> > ehci_sync_hc: tsleep() = 35
> > /var: got error 5 while accessing filesystem
> > panic: softdep_deallocate_dependencies: unrecovered I/O error
> > Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
> > PU  COMMAND
> > 311215  63029     73    0x100010          0    1  syslogd
> > 390837  99391      0     0x panic+0x150
> > panic() at brelse+0xc4
> > brelse() at sd_buf_done+0x10c
> > sd_buf_done() at scsi_done+0x34
> > scsi_done() at usb_transfer_complete+0x238
> > usb_transfer_complete() at ehci_abort_xfer+0x258
> > ehci_abort_xfer() at ehci_timeout_task+0x34
> >
> >
> > Then, the ddb prompt showed up, but I wasn't sure what to do next.
> > Please, tell me how to give you more relevant information to solve this
> > issue.
> >
> > Regards.
> >
>
>

--
Juan Francisco Cantero Hurtado http://juanfra.info

Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Thuban
* Juan Francisco Cantero Hurtado <[hidden email]> le [20-01-2019 23:39:45 +0100]:
> On Sun, Jan 20, 2019 at 07:24:44PM +0100, Karel Gardas wrote:
> >
> > Based on my experience, softdep may be very fragile on the fast system with slow drive. I guess what you see may be softdep own issue.
>
> The panic is in softdep but he had also a crash without softdep.
>

True.
Without softdep, I can't get access to the console (or didn't manage
to).

I ran fsck -yf on the disk via on an amd64 machine as suggested.
(Though, crash happens on arm64).
Everything seems ok.

        moria# fsck -fy /dev/sd2a
        ** /dev/rsd2a
        ** File system is already clean
        ** Last Mounted on /vol/Samsung SSD 850-p1
        ** Phase 1 - Check Blocks and Sizes
        ** Phase 2 - Check Pathnames
        ** Phase 3 - Check Connectivity
        ** Phase 4 - Check Reference Counts
        ** Phase 5 - Check Cyl groups
        20863 files, 1492748 used, 37517771 free (739 frags, 4689629 blocks, 0.0% fragme
        moria# fsck -fy /dev/sd2d
        ** /dev/rsd2d
        ** File system is already clean
        ** Last Mounted on /vol/Samsung SSD 850-p2
        ** Phase 1 - Check Blocks and Sizes
        ** Phase 2 - Check Pathnames
        ** Phase 3 - Check Connectivity
        ** Phase 4 - Check Reference Counts
        ** Phase 5 - Check Cyl groups
        4 files, 4 used, 1034203 free (35 frags, 129271 blocks, 0.0% fragmentation)
        moria# fsck -fy /dev/sd2e
        ** /dev/rsd2e
        ** File system is already clean
        ** Last Mounted on /vol/Samsung SSD 850-p3
        ** Phase 1 - Check Blocks and Sizes
        ** Phase 2 - Check Pathnames
        ** Phase 3 - Check Connectivity
        ** Phase 4 - Check Reference Counts
        ** Phase 5 - Check Cyl groups
        26838 files, 2769562 used, 98993405 free (381 frags, 12374128 blocks, 0.0% fragmentation)


I had to crash the server at least 5 times to get an access to the
console. Else, all I was able to do was a hard reboot.

Here the ddb output after a new crash.

        /var: got error 5 while accessing filesystem
        panic: softdep_deallocate_dependencies: unrecovered I/O error
        Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
        PU  COMMAND
         161990  94125      0     0x14000      0x200    0  zerothread
        *365345   9753      0    c+0x150
        panic() at brelse+0xc4
        brelse() at sd_buf_done+0x124
        sd_buf_done() at scsi_done+0x34
        scsi_done() at usb_transfer_complete+0x238
        usb_transfer_complete() at ehci_abort_xfer+0x258
        ehci_abort_xfer() at ehci_timeout_task+0x34
        https://www.openbsd.
        ddb{1}>

        ddb{1}> trace
        db_enter() at panic+0x150
        panic() at brelse+0xc4
        brelse() at sd_buf_done+0x124
        sd_buf_done() at scsi_done+0x34
        scsi_done() at usb_transfer_complete+0x238
        usb_transfer_complete() at ehci_abort_xfer+0x258
        ehci_abort_xfer() at ehci_timeout_task+0x34
        ehci_timeout_task() at usb_abort_task_thread+0xcc
        usb_abort_task_thread() at proc_trampoline+0x10

        ddb{1}> show panic
        softdep_deallocate_dependencies: unrecovered I/O error


        ddb{1}> show uvm
        Current UVM status:
          pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
          988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
        ero)
          min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
          freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
          faults=67079988, trap53705(153706), anget(retries)=20911670(0), amapcopy=1415
        0766
                neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
                cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
        =17018267
          daemon and swap counet=0
                nswapdev=1
                swpages=145355, swpginuse=0, swpgonly=0 paging=0
          kernel pointers:
                objs(kern)=0xffffff8000b10ed0
        ddb{1}>
        Current UVM status:
          pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
          988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
        ero)
          min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
          freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
          faults=67079988, trap)=153705(153706), anget(retries)=20911670(0), amapcopy=1415
        0766
                neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
                cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
        =17018267
          daemon and swap cget=0
                nswapdev=1
                swpages=145355, swpginuse=0, swpgonly=0 paging=0
          kernel pointers:
                objs(kern)=0xffffff8000b10ed0
        ddb{1}>
        Current UVM status:
          pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
          988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
        ero)
          min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
          freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
          faults=67079988, traps(total)=153705(153706), anget(retries)=20911670(0), amapcopy=1415
        0766
                neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
                cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
        =17018267
          daemon and nswget=0
                nswapdev=1
                swpages=145355, swpginuse=0, swpgonly=0 paging=0
          kernel pointers:
                objs(kern)=0xffffff8000b10ed0
        ddb{1}>
        Current UVM status:
          pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
          988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
        ero)
          min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
          freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
          faults=67079988, trap153705(153706), anget(retries)=20911670(0), amapcopy=1415
        0766
                neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
                cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
        =17018267
          daemon and swap couget=0
                nswapdev=1
                swpages=145355, swpginuse=0, swpgonly=0 paging=0
          kernel pointers:
                objs(kern)=0xffffff8000b10ed0
        ddb{1}>
        Current UVM status:
          pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
          988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
        ero)
          min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
          freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
          faults=67079988, trap(153706), anget(retries)=20911670(0), amapcopy=1415
        0766
                neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
                cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
        =17018267
          daemon and swap counts:

        ddb{1}> show bcstats
        Current Buffer Cache status:
        numbufs 24906 busymapped 135, delwri 35
        kvaslots 6553 avail kva slots 6418
        bufpages 197730, dmapages 197730, dirtypages 380
        pendingreads 2, pendingwrites 129
        highflips 0, highflops 0, dmaflips 0


Regards.

Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Juan Francisco Cantero Hurtado
On Wed, Jan 23, 2019 at 07:47:16PM +0100, Thuban wrote:

> * Juan Francisco Cantero Hurtado <[hidden email]> le [20-01-2019 23:39:45 +0100]:
> > On Sun, Jan 20, 2019 at 07:24:44PM +0100, Karel Gardas wrote:
> > >
> > > Based on my experience, softdep may be very fragile on the fast system with slow drive. I guess what you see may be softdep own issue.
> >
> > The panic is in softdep but he had also a crash without softdep.
> >
>
> True.
> Without softdep, I can't get access to the console (or didn't manage
> to).
>
> I ran fsck -yf on the disk via on an amd64 machine as suggested.
> (Though, crash happens on arm64).

I suggested amd64 because the arch has more stable and widely tested
drivers.

> Everything seems ok.
>
> moria# fsck -fy /dev/sd2a
> ** /dev/rsd2a
> ** File system is already clean
> ** Last Mounted on /vol/Samsung SSD 850-p1
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> 20863 files, 1492748 used, 37517771 free (739 frags, 4689629 blocks, 0.0% fragme
> moria# fsck -fy /dev/sd2d
> ** /dev/rsd2d
> ** File system is already clean
> ** Last Mounted on /vol/Samsung SSD 850-p2
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> 4 files, 4 used, 1034203 free (35 frags, 129271 blocks, 0.0% fragmentation)
> moria# fsck -fy /dev/sd2e
> ** /dev/rsd2e
> ** File system is already clean
> ** Last Mounted on /vol/Samsung SSD 850-p3
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> 26838 files, 2769562 used, 98993405 free (381 frags, 12374128 blocks, 0.0% fragmentation)
>
>
> I had to crash the server at least 5 times to get an access to the
> console. Else, all I was able to do was a hard reboot.
>
> Here the ddb output after a new crash.
>
> /var: got error 5 while accessing filesystem
> panic: softdep_deallocate_dependencies: unrecovered I/O error

"unrecovered I/O error" sometimes happens due to a bad sector. Use amd64
to fill the SSD with zeroes using dd and use also nick's suggestion from
this thread:

http://openbsd-archive.7691.n7.nabble.com/ahci-error-during-install-of-6-4-td357865.html

The dmesg will show if you have bad sectors or not.



> Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
> PU  COMMAND
> 161990  94125      0     0x14000      0x200    0  zerothread
> *365345   9753      0    c+0x150
> panic() at brelse+0xc4
> brelse() at sd_buf_done+0x124

You're mounting the partitions from fstab, at the start of the init
process. The reason why you only can boot with softdep is because
softdep defers the write of the problematic blocks. sd_buf_done() is
part of softdep and brelse() needs to write to the disk.

If you need to save data from the drives, remove the partitions from
fstab and mount both manually with "ro" as mount option. If you're
lucky, the copy will not fail with ro.


> sd_buf_done() at scsi_done+0x34
> scsi_done() at usb_transfer_complete+0x238
> usb_transfer_complete() at ehci_abort_xfer+0x258
> ehci_abort_xfer() at ehci_timeout_task+0x34
> https://www.openbsd.
> ddb{1}>
>
> ddb{1}> trace
> db_enter() at panic+0x150
> panic() at brelse+0xc4
> brelse() at sd_buf_done+0x124
> sd_buf_done() at scsi_done+0x34
> scsi_done() at usb_transfer_complete+0x238
> usb_transfer_complete() at ehci_abort_xfer+0x258
> ehci_abort_xfer() at ehci_timeout_task+0x34
> ehci_timeout_task() at usb_abort_task_thread+0xcc
> usb_abort_task_thread() at proc_trampoline+0x10
>
> ddb{1}> show panic
> softdep_deallocate_dependencies: unrecovered I/O error
>
>
> ddb{1}> show uvm
> Current UVM status:
>  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>  988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
> ero)
>  min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>  freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
>  faults=67079988, trap53705(153706), anget(retries)=20911670(0), amapcopy=1415
> 0766
> neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
> cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
> =17018267
>  daemon and swap counet=0
> nswapdev=1
> swpages=145355, swpginuse=0, swpgonly=0 paging=0
>  kernel pointers:
> objs(kern)=0xffffff8000b10ed0
> ddb{1}>
> Current UVM status:
>  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>  988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
> ero)
>  min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>  freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
>  faults=67079988, trap)=153705(153706), anget(retries)=20911670(0), amapcopy=1415
> 0766
> neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
> cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
> =17018267
>  daemon and swap cget=0
> nswapdev=1
> swpages=145355, swpginuse=0, swpgonly=0 paging=0
>  kernel pointers:
> objs(kern)=0xffffff8000b10ed0
> ddb{1}>
> Current UVM status:
>  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>  988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
> ero)
>  min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>  freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
>  faults=67079988, traps(total)=153705(153706), anget(retries)=20911670(0), amapcopy=1415
> 0766
> neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
> cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
> =17018267
>  daemon and nswget=0
> nswapdev=1
> swpages=145355, swpginuse=0, swpgonly=0 paging=0
>  kernel pointers:
> objs(kern)=0xffffff8000b10ed0
> ddb{1}>
> Current UVM status:
>  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>  988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
> ero)
>  min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>  freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
>  faults=67079988, trap153705(153706), anget(retries)=20911670(0), amapcopy=1415
> 0766
> neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
> cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
> =17018267
>  daemon and swap couget=0
> nswapdev=1
> swpages=145355, swpginuse=0, swpgonly=0 paging=0
>  kernel pointers:
> objs(kern)=0xffffff8000b10ed0
> ddb{1}>
> Current UVM status:
>  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>  988985 VM pages: 110183 active, 132343 inactive, 0 wired, 438054 free (52331 z
> ero)
>  min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>  freemin=32966, free-target=43954, inactive-target=89584, wired-max=329661
>  faults=67079988, trap(153706), anget(retries)=20911670(0), amapcopy=1415
> 0766
> neighbor anon/obj pg=1580327/34256987, gets(lock/unlock)=29150164/153708
> cases: anon=19467283, anoncow=1444387, obj=26419482, prcopy=2730679, przero
> =17018267
>  daemon and swap counts:
>
> ddb{1}> show bcstats
> Current Buffer Cache status:
> numbufs 24906 busymapped 135, delwri 35
> kvaslots 6553 avail kva slots 6418
> bufpages 197730, dmapages 197730, dirtypages 380
> pendingreads 2, pendingwrites 129
> highflips 0, highflops 0, dmaflips 0
>
>
> Regards.
>

--
Juan Francisco Cantero Hurtado http://juanfra.info

Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Thuban
* Juan Francisco Cantero Hurtado <[hidden email]> le [24-01-2019 00:09:17 +0100]:

> On Wed, Jan 23, 2019 at 07:47:16PM +0100, Thuban wrote:
> > * Juan Francisco Cantero Hurtado <[hidden email]> le [20-01-2019 23:39:45 +0100]:
> > > On Sun, Jan 20, 2019 at 07:24:44PM +0100, Karel Gardas wrote:
> > > >
> > > > Based on my experience, softdep may be very fragile on the fast system with slow drive. I guess what you see may be softdep own issue.
> > >
> > > The panic is in softdep but he had also a crash without softdep.
> > >
> >
> > True.
> > Without softdep, I can't get access to the console (or didn't manage
> > to).
> >
> > I ran fsck -yf on the disk via on an amd64 machine as suggested.
> > (Though, crash happens on arm64).
>
> I suggested amd64 because the arch has more stable and widely tested
> drivers.
>

ok, it makes sense.

> > Everything seems ok.
> >
> > moria# fsck -fy /dev/sd2a
> > ** /dev/rsd2a
> > ** File system is already clean
> > ** Last Mounted on /vol/Samsung SSD 850-p1
> > ** Phase 1 - Check Blocks and Sizes
> > ** Phase 2 - Check Pathnames
> > ** Phase 3 - Check Connectivity
> > ** Phase 4 - Check Reference Counts
> > ** Phase 5 - Check Cyl groups
> > 20863 files, 1492748 used, 37517771 free (739 frags, 4689629 blocks, 0.0% fragme
> > moria# fsck -fy /dev/sd2d
> > ** /dev/rsd2d
> > ** File system is already clean
> > ** Last Mounted on /vol/Samsung SSD 850-p2
> > ** Phase 1 - Check Blocks and Sizes
> > ** Phase 2 - Check Pathnames
> > ** Phase 3 - Check Connectivity
> > ** Phase 4 - Check Reference Counts
> > ** Phase 5 - Check Cyl groups
> > 4 files, 4 used, 1034203 free (35 frags, 129271 blocks, 0.0% fragmentation)
> > moria# fsck -fy /dev/sd2e
> > ** /dev/rsd2e
> > ** File system is already clean
> > ** Last Mounted on /vol/Samsung SSD 850-p3
> > ** Phase 1 - Check Blocks and Sizes
> > ** Phase 2 - Check Pathnames
> > ** Phase 3 - Check Connectivity
> > ** Phase 4 - Check Reference Counts
> > ** Phase 5 - Check Cyl groups
> > 26838 files, 2769562 used, 98993405 free (381 frags, 12374128 blocks, 0.0% fragmentation)
> >
> >
> > I had to crash the server at least 5 times to get an access to the
> > console. Else, all I was able to do was a hard reboot.
> >
> > Here the ddb output after a new crash.
> >
> > /var: got error 5 while accessing filesystem
> > panic: softdep_deallocate_dependencies: unrecovered I/O error
>
> "unrecovered I/O error" sometimes happens due to a bad sector. Use amd64
> to fill the SSD with zeroes using dd and use also nick's suggestion from
> this thread:
>
> http://openbsd-archive.7691.n7.nabble.com/ahci-error-during-install-of-6-4-td357865.html
>
> The dmesg will show if you have bad sectors or not.
>

I dis this (boy, that's very slow). At first it seems to solve the
problem, but I had a new crash this morrning when transferring big files.


>
>
> > Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
> > PU  COMMAND
> > 161990  94125      0     0x14000      0x200    0  zerothread
> > *365345   9753      0    c+0x150
> > panic() at brelse+0xc4
> > brelse() at sd_buf_done+0x124
>
> You're mounting the partitions from fstab, at the start of the init
> process. The reason why you only can boot with softdep is because
> softdep defers the write of the problematic blocks. sd_buf_done() is
> part of softdep and brelse() needs to write to the disk.
>

I can boot without softdep.
I can't see the kernel panic and ddb> command line on console at every
crash, which make gathering relevant information more difficult.


After the last crash, I tried to copy big files on another hard drive
(**not SSD**). I had the same issue.
I wonder if it's not the ehci who has a bug. In the NetBSD manpage [1], you
can read the following, which make me think of what happens whith my
disck under OpenBSD  :

> BUGS
> The support for hubs that are connected with high speed upstream and low or full speed downstream (i.e., for transaction translators) is limited.

Maybe it's just a coincidence.

I'm thinking about this because just before the crash, I can see there
messages (sometimes, but as I can't access the console as I should) :

        ehci_sync_hc: tsleep() = 35
        ehci_sync_hc: tsleep() = 35
        ehci_sync_hc: tsleep() = 35

Regards.

[1] https://man.openbsd.org/NetBSD-7.0.1/ehci.4


Reply | Threaded
Open this post in threaded view
|

Re: sshd high cpu : system freeze

Juan Francisco Cantero Hurtado
On Fri, Jan 25, 2019 at 06:37:04PM +0100, Thuban wrote:

> * Juan Francisco Cantero Hurtado <[hidden email]> le [24-01-2019 00:09:17 +0100]:
> > On Wed, Jan 23, 2019 at 07:47:16PM +0100, Thuban wrote:
> > > * Juan Francisco Cantero Hurtado <[hidden email]> le [20-01-2019 23:39:45 +0100]:
> > > > On Sun, Jan 20, 2019 at 07:24:44PM +0100, Karel Gardas wrote:
> > > > >
> > > > > Based on my experience, softdep may be very fragile on the fast system with slow drive. I guess what you see may be softdep own issue.
> > > >
> > > > The panic is in softdep but he had also a crash without softdep.
> > > >
> > >
> > > True.
> > > Without softdep, I can't get access to the console (or didn't manage
> > > to).
> > >
> > > I ran fsck -yf on the disk via on an amd64 machine as suggested.
> > > (Though, crash happens on arm64).
> >
> > I suggested amd64 because the arch has more stable and widely tested
> > drivers.
> >
>
> ok, it makes sense.
>
> > > Everything seems ok.
> > >
> > > moria# fsck -fy /dev/sd2a
> > > ** /dev/rsd2a
> > > ** File system is already clean
> > > ** Last Mounted on /vol/Samsung SSD 850-p1
> > > ** Phase 1 - Check Blocks and Sizes
> > > ** Phase 2 - Check Pathnames
> > > ** Phase 3 - Check Connectivity
> > > ** Phase 4 - Check Reference Counts
> > > ** Phase 5 - Check Cyl groups
> > > 20863 files, 1492748 used, 37517771 free (739 frags, 4689629 blocks, 0.0% fragme
> > > moria# fsck -fy /dev/sd2d
> > > ** /dev/rsd2d
> > > ** File system is already clean
> > > ** Last Mounted on /vol/Samsung SSD 850-p2
> > > ** Phase 1 - Check Blocks and Sizes
> > > ** Phase 2 - Check Pathnames
> > > ** Phase 3 - Check Connectivity
> > > ** Phase 4 - Check Reference Counts
> > > ** Phase 5 - Check Cyl groups
> > > 4 files, 4 used, 1034203 free (35 frags, 129271 blocks, 0.0% fragmentation)
> > > moria# fsck -fy /dev/sd2e
> > > ** /dev/rsd2e
> > > ** File system is already clean
> > > ** Last Mounted on /vol/Samsung SSD 850-p3
> > > ** Phase 1 - Check Blocks and Sizes
> > > ** Phase 2 - Check Pathnames
> > > ** Phase 3 - Check Connectivity
> > > ** Phase 4 - Check Reference Counts
> > > ** Phase 5 - Check Cyl groups
> > > 26838 files, 2769562 used, 98993405 free (381 frags, 12374128 blocks, 0.0% fragmentation)
> > >
> > >
> > > I had to crash the server at least 5 times to get an access to the
> > > console. Else, all I was able to do was a hard reboot.
> > >
> > > Here the ddb output after a new crash.
> > >
> > > /var: got error 5 while accessing filesystem
> > > panic: softdep_deallocate_dependencies: unrecovered I/O error
> >
> > "unrecovered I/O error" sometimes happens due to a bad sector. Use amd64
> > to fill the SSD with zeroes using dd and use also nick's suggestion from
> > this thread:
> >
> > http://openbsd-archive.7691.n7.nabble.com/ahci-error-during-install-of-6-4-td357865.html
> >
> > The dmesg will show if you have bad sectors or not.
> >
>
> I dis this (boy, that's very slow). At first it seems to solve the
> problem, but I had a new crash this morrning when transferring big files.
>
>
> >
> >
> > > Stopped at      panic+0x154:        TID    PID    UID     PRFLAGS     PFLAGS  C
> > > PU  COMMAND
> > > 161990  94125      0     0x14000      0x200    0  zerothread
> > > *365345   9753      0    c+0x150
> > > panic() at brelse+0xc4
> > > brelse() at sd_buf_done+0x124
> >
> > You're mounting the partitions from fstab, at the start of the init
> > process. The reason why you only can boot with softdep is because
> > softdep defers the write of the problematic blocks. sd_buf_done() is
> > part of softdep and brelse() needs to write to the disk.
> >
>
> I can boot without softdep.
> I can't see the kernel panic and ddb> command line on console at every
> crash, which make gathering relevant information more difficult.
>
>
> After the last crash, I tried to copy big files on another hard drive
> (**not SSD**). I had the same issue.
> I wonder if it's not the ehci who has a bug. In the NetBSD manpage [1], you
> can read the following, which make me think of what happens whith my
> disck under OpenBSD  :
>
> > BUGS
> > The support for hubs that are connected with high speed upstream and low or full speed downstream (i.e., for transaction translators) is limited.
>
> Maybe it's just a coincidence.
>
> I'm thinking about this because just before the crash, I can see there
> messages (sometimes, but as I can't access the console as I should) :
>
> ehci_sync_hc: tsleep() = 35
> ehci_sync_hc: tsleep() = 35
> ehci_sync_hc: tsleep() = 35

IIUC, you're connecting directly the USB to the board, so you're not
mixing different hubs with different speeds. That's the problem in the
NetBSD man page.

Maybe the problem is in the drivers, I can't help you with that.

If you want to test more things, I have a few suggestions:

- Boot the board without USB devices connected and keep it on for a few
  days. It's mostly to see if the board can run without problems.

- You could connect a wifi or ethernet USB adapter to the board and
  download a big file with a fast network, like "ftp -o/dev/null
  https://cdimage.debian.org/debian-cd/9.7.0/amd64/iso-dvd/debian-9.7.0-amd64-DVD-1.iso". If the problem is in the USB stack, you should see the same crash.

- Run "while true; do openssl speed; done" for a few hours to test the
  CPU with load.

- Check how much free RAM is available when you boot the system. Create
  a mfs mount with that size and run "while true; do dd if=/dev/random
  of=test.dd bs=1M count=1000; cat test.dd > /dev/null; done" (change
  the 1000 with the correct value) for a few hours. It should test if
  the RAM has problems or not.

- Check how much power require your USB drives and then apply the same
  load with an USB load tester ([1][2]). Sometimes the boards can't
  deliver enough power to the devices.

- Check if the system + USB drives have problems with Linux. It usually
  doesn't crash when there is a problem with USB, but the dmesg
  complains a lot.


1.
https://www.banggood.com/RUIDENG-LD25-Electronic-Load-Resistor-USB-Interface-Discharge-Battery-Test-LED-Display-Fan-p-1305445.html
2. https://www.youtube.com/watch?v=GH1HRG8QRCE


--
Juan Francisco Cantero Hurtado http://juanfra.info