AMD EPYC 7551 box panic: pr_find_pagehead

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

AMD EPYC 7551 box panic: pr_find_pagehead

Janne Johansson-3
Recent AMD box with a bunch of nvme drives, never booted anything, crashes
with
panic: pr_find_pagehead: dma256: page header missing
after listing some sd(4) drives on 14-Jun snapshot installation.

That is the only odd output in the dmesg as far as I can see, picture
included.

6.7 release has the same issue.

6.6 installer works and I can boot the installed OS but I have no net on
the box (mcx 40/56/100GE card in it) yet.


--
May the most significant bit of your life be positive.

obsd-panic.jpg (100K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Mark Kettenis
> From: Janne Johansson <[hidden email]>
> Date: Mon, 15 Jun 2020 19:15:36 +0200
> Content-Type: multipart/mixed; boundary="000000000000ea5a1505a8229334"
>
> Recent AMD box with a bunch of nvme drives, never booted anything, crashes
> with
> panic: pr_find_pagehead: dma256: page header missing
> after listing some sd(4) drives on 14-Jun snapshot installation.
>
> That is the only odd output in the dmesg as far as I can see, picture
> included.

That Micron_9300_MTFD disk showing up twice is suspicious.

Does the machine boot with nvme(4) disabled and/or that particular
NVMe disk removed?

Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Janne Johansson-3
Den mån 15 juni 2020 kl 19:42 skrev Mark Kettenis <[hidden email]>:

> > From: Janne Johansson <[hidden email]>
> > Date: Mon, 15 Jun 2020 19:15:36 +0200
> > Content-Type: multipart/mixed; boundary="000000000000ea5a1505a8229334"
> >
> > Recent AMD box with a bunch of nvme drives, never booted anything,
> crashes
> > with
> > panic: pr_find_pagehead: dma256: page header missing
> > after listing some sd(4) drives on 14-Jun snapshot installation.
> >
> > That is the only odd output in the dmesg as far as I can see, picture
> > included.
>
> That Micron_9300_MTFD disk showing up twice is suspicious
>
It gets worse, sysctl hw.disknames lists some 96(!) drives


> Does the machine boot with nvme(4) disabled and/or that particular
> NVMe disk removed?
>

I currently only have remote-console access to it, but the three of the
four nvmes all get tons of ghosts.


--
May the most significant bit of your life be positive.

crazy-many-disks-obsd.jpg (76K) Download Attachment
many-drives2-obsd.jpg (119K) Download Attachment
many-drives-obsd.jpg (130K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Janne Johansson-3
Den mån 15 juni 2020 kl 19:55 skrev Janne Johansson <[hidden email]>:

>
> I currently only have remote-console access to it, but the three of the
> four nvmes all get tons of ghosts.
>
>
and I am sorry in advance for the pictures, but with no network configured
on the switches the box is connected to, all I can do is screenshot the
console window I used for installation (along with remote-cd-iso use for
install66/67.iso)

--
May the most significant bit of your life be positive.
Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Kenneth Westerback
In reply to this post by Janne Johansson-3
On Mon, Jun 15, 2020 at 07:15:36PM +0200, Janne Johansson wrote:

> Recent AMD box with a bunch of nvme drives, never booted anything, crashes
> with
> panic: pr_find_pagehead: dma256: page header missing
> after listing some sd(4) drives on 14-Jun snapshot installation.
>
> That is the only odd output in the dmesg as far as I can see, picture
> included.
>
> 6.7 release has the same issue.
>
> 6.6 installer works and I can boot the installed OS but I have no net on
> the box (mcx 40/56/100GE card in it) yet.
>
>
> --
> May the most significant bit of your life be positive.

The line

scsibus2 at nvme1: 33 targets, initiator 0

is also weird. I have never seen anything but 1 or 2 targets on nvme.

Running a kernel with SCSIDEBUG will produce more information on the
negotiation/discovery interactions.

Clarification of "bunch of nvme drives" and a complete dmesg would
also help.

.... Ken

Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Janne Johansson-3
Den mån 15 juni 2020 kl 19:59 skrev Kenneth R Westerback <
[hidden email]>:

> On Mon, Jun 15, 2020 at 07:15:36PM +0200, Janne Johansson wrote:
> > Recent AMD box with a bunch of nvme drives, never booted anything,
> crashes
> The line
>
> scsibus2 at nvme1: 33 targets, initiator 0
>
> is also weird. I have never seen anything but 1 or 2 targets on nvme.
>
>
My bad. It has four or five nvmes and there is some .. reflection going on.
I seem to recall similar things in the old scsi1 days if you had bad
termination, long since I saw mirrored devices like this on the bus.


> Running a kernel with SCSIDEBUG will produce more information on the
> negotiation/discovery interactions.
> Clarification of "bunch of nvme drives" and a complete dmesg would
> also help.
>

If it would help, I could screenshot one page at a time, that seems to be
the best I can do today.

--
May the most significant bit of your life be positive.
Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Kenneth Westerback
On Mon, Jun 15, 2020 at 08:03:40PM +0200, Janne Johansson wrote:

> Den m??n 15 juni 2020 kl 19:59 skrev Kenneth R Westerback <
> [hidden email]>:
>
> > On Mon, Jun 15, 2020 at 07:15:36PM +0200, Janne Johansson wrote:
> > > Recent AMD box with a bunch of nvme drives, never booted anything,
> > crashes
> > The line
> >
> > scsibus2 at nvme1: 33 targets, initiator 0
> >
> > is also weird. I have never seen anything but 1 or 2 targets on nvme.
> >
> >
> My bad. It has four or five nvmes and there is some .. reflection going on.
> I seem to recall similar things in the old scsi1 days if you had bad
> termination, long since I saw mirrored devices like this on the bus.
>
>
> > Running a kernel with SCSIDEBUG will produce more information on the
> > negotiation/discovery interactions.
> > Clarification of "bunch of nvme drives" and a complete dmesg would
> > also help.
> >
>
> If it would help, I could screenshot one page at a time, that seems to be
> the best I can do today.

Works for me, though I don't recommend sending all those pics to
bugs@.

>
> --
> May the most significant bit of your life be positive.

The other random thing to try is to find the line

sc->sc_link.adapter_buswidth = sc->sc_nn + 1;

in /usr/src/sys/dev/ic/nvme.c

and replace the "sc->sc_nn + 1" with 1 or 2. Perhaps the nvme
controller is returning interesting values for the namespace count in
the identify message. I see there is already some weird code to deal
with Apple oddities. :-)

.... Ken

Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Janne Johansson-3
Den mån 15 juni 2020 kl 20:18 skrev Kenneth R Westerback <
[hidden email]>:

> > If it would help, I could screenshot one page at a time, that seems to be
> > the best I can do today.
>
> Works for me, though I don't recommend sending all those pics to
> bugs@.
>
>
http://c66.it.su.se:8080/obsd/amd-dmesg-jpegs/index.html

Painstakingly screenshot:ed one at a time, but hopefully readable as a
series of images in a row.

--
May the most significant bit of your life be positive.
Reply | Threaded
Open this post in threaded view
|

Re: AMD EPYC 7551 box panic: pr_find_pagehead

Kenneth Westerback
On Tue, Jun 16, 2020 at 10:34:44AM +0200, Janne Johansson wrote:

> Den m??n 15 juni 2020 kl 20:18 skrev Kenneth R Westerback <
> [hidden email]>:
>
> > > If it would help, I could screenshot one page at a time, that seems to be
> > > the best I can do today.
> >
> > Works for me, though I don't recommend sending all those pics to
> > bugs@.
> >
> >
> http://c66.it.su.se:8080/obsd/amd-dmesg-jpegs/index.html
>
> Painstakingly screenshot:ed one at a time, but hopefully readable as a
> series of images in a row.
>
> --
> May the most significant bit of your life be positive.

I see

nvme0: KXG60ZNV256G Toshiba, firmware AGGA4103, serial <blah>
scsibus2 at nvme0: 2 targets, initiator 0
sd0 at scsibus2 targ 1 lun 0: <NVMe, KXG60NZV256G TOS, AGGGA>
sd0: 244198MB, 512 bytes/sector, 500118192 sectors

nvme1: Micro_9300_MTFDHAL3T8DP, firmware 11300DG0, serial <blah>
scsibus3 at nvme1: 33 targets, initiator 0
sd1 at scsibus3 targ 1 lun 0: <NVMe, Micron_9300_MTFD, 1130>
sd1: 3662830MB, 512 bytes/sector, 7501476528 sectors

<sd2 -> sd32 at scsibus3 targ 2 -> 32 lun 0: <NVMe, Micron_9300_MTFD, 1130>

nvme2: Micro_9300_MTFDHAL3T8DP, firmware 11300DG0, serial <blah>
scsibus4 at nvme2: 33 targets, initiator 0
sd33 at scsibus4 targ 1 lun 0: <NVMe, Micron_9300_MTFD, 1130>
sd33: 3662830MB, 512 bytes/sector, 7501476528 sectors

<sd34 -> sd64 at scsibus4 targ 2 -> 32 lun 0: <NVMe, Micron_9300_MTFD, 1130>

nvme3: Micro_9300_MTFDHAL3T8DP, firmware 11300DG0, serial <blah>
scsibus5 at nvme3: 33 targets, initiator 0
sd65 at scsibus5 targ 1 lun 0: <NVMe, Micron_9300_MTFD, 1130>
sd65: 3662830MB, 512 bytes/sector, 7501476528 sectors

<sd66 -> sd96 at scsibus6 targ 2 -> 32 lun 0: <NVMe, Micron_9300_MTFD, 1130>

nvme4: INTEL SSDPED1K375GA, firmware E2010435, serial <blah>
scsibus6 at nvme4: 2 targets, initiator 0
sd97 at scsibus6 targ 1 lun 0 <NVMe, INTEL SSDPED1K37, E201>
sd97: 357707MB, 512 bytes/sector, 732585168 sectors

So FIVE physical drives?

It looks to me like the Micron devices are reporting/configured with
33 namespaces. Each namespace is treated as a separate disk. I don't
know if the device is actually configured this way, or is reporting a
max number where other drives are reporting actual configured
namespaces.

I am *guessing* that all the namespaces other than the first one have
0 sectors allocated. But we print out the "sdNN at scsibusXX ..."
line before the size is determined via INQUIRY. Dunno if that is
avoidable.

A SCSIDEBUG kernel will print a lot of interesting information that
would shed more light.

.... Ken