dwxe: resetting interface on watchdog timeout

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

dwxe: resetting interface on watchdog timeout

Sebastien Marie-3
Hi,

With a pine64, I am experimenting regulary dwxe watchdog
timeout. Usually it is a sign that something doesn't work in the driver
itself.

The problem I am facing currently is when watchdog timeout occurs,
the interface is unusable. And so I need another system connected
permanently to serial in order to login and reboot the board to get it
working.

The following diff is still a workaround for the underline driver
problem. It tries to reset the interface when watchdog timeout
occurs. But at least, the board could come back in a more accessible
state.

When a watchdog timeout occurs, it will try to:
- down the interface (if it is up)
- reset it
- up the interface (if it called down previously)

With it, I have a "stable" connection to the board via network.

Comments or OK ?
--
Sebastien Marie


Index: if_dwxe.c
===================================================================
RCS file: /cvs/src/sys/dev/fdt/if_dwxe.c,v
retrieving revision 1.11
diff -u -p -r1.11 if_dwxe.c
--- if_dwxe.c 3 Jan 2019 00:59:58 -0000 1.11
+++ if_dwxe.c 15 Apr 2019 10:21:39 -0000
@@ -687,7 +687,21 @@ dwxe_ioctl(struct ifnet *ifp, u_long cmd
 void
 dwxe_watchdog(struct ifnet *ifp)
 {
- printf("%s\n", __func__);
+ struct dwxe_softc *sc = ifp->if_softc;
+ int down_up = 0;
+
+ printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
+ ifp->if_oerrors++;
+
+ if (ifp->if_flags & IFF_RUNNING) {
+ down_up = 1;
+ dwxe_down(sc);
+ }
+
+ dwxe_reset(sc);
+
+ if (down_up == 1)
+ dwxe_up(sc);
 }
 
 int

Reply | Threaded
Open this post in threaded view
|

Re: dwxe: resetting interface on watchdog timeout

Mike Larkin-2
On Wed, Apr 17, 2019 at 09:44:43AM +0200, Sebastien Marie wrote:

> Hi,
>
> With a pine64, I am experimenting regulary dwxe watchdog
> timeout. Usually it is a sign that something doesn't work in the driver
> itself.
>
> The problem I am facing currently is when watchdog timeout occurs,
> the interface is unusable. And so I need another system connected
> permanently to serial in order to login and reboot the board to get it
> working.
>
> The following diff is still a workaround for the underline driver
> problem. It tries to reset the interface when watchdog timeout
> occurs. But at least, the board could come back in a more accessible
> state.
>
> When a watchdog timeout occurs, it will try to:
> - down the interface (if it is up)
> - reset it
> - up the interface (if it called down previously)
>
> With it, I have a "stable" connection to the board via network.
>
> Comments or OK ?
> --
> Sebastien Marie
>
>

Just to add here, in my TESTS for 6.5, all of my 20 or so PINE64s have
had a really tough time with dwxe(4). I have had to put all of them into
10baseT mode. Previously, they all had "media 100baseTX" in their
/etc/hostname.dwxe0 (and these are supposedly 1Gb devices), so even in
the past it has been really flaky. If this helps improve things, I'm all
for it, but you should probably get oks from someone who knows the
driver better.

-ml

> Index: if_dwxe.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/fdt/if_dwxe.c,v
> retrieving revision 1.11
> diff -u -p -r1.11 if_dwxe.c
> --- if_dwxe.c 3 Jan 2019 00:59:58 -0000 1.11
> +++ if_dwxe.c 15 Apr 2019 10:21:39 -0000
> @@ -687,7 +687,21 @@ dwxe_ioctl(struct ifnet *ifp, u_long cmd
>  void
>  dwxe_watchdog(struct ifnet *ifp)
>  {
> - printf("%s\n", __func__);
> + struct dwxe_softc *sc = ifp->if_softc;
> + int down_up = 0;
> +
> + printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
> + ifp->if_oerrors++;
> +
> + if (ifp->if_flags & IFF_RUNNING) {
> + down_up = 1;
> + dwxe_down(sc);
> + }
> +
> + dwxe_reset(sc);
> +
> + if (down_up == 1)
> + dwxe_up(sc);
>  }
>  
>  int
>

Reply | Threaded
Open this post in threaded view
|

Re: dwxe: resetting interface on watchdog timeout

jungle Boogie
In reply to this post by Sebastien Marie-3
On Wed 17 Apr 2019  9:44 AM, Sebastien Marie wrote:
>Hi,
>
>With a pine64, I am experimenting regulary dwxe watchdog
>timeout. Usually it is a sign that something doesn't work in the driver
>itself.

Good to know this isn't just affecting my three devices.
Let's hope this patch gets some feedback and makes its way into the build.

Reply | Threaded
Open this post in threaded view
|

Re: dwxe: resetting interface on watchdog timeout

Sebastien Marie-3
On Wed, Apr 17, 2019 at 04:32:04PM -0700, Jungle Boogie wrote:
> On Wed 17 Apr 2019  9:44 AM, Sebastien Marie wrote:
> > Hi,
> >
> > With a pine64, I am experimenting regulary dwxe watchdog
> > timeout. Usually it is a sign that something doesn't work in the driver
> > itself.
>
> Good to know this isn't just affecting my three devices.
> Let's hope this patch gets some feedback and makes its way into the build.

you could build a kernel and test it for confirming it works as expected.

it could really help to have feedback from users.

thanks.
--
Sebastien Marie

Reply | Threaded
Open this post in threaded view
|

Re: dwxe: resetting interface on watchdog timeout

Theo de Raadt-2
Sebastien Marie <[hidden email]> wrote:

> On Wed, Apr 17, 2019 at 04:32:04PM -0700, Jungle Boogie wrote:
> > On Wed 17 Apr 2019  9:44 AM, Sebastien Marie wrote:
> > > Hi,
> > >
> > > With a pine64, I am experimenting regulary dwxe watchdog
> > > timeout. Usually it is a sign that something doesn't work in the driver
> > > itself.
> >
> > Good to know this isn't just affecting my three devices.
> > Let's hope this patch gets some feedback and makes its way into the build.
>
> you could build a kernel and test it for confirming it works as expected.
>
> it could really help to have feedback from users.

Resetting the chipset on a timer is a workaround.

It means the root cause hasn't been found and fixed.

It is not a fix.

Reply | Threaded
Open this post in threaded view
|

Re: dwxe: resetting interface on watchdog timeout

Mark Kettenis
> From: "Theo de Raadt" <[hidden email]>
> Date: Thu, 18 Apr 2019 01:04:02 -0600
>
> Sebastien Marie <[hidden email]> wrote:
>
> > On Wed, Apr 17, 2019 at 04:32:04PM -0700, Jungle Boogie wrote:
> > > On Wed 17 Apr 2019  9:44 AM, Sebastien Marie wrote:
> > > > Hi,
> > > >
> > > > With a pine64, I am experimenting regulary dwxe watchdog
> > > > timeout. Usually it is a sign that something doesn't work in
> > > > the driver itself.
> > >
> > > Good to know this isn't just affecting my three devices.  Let's
> > > hope this patch gets some feedback and makes its way into the
> > > build.
> >
> > you could build a kernel and test it for confirming it works as expected.
> >
> > it could really help to have feedback from users.
>
> Resetting the chipset on a timer is a workaround.
>
> It means the root cause hasn't been found and fixed.
>
> It is not a fix.

Agreed.

There are basically two reason why the watchdog might trigger:

1. The PHY doesn't have a link while the interface is up and therefore
   the MAC isn't transmitting any packets.

2. The MAC has gotten in a funny state and isn't transmitting any
    packets anymore.

The first might simply happen because the cable isn't connected.  But
at least on some boards we have evidence that there is a problem with
the PHY.  This is fairly easy to diagnose.  If ifconfig(8) reports the
link being down or reports an autonegotiated speed that doesn't make
sense, the PHY needs to be looked at.  Note that the link may be
flapping so run ifconifg(8) multiple times.

If the PHY is flaky there are many possible causes:

a. A bug in the PHY driver.

b. A misconfigured clock.

c. Insufficient power.

d. Firmware issues.

It is the last thing that makes this issue hard to debug.  Our drivers
do depend on some basic initialization of the hardware and an
appropriate description of the hardware.  Certain combinations of
U-Boot and device tree may not work.  And there are a lot of hacked up
firmwares out there that may only work with a specific version of the
Linux kernel because of additional hacks there.  For OpenBSD we really
only support the official U-Boot and device trees from mainline Linux.
We provide ports for these.  Those ports are regularly updated though,
and it is impossible for us to check all supported boards when we do
so.  Since quality control upstream isn't great either, this means
that sometimes things get broken.

Cheers,

Mark