RAIDframe parity errors and rebuild

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

RAIDframe parity errors and rebuild

David Wilk
Howdy all,

I've been testing a 3.8 system with RAIDframe and root on raid in a
RAID1 configuration.  Performance and stability are quite good, but
there's one thing that's a bit irksome and I wonder if I might not be
doing something right.

I've had a couple crashes (potentially hardware related) and every
time the RAID requires a parity rebuild.  That seems fine, but it
refuses to bring the array on line during this time.  It takes several
hours to rebuild a 232GB RAID1 array!

Is this normal?  this seems like quite a bit of time to be down with
every improper shutdown of the system.

thanks for your thoughts.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

John Eisenschmidt
 ----- David Wilk ([hidden email]) wrote: -----

> Howdy all,
>
> I've been testing a 3.8 system with RAIDframe and root on raid in a
> RAID1 configuration.  Performance and stability are quite good, but
> there's one thing that's a bit irksome and I wonder if I might not be
> doing something right.
>
> I've had a couple crashes (potentially hardware related) and every
> time the RAID requires a parity rebuild.  That seems fine, but it
> refuses to bring the array on line during this time.  It takes several
> hours to rebuild a 232GB RAID1 array!

Is the raidframe driver causing the panic? Pedro sent out an email on
2/26 about testing a patch that's being included in 3.9 -STABLE
(Subject: Re: raid(4) users, please test this). I had a problem with
my 3.2 raidframe mirrors causing the system to panic because a call
wasn't being made to VOP_UNLOCK() when VOP_ISLOCKED() was true. I put
the disks in my unpatched 3.8 box and I got an immediate panic. Applied
 the patch and they were fine.

> Is this normal?  this seems like quite a bit of time to be down with
> every improper shutdown of the system.

I've used OpenVMS volume shadowing, Solaris Disk Suite (circus Solaris
2.8), software raid for Mac OS system 8, raidframe, etc all for
software RAID 1 and all of them took a long time to check after a
crash. Essentially when the system hard crashes, it needs to
compare the parity information between both disks sector by sector to
ensure that the mirrors are in sync. From where I sit, the kernalized
raidframe driver stops the system from moving on to multiuser mode
until it has verified the disks are both in sync (the safest route to
take). Parity for my 100GB volume would take about 90 minutes to check
after a crash.

Where I've seen it take less time on other implementations is when it
pushes the system straight to multiuser mode and checks in the
background, which raidframe will do if you hit CTRL-C on the console
when it starts checking. You can't pass by the fsck but you can stop
the interactive parity check and it will run in the background.

> thanks for your thoughts.
>
> Dave

--
John W. Eisenschmidt ([hidden email])
  website: http://www.eisenschmidt.org/jweisen/
  my blog: http://thealphajohn.blogspot.com/
  house blog: http://4104-chestnut-street.blogspot.com/

Law of False Alerts: As the rate of erroneous alerts increases, operator
        reliance, or belief, in subsequent warnings decreases.

[demime 1.01d removed an attachment of type application/pgp-signature]

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

David Wilk
On 3/17/06, John Eisenschmidt <[hidden email]> wrote:

>  ----- David Wilk ([hidden email]) wrote: -----
> > Howdy all,
> >
> > I've been testing a 3.8 system with RAIDframe and root on raid in a
> > RAID1 configuration.  Performance and stability are quite good, but
> > there's one thing that's a bit irksome and I wonder if I might not be
> > doing something right.
> >
> > I've had a couple crashes (potentially hardware related) and every
> > time the RAID requires a parity rebuild.  That seems fine, but it
> > refuses to bring the array on line during this time.  It takes several
> > hours to rebuild a 232GB RAID1 array!
>
> Is the raidframe driver causing the panic? Pedro sent out an email on
> 2/26 about testing a patch that's being included in 3.9 -STABLE
> (Subject: Re: raid(4) users, please test this). I had a problem with
> my 3.2 raidframe mirrors causing the system to panic because a call
> wasn't being made to VOP_UNLOCK() when VOP_ISLOCKED() was true. I put
> the disks in my unpatched 3.8 box and I got an immediate panic. Applied
>  the patch and they were fine.

good question.  I don't think so, although it maybe been the problem
once when I filled up the RAID volume and it dropped into the kernel
debugger.  This last time looks like a problem with my ATA controller
(bus resets resulting in a kernel hang).  I've been tracking
3.8-STABLE figuring that would be the safest route.  Is this patch
going to make it into that tree or would one have to use CURRENT?  I
didn't think there was a 3.9-STABLE already...

>
> > Is this normal?  this seems like quite a bit of time to be down with
> > every improper shutdown of the system.
>
> I've used OpenVMS volume shadowing, Solaris Disk Suite (circus Solaris
> 2.8), software raid for Mac OS system 8, raidframe, etc all for
> software RAID 1 and all of them took a long time to check after a
> crash. Essentially when the system hard crashes, it needs to
> compare the parity information between both disks sector by sector to
> ensure that the mirrors are in sync. From where I sit, the kernalized
> raidframe driver stops the system from moving on to multiuser mode
> until it has verified the disks are both in sync (the safest route to
> take). Parity for my 100GB volume would take about 90 minutes to check
> after a crash.
right, that makes total sense, however I was assuming this would occur
in the background (like a hardware RAID solution).
>
> Where I've seen it take less time on other implementations is when it
> pushes the system straight to multiuser mode and checks in the
> background, which raidframe will do if you hit CTRL-C on the console
> when it starts checking. You can't pass by the fsck but you can stop
> the interactive parity check and it will run in the background.

ah, so that's the key.  Awesome.  I'll give that a try.  Is there
anyway to configure this behavior to be the default?  I'm imagining my
server being powercycled when I'm not around and wanting it to come up
ASAP, especially since disk load would be very light.

>
> > thanks for your thoughts.
> >
> > Dave
>
> --
> John W. Eisenschmidt ([hidden email])
>   website: http://www.eisenschmidt.org/jweisen/
>   my blog: http://thealphajohn.blogspot.com/
>   house blog: http://4104-chestnut-street.blogspot.com/
>
> Law of False Alerts: As the rate of erroneous alerts increases, operator
>         reliance, or belief, in subsequent warnings decreases.

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

Antonios Anastasiadis
I had the same question, and just changed the relevant line in /etc/rc
adding '&' in the end:

raidctl -P all &

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

Joachim Schipper
On Sat, Mar 18, 2006 at 12:59:30PM +0200, Antonios Anastasiadis wrote:
> I had the same question, and just changed the relevant line in /etc/rc
> adding '&' in the end:
>
> raidctl -P all &

Then again, why is this not the default? Are you certain this actually
works?

                Joachim

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

Antonios Anastasiadis
Here it works just fine, done some power-offs by hand and the system
recovers without problems. The only problem I may think of is about
the timing of the fsck checks during the boot process. Perhaps I have
been lucky so far.
Then again, I might just be inexcusably blind to an obvious flaw that
will soon blast my data into oblivion :-)

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

David Wilk
In reply to this post by Joachim Schipper
this was exactly my thought.  I was hoping someone would have some
'official' knowledge, or opinion.  I still can't get over having to
wait several hours for my root partition to become available after an
improper shutdown.

On 3/18/06, Joachim Schipper <[hidden email]> wrote:

> On Sat, Mar 18, 2006 at 12:59:30PM +0200, Antonios Anastasiadis wrote:
> > I had the same question, and just changed the relevant line in /etc/rc
> > adding '&' in the end:
> >
> > raidctl -P all &
>
> Then again, why is this not the default? Are you certain this actually
> works?
>
>                 Joachim

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe parity errors and rebuild

Greg Oster
"David Wilk" writes:

> this was exactly my thought.  I was hoping someone would have some
> 'official' knowledge, or opinion.  I still can't get over having to
> wait several hours for my root partition to become available after an
> improper shutdown.
>
> On 3/18/06, Joachim Schipper <[hidden email]> wrote:
> > On Sat, Mar 18, 2006 at 12:59:30PM +0200, Antonios Anastasiadis wrote:
> > > I had the same question, and just changed the relevant line in /etc/rc
> > > adding '&' in the end:
> > >
> > > raidctl -P all &
> >
> > Then again, why is this not the default? Are you certain this actually
> > works?
> >
> >                 Joachim

If you want to be 100% paranoid, then you want to wait for the
'raidctl -P all' to update all parity before starting even fsck's.
There *is* a non-zero chance that the parity might be out-of-sync
with the data, and should a component die before that parity has been
updated, then you could end up reading bad data.  This can happen
even if the filesystem has been checked.  What are the odds of this
happening?  Pretty small.

If 'raidctl -P all &' is run, then the larger problem is both fsck
and raidctl will be fighting for disk cycles -- i.e. the fsck will
take longer to complete.  On more "critical" systems, this is how I
typically have things setup (I'm willing to risk it that I'm not
going to have a disk die during the minutes that it takes to do the
fsck).

On less critical boxes, I've got a "sleep 3600" before the 'raidctl
-P', so that the parity check doesn't get in the way of the fsck or
the system coming up... about an hour after it comes up, the disks
are then checked...

It's one of those "what are the odds" games... allowing the raidctl
to run in the background seems to have the right mix of paranoia and
practicality...

Later...

Greg Oster