RAIDframe question

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

RAIDframe question

Peter Matulis
I am running 3.8-stable with RAIDframe RAID-1 and two IDE disks (wd0
and wd1).  Everything thing seems to work (parity is good) but when I
boot up I get two messages that worry me:

raid0: Device already configured!
"ioctl (RAIDFRAME_CONFIGURE) failed"

Can anyone lend a hand in this important matter?

There are some more seemingly good raid related messages after that but
the machine is now at a remote location and I do not see any messages
in the output of dmesg (the first line above is the last line from
dmesg).

The tail end of dmesg output is:

Kernelized RAIDframe activated
cd0(atapiscsi0:0:0): Check Condition (error 0x70) on opcode 0x0
    SENSE KEY: Not Ready
     ASC/ASCQ: Medium Not Present
raid0 (root): (RAID Level 1) total number of sectors is 48234752 (23552
MB) as root
dkcsum: wd0 matches BIOS drive 0x80
dkcsum: wd1 matches BIOS drive 0x81
rootdev=0x1300 rrootdev=0x3600 rawdev=0x3602
raid0: Device already configured!

--
Peter

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Jurjen Oskam
On Wed, Feb 01, 2006 at 01:19:58AM -0500, Peter wrote:

> raid0: Device already configured!
> "ioctl (RAIDFRAME_CONFIGURE) failed"
>
> Can anyone lend a hand in this important matter?

Let me guess (since you didn't post any configuration): you
enabled RAID-autoconfiguration by the kernel *and* you
configure the same RAID-device during the boot sequence using
raidctl?

--
Jurjen Oskam

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Joachim Schipper
In reply to this post by Peter Matulis
On Wed, Feb 01, 2006 at 01:19:58AM -0500, Peter wrote:

> I am running 3.8-stable with RAIDframe RAID-1 and two IDE disks (wd0
> and wd1).  Everything thing seems to work (parity is good) but when I
> boot up I get two messages that worry me:
>
> raid0: Device already configured!
> "ioctl (RAIDFRAME_CONFIGURE) failed"
>
> Can anyone lend a hand in this important matter?
>
> There are some more seemingly good raid related messages after that but
> the machine is now at a remote location and I do not see any messages
> in the output of dmesg (the first line above is the last line from
> dmesg).
>
> The tail end of dmesg output is:
>
> Kernelized RAIDframe activated
> cd0(atapiscsi0:0:0): Check Condition (error 0x70) on opcode 0x0
>     SENSE KEY: Not Ready
>      ASC/ASCQ: Medium Not Present
> raid0 (root): (RAID Level 1) total number of sectors is 48234752 (23552
> MB) as root
> dkcsum: wd0 matches BIOS drive 0x80
> dkcsum: wd1 matches BIOS drive 0x81
> rootdev=0x1300 rrootdev=0x3600 rawdev=0x3602
> raid0: Device already configured!

Is there any raidctl somewhere in rc, rc.local, or somesuch, other than
in rc where it reads /etc/raid[0-3].conf and checks parity?

Otherwise, check the contents of /etc/raid[0-3].conf...

                Joachim

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Håkan Olsson
In reply to this post by Jurjen Oskam
On 1 feb 2006, at 08.38, Jurjen Oskam wrote:

> On Wed, Feb 01, 2006 at 01:19:58AM -0500, Peter wrote:
>
>> raid0: Device already configured!
>> "ioctl (RAIDFRAME_CONFIGURE) failed"
>>
>> Can anyone lend a hand in this important matter?
>
> Let me guess (since you didn't post any configuration): you
> enabled RAID-autoconfiguration by the kernel *and* you
> configure the same RAID-device during the boot sequence using
> raidctl?

/etc/rc includes commands to configure the raid devices, and if  
they've been setup to use autoconfiguration then this is indeed what  
happens. Expected and nothing to worry about, although noisy. For my  
raidframe devices, I just removed the autoconfigure flag.

/H

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Peter Matulis
--- Ho?=kan Olsson <[hidden email]> wrote:

> On 1 feb 2006, at 08.38, Jurjen Oskam wrote:
>
> > On Wed, Feb 01, 2006 at 01:19:58AM -0500, Peter wrote:
> >
> >> raid0: Device already configured!
> >> "ioctl (RAIDFRAME_CONFIGURE) failed"
> >>
> >> Can anyone lend a hand in this important matter?
> >
> > Let me guess (since you didn't post any configuration): you
> > enabled RAID-autoconfiguration by the kernel *and* you
> > configure the same RAID-device during the boot sequence using
> > raidctl?
>
> /etc/rc includes commands to configure the raid devices, and if  
> they've been setup to use autoconfiguration then this is indeed what  
> happens. Expected and nothing to worry about, although noisy. For my  
> raidframe devices, I just removed the autoconfigure flag.

Oh that's a relief.  Yes, now I see in /etc/rc the raid commands.  So I
should leave everything as is?

Side question:
I tried unsuccessfully using the same procedure to set up two disks (sd0
and sd1) attached to a QLogic FibreChannel controller (isp driver).  I
probably don't have the correct terminology but upon startup the boot code
could not be found (would not get beyond the point where the kernel
usually kicks in).  I'm wondering whether RAIDframe has limitations with
this hardware.

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Joachim Schipper
On Wed, Feb 01, 2006 at 08:45:42AM -0500, Peter wrote:

> --- Ho?=kan Olsson <[hidden email]> wrote:
>
> > On 1 feb 2006, at 08.38, Jurjen Oskam wrote:
> >
> > > On Wed, Feb 01, 2006 at 01:19:58AM -0500, Peter wrote:
> > >
> > >> raid0: Device already configured!
> > >> "ioctl (RAIDFRAME_CONFIGURE) failed"
> > >>
> > >> Can anyone lend a hand in this important matter?
> > >
> > > Let me guess (since you didn't post any configuration): you
> > > enabled RAID-autoconfiguration by the kernel *and* you
> > > configure the same RAID-device during the boot sequence using
> > > raidctl?
> >
> > /etc/rc includes commands to configure the raid devices, and if  
> > they've been setup to use autoconfiguration then this is indeed what  
> > happens. Expected and nothing to worry about, although noisy. For my  
> > raidframe devices, I just removed the autoconfigure flag.
>
> Oh that's a relief.  Yes, now I see in /etc/rc the raid commands.  So I
> should leave everything as is?

You should probably disable either /etc/raid0.conf or the autodetection.
The former is most easily achieved by mv
/etc/raid0.conf{,.autodetected}; the latter is achieved by either not
compiling with the option RAID_AUTOCONFIG, or running raidctl -A no
raid0.

> Side question:
> I tried unsuccessfully using the same procedure to set up two disks (sd0
> and sd1) attached to a QLogic FibreChannel controller (isp driver).  I
> probably don't have the correct terminology but upon startup the boot code
> could not be found (would not get beyond the point where the kernel
> usually kicks in).  I'm wondering whether RAIDframe has limitations with
> this hardware.

I don't know anything about this card, but the isp(4) man page seems to
suggest that adding ISP_COMPILE_FW to the kernel configuration may be
helpful.

This should not be relevant to RAIDframe operation, though - does it
work without RAIDframe?

Please note that RAIDframe is software RAID - hardware RAID is handled
differently.

                Joachim

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Greg Oster
In reply to this post by Håkan Olsson
=?ISO-8859-1?Q?H=E5kan_Olsson?= writes:

> On 1 feb 2006, at 08.38, Jurjen Oskam wrote:
>
> > On Wed, Feb 01, 2006 at 01:19:58AM -0500, Peter wrote:
> >
> >> raid0: Device already configured!
> >> "ioctl (RAIDFRAME_CONFIGURE) failed"
> >>
> >> Can anyone lend a hand in this important matter?
> >
> > Let me guess (since you didn't post any configuration): you
> > enabled RAID-autoconfiguration by the kernel *and* you
> > configure the same RAID-device during the boot sequence using
> > raidctl?
>
> /etc/rc includes commands to configure the raid devices, and if  
> they've been setup to use autoconfiguration then this is indeed what  
> happens. Expected and nothing to worry about, although noisy.

"What he said."

> For my  
> raidframe devices, I just removed the autoconfigure flag.

Please use the autoconfigure flag.  It is *far* better at gluing
together a RAID set than the regular configuration bits, especially
in the face of drives that move about or drives that fail to spin
up... (the old config code needs to find its way into a bit-bucket..)

You really want to use the autoconfigure bits.. :)  Really. :)

Later...

Greg Oster

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Nick Bender
In reply to this post by Peter Matulis
> Side question:
> I tried unsuccessfully using the same procedure to set up two disks (sd0
> and sd1) attached to a QLogic FibreChannel controller (isp driver).  I
> probably don't have the correct terminology but upon startup the boot code
> could not be found (would not get beyond the point where the kernel
> usually kicks in).  I'm wondering whether RAIDframe has limitations with
> this hardware.

man installboot ?

-N

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Peter Fraser-2
In reply to this post by Peter Matulis
I had a disk drive fail while running RAIDframe.
The system did not survive the failure. Even worse
there was data loss.

The system was to be my new web server. The system
had 1 Gig of memory.  I was working, slowly, on
configuring apache and web pages. Moving to
a chroot'ed environment was none trivial.

The disk drive died, the system crashed, and the
system rebooted and came up. Remove the
dead disk and replacing it with a new disk
and reestablishing the raid was no problem.

But why was there a crash, I would of thought
that the system should run after a disk failure.
And even more to my surprise, about two days
of my work disappeared.

I believe, the disk drive died about 2 days before
the crash. I also believe that RAIDframe did
not handle the disk drive's failure correctly
and as a result all file writes to the failed
drive queued up in memory, when memory ran
out the system crashed.

I don't know enough about OpenBSD internals to
know if my guess as to what happened is correct,
but it did worry me about the reliability of
RAIDframe.

I am now trying ccd for my web pages and
ALTROOT in daily for root, I have not had a disk
fail with ccd yet, so I have not determined whether
ccd works better.

Neither RAIDframe or ccd seems to be up the
quality of nearly all the other software
in OpenBSD. This statement is also true of
the documentation.

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Greg Oster
In reply to this post by Peter Matulis
Peter writes:
>
> I tried unsuccessfully using the same procedure to set up two disks (sd0
> and sd1) attached to a QLogic FibreChannel controller (isp driver).  I
> probably don't have the correct terminology but upon startup the boot code
> could not be found (would not get beyond the point where the kernel
> usually kicks in).  I'm wondering whether RAIDframe has limitations with
> this hardware.

RAIDframe doesn't care about underlying hardware.  It's run on top of
a) probably every flavour of SCSI, b) various levels of IDE/pciide,
c) FibreChannel, d) ancient things like HP-IB, and e) other RAIDframe
devices.  If the underlying device can provide something that looks/
smells like a disk partition, that's good enough for RAIDframe.

Later...

Greg Oster

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Joachim Schipper
In reply to this post by Peter Fraser-2
On Wed, Feb 01, 2006 at 11:02:22AM -0500, Peter Fraser wrote:

> I had a disk drive fail while running RAIDframe.
> The system did not survive the failure. Even worse
> there was data loss.
>
> The system was to be my new web server. The system
> had 1 Gig of memory.  I was working, slowly, on
> configuring apache and web pages. Moving to
> a chroot'ed environment was none trivial.
>
> The disk drive died, the system crashed, and the
> system rebooted and came up. Remove the
> dead disk and replacing it with a new disk
> and reestablishing the raid was no problem.
>
> But why was there a crash, I would of thought
> that the system should run after a disk failure.
> And even more to my surprise, about two days
> of my work disappeared.
>
> I believe, the disk drive died about 2 days before
> the crash. I also believe that RAIDframe did
> not handle the disk drive's failure correctly
> and as a result all file writes to the failed
> drive queued up in memory, when memory ran
> out the system crashed.
>
> I don't know enough about OpenBSD internals to
> know if my guess as to what happened is correct,
> but it did worry me about the reliability of
> RAIDframe.
>
> I am now trying ccd for my web pages and
> ALTROOT in daily for root, I have not had a disk
> fail with ccd yet, so I have not determined whether
> ccd works better.
>
> Neither RAIDframe or ccd seems to be up the
> quality of nearly all the other software
> in OpenBSD. This statement is also true of
> the documentation.

Crashing IDE drives are likely to confuse the IDE bus; there is little
that OpenBSD can do about this.

There was a thread on RAIDframe vs ccd on misc@ about a month ago;
search the archives. Basically, ccd isn't good for data safety.

I don't think your guess is correct, though - RAIDframe is rather likely
to crash when poked at (with incorrect configurations &c), but otherwise
quite stable.

                Joachim

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Greg Oster
In reply to this post by Peter Fraser-2
"Peter Fraser" writes:
> I had a disk drive fail while running RAIDframe.
> The system did not survive the failure. Even worse
> there was data loss.

Ow.  

> The system was to be my new web server. The system
> had 1 Gig of memory.  I was working, slowly, on
> configuring apache and web pages. Moving to
> a chroot'ed environment was none trivial.
>
> The disk drive died, the system crashed,

Oh.... so it *wasn't* just a simple case of a drive dying, but the
system crashed too...  Well, RAIDframe can't make any guarantees when
there's a system crash -- if buffers havn't been flushed or there's
still pending meta-data to be written, there's not much RAIDframe can
do about that... "those are filesystem issues".

> and the
> system rebooted and came up. Remove the
> dead disk and replacing it with a new disk
> and reestablishing the raid was no problem.
>
> But why was there a crash, I would of thought
> that the system should run after a disk failure.

You havn't said what types of disks.  I've had IDE disks fail that
take down the entire system.  I've had IDE disks fail but the system
remains up and happy.  I've had SCSI disks fail that have made the
SCSI cards *very* unhappy (and had the system die shortly after).  
None of these things can be solved by RAIDframe -- if the underlying
device drivers can't "deal" in the face of lossage, RAIDframe can't
do anything about that...

You also havn't given any indication as to the nature of the crash,
or what the panic message was (if any).  (e.g. was it a null-pointer
dereference, or a corrupted filesystem or something that went wrong
in the network stack?)

> And even more to my surprise, about two days
> of my work disappeared.

Of course, you just went to your backups to get that back, right? :)

> I believe, the disk drive died about 2 days before
> the crash. I also believe that RAIDframe did
> not handle the disk drive's failure correctly

Do you have a dmesg related to the drive failure?  e.g. something
that shows RAIDframe complaining that something was wrong, and
marking the drive as failed?  

> and as a result all file writes to the failed
> drive queued up in memory,

I've never seen that behaviour...  I find it hard to believe that
you'd be able to queue up 2 days worth of writes without a) any reads
being done or b) not noticing that the filesystem was completely
unresponsive when a write of associated meta-data never returned...  
(on the first write of meta-data that didn't return, pretty much all
IO to that filesystem should grind to a halt.  Sorry.. I'm not buying
the "it queued up things for two days"... )

> when memory ran out the system crashed.
>
> I don't know enough about OpenBSD internals to
> know if my guess as to what happened is correct,
> but it did worry me about the reliability of
> RAIDframe.

I've been running RAIDframe (albeit not w/ OpenBSD) in both
production and non-production environments now for 7+ years...  
RAIDframe reliability is the least of my worries :)
(RAIDframe has also saved mine and others' data on various occasions
over the years...)
 
> I am now trying ccd for my web pages and
> ALTROOT in daily for root, I have not had a disk
> fail with ccd yet, so I have not determined whether
> ccd works better.

"Good luck."  (see a different thread for my thoughts on using ccd :)

 
> Neither RAIDframe or ccd seems to be up the
> quality of nearly all the other software
> in OpenBSD. This statement is also true of the documentation.

My only comment on that is that the version of RAIDframe in OpenBSD
is somewhat dated.  You are also encouraged to find and read the
latest versions of the documentation, and to provide feedback
to the author on what you feel is lacking.

Later...

Greg Oster

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Peter Fraser-2
In reply to this post by Peter Matulis
> You havn't said what types of disks.  I've had IDE disks fail that
> take down the entire system.  I've had IDE disks fail but the system
> remains up and happy.  I've had SCSI disks fail that have made the
> SCSI cards *very* unhappy (and had the system die shortly after).  
> None of these things can be solved by RAIDframe -- if the underlying
> device drivers can't "deal" in the face of lossage, RAIDframe can't
> do anything about that...

 The system is a SuperServer 5013C-T, with two hot swappable
sata drives.

> You also havn't given any indication as to the nature of the crash,
> or what the panic message was (if any).  (e.g. was it a null-pointer
> dereference, or a corrupted filesystem or something that went wrong
> in the network stack?)

I came in the morning,  I got no response
from the system. I eventually had to hit the init button.

I did not lose a lot of work, only an hour or
two. The old web server worked, creating a new on
was not high priority.

> I've never seen that behaviour...  I find it hard to believe that
> you'd be able to queue up 2 days worth of writes without a) any reads
> being done or b) not noticing that the filesystem was completely
> unresponsive when a write of associated meta-data never returned...  
> (on the first write of meta-data that didn't return, pretty much all
> IO to that filesystem should grind to a halt.  Sorry.. I'm not buying
> the "it queued up things for two days"... )
 
The system was almost completely idle. The only changes I
made was editing a few small files (httpd.conf and friends)
I doubt sure that there was less then a megabyte of changes
I made.

I also assumed that the write went to the one operating disk
and a failed failure recovery caused the problem.

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Andy Hayward
In reply to this post by Peter Fraser-2
On 2/1/06, Peter Fraser <[hidden email]> wrote:

> But why was there a crash, I would of thought
> that the system should run after a disk failure.
> And even more to my surprise, about two days
> of my work disappeared.
>
> I believe, the disk drive died about 2 days before
> the crash. I also believe that RAIDframe did
> not handle the disk drive's failure correctly
> and as a result all file writes to the failed
> drive queued up in memory, when memory ran
> out the system crashed.

What mount options are/were you using?

-- ach

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Andy Hayward
In reply to this post by Greg Oster
On 2/1/06, Greg Oster <[hidden email]> wrote:

> "Peter Fraser" writes:
> > and as a result all file writes to the failed
> > drive queued up in memory,
>
> I've never seen that behaviour...  I find it hard to believe that
> you'd be able to queue up 2 days worth of writes without a) any reads
> being done or b) not noticing that the filesystem was completely
> unresponsive when a write of associated meta-data never returned...
> (on the first write of meta-data that didn't return, pretty much all
> IO to that filesystem should grind to a halt.  Sorry.. I'm not buying
> the "it queued up things for two days"... )

I've seem similar on a machine with a filesystem on a raid-1 partition
and mounted with softdeps enabled. From what I remember the scenario
was something like:

* copied 10Gb or so of data to new raid-1 filesystem
* system then left idle for 30mins or so
* being an idiot, pulled the wrong plug out of the wall
* upon reboot, and after raid resync and fsck, most of the copied data
was no longer there

-- ach

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Greg Oster
Andy Hayward writes:

> On 2/1/06, Greg Oster <[hidden email]> wrote:
> > "Peter Fraser" writes:
> > > and as a result all file writes to the failed
> > > drive queued up in memory,
> >
> > I've never seen that behaviour...  I find it hard to believe that
> > you'd be able to queue up 2 days worth of writes without a) any reads
> > being done or b) not noticing that the filesystem was completely
> > unresponsive when a write of associated meta-data never returned...
> > (on the first write of meta-data that didn't return, pretty much all
> > IO to that filesystem should grind to a halt.  Sorry.. I'm not buying
> > the "it queued up things for two days"... )
>
> I've seem similar on a machine with a filesystem on a raid-1 partition
> and mounted with softdeps enabled. From what I remember the scenario
> was something like:
>
> * copied 10Gb or so of data to new raid-1 filesystem
> * system then left idle for 30mins or so
> * being an idiot, pulled the wrong plug out of the wall
> * upon reboot, and after raid resync and fsck, most of the copied data
> was no longer there

RAIDframe can only write what it's given.  If, after 30 minutes,
the filesystem layers havn't synced all the data, RAIDframe can't
do anything about that...  if left idle for 30 minutes, that
filesystem should have synced itself many times over, to the point
that fsck shouldn't have found anything to complain about...

(I strongly suspect you'd see exactly the same behaviour without
RAIDframe involved here...  I also suspect you wouldn't see the same
behavior without softdeps, RAIDframe or not.)

Later...

Greg Oster

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Nick Holland
In reply to this post by Greg Oster
Greg Oster wrote:
> "Peter Fraser" writes:
>> I had a disk drive fail while running RAIDframe.
>> The system did not survive the failure. Even worse
>> there was data loss.
>
> Ow.  

Welcome to the REALITY of RAID.

If you rely on RAID to always work, and never go down, you Just Don't
Understand.

...
> You havn't said what types of disks.  I've had IDE disks fail that
> take down the entire system.  I've had IDE disks fail but the system
> remains up and happy.  I've had SCSI disks fail that have made the
> SCSI cards *very* unhappy (and had the system die shortly after).  
> None of these things can be solved by RAIDframe -- if the underlying
> device drivers can't "deal" in the face of lossage, RAIDframe can't
> do anything about that...

Doesn't matter about drive type, doesn't really matter about device
drivers, there are PLENTY of things that CAN and WILL cause every drive
on the same channel with the failed drive to go down.  There are even
plenty of things that can fail on the drive which will jump across
channels (imagine a nice little despiking cap shorting out, slamming
your 5v line to ground for a moment until it turns into a puff of smoke.
 yes, I've seen this).  RAID can help you get back up faster, but it
can't keep you from ever going down.
...

>> And even more to my surprise, about two days
>> of my work disappeared.
>
> Of course, you just went to your backups to get that back, right? :)
>
>> I believe, the disk drive died about 2 days before
>> the crash. I also believe that RAIDframe did
>> not handle the disk drive's failure correctly
>
> Do you have a dmesg related to the drive failure?  e.g. something
> that shows RAIDframe complaining that something was wrong, and
> marking the drive as failed?  
>
>> and as a result all file writes to the failed
>> drive queued up in memory,
>
> I've never seen that behaviour...  I find it hard to believe that
> you'd be able to queue up 2 days worth of writes without a) any reads
> being done or b) not noticing that the filesystem was completely
> unresponsive when a write of associated meta-data never returned...  
> (on the first write of meta-data that didn't return, pretty much all
> IO to that filesystem should grind to a halt.  Sorry.. I'm not buying
> the "it queued up things for two days"... )

agreed.
HOWEVER...
I have seen (and heard reports of) OpenBSD firewalls run for LONG
periods of time with a failed hard disks.  Don't ask me how..think kept
putting scary messages on the screen (so obviously it was
using..er..trying to use...the bad spots), and kept filtering packets
until the power went out, and of course, it didn't come back up.  Could
lead someone to think the disk wasn't that important. :)
...
>> I am now trying ccd for my web pages and
>> ALTROOT in daily for root, I have not had a disk
>> fail with ccd yet, so I have not determined whether
>> ccd works better.
>
> "Good luck."  (see a different thread for my thoughts on using ccd :)

more than that...he doesn't understand the nature of RAID.

If hardware breaks, don't expect everything else to keep working.  Hope,
sure.  Expect?  No.  I don't care if you are talking about ccd,
RAIDframe, or hardware RAID.  Your machine can still go down due to a
disk failure.  People who don't believe me have just been lucky.  So far.

Further, if you wait until a disk fails to find out how things work, you
are a fool.  Worst down-time disasters I've seen involved RAID systems
where people expected magic to happen when something went wrong.

>> Neither RAIDframe or ccd seems to be up the
>> quality of nearly all the other software
>> in OpenBSD. This statement is also true of the documentation.

well, one of the people that writes documentation (me) can't mention
RAID without yelling at people who think it will haul their butts out of
the fire under all circimstances.  That gets a little off-topic for
official documentation, so the editing and reediting process is pretty
painful. :)

Yes, some drivers might be more tollerant of a disk failure than others,
however disk failure is something that's almost impossible to
simulate...and disk failures rarely happen on cue [insert nailgun
comment here], so testing, debugging and improving hardware failure
handling is not a very easy task...and you go through a lot of hard
disks.  Non-destructive testing is just "best the real world can do", it
doesn't accurately simulate most types of hard disk failures.

Nick.

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Diana Eichert
On Wed, 1 Feb 2006, Nick Holland wrote:
SNIP
> Welcome to the REALITY of RAID.
>
> If you rely on RAID to always work, and never go down, you Just Don't
> Understand.
SNIP
> Doesn't matter about drive type, doesn't really matter about device
> drivers, there are PLENTY of things that CAN and WILL cause every drive
> on the same channel with the failed drive to go down.  There are even
> plenty of things that can fail on the drive which will jump across
> channels (imagine a nice little despiking cap shorting out, slamming
> your 5v line to ground for a moment until it turns into a puff of smoke.
>  yes, I've seen this).  RAID can help you get back up faster, but it
> can't keep you from ever going down.

Yep, it's amazing what happens to a hard drive when you pull out your
FiveSeven and pop off a few rounds into the system.

diana

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

Peter Fraser-2
In reply to this post by Peter Matulis
Nick Holland wrote:

> Welcome to the REALITY of RAID.
>
> If you rely on RAID to always work, and never go down, you Just Don't
>  Understand.
>
> ...
>
> If hardware breaks, don't expect everything else to keep working.
Hope,
> sure.  Expect?  No.  I don't care if you are talking about ccd,
> RAIDframe, or hardware RAID.  Your machine can still go down due to a
> disk failure.  People who don't believe me have just been lucky.  So
far.

> Further, if you wait until a disk fails to find out how things work,
you
> are a fool.  Worst down-time disasters I've seen involved RAID systems
> where people expected magic to happen when something went wrong.

I come from a mainframe world that deals in non stop transaction
processing.
That world expects disks to die, and the system to keep on running.

Hardware mirroring is done within a disk controller, and software
mirroring
is done between controllers.  Software mirror is done largely to protect
from controller failure, not disk failure.  It is the norm in such an
environment to add and remove disks and disk controllers on the fly.

Now, I know I should not expect the reliability on a pc vs. a mainframe,
but
I have had twice had disks fail on Windows servers using software
mirroring
and both times those systems survived.  

For about the last three years, whenever I order workstation I always
spend a bit extra to get mirroring. (Its about $25 extra plus the price
of the disk drive) I also advise everyone I know to do the same.
I have yet to have a windows machine die because of a disk failure
when mirrored.  I have also yet to see any loss of  data. I have
had many people thank me for my advise.

I am careful when I set up a software raid. The two disk must be
on separate IDE controllers. The master/slave jumpers screw up
when one disk dies. Even cable select seems to cause troubles.

My believe is if a system dies, as a result of a mirrored disk's
death on a properly configured system, there is bug.

I chose OpenBSD for its security, I use it for my name servers,
fire wall, mail and web, and I have set others up with it for
the same reason.  I completely believe that OpenBSD is the
best choice for protecting again intrusion. I just wish
my data was more security against its loss.

P.S.

For some strange reason, Microsoft allows mirror, stripping and
concatenation, with disk on the server, but the work station
only allow stripping and concatenation. So hardware mirror is
the only option for XP.

I prefer software mirroring, because it allows for controller failure.
I have had a hardware raid system controllers failure and write
garbage over the disks.  I have also had a power supply screw
up and cause multiple disk failure on another hardware raid
system. Recently I have seen a lot of ide controller failures.

If you use raid you still have to do backups!

Reply | Threaded
Open this post in threaded view
|

Re: RAIDframe question

knitti
On 2/2/06, Peter Fraser <[hidden email]> wrote:
> I have yet to have a windows machine die because of a disk failure
> when mirrored.

ok, I'll take the bait. you are documenting simply, that you had luck
in the past, perhaps also due to some good hardware (although
I do not trust those $25 "hardware"-raid-controllers, be it onboard or
on an extra card.
I've seen windows die on soft raid and also on hard raid. the latter
one was especially nice. one disk died, the controller either didn't
recognize it or the driver didn't ask the controller, so no one knew
the drive was dead. severaly minutes after rebooting the system
locked completely up (again). cheap hard raid. bad driver.
*cough* adaptec *cough*

when I use raid, I want to know when something is wrong, and
I want come back up asap.
when I want hyper-availability, I have to do something duplicating
entire machines, routers/firewalls have carp, and server have
either some application-clustering or a hack simulating something
like that.

--knitti