A sad raid/fsck story

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

A sad raid/fsck story

sven falempin
Dear readers,

I was running a OpenBSD (6.4) device, with a raid mirror array.
One of the disk failed, so the system ask me to fsck,
which I did before checking the raid status manually ( :'( ) ,
THEN I rebooted and softraid told me: one of the hard drive is dead.

But fsck already destroyed a few file on the mirror.

Probably a user error, nevertheless, In openbsd 'simply work' mindset,
maybe the /etc/rc could warn or even perform some bioctl check on raid
array when first fsck / mount
fails.

Cheers.

( Lost data recovered from backup )
Reply | Threaded
Open this post in threaded view
|

Re: A sad raid/fsck story

Aaron Mason
Thanks for the cautionary tale.  Will definitely keep this in mind for
any RAID arrays I manage.

On Fri, Oct 4, 2019 at 2:04 AM sven falempin <[hidden email]> wrote:

>
> Dear readers,
>
> I was running a OpenBSD (6.4) device, with a raid mirror array.
> One of the disk failed, so the system ask me to fsck,
> which I did before checking the raid status manually ( :'( ) ,
> THEN I rebooted and softraid told me: one of the hard drive is dead.
>
> But fsck already destroyed a few file on the mirror.
>
> Probably a user error, nevertheless, In openbsd 'simply work' mindset,
> maybe the /etc/rc could warn or even perform some bioctl check on raid
> array when first fsck / mount
> fails.
>
> Cheers.
>
> ( Lost data recovered from backup )



--
Aaron Mason - Programmer, open source addict
I've taken my software vows - for beta or for worse

Reply | Threaded
Open this post in threaded view
|

Re: A sad raid/fsck story

Nick Holland
In reply to this post by sven falempin
On 10/3/19 10:01 AM, sven falempin wrote:
> Dear readers,
>
> I was running a OpenBSD (6.4) device, with a raid mirror array.
> One of the disk failed, so the system ask me to fsck,

Probably not quite that simple.  More likely, the disk failed,
that took the system down hard, and it needed an fsck on reboot.
Which is normal, RAID or otherwise.

> which I did before checking the raid status manually ( :'( ) ,
> THEN I rebooted and softraid told me: one of the hard drive is dead.
>
> But fsck already destroyed a few file on the mirror.

that seems unlikely.  that's not what fsck does -- fsck's job is to
repair a file system.  If it removes a file, the file is already
damaged.

> Probably a user error, nevertheless, In openbsd 'simply work' mindset,
> maybe the /etc/rc could warn or even perform some bioctl check on raid
> array when first fsck / mount
> fails.

I'm not seeing what this has to do with RAID, soft or otherwise.  If your
system needed an fsck, it needed it whether it was a simple drive or a
RAID array.  If you need an fsck, you are likely to have lost data.

> ( Lost data recovered from backup )

And again...nothing to do with either fsck or RAID -- you have to have
a backup.  RAID doesn't change that.

Nick.

Reply | Threaded
Open this post in threaded view
|

Re: A sad raid/fsck story

sven falempin
On Fri, Oct 4, 2019 at 8:10 AM Nick Holland <[hidden email]> wrote:

>
> On 10/3/19 10:01 AM, sven falempin wrote:
> > Dear readers,
> >
> > I was running a OpenBSD (6.4) device, with a raid mirror array.
> > One of the disk failed, so the system ask me to fsck,
>
> Probably not quite that simple.  More likely, the disk failed,
> that took the system down hard, and it needed an fsck on reboot.
> Which is normal, RAID or otherwise.
>
> > which I did before checking the raid status manually ( :'( ) ,
> > THEN I rebooted and softraid told me: one of the hard drive is dead.
> >
> > But fsck already destroyed a few file on the mirror.
>
> that seems unlikely.  that's not what fsck does -- fsck's job is to
> repair a file system.  If it removes a file, the file is already
> damaged.
>
> > Probably a user error, nevertheless, In openbsd 'simply work' mindset,
> > maybe the /etc/rc could warn or even perform some bioctl check on raid
> > array when first fsck / mount
> > fails.
>
> I'm not seeing what this has to do with RAID, soft or otherwise.  If your
> system needed an fsck, it needed it whether it was a simple drive or a
> RAID array.  If you need an fsck, you are likely to have lost data.
>
> > ( Lost data recovered from backup )
>
> And again...nothing to do with either fsck or RAID -- you have to have
> a backup.  RAID doesn't change that.
>
> Nick.
>


Let me reformulate as a question, because I clearly misslead you in
thinking that fsck -p from rc would delete files or having a backup
is a bad idea. @_@
I lose recent data with fsck -y , and use it because i have a backup,
the data loss here was massive (old untouched files).

How to check the state of the MIRROR raid array , to detect large
amount of failures on one of the two disk ?

Best.

--
--
---------------------------------------------------------------------------------------------------------------------
Knowing is not enough; we must apply. Willing is not enough; we must do

Reply | Threaded
Open this post in threaded view
|

Re: A sad raid/fsck story

Gwen Nelson
RAID is not a backup solution and should not be treated as one

On Fri, Oct 4, 2019, 3:41 PM sven falempin <[hidden email]> wrote:

> On Fri, Oct 4, 2019 at 8:10 AM Nick Holland <[hidden email]>
> wrote:
> >
> > On 10/3/19 10:01 AM, sven falempin wrote:
> > > Dear readers,
> > >
> > > I was running a OpenBSD (6.4) device, with a raid mirror array.
> > > One of the disk failed, so the system ask me to fsck,
> >
> > Probably not quite that simple.  More likely, the disk failed,
> > that took the system down hard, and it needed an fsck on reboot.
> > Which is normal, RAID or otherwise.
> >
> > > which I did before checking the raid status manually ( :'( ) ,
> > > THEN I rebooted and softraid told me: one of the hard drive is dead.
> > >
> > > But fsck already destroyed a few file on the mirror.
> >
> > that seems unlikely.  that's not what fsck does -- fsck's job is to
> > repair a file system.  If it removes a file, the file is already
> > damaged.
> >
> > > Probably a user error, nevertheless, In openbsd 'simply work' mindset,
> > > maybe the /etc/rc could warn or even perform some bioctl check on raid
> > > array when first fsck / mount
> > > fails.
> >
> > I'm not seeing what this has to do with RAID, soft or otherwise.  If your
> > system needed an fsck, it needed it whether it was a simple drive or a
> > RAID array.  If you need an fsck, you are likely to have lost data.
> >
> > > ( Lost data recovered from backup )
> >
> > And again...nothing to do with either fsck or RAID -- you have to have
> > a backup.  RAID doesn't change that.
> >
> > Nick.
> >
>
>
> Let me reformulate as a question, because I clearly misslead you in
> thinking that fsck -p from rc would delete files or having a backup
> is a bad idea. @_@
> I lose recent data with fsck -y , and use it because i have a backup,
> the data loss here was massive (old untouched files).
>
> How to check the state of the MIRROR raid array , to detect large
> amount of failures on one of the two disk ?
>
> Best.
>
> --
> --
>
> ---------------------------------------------------------------------------------------------------------------------
> Knowing is not enough; we must apply. Willing is not enough; we must do
>
>
Reply | Threaded
Open this post in threaded view
|

Re: A sad raid/fsck story

Nick Holland
In reply to this post by sven falempin
On 10/4/19 8:37 AM, sven falempin wrote:
...
> How [do I] check the state of the MIRROR raid array , to detect large
> amount of failures on one of the two disk ?
>
> Best.
>

fsck has NOTHING to do with the status of your drives.
It's a File System ChecKer.  Your disk can be covered with unreadable
sectors but if the file system on that disk is intact, fsck reports
no problem.  Conversely, your disks can be fine, but your file system
can be scrambled beyond recognition; bad news from fsck doesn't mean
your drive is bad.

To check the status of the disks, you probably want to slip a call
to bioctl into /etc/daily.local:

# bioctl softraid0
Volume      Status               Size Device  
softraid0 0 Online      7945693712896 sd2     RAID1
          0 Online      7945693712896 0:0.0   noencl <sd0a>
          1 Online      7945693712896 0:1.0   noencl <sd1a>

This is a happy array.  If you have a bad drive, one of those
physical drives is going to not be online.

Nick.

Reply | Threaded
Open this post in threaded view
|

Re: A sad raid/fsck story

sven falempin
On Sat, Oct 5, 2019 at 8:39 AM Nick Holland <[hidden email]> wrote:

>
> On 10/4/19 8:37 AM, sven falempin wrote:
> ...
> > How [do I] check the state of the MIRROR raid array , to detect large
> > amount of failures on one of the two disk ?
> >
> > Best.
> >
>
> fsck has NOTHING to do with the status of your drives.
> It's a File System ChecKer.  Your disk can be covered with unreadable
> sectors but if the file system on that disk is intact, fsck reports
> no problem.  Conversely, your disks can be fine, but your file system
> can be scrambled beyond recognition; bad news from fsck doesn't mean
> your drive is bad.
>
> To check the status of the disks, you probably want to slip a call
> to bioctl into /etc/daily.local:
>
> # bioctl softraid0
> Volume      Status               Size Device
> softraid0 0 Online      7945693712896 sd2     RAID1
>           0 Online      7945693712896 0:0.0   noencl <sd0a>
>           1 Online      7945693712896 0:1.0   noencl <sd1a>
>
> This is a happy array.  If you have a bad drive, one of those
> physical drives is going to not be online.
>
> Nick.
>

My moral of the story is:

if your raid array is not mounting, check smart, check bioctl, FSCK
each disk separately
and then restore or dump the bad drive

Next,

Raid 5 is cool . It knows which disk failed the checksum ?