softraid(4) RAID1 tools or experimental patches for consistency checking

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

softraid(4) RAID1 tools or experimental patches for consistency checking

Constantine A. Murenin-3
Dear misc@,

I'm curious if anyone has any sort of tools / patches to verify the consistency of softraid(4) RAID1 volumes?


If one adds a new disc (i.e. chunk) to a volume with the RAID1 discipline, the resilvering process of softraid(4) will read data from one of the existing discs, and write it back to all the discs, ridding you of the artefacts that could potentially be used to reconstruct the flipped bits correctly.

Additionally, this resilvering process is also really slow.  Per my notes from a few years ago, softraid has a fixed block size of 64KB (MAXPHYS); if we're talking about spindle-based HDDs, they only support like 80 random IOPS at 7,2k RPM, half of which we gotta use for reads, half for writes; this means it'll take (1TB/64KB/(80/s/2)) = 4,5 days to resilver each 1TB of an average 7,2k RPM HDD; compare this with sequential resilvering, which will take (1TB/120MB/s) = 2,3 hours; the reality may vary from these imprecise calculations, but these numbers do seem representative of the experience.

The above behaviour is defined here:

http://bxr.su/o/sys/dev/softraid_raid1.c#sr_raid1_rw

369        } else {
370            /* writes go on all working disks */
371            chunk = i;
372            scp = sd->sd_vol.sv_chunks[chunk];
373            switch (scp->src_meta.scm_status) {
374            case BIOC_SDONLINE:
375            case BIOC_SDSCRUB:
376            case BIOC_SDREBUILD:
377                break;
378
379            case BIOC_SDHOTSPARE: /* should never happen */
380            case BIOC_SDOFFLINE:
381                continue;
382
383            default:
384                goto bad;
385            }
386        }


What we could do is something like the following, to pretend that any online volume is not available for writes when the wu (Work Unit) we're handling is part of the rebuild process from http://bxr.su/o/sys/dev/softraid.c#sr_rebuild, mimicking the BIOC_SDOFFLINE behaviour for BIOC_SDONLINE chunks (discs) when the SR_WUF_REBUILD flag is set for the workunit:

  switch (scp->src_meta.scm_status) {
  case BIOC_SDONLINE:
+ if (wu->swu_flags & SR_WUF_REBUILD)
+ continue; /* must be same as BIOC_SDOFFLINE case */
+ /* FALLTHROUGH */
  case BIOC_SDSCRUB:
  case BIOC_SDREBUILD:


Obviously, there's both pros and cons to such an approach; I've tested a variation of the above in production (not a fan weeks-long random-read/write rebuilds); but use this at your own risk, obviously.

...

But back to the original problem, this consistency check would have to be file-system-specific, because we gotta know which blocks of softraid have and have not been used by the filesystem, as softraid itself is filesystem-agnostic.  I'd imagine it'll be somewhat similar in concept to the fstrim(8) utility on GNU/Linux -- http://man7.org/linux/man-pages/man8/fstrim.8.html -- and would also open the door for the cron-based TRIM support as well (it would also have to know the softraid format itself, too).  Any pointers or hints where to get started, or whether anyone has worked on this in the past?


Cheers,
Constantine. http://cm.su/

Reply | Threaded
Open this post in threaded view
|

Re: softraid(4) RAID1 tools or experimental patches for consistency checking

Karel Gardas

Tried something like that in the past:
https://marc.info/?l=openbsd-tech&m=144217941801350&w=2

It worked kind of OK except the performance. The problem is that data
layout makes read op. -> 2x read op. and write op. -> read op. + 2x
write op. which is not the speed winner. Caching of checksuming blocks
helped in some cases a lot, but was not submitted since you would also
ideally need readahead and this was not done at all. The other perf
issue is that putting this slow virtual drive impl. under already slow
ffs is a receipt for disappointment from the perf. point of view.
Certainly no speed daemon and certainly completely different league than
checkumming able fss from open-source world (ZFS, btrfs, bcachefs. No
HAMMER2 is not there since it checksum only meta-data and not user data
and can't self-heal).

Yes, you are right that ideally drive would be fs aware to optimize
rebuild, but this may be worked around by more clever layout marking
also used blocks. Anyway, that's (and above) are IMHO reasons why
development is done on checksumming fss instead of checksumming software
raids. Read somewhere paper about linux's mdadm hacked to do checksums
and the result was pretty much the same (IIRC!). E.g. perf.
disappointment. If you are curious, google for it.

So, work on it if you can tolerate the speed...

On 1/12/20 6:46 AM, Constantine A. Murenin wrote:

> Dear misc@,
>
> I'm curious if anyone has any sort of tools / patches to verify the consistency of softraid(4) RAID1 volumes?
>
>
> If one adds a new disc (i.e. chunk) to a volume with the RAID1 discipline, the resilvering process of softraid(4) will read data from one of the existing discs, and write it back to all the discs, ridding you of the artefacts that could potentially be used to reconstruct the flipped bits correctly.
>
> Additionally, this resilvering process is also really slow.  Per my notes from a few years ago, softraid has a fixed block size of 64KB (MAXPHYS); if we're talking about spindle-based HDDs, they only support like 80 random IOPS at 7,2k RPM, half of which we gotta use for reads, half for writes; this means it'll take (1TB/64KB/(80/s/2)) = 4,5 days to resilver each 1TB of an average 7,2k RPM HDD; compare this with sequential resilvering, which will take (1TB/120MB/s) = 2,3 hours; the reality may vary from these imprecise calculations, but these numbers do seem representative of the experience.
>
> The above behaviour is defined here:
>
> http://bxr.su/o/sys/dev/softraid_raid1.c#sr_raid1_rw
>
> 369        } else {
> 370            /* writes go on all working disks */
> 371            chunk = i;
> 372            scp = sd->sd_vol.sv_chunks[chunk];
> 373            switch (scp->src_meta.scm_status) {
> 374            case BIOC_SDONLINE:
> 375            case BIOC_SDSCRUB:
> 376            case BIOC_SDREBUILD:
> 377                break;
> 378
> 379            case BIOC_SDHOTSPARE: /* should never happen */
> 380            case BIOC_SDOFFLINE:
> 381                continue;
> 382
> 383            default:
> 384                goto bad;
> 385            }
> 386        }
>
>
> What we could do is something like the following, to pretend that any online volume is not available for writes when the wu (Work Unit) we're handling is part of the rebuild process from http://bxr.su/o/sys/dev/softraid.c#sr_rebuild, mimicking the BIOC_SDOFFLINE behaviour for BIOC_SDONLINE chunks (discs) when the SR_WUF_REBUILD flag is set for the workunit:
>
>   switch (scp->src_meta.scm_status) {
>   case BIOC_SDONLINE:
> + if (wu->swu_flags & SR_WUF_REBUILD)
> + continue; /* must be same as BIOC_SDOFFLINE case */
> + /* FALLTHROUGH */
>   case BIOC_SDSCRUB:
>   case BIOC_SDREBUILD:
>
>
> Obviously, there's both pros and cons to such an approach; I've tested a variation of the above in production (not a fan weeks-long random-read/write rebuilds); but use this at your own risk, obviously.
>
> ...
>
> But back to the original problem, this consistency check would have to be file-system-specific, because we gotta know which blocks of softraid have and have not been used by the filesystem, as softraid itself is filesystem-agnostic.  I'd imagine it'll be somewhat similar in concept to the fstrim(8) utility on GNU/Linux -- http://man7.org/linux/man-pages/man8/fstrim.8.html -- and would also open the door for the cron-based TRIM support as well (it would also have to know the softraid format itself, too).  Any pointers or hints where to get started, or whether anyone has worked on this in the past?
>
>
> Cheers,
> Constantine. http://cm.su/
>

Reply | Threaded
Open this post in threaded view
|

Re: softraid(4) RAID1 tools or experimental patches for consistency checking

Karel Gardas

Few missing notes to this email:

- my performance testing, results and conclusion were done only on
mechanical drives (hitachi 7k500 and wd re 500) and only with meta-data
intensive workload. Basically tar -xf src.tar; unmount and rm -rf src;
unmount where src.tar was src.tar of stable at that time.

- at the same time WAPBL was submitted to tech@ and IIRC it increased
perf a lot since I've also been using limited checksum blocks caching
(not in patch, not submitted yet) and since WAPBL log is on constant
place RAID1c/s was more happy.

- at the time I've not had any ssd/nvme for testing. Situation may be
different with this especially once someone think what's tolerable and
what's not anymore w.r.t. speed.

On 1/12/20 9:58 PM, Karel Gardas wrote:

>
> Tried something like that in the past:
> https://marc.info/?l=openbsd-tech&m=144217941801350&w=2
>
> It worked kind of OK except the performance. The problem is that data
> layout makes read op. -> 2x read op. and write op. -> read op. + 2x
> write op. which is not the speed winner. Caching of checksuming blocks
> helped in some cases a lot, but was not submitted since you would also
> ideally need readahead and this was not done at all. The other perf
> issue is that putting this slow virtual drive impl. under already slow
> ffs is a receipt for disappointment from the perf. point of view.
> Certainly no speed daemon and certainly completely different league
> than checkumming able fss from open-source world (ZFS, btrfs,
> bcachefs. No HAMMER2 is not there since it checksum only meta-data and
> not user data and can't self-heal).
>
> Yes, you are right that ideally drive would be fs aware to optimize
> rebuild, but this may be worked around by more clever layout marking
> also used blocks. Anyway, that's (and above) are IMHO reasons why
> development is done on checksumming fss instead of checksumming
> software raids. Read somewhere paper about linux's mdadm hacked to do
> checksums and the result was pretty much the same (IIRC!). E.g. perf.
> disappointment. If you are curious, google for it.
>
> So, work on it if you can tolerate the speed...
>