this post was submitted on 08 Aug 2024
10 points (91.7% liked)

Everything ZFS

271 readers
1 users here now

A community for the ZFS filesystem.

ZFS is an opensource COW filesystem used by enterprise and serious homelabbers for it's data safety and extensive feature set.

OpenZFS is the active branch now developed primarily for Linux with a port to it's FreeBSD roots.

This community is here to answer questions and discuss topics related to the use of ZFS in the wild.

Rules:

As always, the main rule is Don't Be a Dick. Be polite with new users asking questions that you may consider obvious. If you don't have something constructive to offer, downvote and move on.

No dirty deletes: your posts are here for posterity, perhaps the next person will get something out of it, even if it's wrong.

founded 1 year ago
MODERATORS
 

Looking for thoughts/opinions

I have a 5 disc raidz1 array. The volumes are accumulating CKSUM errors - fairly evenly distributed over the discs. I've been lazy and let this progress to the point where there are permanent errors in files.

# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 748K in 06:17:19 with 1 errors on Sun Jul 14 06:41:22 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz1-0                           ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD13YBW  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD13YE4  ONLINE       0     0     7
            ata-ST8000VN004-2M2101_WSD1454G  ONLINE       0     0     8
            ata-ST8000VN004-2M2101_WSD1454W  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD14563  ONLINE       0     0     7

errors: Permanent errors have been detected in the following files:

        /you/do/not/need/this/level of detail.txt

I've done some research and believe (hope) that the cause of these errors is the "domestic" onboard SATA controllers I'm using and I have ordered a LSI SAS3008 9300-8i HBA as an upgrade.

I know I can fix the permanent error by deleting and restoring it and then running a scrub. But, I'm torn - should I scrub now and risk stressing it more on the crappy SATA controllers, or wait until I get the new HBA (in a few weeks - free cheap, slow, shipping)?

top 7 comments
sorted by: hot top controversial new old
[–] [email protected] 2 points 2 months ago* (last edited 2 months ago) (1 children)

I have the same issue. For what it's worth it's still running just fine after 3 years apart from the occasional corrupted file after scrub, which thankfully that pool is mostly games and media I can just redownload. Error rate is always the same, and a corrupted file when the controller fucks up. Weirdly my SSD pool on the same controller seems fine, but it also completes scrub in less than an hour vs the HDD array.

You'll be fine waiting a week if you want to be sure.

I wish there was an option to retry a few times instead of giving up, as the controller will give the correct data if tried again. It seems to happen when the controller is under heavy load for an extended period of time (ie 18h of scrubbing), it only does it close to the end.

I've seen some tunables to make the scrub slower, it might help reduce the strain enough to not cause the errors.

[–] [email protected] 1 points 2 months ago

I have not been observant enough to notice that the corruption is caused by the scrubs. It makes sense - that would be the only real time my array gets any stress. That being the case - I'll leave the scrub until after I get the HBA installed.

[–] [email protected] 2 points 2 months ago (1 children)

Hello from All.

I don't know what any of that means, but as a scrub tech, my vote is to scrub.

[–] [email protected] 3 points 2 months ago

Why would anyone downvote this? Don't you have a sense of humour? I thought it was funny.

[–] [email protected] 1 points 2 months ago (1 children)

I’d shut it down before it corrupts even more, replace HBA when it arrives and run a scrub to see what’s the damage

[–] [email protected] 1 points 2 months ago (1 children)

I know that's the correct response. But, it's been running like this for many months, maybe even years - as I said in the post, I've been lazy. There's nothing on it that can't easily be restored, or replaced, and shutting it down would be a PITA.

[–] [email protected] 1 points 2 months ago

There’s always a chance your backups might get corrupted too if you let it continue like that