new post: Fixing zfs pool error in zones/dump

2014-03-17 14:07:53 +01:00 · 2014-03-17 14:07:53 +01:00 · 46f7199b92
parent 8a4b693d59
commit 46f7199b92
1 changed files with 83 additions and 0 deletions
--- a/_posts/2014-03-17-fixing-zfs-pool-error-in-zones-dump.md
+++ b/_posts/2014-03-17-fixing-zfs-pool-error-in-zones-dump.md
@ -0,0 +1,83 @@
+---
+layout: post
+title: Fixing zfs pool error in zones/dump
+date: 17.03.2014 14:08
+---
+
+At work we're using [SmartOS][], an Open Solaris clone featuring all kinds of cool stuff. One of the best things is the underlying file system: ZFS.
+
+With ZFS it is easy to create, mirror, extend storage pools, it's also very easy to snapshot pools and backup them using `zfs send` and `zfs receive`.
+In the process of a manual backup of one of the pools today I wanted to see the status of the whole system by using `zpool status`.
+This is what it showed:
+
+~~~shell
+$ zpool status -v
+  pool: zones
+ state: DEGRADED
+status: One or more devices has experienced an error resulting in data
+        corruption.  Applications may be affected.
+action: Restore the file in question if possible.  Otherwise restore the
+        entire pool from backup.
+   see: http://illumos.org/msg/ZFS-8000-8A
+config:
+
+        NAME        STATE     READ WRITE CKSUM
+        zones       DEGRADED    16     0     0
+          mirror-0  DEGRADED    32     0     0
+            c0t4d0  DEGRADED    32     0     0  too many errors
+            c0t6d0  DEGRADED    32     0     0  too many errors
+        logs
+          c0t9d0    ONLINE       0     0     0
+        cache
+          c0t8d0    ONLINE       0     0     0
+
+errors: Permanent errors have been detected in the following files:
+
+        zones/dump:<0x1>
+~~~
+
+At first this looks a litte bit weird. What is this `zones/dump` even for? Why is it broken?
+The answer: Solaris dumps the memory onto the disk on a system crash.
+I tried googling this error, why it would get corrupt, if the disks are really broken or if it is just a software error.
+
+Turns out this bug is known. We recently upgraded our SmartOS, which brings up this issue.
+The disk and the pool are not really broken, but simply the data is misinterpreted.
+To correct it you must replace the dump and later scrub the whole pool again.
+I executed the following commands to do this (found them in a [forum post](http://www.kdump.cn/forums/viewtopic.php?pid=2761#p2761)):
+
+~~~shell
+zfs create -o refreservation=0 -V 4G zones/dump2
+dumpadm -d /dev/zvol/dsk/zones/dump2
+zfs destroy zones/dump
+zfs create -o refreservation=0 -V 4G zones/dump
+dumpadm -d /dev/zvol/dsk/zones/dump
+zfs destroy zones/dump2
+~~~
+
+This will first create a new file system, swap it in as the dump file system,
+delete the old one and once again create a new one with the old name and putting it back in place.
+
+In case the `dumpadm -d` part fails, complaining about the file system being to small, just resize it:
+
+~~~shell
+zfs set volsize=20G zones/dump2
+~~~
+
+See [Swap and Dump Volume Sizes for ZFS File Systems](http://docs.oracle.com/cd/E23824_01/html/821-1459/fsswap-31050.html#SAGDFSfsswap-31050).
+
+The scrubbing took 21 hours with our large data set, but it was not noticable in running machines on this host due to its low priority.
+The final status:
+
+~~~shell
+  pool: zones
+ state: DEGRADED
+…
+
+errors: Permanent errors have been detected in the following files:
+
+        <0x17f>:<0x1>
+~~~
+
+Well, now the `zones/dump:<0x1>` is gone. But it still shows an error for the same file system, just that it is not named anymore. We're scheduling a maintenance soon to reboot the machine. Let's hope this will clear the error. Otherwise we will replace the HDD.
+
+[smartos]: http://smartos.org/