FreeNAS Alert

While away on holiday I recieved the following email:

FreeNAS: Critical Alerts Device: /dev/ada2, 1 Currently unreadable (pending) sectors

Fortunately having OpenVPN set up I was able to VPN into my home network to do some investigation.

The FreeNAS web interface was showing the red alert button due to the error but unfortunately did not show much detail. I connected to my FreeNAS server via ssh to look into the issue further.

Running smartctl -a /dev/ada2 showed that there was indeed 1 pending sector on the drive.

After some reading I found that this issue can occur if a drive fails to read a sector but the sector will not be reallocated until a write attempt fails.

In an effort to reduce disk activity while I figured out what to do I stopped all plugins and jails and made sure that no tasks were scheduled to run.

I also decided to temporarily remove the drive from the pool while working on it although I now suspect that this was unnecessary. After some googling I identified the disk and removed it from the pool as follows:

[root@freenas] ~# glabel status
                                      Name  Status  Components
gptid/dff6e82d-4f68-11e5-9220-a0b3ccdf05de     N/A  ada0p2
gptid/e06f42ea-4f68-11e5-9220-a0b3ccdf05de     N/A  ada1p2
gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de     N/A  ada2p2
gptid/e1787b7c-4f68-11e5-9220-a0b3ccdf05de     N/A  ada3p2
gptid/d95e5fe3-ec7f-11e5-93e4-b05ada874e14     N/A  da0p1
gptid/d97fbbc5-ec7f-11e5-93e4-b05ada874e14     N/A  da0p2
gptid/46d1f085-ec7b-11e5-880a-b05ada874e14     N/A  da1p1
gptid/46e2839e-ec7b-11e5-880a-b05ada874e14     N/A  da1p2
[root@freenas] ~# zpool offline vol0 /dev/gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de

This did technically degrade the array and I think if I have to repeat this procedure in the future I would be best to avoid this, especially in scenarios with single disk redundancy.

[root@freenas] ~# zpool status -v vol0
  pool: vol0
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 8h37m with 0 errors on Sun Jul 31 16:37:38 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol0                                            DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/dff6e82d-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     0
            gptid/e06f42ea-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     0
            14494392554726482640                        OFFLINE      0     0     0  was /dev/gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de
            gptid/e1787b7c-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     0

errors: No known data errors

I then started a long S.M.A.R.T test to get some more information:

[root@freenas] ~# smartctl -t long /dev/ada2

Smartctl reported that this would take 417 minutes to complete so I decided I would resume looking into the issue the next day.

Once the test completed I ran smartctl -a /dev/ada2 but although the “Current Pending Sector count” was still 1 the test had completed successfully and was not showing any errors.

This was rather irritating as I was hoping to find the location of the sector that was causing the issue.

After some reading I ran the following which revealed the sector:

[root@freenas] ~# smartctl -l xerror /dev/ada2
Error 1 [0] occurred at disk power-on lifetime: 8046 hours (335 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 36 51 a8 88 40 00  Error: UNC at LBA = 0x13651a888 = 5206288520

Now knowing that the “Logical Block Address” of the erroring sector was 5206288520 I set a kernel option to allow direct access to the disk:

[root@freenas] ~# sysctl kern.geom.debugflags=16

As I knew that the disk has 4K sectors I ran the following to write zeros to this sector:

[root@freenas] ~# dd if=/dev/zero of=/dev/ada2 bs=4096 count=1 seek=5206288520

This resulted in an input/output error which I found quite confusing. After further reading I found that LBA (for historic reasons) always specifies sectors in 512 bytes.

I adjusted my command as follows:

[root@freenas] ~# dd if=/dev/zero of=/dev/ada2 bs=512 count=1 seek=5206288520

As I no longer needed direct access to the disk I reset the kernel flag and started a short smart test:

[root@freenas] ~# sysctl kern.geom.debugflags=0
[root@freenas] ~# smartctl -t short /dev/ada2

This completed in 3 minutes and running smartctl -a /dev/ada2 showed a pending sector count of 0. Strangely in my case the reallocated sector count did not increase so I suspect that the drive was able to recover the sector.

I then added the drive back to the pool and started a scrub as follows:

[root@freenas] ~# zpool online vol0 14494392554726482640
[root@freenas] ~# zpool scrub vol0

After waiting approximately 9 hours the scrub completed and was able to repair the zeroed sectors:

[root@freenas] ~# zpool status -v vol0
  pool: vol0
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 44K in 9h9m with 0 errors on Tue Aug  9 11:51:01 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol0                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/dff6e82d-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     0
            gptid/e06f42ea-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     0
            gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     1
            gptid/e1787b7c-4f68-11e5-9220-a0b3ccdf05de  ONLINE       0     0     0

errors: No known data errors

As the URE was caused by my overwriting of a sector I cleared the error with a zpool clear.

The Alert light went back to green in the webUI and I am fairly confident that this has resolved the issue.

Many thanks to Dan Smith whose blog post was of great help. Also see the FreeBSD Diary for further information.