FreeNAS Alert
While away on holiday I recieved the following email:
FreeNAS: Critical Alerts Device: /dev/ada2, 1 Currently unreadable (pending) sectors
Fortunately having OpenVPN set up I was able to VPN into my home network to do some investigation.
The FreeNAS web interface was showing the red alert button due to the error but unfortunately did not show much detail. I connected to my FreeNAS server via ssh to look into the issue further.
Running smartctl -a /dev/ada2
showed that there was indeed 1 pending sector on the drive.
After some reading I found that this issue can occur if a drive fails to read a sector but the sector will not be reallocated until a write attempt fails.
In an effort to reduce disk activity while I figured out what to do I stopped all plugins and jails and made sure that no tasks were scheduled to run.
I also decided to temporarily remove the drive from the pool while working on it although I now suspect that this was unnecessary. After some googling I identified the disk and removed it from the pool as follows:
[root@freenas] ~# glabel status
Name Status Components
gptid/dff6e82d-4f68-11e5-9220-a0b3ccdf05de N/A ada0p2
gptid/e06f42ea-4f68-11e5-9220-a0b3ccdf05de N/A ada1p2
gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de N/A ada2p2
gptid/e1787b7c-4f68-11e5-9220-a0b3ccdf05de N/A ada3p2
gptid/d95e5fe3-ec7f-11e5-93e4-b05ada874e14 N/A da0p1
gptid/d97fbbc5-ec7f-11e5-93e4-b05ada874e14 N/A da0p2
gptid/46d1f085-ec7b-11e5-880a-b05ada874e14 N/A da1p1
gptid/46e2839e-ec7b-11e5-880a-b05ada874e14 N/A da1p2
[root@freenas] ~# zpool offline vol0 /dev/gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de
This did technically degrade the array and I think if I have to repeat this procedure in the future I would be best to avoid this, especially in scenarios with single disk redundancy.
[root@freenas] ~# zpool status -v vol0
pool: vol0
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 8h37m with 0 errors on Sun Jul 31 16:37:38 2016
config:
NAME STATE READ WRITE CKSUM
vol0 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/dff6e82d-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 0
gptid/e06f42ea-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 0
14494392554726482640 OFFLINE 0 0 0 was /dev/gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de
gptid/e1787b7c-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 0
errors: No known data errors
I then started a long S.M.A.R.T test to get some more information:
[root@freenas] ~# smartctl -t long /dev/ada2
Smartctl reported that this would take 417 minutes to complete so I decided I would resume looking into the issue the next day.
Once the test completed I ran smartctl -a /dev/ada2
but although the “Current Pending Sector count” was still 1 the test had completed successfully and was not showing any errors.
This was rather irritating as I was hoping to find the location of the sector that was causing the issue.
After some reading I ran the following which revealed the sector:
[root@freenas] ~# smartctl -l xerror /dev/ada2
Error 1 [0] occurred at disk power-on lifetime: 8046 hours (335 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 01 36 51 a8 88 40 00 Error: UNC at LBA = 0x13651a888 = 5206288520
Now knowing that the “Logical Block Address” of the erroring sector was 5206288520 I set a kernel option to allow direct access to the disk:
[root@freenas] ~# sysctl kern.geom.debugflags=16
As I knew that the disk has 4K sectors I ran the following to write zeros to this sector:
[root@freenas] ~# dd if=/dev/zero of=/dev/ada2 bs=4096 count=1 seek=5206288520
This resulted in an input/output error which I found quite confusing. After further reading I found that LBA (for historic reasons) always specifies sectors in 512 bytes.
I adjusted my command as follows:
[root@freenas] ~# dd if=/dev/zero of=/dev/ada2 bs=512 count=1 seek=5206288520
As I no longer needed direct access to the disk I reset the kernel flag and started a short smart test:
[root@freenas] ~# sysctl kern.geom.debugflags=0
[root@freenas] ~# smartctl -t short /dev/ada2
This completed in 3 minutes and running smartctl -a /dev/ada2
showed a pending sector count of 0. Strangely in my case the reallocated sector count did not increase so I suspect that the drive was able to recover the sector.
I then added the drive back to the pool and started a scrub as follows:
[root@freenas] ~# zpool online vol0 14494392554726482640
[root@freenas] ~# zpool scrub vol0
After waiting approximately 9 hours the scrub completed and was able to repair the zeroed sectors:
[root@freenas] ~# zpool status -v vol0
pool: vol0
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 44K in 9h9m with 0 errors on Tue Aug 9 11:51:01 2016
config:
NAME STATE READ WRITE CKSUM
vol0 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/dff6e82d-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 0
gptid/e06f42ea-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 0
gptid/e0f4248d-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 1
gptid/e1787b7c-4f68-11e5-9220-a0b3ccdf05de ONLINE 0 0 0
errors: No known data errors
As the URE was caused by my overwriting of a sector I cleared the error with a zpool clear
.
The Alert light went back to green in the webUI and I am fairly confident that this has resolved the issue.
Many thanks to Dan Smith whose blog post was of great help. Also see the FreeBSD Diary for further information.