Avaria d’un disc amb RAID1. (I)

Tinc pendents uns quants temes a tractar en aquest portfoli, però una novetat que m’ha vingut per sorpresa s’ha posat per davant de la resta.

Per aquelles coses de la vida quan vaig muntar el meu ordinador de sobretaula de casa (ara ja començar a fer uns anyets…) vaig pensar que “les coses s’han de fer bé” i vaig muntar dos discos de 500 GB com si fossin un sol, fent de mirall l’un de l’altre (RAID1). Hi ha gent que pensa que així el que feia era perdre exactament la meitat de la capacitat  d’emmagatzemament. Jo penso que m’he curat en salut per si algun dia un disc decidia avariar-se per la cara. Com així ha estat finalment a dia d’avui.

Com que els dos discos es cobreixen mútuament ho vaig notar amb múltiples missatges d’error i que l’ordinador m’anava més lent …i un petit “piit” regular. Els missatges d’error eren el següents:
root@padova:/media/usb0/chryse# tail -f /var/log/syslog
Nov 27 21:36:09 padova kernel: [ 2907.886743] ata3: hard resetting link
Nov 27 21:36:12 padova kernel: [ 2911.132043] ata3: softreset failed (device not ready)
Nov 27 21:36:12 padova kernel: [ 2911.132057] ata3: applying PMP SRST workaround and retrying
Nov 27 21:36:12 padova kernel: [ 2911.304076] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 27 21:36:12 padova kernel: [ 2911.379778] ata3.00: configured for UDMA/33
Nov 27 21:36:12 padova kernel: [ 2911.404054] ata3: EH complete
Nov 27 21:36:13 padova kernel: [ 2912.568745] ata3: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
Nov 27 21:36:13 padova kernel: [ 2912.568753] ata3: irq_stat 0x00400000, PHY RDY changed
Nov 27 21:36:13 padova kernel: [ 2912.568758] ata3: SError: { Persist PHYRdyChg 10B8B }

Em penso que no ho hauria d’haver fet, però per si de cas, vaig reiniciar l’ordinador. Error per que li va costar una enormitat arrencar i començar a treballar. Una ràpida cerca via Google em va portar a aquest enllaç. Tenia una avaria un dels discos del RAID, calia trobar quin dels discos era, desmuntar el RAID, treure el disc del RAID, fer un últim backup d’urgència, remplaçar físicament el disc per un altre i tornar a muntar el RAID1. Com que no tinc (encara) el disc que el reemplaçarà, avui faré una primera part i quan tingui tot acabat farem la segona.

El primer pas ha estat comprovar q els dos RAIDs funcionaven correctament. En el meu cas es tracta de dues particions per cada disc, una per a /boot i un aaltra per a un Logical Volume Manageral que després ja li faig algunes altres “perreries” però que ara per ara no ens han de preocupar (crec). Per tant:

root@padova:/media/usb0/chryse# mdadm --detail /dev/md0
root@padova:/media/usb0/chryse# mdadm --detail /dev/md1

No poso la sortida que tinc ara, perque ha variat una mica, al només tenir actiu un disc (/dev/sda) al RAID.

De manera rutinària he comprovat l’espai i els sistemes de fitxers.

root@padova:/media/usb0/chryse# df -h
root@padova:/media/usb0/chryse# fdisk -l

Comprovació dels errors de cada disc (aquí haurem de tenir instal·lades les smartmontools).

root@padova:/media/usb0/chryse# aptitude install smartmontools

root@padova:/media/usb0/chryse# smartctl --all /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12
Device Model: ST3500418AS
Serial Number: 9VM1FWFN
LU WWN Device Id: 5 000c50 0152a171a
Firmware Version: CC35
User Capacity: 500,106,780,160 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Nov 27 22:23:03 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 93) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 17938660
3 Spin_Up_Time 0x0003 098 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 773
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 72642186
9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 23495
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 463
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 083 000 Old_age Always - 1129
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 068 057 045 Old_age Always - 32 (Min/Max 30/32)
194 Temperature_Celsius 0x0022 032 043 000 Old_age Always - 32 (0 12 0 0)
195 Hardware_ECC_Recovered 0x001a 042 015 000 Old_age Always - 17938660
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 252827544865575
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1091387300
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3228034447

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I al disc /dev/sdb…

root@padova:/media/usb0/chryse# smartctl --all /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12
Device Model: ST3500418AS
Serial Number: 5VM1SELY
LU WWN Device Id: 5 000c50 0165cb18f
Firmware Version: CC34
User Capacity: 500,106,780,160 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Nov 27 22:23:17 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 609) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 96) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 094 006 Pre-fail Always - 237507395
3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 097 097 020 Old_age Always - 3922
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 50
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 143059289
9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 23543
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 096 096 020 Old_age Always - 4412
183 Runtime_Bad_Block 0x0000 100 100 000 Old_age Offline - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 192
188 Command_Timeout 0x0032 100 095 000 Old_age Always - 379
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 058 045 Old_age Always - 30 (Min/Max 30/30)
194 Temperature_Celsius 0x0022 030 042 000 Old_age Always - 30 (0 12 0 0)
195 Hardware_ECC_Recovered 0x001a 033 017 000 Old_age Always - 237507395
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 269569327390213
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2470981269
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3044039389

SMART Error Log Version: 1
ATA Error Count: 204 (device log contains only the most recent five errors)

CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 204 occurred at disk power-on lifetime: 23117 hours (963 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:39:11.543 READ FPDMA QUEUED
27 00 00 00 00 00 e0 00 00:39:11.517 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:39:11.516 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 00:39:11.503 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:39:11.477 READ NATIVE MAX ADDRESS EXT

Error 203 occurred at disk power-on lifetime: 23117 hours (963 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:39:08.720 READ FPDMA QUEUED
27 00 00 00 00 00 e0 00 00:39:08.694 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:39:08.693 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 00:39:08.681 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:39:08.655 READ NATIVE MAX ADDRESS EXT

Error 202 occurred at disk power-on lifetime: 23117 hours (963 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:39:05.890 READ FPDMA QUEUED
27 00 00 00 00 00 e0 00 00:39:05.863 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:39:05.862 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 00:39:05.850 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:39:05.824 READ NATIVE MAX ADDRESS EXT

Error 201 occurred at disk power-on lifetime: 23117 hours (963 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:39:03.076 READ FPDMA QUEUED
27 00 00 00 00 00 e0 00 00:39:03.049 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:39:03.048 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 00:39:03.036 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:39:03.010 READ NATIVE MAX ADDRESS EXT

Error 200 occurred at disk power-on lifetime: 23117 hours (963 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:39:00.245 READ FPDMA QUEUED
27 00 00 00 00 00 e0 00 00:39:00.219 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:39:00.218 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 00:39:00.205 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:39:00.179 READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Aquí sí que m’identificava l’error clarament. I era al disc /dev/sdb (Tampoc poso la sortida, però ho deixava claríssim). Per tant, calia desmuntar aquest del RAID. Tornar a iniciar l’ordinador. Fer el backup ultimíssim. Treure el disc físicament de l’ordinador.

Per desmuntar el disc del RAID, cal per a cada partició de cada RAID del disc en qüestió:

root@padova:/media/usb0/chryse# mdadm --manage /dev/md0 --fail /dev/sdb1
root@padova:/media/usb0/chryse# mdadm --manage /dev/md0 --remove /dev/sdb1
root@padova:/media/usb0/chryse# mdadm --manage /dev/md1 --fail /dev/sdb2
root@padova:/media/usb0/chryse# mdadm --manage /dev/md1 --remove /dev/sdb2

He reiniciat l’ordinador i tot i que encara es queixa ara va molt més ràpid. He fet un ultimíssim backup contra un disc extern. (Amb el disc extern muntat a /media/usb0 i havent creat el directori chryse dintre d’aquest per tal de posar el backup de TOT):

root@padova:/media/usb0/chryse# rsync -av / /media/usb0/chryse

Ara ja l’últim pas, un cop completat el backup és apagar la màquina i treure el disc xungo. Us mantindré informats.

nomCategoria: