Summary ¶
弊宅では Proxmox VE を母体として TrueNAS SCALE, 録画鯖 on Proxmox VE としている。こうすることで、 4U のクソデカシャーシにリソースを集約し管理と消費電力をおさせる。そのため母体の Proxmox VE は OS レベルではできるだけ障害に強いように組んだ。
そのため SSD を ZFS MIROR し OS Disk としていたがそのうちの 1枚で極端に速度が出ない現象が発生したため原因調査を実施した。お仕事でやると吐き気が出る案件だが HomeLab なので気楽に外堀を埋めていくことで原因が見えてきた、その経緯を残す。
ことの発端 ¶
それは、構築して1週間程度稼働しているが Kernel パラメーターを詰める必要があり何度か再起動していたところ、 dmesg にかなりの数 failed command: WRITE FPDMA QUEUED
のエラーが出力されている状況と、ZFS が一瞬崩れた様子が見られたので調査を始めた。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
| Jan 11 12:39:42 nas-01 kernel: ata7.00: exception Emask 0x10 SAct 0x80000020 SErr 0x400000 action 0x6 frozen
Jan 11 12:39:43 nas-01 kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Jan 11 12:39:43 nas-01 kernel: ata7: SError: { Handshk }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:28:40:7c:a0/01:00:1f:00:00/40 tag 5 ncq dma 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:f8:40:7d:a0/01:00:1f:00:00/40 tag 31 ncq dma 131072 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7: hard resetting link
Jan 11 12:39:43 nas-01 kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 11 12:39:43 nas-01 kernel: ata7.00: configured for UDMA/133
Jan 11 12:39:43 nas-01 kernel: ata7: EH complete
Jan 11 12:39:43 nas-01 kernel: ata7.00: exception Emask 0x10 SAct 0x8000fe02 SErr 0x400000 action 0x6 frozen
Jan 11 12:39:43 nas-01 kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Jan 11 12:39:43 nas-01 kernel: ata7: SError: { Handshk }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:08:40:85:a0/01:00:1f:00:00/40 tag 1 ncq dma 131072 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:48:40:83:a0/02:00:1f:00:00/40 tag 9 ncq dma 262144 out
res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:50:40:87:a0/01:00:1f:00:00/40 tag 10 ncq dma 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:58:40:81:a0/01:00:1f:00:00/40 tag 11 ncq dma 131072 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:60:40:8b:a0/01:00:1f:00:00/40 tag 12 ncq dma 131072 out
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:68:40:88:a0/01:00:1f:00:00/40 tag 13 ncq dma 131072 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:70:40:89:a0/01:00:1f:00:00/40 tag 14 ncq dma 131072 out
res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:78:40:8a:a0/01:00:1f:00:00/40 tag 15 ncq dma 131072 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 12:39:43 nas-01 kernel: ata7.00: cmd 61/00:f8:40:86:a0/01:00:1f:00:00/40 tag 31 ncq dma 131072 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 12:39:43 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 12:39:43 nas-01 kernel: ata7: hard resetting link
Jan 11 12:39:43 nas-01 kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 11 12:39:43 nas-01 kernel: ata7.00: configured for UDMA/133
Jan 11 12:39:43 nas-01 kernel: ata7: EH complete
Jan 11 12:40:41 nas-01 kernel: zd64: p1 p2 p3
Jan 11 12:42:44 nas-01 kernel: zd64: p1 p2 p3
Jan 11 13:03:33 nas-01 kernel: ata7.00: exception Emask 0x10 SAct 0x80 SErr 0x400000 action 0x6 frozen
Jan 11 13:03:33 nas-01 kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Jan 11 13:03:33 nas-01 kernel: ata7: SError: { Handshk }
Jan 11 13:03:33 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 13:03:33 nas-01 kernel: ata7.00: cmd 61/48:38:e0:23:22/00:00:21:00:00/40 tag 7 ncq dma 36864 out
res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 13:03:33 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 13:03:33 nas-01 kernel: ata7: hard resetting link
Jan 11 13:03:34 nas-01 kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 11 13:03:34 nas-01 kernel: ata7.00: configured for UDMA/133
Jan 11 13:03:34 nas-01 kernel: ata7: EH complete
Jan 11 13:17:48 nas-01 kernel: ata7.00: exception Emask 0x10 SAct 0x40 SErr 0x400000 action 0x6 frozen
Jan 11 13:17:48 nas-01 kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Jan 11 13:17:48 nas-01 kernel: ata7: SError: { Handshk }
Jan 11 13:17:48 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 13:17:48 nas-01 kernel: ata7.00: cmd 61/20:30:10:7d:22/00:00:21:00:00/40 tag 6 ncq dma 16384 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 13:17:48 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 13:17:48 nas-01 kernel: ata7: hard resetting link
Jan 11 13:17:49 nas-01 kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 11 13:17:49 nas-01 kernel: ata7.00: configured for UDMA/133
Jan 11 13:17:49 nas-01 kernel: ata7: EH complete
Jan 11 13:18:04 nas-01 kernel: ata7.00: exception Emask 0x10 SAct 0x8400 SErr 0x400000 action 0x6 frozen
Jan 11 13:18:04 nas-01 kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Jan 11 13:18:04 nas-01 kernel: ata7: SError: { Handshk }
Jan 11 13:18:04 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 13:18:04 nas-01 kernel: ata7.00: cmd 61/18:50:90:7e:22/00:00:21:00:00/40 tag 10 ncq dma 12288 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 13:18:04 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 13:18:04 nas-01 kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Jan 11 13:18:04 nas-01 kernel: ata7.00: cmd 61/30:78:a8:7e:22/00:00:21:00:00/40 tag 15 ncq dma 24576 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 11 13:18:04 nas-01 kernel: ata7.00: status: { DRDY }
Jan 11 13:18:04 nas-01 kernel: ata7: hard resetting link
Jan 11 13:18:04 nas-01 kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 11 13:18:04 nas-01 kernel: ata7.00: configured for UDMA/133
Jan 11 13:18:04 nas-01 kernel: ata7: EH complete
|
ここで、疑問になったのは ata7
がどのディスクであるか? だと思う。
調べる方法を試行錯誤した結果下記で確認できた。 Linux では /dev/sdX
で始まるため強引な手法であるが目的は達成する
今回は TEAM_T253512GB_TPBF240909XXXXXXXXXX
であった
/devices/pci0000:40/0000:40:08.1/0000:42:00.2/ata7
のためこれが ata7
であるデバイスは /dev/sdd
であった。
1
2
3
4
5
6
7
| > udevadm info --query=all --name=/dev/sd{a..z} | grep -E '^(P|S|M)'
P: /devices/pci0000:40/0000:40:08.1/0000:42:00.2/ata7/host6/target6:0:0/6:0:0:0/block/sdd
M: sdd
S: disk/by-id/ata-TEAM_T253512GB_TPBF240909XXXXXXXXXX
S: disk/by-diskseq/12
S: disk/by-path/pci-0000:42:00.2-ata-6
S: disk/by-path/pci-0000:42:00.2-ata-6.0
|
状態を確認 ¶
さて、前述の調査で /dev/sdd
のドライブが被疑であることを確認した。詳細を確認していく
まずは、 smartctl の情報で /dev/sdd
が被疑の S/N か確認する
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| > smartctl -i /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TEAM T253512GB
Serial Number: TPBF240909XXXXXXXXXX
LU WWN Device Id: 0 000000 000000000
Firmware Version: HP3414B5
User Capacity: 512,110,190,592 bytes [512 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Jan 11 13:45:10 2025 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
|
S/N が TPBF240909XXXXXXXXXX
でありあっている。また、 SATA 3.2 で 6.0Gb/s で接続されていてしっかりと SATA ||| で接続できていることも確認できた。そのため SATA の速度が削られて遅いということではないことが確認できた。
このマザーボードは Supermicro H11SSL-i というやつでマニュアルを見ると SATA 0-7、SATA 8-11、 SATA 12-15 と存在する。ブロック図を確認すると CPU に直結されていることも確認できる。

そのため ATA の番号も 0 ~ 16番まで存在する、接続には2つの方式を利用しており1つは SFF8087 to Reverse SATA ケーブルで接続している、残りは SFF8087 to MiniSAS で接続していた。そのためケーブル被疑でないことを確認する必要がある。
SATA Port | HCTL | |
---|
0-3 | 1-4 | SFF8087 to Reverse SATA |
4-7 | 5-8 | SFF8087 to MiniSAS |
8-11 | 9-12 | NOT USE |
12-15 | 13-16 | SFF8087 to Reverse SATA |
同じ型番の製品を2枚利用していたため sdd と sdf で速度差があるのか確認した
結果は sdd がかなり遅く、 sdf はそれなりであった。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| > lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdd 6:0:0:0 TEAM T253512GB TPBF240909XXXXXXXXXX
sdf 14:0:0:0 TEAM T253512GB TPBF2410210XXXXXXXXX
> fio --name=test --readonly --rw=randread --filename /dev/sdd --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --time_based=1
Run status group 0 (all jobs):
READ: bw=48.6MiB/s (51.0MB/s), 48.6MiB/s-48.6MiB/s (51.0MB/s-51.0MB/s), io=2918MiB (3060MB), run=60021-60021msec
Disk stats (read/write):
sdd: ios=93169/847, merge=0/26, ticks=1914572/17850, in_queue=1933389, util=99.87%
> fio --name=test --readonly --rw=randread --filename /dev/sdf --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --time_based=1
Run status group 0 (all jobs):
READ: bw=384MiB/s (403MB/s), 384MiB/s-384MiB/s (403MB/s-403MB/s), io=22.5GiB (24.2GB), run=60003-60003msec
Disk stats (read/write):
sdf: ios=735683/989, merge=0/19, ticks=1909824/3140, in_queue=1913104, util=99.87%
|
そこで、マザーボード、SATAケーブルの被疑を排除するためにドライブを入れ替えて検証したところ ATA 6 は問題ないことがわかり、 TPBF240909XXXXXXXXXX
のみ動作が安定しないことがわかった。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| > lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdd 6:0:0:0 TEAM T253512GB TPBF2410210XXXXXXXXX
sdf 14:0:0:0 TEAM T253512GB TPBF240909XXXXXXXXXX
> fio --name=test --readonly --rw=randread --filename /dev/sdd --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --ti
Run status group 0 (all jobs):
READ: bw=385MiB/s (404MB/s), 385MiB/s-385MiB/s (404MB/s-404MB/s), io=22.5GiB (24.2GB), run=60003-60003msec
> fio --name=test --readonly --rw=randread --filename /dev/sdf --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --time_based=1
Run status group 0 (all jobs):
READ: bw=48.6MiB/s (50.9MB/s), 48.6MiB/s-48.6MiB/s (50.9MB/s-50.9MB/s), io=2915MiB (3056MB), run=60019-60019msec
|
ATA 13-16 は SFF8087 to Mini SAS で接続していたためケーブル被疑を確認。 余剰の Samsung を利用したがどれも問題無し
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| # => 4 3 2 [1]
> lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdh 16:0:0:0 Samsung SSD 870 EVO 1TB S74XXXXXXXXXXXX
> fio --name=test --readonly --rw=randread --filename /dev/sdh --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --ti
Run status group 0 (all jobs):
READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s), io=24.0GiB (25.7GB), run=60003-60003msec
# => 4 3 [2] 1]
> lsblk -o NAME,HCTL,MODEL,SERIAL
sdg 15:0:0:0 Samsung SSD 870 EVO 1TB S74XXXXXXXXXXXX
> fio --name=test --readonly --rw=randread --filename /dev/sdg --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --ti
Run status group 0 (all jobs):
READ: bw=409MiB/s (428MB/s), 409MiB/s-409MiB/s (428MB/s-428MB/s), io=23.9GiB (25.7GB), run=60003-60003msec
# => 4 [3] 2 1
lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdf 14:0:0:0 Samsung SSD 870 EVO 1TB S74XXXXXXXXXXXX
> fio --name=test --readonly --rw=randread --filename /dev/sdf --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --ti
Run status group 0 (all jobs):
READ: bw=412MiB/s (432MB/s), 412MiB/s-412MiB/s (432MB/s-432MB/s), io=24.1GiB (25.9GB), run=60003-60003msec
# => [4] 3 2 1
> lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdg 13:0:0:0 Samsung SSD 870 EVO 1TB S74XXXXXXXXXXXX
> fio --name=test --readonly --rw=randread --filename /dev/sdg --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=1m --time_based=1
Run status group 0 (all jobs):
READ: bw=411MiB/s (431MB/s), 411MiB/s-411MiB/s (431MB/s-431MB/s), io=24.1GiB (25.9GB), run=60003-60003msec
|
以上の結果から smartctl に表示される Attributes で Erase_Fail_Count_Chip
, Wear_Leveling_Count
が上昇する現象を確認したためメーカーに問い合わせしつつ、同型番の保守部材として確保していたストックを付けて復旧させる。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
| # => 4 3 [2] 1
> lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdg 15:0:0:0 TEAM T253512GB TPBF240909XXXXXXXXXX
> date
Sat Jan 11 06:20:30 PM JST 2025
> smartctl -a /dev/sdf
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TEAM T253512GB
Serial Number: TPBF240909XXXXXXXXXX
LU WWN Device Id: 0 000000 000000000
Firmware Version: HP3414B5
User Capacity: 512,110,190,592 bytes [512 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Jan 11 18:23:29 2025 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 8) minutes.
Extended self-test routine
recommended polling time: ( 16) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 050 Old_age Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 050 Old_age Always - 314
12 Power_Cycle_Count 0x0032 100 100 050 Old_age Always - 27
160 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 0
161 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 100
163 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 124
164 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 14
165 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 23
166 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 1
167 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 6
168 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 0
169 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 100
175 Program_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
176 Erase_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 34265
177 Wear_Leveling_Count 0x0032 100 100 050 Old_age Always - 433299
178 Used_Rsvd_Blk_Cnt_Chip 0x0032 100 100 050 Old_age Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 050 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 050 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 26
194 Temperature_Celsius 0x0032 100 100 050 Old_age Always - 40
195 Hardware_ECC_Recovered 0x0032 100 100 050 Old_age Always - 19
196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 0
198 Offline_Uncorrectable 0x0032 100 100 050 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 050 Old_age Always - 46
232 Available_Reservd_Space 0x0032 100 100 050 Old_age Always - 100
241 Total_LBAs_Written 0x0032 100 100 050 Old_age Always - 28403
242 Total_LBAs_Read 0x0032 100 100 050 Old_age Always - 2212
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 97 hours (4 days + 1 hours)
When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 40 a0 48 56 26 00 at LBA = 0x00265648 = 2512456
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 a0 48 56 26 40 08 04:45:26.120 WRITE FPDMA QUEUED
61 00 78 48 55 26 40 08 04:45:26.120 WRITE FPDMA QUEUED
61 00 48 48 54 26 40 08 04:45:26.120 WRITE FPDMA QUEUED
61 00 c0 48 53 26 40 08 04:45:26.120 WRITE FPDMA QUEUED
61 00 98 48 52 26 40 08 04:45:26.120 WRITE FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 52 hours (2 days + 4 hours)
When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 40 50 40 9c 01 00 at LBA = 0x00019c40 = 105536
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 40 50 40 9c 01 40 08 00:03:15.780 WRITE FPDMA QUEUED
61 40 48 00 97 01 40 08 00:03:15.780 WRITE FPDMA QUEUED
61 40 40 c0 91 01 40 08 00:03:15.770 WRITE FPDMA QUEUED
61 40 38 80 8c 01 40 08 00:03:15.770 WRITE FPDMA QUEUED
61 40 30 40 87 01 40 08 00:03:15.770 WRITE FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Offline Self-test routine in progress 80% 312 -
# 2 Offline Self-test routine in progress 80% 312 -
# 3 Offline Self-test routine in progress 80% 312 -
# 4 Offline Self-test routine in progress 80% 312 -
# 5 Offline Self-test routine in progress 80% 312 -
# 6 Offline Self-test routine in progress 80% 312 -
# 7 Offline Self-test routine in progress 80% 312 -
# 8 Offline Self-test routine in progress 80% 312 -
# 9 Offline Self-test routine in progress 80% 312 -
#10 Offline Self-test routine in progress 80% 312 -
#11 Offline Self-test routine in progress 80% 312 -
#12 Offline Self-test routine in progress 80% 312 -
#13 Offline Self-test routine in progress 80% 312 -
#14 Offline Self-test routine in progress 80% 312 -
#15 Offline Self-test routine in progress 80% 312 -
#16 Offline Self-test routine in progress 80% 312 -
#17 Offline Self-test routine in progress 80% 312 -
#18 Offline Self-test routine in progress 80% 312 -
#19 Offline Self-test routine in progress 80% 312 -
#20 Offline Self-test routine in progress 80% 312 -
#21 Offline Self-test routine in progress 80% 312 -
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
ZFS から外す ¶
まず、 Proxmox VE のインストール時に ZFS MIROR で構築しているため故障 Disk TPBF240909XXXXXXXXXX
を ZFS から外す
1
2
3
4
5
6
7
8
| > lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdd 6:0:0:0 TEAM T253512GB TPBF2410210XXXXXXXXX
sdf 14:0:0:0 TEAM T253512GB TPBF240909XXXXXXXXXX
> zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 472G 124G 348G - - 7% 26% 1.00x ONLINE -
|
ZFS pool 名が rpool
であることが確認できたため pool の詳細を確認する
1
2
3
4
5
6
7
8
9
10
11
12
13
| > zpool status rpool
pool: rpool
state: ONLINE
scan: resilvered 17.9M in 00:00:00 with 0 errors on Sat Jan 11 17:39:05 2025
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-TEAM_T253512GB_TPBF240909XXXXXXXXXX-part3 ONLINE 0 0 0
ata-TEAM_T253512GB_TPBF2410210XXXXXXXXX-part3 ONLINE 0 0 0
errors: No known data errors
|
by-id が確認できたため Remove する
1
| > zpool offline rpool ata-TEAM_T253512GB_TPBF240909XXXXXXXXXX-part3
|
ステータスを再度確認すると下記
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| > zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 472G 124G 348G - - 7% 26% 1.00x DEGRADED -
> zpool status rpool
pool: rpool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: resilvered 17.9M in 00:00:00 with 0 errors on Sat Jan 11 17:39:05 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T253512GB_TPBF2409090090201032-part3 OFFLINE 0 0 0
ata-TEAM_T253512GB_TPBF2410210030300358-part3 ONLINE 0 0 0
|
ZFS へ新しい Disk を追加 ¶
トレイから抜去し、 Disk を交換したら パーティション情報を書き込む
1
2
3
4
| > lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdd 6:0:0:0 TEAM T253512GB TPBF2410210030300358
sdh 14:0:0:0 TEAM T253512GB TPBF2409090090200500
|
今回は /dev/sdd
から /dev/sdh
で複製するため下記となる
1
2
| > sgdisk /dev/sdd -R /dev/sdh
The operation has completed successfully.
|
次に、 GUID を固有なものに変更
1
2
| > sgdisk -G /dev/sdh
The operation has completed successfully.
|
Note
ここで GUID を認識させるため再起動する
再起動が完了したら 新しい Disk の by-id と ZFS pool の状態を確認する
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| > ls /dev/disk/by-id -ahl
total 0
drwxr-xr-x 2 root root 560 Jan 12 11:13 .
drwxr-xr-x 9 root root 180 Jan 12 11:13 ..
lrwxrwxrwx 1 root root 9 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2409YYYYYYYYYYYY -> ../../sdf
lrwxrwxrwx 1 root root 10 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2409YYYYYYYYYYYY-part1 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2409YYYYYYYYYYYY-part2 -> ../../sdf2
lrwxrwxrwx 1 root root 10 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2409YYYYYYYYYYYY-part3 -> ../../sdf3
lrwxrwxrwx 1 root root 9 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX -> ../../sdd
lrwxrwxrwx 1 root root 10 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX-part2 -> ../../sdd2
lrwxrwxrwx 1 root root 10 Jan 12 11:13 ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX-part3 -> ../../sdd3
> zpool status
pool: rpool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 17.9M in 00:00:00 with 0 errors on Sat Jan 11 17:39:05 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T253512GB_TPBF2409XXXXXXXXXXXX-part3 OFFLINE 0 0 0
ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX-part3 ONLINE 0 0 0
|
交換対象は ata-TEAM_T253512GB_TPBF2409XXXXXXXXXXXX-part3
で ZFS pool 名は rpool
であり
交換先 Disk は ata-TEAM_T253512GB_TPBF2409YYYYYYYYYYYY
である。コマンドを組むと下記になる。
1
2
| > zpool replace -f rpool ata-TEAM_T253512GB_TPBF2409XXXXXXXXXXXX-part3 \
ata-TEAM_T253512GB_TPBF2409YYYYYYYYYYYY-part3
|
コマンドが成功すると
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| > zpool status
pool: rpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Jan 12 11:22:32 2025
124G / 124G scanned, 12.4G / 124G issued at 396M/s
12.5G resilvered, 9.98% done, 00:04:48 to go
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
ata-TEAM_T253512GB_TPBF240909XXXXXXXXXX-part3 OFFLINE 0 0 0
ata-TEAM_T253512GB_TPBF240909YYYYYYYYYY-part3 ONLINE 0 0 0 (resilvering)
ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX-part3 ONLINE 0 0 0
errors: No known data errors
|
完了すると ONLINE
に変更される。これでリプレイスは完了。
1
2
3
4
5
6
7
8
9
10
11
12
13
| zpool status
pool: rpool
state: ONLINE
scan: resilvered 125G in 00:24:32 with 0 errors on Sun Jan 12 12:08:04 2025
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-TEAM_T253512GB_TPBF240909YYYYYYYYYY-part3 ONLINE 0 0 0
ata-TEAM_T253512GB_TPBF2410XXXXXXXXXXXX-part3 ONLINE 0 0 0
errors: No known data errors
|
次に、 bootloader を書き込みする
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| > lsblk -o NAME,HCTL,MODEL,SERIAL
NAME HCTL MODEL SERIAL
sdd 6:0:0:0 TEAM T253512GB TPBF2410XXXXXXXXXXXX
├─sdd1
├─sdd2
└─sdd3
sdf 14:0:0:0 TEAM T253512GB TPBF2409YYYYYYYYYYYY
├─sdf1
├─sdf2
└─sdf3
> proxmox-boot-tool format /dev/sdf2 --force
UUID="11508983877615487063" SIZE="1073741824" FSTYPE="zfs_member" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="sdf" MOUNTPOINT=""
Formatting '/dev/sdf2' as vfat..
mkfs.fat 4.2 (2021-01-31)
Done.
> proxmox-boot-tool init /dev/sdf2 --force
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
UUID="89FD-A1F3" SIZE="1073741824" FSTYPE="vfat" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="sdf" MOUNTPOINT=""
Mounting '/dev/sdf2' on '/var/tmp/espmounts/89FD-A1F3'.
Installing systemd-boot..
Created "/var/tmp/espmounts/89FD-A1F3/EFI/systemd".
Created "/var/tmp/espmounts/89FD-A1F3/EFI/BOOT".
Created "/var/tmp/espmounts/89FD-A1F3/loader".
Created "/var/tmp/espmounts/89FD-A1F3/loader/entries".
Created "/var/tmp/espmounts/89FD-A1F3/EFI/Linux".
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/var/tmp/espmounts/89FD-A1F3/EFI/systemd/systemd-bootx64.efi".
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/var/tmp/espmounts/89FD-A1F3/EFI/BOOT/BOOTX64.EFI".
Random seed file /var/tmp/espmounts/89FD-A1F3/loader/random-seed successfully written (32 bytes).
Created EFI boot entry "Linux Boot Manager".
Configuring systemd-boot..
Unmounting '/dev/sdf2'.
Adding '/dev/sdf2' to list of synced ESPs..
Refreshing kernels and initrds..
Running hook script 'proxmox-auto-removal'..
Running hook script 'zz-proxmox-boot'..
Copying and configuring kernels on /dev/disk/by-uuid/89FD-A1F3
Copying kernel and creating boot-entry for 6.8.12-4-pve
Copying kernel and creating boot-entry for 6.8.12-5-pve
WARN: /dev/disk/by-uuid/EE7D-DAA3 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
Copying and configuring kernels on /dev/disk/by-uuid/EE7E-A82E
Copying kernel and creating boot-entry for 6.8.12-4-pve
Copying kernel and creating boot-entry for 6.8.12-5-pve
|
WARN: /dev/disk/by-uuid/EE7D-DAA3 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
の対処をする
コレは、古い Disk の UUID が /etc/kernel/proxmox-boot-uuids
に残っていためなのでファイルから削除すればよい
1
| nano /etc/kernel/proxmox-boot-uuids
|
確認のためもう一度実行するとエラーが消えていると思う
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| > proxmox-boot-tool init /dev/sdf2 --force
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
UUID="89FD-A1F3" SIZE="1073741824" FSTYPE="vfat" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="sdf" MOUNTPOINT=""
Mounting '/dev/sdf2' on '/var/tmp/espmounts/89FD-A1F3'.
Installing systemd-boot..
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/var/tmp/espmounts/89FD-A1F3/EFI/systemd/systemd-bootx64.efi".
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/var/tmp/espmounts/89FD-A1F3/EFI/BOOT/BOOTX64.EFI".
Random seed file /var/tmp/espmounts/89FD-A1F3/loader/random-seed successfully written (32 bytes).
Created EFI boot entry "Linux Boot Manager".
Configuring systemd-boot..
Unmounting '/dev/sdf2'.
Adding '/dev/sdf2' to list of synced ESPs..
Refreshing kernels and initrds..
Running hook script 'proxmox-auto-removal'..
Running hook script 'zz-proxmox-boot'..
Copying and configuring kernels on /dev/disk/by-uuid/89FD-A1F3
Copying kernel and creating boot-entry for 6.8.12-4-pve
Copying kernel and creating boot-entry for 6.8.12-5-pve
Copying and configuring kernels on /dev/disk/by-uuid/EE7E-A82E
Copying kernel and creating boot-entry for 6.8.12-4-pve
Copying kernel and creating boot-entry for 6.8.12-5-pve
|
以上で Proxmox VE ZFS の修復が完了した。
参考情報 ¶