管理指南 :: 替换 Ceph 集群中的故障磁盘
¶
从Ceph集群中替换故障磁盘 ¶
你有一个ceph集群,太棒了,你真了不起;很快你就会遇到这个问题。
- 检查集群健康状况
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
830, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e8871: 153 osds: 152 up, 152 in
pgmap v1409465: 66256 pgs, 30 pools, 201 TB data, 51906 kobjects
90439 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- 登录到任何ceph节点并搜索故障磁盘
#ceph osd tree | grep -i down
9 2.63 osd.9 up 1
17 2.73 osd.17 up 1
30 2.73 osd.30 up 1
53 2.73 osd.53 up 1
65 2.73 osd.65 up 1
78 2.73 osd.78 up 1
89 2.73 osd.89 up 1
99 2.73 osd.99 down 0
113 2.73 osd.113 up 1
128 2.73 osd.128 up 1
141 2.73 osd.141 up 1
- 现在你已经确定了哪个节点的OSD出故障了,以及OSD编号是多少。登录到该节点并检查该OSD是否已挂载,它应该没有挂载(因为它已失败)
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 47G 6.0G 38G 14% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/sdd1 2.8T 197G 2.5T 8% /var/lib/ceph/osd/ceph-30
/dev/sde1 2.8T 172G 2.6T 7% /var/lib/ceph/osd/ceph-53
/dev/sdc1 2.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-17
/dev/sdh1 2.8T 227G 2.5T 9% /var/lib/ceph/osd/ceph-89
/dev/sdf1 2.8T 169G 2.6T 7% /var/lib/ceph/osd/ceph-65
/dev/sdi1 2.8T 150G 2.6T 6% /var/lib/ceph/osd/ceph-113
/dev/sdb1 2.7T 1.3T 1.3T 51% /var/lib/ceph/osd/ceph-9
/dev/sdj1 2.8T 1.6T 1.2T 58% /var/lib/ceph/osd/ceph-128
/dev/sdg1 2.8T 237G 2.5T 9% /var/lib/ceph/osd/ceph-78
/dev/sdk1 2.8T 1.5T 1.3T 53% /var/lib/ceph/osd/ceph-141
- 带着一块新的物理驱动器去你的数据中心,并 physically 更换驱动器,我假设根据你使用的企业服务器,它应该是支持热插拔的,现在几乎所有服务器都支持磁盘热插拔,但你仍应检查你的服务器型号。在这个例子中,我使用的是HPDL380服务器。
- 物理更换驱动器后,等待一段时间,让新驱动器稳定下来。
- 登录到你的节点,将OSD从集群中取出。请记住,OSD在磁盘失败时就已经 DOWN 和 OUT 了。Ceph会处理OSD,如果它不可用,Ceph会将其标记为 down 并将其移出集群。
# ceph osd out osd.99
osd.99 is already out.
# service ceph stop osd.99
/etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78)
# service ceph status osd.99
/etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78)
# service ceph status
=== osd.9 ===
osd.9: running {"version":"0.72.1"}
=== osd.30 ===
osd.30: running {"version":"0.72.1"}
=== osd.17 ===
osd.17: running {"version":"0.72.1"}
=== osd.128 ===
osd.128: running {"version":"0.72.1"}
=== osd.65 ===
osd.65: running {"version":"0.72.1"}
=== osd.141 ===
osd.141: running {"version":"0.72.1"}
=== osd.89 ===
osd.89: running {"version":"0.72.1"}
=== osd.53 ===
osd.53: running {"version":"0.72.1"}
=== osd.113 ===
osd.113: running {"version":"0.72.1"}
=== osd.78 ===
osd.78: running {"version":"0.72.1"}
- 现在从Crush Map中移除这个失败的OSD,一旦它被从crush map中移除,ceph就开始制作位于这个失败磁盘上的PG副本,并将其放置在其他磁盘上。因此,恢复过程将开始。
# ceph osd crush remove osd.99
removed item id 99 name 'osd.99' from crush map
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 43 pgs backfill; 56 pgs backfilling; 9 pgs peering; 82 pgs recovering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 192 pgs st
uck unclean; 4 requests are blocked > 32 sec; recovery 373488/106903578 objects degraded (0.349%)
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e8946: 153 osds: 152 up, 152 in
pgmap v1409604: 66256 pgs, 30 pools, 201 TB data, 51916 kobjects
1
1 GB used, 316 TB / 413 TB avail
1
1 /106903578 objects degraded (0.349%)
1 active
66060 active+clean
1
1
1 active+remapped+wait_backfill
3 peering
1 active+remapped
1
1
1 active+remapped+backfilling
6 stale+peering
4 active+clean+scrubbing+deep
1
1
1 active+recovering
recovery io 159 MB/s, 39 objects/s
- (可选)检查磁盘统计信息,它看起来不错,一段时间后(取决于你的故障磁盘上的数据量)它会完成。
# dstat 10
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
2 3 95 1 0 0|2223k 5938k| 0 0 |1090B 2425B|5853 11k
14 58 1 25 0 2| 130M 627M| 219M 57M|6554B 0 | 28k 111k
14 57 1 26 0 2| 106M 743M| 345M 32M| 0 4096B| 35k 73k
13 61 1 23 0 2| 138M 680M| 266M 67M| 83k 0 | 31k 82k
14 52 1 31 0 2| 99M 574M| 230M 32M| 48k 6963B| 27k 78k
14 51 2 31 0 2| 99M 609M| 291M 31M| 0 0 | 29k 83k
11 57 1 28 0 2| 118M 636M| 214M 57M|9830B 0 | 26k 92k
12 49 4 34 0 1| 97M 432M| 166M 48M| 35k 0 | 22k 100k
13 44 3 38 0 1| 95M 422M| 183M 46M| 0 0 | 22k 88k
13 52 3 30 0 2| 96M 510M| 207M 44M| 0 0 | 25k 109k
14 49 3 32 0 2| 96M 568M| 276M 37M| 16k 0 | 27k 72k
9 54 4 31 0 2| 109M 520M| 136M 45M| 0 0 | 20k 89k
14 44 5 35 0 1| 76M 444M| 192M 13M| 0 0 | 22k 54k
15 47 3 34 0 1| 101M 452M| 141M 20M|3277B 13k| 21k 79k
17 48 3 31 0 1| 108M 445M| 181M 16M| 0 200k| 23k 69k
17 48 3 30 0 1| 154M 406M| 138M 23M| 0 0 | 21k 75k
17 53 3 27 0 1| 169M 399M| 115M 23M| 0 396k| 21k 81k
13 45 4 36 0 1| 161M 330M| 131M 20M| 0 397k| 20k 90k
11 51 5 33 0 1| 116M 416M| 145M 1177k| 0 184k| 20k 69k
14 50 4 31 0 1| 144M 376M| 124M 8752k| 0 0 | 20k 72k
14 42 6 37 0 1| 142M 340M| 138M 19M| 0 0 | 19k 79k
15 47 6 32 0 1| 111M 427M| 129M 11M| 0 819B| 19k 66k
15 50 5 29 0 1| 163M 413M| 139M 5709k| 58k 0 | 20k 90k
14 49 4 32 0 1| 155M 395M| 91M 12M| 0 0 | 18k 93k
18 43 7 31 0 1| 166M 338M| 84M 6493k| 0 0 | 17k 81k
14 49 5 32 0 1| 179M 335M| 98M 3824k| 0 0 | 18k 91k
13 46 9 31 0 1| 157M 299M| 72M 14M| 0 0 | 17k 125k
17 42 9 30 0 1| 188M 269M| 82M 11M| 16k 0 | 16k 102k
22 35 15 27 0 1| 158M 167M|8932k 287k| 0 0 | 13k 88k
7 20 46 26 0 0| 118M 12M| 250k 392k| 0 82k|9333 61k
7 17 60 16 0 0| 124M 1638B| 236k 225k| 0 0 |7512 64k
7 16 63 14 0 0| 117M 1005k| 247k 238k| 0 0 |7429 60k
3 9 82 5 0 0| 41M 17M| 225k 225k| 0 0 |6049 27k
4 8 81 7 0 0| 56M 7782B| 227k 225k| 0 6144B|5933 33k
4 9 79 7 0 0| 60M 9011B| 248k 245k| 0 9011B|6457 36k
4 9 79 7 0 0| 58M 236k| 231k 230k| 0 14k|6210 35k
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9045: 153 osds: 152 up, 152 in
pgmap v1409957: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects
90448 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- 数据恢复完成后,继续删除该OSD的keyrings,最后删除OSD
# ceph auth del osd.99
updated
# ceph osd rm osd.99
removed osd.99
#ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9046: 152 osds: 152 up, 152 in
pgmap v1409971: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects
90445 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- 从ceph.conf中移除该OSD的条目(如果存在),确保所有节点的ceph.conf文件都已更新。你可以使用 # ceph admin 命令将新的配置文件推送到整个集群。
- 是时候为我们插入的物理磁盘创建新的OSD了,你会看到,ceph会创建一个与失败的OSD相同编号的新OSD,因为我们已经干净地移除了失败的OSD,如果你看到不同的OSD编号,则意味着你没有干净地移除失败的OSD。
# ceph osd create
99
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9047: 153 osds: 152 up, 152 in
pgmap v1409988: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects
90442 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- 列出磁盘,擦除它并重新部署
# ceph-deploy disk list node14
[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk list node14
[node14][DEBUG ] connected to host: node14
[node14][DEBUG ] detect platform information from remote host
[node14][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final
[ceph_deploy.osd][DEBUG ] Listing disks on node14...
[node14][INFO ] Running command: ceph-disk list
[node14][DEBUG ] /dev/sda
[node14][DEBUG ] /dev/sda1 other, ext4, mounted on /
[node14][DEBUG ] /dev/sda2 swap, swap
[node14][DEBUG ] /dev/sdb
[node14][DEBUG ] /dev/sdb1 ceph data, active, cluster ceph, osd.9, journal /dev/sdb2
[node14][DEBUG ] /dev/sdb2 ceph journal, for /dev/sdb1
[node14][DEBUG ] /dev/sdc
[node14][DEBUG ] /dev/sdc1 ceph data, active, cluster ceph, osd.17, journal /dev/sdc2
[node14][DEBUG ] /dev/sdc2 ceph journal, for /dev/sdc1
[node14][DEBUG ] /dev/sdd
[node14][DEBUG ] /dev/sdd1 ceph data, active, cluster ceph, osd.30, journal /dev/sdd2
[node14][DEBUG ] /dev/sdd2 ceph journal, for /dev/sdd1
[node14][DEBUG ] /dev/sde
[node14][DEBUG ] /dev/sde1 ceph data, active, cluster ceph, osd.53, journal /dev/sde2
[node14][DEBUG ] /dev/sde2 ceph journal, for /dev/sde1
[node14][DEBUG ] /dev/sdf
[node14][DEBUG ] /dev/sdf1 ceph data, active, cluster ceph, osd.65, journal /dev/sdf2
[node14][DEBUG ] /dev/sdf2 ceph journal, for /dev/sdf1
[node14][DEBUG ] /dev/sdg
[node14][DEBUG ] /dev/sdg1 ceph data, active, cluster ceph, osd.78, journal /dev/sdg2
[node14][DEBUG ] /dev/sdg2 ceph journal, for /dev/sdg1
[node14][DEBUG ] /dev/sdh
[node14][DEBUG ] /dev/sdh1 ceph data, active, cluster ceph, osd.89, journal /dev/sdh2
[node14][DEBUG ] /dev/sdh2 ceph journal, for /dev/sdh1
[node14][DEBUG ] /dev/sdi other, btrfs
[node14][DEBUG ] /dev/sdj
[node14][DEBUG ] /dev/sdj1 ceph data, active, cluster ceph, osd.113, journal /dev/sdj2
[node14][DEBUG ] /dev/sdj2 ceph journal, for /dev/sdj1
[node14][DEBUG ] /dev/sdk
[node14][DEBUG ] /dev/sdk1 ceph data, active, cluster ceph, osd.128, journal /dev/sdk2
[node14][DEBUG ] /dev/sdk2 ceph journal, for /dev/sdk1
[node14][DEBUG ] /dev/sdl
[node14][DEBUG ] /dev/sdl1 ceph data, active, cluster ceph, osd.141, journal /dev/sdl2
[node14][DEBUG ] /dev/sdl2 ceph journal, for /dev/sdl1
# ceph-deploy disk zap node14:sdi
[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk zap node14:sdi
[ceph_deploy.osd][DEBUG ] zapping /dev/sdi on node14
[node14][DEBUG ] connected to host: node14
[node14][DEBUG ] detect platform information from remote host
[node14][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final
[node14][DEBUG ] zeroing last few blocks of device
[node14][INFO ] Running command: sgdisk --zap-all --clear --mbrtogpt -- /dev/sdi
[node14][DEBUG ] Creating new GPT entries.
[node14][DEBUG ] GPT data structures destroyed! You may now partition the disk using fdisk or
[node14][DEBUG ] other utilities.
[node14][DEBUG ] The operation has completed successfully.
# ceph-deploy --overwrite-conf osd prepare node14:sdi
[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy --overwrite-conf osd prepare node14:sdi
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node14:/dev/sdi
[node14][DEBUG ] connected to host: node14
[node14][DEBUG ] detect platform information from remote host
[node14][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final
[ceph_deploy.osd][DEBUG ] Deploying osd to node14
[node14][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[node14][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add
[ceph_deploy.osd][DEBUG ] Preparing host node14 disk /dev/sdi journal None activate False
[node14][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdi
[node14][ERROR ] INFO:ceph-disk:Will colocate journal with data on /dev/sdi
[node14][DEBUG ] The operation has completed successfully.
[node14][DEBUG ] The operation has completed successfully.
[node14][DEBUG ] meta-data=/dev/sdi1 isize=2048 agcount=32, agsize=22884224 blks
[node14][DEBUG ] = sectsz=512 attr=2, projid32bit=0
[node14][DEBUG ] data = bsize=4096 blocks=732295168, imaxpct=5
[node14][DEBUG ] = sunit=64 swidth=64 blks
[node14][DEBUG ] naming =version 2 bsize=4096 ascii-ci=0
[node14][DEBUG ] log =internal log bsize=4096 blocks=357568, version=2
[node14][DEBUG ] = sectsz=512 sunit=64 blks, lazy-count=1
[node14][DEBUG ] realtime =none extsz=4096 blocks=0, rtextents=0
[node14][DEBUG ] The operation has completed successfully.
[ceph_deploy.osd][DEBUG ] Host node14 is now ready for osd use.
- 检查新的OSD
# ceph osd tree
9 2.63 osd.9 up 1
17 2.73 osd.17 up 1
30 2.73 osd.30 up 1
53 2.73 osd.53 up 1
65 2.73 osd.65 up 1
78 2.73 osd.78 up 1
89 2.73 osd.89 up 1
113 2.73 osd.113 up 1
128 2.73 osd.128 up 1
141 2.73 osd.141 up 1
99 2.73 osd.99 up 1
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 186 pgs backfill; 12 pgs backfilling; 6 pgs peering; 57 pgs recovering; 887 pgs recovery_wait; 6 pgs stale; 6 pgs stuck inactive; 6 pgs
stuck stale; 283 pgs stuck unclean; 2 requests are blocked > 32 sec; recovery 784023/106982434 objects degraded (0.733%)
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9190: 153 osds: 153 up, 153 in
pgmap v1413041: 66256 pgs, 30 pools, 200 TB data, 51840 kobjects
90504 GB used, 319 TB / 416 TB avail
784023/106982434 objects degraded (0.733%)
65108 active+clean
186 active+remapped+wait_backfill
887 active+recovery_wait
12 active+remapped+backfilling
6 stale+peering
57 active+recovering
recovery io 383 MB/s, 95 objects/s
- 你会注意到ceph将开始把PG(数据)放到这个新的OSD上,以重新平衡数据并让这个新的OSD参与集群。
是的,到此你已经完成了替换工作。