'Ceph Osd Reweight' 和 'Ceph Osd Crush Reweight' 的区别

2014年12月23日 laurentbarbe

来自邮件列表中的 Gregory 和 Craig…

“ceph osd crush reweight” 设置 OSD 的 CRUSH 权重。这个
权重是一个任意值（通常是磁盘的大小，以 TB 为单位，或者
其他），控制系统尝试将多少数据分配到
OSD。
“ceph osd reweight” 设置 OSD 上的覆盖权重。这个值是
在 0 到 1 之间，强制 CRUSH 重新放置 (1-权重) 的
原本会存储在该驱动器上的数据。它*不*更改
OSD 上方 bucket 分配的权重，而是一种修正
措施，以防正常的 CRUSH 分布没有达到
预期效果。（例如，如果你的一个 OSD 处于 90% 负载，而其他 OSD
处于 50% 负载，你可以降低这个权重来尝试补偿。）
Gregory Farnum lists.ceph.com/pipermail/…

请注意，'ceph osd reweight' 不是一个持久设置。当一个 OSD
被标记为 down 时，OSD 权重将被设置为 0。当它被标记为 up
时，权重将被更改为 1。
因此，'ceph osd reweight' 是一个临时解决方案。你应
仅在订购更多
硬件时使用它来保持集群的运行。
Craig Lewis lists.ceph.com/pipermail/…

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040961.html

我一直在想，当我的一个 osd 被标记为 down（在我的旧的 Cuttlefish 集群中…），我注意到只有本地机器的驱动器似乎被填满。这似乎很正常，因为 crushmap 中 host 的权重没有改变。

测试 ¶

在简单的集群（Giant）上进行测试，使用这个 crushmap

ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
  step emit

以池 3 上的 8 个 pg 为例

$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
dumped all in format plain
3.4 [0,2]
3.5 [4,1]
3.6 [2,0]
3.7 [2,1]
3.0 [2,1]
3.1 [0,2]
3.2 [2,1]
3.3 [2,4]

现在我尝试 ceph osd out

$ ceph osd out 0    # This is equivalent to "ceph osd reweight 0 0"
marked out osd.0.

$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.2 root default
-2    0.09998     host ceph-01
0 0.04999         osd.0   up  0       # <-- reweight has set to "0"
4 0.04999         osd.4   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1   

$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
dumped all in format plain
3.4 [2,4]  # <-- [0,2] (move pg on osd.4)
3.5 [4,1]
3.6 [2,1]  # <-- [2,0] (move pg on osd.1)
3.7 [2,1]
3.0 [2,1]
3.1 [2,1]  # <-- [0,2] (move pg on osd.1)
3.2 [2,1]
3.3 [2,4]

现在我尝试 ceph osd CRUSH out

$ ceph osd crush reweight osd.0 0
reweighted item id 0 name 'osd.0' to 0 in crush map

$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.15    root default
-2    0.04999     host ceph-01            # <-- the weight of the host changed
0 0               osd.0   up  1       # <-- crush weight is set to "0"
4 0.04999         osd.4   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1   

$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
dumped all in format plain
3.4 [4,2]  # <-- [0,2] (move pg on osd.4)
3.5 [4,1]
3.6 [2,4]  # <-- [2,0] (move pg on osd.4)
3.7 [2,1]
3.0 [2,1]
3.1 [4,2]  # <-- [0,2] (move pg on osd.4)
3.2 [2,1]
3.3 [2,1]

这似乎不太合理，因为分配给 bucket “host ceph-01” 的权重仍然高于其他 bucket。如果 PG 更多，这可能会有所不同…

使用更多 pg 进行测试 ¶

# Add more pg on my testpool
$ ceph osd pool set testpool pg_num 128
set pool 3 pg_num to 128

# Check repartition
$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=48 pgs
osd.1=78 pgs
osd.2=77 pgs
osd.4=53 pgs

$ ceph osd reweight 0 0
reweighted osd.0 to 0 (802)
$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=96 pgs
osd.2=97 pgs
osd.4=63 pgs

分布似乎是公平的。为什么在相同的情况下，Cuttlefish 的分布却不一样？

$ ceph osd reweight 0 1
reweighted osd.0 to 0 (802)
$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=96 pgs
osd.2=97 pgs
osd.4=63 pgs

$ ceph osd crush reweight osd.0 0
reweighted osd.0 to 0 (802)

$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=87 pgs
osd.2=88 pgs
osd.4=81 pgs

使用 crush reweight，一切正常。

尝试使用 crush legacy ¶

$ ceph osd crush tunables legacy
adjusted tunables profile to legacy
root@ceph-01:~/ceph-deploy# for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=87 pgs
osd.2=88 pgs
osd.4=81 pgs

$ ceph osd crush reweight osd.0 0.04999
reweighted item id 0 name 'osd.0' to 0.04999 in crush map

$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.2 root default
-2    0.09998     host ceph-01
0 0.04999         osd.0   up  0   
4 0.04999         osd.4   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1   

$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=78 pgs
osd.2=77 pgs
osd.4=101 pgs   # <--- All pg from osd.0 and osd.4 is here when using legacy value (on host ceph-01)

因此，这是一种分布算法的演变，旨在更倾向于全局分布，当 OSD 被标记为 down 时（而不是优先通过邻近性进行分布）。确实，旧的分布在每个 host 的 OSD 不多，并且它们几乎已满时，可能会导致问题。

当一些 OSD 被标记为 down 时，数据倾向于重新分配到附近的 OSD，而不是跨整个层次结构。
Ceph Docs ceph.com/docs/master/rados/…

要查看每个 osd 的 pg 数量

http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd/