分析 Ceph 对象目录在磁盘上的映射

shan

这对于理解基准测试结果和 Ceph 的第二次写入惩罚非常有用(这种现象在这里I.1节中有所解释)。

I. 使用 RBD 镜像并定位对象

让我们从一个简单的 40 MB RBD 镜像开始,并获取有关该镜像的一些统计信息

bash $ sudo rbd info volumes/2578a6ed-2bab-4f71-910d-d42f18c80d11_disk rbd image '2578a6ed-2bab-4f71-910d-d42f18c80d11_disk': size 40162 kB in 10 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.97ab74b0dc51 format: 2 features: layering

现在使用我的脚本来验证每个对象的放置位置。请注意,所有块都必须已分配,否则只需映射设备并运行 dd

bash $ sudo ./rbd-placement volumes 2578a6ed-2bab-4f71-910d-d42f18c80d11_disk osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000000' -> pg 28.b52329a6 (28.6) -> up ([0,1], p0) acting ([0,1], p0) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000009' -> pg 28.7ac71fc6 (28.6) -> up ([0,1], p0) acting ([0,1], p0) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000002' -> pg 28.f9256dc8 (28.8) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000005' -> pg 28.141bf9ca (28.a) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000003' -> pg 28.58c5376b (28.b) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000008' -> pg 28.a310d3d0 (28.10) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000001' -> pg 28.88755b97 (28.17) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000004' -> pg 28.e52ce538 (28.18) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000006' -> pg 28.80a6755a (28.1a) -> up ([0,1], p0) acting ([0,1], p0) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000007' -> pg 28.9c45d2fa (28.1a) -> up ([0,1], p0) acting ([0,1], p0)

此镜像存储在 OSD 0 和 OSD 1 上。然后我选择了所有的 PG 和 rbd_prefix。我们可以使用 tree 命令在目录层次结构中反映放置位置

```bash $ sudo tree -Ph '97ab74b0dc51' /var/lib/ceph/osd/ceph-0/current/{28.6,28.8,28.a,28.b,28.10,28.17,28.18,28.1a}_head/ /var/lib/ceph/osd/ceph-0/current/28.6_head/ ├── [4.0M] rbd\udata.97ab74b0dc51.0000000000000000head_B52329A61c └── [3.2M] rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c /var/lib/ceph/osd/ceph-0/current/28.8_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000002head_F9256DC81c /var/lib/ceph/osd/ceph-0/current/28.a_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000005head_141BF9CA1c /var/lib/ceph/osd/ceph-0/current/28.b_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000003head_58C5376B1c /var/lib/ceph/osd/ceph-0/current/28.10_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000008head_A310D3D01c /var/lib/ceph/osd/ceph-0/current/28.17_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000001head_88755B971c /var/lib/ceph/osd/ceph-0/current/28.18_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000004head_E52CE5381c /var/lib/ceph/osd/ceph-0/current/28.1a_head/ ├── [4.0M] rbd\udata.97ab74b0dc51.0000000000000006head_80A6755A1c └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000007head_9C45D2FA1c

0 directories, 10 files ```

II. 分析磁盘几何结构

为简单起见,我使用了一个连接到我的虚拟机上的虚拟硬盘驱动器。磁盘大小为 10GB。

```bash root@ceph:~# fdisk -l /dev/sdb1

Disk /dev/sdb1: 10.5 GB, 10484711424 bytes 255 heads, 63 sectors/track, 1274 cylinders, total 20477952 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000

Disk /dev/sdb1 doesn't contain a valid partition table ```

因此,这里总共有 20477952 个 512 字节的扇区/块,即 (20477952*512)/1024/1024/1024 = ~10 GB

III. 打印每个对象的块映射

现在我将假设你的 OSD 数据底层文件系统是 XFS。否则,以下操作将不可能。

bash $ sudo for i in $(find /var/lib/ceph/osd/ceph-0/current/{28.6,28.8,28.a,28.b,28.10,28.17,28.18,28.1a}_head/*97ab74b0dc51*) ; do xfs_bmap -v $i ;done /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000000__head_B52329A6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..2943]: 1992544..1995487 0 (1992544..1995487) 2944 1: [2944..8191]: 1987296..1992543 0 (1987296..1992543) 5248 /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009__head_7AC71FC6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..255]: 1987040..1987295 0 (1987040..1987295) 256 1: [256..1279]: 1986016..1987039 0 (1986016..1987039) 1024 2: [1280..6599]: 1978848..1984167 0 (1978848..1984167) 5320 /var/lib/ceph/osd/ceph-0/current/28.8_head/rbd\udata.97ab74b0dc51.0000000000000002__head_F9256DC8__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19057336..19065527 3 (3698872..3707063) 8192 /var/lib/ceph/osd/ceph-0/current/28.a_head/rbd\udata.97ab74b0dc51.0000000000000005__head_141BF9CA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13909496..13917687 2 (3670520..3678711) 8192 /var/lib/ceph/osd/ceph-0/current/28.b_head/rbd\udata.97ab74b0dc51.0000000000000003__head_58C5376B__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..639]: 7303544..7304183 1 (2184056..2184695) 640 1: [640..8191]: 10090000..10097551 1 (4970512..4978063) 7552 /var/lib/ceph/osd/ceph-0/current/28.10_head/rbd\udata.97ab74b0dc51.0000000000000008__head_A310D3D0__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..639]: 12289352..12289991 2 (2050376..2051015) 640 1: [640..8191]: 13934072..13941623 2 (3695096..3702647) 7552 /var/lib/ceph/osd/ceph-0/current/28.17_head/rbd\udata.97ab74b0dc51.0000000000000001__head_88755B97__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19049144..19057335 3 (3690680..3698871) 8192 /var/lib/ceph/osd/ceph-0/current/28.18_head/rbd\udata.97ab74b0dc51.0000000000000004__head_E52CE538__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13901304..13909495 2 (3662328..3670519) 8192 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000006__head_80A6755A__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..6911]: 13917688..13924599 2 (3678712..3685623) 6912 1: [6912..8191]: 13932792..13934071 2 (3693816..3695095) 1280 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000007__head_9C45D2FA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13924600..13932791 2 (3685624..3693815) 8192

我的文件系统似乎有一些碎片,因为有些文件映射到多个扩展区。因此,在继续之前,我将对一些文件进行碎片整理。以一个文件为例

```bash $ sudo xfs_bmap -v /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..255]: 1987040..1987295 0 (1987040..1987295) 256 1: [256..1279]: 1986016..1987039 0 (1986016..1987039) 1024 2: [1280..6599]: 1978848..1984167 0 (1978848..1984167) 5320

$ sudo xfs_fsr /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c

$ sudo xfs_bmap -v /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..6599]: 1860632..1867231 0 (1860632..1867231) 6600 ```

操作之后,我们得到以下重新分区

bash $ sudo for i in $(find /var/lib/ceph/osd/ceph-0/current/{28.6,28.8,28.a,28.b,28.10,28.17,28.18,28.1a}_head/*97ab74b0dc51*) ; do xfs_bmap -v $i ;done /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000000__head_B52329A6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 1852440..1860631 0 (1852440..1860631) 8192 /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009__head_7AC71FC6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..6599]: 1860632..1867231 0 (1860632..1867231) 6600 /var/lib/ceph/osd/ceph-0/current/28.8_head/rbd\udata.97ab74b0dc51.0000000000000002__head_F9256DC8__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19057336..19065527 3 (3698872..3707063) 8192 /var/lib/ceph/osd/ceph-0/current/28.a_head/rbd\udata.97ab74b0dc51.0000000000000005__head_141BF9CA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13909496..13917687 2 (3670520..3678711) 8192 /var/lib/ceph/osd/ceph-0/current/28.b_head/rbd\udata.97ab74b0dc51.0000000000000003__head_58C5376B__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13932792..13940983 2 (3693816..3702007) 8192 /var/lib/ceph/osd/ceph-0/current/28.10_head/rbd\udata.97ab74b0dc51.0000000000000008__head_A310D3D0__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 14201728..14209919 2 (3962752..3970943) 8192 /var/lib/ceph/osd/ceph-0/current/28.17_head/rbd\udata.97ab74b0dc51.0000000000000001__head_88755B97__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19049144..19057335 3 (3690680..3698871) 8192 /var/lib/ceph/osd/ceph-0/current/28.18_head/rbd\udata.97ab74b0dc51.0000000000000004__head_E52CE538__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13901304..13909495 2 (3662328..3670519) 8192 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000006__head_80A6755A__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 14209920..14218111 2 (3970944..3979135) 8192 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000007__head_9C45D2FA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13924600..13932791 2 (3685624..3693815) 8192

IV. 了解对象映射

如前所述,我们总共有 20477952 个 512 字节的块,对象具有以下映射

  • 1852440..1860631,一个范围为 8192 个 512 字节块,(8192*512/1024/1024) = 4M
  • 1860632..1867231
  • 19057336..19065527
  • 13909496..13917687
  • 13932792..13940983
  • 14201728..14209919
  • 19049144..19057335
  • 13901304..13909495
  • 14209920..14218111
  • 13924600..13932791

基于范围的平均块位置是

  • 1856535
  • 1863135
  • 13905399
  • 13913591
  • 13928695
  • 13936887
  • 14205823
  • 14214015
  • 19053239
  • 19061431

我们现在可以计算这些位置的标准差:6020910.93405966

本文的目的是演示并解释 Ceph 中的第二次写入惩罚。第二次写入是由 syncfs 调用的,它将所有对象写入各自的 PG 目录。了解对象的 PG 放置以及文件系统中每个对象在块设备上的物理映射,可能对调试性能问题有很大的帮助。不幸的是,由于客户端并发写入和 Ceph 的分布式特性,这个问题很难解决。显然,这里所写的内容仍然是纯理论(尽管很可能是真的 :p),因为确定数据在磁盘上的真实放置位置很困难。关于 xfs 返回的块放置,还有一点:这个放置给我们提供了值,但我们不知道这些范围的映射在设备上到底是什么样子。