Ceph Performance Part 2: Write Throughput Without SSD Journals

Nov 9, 2012 MarkNelson

简介 ¶

Hello again!

If you are new around these parts you may want to start out by reading the first article in this series available here.

For the rest of you, I am sure you are no doubt aware by now of the epic battle that Mark Shuttleworth and I are waging over who can generate more page hits on the ceph.com website. I’ve made a totally original and in no way inaccurate illustration to document the saga for future generations

Shuttleworth is going down!

After writing the first article I realized that my 15 minutes of Internet fame had started and that I better milk it for all it’s worth before people start moving back to Lolcats and Slashdot. Unfortunately I used half of my budgeted time producing the trompe l’oeil shown above, so you’ll have to forgive me for recycling the format of the last article for this one. In fact you should probably just consider this to be a continuation rather than a new one. If that excites you, please continue reading. This time we are going to look at how each of the disk controllers used in the last set of tests performs with 8 spinning disks that have data and journal partitions both stored on the same device(s).

HARDWARE AND SOFTWARE SETUP ¶

Here’s a recap of the system setup we’ll be testing

The SAS/RAID Controllers we will be testing.

High Point Rocket 2720SGL (center top): This is an entry level SAS controller with a marvel 9485 RAID chipset. This particular model has a JBOD-mode-only firmware and can be had for a measly $150 on the net.
Areca ARC-1880 (middle left): This is Areca’s previous-generation high end RAID controller. Still considered to be quite fast. Supports disks in JBOD mode, a pass-through mode, and RAID0-6 setups.
Areca ARC-1222 (middle right): This is a much older Areca RAID controller and is only really being included in this comparison because we happened to have a spare one laying around. Supports disks in JBOD mode, a pass-through mode, and RAID0-6 setups.
LSI SAS 9207-8i (bottom left): This is LSI’s newest budget controller using the SAS2308 chipset. Interestingly they ship it with the IT/JBOD firmware which does not support any RAID configurations at all. Card can be had for about $240.
LSI SAS 9211-8i (bottom right): Ah, the 9211-8i. It uses the venerable SAS2008 chipset, widely known and used in ZFS deployments all over the world. It’s basically a SAS controller that supports JBOD mode and very basic RAID0/1 functionality. There appears to be little or no write-through or write-back cache. Can be had for around $225.
LSI SAS 2208 (not shown): It just so happens that the Supermicro motherboard we purchased has LSIs current higher-end SAS 2208 chipset on it with 1GB of cache and full JBOD and RAID0-6 mode support. LSI’s equivalent standalone card is the SAS 9265-8i, which retails for around $630.

Other hardware being used in this setup includes

Chassis: Supermicro 4U 36-drive SC847A
Motherboard: Supermicro X9DRH-7F
CPUS: 2 X Intel XEON E5-2630L (2.0GHz, 6-core)
RAM: 8 X 4GB Supermicro ECC Registered DDR1333 (32GB total)
Disks: 8 X 7200RPM Seagate Constellation ES 1TB Enterprise SATA
NIC: Intel X520-DA2 10GBE

As far as software goes, these tests will use

OS: Ubuntu 12.04
Ceph: 0.50
Kernel: 3.4 from source
Tools: blktrace, collectl, perf

In a response to the previous article, a reader asked if hardware crc32c instruction support was enabled. Ceph itself does not currently make use of hardware crc32c (it uses a C based slice-by-8 implementation), but apparently BTRFS can. A quick look at /proc/crypto shows

name : crc32c driver : crc32c-intel module : crc32c_intel priority : 200 refcnt : 2 selftest : passed type : shash blocksize : 1 digestsize : 4

Theoretically BTRFS should be using hardware crc32c instructions both for the results in this article and for the results in the previous one.

Test Setup ¶

In this article the focus is specifically on the raw controller/disk throughput that can be obtained, so these tests are being run directly on the SC847a using localhost TCP socket connections. As opposed to the first article, the controllers’ ability to help the drives deal with conflicting writes to the journal and data partitions will be of paramount importance. Each controller being tested supports a variety of operational modes. To keep things reasonable we are testing 3 configurations that all use the same number of data and journal disks

JBOD Mode (Supported by all controllers in these tests. Acts like a standard SAS controller. Journals are on separate 10G partitions on each drive.)
Pass-through/8xRAID0 mode (Supported on the Areca controllers and can be simulated on the SAS2208 by creating a single drive RAID0 for each OSD. Uses on-board write-back cache. Journals are on separate 10G partitions on each drive.)
RAID0 Mode (A single OSD on a 8-disk RAID0 array. A single 80G Journal is placed on a separate partition on the RIAD0 array. Write-back cache is enabled on the controllers that support it.

Based on the tests performed in the previous article and some on-line rumors, it appears that Areca’s JBOD mode is indeed using on-board cache and should perform similarly to the pass-through mode. This may give it an advantage over JBOD modes on other controllers.

To generate results, we are using Ceph’s built-in benchmarking command: “RADOS bench” which writes new objects for every chunk of data that is to be written out. RADOS bench has certain benefits and drawbacks. On one hand it gives you a very clear picture of how fast OSDs can write out new objects at various sizes. What it does not test is how quickly small writes to existing objects are performed. This is relevant, because this is exactly what happens when doing small random writes to an RBD volume. Recently Inktank’s own Sam Just wrote another set of benchmarks called smalliobench and smalliobenchfs and introduced them into the Ceph master git branch that simulates these kinds of IO. In future articles we’ll start to look at those tools, but for now we’ll again be using RADOS bench. Again, we are running 8 concurrent instances of the benchmark and aggregating the results to ensure that the benchmark is not a bottleneck.

RADOS bench gives you some flexibility regarding how big objects should be, how many to concurrently keep in flight, and how long the test should be run for. We’ve settled on 5 minute tests using the following permutations

4KB Objects, 16 Concurrent Operations (2 per rados bench instance)
4KB Objects, 256 Concurrent Operations (32 per rados bench instance)
128KB Objects, 16 Concurrent Operations (2 per rados bench instance)
128KB Objects, 256 Concurrent Operations (32 per rados bench instance)
4M Objects, 16 Concurrent Operations (2 per rados bench instance)
4M Objects, 256 Concurrent Operations (32 per rados bench instance)

For each permutation, we run the same test using BTRFS, XFS, and EXT4 for the underlying OSD filesystems. Filesystems are reformatted and mkcephfs is re-run between every test to ensure that fragmentation from previous tests does not affect the outcome. We left most Ceph tunables in their default state for these tests except for: “filestore xattr use omap = true” to ensure that EXT4 worked properly. We did pass certain mkfs and mount options to the underlying file systems where it made sense

mkfs.btfs options: -l 16k -n 16k
btrfs mount options: -o noatime
mkfs.xfs options: -f -i size=2048 (-d su-64k, sw=8 for RAID0 tests)
xfs mount options: -o noatime
mkfs.ext4 options: (-b 4096 -E stride=16,stripe-width=128 for RAID0 tests)
ext4 mount options: -o noatime,user_xattr

During the tests, collectl was used to record various system performance statistics, and perf was used to gather profiling data on the running processes. blktrace was also run against every OSD data disk so that we could potentially go back and examine seek behavior on the underlying block devices.

4KB RADOS BENCH RESULTS ¶

Before we get started you may want to open the Ceph Performance Part 1 article in another window and scroll down to the first set of tests. I’ll be comparing the numbers in this article to those found in the previous one. The first thing that you might notice here is that some of the controllers have very different performance characteristics than they did when SSDs were used for journals. The RAID controllers that have write-back cache are now leading the pack when used in 8-OSD modes. It looks like the cache may be helping reorder writes to mitigate extra seeks caused by the journal being stored on the same disks. Like in the last article, using a single OSD with a big RAID0 array is pretty slow. Surprisingly this is not the slowest configuration. In the previous article when SSDs were used for the journals, the cheap SAS controllers in JBOD mode were among the fastest controllers tested (Specifically the SAS2308, 2720SGL, and SAS2008). Without SSD drives for the journals they are now amongst the slowest. What happened? I suspect the lack of on-board cache is really hurting them. Writes are likely taking longer to complete, and I suspect that with few concurrent operations per OSD there just isn’t enough concurrency to hide it. In the previous article we never showed RADOS Bench operation latencies for each of the tests, but we did collect the information. We’ve done so again now. Lets compare the results and see if the theory holds up

Click on any of the images below (and click again) to enlarge them…

16 Concurrent 4K Write Op Latency (BTRFS)

16 Concurrent 4K Write Op Latency (XFS)

16 Concurrent 4K Write Op Latency (EXT4)

Indeed, it’s pretty clear that if there are few concurrent OPs, it really helps to have a controller with on-board cache or have SSD journals. In a bit we’ll look and see if this trend holds true with more concurrent OPs. First though, lets look at the CPU utilization and average disk queue times.

Click on any of the images below (and click again) to enlarge them…

16 Concurrent 4K Writes - CPU Utilization

16 Concurrent 4K Writes - Disk Waits

If you’ve read the previous article there should be no real surprises as far as CPU utilization goes. BTRFS continues to use significantly more CPU resources than the other filesystems, even when it is barely edging out EXT4. As far as wait times go, BTRFS seems to cause extremely high device queue wait times on the Areca controllers despite performing similarly to the SAS2208.

How do things change with 256 Concurrent OPs?

With 256 Concurrent 4K writes, the RAID controllers with BBU cache are still leading the pack in 8-OSD modes, but the cheaper SAS controllers have caught up considerably. BTRFS performance is now equal to or maybe even slightly faster than the more expensive controllers. XFS and EXT4 performance has improve as well, but still lags behind the performance of those filesystems on the controllers with BBU cache. How do the latencies look?

Click on any of the images below (and click again) to enlarge them…

256 Concurrent 4K Write Latency (BTRFS)

256 Concurrent 4K Write Latency (XFS)

256 Concurrent 4K Write Latency (EXT4)

With many concurrent operations in flight, the latencies have increased across the board, but not at the same rate. The latencies for the cheaper SAS controllers are now more in-line with the latencies of the higher end RAID controllers. One thing you may notice in this set of tests is that with enough concurrent operations, there is basically no sustained IOP advantage to having journals on SSDs. Nor is there any advantage to having 2 additional OSDs (Though the journals are on the same disks.) It’s not entirely clear what this may indicate, but it may be that there are software limitations in play here. Indeed, several improvements have been made to the OSD threading and locking code recently in the Ceph master branch that may increase performance of small writes in some cases.

Click on any of the images below (and click again) to enlarge them…

256 Concurent 4K Writes - CPU Utilization

256 Concurent 4K Writes - Disk Waits

Again we are seeing BTRFS use relatively more CPU in higher performance configurations. Queue wait times are high again for BTRFS on the Areca cards which we also saw for this test when SSDs were used. The JBOD cards are showing somewhat high queue wait times for XFS as well.

128KB RADOS BENCH RESULTS ¶

使用少量并发 128K 写入的结果与 4K 写入结果相似。BTRFS 再次在大部分情况下表现最佳。具有 WB 缓存的控制器表现良好，而没有缓存的控制器表现糟糕。Areca 卡表现尤为出色，即使是过时的 ARC-1222。事实上，Areca 卡在此测试中的表现优于 6 块旋转磁盘和 2 块 SSD 的配置，使用 8 块旋转磁盘同时提供数据和日志。这些控制器的处理器必须做得非常好，以至于日志写入对磁盘数据部分的 128k 写入影响很小。虽然 SAS2208 是此测试中第二快的控制器，但它比 ARC-1880 慢得多，并且在使用 BTRFS 时比 ARC-1222 慢。

Click on any of the images below (and click again) to enlarge them…

16 个并发 128K 写入 - CPU 利用率

16 个并发 128K 写入 - 磁盘等待

这里需要注意的一点是，CPU 利用率显示 BTRFS 的结果，虽然仍然很高，但每单位吞吐量似乎比上一篇文章中的相同测试低得多。这部分原因可能是因为整体吞吐量较慢。但是，如果您将此测试的最快结果与上一篇文章中的结果进行比较，那么很明显，对于相同的性能水平，某些控制器的 CPU 利用率高于其他控制器。Areca 控制器上 BTRFS 的高队列等待时间已在使用 128K 写入和 XFS 时消失，现在 XFS 似乎导致比其他文件系统更高的队列等待时间。SAS2008 在 RAID0 模式下特别容易导致高队列等待，尽管速度较慢，这与我们之前文章中看到的现象类似。

使用 256 个并发 128K 写入，SAS2208 在 8 个单驱动 RAID0 阵列中的性能得到了显著提升，现在名列前茅，仅略胜于 ARC-1880。ARC-1880 使用 EXT4 时表现明显更好，而使用 BTRFS 时表现略差，使用 XFS 时表现明显更差。没有缓存的 SAS 控制器的性能再次得到提升，现在与 8-OSD 模式下的 ARC-1222 性能大致相同。BTRFS 在所有这些控制器上的性能相对较高，而 EXT4 和 XFS 的性能较差。支持它的控制器上的单 OSD RAID0 模式再次相当慢。

Click on any of the images below (and click again) to enlarge them…

256 个并发 128K 写入 - CPU 利用率

256 个并发 128K 写入 - 磁盘等待

在所有情况下，BTRFS 的 CPU 利用率仍然很高，但不如使用 SSD 作为日志时那么高。磁盘等待时间再次在使用 XFS 时很高，这与使用 SSD 日志时的情况相同。

4MB RADOS 基准测试结果 ¶

使用 16 个并发 4MB 写入时，发生了一件非常奇怪的事情。高端 SAS2208 和 ARC-1880 控制器再次名列前茅，这并不令人惊讶。令人惊讶的是，这些控制器上的 RAID0 配置表现良好，但仅在使用 EXT4 时！事实上，使用 EXT4，SAS2208 是在此测试中最快的控制器，配置了 8 块驱动器的 RAID0 阵列。它仅略胜于使用 BTRFS 配置相同控制器时的单磁盘 RAID0 阵列。ARC-1880 的情况也是如此，但性能在两种情况下都低约 15%。尽管如此，我们现在开始看到将日志放在与数据相同的磁盘上时的一些限制。尽管有 2 个额外的旋转磁盘，但此测试中最快的配置只能达到约 450MB/s，而廉价的 SAS 控制器在使用 SSD 作为日志时可以达到接近 700MB/s 的速度。说到 SAS 控制器，它们在没有 SSD 支持的日志时仍然很慢，但至少它们看起来不像在较小的写入测试中那么糟糕。ARC-1222 终于显露出它的年龄，尽管具有 WB 缓存，但仅勉强跟上 SAS 控制器的速度。最后，SAS2008 在 RAID0 模式下由于缺乏缓存和较慢的处理器而无法跟上，远远落后于其他所有控制器。

Click on any of the images below (and click again) to enlarge them…

16 个并发 4M 写入 - CPU 利用率

16 个并发 4M 写入 - 磁盘等待

在这些测试中，使用 BTRFS 的 CPU 利用率似乎再次很高，但如果您查看比例，您会看到它仅上升到约 28%。在之前的测试中，使用 BTRFS 时的 CPU 利用率接近 80%，但性能也高得多。所有磁盘队列等待时间似乎大致相同，除了 Areca 卡。ARC-1880 的队列等待时间略高，而 ARC-1222 的队列等待时间高得多（尤其是在使用 XFS 时）。

使用更多的并发操作，我们再次看到 SAS 控制器得到改进，但不足以超越 SAS2208 和 ARC-1880。在考虑日志写入时，除了 ARC-1222 之外，所有控制器现在都能够使用 BTRFS 数据分区达到每驱动 110MB/s 以上的速度。SAS2208 和 ARC-1880 能够推动大约 120-130MB/s，并且正在接近驱动吞吐量限制。不幸的是，XFS 和 EXT4 的性能通常要低得多，并且经常达到每驱动 80-100MB/s 的峰值。

Click on any of the images below (and click again) to enlarge them…

256 个并发 4M 写入 - CPU 利用率

256 个并发 4M 写入 - 磁盘等待

与 16 个并发 4M 写入看到的情况类似，CPU 利用率数字和队列等待时间也类似，只是规模更大。特别是，在 8-OSD 模式下和使用 XFS 时，Areca 控制器上的队列等待时间要高得多。

结论 ¶

好吧，又完成了一篇文章。现在我可以回去玩《战争世界》^H^H^H^H认真地处理重要问题了。（真的，已经过了工作时间！）哦，等等，我应该在这里说点什么。好的，首先，BTRFS 再次（至少从表面上看）似乎比 XFS 表现更好。EXT4 的表现好坏参半，有时比 BTRFS 略好，有时几乎像 XFS 一样糟糕。在未来，我们希望更深入地比较这些文件系统，并研究性能随时间的变化情况（提示：BTRFS 小写入性能往往会迅速下降）。现在，如果您想要在新鲜创建的文件系统上获得最高的性能，而无需考虑 CPU 利用率，那么 BTRFS 是您的选择。

除此之外，我们看到当您将日志放在与数据相同的驱动器上时，具有写回缓存的控制器确实会产生很大的影响。当飞行中的 IO 很少时，这一点最为明显。即使有大量的 IO 在飞行中，它仍然似乎有帮助。这与我们看到日志位于 SSD 上时的情况完全相反。在这种情况下，廉价的 SAS 控制器通常是性能最高的卡，无论有多少 IO 在飞行中。

有趣的是，将专用 SSD 切换为共享旋转磁盘用于日志似乎不会显着降低 IOP 吞吐量，但它确实会显着降低写入大型 4MB 对象时的吞吐量。可能通过调整各种 Ceph 参数可以帮助解决此问题，但也可能是 Ceph 在某些方面可以改进小 IO 吞吐量的地方。

至于购买昂贵的控制器并使用所有驱动器托架用于旋转磁盘，还是购买廉价的控制器并使用部分托架用于 SSD/日志，这其中存在权衡。如果构建正确，具有 SSD 日志的系统可以提供更高的块写入吞吐量，而将日志放在数据磁盘上可以提供更高的存储密度。在没有进行任何调整的情况下，这两种解决方案目前都提供相似的 IOP 吞吐量。

未来工作 ¶

我真的想做一篇关于扩展和调整的文章，但我认为在处理它之前，我们应该快速调查一下 Argonaut 与即将推出的闪亮的 Bobtail 版本。有很多性能改进以及新的 smalliobench 基准测试工具可用，这将非常值得深入研究。除此之外，我们仍然拥有我们上次文章中列出的所有内容（减去这组测试！）需要调查，以及我计划进行的一些其他调查。还有很多事情要做！我希望这对大家有价值。与往常一样，如果您有任何问题或建议，请随时发送电子邮件给我或在下方留下评论。

顺便说一句，如果您今年要去 SC12，请到 Ceph 展位！我们的展位是 #3361，就在耳语套房旁边。我将在开幕晚宴上出现，并且在会议期间也会四处游荡。

感谢阅读！