Ceph Performance Part 1: Disk Controller Write Throughput

MarkNelson

简介

Here at Inktank our developers have been toiling away at their desks, profiling and optimizing Ceph to make it one of the fastest distributed storage solutions on the planet.  One question we often get asked is how to build an optimally performing Ceph cluster.  This isn’t always an easy question to answer because it depends on many factors including funding, capacity requirements, density requirements, and existing infrastructure.  There are however some basic investigations that can be done to start getting an idea of what components in a Ceph storage node matter.

The wise and benevolent management at Inktank (Hi Guys!) agreed to allow me to go on a shopping spree with the corporate credit card to answer these questions.  Without further encouragement, I went to one of our hardware vendors and immediately put in an order for a 36-drive Supermicro SC847A chassis along with 36 SATA drives, 9 Intel 520 SSDs, a variety of controllers, and all of the other random bits needed to actually make this thing work.

For those of you that like shiny things, here are some pictures to feast on

Front view of the SC847a

The SC847a in all of its glory.

Internal View of the SC847a

Preview of things to come...

Inktank Performance Lab

The Inktank "Performance Lab"

Like all too-good-to-be-true deals, there is a terrible price.  Somehow a wire got crossed and it turns out that there are certain expectations that come with this gear.  Like running tests.  Hundreds of them.  Maybe thousands!  And publishing too!  It’s almost indentured servitude.  Just look at the “Performance Lab” they make me work in (Yes, that is my basement and yes, I made that sign).  My hope is that you, the reader, will benefit from my easily exploited addiction to new toys and read through the articles I am writing that explore Ceph performance on this hardware in alarming detail.  Not only will you likely learn something interesting, but it will help me convince the management that this wasn’t a completely hair-brained idea.

Now I know what you are thinking.  You want to see that chassis spinning up with all of the drives and SSDs going full blast on all 5 controllers simultaneously.  You want my power bill to skyrocket and watch the meter spin around like in “Christmas Vacation” when all of the lights got plugged in.  You’ll get that later, I promise (well, not the spinning meter part).  For now we’ll explore something that the brass have been hounding me to produce for months:  A report detailing how a wide variety of SAS/RAID controller setups handle different Ceph workloads on various OSD back-end filesystems.  This first article is using SSDs for the journals as the performance results tend to be more straightforward than when putting journals on the same disk as the data.  Later on, we’ll explore how these controllers perform with journals on the same disks.

So without further ado, and to prove I’m not fleecing the company by selling all of these parts on Ebay for booze money, lets get started.

硬件和软件设置

Let me begin by saying the SC847A is a real beast.  Even at idle with low voltage CPUs its 7 80x38mm fans spin at 5K RPM.  When it’s under heavy load, they spin up to 8K RPM and you can easily hear the thing on the main floor of the house.  I imagine I’ve probably already lost a decibel or two of hearing just sitting next to it for a couple of weeks and I’m used to working in data centers.  Make sure you know what you are getting into if you decide to stick one of these things under your desk.

Let’s take a look at some of the controllers we’ll be testing today

SAS/Raid Controllers

The SAS/RAID Controllers we will be testing.

  • High Point Rocket 2720SGL (center top): This is an entry level SAS controller with a marvel 9485 RAID chipset.  This particular model has a JBOD mode only firmware and can be had for a measly $150 on the net.
  • Areca ARC-1880 (middle left): This is Areca’s previous-generation high end RAID controller.  Still considered to be quite fast.  Supports disks in JBOD mode, a pass-through mode, and RAID0-6 setups.
  • Areca ARC-1222 (middle right): This is a much older Areca RAID controller and is only really being included in this comparison because we happened to have a spare one laying around.  Supports disks in JBOD mode, a pass-through mode, and RAID0-6 setups.
  • LSI SAS 9207-8i (bottom left): This is LSI’s newest budget controller using the SAS2308 chipset.  Interestingly they ship it with the IT/JBOD firmware which does not support any RAID configurations at all.  Card can be had for about $240.
  • LSI SAS 9211-8i (bottom right): Ah, the 9211-8i.  It uses the venerable SAS2008 chipset, widely known and used in ZFS deployments all over the world.  It’s basically a SAS controller that supports JBOD mode and very basic RAID0/1 functionality.  There appears to be little or no write-through or write-back cache.  Can be had for around $225.
  • LSI SAS 2208 (not shown): It just so happens that the Supermicro motherboard we purchased has LSIs current higher-end SAS 2208 chipset on it with 1GB of cache and full JBOD and RAID0-6 mode support.  LSI’s equivalent standalone card is the SAS 9265-8i, which retails for around $630.

Other hardware being used in this setup includes

  • Chassis: Supermicro 4U 36-drive SC847A
  • Motherboard: Supermicro X9DRH-7F
  • CPUS: 2 X Intel XEON E5-2630L (2.0GHz, 6-core)
  • RAM: 8 X 4GB Supermicro ECC Registered DDR1333 (32GB total)
  • Disks: 6 X 7200RPM Seagate Constellation ES 1TB Enterprise SATA
  • SSDs: 2 X Intel 520 180GB
  • NIC: Intel X520-DA2 10GBE

As far as software goes, these tests will use

  • OS: Ubuntu 12.04
  • Ceph: 0.50
  • Kernel: 3.4 from source
  • Tools: blktrace, collectl, perf

Test Setup

In this article the focus is specifically on the raw controller/disk throughput that can be obtained, so these tests are being run directly on the SC847a using localhost TCP socket connections.  Each controller being tested supports a variety of operational modes.  To keep things reasonable we are testing 3 configurations that all use the same number of data and journal disks

  • JBOD Mode (Supported by all controllers in these tests. Acts like a standard SAS controller.  3 10GB Journals per SSD).
  • Pass-through/6xRAID0 mode (Supported on the Areca controllers and can be simulated on the SAS2208 by creating a single drive RAID0 for each OSD.  Uses on-board write-back cache.  3 10GB Journals per SSD).
  • RAID0 Mode (A single OSD on a 6-disk RAID0 array.  A single 60G journal is on a RAID0 array made up of the two SSDs).  Write-back cache is enabled on the controllers that support it.

There is some controversy surrounding Areca’s JBOD mode as it may actually be using on-board cache and acting more or less the same as their pass-through mode.  In the tests we will see that the two modes tend to perform very similarly on Areca hardware which may support this belief.

To generate results, we are using Ceph’s built-in benchmarking command: “RADOS bench” which writes new objects for every chunk of data that is to be written out.  RADOS bench has certain benefits and drawbacks.  On one hand it gives you a very clear picture of how fast OSDs can write out new objects at various sizes.  What it does not test is how quickly small writes to existing objects are performed.  This is relevant, because this is exactly what happens when doing small random writes to an RBD volume.  In the past we have also noticed that RADOS bench can become a bottleneck at high throughput levels.  We are running 8 concurrent instances of the benchmark and aggregating the results to get around this issue.

RADOS bench gives you some flexibility regarding how big objects should be, how many to concurrently keep in flight, and how long the test should be run for.    We’ve settled on 5 minute tests using the following permutations

  • 4KB Objects, 16 Concurrent Operations (2 per rados bench instance)
  • 4KB Objects, 256 Concurrent Operations (32 per rados bench instance)
  • 128KB Objects, 16 Concurrent Operations (2 per rados bench instance)
  • 128KB Objects, 256 Concurrent Operations (32 per rados bench instance)
  • 4M Objects, 16 Concurrent Operations (2 per rados bench instance)
  • 4M Objects, 256 Concurrent Operations (32 per rados bench instance)

For each permutation, we run the same test using BTRFS, XFS, and EXT4 for the underlying OSD filesystems.  Filesystems are reformatted and mkcephfs is re-run between every test to ensure that fragmentation from previous tests does not affect the outcome.  We left  most Ceph tunables in their default state for these tests except for: “filestore xattr use omap = true” to ensure that EXT4 worked properly.  We did pass certain mkfs and mount options to the underlying file systems where it made sense

  • mkfs.btfs options: -l 16k -n 16k
  • btrfs mount options: -o noatime
  • mkfs.xfs options: -f -i size=2048 (-d su-64k, sw=6 for RAID0 tests)
  • xfs mount options: -o noatime
  • mkfs.ext4 options: (-b 4096 -E stride=16,stripe-width=96 for RAID0 tests)
  • ext4 mount options: -o noatime,user_xattr

During the tests, collectl was used to record various system performance statistics, and perf was used to gather profiling data on the running processes.  blktrace was also run against every OSD data disk so that we could potentially go back and examine seek behavior on the underlying block devices.

4KB RADOS Bench Results

Throughput- 16 Concurrent 4K WritesBefore we dig into these results, I want to talk a little bit about what actually happens when RADOS bench is run.  Each instance of RADOS bench concurrently generates and sends multiple chunks of data over TCP socket connections (in this case via localhost) to OSDs in various placement groups.  Depending on the replication level of the pool that is targeted, each OSD forwards the data it receives to secondary OSDs in the same placement group.  Each OSD must write its data out to the journal (or potentially the data disk in the case of BTRFS) before it can send an acknowledgement to the sender that the data has been written.  When data is written to the journal, it is appended to where the last bit of data left off.  This is ideal behavior on spinning disks because it means that fewer seeks are needed to get the data written out.  Eventually that data must be written to the OSD data disk.  Significantly more work needs to be performed (which tends to mean significantly more seeks) due to all of the various things the file system needs to do to keep track of and place the data.  This only gets worse as the file system gets older and fragments in exciting ways.  It is important to keep this in mind, because it can have a dramatic effect on performance.

Are these results reasonable?  The journals are on SSDs which have been carefully chosen to exceed the throughput and IOPS capabilities of the underlying data disks.  This should hopefully keep them from being a bottleneck in this test.  The data disks, which are 7200RPM SATA drives, are capable of about 150-200 IOPS each.  With 4KB IOs and 6 disks, that’s something like 4MB/s aggregate throughput assuming there is no write coalescing happening behind the scenes.  Given these results, it doesn’t really look like much coalescing is happening.  It’s also possible however, that write coalescing is happening and that some other bottleneck is limiting us.  We have blktrace results, and in another article it would be interesting to dig deeper into these numbers to see what’s going on.

What else can we say about these results?  The most obvious thing is that JBOD mode seems to be doing the best, while single array RAID0 configurations appear to be universally slow.  It’s entirely possibly that tweaks to various queue limits or other parameters may be needed to increase single OSD throughput in these kinds of setups.  The other thing to note here is that BTRFS appears to combine well with JBOD modes.  Lets take a look at some of the system monitoring data gathered with collectl during the tests to see if it gives us any clues regarding why this is.  Specifically, we will look at the average CPU utilization, the average IO wait time for the OSD data disks, and the average IO wait time for OSD journal disks.

Click on any of the images below to enlarge them…

CPU Utilization - 16 Concurrent 4K Writes

CPU Utilization

Data Disk Waits - 16 Concurrent 4K Writes

Data Disk Waits

Journal Disk Waits - 16 Concurrent 4K Writes

Journal Disk Waits

Very interesting!  CPU Utilization is highest in JBOD modes which makes some sense  given that a certain amount of processing is necessary for every object written and those modes are the fastest.  The CPU utilization used during the BTRFS tests in JBOD mode are particularly high though!   The top performing combinations (i.e. JBOD mode with BTRFS) also have the highest wait times.  My reading of this is that in those cases the spinning disks are working pretty hard for the throughput they are getting and are probably performing a lot of seeks.  In non-JBOD modes, I suspect there is a bottleneck somewhere else and the disks aren’t being stressed as hard.

The journal wait times are also interesting.  I think these results are hinting that while the cheap/basic SAS controllers are showing no wait times at all for the journal disks (They shouldn’t; the SSDs should easily be out-pacing the spinning disks on this kind of workload), the more expensive RAID controllers appear to be introducing extra latency, probably due to the extra caching and processing they do.  Despite this, the fastest performing configuration was the SAS2208 in JBOD mode with BTRFS, which also had the highest journal queue wait times!

Let’s see if these trends continue if we up the number of concurrent operations

Throughput - 256 Concurrent 4K Writes

高端RAID控制器的性能有了相当大的提升,但仅限于涉及多个OSD的配置。 拥有更高的并发操作数可以让这些控制器隐藏日志磁盘引入的延迟。 另一方面,单个OSD RAID0的性能几乎保持不变。 我怀疑阻碍单个OSD RAID0性能的瓶颈仍然存在,并且磁盘和控制器都没有受到特别大的压力。 JBOD性能也几乎没有改善。 然而,在这种情况下,写入已经在磁盘队列中排队,因此增加操作数量并不能解决问题。 让我们看看我的理论是否成立

Click on any of the images below to enlarge them…

CPU Utilization - 256 Concurrent 4K Writes

CPU Utilization

Data Disk Waits - 256 Concurrent 4K Writes

Data Disk Waits

Journal Disk Waits - 256 Concurrent 4K Writes

Journal Disk Waits

RAID控制器上的直通配置的性能提高到与更便宜的SAS控制器相匹配,但CPU利用率也随之提高。 在Areca控制器上,平均数据磁盘和SSD日志队列等待时间也大幅增加。 看起来这个工作负载正在很好地排队操作。 话虽如此,这似乎并没有影响整体吞吐量,因为ARC-1880在此测试中仍然表现良好。

128KB RADOS 基准测试结果

Throughput - 16 Concurrent 128K Writes

使用128K写入,情况看起来大致相同,但有一些例外。 使用BTRFS的JBOD配置再次非常快,尤其是在更便宜的SAS控制器和SAS2208上。 另一方面,在SAS2208上的6个OSD RAID0配置,在256个并发4KB测试中是最快的配置,但在本次测试中却是最慢的配置之一。 单个OSD RAID0模式再次普遍较慢。 XFS性能普遍较慢,而EXT4通常介于XFS和BTRFS之间。

Click on any of the images below to enlarge them…

CPU Utilization - 16 Concurrent 128K Writes

CPU Utilization

Data Disk Waits - 16 Concurrent 128K Writes

Data Disk Waits

Journal Disk Waits - 16 Concurrent 128K Writes

Journal Disk Waits

使用16个并发128K操作,BTRFS在每个控制器上的每个配置中都会产生更高的CPU利用率。 这在BTRFS表现良好的配置中尤其如此,但即使在BTRFS相对较慢的配置中也是如此。 数据磁盘等待时间再次在更快的配置中较高,但有趣的是,BTRFS的等待时间减少了,而XFS的等待时间增加了。 EXT4等待时间基本保持不变。 在日志上,SAS2208在JBOD模式下再次显示最高的IO队列等待时间,但同时也是最快的配置之一。

Throughput - 256 Concurrent 128K Writes

使用256个并发操作,直通和6个OSD RAID0模式再次显示出显著的改进,而单个OSD RAID0模式仍然较慢。 BTRFS继续在性能方面占据主导地位,而EXT4在Areca ARC-1880和LSI SAS2208上表现出相对不错的结果。

Click on any of the images below to enlarge them…

CPU Utilization - 256 Concurrent 128K Writes

CPU Utilization

Data Disk Waits - 256 Concurrent 128K Writes

Data Disk Waits

Journal Disk Waits - 256 Concurrent 128K Writes

Journal Disk Waits

在BTRFS吞吐量相对于16个并发操作测试增加的配置中,CPU利用率也大幅增加。 即使在EXT4性能接近BTRFS的70-80%时,EXT4的CPU利用率也显得相当低。 XFS再次导致各种控制器上的IO等待时间较高。 BTRFS等待时间在ARC-1880上似乎也很高。 在SSD方面,具有回写缓存的RAID控制器再次显示出日志写入相对较高的IO等待时间。

4MB RADOS 基准测试结果

Throughput - 16 Concurrent 4M Writes

哇! 仅使用16个并发IO,我们这批中最便宜的三个控制器在使用EXT4时推动超过600MB/s,使用BTRFS时接近700MB/s。 在JBOD模式下(以及ARC-1880上的直通模式),BTRFS仍然获得略高于600MB/s,但XFS和EXT4都无法跟上。 令人惊讶的是,ARC-1880上的EXT4能够使用6个驱动器的RAID0阵列推动大约550MB/s。 这是我们第一次看到RAID0配置的性能接近6个OSD配置的性能,可能值得在后续文章中进一步调查。

Click on any of the images below to enlarge them…

CPU Utilization - 16 Concurrent 4M Writes

CPU Utilization

Data Disk Waits - 16 Concurrent 4M Writes

Data Disk Waits

Journal Disk Waits - 16 Concurrent 4M Writes

Journal Disk Waits

这里没有什么特别新鲜的东西。 BTRFS配置的高CPU利用率。 即使在推动与BTRFS几乎相同的数据量时,EXT4配置似乎更容易对CPU。 XFS显示出更高的数据磁盘等待时间,同时提供的吞吐量低于EXT4或BTRFS。 假设我们查看了blktrace数据,我们会在使用XFS的数据磁盘上看到更高的寻道计数。 我们可以通过在后续文章中使用名为seekwatcher的工具和记录的blktrace数据来验证这一点。 在绕过任何板载缓存的JBOD模式下,队列等待时间似乎相当高,而利用缓存的模式则较低。 相反,SSD日志磁盘上的等待时间在3个SAS控制器上仍然很低,但在RAID控制器上则有不同程度的较高。

Throughput - 256 Concurrent 4M Writes

使用256个并发IO,情况甚至更好: 在6个控制器中,有4个都能够在使用BTRFS时推动超过130MB/s的每磁盘吞吐量。 这几乎是800MB/s的聚合吞吐量到6个OSD数据磁盘,并且通过日志写入,控制器总吞吐量为1.6GB/s。 不错! EXT4性能仍然良好,但似乎并没有随着更多并发IO的增加而得到太大改善。

Click on any of the images below to enlarge them…

CPU Utilization - 256 Concurrent 4M Writes

CPU Utilization

Data Disk Waits - 256 Concurrent 4M Writes

Data Disk Waits

Journal Disk Waits - 256 Concurrent 4M Writes

Journal Disk Waits

CPU使用率、数据磁盘等待时间和日志磁盘等待时间都与我们看到的4MB写入和16个并发IO非常相似。 没有什么惊喜。

结论

经过数小时的基准测试以及各种操作以防止Libre Office在编译结果时崩溃,我们学到了什么? 通常,JBOD模式表现良好,尤其是使用BTRFS时。 单个OSD RAID0吞吐量在所有方面都很差,只有少数例外。 BTRFS显示出最高的吞吐量数字,但也会导致比XFS或EXT4更高的CPU使用率。 使用回写缓存的RAID控制器似乎减少了IO到达数据磁盘的队列长度时间,但尚不清楚这是否是缓存或较低的吞吐量造成的。 奇怪的是,这些相同的控制器显示SSD日志磁盘的队列等待时间较高,尽管这些驱动器在更便宜的SAS控制器上具有非常低的队列等待时间。

让我们看看每个控制器在这些测试中的表现如何。 请记住,这些结论可能不适用于日志与数据磁盘位于同一磁盘上的配置。

  • High Point Rocket 2720SGL: 此控制器并非始终具有最高的性能,但它始终具有非常高的BTRFS性能并且非常便宜。 从纯粹的性能角度来看,如果目标是使用BTRFS的JBOD配置,这是一个非常强大的选择。
  • Areca ARC-1880: ARC-1880在某些情况下表现良好,在另一些情况下表现不佳。 它在少量并发IO时表现不佳,但在许多并发IO时表现更好,有时甚至领先。 它在EXT4方面也比其他许多控制器表现更好。 在大IO尺寸下,它表现尚可,但无法赶上以JBOD模式运行的便宜控制器。 公平地说,较新的ARC-1882可能在ARC-1880表现不佳的地方表现更好。
  • Areca ARC-1222: ARC-1222是一张旧卡,它显示出来了。 它通常在所有测试中都徘徊在包的底部附近。 它像ARC-1880一样,在少量并发IO时遇到问题,但也没有足够的吞吐量来与大IO竞争。 可能不值得购买,即使你在Ebay上看到便宜的价格。 有更好的经济实惠的选择。
  • LSI SAS 9207-8i: 此控制器在BTRFS模式下表现非常好,并且通常与2720SGL并驾齐驱。 它更昂贵,但来自一个更受认可的品牌。 如果您想要一个便宜的控制器并且对购买Highpoint感到不舒服,那么这就是您的控制器。
  • LSI SAS 9211-8i: 在JBOD模式下,SAS9211-8i的性能与SAS9207-8i和2720SGL非常相似。 它的RAID功能几乎是事后才添加的,并且在使用Ceph时表现很差。 如果您以一个好的价格找到它,这张卡值得购买,但9207-8i通常更便宜并且具有稍好的规格。
  • LSI SAS 2208: SAS2208是一个有趣的控制器。 在JBOD模式下,它的行为模仿了更便宜的控制器,并且在使用BTRFS时表现良好,但通常稍慢。 使用6个单磁盘RAID0阵列,它有时表现非常好,并且是性能最高的选项,而另一些时候则表现惨淡。 与ARC-1880类似,此控制器似乎对EXT4甚至XFS更友好,这可能是由于板载回写缓存造成的。 此控制器在某些情况下显示出奇怪的SSD队列等待时间。

未来工作

本文提供了对几种不同控制器在Ceph中执行情况的高级视图。 在未来的文章中可以进行更多测试和额外的分析。 如果您对想要看到的内容有任何意见,请给我发送 电子邮件 或在文章中留下评论。 以下是一些可能值得在后续文章中调查的内容

更广泛的分析

  • 使用8个旋转磁盘进行测试,日志与数据磁盘位于同一磁盘上,而不是6个旋转磁盘和2个SSD用于日志。
  • 检查性能如何随着多个控制器和同一节点中更多磁盘/SSD的扩展而扩展。
  • 检查性能如何扩展到多个节点(拿出信用卡Inktank!)。
  • 使用10GbE和潜在的绑定10GbE在更多驱动器使用时测试性能,使用单独的客户端。
  • 其他测试包括对象读取、rbd吞吐量测试、cephfs吞吐量测试、元数据测试等。

更深入的分析

  • 调查每个进程的CPU使用率,尤其是在CPU使用率很高的情况下。
  • 检查性能如何随着时间的推移而下降。
  • 检查各种条件下的底层块设备性能和寻道行为。
  • 检查各种调整参数如何影响性能,尤其是在小IO尺寸和快速RAID阵列上。