Building a NAS, Part 2: Storage napkin math

In part one we got together some requirements, and got a rough idea of the kind of server we’re thinking about. Now it’s time for some napkin math to work out some minimum hardware specs related to the storage on the box.

This post is part of a series:

Specifically, we’re going to explore numbers for the SATA I/O bandwidth we require, how that translates into PCI Express bandwidth and lanes, and how all of that influences both our bus topology and the CPUs we can use.

To frame these calculations, we’re going to make some assumptions about the layout of our final server. We’re going to assume that we’ll have a chassis that features hot-swap bays that hook up to a backplane. That backplane then has to hook up to the motherboard. This hookup could be done through either SATA, where the backplane has one SATA connector per disk (lots of cables!), or through SAS cables that can transport the equivalent of 4 SATA cables each.

A good example of this kind of layout is the Supermicro SC826 series of rackmount chassis. They offer 12 hotswap 3.5” bays at the front, with 3 main backplane options: SATA direct-attach (one SATA cable per disk), SAS direct-attach (1 SAS cable per 4 drives - the cable transports 4 separate SAS/SATA links), or SAS expander (1 SAS cable for all 12 drives, via the SAS equivalent of a network switch).

Illustration showing the difference in amount of cabling between
SATA direct attach, SAS direct attach, and SAS with an
expander.

12 drive bays is about right for what we want, so let’s calculate using that. I’ll also uses the SAS-2 protocol version, because that’s what’s available cheaply on the used market at the moment. SAS-3 is still state of the art, and that means more expensive hardware.

I tend to redo these calculations a bunch of times with different parameters as I narrow in on a build, so this is just one “run”, with the thinking explained.

These calculations are also going to all be back of the envelope. I’ll be rounding generously, making guesstimates, and generally I’ll be happy if we stay within 10-20% of reality. Incidentally, this is also what Google’s “Non-Abstract Large System Design” interviews for SRE have you do - throw a bunch of numbers at the wall to estimate the scale of a problem and guide designs.

Before we begin

If you read part 1 when it came out, there’s one significant change that happened since then: the requirement for networking got upped to 10Gbps. We still want the ability to saturate that bandwidth, so that’s upped the ante for storage performance. However, we don’t need all storage on the machine to be that fast, we can separate into a fast (SSD-based) and a slow (spinning disk based) pool.

That’s why we’re doing math assuming SSDs here, even though in the previous part we’d concluded that we didn’t necessarily need SSDs. We definitely do, and that’s influencing this math.

SATA I/O bandwidth

We need to calculate two things here: how much bandwidth our disks are likely to want to consume, and how much we can offer them. Let’s start with the second half of that.

One SATA3 link runs at a 6Gbps clock rate. However, the data being transported uses 8b/10b encoding. So, it’s actually the 10-bit words that are being sent at 6Gbps, even though each one only represents 8 bits of what we wanted to send. In other words, there’s a 25% clock rate “tax” to make the transmission robust, and we “only” get 4.8Gbps for our actual SATA protocol traffic.

There’s some additional overheads from the link and frame protocols that sit on top of the electrical layer, but I couldn’t find any specific information because the SATA spec is closed and requires payment to access.

However, most protocols out there consider 3-5% protocol overhead acceptable (TCP/IPv4 clocks in around 3% for 1500-byte frames), so let’s guess that SATA is somewhere around there.

Shave that off our 4.8Gbps, and our range is a few dozen Mbps either side of 4.6Gbps. Since this is napkin math, we can stick with that nice round number.

Now on to SAS cables. The cables we’re likely to encounter will be the SFF-8087 form factor, which as I hinted above carry 4 independent SAS links each. For our purposes, we can think of SATA and SAS as equivalent (they’re friendly sibling protocols), so one SAS cable can carry 4*4.6 = 18.4Gbps of SATA data total. Again, this is assuming we’re speaking SAS-2 on the link, which is roughly on par with SATA-3. If you’re calculating for the current state of the art, simply double the bandwidth to get SAS-3 numbers.

As a piece of independent verification, I searched Google and found someone reporting that, over a 4-lane SFF-8087 cable with SAS-2, they were seeing 22Gbps data rates rather than the theoretical 24Gbps. This seems in line with protocol overhead, but is missing the overhead from 8b/10b conversion. If we factor that in as well, that 22Gbps becomes 17.6Gbps - within 5% of the number we got. Close enough.

How much bandwidth from SATA spinning disks?

We’ll take a WD Red as a likely NAS drive to use. Benchmarks for that drive show that synthetic read and write benchmarks consistently land around 160MiB/s. Real world data shows variance from 100-155MiB/s depending on workload, so if we use 160MiB/s that should leave us comfortable margin and ensure that the link is not the bottleneck.

Comparing that to our numbers above (which is annoying because link speeds are conventionally in bits/s, so we get to convert to bytes/s), we expect one SATA link to give us 575MiB/s, more than enough. So, if we go with a setup where each disk gets its own SATA cable, we’re golden. Likewise if we go for a SAS setup in which we use 3 SAS cables and dedicate each lane to one disk - 12 lanes, 12 disks, 575MiB/s each, we’re golden.

What about the SAS expander case? In that scenario, we have 1 SAS cable connecting to the backplane, so 12 drives are sharing those 4 lanes through the expander chip. Well, 18.4Gbps for the SAS lanes, divided into 12 drives, is 190MiB/s per drive. Phew, still works, albeit with little margin.

How much bandwidth from SATA SSDs?

This is where things start to get dicey. Benchmarks for a Samsung 860 Evo SSD shows it clocking in around 400-500MiB/s. In other words, effectively an entire SATA lane. That means that, if we want to get full bandwidth to 12 SATA SSDs, our chassis backplane will need 12 SATA connectors, or 3 SAS connectors (which is effectively the same thing, just less space). If we use a SAS expander, we can get peak bandwidth to 4 disks at a time, maybe 5 depending on how the leftover lane bandwidth gets multiplexed across the expander.

So now, this turns back into a question for the end user: given an all-SSD array, are you okay with only having 4 drives worth of bandwidth in and out of the array?

If not, that constrains us to one of the direct-attach backplane options, and bumps up the bandwidth we’ll need to be able to deliver from the motherboard - which brings us to our next section.

PCI Express bandwidth

Working backwards from the storage, we have some idea of the bandwidth we need to push out from the motherboard now: 18.4Gbps for spinning disks, or up to 55.2Gbps for SSDs.

We can get that bandwidth off the motherboard either through SATA ports, or through a SAS HBA (Host Bus Adapter) in one of the motherboard’s PCIe slots.

If we use the onboard SATA ports, those connect to the CPU via the southbridge. In theory, every port on the board should get the full rated 6Gbps of bandwidth, but there may be shenanigans there. Be sure to check motherboard specs and manuals to see if connecting many drives has adverse effects like halving port bandwidth, or disabling other ports.

Since that part is very variable, I’ll just say that it looks like we’d want 4-12 SATA ports on the motherboard. 4 is feasible, but 12 puts us in a niche market, so the prices are going to be higher.

The other option is to consume a PCIe slot with a SAS HBA (Host Bus Adapter). HBAs are advertised based on the number of SAS ports they offer, rather than number of connectors. For example, an 8-port HBA will usually have 2 SFF-8087 connectors on the board, each serving 4 lanes of SAS.

Illustration showing the busses involved in using an HBA. CPU and
HBA are connected by PCI Express, and the HBA connects out with
SAS.

So, we’d need between 1 and 3 ports on our HBA. 3-port HBAs don’t really exist, and 4-port HBAs are problematic in a 2U chassis because they’re usually full-height cards, and 2U can only fit half-height PCIe cards. But we’ll worry about spatial packing later.

The next hop up the chain is PCIe bandwidth. To maintain full throughput, we want to have equal or better PCIe bandwidth arriving at the HBA as we do leaving it on the SAS cables. How does that shake out?

PCIe, similar to SAS in some ways, functions with independent “lanes”. A PCIe x16 port has 16 PCIe lanes going to it. To get the total bandwidth into a port, you just multiply out the per-lane bandwidth.

Per-lane bandwidth depends on the version of PCIe. So far, the lane bandwidth has doubled with each major version of PCIe. PCIe gen1 is 2Gbps per lane, gen2 is 4Gbps per lane, and gen3 is 8Gbps per lane. And like SAS, PCIe uses 8b/10b encoding, so if we include that and protocol overhead, the practical numbers look more like 1.55Gbps for gen1, 3.1Gbps for gen2, and 6.2Gbps for gen3.

This matters, especially when we’re dealing with SAS-2, because right around that time was the switch from PCIe gen2 to gen3. So you might have two PCIe x8 cards, but one can only field 8*3.1=24Gbps to the CPU (PCIe gen2), while the other can field 8*6.2=48Gbps (PCIe gen3). If we want to fully utilize all ports on the HBA, we need a gen3 card.

PCIe lanes

Finally, beyond thinking about the PCIe bandwidth we need to serve these drives, we also need to be mindful of the total number of PCIe lanes. CPUs have a limited number of PCIe lanes, and adding more quickly gets expensive: PCIe lane count is one way that Intel segments the CPU market, with affordable SKUs having few lanes.

How many lanes do we need? Well, for our SAS HBA, depending on the storage topology we end up with, we’ll need either 8 or 16 lanes.

For this particular build, we have two other requirements that consume PCIe lanes. We want 10Gbps ethernet, and we want NVMe storage.

A simple 10Gbps PCIe addon card will consume 4 PCIe lanes. A modern NVMe drive will do likewise, assuming you plug it into a PCIe slot with an adapter card. If you’re using the M.2 slot on the motherboard, it might disable one of the PCIe slots, or it might use the chipset’s PCIe lanes rather than use the CPU’s directly-connected lanes (which are the ones that get broken out to PCIe slots).

Adding all that up, for 10Gbps ethernet, 2 NVMe drives and our HBA(s) as discussed above, we’re looking at 20-28 PCIe lanes for our build.

Unfortunately, that places us in the “you must be serious, and therefore have serious money” category of hardware. In particular, it rules out the entire entry level Xeon E3 lineup of CPUs, which only have 16 PCIe lanes.

The Xeon E5 line will work, however. Those CPUs have 40 PCIe lanes to offer. Note that the CPU having 40 lanes doesn’t mean all 40 get broken out to PCIe slots on the motherboard. the motherboard might not populate slots for all lanes, and might commandeer some lanes for onboard things like a RAID controller.

So, we’ll still have to look carefully at motherboards, to ensure that they’re providing both the number of lanes we need, and the right set of ports (e.g. one x16, two x4). Further muddying the waters, some motherboards we look at might have onboard 10Gbps, or M.2 slots that utilize chipset PCIe lanes. These could make our lives easier, but given the generation of hardware we’ll be looking at, I’m not hoping to find that.

Conclusion

I asked the end user for this build, and they wanted maximum flexibility in utilizing SATA SSDs - meaning full SATA bandwidth to each of the 12 drive bays. This, combined with the thinking above, tells us a lot about the machine we need to build.

For full bandwidth we’ll need a direct-attach backplane, an expander backplane just won’t cut it, even if we upgraded to a SAS3 backplane.

For the Supermicro 826 case we’re looking at, we have the option of either 12 SATA ports, or 3 SAS SFF-8087 ports. However, we looked at the landscape of motherboards and concluded that 12 SATA ports would be hard to come by. And besides, have you tried routing 12 cables through tiny slots in a 2U case? It’s a nightmare!

So, 3 SAS SFF-8087 ports then. To connect those to the motherboard, we can use PCIe gen3 HBAs. They’ll only have 2 ports however. While we could just buy two HBAs, a cheaper solution is to connect 2 of the SAS ports to our HBA, and connect the 3rd to the motherboard SATA ports directly with a reverse breakout cable. That requires 8 PCIe lanes and 4 SATA ports, which almost any motherboard out there can provide.

Adding in our network card and NVMe drives, that puts us at 20 PCIe lanes, so we still need a Xeon E5 regardless.

In the next part, we’ll turn this abstract topology into an actual set of parts to order.