Bulletproof Memory for RAID Servers, Part 2

Just what is the real cost of the memory in a RAID server? Seems like a simple question, right? For volatile memories such as DRAM and SRAM, the cost is pretty much the purchase cost of the memory DIMMs. Sure, DRAM and SRAM modules might occasionally fail and require replacement, but the associated failure rate is pretty low so the reliability tax on the failures is also relatively low. Not true for non-volatile memory. No matter what technology a RAID server design team adds non-volatile memory, there will be costs beyond the acquisition cost of the memory and those extra costs should be factored into the system design if the design is to be competitive.

As we discussed in Part 1 of this blog entry series, RAID servers must use non-volatile memory for their write caches to prevent data loss during power failures. There are many ways to achieve nonvolatility. One way is to back up the entire server with an uninterruptible power supply. That takes a lot of battery power or a diesel-driven generator (or a hydroelectric turbine, if there’s one handy). Another way is to use a much smaller battery to back up the RAM used as a write cache. Yet another is to use NAND Flash as a write cache. All of these design approaches have problems and no matter the approach, the server processor must be involved in safely preparing for the imminent loss of power. Let’s examine these last two design approaches more closely, assuming that diesel generators and water power are out of the question.

Backup batteries have short lives and require regular maintenance, which they often do not get. NAND Flash memory has relatively slow write times, so it makes a poor write cache when used directly. Worse, NAND Flash memory exhibits write-induced wearout failure. You really must minimize the number of times you write to Flash memory. For both of these reasons, using Flash memory like it’s RAM is clearly a misapplication of Flash memory technology.

So what’s the real cost?

Back to the original question posed in this blog entry: What’s the real cost of the memory in a RAID server? Let’s run a thought experiment and see where it takes us. Consider a battery-backed RAM. Besides the cost of the RAM, which is the same whether there’s battery backup or not, there’s the cost of the battery. What’s the cost of a battery pack? It’s on the order of $100 for the RAID server customer. However, if your customers are replacing these batteries annually as they should, then there’s roughly $500 worth of batteries to buy per server over the course of a four-year lifespan for the memory. (That’s $100 initially for the first battery and $100 per year for each year following.)

However, that’s not the only cost. Someone must go into the server room, take the server down, replace the battery, and then bring the server back up. For the sake of argument, let’s say it takes an hour for an IT tech to do all of this for one server. What’s the burdened cost of an hour of an IT tech’s time? Well, that number varies, but again it’s on the order of $100. And you need to do it four times over the course of the 4-year life of the server memory. That’s another $400. (We’re ignoring recycling costs here, but batteries should be recycled properly.)

So if battery maintenance occurs as it should, the cost of non-volatile server memory is roughly the cost of the memory plus $900 in maintenance costs. These costs greatly exceed the cost of the memory itself.

But what if battery maintenance doesn’t take place as it should? What if the battery fails in service? What’s the cost then? Well, in this scenario, you need to make some big assumptions. First, you need to assume that the batteries are all properly monitored so that there’s an alert as soon as a battery fails. If not, then the RAID servers are always subject to catastrophic data loss because their write caches are unprotected from power failures. Actually, it’s not so easy to sense battery failure without putting a load on the battery, but let’s ignore this detail for now.

Next you need to assume that there’s a replacement battery handy, sitting ready to go on the shelf next to the server room, and that someone knows where this battery is stored. Otherwise the RAID server with the failed battery will need to be taken out of service and replaced with another server until a new battery can be found, flown in, or otherwise delivered from the warehouse, wherever that is. Battery spares are cheaper to keep on the shelf than spare RAID servers so it’s likely that it’ll be a spare battery on the shelf. Likely as not, the battery on the shelf won’t be fully charged, but let’s ignore that detail for now as well.

Finally, you need to assume that there’s always an IT tech on hand who knows how to replace a server backup battery and can act quickly when a battery fails.

These are all big assumptions and they are all most assuredly bad assumptions, but they set a lower bound on the associated maintenance costs. An unattainable lower bound, most certainly, but a lower bound nevertheless.

$300 for one failure, $500 for two

If you make all of these assumptions, then the costs for server-memory nonvolatility using battery backup include the initial $100 battery cost, plus the cost of replacing the failed batteries over the four-year life of the server memory. In the highly unlikely event that there’s only one failure during that time, the 1-time replacement cost is about $200 ($100 for the replacement battery plus $100 for the labor cost to replace it) for a total of $300 for the initial battery plus one replacement. If the battery fails twice during the four years, then the total cost is $500.

While this second scenario sets a lower bound on cost, it’s clearly built on unrealistic assumptions. There will most certainly be unplanned downtime with this scenario.

Batteries almost never fail at convenient times. They seem to have a second sense about these things. Batteries fail at night and when the IT team is otherwise occupied. So you also need to figure in the cost of lost business due to the unplanned server outage. Realistically, that’s clearly going to happen.

Lost time counts too

Now the dollar value of lost data is really tough to set. However, as discussed in the previous blog entry, an hour’s loss of server time could easily cost a large customer thousands or millions of dollars especially if that server customer is Amazon, Google, or a fast-transaction securities trader that relies on response times that are microseconds faster than competing traders. For such customers, the cost of server memory is clearly irrelevant because uninterrupted server uptime is so very valuable to them. These customers know to the penny what server uptime is worth per minute, per second, and even per millisecond. That’s how valuable server uptime is to this class of customer.

These customers don’t want to know how much the memory in the server costs. They want to know how the server’s design will prevent unplanned downtime.

The server design team must therefore have bulletproof, nonvolatile memory as a goal. This memory should not require annual maintenance so that the server’s design avoids both frequently planned and unplanned downtime due to memory failure. The economics of this goal are simply undeniable.

If you’re thinking that this discussion is leading to a discussion of why AgigA Tech’s approach to non-volatile server memory is worth more money, you’re wrong. After taking maintenance costs into account, AgigA Tech’s AGIGARAM modules actually cost less. Taking the cost of lost data and server downtime into account, AGIGARAM modules cost a lot less. Something to be discussed in the next blog entry.

Friday, November 13th, 2009 at 22:45
No comments yet.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>