Niall’s virtual diary archives – Wednesday 8th August 2012

by . Last updated .

Wednesday 8th August 2012: 5.40pm. I'm glad to say that every item on my summer todo list from the last entry has been completed or nearly completed - unfortunately, it is at the expense of being so very tired and not thinking properly. This morning I forgot my passport for the first time since 1997 for my flight to Belgium to say goodbye to Natasja - and this being post 9/11, they wouldn't let me fly, so that was €190 down the tube and much disappointment caused for all. The tiredness resulted from being up till 2.30am replying to essential email organising the relocation to Canada, my next trip to see people around London, booking flights for same despite various people continuing to be vague about their availability, and setting a burn in test running on my shiny new 3Tb hard drive which is going to act as a separately transited ZFS pool mirror for all our important data during the move to Canada (not helped by the Sandforce SSD in my main desktop deciding to die for the third time yesterday, once again hosing my Windows install. This morning I ordered a 256Gb Samsung 830 SSD at some expense, and vowed to never, ever again use any Sandforce based SSD).

Why didn't I start all that earlier? Well, Sarah was visiting to say goodbye for the preceding four days, and with Megan working yesterday I took them to Kinsale for the day. In short, I couldn't have started anything earlier even though I kept dipping out of activities with Megan and Sarah to grab an hour or two sitting in the car with my laptop over a 3G link to attend to essential communiqués etc. (and hard drive burn in tests take several days). All in all, outta time in almost every endeavour, and that results in Niall putting up mental resistance to adding on more stress by doing subconscious blocks like forgetting passports. Unhelpful. The truth is though that the extra two days will be exceptionally useful - I now have enough days to get in my final OU Pure Maths coursework on time, and organise the removers who of course want contacting when I'm busy and won't respond when I contact them e.g. they rang when I was grabbing some very much needed sleep this afternoon. Still, I regret deeply missing Belgium. It's always fun visiting there.

You may have noticed that I upgraded the CSS (for those non-ancient pages on nedprod which use CSS) to use some newer features. There are more rounded corners than before, particularly on the navigation pane separator; hovering over links produces a fire effect I learned from developing the Deeper Economics website; and I used some CSS3 selectors to apply box shadowing to any standalone i.e. centred images which I think works really well. I ended up doing this as part of modularising the RSS feed floating pane so it can appear on the homepage of my software libraries with a running commit feed. Good stuff.

Other changes still to come include a proper SSL certificate for nedprod, and with that I can turn on the SPDY fast HTTP extension for nedprod to improve still further page load times. I rented a verified personal identity SSL certificate from StartSSL for a fairly reasonable US$60 for two years. Basically what this does is to say that someone called Niall Douglas residing in a given locality in a given country has provided a minimal amount of proof that they do have that name and do reside at the address they supplied. You can then attach this "proof" as a digital signature to your email, your website and so on under the theory that it makes it somewhat harder for another to impersonate you. Now, I'm not bothered about anyone impersonating me, rather I bothered with renting this because recent Windows throw up a warning if you try to install unsigned programs, and this includes the v1.50 alpha 1 release of BEurtle.. Future releases of BEurtle will now be properly signed rather than self-signed, and therefore not raise a warning. Obviously I also get the advantages of signed email, my email program not complaining every time I fetch mail etc. as well.

Last bit of news: I passed that damn PGCert in Educational and Social Research with the Institute of Education in London - in fact, I think I'll get a merit if I've calculated my weighted average correctly. Glad to be away from there - it was an eye opener. And very glad that some £3,200 of my hard earned money was not wasted.

Ok, so here's kinda why I'm writing a virtual diary entry now rather than later - here's me thinking out loud about how I'm going to configure my ZFS storage pool on the basis that this may help others. What I've got is my Proxmox cloud node running a copy of Ubuntu Server 12.04 LTS in a KVM virtual machine. Into that I've installed the Linux kernel port of ZFS v28 by the Lawrence Livermore laboratories (who are part of the US government, and they have huge data needs which is why they ported ZFS to Linux so I definitely trust the port) which is literally as easy as just adding a PPA to Ubuntu. I have three main storage hard drives which originate from an external USB hard drive solution which was originally called an "Icy Box":

  1. IcyBox1: A Jan 2008 1Tb 3-platter Samsung Spinpoint F1 HD103UJ drive (1,000,204,886,016 bytes, 512 byte sectors, 6.7w/19.6w idle/start). This was my first solution to a stack of DVDs several feet high which had been growing since I studied at Hull University, and I remember having much fun carrying all 25kg of them in a backpack through the Madrid metro when I left Spain for London. I remember many students at St. Andrews marvelling that so much data could fit into a single drive, and in fairness so did I at that time. I also remember being fairly appalled during the load of the DVDs onto that drive that some of the older DVDs, especially those written at Hull and some of those in Madrid, had become unreadable despite much loving care and that arduous trek through the Madrid metro among other occasions of physical data transfer. Basically, DVD±R is a bad way to store long term data.
  2. IcyBox2: A May 2010 2Tb 3-platter Western Digital Caviar Green WD20EARS-00MVWB0 "load/unlock click of death" drive (2,000,398,934,016 bytes, 4096 byte sectors, 2.9w/14w idle/start). This was purchased just as the Samsung drive approached capacity, and just after the 3-platter Green design came onto the market.
  3. In test now: A June 2012 3Tb 3-platter Western Digital Red WD30EFRX-68AX9N0 drive (3,000,592,982,016 bytes, 4096 byte sectors, 3.84w/13.71w idle/start). Unlike the earlier drives, this I bought about six months too early in order to secure our data for Canada. Once again, this is one of the very first 3-platter 3Tb drive designs on the market, and the first I've owned to contain fancy enterprise style vibration gyroscopes.

As you can tell, I really don't like more than three platters for long term storage smiley. I have about 1.7Tb of data in total, and currently the 1Tb drive holds what I had back when I had just 1Tb of data. Apart from that, I don't have any redundancy though the 2Tb drive is only ever plugged in when it's time to back up data - thankfully, as a result, it has a low load cycle count and won't die from the infamous "load/unload click of death" bug in those Caviar Green drives. Equally, I do need to flash a new firmware for that drive as it has the original "I can't do SMART properly" firmware, but I dare not until I have a full backup. All in all, it isn't a good long term data storage solution, but it could be worse.

So, how should I configure these drives as a fully redundant ZFS storage pool? This is slightly tricky. The naive solution is a simple 1:1 mirror, so you configure the 1Tb and 2Tb drives together to mirror the 3Tb drive, making sure you account for the fact that the 3Tb drive is slightly short (3,000,592,982,016 vs. 3,000,603,820,032 bytes). However, the 1Tb drive has 512 byte sectors, and apparently ZFS won't mirror across dissimilar sector sizes, though you can override and force a 4Kb sector size during pool creation. Another problem is that I am unsure if ZFS lets you concatenate two physical units into a vdev, then form a second vdev from that vdev and another physical unit because apparently you can't nest vdevs. However, you could stripe 3x 1Tb i.e. partition the 2Tb drive into two, and stripe all three 1Tb slices into a 3Tb vdev. The only problem with this is it would bottleneck on the 2Tb drive, because it must be constantly read and written at seek locations about half a drive apart i.e. slow, and the 128Kb strip used by ZFS isn't divided by three evenly so you'd get dreadful alignment penalties. Assuming a 5% chance of individual disk failure, risk factor of data loss is 10% (RAID 0) x 5% = 0.5%, so this configuration reduces the chance of data loss by 30x.

What if you want to expand the pool later? Well, ZFS won't let you change vdevs after configuration, so the only expansion route is to add another mirror pair as a separate vdev i.e. probably 2x 3Tb drives, or replace each drive with a larger one and resolder. This is because redundancy is per vdev, so if you lose a vdev you lose the pool. As I only accumulate about 500Gb a year, having to add in chunks of 3Tb at a cost of €400 a pop seems way overkill, never mind I dislike losing the older drives. Adding expansion increases the chance of data loss to 0.75%, but slightly improves the chance of data loss over any one of the drives failing to 33x.

What about RAID-Z? For this you need a minimum of two storage units with one failure tolerated. As this blog suggests, you can partition up the three drives as follows:

  • 2x 2Tb partitions on 2Tb and 3Tb drives
  • 2x 1Tb partitions on 1Tb and 3Tb drives

As with the 1:1 mirroring solution, here you also get 3Tb of available space. Assuming a 5% chance of individual disk failure, risk factor of data loss is 5% x 5% = 0.25%, or half the previous solution. Because ZFS can either stripe, mirror or RAID-Z but not concatenate, you'd see the bottleneck move onto the 3Tb drive which would stripe over about 66% of the 3Tb drive's total area. It isn't anything like as bad as the earlier mirroring solution for bottlenecking though, as the 4Tb vdev would get filled in preference to the 2Tb vdev.

What about expansion? Well, all you can do is add another vdev same as with mirroring, because vdevs are immutable with ZFS. This sucks.

What about RAID-Z2? This requires at least three storage units with two failures tolerated. However, interestingly, you could deliberately and intentionally create a degraded pool i.e. one which was effectively RAID-Z but could be "improved" to RAID-Z2 by adding a device, or you could have two parity devices with a "missing" data device with a degraded read/write latency. This is, as far as I can see, the sole and only way of creating an expandable vdev in ZFS though at the cost initially of sacrificing 66% of your storage capacity i.e. you'd get 2Tb of available space now.

In short, colour me not impressed. As much as ZFS is cool and everything, it isn't really suited to three device configurations because it isn't intended as such - it's intended for a dozen or so physical devices where a 10-15% parity overhead is just right because it reduces a 60% chance of data loss (assuming 5% individual drive failure) to (assuming no failures during rebuild) just 0.25% for RAID-Z or mirroring and to just 0.0125% with RAID-Z2 (if across all drives). That's a huge win, and that's why RAID-whatever and ZFS makes sense with lots of drives. With just three drives, the parity overhead is large, the lack of ability to reconfigure is inconvenient/expensive, and in short ZFS isn't the right tool for this job.

What we really need is the ability to do "block pointer rewrite" as online vdev reconfiguration is known in the ZFS jargon. Not a lot of chance of seeing that outside Oracle's proprietary ZFS enhancements sadly which effectively means we won't see it at all in the foreseeable future seeing as ZFS v28 is already two years out of date. I had very high hopes for BTRFS, indeed my secure off-site replicated backup solution is based on BTRFS with two copies of everything stored and remotely replicated by DRBD, and BTRFS's design is much more flexible for the low drive count user than ZFS. However, as Chris Mason (lead BTRFS developer) left Oracle for greener pastures in June, any of my hopes there are gone, especially as it's not in any commercial company's interest to push either BTRFS or indeed ZFS for low drive count use cases as in the end, where's the (serious) money in solving home user long term data storage issues? In short, ZFS v28 is as good as it's going to get for the foreseeable future, especially as Oracle are highly unlikely to release any of the improvements since v28 to the public. In reality, with the loss of BTRFS momentum, ZFS is quite literally the only game in town. I guess I'm just going to have to take those 2x 3Tb expansions on the chin!

So, basically I think I've decided to do RAID-Z for the existing 1Tb + 2Tb + 3Tb configuration - it's twice as safe as mirroring without so much of the read/write bottleneck and load placed on one drive. Further expansion though would be in 2x 3Tb mirrors, because mirrors don't lose your data if you break a pool unlike RAID-Z. At 500Gb/year, even that is two and a half years away, assuming that employment at RIM doesn't slow that rate of data acquisition down - which it very likely will. [Note added three months later: I didn't go with the RAID-Z in the end, as mirroring is inherently more fault tolerant because if a unit fails, you simply replace it and resilver whereas with RAID-Z it's a full rebalance which hammers the drives. Instead I "glued" the 1Tb and 2Tb drives together using Linux LVM to make them a fake 3Tb drive, then supplied the two 3Tb drives to KVM for FreeNAS to use as two ZFS storage units. FreeNAS has no idea it's in a virtual machine working with virtual hard drives, and nor does it matter. It all "just works", even if write speeds are in the 100Mbit range, read speeds reach about 400Mbit which is good enough].

Let's just hope that something much better comes along in two years' time. For now, ZFS as the only game in town will have to do. Be happy y'all!

Go back to the archive index Go back to the latest entries

Contact the webmaster: Niall Douglas @ webmaster2<at symbol> (Last updated: 2012-08-08 17:40:00 +0000 UTC)