Tim Burton – Media IT My ramblings covering NLEs / Servers / Storage and the odd bit of photography.

15Jan/102

The death of RAID controllers?

In any data center you will see a raft of disk arrays, the majority either have RAID controllers (hopefully two of them) or are dumb JBODs with a RAID card along the lines of a LSI MegaRAID.  The theory is all the same, abstract the physical disks with a RAID algorithm so they appear the host as one large LUN.  The filesystem has no awareness of the underlying structure of the stripe or the physical hardware, and the majority of filesystems out there wouldn't do anything with the data if presented anyway.  Then there is software raid (LVM, Widnows RAID etc), where the hardware array is replaced with a simple JBOD and HBA, but this simply moves the RAID overhead up the path a notch, onto the host OS and in doing so ties access to one node only.

Enter stage left, we have ZFS and RAIDZ.  Heralded as one of the most advanced filesystem available today, ZFS is a real gem, it was a real shame when Apple dropped scraped support for it in 10.6. All striping is handled in the filesystem with pools, and pools of SSDs can be added for acceleration through the use of caching. Jeff Bonwick has written a great blog post about RAIDZ and the RAID5 write hole (the reason we have battery backup on controllers) and also a post on End to End Data Integrity of ZFS, covering re-silvering and how ZFS verifies its structure.  The sun x4500/x4540 is the best case of this filesystem in action, a one (4U) box 48 disk storage system with big bandwidth to network, yet no hardware raid cards, just simple LSI SAS HBAs.  It's all very clever, however ZFS is a DAS filesystem, not a SAN filesystem.

The first time I thought about the future of RAID controllers, as we know them ,was when testing some DDN gear.  I noticed on the back of their 6620 60 disk array the FC ports on the controllers looks suspiciously like qlogic fibre HBA PCIe cards.  On further inspection is seems the whole controller is a custom x86 server, and the host connection is just a matter of adding the relevant card (10Gb NIC for iSCSI for example).  Their 9900 series goes one step further, using a 4 socket (Dell) server they can run the RAID on 2 of the sockets and allow a guest filesystem to run on the other two.  Nice and close coupled, however it still isn't stripe / block level awareness for the filesystem.

Isilon's hardware could be compared to the X4500 in topology at a node level, however they expand their pool of disks outside of just one node into a cluster format, with infiniband stitching it all together.  Their 'RAID' means they can protect against not only drive loss, but whole nodes.  The power of the whole cluster can be leveraged in rebalancing to protect the data after a failure, in contrast to the time taken to rebuild a 2TB SATA disk in a RAID5/6 array where you are limited by the throughput of that one drive and completing the XOR.  Nodes can be easily added and removed from the cluster and it is very easy to manage.  The cons are: You can't have a client connected to the IB fabric, they must connect via the Gb Ethernet using normal protocols such as NFS / CIFS / SMB.  For higher speed access you need the 10GbE accelerators and you are still bound by the speed limitations of the aforementioned protocols.  Isilon is prefect where you have lots of GbE attached nodes wanting to hit the cluster (render nodes), where it falls down is where you want a few really high performance workstations or servers sharing data (uncompressed HD on ingest or edit system) requiring lots of accelerator nodes to be added.  They have killed RAID controllers in their system, and harnessed the whole cluster topology into a very capable and easy product.

My final piece is on IBM's GPFS, which like Isilon's OneFS is a cluster filesystem, however their clients access the cluster via propriety protocol, where data is read and written to a pool of Linux servers over 10GbE DCE (Data Center Ethernet) rather than channeled through one server to the rest of the cluster on the IB backend.  Therefore a single client can get very high transfer speeds by communicating in parallel, perfect for a network of very high power workstations.  They currently still use hardware RAID to present 10 disks (8+2 RAID6) to the GPFS servers as a single LUN, therefore still have the rebuild time issues.  However a little birdie tells me soon they will be setting all their arrays to run as JBODs, presenting each drive individually and the filesystem will handle the protection.  The key difference between this and Isilon though is that the disks will be presented to all the servers on the fabric, not just one local node, which means the bandwidth is a lot more flexible.  It's not announced yet, but I think 2010 could be a very exciting year for large storage systems.

To summarise; DAS solutions we have ZFS, clustered NAS we have OneFS, SAN we are seeing the the controllers being commoditised, and with the arrival of a new GPFS we are even seeing JBOD at a SAN level.  There is no need for a layer between the disks and the hosts, so could we kill the JBOD off as well by bringing ethernet or iSCSI right to the HDD controllers?  3.5" disks with a RJ45 and PoE?

Comments (2) Trackbacks (0)
  1. http://twit.tv/floss58

    Very good podcast from Floss weekly on ZFS.

  2. Isilon is a symmetrical cluster, its always going to struggle as they all do with maintaining performance whilst maintaining coherency and locking, for sure it will scale to a certain point but in the end the backend network traffic gets you (n-1)^2 and performance will begin to tail off. ultimately as it scales to really large node counts its effectivelly becoming an archive system.


Leave a comment


No trackbacks yet.