Notes on Building a Linux Storage Server

Martin Smith
martin at spamcop.net
V0.7 February 2005

Introduction

Recently I've been setting up MythTV and this has made me re-evaluate how much storage is 'enough'. I also realised that I have free space scattered across several machines and it's often tricky to find enough space to do things even though there's enough in total.

I decided to address this by putting in a machine primarily dedicated to hosting storage. I started by looking at what could be bought off the shelf. The solutions I saw had several drawbacks.

One was cost. It seems that most people think that if you're buying storage you must be a business with buckets of money to spend. It seemed to me that I was unlikely to get a storage server with 750GB - 1TB of total space for under £2000.

There are lower cost solutions aimed at home use but they semed to be very limited, sometimes only having a single drive. I knew I could build something from scratch for less money that would be more flexible and that could also do other server tasks.

Another issue was that some of the off the shelf solutions are based on an embedded version of Windows 2000 Server. There's no way I want to have Windows on my network in any kind of server role and I also see no reason to pay for software for such a simple task when there are better free alternatives.

The requirements I put together were quite simple:
Before coming up with my shopping list I then had to answer the following questions:

SCSI or IDE

The IDE v. SCSI question has been a long running and partly religious war. In this case the job was made slightly easier as I could see no way that a SCSI solution could be built within the budget or that would fit within a standard ATX tower case.

I got prices for some drives and worked out the total. A 36GB SCSI drive would cost about £100 and I'd need 20 of them to meet my minium space target. A 76GB drive costs about £220 and I'd need 10 of them. A 146GB drive costs £440 at current prices and I'd need five. All of these add up to £2000 just for the drives and I'd also need to allow for controller cards. Interestingly the cost per megabyte is almost exactly the same regardless of drive density.

With IDE a 250GB drive cost about £200 at the time I built the machine and three of them will meet the minium space target I had in mind. Since I originally wrote this prices have fallen further and drives can now be picked up for around £130 even over here in old rip-off blighty. I bet people in more consumer friendly countries can get them for even less.

It was soon clear that large IDE drives offered the only way I could meet my goals and I didn't spend any more time considering the SCSI option. It may have led to a better performing solution but that was academic in the circumstances.

Besides a lot of effort has been going into the IDE drive market lately. They may still slightly lag SCSI in some areas but it's not as clear cut as it used to be and it's becoming much more difficult to justify the extra cost of SCSI drives. It's hard to escape the thought that they are now more expensive because everyone expects them to be rather than this being a true cost of their manufacture.

Parallel or Serial ATA

Having decided to go the IDE route the next question was which type of interface to get. This isn't necessarily a simple question as it impacts the choice of motherboard among other things.

The results of my research were that people were using SATA with Linux but some chipsets were problematic. There didn't seem to be any real performance advantage to the current generation of SATA drives but they do simplify cabling and this can also help with cooling the system.

It seemed that the two most popular on board chipsets are the Intel ICH5 and the Promise 2037x. The Promise chips connect via PCI, the Intel ones via the southbridge which should boost performance. Neither of these chipsets can do RAID 5. Each of the supports 2 SATA and 2 PATA devices. I understood from what I read that the Promise chipset cannot handle a mixture of drive types but I didn't test this in practice.

I could see no clear advantage either way, assuming the motherboard had a well supported chipset. I could also see that a mixture of drives could prove to be necessary. Some boards provide four SATA connections but they can be split between two different controller chips. This can reduce the number of available connections. This was the biggest pitfall I fell into as I'll explain later.

Software or Hardware RAID

I regarded RAID as essential as there is no cost effective way for a home user to back up hundreds of gigabytes of data. I could have looked at DLT tape solutions but they'd cost probably double what the entire machine did and another pile of cash for tapes. Somebody could clean up here if they could work out a way to reduce the ridiculous cost of large tape drives.

This really turned into another easy decision. I'd been burned before with the low cost IDE RAID solutions. My opinion now is that they don't offer anything more than the Linux software RAID driver does. In fact some offered less as they support fewer RAID levels than the kernel offers. Few of them do RAID 5 for example.

There are some good quality IDE RAID controllers available e.g. from 3Ware but they cost in the region of £300. If I ended up with a split between PATA and SATA drives I might also have needed two controller cards. That would have blown the budget out of the water.

I decided that software RAID would be adequate for my needs. It would offer the most flexibility whilst saving the cost of an additional controller card(s).

Choosing Components

Having made the overall decisions I set about selecting the components that I'd be using to build the machine. This resulted in the following list:
Some people have commented that this is over specced, and indeed it is if all you want is a storage server. I didn't have any space left to put an extra machine so it's also doubling up as a general purpose server as well as serving up video.

This shopping list put me just slightly over budget but the extra did buy me a fair bit of extra performance. It would have been easy enough to scale this back to e.g. a Celeron or Duron, less memory and a cheaper case and motherboard.

The choice of Intel wasn't religious either. I have a mixture of Intel and AMD machines on my network. I believed that the P4 would run cooler and with several hot drives in the case I wanted to try and minimize the heat from other components.

I've always been happy with AOpen cases. This one is no exception, combining a decent build quality with a sensible price. The one I got is black and silver and looks good as well.

I had the following spare parts in cupboard that I planned to use:
Believing that covered everything I planned the following disk layout.

Promise PATA Controller
Intel PATA Controller
Intel SATA Controller
I didn't realise it at the time but this sensible sounding layout was a mistake as we'll see later. The rationale was sound enough though. A separate disk for the operating system allowed me to avoid all the potential horrors of booting from RAID devices. As I wanted to use RAID 5 this could get very complex. I'd have had to create a separate RAID 1 partition on two of the disks and this looked like a potential nightmare.

I didn't mind not having RAID for the operating sytem disk as the machine gets backed up every night by my Amanda server. If the OS disk failed then I'd just drop in a new one and restore the system from a full backup tape.

I also wanted to avoid having the disks in the array sharing controllers. If more than one disk in a RAID 5 setup fails then the array is dead. Putting two disks on the same IDE channel is just asking for a failing disk to lock up the bus and make them both inaccessible. In this situation all the effort setting up RAID is wasted and your data is toasted.

It's important to remember this when working on the system or when replacing a failed disk. If you pull the wrong disk or otherwise boot the machine with 2 or more disks missing from the RAID 5 array e.g. because you forgot to reconnect them then the array fails. A double check before rebooting can save a lot of aggravation.

Choosing an Operating System

My intitial plan was to make this machine a test installation for the Fedora Core 1 release of Linux. I had several machines runing RedHat 9.0 and I needed to decide what to do with them.

I wiped the old contents of the 40GB disk and installed Fedora over the network. This installation was extremely fast and easy. The main irritation being that they made it hard to use ReiserFS. I had to exit the install and reboot with the correct 'linux reiserfs' option to enable this. I find this very annoying as I've used ReiserFS for years on many machines and I'm very happy with it. Note to Fedora: please put ReiserFS back in the default filesystem options. I can make my own mind up thanks.

I had some more second thoughts about using Fedora when I tried to use the up2date application. I got all sorts of XML and RPM database errors and it kept locking up. I've also seen some of this on RedHat 9.0 lately and the expiring SSL certificate farce was another inconvenience. RedHat have been clocking up a few black marks in my estimation lately. Getting an email saying they didn't want to take my money for RHN any more was something else I didn't like.

Also around this point the IBM 40GB drive started to make some very strange noises at power on. The BIOS kept telling me the disk was faulty in that inaudble voice that motherboards have these days. I can never tell exactly what they say, even when I put my ear next to the speaker.

I ran the Drive Fitness Test utility on it and it passed but it was making me very nervous. It looked like some sort of intermittent mechanical seeking problem. I decided to replace it and bought an 80GB drive. The only one I could easily obtain at the time was a Western Digital Caviar model. I had real doubts about this as the last WD drive I bought managed to survive a mere two weeks before dying with a mountain of read / write errors. I'd heard that their more recent products were much improved so I thought I'd risk it and give them another chance. I haven't had any problem with it yet and it's been in there for nine months now..

This was when I discovered the mistake in my original disk layout plans. The Promise 20378 chip turned out to only be supported under Linux for SATA. That meant the drive I'd connected to the third IDE channel on the motherboard was not visible. That was very annoying and caused me to say a few bad words. The only way out I could see was to double up the operating system disk with one of the array disks on the primary IDE channel. That wasn't going to be ideal but it would still mean that the array disks would be on separate channels. I just hoped that it wouldn't have too much of an impact on performance.

While I think the mistake I made was understandable. It was a reminder of two things. Firstly that it's vital to read motherboard and chipset specifications in extreme detail. It also told me that despite all the advances Linux has made we're still second class citizens sometimes when it comes to some hardware support. A Windows user would be able to use all the IDE channels on the board, why shouldn't I be able to?

Binary drivers from manufacturers are one option but they often have their own problems, particularly for storage. I'd be very unlikely to use one if there was a chance that I might upgrade my kernel and suddenly find the array inaccessible.

Having got tired of rebuilding the RPM database I decided to ditch Fedora and use Gentoo Linux for this machine. I'd already installed it on my desktop and laptop and been impressed by it. This was its first chance to run a server for me.

I should probably note at this point that I've since installed Fedora on two other machines without seeing the problems I had with it on this one so I don't know why they occurred.

I partitioned the 80GB disk for an LVM install with a layout that looked like this:
All my machines now use LVM. It provides a lot of flexibility for a very small overhead. I decided against having a separate /boot partition. This is another slightly religious area. I find them fiddly and annoying and often forget to mount them when I'm building a new kernel. As I believe their original rationale of getting round BIOS limitations is no longer relevant I don't bother with them.

During the Gentoo install I set up logical volumes for /tmp, /var, /usr, /home, /opt and /usr/portage/distfiles. The last is Gentoo specific and allows me to easily avoid backing up the package cache while also stopping it filling up /usr. I made the partitions pretty big and still ended up with 50GB of uncommitted space that can be assigned to any existing or new volume as the need arised.

The Gentoo install took a very long time. I'd started from a stage one install and was building everything from scratch. If time had been a real factor I could have started with a package based install. I wanted total control of the software that got installed and to have it optimised for the machine and I was prepared to wait. Once the initial set up was done most of the time was spent in waiting for things to compile, rather than being sat in front of the machine fiddling with it.

Gentoo does not have the same sort of installation as some other distributions as it involves a good deal of manual effort. The installaers for distributions like Fedora and SuSE can set up RAID / LVM configurations during the installation process using graphical configuration tools. This can make the process easier but the concepts still need to be understood.

RAID Options

I now present my opinionated guide to different RAID levels. For a much more detailed explanation try this web site.

Level
Min Disks
Good For
Concatenation
2
Nothing, use LVM for this functionality.
Striping (RAID 0)
2
Performance. Useful for transient video data or machines that are regularly fully backed up. Increases data loss when there's a drive failure.
Mirroring (RAID 1)
2
Reliability. Survives  single disk failure with good performance.
Stripe w. Parity (RAID 5)
3
Reliablity. Survives single failure but with degraded performance. Lower cost than mirroring.
Mirrored Stripe (RAID 0+1)
4
Increases performance and reliability at additional cost.

Setting Up LVM + RAID

I'd decided on a RAID 5 volume using the three disks. This would give me about 500GB of usable space and allow for recovery of a single disk failure.

Alternatives I considered were RAID 0 + 1 using four 160GB disks. This would have given me 320 GB of usable space and allowed for the recovery of either 1 or 2 failed disks. This would also have given me better performance, but this wasn't the overriding criteria for my system and would have been over budget.

The configuration I have set up is LVM on top of RAID then using ReiserFS on top of that.  The following diagram should show how the various levels of storage entities stack up. Click on it to get a readable version.

RAID and LVM diagram

Working up from the bottom of the diagram we have the physical disks, in my case three of them. These are formed into a metadevice /dev/md0 by the RAID driver. This is the level that is providing the failure recovery support in this configuration. Above that the /dev/md0 is made into an LVM PV (Physical Volume) and used to form a VG (Volume Group). Space from this volume group is then assigned to a number of LVs (Logical Volumes) and these are the devices on which filesystems are created.

Software RAID was easy to set up. There are two sets of tools available to do this raidtools and mdadm. Because I didn't realise that mdadm existed I initially used raidtools. If I was doing it again I'd probably have used the newer alternative. My /etc/raidtab file looked like this:
raiddev /dev/md0
        raid-level      5
        nr-raid-disks   3
        nr-spare-disks  0
        persistent-superblock 1
        parity-algorithm left-symmetric
        chunk-size      128
        device          /dev/ide/host0/bus0/target1/lun0/part1
        raid-disk       0
        device          /dev/ide/host0/bus1/target0/lun0/part1
        raid-disk       1
        device          /dev/ide/host2/bus0/target0/lun0/part1
        raid-disk       2
I created the /etc/raidtab above listing the three disks and did mkraid md0. This succeeded and I was then able to see the process of array reconstruction in /proc/mdstat. I could see it was going to take about 2 hours for the array to have its parity blocks reset. It can be used during this time but I decided to let the process finish. If a disk fails during reconstruction then data may be lost.

Once the array had finished rebuilding itself then I created a new LVM physical volume on it and a new volume group for video storage. It seems unlikely that I'll be putting any more disks in this machine but using LVM is still worthwhile as it makes it possible to parcel out chunks of the space as necessary.

Running vgdisplay shows the following:
--- Volume group ---
VG Name               system
VG Access             read/write
VG Status             available/resizable
VG #                  0
MAX LV                256
Cur LV                7
Open LV               7
MAX LV Size           2 TB
Max PV                256
Cur PV                1
Act PV                1
VG Size               53.47 GB
PE Size               32 MB
Total PE              1711
Alloc PE / Size       704 / 22 GB
Free  PE / Size       1007 / 31.47 GB
VG UUID               94G08x-Q6uf-2t0j-XAkj-uUf0-YKWh-Vk517n

--- Volume group ---
VG Name               video
VG Access             read/write
VG Status             available/resizable
VG #                  1
MAX LV                256
Cur LV                1
Open LV               1
MAX LV Size           2 TB
Max PV                256
Cur PV                1
Act PV                1
VG Size               465.69 GB
PE Size               64 MB
Total PE              7451
Alloc PE / Size       4800 / 300 GB
Free  PE / Size       2651 / 165.69 GB
VG UUID               vpj4Qg-OEJS-Vocp-i6zZ-5P36-E29b-EFPGZl
I created a 300 GB volume, ran mkreiserfs on it and mounted it. A quick edit of /etc/exports and the Samba smb.conf and I had a working network storage server.

Conclusions

I managed to builld a storage server to my requirements for just over my original budget, though it would have been easy enough to meet it by reducing the spec. The machine also now hosts my printer and my home automation X10 interface and is a pleasingly fast general purpose server. I've also started using it as a Postgres server for a development project in addition to its video serving duties.
 
The machine isn't as quiet as it might be because I put in some extra fans. The Hitachi drives also make occasional strange squeaking noises as they recalibrate. It sometimes sounds like there's an animal trapped inside the case. These drives have now been running for a year and have given me no problems (well until I just tempted fate I suppose). This is a welcome relief after the problems I've had with some models of Deskstar drives before.

The only real advantage a dedicated storage solution would have had is fitting into a much smaller case. Compared to the disadvantages of it being based on Windows 2000 and costing alot more that's something I can live without.

The latest change I've made to it is to add another 250GB drive. I reckon I could still get another 2-3 drives in there if it proves necessary, and as we all know it probably will..


Martin Smith
Last Updated 7th February 2005. Hosted with the nice people at clued up hosting