Notes on Building a Linux Storage Server
Martin Smith
martin at spamcop.net
V0.7 February 2005
Introduction
Recently I've been setting up MythTV and this has made me re-evaluate
how much storage is 'enough'. I also realised that I have free
space scattered across several machines and it's often tricky to find
enough
space to do things even though there's enough in total.
I decided to address this by putting in a machine primarily dedicated
to hosting
storage. I started by looking at what could be bought off the shelf.
The solutions I saw had several drawbacks.
One was cost. It seems that most people think that if you're buying
storage you must be a business with buckets of money to spend. It
seemed to me that I was unlikely to get a storage server with 750GB -
1TB of
total space for under £2000.
There are lower cost solutions
aimed at home use but they semed to be very limited, sometimes only
having a
single drive. I knew I could build
something from scratch for less money that would be more flexible and
that could also do other server tasks.
Another issue was that some of the off the shelf solutions are
based on an embedded version of Windows 2000 Server. There's no way I
want to have Windows on my network in any kind of server role and I
also see no reason to pay for software for such a simple task when
there are better free alternatives.
The requirements I put together were quite simple:
- Must cost around £1000. I'd be prepared to go slightly
higher if justified.
- Must have a minimum of 750GB raw space (not including RAID
overheard).
- Must fit a standard ATX case
- Must have RAID support, preferably RAID 5
- Must support NFS and SMB access
- Must be usable for other general purpose server tasks
specifically
printing.
Before coming up with my shopping list I then had to answer the
following questions:
- SCSI or IDE
- Parallel or Serial ATA
- Software or Hardware RAID
SCSI or IDE
The IDE v. SCSI question has been a long running and partly religious
war. In this case the job was made slightly easier as I could see no
way that a SCSI solution could be built within the budget or that would
fit
within a standard ATX tower case.
I got prices for some drives and worked out the total. A 36GB SCSI
drive would cost about £100 and I'd need 20 of
them to meet my minium space target. A 76GB drive costs about
£220 and I'd need 10 of them. A 146GB drive costs £440 at
current prices and I'd need five. All of these add up to £2000
just for the drives and I'd also need to allow for controller cards.
Interestingly the cost per megabyte is almost exactly the same
regardless of
drive density.
With IDE a 250GB drive cost about £200 at the time I built the
machine and three of them will
meet the minium space target I had in mind. Since I originally wrote
this prices have fallen further and drives can now be picked up for
around £130 even over here in old rip-off blighty. I bet people
in more consumer friendly countries can get them for even less.
It was soon clear that large IDE drives offered the only way I could
meet my goals and I didn't spend any more time considering the SCSI
option. It may have led to a better performing solution but that
was academic in the circumstances.
Besides a lot of effort has been going into the IDE drive market
lately. They
may still slightly lag SCSI in some areas but it's not as clear cut as
it used to be and it's becoming much more difficult to justify the
extra cost of SCSI drives. It's hard to escape the thought that they
are now more expensive because everyone expects them to be rather than
this being a true cost of their manufacture.
Parallel or Serial ATA
Having decided to go the IDE route the next question was which type of
interface to get. This isn't necessarily a simple question as it
impacts the choice of motherboard among other things.
The results of my research were that people were using SATA with Linux
but some chipsets were problematic. There didn't seem to be any real
performance advantage to the current generation of SATA drives but they
do simplify cabling and this can also help with cooling the system.
It seemed that the two most popular on board chipsets are the Intel
ICH5
and the Promise 2037x. The Promise chips connect via PCI, the Intel
ones via the southbridge which should boost performance. Neither of
these chipsets can do RAID 5. Each of the supports 2 SATA and 2 PATA
devices. I understood from what I read that the Promise chipset cannot
handle a mixture of drive types but I didn't test this in practice.
I could see no clear advantage either way, assuming the motherboard had
a well supported chipset. I could also see that a mixture of drives
could prove to be necessary. Some boards provide four SATA connections
but they
can be split between two different controller chips. This can reduce
the number of available connections. This was the biggest pitfall I
fell into as I'll explain later.
Software or Hardware RAID
I regarded RAID as essential as there is no cost effective way for a
home user to
back up hundreds of gigabytes of data. I could have
looked at DLT tape
solutions but they'd cost probably double what the entire machine did
and another pile of cash for tapes.
Somebody could clean up here if they could work out a way to reduce the
ridiculous cost of large tape drives.
This really turned into another easy decision. I'd been burned before
with the
low cost IDE RAID solutions. My opinion now is that they don't offer
anything more than the Linux software RAID driver does. In fact some
offered less as they support fewer RAID levels than the kernel offers.
Few of them do RAID 5 for example.
There are some good quality IDE RAID controllers available e.g. from 3Ware but
they cost in the region of £300. If I ended up with a split
between PATA and SATA
drives I might also have needed two controller cards. That would have
blown the budget out of the water.
I decided that software RAID would be adequate for my needs. It
would offer the most flexibility whilst saving the cost of an
additional controller card(s).
Choosing Components
Having made the overall decisions I set about selecting the components
that I'd be using to build the machine. This resulted in the following
list:
Some people have commented that this is over specced, and indeed it is
if all you want is a storage server. I didn't have any space left to
put
an extra machine so it's also doubling up as a general purpose server
as well as serving up video.
This shopping list put me just slightly over budget but the extra did
buy me a fair bit of extra performance. It would have been easy enough
to scale
this back to e.g. a Celeron or Duron, less memory and a cheaper case
and motherboard.
The choice of Intel wasn't religious either. I have a mixture of Intel
and AMD
machines on my network. I believed that the P4 would run cooler and
with several hot
drives in the case I wanted to try and minimize the heat from other
components.
I've always been happy with AOpen cases. This one is no exception,
combining a decent build quality with a sensible price. The one I got
is black and silver and looks good as well.
I had the following spare parts in cupboard that I planned to use:
- HP CD ROM Drive
- IBM 40GB Deskstar (for operating system)
- Floppy Drive
- Cooling Fans
Believing that covered everything I planned the following disk layout.
Promise PATA Controller
Intel PATA Controller
- Array Disk 1 (on Primary Channel)
- Array Disk 2 (on Secondary Channel)
Intel SATA Controller
I didn't realise it at the time but this sensible sounding layout was a
mistake as we'll see later. The rationale was sound enough though. A
separate disk for the
operating system allowed me to avoid all the potential horrors of
booting from RAID devices. As I wanted to use RAID 5 this could get
very complex. I'd have had to create a separate RAID 1 partition on two
of the disks and this looked like a potential nightmare.
I didn't mind not having RAID for the operating
sytem disk as the machine gets backed up every night by my Amanda
server. If the OS disk failed then I'd just drop in a new one and
restore
the system from a full backup tape.
I also wanted to avoid having the disks in the array sharing
controllers. If more than one disk in a RAID 5 setup fails then the
array is dead. Putting two disks on the same IDE channel is just asking
for a failing disk to lock up the bus and make them both inaccessible.
In this situation all the effort setting up RAID is wasted and your
data is toasted.
It's important to remember this when working on the system or when
replacing a failed disk. If you pull the wrong disk or otherwise boot
the machine with 2 or more disks missing from the RAID 5 array e.g.
because you forgot to reconnect them then the array fails. A double
check before rebooting can save a lot of aggravation.
Choosing an Operating System
My intitial plan was to make this machine a test installation for the
Fedora Core 1 release of Linux.
I had several machines runing RedHat
9.0 and I needed to decide what to do with them.
I wiped the old contents of the 40GB disk and installed Fedora over the
network. This installation was extremely fast and easy. The main
irritation being that they made it hard to use ReiserFS. I had to exit
the
install and reboot with the correct 'linux reiserfs' option to enable
this. I find this
very annoying as I've used ReiserFS for years on many machines and I'm
very happy with it. Note to Fedora: please put ReiserFS back in the
default filesystem options. I can make my own mind up thanks.
I had some more second thoughts about using Fedora when I tried to use
the
up2date application. I got all sorts of XML and RPM database errors and
it kept locking up. I've also seen some of this on RedHat 9.0 lately
and the expiring SSL certificate farce was another inconvenience.
RedHat have been clocking up a few black marks in my estimation lately.
Getting an email saying they didn't want to take my money for RHN any
more was
something else I didn't like.
Also around this point the IBM 40GB drive started to make some very
strange noises at power on. The BIOS kept telling me the disk was
faulty in that inaudble voice that motherboards have these days. I can
never tell exactly what they say, even when I put my ear next to the
speaker.
I ran the Drive Fitness Test utility on it and it passed but it
was making me very nervous. It looked like some sort of intermittent
mechanical
seeking problem. I decided to replace it and bought an 80GB drive. The
only one I could easily obtain at the time was a Western Digital Caviar
model. I had real doubts about this as the last WD
drive I
bought managed to survive a mere two weeks before dying with a mountain
of read / write errors. I'd heard that their more
recent products were much improved so I thought I'd risk it and give
them another chance. I haven't had any problem with it yet and it's
been in there for nine months now..
This was when I discovered the mistake in my original disk layout
plans. The
Promise 20378 chip turned out to only be supported under Linux for
SATA. That meant the drive I'd connected to the third IDE channel on
the motherboard was not visible. That was very annoying and caused me
to say a few bad words. The only way
out I could see was to double up the operating system disk with
one of the array disks on the primary IDE channel. That wasn't going to
be ideal but it would
still mean that the array disks would be on separate channels. I just
hoped that it wouldn't have too much of an impact on performance.
While I think the mistake I made was
understandable.
It was a reminder of two things. Firstly that it's vital to read
motherboard and chipset specifications in extreme detail. It also told
me that despite all the advances Linux has
made we're still second class citizens sometimes when it comes to some
hardware support. A Windows user would be able to use all the IDE
channels on the board, why shouldn't I be able to?
Binary drivers from manufacturers are one option but they often have
their own problems,
particularly for storage. I'd be very unlikely to use one if there was
a chance that I might upgrade my kernel and suddenly find the array
inaccessible.
Having got tired of rebuilding the RPM database I decided to ditch
Fedora and use Gentoo Linux for
this machine.
I'd already installed it on my desktop and laptop and been impressed by
it.
This was its first chance to run a server for me.
I should probably note at this point that I've since installed Fedora
on two
other
machines without seeing the problems I had with it on this one so I
don't know
why they occurred.
I partitioned the 80GB disk for an LVM install with a layout that
looked
like this:
- hda1 512 MB root
- hda2 2048 MB swap
- hda3 77 GB LVM Physical volume
All my machines now use LVM. It provides a lot of flexibility for a
very small overhead. I decided against having a separate /boot
partition. This is another slightly religious area. I find them fiddly
and annoying and often forget to mount them when I'm building a new
kernel. As I believe their original rationale of getting round BIOS
limitations is no longer relevant I don't bother with them.
During the Gentoo install I set up logical volumes for /tmp, /var,
/usr, /home, /opt and /usr/portage/distfiles. The last is Gentoo
specific and allows me to easily avoid backing up the package cache
while also stopping it filling up /usr. I made the partitions pretty
big and still ended up with 50GB of uncommitted space that can be
assigned to any existing or new volume as the need arised.
The Gentoo install took a very long time. I'd started from a stage one
install and was building everything from scratch. If time had been a
real factor I could have started with a package based install. I wanted
total control of the software that got installed and to have it
optimised for the machine and I was prepared to wait. Once the initial
set up was done most of the
time was spent in waiting for things to compile, rather than being sat
in front of the machine fiddling with it.
Gentoo does not have the same sort of installation as some other
distributions as it involves a good deal of manual effort. The
installaers for distributions like Fedora
and SuSE can set up RAID / LVM
configurations during the installation process using graphical
configuration tools. This can make the process easier but the concepts
still need to be understood.
RAID Options
I now present my opinionated guide to different RAID levels. For a much
more detailed explanation try this
web site.
Level
|
Min Disks
|
Good For
|
Concatenation
|
2
|
Nothing, use LVM for this
functionality.
|
Striping (RAID 0)
|
2
|
Performance. Useful for
transient video data or machines that are regularly fully backed up.
Increases data loss when there's a drive failure.
|
Mirroring (RAID 1)
|
2
|
Reliability. Survives
single disk failure with good performance.
|
Stripe w. Parity (RAID 5)
|
3
|
Reliablity. Survives single
failure but with degraded performance. Lower cost than mirroring.
|
Mirrored Stripe (RAID 0+1)
|
4
|
Increases performance and
reliability at additional cost.
|
Setting Up LVM + RAID
I'd decided on a RAID 5 volume using the three disks. This would give
me
about 500GB of usable space and allow for recovery of a single disk
failure.
Alternatives I considered were RAID 0 + 1 using four 160GB disks. This
would
have given me 320
GB of usable space and allowed for the recovery of either 1 or 2 failed
disks. This would also have given me better performance, but this
wasn't
the overriding criteria for my system and would have been over budget.
The configuration I have set up is LVM on top of RAID then using
ReiserFS on top of that. The following diagram should show how
the various levels of storage entities stack up. Click on it to get a
readable version.
Working up from the bottom of the diagram we have the physical disks,
in my case three of them. These are formed into a metadevice /dev/md0
by the
RAID driver. This is the level that is providing the failure recovery
support in this configuration. Above that the /dev/md0 is made into an
LVM PV (Physical Volume) and used to form a VG (Volume Group). Space
from this volume group is then assigned to a number of LVs (Logical
Volumes) and these are the devices on which filesystems are created.
Software RAID was easy to set up. There are two sets of tools available
to do this raidtools and mdadm. Because I didn't realise that mdadm
existed I initially used raidtools. If I was doing it again I'd
probably have used the newer alternative. My /etc/raidtab file looked
like
this:
raiddev /dev/md0
raid-level 5
nr-raid-disks 3
nr-spare-disks 0
persistent-superblock 1
parity-algorithm left-symmetric
chunk-size 128
device /dev/ide/host0/bus0/target1/lun0/part1
raid-disk 0
device /dev/ide/host0/bus1/target0/lun0/part1
raid-disk 1
device /dev/ide/host2/bus0/target0/lun0/part1
raid-disk 2
I created the /etc/raidtab above listing the three disks and did mkraid
md0.
This succeeded and I was then able to see the process of array
reconstruction in /proc/mdstat. I could see it was going to take about
2
hours for the array to have its parity blocks reset. It can be used
during this time but I decided to let the process finish. If a disk
fails during reconstruction then data may be lost.
Once the array had finished rebuilding itself then I created a new LVM
physical volume on it and a new volume group for video storage. It
seems unlikely that
I'll be putting any more disks in this machine but using LVM is still
worthwhile as it makes it possible
to parcel out chunks of the space as necessary.
Running vgdisplay shows the following:
--- Volume group ---
VG Name system
VG Access read/write
VG Status available/resizable
VG # 0
MAX LV 256
Cur LV 7
Open LV 7
MAX LV Size 2 TB
Max PV 256
Cur PV 1
Act PV 1
VG Size 53.47 GB
PE Size 32 MB
Total PE 1711
Alloc PE / Size 704 / 22 GB
Free PE / Size 1007 / 31.47 GB
VG UUID 94G08x-Q6uf-2t0j-XAkj-uUf0-YKWh-Vk517n
--- Volume group ---
VG Name video
VG Access read/write
VG Status available/resizable
VG # 1
MAX LV 256
Cur LV 1
Open LV 1
MAX LV Size 2 TB
Max PV 256
Cur PV 1
Act PV 1
VG Size 465.69 GB
PE Size 64 MB
Total PE 7451
Alloc PE / Size 4800 / 300 GB
Free PE / Size 2651 / 165.69 GB
VG UUID vpj4Qg-OEJS-Vocp-i6zZ-5P36-E29b-EFPGZl
I created a 300 GB
volume, ran mkreiserfs on it and mounted it. A quick edit of
/etc/exports and the Samba smb.conf and I had a working network storage
server.
Conclusions
I managed to builld a storage server to my requirements for just over
my original budget, though it would have been easy enough to meet it by
reducing the spec. The machine also now hosts my printer and my home
automation X10 interface and is a pleasingly fast general purpose
server. I've also started using it as a Postgres server for a
development project in addition to its video serving duties.
The machine isn't as quiet as it might be because I put in some extra
fans. The Hitachi drives also make
occasional strange squeaking noises as they recalibrate. It sometimes
sounds
like there's an animal trapped inside the case. These drives have now
been running for a year and have given me no problems (well until I
just tempted fate I suppose). This is a welcome relief after the
problems I've had with some models of Deskstar drives before.
The only real advantage a dedicated storage solution would have had is
fitting into a much smaller case. Compared to the disadvantages of it
being based on Windows 2000 and costing alot more that's something I
can live without.
The latest change I've made to it is to add another 250GB drive. I
reckon I could still get another 2-3 drives in there if it proves
necessary, and as we all know it probably will..
Martin Smith
Last Updated 7th February 2005.
Hosted with the nice people at clued
up hosting