who’s AFRAID of real hardware RAID?

Recently we bought a low-end IBM xSeries 306m server to handle generic IT utility tasks, such as hosting an installation of Request Tracker, Cacti and, in the near future, Nagios. The server came with a pair of 160GB SATA disks attached to a ServeRAID-8e HostRAID controller. I quickly discovered that HostRAID is an awful hack; it’s not real hardware RAID, but software-emulated RAID, utilizing the host system’s SATA controller to do the actual I/O to the disks, but with the RAID processing done in software using a proprietary driver, in my case, a driver called adpahci. In other words, it’s "A Fake RAID", which some pundits have noted collapses into the fitting acronym AFRAID.

Several admins have criticized HostRAID for a number of reasons:

  • Performance is terrible because the AFRAID controller must do polled I/O (PIO) through the CPU
  • The drivers are, by nature, proprietary, since the RAID logic is licensed from a third party
  • Limited sophistication in array rebuilds, since the controller has a minimal BIOS and online rebuilds are not possible
  • Disks in an AFRAID array are probably unusable outside of the array, given that the driver is chipset-specific

Although I don’t really care about performance for such a low-end utility box, I have been seriously bitten by the second point. We use RedHat Enterprise Linux 4 Update 3 on all production servers like my utility box. IBM only provides binary HostRAID drivers up to RHEL4 Update 2. You can allegedly rebuild the drivers using a SHIM from Adaptec, but it doesn’t work; although the SHIM package contains C drivers for all the Adaptec HostRAID controllers (aar81xx, adp94xx, adpahci, adpsata, etc.) the only binary blob you can obtain is the one for the aar81xx. Ergo, I am S.O.L. I’m stuck with a RHEL 4 Update 3 userland on a RHEL 4 Update 2 kernel.

I guess the appropriate solution if you’re going to buy this model of server (with SATA) is to ditch the on-board AFRAID and buy a ServeRAID-7t SATA controller, which has a real 80302 processor and 64MB of cache memory, or any of the other ServeRAID products which fit in the server.

On a final note, what the heck is with IBM’s insane naming schemes for all of its ServeRAID products? I can’t keep the 6i+, 7t, 7k, 7e, 8e, 8i, 6M, etc. straight — can you? Have a look at this driver matrix and your eyes will glaze over. Why don’t they name the controllers something meaningful?

memories of Farallon PhoneNet

My 10-year high school reunion is happening over the August long weekend this year, and the event got me thinking about some of the technology we used during those years.

Every Ontario elementary school and high school student of a certain vintage will remember the ubiquitous Unisys ICON terminals, a topic that I will actually leave to a later entry (we had a lot of fun with those ICONs, especially upon discovering that one could write a C program fork() from any PID on the system, including /sbin/init, with extremely useful results). However, I started thinking about Farallon PhoneNet, a fabulous networking technology for Macintoshes back in the day, and I thought I should record for posterity what kind of equipment it took to produce the Mackenzie High Times back in the day.

Continue reading

new computer woes

So my trusty 6 year-old desktop, jupiter, died after a power outage a couple of weeks ago. I suspect the motherboard got fried, because trying to power on the system did nothing, although a monitor plugged into the back of the PSU could still power up.

I’d been thinking of getting a new computer for some time, because the Pentium III 800 MHz processor and 768 MB of PC133 RAM wasn’t really cutting it for running VMWare Workstation, so this failure pushed me into action. I decided to purchase a standard "template" system from Canada Computers with the following specifications:

  • ASUS P5LD2-VM motherboard
  • Pentium 4 3.2 GHz CPU
  • 512 MB of DDR400 PC3200 RAM
  • 250 GB Western Digital SATA hard disk
  • LG 16x dual-layer DVD writer

The P5LD2-VM has onboard sound, video (using an Intel 945G chipset) and Intel Gigabit Ethernet, so I decided to just use those.

To this system I added another 512 MB of RAM, a second 250GB SATA hard disk (for a software RAID-1 mirror), and an APC Back-UPS CS 500 uninterruptible power supply.

I picked up the system on Saturday, took it home and powered it up. Immediately I saw a problem: one of the SATA hard disks wasn’t being properly detected. After fiddling around with the connections on the motherboard, I was able to get both disks to show up, but only if I used SATA ports 1 and 3, rather than 1 and 2. Plugging any device into ports 3 and 4 caused them to not show up in the BIOS.

I resolved to take the system back to have Canada Computers’ technicians diagnose the issue (eventually they reset the BIOS and everything was fine) but in the meantime I could still install Fedora Core 5 on it. Or so I thought.

I started by installing the i386 version of Fedora, which succeeded, but then I realized that the Pentium 4 is an EM64T CPU, so I should install x86_64 Fedora. Trying to do so, however, caused the installer to lock up right before the first boot, and resulted in a corrupted system — for example, /etc/inittab would be missing. I observed other weird behaviour, like the fact that the primary software RAID partition, /dev/md2, would be in a rebuilding state immediately after the install, even though the installer said to reboot the system.

I subsequently tried to install the Fedora Unity Re-Spin of x86_64 Fedora Core 5, with similar results; at least I was able to get through the installer and onto first boot, but when starting up X the system would lock up hard. SUSE Linux 10.1, which I tried just to see if it would behave differently, had the same issues.

I came to the conclusion that the on-board Intel 945G video chipset is no good, at least with the 64-bit drivers in X. So I ran out to Canada Computers again and bought the cheapest PCI Express video card I could find: an ASUS Extreme AX300SE (basically an ATI Radeon X300SE). Then I tried to reinstall Fedora Core 5, and it worked perfectly! So I would advise everyone to stay away from the Intel 945G chipset for on-board video.

By the way, the UPS was broken too — I opened a support case with APC and they are planning courier me a replacement. I guess I just have bad luck with computer equipment.

MTBF for Sun drives: 4 months or less

Boy, I’m glad I wrote down directions for replacing a drive in an LVM mirror, because c0t1d0 just died on me. That’s right, the drive that I didn’t replace last time.

Keep in mind that I purchased this Sun server less than 4 months ago. I wonder if the assembly line workers at Seagate were smoking pot 6 months ago when they put this batch together?

Linux WiFi improvements on the horizon

Wireless device support — and indeed, wireless reliability — has been frankly awful in Linux up to this point. Even among the devices that work (at least some of the time), there are frequent problems. For example, my IBM Thinkpad T42 laptop comes with an IPW2200BG adapter that mostly works — except after suspend, when it will refuse to function unless the driver is unloaded and reloaded. I’m using NetworkManager to "magically" manage my network connection; when it works, it works fabulously, but there are no docs. None. I challenge you to try and find a man page or any scrap of documentation about NetworkManager anywhere on the Internet.

At least on the driver side there might be some hope on the horizon. Devicescape, a WiFi software stack specialist, has just released their "Advanced Datapath" IEEE 802.11 driver stack under an open source license, and several kernel developers are trying to get it integrated into the Linux kernel. Of course, as with all integrations, this won’t happen overnight, but when it does, many wireless features such as WPA, WEP, software MAC, and so on that currently require add-ons like the userland wpa_supplicant for WPA could be directly run by this stack.

I’m looking forward to the day when I don’t have to do this magic incantation to get wireless working after suspend:

# /etc/init.d/NetworkManager stop
# /sbin/modprobe -r ipw2200
# /sbin/modprobe ipw2200
# sleep 10
# /etc/init.d/NetworkManager start

A complete non-sequitur: survey questions that make no sense.
how quickly?

replacing a failed Sun LVM mirror

The problem with mirroring your disks is that one side of the mirror will invariably fail two weeks later. This has happened to me several times, first under NetBSD (with its excellent RAIDFrame technology, a worthy competitor, functionally, to Sun Volume Manager) and now with the Sun LVM mirror that I set up several weeks ago and documented in this very blog.

I called Sun support, and they shipped me a new disk. Here’s how I went about replacing the failed device, without incurring any downtime (yay, Sun hot-swappable parts)! Continue reading

home router replaced!

I finally decided to replace my FreeBSD-based Sun Ultra 10-based home router. There were a couple of reasons for this:

  1. I was running FreeBSD 5.x, which meant that the keyboard wouldn’t work — I could only control the system remotely over SSH or through a serial console. This was fixed in later versions of FreeBSD 5.x but I didn’t want to bother upgrading, since the box isn’t the fastest machine
  2. Using a desktop workstation for routing and running ppp consumes more power than it’s worth, and makes a fair amount of noise
  3. Using an 400 MHz UltraSparc III-based workstation with 512 MB of ECC RAM for a simple firewall and router seemed like a bit of overkill 🙂
  4. I want to free up the Ultra 10 for testing out Solaris 10 and possibly upgrading my Solaris 9 SCSA designation.
  5. I want to (finally!) equip my home with wireless… yes, I’m a little late getting on the bandwagon.

Continue reading

Broadcom NetXtreme issues part 2

Here’s a follow-up to my previous post about the Broadcom BCM570x Gig-E adapters on HP-DL380 servers. HP pointed us to the following advisory:

Advisory: Primary Port of Integrated NC7782 Gigabit Server Adapters with NFS protocol with Certain Firmware Versions Stops Transmitting under Linux, Resulting in Lost Network Connectivity

However, reading the advisory indicates that the problem only afflicts the primary port of the Ethernet adapter. We’ve been seeing problems on the secondary port, as well as an add-on card.

This has been raised with HP, so we’ll see what they say.

Broadcom NetXtreme Gigabit Ethernet adapter problems

Recently we’ve been seeing a lot of error messages while using the Broadcom BCM570x series of Gigabit Ethernet adapters under SUSE Linux Enterprise Server 9. The symptoms are that the interface will simply hang under high traffic and refuse to pass more packets, eventually giving the error:

Dec 1 01:17:46 dev03 kernel: NETDEV WATCHDOG: eth2: transmit timed out
Dec 1 01:17:46 dev03 kernel: tg3: eth2: transmit timed out, resetting
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2

It’s become a very serious issue for us because we have Broadcom BCM570x controllers on board all of our HP-DL380 servers. The problem seems to occur more frequently now that we’ve upgraded an SP2 (and beyond) SLES9 kernel, although we have had problems dating back several months with older kernels.

Doing some research on the Internet, I’ve found that this is a very common problem out in the field. In a summary document I prepared to management, I wrote the following:

Other customers in the field have reported the same problems running RedHat Enterprise Server 3, Debian GNU Linux, FreeBSD/NetBSD and even Novell Netware (internal communication with Novell PSE). In many of the reported incidents, customers were running identical server hardware (HP/Compaq Proliant DL-3×0 series) to CBC.ca. [HP IT Resource Centre thread #898761 where customers have reported issues with a variety of HP hardware and operating systems.]

There are a number of root causes to the problem including Linux driver instability (the Tigon3 (tg3) driver was created by reverse-engineering the Broadcom bcm5700 driver due to the low quality of the latter) and manufacturing defects (manufacturing defects with some Broadcom 5704 chips afflicted Sun’s initial customer shipment of Sun Fire V210 and V240 servers in 2003 leading to Sun Alert #55620 The impact of such defects beyond Sun is unclear because Broadcom refused to provide further details.)

Right now, we’re awaiting feedback from HP and Novell on how they plan to resolve this issue. In the meantime, we’re going to stockpile some Intel Gigabit Ethernet cards.