Hardware

new computer woes

So my trusty 6 year-old desktop, jupiter, died after a power outage a couple of weeks ago. I suspect the motherboard got fried, because trying to power on the system did nothing, although a monitor plugged into the back of the PSU could still power up.

I’d been thinking of getting a new computer for some time, because the Pentium III 800 MHz processor and 768 MB of PC133 RAM wasn’t really cutting it for running VMWare Workstation, so this failure pushed me into action. I decided to purchase a standard "template" system from Canada Computers with the following specifications:

  • ASUS P5LD2-VM motherboard
  • Pentium 4 3.2 GHz CPU
  • 512 MB of DDR400 PC3200 RAM
  • 250 GB Western Digital SATA hard disk
  • LG 16x dual-layer DVD writer

The P5LD2-VM has onboard sound, video (using an Intel 945G chipset) and Intel Gigabit Ethernet, so I decided to just use those.

To this system I added another 512 MB of RAM, a second 250GB SATA hard disk (for a software RAID-1 mirror), and an APC Back-UPS CS 500 uninterruptible power supply.

I picked up the system on Saturday, took it home and powered it up. Immediately I saw a problem: one of the SATA hard disks wasn’t being properly detected. After fiddling around with the connections on the motherboard, I was able to get both disks to show up, but only if I used SATA ports 1 and 3, rather than 1 and 2. Plugging any device into ports 3 and 4 caused them to not show up in the BIOS.

I resolved to take the system back to have Canada Computers’ technicians diagnose the issue (eventually they reset the BIOS and everything was fine) but in the meantime I could still install Fedora Core 5 on it. Or so I thought.

I started by installing the i386 version of Fedora, which succeeded, but then I realized that the Pentium 4 is an EM64T CPU, so I should install x86_64 Fedora. Trying to do so, however, caused the installer to lock up right before the first boot, and resulted in a corrupted system — for example, /etc/inittab would be missing. I observed other weird behaviour, like the fact that the primary software RAID partition, /dev/md2, would be in a rebuilding state immediately after the install, even though the installer said to reboot the system.

I subsequently tried to install the Fedora Unity Re-Spin of x86_64 Fedora Core 5, with similar results; at least I was able to get through the installer and onto first boot, but when starting up X the system would lock up hard. SUSE Linux 10.1, which I tried just to see if it would behave differently, had the same issues.

I came to the conclusion that the on-board Intel 945G video chipset is no good, at least with the 64-bit drivers in X. So I ran out to Canada Computers again and bought the cheapest PCI Express video card I could find: an ASUS Extreme AX300SE (basically an ATI Radeon X300SE). Then I tried to reinstall Fedora Core 5, and it worked perfectly! So I would advise everyone to stay away from the Intel 945G chipset for on-board video.

By the way, the UPS was broken too — I opened a support case with APC and they are planning courier me a replacement. I guess I just have bad luck with computer equipment.

MTBF for Sun drives: 4 months or less

Boy, I’m glad I wrote down directions for replacing a drive in an LVM mirror, because c0t1d0 just died on me. That’s right, the drive that I didn’t replace last time.

Keep in mind that I purchased this Sun server less than 4 months ago. I wonder if the assembly line workers at Seagate were smoking pot 6 months ago when they put this batch together?

Linux WiFi improvements on the horizon

Wireless device support — and indeed, wireless reliability — has been frankly awful in Linux up to this point. Even among the devices that work (at least some of the time), there are frequent problems. For example, my IBM Thinkpad T42 laptop comes with an IPW2200BG adapter that mostly works — except after suspend, when it will refuse to function unless the driver is unloaded and reloaded. I’m using NetworkManager to "magically" manage my network connection; when it works, it works fabulously, but there are no docs. None. I challenge you to try and find a man page or any scrap of documentation about NetworkManager anywhere on the Internet.

At least on the driver side there might be some hope on the horizon. Devicescape, a WiFi software stack specialist, has just released their "Advanced Datapath" IEEE 802.11 driver stack under an open source license, and several kernel developers are trying to get it integrated into the Linux kernel. Of course, as with all integrations, this won’t happen overnight, but when it does, many wireless features such as WPA, WEP, software MAC, and so on that currently require add-ons like the userland wpa_supplicant for WPA could be directly run by this stack.

I’m looking forward to the day when I don’t have to do this magic incantation to get wireless working after suspend:

# /etc/init.d/NetworkManager stop
# /sbin/modprobe -r ipw2200
# /sbin/modprobe ipw2200
# sleep 10
# /etc/init.d/NetworkManager start

A complete non-sequitur: survey questions that make no sense.
how quickly?

replacing a failed Sun LVM mirror

The problem with mirroring your disks is that one side of the mirror will invariably fail two weeks later. This has happened to me several times, first under NetBSD (with its excellent RAIDFrame technology, a worthy competitor, functionally, to Sun Volume Manager) and now with the Sun LVM mirror that I set up several weeks ago and documented in this very blog.

I called Sun support, and they shipped me a new disk. Here’s how I went about replacing the failed device, without incurring any downtime (yay, Sun hot-swappable parts)! Continue reading…

home router replaced!

I finally decided to replace my FreeBSD-based Sun Ultra 10-based home router. There were a couple of reasons for this:

  1. I was running FreeBSD 5.x, which meant that the keyboard wouldn’t work — I could only control the system remotely over SSH or through a serial console. This was fixed in later versions of FreeBSD 5.x but I didn’t want to bother upgrading, since the box isn’t the fastest machine
  2. Using a desktop workstation for routing and running ppp consumes more power than it’s worth, and makes a fair amount of noise
  3. Using an 400 MHz UltraSparc III-based workstation with 512 MB of ECC RAM for a simple firewall and router seemed like a bit of overkill :-)
  4. I want to free up the Ultra 10 for testing out Solaris 10 and possibly upgrading my Solaris 9 SCSA designation.
  5. I want to (finally!) equip my home with wireless… yes, I’m a little late getting on the bandwagon.

Continue reading…

and then it turns out…

Intel EtherExpress PRO/1000 experience the same issue as the Broadcoms do, too, under certain conditions:

Linux kernel message: “NETDEV WATCHDOG: eth0: transmit timed out”

The workaround is to disable TCP segmentation offloading (TSO) with ethtool.

I feel as though there is no good Gigabit Ethernet controller that one can use with Linux. Can someone prove me wrong?

Broadcom NetXtreme issues part 2

Here’s a follow-up to my previous post about the Broadcom BCM570x Gig-E adapters on HP-DL380 servers. HP pointed us to the following advisory:

Advisory: Primary Port of Integrated NC7782 Gigabit Server Adapters with NFS protocol with Certain Firmware Versions Stops Transmitting under Linux, Resulting in Lost Network Connectivity

However, reading the advisory indicates that the problem only afflicts the primary port of the Ethernet adapter. We’ve been seeing problems on the secondary port, as well as an add-on card.

This has been raised with HP, so we’ll see what they say.

Broadcom NetXtreme Gigabit Ethernet adapter problems

Recently we’ve been seeing a lot of error messages while using the Broadcom BCM570x series of Gigabit Ethernet adapters under SUSE Linux Enterprise Server 9. The symptoms are that the interface will simply hang under high traffic and refuse to pass more packets, eventually giving the error:


Dec 1 01:17:46 dev03 kernel: NETDEV WATCHDOG: eth2: transmit timed out
Dec 1 01:17:46 dev03 kernel: tg3: eth2: transmit timed out, resetting
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2

It’s become a very serious issue for us because we have Broadcom BCM570x controllers on board all of our HP-DL380 servers. The problem seems to occur more frequently now that we’ve upgraded an SP2 (and beyond) SLES9 kernel, although we have had problems dating back several months with older kernels.

Doing some research on the Internet, I’ve found that this is a very common problem out in the field. In a summary document I prepared to management, I wrote the following:

Other customers in the field have reported the same problems running RedHat Enterprise Server 3, Debian GNU Linux, FreeBSD/NetBSD and even Novell Netware (internal communication with Novell PSE). In many of the reported incidents, customers were running identical server hardware (HP/Compaq Proliant DL-3×0 series) to CBC.ca. [HP IT Resource Centre thread #898761 where customers have reported issues with a variety of HP hardware and operating systems.]

There are a number of root causes to the problem including Linux driver instability (the Tigon3 (tg3) driver was created by reverse-engineering the Broadcom bcm5700 driver due to the low quality of the latter) and manufacturing defects (manufacturing defects with some Broadcom 5704 chips afflicted Sun’s initial customer shipment of Sun Fire V210 and V240 servers in 2003 leading to Sun Alert #55620 The impact of such defects beyond Sun is unclear because Broadcom refused to provide further details.)

Right now, we’re awaiting feedback from HP and Novell on how they plan to resolve this issue. In the meantime, we’re going to stockpile some Intel Gigabit Ethernet cards.

tape hardware, part two

While on the topic of tape hardware and backups… never mind my little DLT7000 drive at home. How do you back up a 4TB Titan NAS?

We bought one of these servers at work last year; we’re finally getting around to using it for something. Our current challenge is trying to figure out how to back up a 1TB Interwoven content store (we’ve just bought almost the entire product line from Interwoven) without IT screaming at us for taking up their entire tape rotation schedule. This is on top of having to back up a large MediaBin store as well.

I’ll be happy when the Titan is actually up and running, though. We’ve been having some problems getting the CIFS partitions running, because the Titan really needs an Active Directory server in order to enforce permissions, and all we have is a Windows NT 4 domain controller (think again about hacking it; it’s on an internal network). The problem is that we never originally intended the Titan to be used for Windows shares; the unit was purchased long before we decided to go with Interwoven on Windows entirely.

Interesting technical challenges abound…

upgrading tape hardware to DLT

After my girlfriend‘s Powerbook crashed, taking with it several months of her un-backed-up data, I decided enough was enough with my own antiquated backup hardware (ExaByte 8505 8mm tape drive, 5GB/10GB) and I bought a DLT7000 (35GB/70GB) drive off eBay, thus increasing my backup capability by seven times. With eight tapes, the whole adventure cost me approximately CAD$150.

I started thinking about what the purpose of hardware compression on tape drives is. In principle, it seems like it’s a good idea; offload compression, which is a CPU-intensive activity, onto the drive. The only problem is that it makes the estimation of whether or not a backup will fit onto a tape a virtual impossibility. I want to know, before I even start writing to the tape, whether or not a backup is going to fit. I don’t want to start writing to tape and then, 2 hours later, find I just hit End Of Media. It’s not something that you can recover from.

I don’t see a technical solution around this problem, so what I do is turn off hardware compression and just gzip the data to a holding disk. This is one of the great features of AMANDA; you can stage the entire backup to a temporary disk, and then write the backup to tape from that disk.

So, as far as I can tell, hardware compression is not very useful; it seems like a scenario where solving one technical problem (moving slow compression activities onto hardware) creates another (inability to know a priori if you’re going to run out of tape before you start writing the backup).