Time for a backup

February 12, 2009

The following story is probably not very interesting to anyone out there, but it serves as a reminder (at least to myself) why backups are so important. Today I was lucky enough to spot disaster creeping up on me in time, but tomorrow I might be too late...

The whole thing started when my wife told me that "there are a lot of funny error messages on our homepage". "Error messages on our homepage, what do you mean?" I replied, and then pointed my browser at the website to see for myself. This is what I saw:

user warning: Can't create/write to file '/tmp/#sql_4d86_0.MYI' (Errcode: 30) query: SELECT t.* FROM term_node r INNER JOIN term_data t ON r.tid = t.tid INNER JOIN vocabulary v ON t.vid = v.vid WHERE r.vid = 109 ORDER BY v.weight, t.weight, t.name in /var/www/drupal-6.8/modules/taxonomy/taxonomy.module on line 618.

Not being able to write to a file in /tmp? Can't be, I thought, /tmp is on the root filesystem (ok, bad idea, I know, but that's how it is set up right now), so I proceeded to log into osgiliath (my server box) and started probing:

osgiliath:~# touch /tmp/foo
touch: cannot touch `/tmp/foo': Read-only file system

Hahaha, or maybe not so funny after all? Another check revealed this:

osgiliath:~# mount
/dev/hda3 on / type ext3 (rw,errors=remount-ro)
[...]

Interesting... so I really was sitting on a read-only root filesystem. This reminded me of a similar situation: I had already experienced a read-only root filesystem when I had been away from home last summer, on vacation in Japan. At that time I had been unable to fix the problem remotely: In a desperate move I had tried to reboot the machine while logged in via ssh, hoping that it would perform a self-healing file system check during startup, but the system never came back online, leaving me cursing in the lobby of a hotel in Hiroshima, and without email access for the remainder of the trip. It was only after I had returned home that I was able to get the system up and running again.

Thinking of this unpleasant experience half a year ago, I decided that today I would get to the bottom of this problem in order to prevent it from occurring a third time. I also decided to document the progress of my investigation, mainly for my personal education and entertainment, which was when I started writing this little report. Switching to present tense... :-)

My initial information source, of course, are the system log files. To begin with, I manually examine the current syslog (/var/log/syslog). I find that spamd writes error messages to the syslog that indicate that it could not write to the /tmp directory. Since spamd is run fairly often, the first occurrence of such an error message should give me a good indicator when the problem first started to occur. The next step is semi-automatic: Examining all syslog files for the first occurrence of the spamd error:

osgiliath:/var/log# l syslog*
-rw-r----- 1 root adm 1856166 Feb 11 23:00 syslog
-rw-r----- 1 root adm 2210530 Feb 11 06:25 syslog.0
-rw-r----- 1 root adm  172319 Feb 10 06:25 syslog.1.gz
-rw-r----- 1 root adm  152769 Feb  9 06:26 syslog.2.gz
-rw-r----- 1 root adm  160936 Feb  8 06:30 syslog.3.gz
-rw-r----- 1 root adm  170812 Feb  7 06:28 syslog.4.gz
-rw-r----- 1 root adm  147703 Feb  6 06:28 syslog.5.gz
-rw-r----- 1 root adm  181149 Feb  5 06:28 syslog.6.gz
osgiliath:/var/log# for i in 6 5 4 3 2 1; do zcat syslog.$i.gz >>syslog.full; done
osgiliath:/var/log# cat syslog.0 syslog >>syslog.full
osgiliath:/var/log# grep Read-only syslog.full | head -1
Feb  8 23:38:44 osgiliath spamd[8507]: util: secure_tmpfile failed to create file '/tmp/.spamassassin8507eLOmdntmp': Read-only file system

Aha, the first entry occurred about 3 days ago (it is now February 11). In theory, the source of the problem should become evident by looking at syslog entries immediately prior to this first spamd error. The entries are located in syslog.2.gz. I did not find anything useful by manual inspection, so I just guess what to search for:

osgiliath:/var/log# zcat syslog.2.gz | grep -i mount
Feb  8 23:14:02 osgiliath kernel: Remounting filesystem read-only

This looks promising... another manual inspection reveals the following entries:

Feb  8 23:14:00 osgiliath kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Feb  8 23:14:00 osgiliath kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=50753922, high=3, low=422274, sector=50753918
Feb  8 23:14:00 osgiliath kernel: ide: failed opcode was: unknown
Feb  8 23:14:00 osgiliath kernel: end_request: I/O error, dev hda, sector 50753918
Feb  8 23:14:00 osgiliath kernel: journal_bmap: journal block not found at offset 2060 on hda3
Feb  8 23:14:00 osgiliath kernel: Aborting journal on device hda3.
Feb  8 23:14:00 osgiliath hddtemp[24701]: /dev/hda: IC35L120AVV207-1: 42 C
Feb  8 23:14:00 osgiliath hddtemp[24701]: /dev/sda: HDS722525VLSA80: 41 C
Feb  8 23:14:02 osgiliath kernel: ext3_abort called.
Feb  8 23:14:02 osgiliath kernel: EXT3-fs error (device hda3): ext3_journal_start_sb: Detected aborted journal
Feb  8 23:14:02 osgiliath kernel: Remounting filesystem read-only

Looking for other occurrences of the problem gives me a second series of entries:

Feb  9 04:00:21 osgiliath kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Feb  9 04:00:21 osgiliath kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=51787320, high=3, low=1455672, sector=51787318
Feb  9 04:00:21 osgiliath kernel: ide: failed opcode was: unknown
Feb  9 04:00:21 osgiliath kernel: end_request: I/O error, dev hda, sector 51787318
Feb  9 04:00:21 osgiliath kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Feb  9 04:00:21 osgiliath kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=51787323, high=3, low=1455675, sector=51787318
Feb  9 04:00:21 osgiliath kernel: ide: failed opcode was: unknown
Feb  9 04:00:21 osgiliath kernel: end_request: I/O error, dev hda, sector 51787318

So, hmm... I don't really know what this is all about. Is it the hard disk dying? Could be, after all the disk is about 5½ years old and has been in almost continual use. On the other hand, the hddtemp syslog entry is also suspicious, being so close to the hard disk failure... Googling brings no definite explanation, so I must assume the worst and prepare for adding a new hard disk to the system.

But first I need to fix the problem at hand. The attempted procedure is this:

reboot the machine using an emergency CD-ROM
perform a file-system check and fix all errors on /dev/hda3
reboot

1 hour later...

Procedure went as planned, although I had some trouble working with the rescue/emergency environment that the boot CD-ROM provided. I had decided to use a Debian install CD-ROM (debian-Lenny-DI-rc2-i386-CD-1.iso, to be precise), and I had to learn the uncomfortable way that the d-i (Debian Installer) environment is a BusyBox environment and therefore rather limited. No man pages, no cpio (for last-minute backup :-), no diff, tar is present but works differently than what I am used to (and since there are no man pages, and tar has been stripped of all help texts, it is practically impossible to find out how it works), etc.

I eventually fixed the file system using the command

fsck.ext3 /dev/hda3

and a lot of manual intervention. What follows is an abridged transcript of the fsck session, mainly provided for my own sake and for historical purposes.

superblock has an invalid ext3 journal (inode 8).
clear<y>?
 
*** ext3 journal has been deleted - filesystem is now ext2 only ***
 
/dev/hda3 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Journal inode is not in use, but contains data. Clear<y>? y
 
Inodes that were part of a corrupted orphan linked list found. Fix<y>?
 
Inode 179183 was part of the orphaned inode list. FIXED.
Inode 179185 was part of the orphaned inode list. FIXED.
Inode 179186 was part of the orphaned inode list. FIXED.
Inode 179187 was part of the orphaned inode list. FIXED.
Inode 179188 was part of the orphaned inode list. FIXED.
 
Error reading block 360903 (Attempt to read block from filesystem resulted in short read) while getting next inode from scan. Ignore error<y>?
Force rewrite<y>?
 
[...]
 
Inode 211815 was part of the orphaned inode list. FIXED.
Inode 211820 was part of the orphaned inode list. FIXED.
 
Deleted inode 212060 has zero dtime. Fix<y>?
 
Pass 2: Checking directory structure
 
Pass 3: Checking directory connectivity
 
Pass 4: Checking reference counts
 
Pass 5: Checking group summary information
 
Block bitmap differences: -(515--8719) -382977 -(444438--444439)  -(444483--444487) -(444630--444649) -(448685--448692)
Fix<y>
 
Free blocks count wrong for group #0 (24034, counted=32236).
Fix<y>
 
[...]
 
Inode bitmap differences: -179183 -(179185--179188) -211815 -211820 -212060
Fix<y>
 
Free inodes count wrong for group #11 (16253, counted=16258).
Fix<y>
 
[...]
 
Free inodes count wrong (1211362, counted=1211370).
Fix<y>
 
Recreate journal to make the filesysem ext3 again?
Fix<y>
 
Creating journal (32768 blocks): Done.
 
*** journal has been re-created - filesystem is now ext3 again ***
 
/dev/hda3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/hda3: 10230/1221600 files (1.2% non-contiguous), 126986/2441880 blocks