28. Mai 2012

Hetzner Root Server with RAID does not Boot after disk swap and re-Installing Grub Bootloader

(I am writing this to pay back to the Web, because it helped me to solve a problem and my post might help someone else.)

Two weeks ago a disk of my Hetzner root server was slowly degrading. I asked for an exchange, got the go from Hetzner support, removed the disk from RAID1. Hetzner installed a new disk and booted up. I added the disk to the RAID, resync, done.

Request Rescue-System
Two weeks later the server just disappeard from the net with all services. No connection possible on any port. No idea, if the earlier disk change was related to this problem. I suppose it was.

Here is what I did to fix it:

Check if it boots into the rescue system

Log in to the Hetzner management site (https://robot.your-server.de/) and enable the rescue-system. Then request an automatic hardware reset.

Connect to the server, log in as root. This means, there is no hardware problem and no persistent routing problem.

Request Hardware-Reset
Just to be sure: request another hardware reset and let it do the normal boot. Trying to connect again fails. This eliminates a temporary routing problem or a switch problem (I once had a switch problem, where after a power outage, the switch needed an "I am here" from the server in order to send packets on the link. The only way to make the server known to the switch was a reset to force a DHCP request during boot.)

I need to know what happens on the console during boot

Asked Hetzner support to attach a LARA remote console. The  console shows the local screen output and accepts keyboard input. It is even possible to re-configure the BIOS.

Request hardware reset and watch the console. I see the memory test, scanning for devices, 2 disks found (fine), then "Booting from local disk"...

Remote console Java applet
Since the disks were found, it must be a problem on the disk. Maybe the bootloader is broken.

Check the file system and the re-install the bootloader

Request the rescue-system and then a hardware-reset. Connect to the rescue system.

Possible to mount the RAID?

# mount /dev/md1 /mnt

Yes. Do a file system check (first umount the file system):

# ummount /mnt
# fsck /dev/md1;

Shows some (many) errors. Fixed them by staying on the "y" key. Could use the auto repair option of fsck (-y).

Re-Install the bootloader

This is a Hetzner installimage-setup, so there should be a grub bootloader. Check for the /boot/grub/ folder. Again mount the RAID.

# mount /dev/md1 /mnt
# ls /mnt/boot/grub

It is there, so there is a good chance, that the bootloader is grub, not lilo. Now re-install grub on the disk. Actually on both disks, just in case one is missing.

Make a chroot environment:

# mount /dev/md1 /mnt
# mount -t none -o bind /dev /mnt/dev
# mount -t proc -o bind /proc /mnt/proc
# mount -t sysfs -o bind /sys /mnt/sys
# chroot /mnt


# grub

Look for the file stage1 to find the boot partitions

grub> find /grub/stage1

Install the bootloader on both partitions. Both are regarded as hd0 from the point of view of the bootloader at boot time.

grub> device (hd0) /dev/sda
 device (hd0) /dev/sda
grub> root (hd0,1)
 root (hd0,1)
  Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0)
 setup (hd0)
  Checking if "/boot/grub/stage1" exists... yes
  Checking if "/boot/grub/stage2" exists... yes
  Checking if "/boot/grub/e2fs_stage1_5" exists... yes
  Running "embed /boot/grub/e2fs_stage1_5 (hd0)"...  17 sectors are embedded.
  Running "install /boot/grub/stage1 (hd0) (hd0)1+17 p (hd0,1)...

The same for the other disk

grub> device (hd0) /dev/sdb
grub> root (hd1,1)
grub> setup (hd0)
grub> quit

And reboot - works.

(Maybe my mistake was not to re-install the bootloader after swapping the disk. I expected, that the RAID1 resync would make both disks identical on the sector level. Maybe this assumption is wrong. The fsck problems may also indicate, that the boot sector was affected by disk problems, who knows. These things happen, especially to part time admins.)