ext3-check_descriptors

So a co-worker of mine had his little UPS under his desk go out and when he rebooted he got something like the following:

Reading all physical volumes. This may take to while
Found volume group “VolGroup00” using metadata type LVM2
2 logical volumes in volume group “VolGroup00” now active
EXT3-fs error (device dm-0): ext3-check_descriptors: Inode bitmap for group 45 nont in group (block 1441793)!
ETX3-fs: group descriptors corrupted!
mount: error mounting /dev/root on /sysroot as ext3: Invalid argument
setuproot: moving /dev failed: not such rows or directory
setuproot: error mounting /proc: Not such rows or directory
setuproot: error mounting /sys: Not such rows or directory
switchroot: mount failed: not such rows or directory
Kernel Panic – not syncyng: Attempt to kill init!

This is what we did to recover the data. His computer, running Fedora Core 5 (FC5), wouldn’t boot, but we were at least able to pull off his data. The first thing we did was boot off of the Fedora 1 CD into rescue mode.

It was unable to find the installation, so since he had a LVM partition, we ran the following commands:

# lvm

lvm> lvscan

lvm> lvchange -ay /dev/VolGroup00

lvm> lvscan

lvm> exit

# fsck -p -y /dev/VolGroup00/LogVol00

At this point it found all kinds of corrupt inodes and even deleted the ext3 journal. Depending on what is corrupt you may have better luck with this part. Since it blew away the journal. We had to recreate it:

# tune2fs -j /dev/VolGroup00/LogVol00

This made it so we could mount the partition, but all the folders were named after their inode, like this #567847.

# mkdir /mnt1

# mount /dev/VolGroup00/LogVol00 /mnt1

He was then able to tar up his stuff, scp it to another box and then rebuild from scratch.

STARTTLS: read error=generic SSL error (0)

Ok, so I had a customer with a newer version of sendmail with tons of these error messages in the logs:

STARTTLS: read error=generic SSL error (0)

After doing some research, everyone was recommending turning off of the error reporting by recompiling sendmail with a different configuration. The real problem though is that usually the error is being written by just one errant sendmail process and it just needs to be killed. It’s actually good to get the error message so you know there is something to fix. A normal sendmail restart does not kill the errant process, so look at the message in the logs and kill that particular PID. So for example, the following log entry:

sendmail[21313]: STARTTLS: read error=generic SSL error (0)

You would use the following command:

kill -9 21313

Dell vs HP, winner? Dell by a landslide

A customer of mine, who has a HP server (Proliant ML350) recently had a RAID 5 array crash taking down their system. I immediately got on the phone with HP support since this customer has one of their best warranty options available. I ended up getting routed to India, where the technician I talked to only had access to the owner’s name and what model the server was. The technician decided that it was a hard drive problem and was going to send out a drive but he had to ask me what drive he should send out. I was flabbergasted. How can this guy do his job in any way without knowing the components of the system? After telling him what kind of drive was in the system, he said he would send out a drive and a technician. Several hours later I got a call from a tech in Salt Lake, who confirmed that indeed the problem was not with the hard drive. So back, to the phones I went, again getting routed to India. Then after another try, I made it to someone in the states, but who had no idea what Red Hat Linux was or the right department to send me to. After getting back to the operator a third time, I was finally passed to a supervisor, who passed me to another supervisor, who would then have the right person call me back. Anyway, after 7 hours of waiting on hold and explaining the situation as many times, I finally talked to a guy that has worked at HP for 23 years (via DEC & Compaq) at the same desk, doing the same job for the entire time. Guess what? He was able to narrow down the problem and have the server back up in 15 minutes! Great support at that point, but what about the 7 hours it took to get to him? I will not be recommending HP anytime soon, that’s for sure.

Now, let’s compare that to recent warranty work that I went through last month on a Dell server, a PowerEdge 2850. The server wouldn’t POST, so we called Dell, did some basic troubleshooting to narrow down the point of failure and they sent parts and a tech. The warranty on this server was a 4 hour response time, and the parts arrived about 2 hours later by special courier, followed immediately by a technician trained and certified on every Dell server he works on. Dell had sent all parts necessary to fix the server. I assumed the tech would then spend the next hour or so troubleshooting which part was bad and needed replacement. But I was wrong. The technician simply replaced every single part that they sent and had the server back up and running in less then 15 minutes. I was amazed at this new philosophy, but it makes complete sense. Especially since the repair facilities recheck all the parts they get back anyway, why do it twice while the customer is down? The technician simply replaced all the parts, sent the old ones back to Dell, and then they could determine what was defective. You don’t need to worry about an intermittent problem that may come back after the technician leaves or wasted time troubleshooting. I will always recommend a Dell, simply because of the level of support I have consistently received from them.