Server :: Troubleshooting/resolving Random Kernel Panic On Multiple?
Jun 22, 2010
I'm having a strange problem on some of our Debian servers. It all started about three weeks ago when we moved our virtual environment (VMWare ESX3) from a SAN to a NAS (NetApp). At first I thought it had to do with that move but since the other 9 servers are working perfectly I eliminated that idea. For over a year all 12 Debian 5 servers have been working great without mentionable failures. All servers are (where) up to date with the latest patches. About three weeks ago I started having kernel panics with the following message on three of our servers:
Code: Bad EIP value
EIP  0x0 SS:ESP 0068:f6d7da18
Kernel panic - not syncing: Fatal exception in interrupt and other times it looks like just a dump of hexadecimal data. The only difference between those 3 servers is that they have several mounted shares connecting to the NAS using CIFS. So I was thinking that it might have to do with an update of some kind in regards to smb. I recovered an image from a month ago, before the troubles began, copied over the data and MySQL databases and configured the 'old environment with recent data' exactly the same with MySQL master-master replication, document synchronization and load balancing. This task I performed last night (no other way since it's a production environment). Up to this time neither of the two 'restored' servers had a kernel panic. The one that has not been restored is having one at random about every hour and a half. Following are the different versions between the 'at this time' working server(s) and the failing one:
I'm on an Acer Aspire 7535 Laptop running a dual boot with 10.10 and Windows 7. Although a fresh install of Ubuntu 10.10 doesn't seem to give me any problems, after a few updates, my laptop would lock up and give a black screen while using 10.10. The kernel panic seems to happen randomly, either taking minutes or hours to happen, whether I am doing anything or idling.
I have had a serious problem with my laptop recently. I have been having unexpected reboots and kernel panics and I don't know why. I can't tell if it is hardware or software, and what is causing it.
It has happened on multiple power bricks from different outlets and locations, my slice battery and my main 9-cell battery.
I am running Debian Testing/Stretch with Kernel 4.1 with XFCE4 on a Lenovo T420 with 16 gb of ram, a i5-2540m, USB3 Express Card, OCZ SSD, an intel 7260 ac wifi card and Ericcson F5521GW 3g card. I also have a modded bios for the wireless cards, and I have had that for over half a year without any problems.
I have no clue what the cause is from the logs.
I also booted into Windows for a bit, then it blue screened, but I don't know if it was because of a driver I had tried to put on earlier for the 3g card, because Linux was out of commission and I needed 3g. The driver didn't work, gave a code 10 or something and gave me a blue screen that said something along the lines of device driver attempting to corrupt the system has been caught. Windows also won't boot anymore.
I'm not worried about Windows though, and what I really need is Linux to work, being that this was the first time I booted up into windows in several months.
Memtest runs fine, and passes all tests.
Right when I turned on my computer after one incident, I wrote a script to check the CPU temp and write it to a file every second. Once the computer turned off, I read the file and it said the CPU was at 39 degrees. Not something to turn off over.
I cannot find any indication of a problem in /var/log/kern.log. kern.log was extracted from the computer right after it rebooted randomly. Find it here: [URL] ...
I have an IBM server with a: "SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)", meaning kernel module CONFIG_FUSION. This is the real hardware, not emulated by vmware. With recent kernels (2.6.30-35) the boot results in one of three situations:
1.) Most common: kernel panic, way too long stack dump to see anything useful but did have a terminal briefly to debug this and its just not being able to communicate with the scsi subsystems, google for errors found nothing of use.
2.) Sometimes: A hang waiting for the scsi disks after the controller fails to initialize properly, does not pass in a couple of hours.
3.) Very rarely: A long wait for the scsi controller to initialize but a succesful boot afterwards. ACPI subsystems are taking all CPU power. This is the state the machine is at now, working somewhat.
I can boot the machine with a Gentoo live install disk (kernel 2.6.29 I think) and under that version the controller initializes instantly and there are no problems whatsover using the disks so the hardware seems to be solid!
There were some MPT changes in 2.6.25rc5 so thats the latest kernel I've tried. The errors during boot change to a bit more verbose but thats about it. Unfortunately I don't have a serial terminal with me anymore so cannot capture these errors but it was something about an "unexpected doorbell" with 2.6.24 and nothing memorable with .25. Long story short tho, googling the errors provides no solutions and the little I find points to the controller being too slow to initialize somehow.
I'm running out of ideas here and the server is burning fans like theres no tomorrow (ACPI taking everything it can) so is my only option to downgrade to 2.6.29 (and downgrade udev, lvm2, so on and on and on)? I can't believe I'm the only one for whom this has been broken for several minor versions.
- 2.6.29-gentoo (live disk) - works, blazing fast initialization - 2.6.32-gentoo-r7 - random boot failures and successes - 2.6.34-gentoo-r1 - same - 2.6.35-rc5 vanilla - same, better errors, zero useful google results.
I'm not very keen on longshot attempts since it can take a couple of hours of panic-loops to get the machine booted up again and the server functions cannot be transferred to another machine. Also, I'm a bit hesitant to mention this, but the boot has only ever succeeded when a serial cable with something at the other end is connected. At first it was the testing serial terminal and now its just a connection to a UPS.
Attached: bootlog of succesfull-ish boot with 2.6.34 (had to cut some out to make the sizelimit):
Linux version 2.6.34-gentoo-r1 (root@livecd) (gcc version 4.3.4 (Gentoo Hardened 4.3.4 p1.1, pie-10.1.5) ) #1 SMP Sun Jul 18 18:54:48 EEST 2010 BIOS-provided physical RAM map:
I have a Centos 4.8 linuxbox running in VMWare ESXi 4 and the kernel is 2.6.9-89.0.11.ELsmp. Recently, this linuxbox is quite unstable, it has kernel panic once a week... But we didn't have any configuration changes on it. And I have attached the kernel panic console screen and lsmod for the server.[URL]...
We have a server running CentOS 5 Linux 2.6.18-128.1.16.el5xen #1 SMP Tue Jun 30 06:39:23 EDT 2009 x86_64 x86_64 x86_64 GNU Linux. We've seen at random times that the server will just reboot and nothing is logged in messages. I tried to enable kdump but was only able to get a 5.4 gig dump since our /var directory is set to 10GB. Here is the messages I see before and after the server restart. I had thought that when a kernel panics, it is supposed to halt the system and not reboot it. My /proc/sys/kernel/panic is set to 0. I can run an update but want to have some sort of idea what is causing the issue and if the update will fix anything.
May 13 20:05:22 hlotmt01 xinetd: EXIT: bpcd status=0 pid=1071 duration=1(sec) May 13 20:05:22 hlotmt01 xinetd: START: bpcd pid=1072 from=10.203.1.1 May 13 20:05:23 hlotmt01 xinetd: EXIT: bpcd status=0 pid=1072 duration=1(sec)
I want to build a custom system and I need your opinions. I have an old laptop which I want to configure as a system for troubleshooting purpose, my idea is to have multi-boot system with multiple root file systems, e.g. one root file system has only BIND to work as DNS server, another root file system has only Samba, etc., and I can choose which system to boot into from grub, or a custom menu after booting grub.
I thought of setting multiple partitions and install a full system on each one, but I thought that there might be a better way to do this, I'd like to hear your opinions.
I've been troubleshooting the Audio for my iMac for the past few weeks (ugh) and right now I'm working through this: [URL}...When I use
configure --with-cards=hda-intel It tells me:
Code: The file /lib/modules/2.6.35-28-generic/build/include/linux/autoconf.h does not exist. install the package with full kernel sources for your distrubution or use --with-kernel=dir option to specifiy another directory with kernel sources (default is /lib/modules/2.6.35-28-generic/build). When I searched for how to get kernel source I got this:
I have recompiled a few kernels, but all on 32bit systems so not sure if that has anything to do with it.
Running Arch Linux 64bit, most recent version.
My first thoughts was that it might be my grub bootloader configuration, so had a big play around with that but it didn't fix it. Also made sure support was built for filesystems. However almost all that Fstab mounts are ext3 anyway, and certainly the root and /boot are. Now thinking it may be a memory error so will run a check when I shutdown.
Dell laptop booting from a USB stick with a CentOS 5.5 minimum installation.
Uncompressing Linux...OK, booting the kernel. Red Hat nash version 18.104.22.168 starting sda: assuming drive cache: write through sda: assuming drive cache: write through mount: error 6 mounting ext3 mount: error 2 mounting none switchroot: mount failed: 22 umount /initrd-dev failed: 2 Kernel panic - no syncing: Attempted to kill init!
1. Does minimum installation not drop on a kernel or initrd with ext3 support? I can't imagine that's true, but have to ask.
2. The USB stick is single partition ext3. Maybe there is some limitation specifically related to USB stick booting that requires boot to be FAT16 or FAT32? Except the CentOS 5.5 installer refuses to let me install on either FAT.
3. How can I do the equivalent of lsmod on a linux installation that will not boot? i.e. I have CentOS x86_64 running in VirtualBox, I can plug the USB stick in there, so how do I get information on the USB stick's kernel and initrd if I can't boot from it?
4. Is it possible to rebuild the i386 based initrd on this USB stick, when the computer is not booted from that stick, with a system that's x86_64 based?
System Info: Dell Latitude i686 Laptop which has run CentOS 5.5 and Fedora 12,13,14 in the past, and boots from Fedora 14 Live CD transferred to a USB stick. So I know USB booting is possible on this machine, and this stick.
The process of creating the stick:
CentOS 5.5 i386 on a USB stick. Old Dell i686 laptop which has previously run CentOS 5.5 installed from DVD, and has successfully booted from this same USB stick holding transferred Fedora 12,13,14 Live CDs. CentOS 5.5 was installed onto the USB drive directly by the CentOS 5.5 DVD installer (running virtualized in VirtualBox 4.02 on Mac OS X 10.6.5.). No errors or complaints during installation.
For whatever reason, the installer did not do some things correctly. First Grub wasn't working correctly, I got that sorted out and have the Grub+CentOS splash screen, it finds vmlinuz and the initrd, and then I get a kernel panic.
Ext3 was built into the kernel and that's why I'm getting this message. I do not know how the installer would have dropped a kernel or initrd during instalation that that don't contain such a basic thing that obviously comes in linux kernel 2.6.18-89 EL.
I have the following strange thing with a RHEL4 installation. Since last week, the system did a reboot and now something is really fucked up. During boot we get the following messages (don't care about 'strange' typo's, my colleague typed it 'blind' from the screen)
The strange thing is that we never see a 'could not mount blabla' or similar messages. First we thought it was a failing kernel update by plesk, but even after manually updating the kernel with RHN RPM's, still the same message. Booting with rescue mode and then chroot the system works. After that we even can start things like plesk and so on.
We double checked things with another RHEL4 install, and at least two things were odd:
1: the working machine has /dev/dm-0 and /dev/dm-1, the broken one doesn't
2: some files on /dev didn't have group root, but 252
We tried to recreate the /dev/dm-X nodes with [vgmknodes -v], output:
A fdisk /dev/sda shows: /dev/sda2 XX XXX XXXXX Linux LVM (I removed the numbers because this line is from another machine, but rest was identical)
We have a copy of the boot partition so if one need more info please let me know.
last part of init extracted from initrd-2.6.9-78.0.8.ELsmp.img:
I am running an Hp Pavillion dv6000 with the Broadcom card that never seems to work for Linux. I recently talked with my friend who said he found a way to get it work.following his instructions I opened Synaptic and checked the package bmcwl-kernel-source to be installed.I went through the process of it all and it said it had install successfully. I restarted the computer and when I tried to enter my operating system I got this error "Kernel panic - not syncing : VFS : Unable to mount root fs on unknown - block(8,1)" I have previous versions of Linux on my computer so I can still get in to those if need be but I don't know how to undo what I did or why it isn't working for that matter. Does anyone have any ideas as to why I am getting this error and how I can fix it?
I have one machine where I have several versions installed on different partitions. The base partition (/dev/hda1) is Slack 12.1. On a spare partition (/dev/hdc4) I had installed Slackware64-current. Last week I slackpkg upgraded and installed the 22.214.171.124 kernel, and now that partition will not boot. I know that with the new kernels the hd* designation has been removed, and have already redone that fstab (accessing it from a different boot) to reflect the sd*. Here is the slack64 section of my lilo.conf:
Code: # Linux bootable partition config begins image = /other/spare4/boot/vmlinuz
I have a server running Red Hat Enterprise Linux ES (126.96.36.199.ELsmp). When it starts up I get the following error: Uncompressing Linux... Ok, booting the kernel. Red Hat nash version 188.8.131.52 starting WARNING: can't access (null) exec of init ((null)) failed!!!: 14 unmount /initrd/dev failed: 2 Kernel panic - not syncing: Attempted to kill init!
After that I got no response from the OS. I have the installation CD, so I tried to start the rescue mode, while going through the steps I received an error stating that mounting to /mnt/sysimage failed and that if I want to I can access a shell. I really don't know what to do from here
this is what i did i downloaded the latest stable kernel archive from kernel.org and extracted the archive into the download directory (i don't think that matters though) then i downloaded and installed the ncurses archive (needed for menuconfig) then i opened a terminal and navigated to the directory that was extracted from the archive and issues the floowing commands
The server is booting up properly and the system is running. When browsing web sites on the sever and they seem to run fine. But when you make requests to it, it seems to take a long time to process stuff. Such as large mysql request. Even to the point it freezes. It seems that if you make a request of some sort it just over loads. I have checked the logs and the most I can get from it that one of my wiki sites that seem to be getting nailed with spam bots trying to getting which is one of the databaes that where repaired could be slowing it down A LOT. Now I did find a couple databases that where marked as crashed and they have been repaired. I am not really sure what is causes the errors since the HD space is fine, ram is fine, internet connection is fine...So I am not familiar with diagnoses linux as much as I am with a windows pc. I am not real shell friendly. I am not good with the basic troubleshooting commands to try and identify the source of the problem to correct it. Could anyone point me the way...
Thanks in advance and I apologize for being a ra-tard on these. I should know this stuff more than I do running a server like this. It was created in 2006 and has been problem free since. Another thing, mass emailing isnt working either. It may be mysqld related. Sorry for the fragments and such. It also seems the SMTP isn't working properly as well suddenly.
Another error I been seeing:
PHP Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 16 bytesPHP Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 16 bytes some times I see this as a 32 or 64 bytes as well.
FTP Connection: It is also slow when making ftp connects. It is extremly slow but they finally do complete. It can be just a simple connection to display content. Slow.
I have change my mail server previous when i open my domain [url] it goes to [url].Now i have change my Mail server from openwebmail to Zimbra Server .so when i again open to it mail.mydomain.com ,proxy server takes to [url].
If i bypass proxy server then it open mail.mydomain.com(My zimbra Page).i have also updated internal DNS addresss But Squid not updating it
How to update Squid DNS Entry ? how to Update Squid Cache records?
I got two computers with the exact same symptoms today: Both seem to be alive. They answer pings and they host virtual machines which are working. However, I cannot login via SSH. I get
$ ssh -v obelix OpenSSH_5.3p1 Debian-3ubuntu6, OpenSSL 0.9.8k 25 Mar 2009 debug1: Reading configuration data /home/mike/.ssh/config debug1: Applying options for obelix
On the console I can switch the ttys and enter my login credentials. After that I see the GNU/Debian Linux welcome message an no prompt. The system seems to hang. I'm able to switch the tty again and login with the same result. (I tried that three times now and I start running out of ttys.) A search on the Internet suggested some ressoucre shortage. How could I release some resoucres than? How could I further investigate the situation before just trying a hard power off?
I've got a server that I've been trying to get Ubuntu Server 10.04.01 LTS installed and running on and I've gotten incredibly close to having it work, but the Network is still giving me issues. The installer went off without a hitch (once I fixed some optical drive issues) and it has actually detected the "Intel Corporation 82576 Gigabit Network Connection (rev 01)" Dual-port onboard NIC which shows up when I run lspci. I statically assigned an IP, Subnet, DNS Server and Gateway to eth0 and things looked good. ifconfig showed that it was "UP", and I could ping localhost, but if I tried to ping anything on the LAN it said "Destination Host Unreachable".
Here's the ifconfig and lspci (snippet)
======== ifconfig ======== eth0 Link encap:Ethernet HWaddr 00:25:90:2d:26:98 inet addr:172.30.1.75 Bcast:172.30.3.255 Mask:255.255.252.0 UP BROADCAST MULTICAST MTU:1500 Metric:1
Recently i've been playing with few variations of linux distros since my Windows XP got infected with something that makes anything flash-player related crash or restart computer even(im saying this because it might be related), reinstall didn't help fix it so i decided to better switch back to Linux world and tried some newest distros. Everything was nice untill..recently some distros after installation gave me kernel-panic or caps lock+scroll lock lights blinking..trouble is Fedora 14 just gave me same stuff after i went maybe 5mins away from keyboard with web browser open then came back to notice..if at first i thought it was wrongly made distribution now i'm bit scared that my PC is dying..i did basic search in google and people say it may be related to hardware beeing too old or drivers but when i use it then i don't encounter this - just when afk. I have almost none experience in Fedora so was just wondering if someone could point me to right direction
I have a CR-48 running Ubuntu 11.04. I've had some continual issues with it, along with other Linux distros I tried. There are times my laptop freezes completely during mid use. If I power off and back on, 70% of the time it doesn't detect the hard drive upon next boot. If I go into BIOS, no SSD detected. If I pull the battery, hit the power button, put battery back in, it'll run fine for a while.
I'm up in the air if it's a hard drive issue or motherboard issue. Once I got this kernel panic I sat here for a half hour on another system typing it up. I hope somebody can look at it and make sense of it and tell me what essentially is happening.
I have installed Bind DNS server in Opensuse 11.1 and it was running fine for two weeks. However starting last week it was unable to resolve a lot of internet sites (e.g. opensuse.org). I can say it is working partiallyI tried restart the named service and it worked fine for only about 15 minutes and back to partial condition.All the records in zone files can be resolved. Only some internet websites not able to resolve. The same sites can be resolve with another dns server internally
The Fedora 11 Live CD was tested on a Dell and it is good. On the HP Pavilion a705w, with an Intel Celeron 2.93HHz w 760MB RAM, the install fails with kernel panic. The cd was burned on the HP.
A colleague suggested that the HP may not be compatible with the kernel. Are there startup settings that I might try to diagnose the panic issue? Perhaps there is a verbose boot up process that pauses so one can take notes?
After installing the latest updates (dont know which ones) yesterday, I get a kernel panic (LED lights flashing at the keyboard and black screen) when starting X (the login manager). I am running on 64 bit with the proprietary ATI catalyst driver. How can I start in text only mode?